Snowplow web with dbt
This guide demonstrates how to set up a dbt project using the Snowplow Web package with Embucket as your target database. Embucket allows you to run existing Snowflake-compatible tools without modification.
What you’ll learn
Section titled “What you’ll learn”Follow this guide to:
- Configure dbt project dependencies and database connections for Embucket
- Set up sample Snowplow event data for testing and development
- Execute dbt transformations using the Snowplow Web package
- Check results and explore generated analytics models
- Plan next steps for production deployment
What this guide covers
Section titled “What this guide covers”This guide covers local development setup with sample data and basic dbt transformations. This guide doesn’t cover production deployment, advanced dbt customizations, custom Snowplow tracking implementation, or enterprise security configurations.
Prerequisites
Section titled “Prerequisites”- Embucket instance running
- Python virtual environment manager:
virtualenv
,uv
,conda
, etc.
Configuration
Section titled “Configuration”Add the Snowplow Web package (a dbt package for web analytics transformations) to your packages.yml
:
packages: - package: snowplow-analytics/snowplow_web version: ['>=0.16.0', '<0.17.0']
Configure your profiles.yml
to connect to Embucket:
snowplow_project: target: dev outputs: dev: type: snowflake host: localhost port: 3000 protocol: http account: 'embucket.local' user: 'embucket' password: 'embucket' role: 'ACCOUNTADMIN' database: 'analytics' schema: 'snowplow' warehouse: 'compute_wh' threads: 4
Configuration Requirements:
- Connection settings: Set
host
,port
, andprotocol
to match your Embucket instance - Authentication: Use default credentials
embucket
/embucket
unless you changed them - Identifiers: Choose descriptive names for
account
,database
, andschema
- Warehouse: Specify any valid identifier. Embucket ignores this parameter.
- Performance: Set
threads
based on your system capacity
Update your dbt_project.yml
:
name: 'snowplow_analytics'version: '1.0.0'profile: 'snowplow_project'
target-path: 'target'clean-targets: - 'target' - 'dbt_packages'
dispatch: - macro_namespace: dbt search_order: ['snowplow_utils', 'dbt']
vars: snowplow_web: snowplow__atomic_schema: 'snowplow' snowplow__database: 'analytics' snowplow__start_date: '2022-08-19' snowplow__end_date: '2022-08-26' snowplow__events_table: 'events' snowplow__enable_cwv: false snowplow__enable_consent: false
This setup directs the Snowplow Web package to read from analytics.snowplow.events
and process data for August 19-26, 2022.
-
Create and activate a Python virtual environment:
Terminal window python -m venv .venvsource .venv/bin/activate # On Windows: .venv\Scripts\activate -
Install the required dbt packages:
Terminal window pip install dbt-core dbt-snowflake -
Download dbt dependencies:
Terminal window dbt deps
Load source data
Section titled “Load source data”Download the Snowplow sample dataset:
curl -o Web_Analytics_sample_events.csv \ https://snowplow-demo-datasets.s3.eu-central-1.amazonaws.com/Web_Analytics/Web_Analytics_sample_events.csv
Formatting issues in the downloaded CSV prevent proper JSON parsing. Fix these issues with the following script:
import csv
filename = "Web_Analytics_sample_events.csv"output_filename = "Web_Analytics_sample_events_fixed.csv"
with open(filename, mode="r", newline="") as infile, open(output_filename, mode="w+", newline="") as outfile: reader = csv.DictReader(infile) data = [ {k: v.replace('"', "").replace("'", '"') for k, v in row.items()} for row in reader ] writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames) writer.writeheader() writer.writerows(data)
Create the events table structure in Embucket:
CREATE OR REPLACE TABLE analytics.snowplow.events ( app_id TEXT, platform TEXT, etl_tstamp TIMESTAMP_NTZ(9), collector_tstamp TIMESTAMP_NTZ(9) NOT NULL, dvce_created_tstamp TIMESTAMP_NTZ(9), event TEXT, event_id TEXT, txn_id NUMBER(38,0), name_tracker TEXT, v_tracker TEXT, v_collector TEXT, v_etl TEXT, user_id TEXT, user_ipaddress TEXT, user_fingerprint TEXT, domain_userid TEXT, domain_sessionidx NUMBER(38,0), network_userid TEXT, geo_country TEXT, geo_region TEXT, geo_city TEXT, geo_zipcode TEXT, geo_latitude FLOAT, geo_longitude FLOAT, geo_region_name TEXT, ip_isp TEXT, ip_organization TEXT, ip_domain TEXT, ip_netspeed TEXT, page_url TEXT, page_title TEXT, page_referrer TEXT, page_urlscheme TEXT, page_urlhost TEXT, page_urlport NUMBER(38,0), page_urlpath TEXT, page_urlquery TEXT, page_urlfragment TEXT, refr_urlscheme TEXT, refr_urlhost TEXT, refr_urlport NUMBER(38,0), refr_urlpath TEXT, refr_urlquery TEXT, refr_urlfragment TEXT, refr_medium TEXT, refr_source TEXT, refr_term TEXT, mkt_medium TEXT, mkt_source TEXT, mkt_term TEXT, mkt_content TEXT, mkt_campaign TEXT, se_category TEXT, se_action TEXT, se_label TEXT, se_property TEXT, se_value FLOAT, tr_orderid TEXT, tr_affiliation TEXT, tr_total NUMBER(18,2), tr_tax NUMBER(18,2), tr_shipping NUMBER(18,2), tr_city TEXT, tr_state TEXT, tr_country TEXT, ti_orderid TEXT, ti_sku TEXT, ti_name TEXT, ti_category TEXT, ti_price NUMBER(18,2), ti_quantity NUMBER(38,0), pp_xoffset_min NUMBER(38,0), pp_xoffset_max NUMBER(38,0), pp_yoffset_min NUMBER(38,0), pp_yoffset_max NUMBER(38,0), useragent TEXT, br_name TEXT, br_family TEXT, br_version TEXT, br_type TEXT, br_renderengine TEXT, br_lang TEXT, br_features_pdf BOOLEAN, br_features_flash BOOLEAN, br_features_java BOOLEAN, br_features_director BOOLEAN, br_features_quicktime BOOLEAN, br_features_realplayer BOOLEAN, br_features_windowsmedia BOOLEAN, br_features_gears BOOLEAN, br_features_silverlight BOOLEAN, br_cookies BOOLEAN, br_colordepth TEXT, br_viewwidth NUMBER(38,0), br_viewheight NUMBER(38,0), os_name TEXT, os_family TEXT, os_manufacturer TEXT, os_timezone TEXT, dvce_type TEXT, dvce_ismobile BOOLEAN, dvce_screenwidth NUMBER(38,0), dvce_screenheight NUMBER(38,0), doc_charset TEXT, doc_width NUMBER(38,0), doc_height NUMBER(38,0), tr_currency TEXT, tr_total_base NUMBER(18,2), tr_tax_base NUMBER(18,2), tr_shipping_base NUMBER(18,2), ti_currency TEXT, ti_price_base NUMBER(18,2), base_currency TEXT, geo_timezone TEXT, mkt_clickid TEXT, mkt_network TEXT, etl_tags TEXT, dvce_sent_tstamp TIMESTAMP_NTZ(9), refr_domain_userid TEXT, refr_dvce_tstamp TIMESTAMP_NTZ(9), domain_sessionid TEXT, derived_tstamp TIMESTAMP_NTZ(9), event_vendor TEXT, event_name TEXT, event_format TEXT, event_version TEXT, event_fingerprint TEXT, true_tstamp TIMESTAMP_NTZ(9), load_tstamp TIMESTAMP_NTZ(9), contexts_com_snowplowanalytics_snowplow_ua_parser_context_1 TEXT, contexts_com_snowplowanalytics_snowplow_web_page_1 TEXT, contexts_com_iab_snowplow_spiders_and_robots_1 TEXT, contexts_nl_basjes_yauaa_context_1 TEXT);
Upload the fixed CSV file to Embucket through the web UI or SQL interface:
COPY INTO analytics.snowplow.events FROM '/home/ec2-user/tmp/Web_Analytics_sample_events_fixed.csv' FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = ',');
- Open Embucket at
http://localhost:3000
- Navigate to the
analytics.snowplow.events
table - Click Upload Data
- Select
Web_Analytics_sample_events_fixed.csv
Running dbt
Section titled “Running dbt”Load reference data required by the Snowplow Web package:
dbt seed
The output confirms successful loading of dimension tables:
Running with dbt=1.9.4Registered adapter: snowflake=1.9.4Found 18 models, 103 data tests, 3 seeds, 2 operations, 8 sources, 781 macros
1 of 3 START seed file analytics.snowplow_web_dim_ga4_source_categories ......... [RUN]1 of 3 OK loaded seed file analytics.snowplow_web_dim_ga4_source_categories ...... [INSERT 819 in 1.71s]2 of 3 START seed file analytics.snowplow_web_dim_geo_country_mapping ........... [RUN]2 of 3 OK loaded seed file analytics.snowplow_web_dim_geo_country_mapping ........ [INSERT 249 in 1.88s]3 of 3 START seed file analytics.snowplow_web_dim_rfc_5646_language_mapping ..... [RUN]3 of 3 OK loaded seed file analytics.snowplow_web_dim_rfc_5646_language_mapping . [INSERT 232 in 1.48s]
Completed successfully
Run dbt pipeline:
dbt run
Output shows successful run of all models:
17:22:5917:22:59 1 of 1 START hook: snowplow_web.on-run-end.0 ................................... [RUN]17:22:59 1 of 1 OK hook: snowplow_web.on-run-end.0 ...................................... [OK in 0.01s]17:22:5917:22:59 Finished running 2 project hooks, 4 seeds in 0 hours 0 minutes and 20.91 seconds (20.91s).17:22:5917:22:59 Completed successfully17:22:5917:22:59 Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=617:23:01 Running with dbt=1.9.817:23:02 Registered adapter: snowflake=1.9.117:23:03 Found 18 models, 103 data tests, 4 seeds, 2 operations, 8 sources, 768 macros17:23:0317:23:03 Concurrency: 4 threads (target='*****')17:23:0317:23:04 1 of 1 START hook: snowplow_web.on-run-start.0 ................................. [RUN]17:23:04 1 of 1 OK hook: snowplow_web.on-run-start.0 .................................... [OK in 0.03s]17:23:0417:23:04 1 of 18 START sql incremental model public_snowplow_manifest.snowplow_web_base_quarantined_sessions [RUN]17:23:04 2 of 18 START sql incremental model public_snowplow_manifest.snowplow_web_incremental_manifest [RUN]17:23:04 1 of 18 OK created sql incremental model public_snowplow_manifest.snowplow_web_base_quarantined_sessions [ok 1 in 0.61s]17:23:04 2 of 18 OK created sql incremental model public_snowplow_manifest.snowplow_web_incremental_manifest [ok 1 in 0.61s]...17:23:10 18 of 18 START sql incremental model public_derived.snowplow_web_users ......... [RUN]17:23:10 18 of 18 OK created sql incremental model public_derived.snowplow_web_users .... [ok 1 in 0.50s]17:23:1017:23:10 1 of 1 START hook: snowplow_web.on-run-end.0 ................................... [RUN]17:23:11 1 of 1 OK hook: snowplow_web.on-run-end.0 ...................................... [OK in 0.30s]17:23:1117:23:11 Finished running 7 incremental models, 2 project hooks, 11 table models in 0 hours 0 minutes and 7.54 seconds (7.54s).17:23:1117:23:11 Completed successfully17:23:1117:23:11 Done. PASS=20 WARN=0 ERROR=0 SKIP=0 TOTAL=20