Troubleshooting
Why are there duplicate event IDs in the All Events view in my data warehouse?
Section titled Why are there duplicate event IDs in the All Events view in my data warehouse?Some duplicate pairs (user_id, event_id) can be expected in the all_events table, as long as the event_table_name value is different for the rows that share an event ID.
If you have two event definitions that overlap, it's possible for a single event with a unique event ID to qualify for multiple event definitions.
Let's say you have two click event definitions: one that captures every click on your platform and one that captures clicks on a specific element. A click event that qualifies for the first event will automatically qualify for the second event because the first event captures any click.
If both event definitions are synced to Data Connect, they will each have their own event table downstream and a single event that qualifies for both definitions will appear on both tables with the same event ID. An event will always have just one event ID, even if it qualifies for multiple event definitions (and by extension, is included on multiple event tables in your data warehouse).
The all_events table is a UNION of every individual event table that has been synced to your data warehouse. If both individual event tables include the same event, then the all_events table will show duplicate event IDs. However, these two rows will have different event_table_name values.
The primary key for all_events should be a composite of user_id, event_id, and event_table_name. In individual event tables, a composite of user_id and event_id should suffice.
Why don't I see initial user properties in Data Connect?
Section titled Why don't I see initial user properties in Data Connect?We don't currently sync Initial User properties downstream in Data Connect (such as Initial Marketing Channel, or Initial Browser). Refer to how to recreate those properties in your warehouse.
Why is a given event_id included more than once on the all_events table?
Section titled Why is a given event_id included more than once on the all_events table?This FAQ applies to using Data Connect for warehouses (Redshift, BigQuery, Snowflake) only.
A given event_id across all tables corresponds with a unique event recorded by Contentsquare. That said, the same event_id may exist in multiple event tables, as Contentsquare provides the flexibility to create multiple event definitions that may correspond to the same raw event. For instance, you may define the following two events in the product:
- Click CTA defined as Click on .cta
- Click CTA - Homepage defined as Click on .cta with a filter where Path equals /
If a user clicks the CTA on the homepage, a new event will be recorded on both event tables that correspond with both event definitions. As a result, two events with the same event_id will be included in the all_events table, as it contains every recorded instance of all defined and custom events.
Additionally, pageviews can share an event_id with a defined event. The defined event with a given event_id will have a unique time, after excluding where event_name = 'pageviews'.
Table constraints are not enforced
Section titled Table constraints are not enforcedData Connect defines primary and foreign key constraints in the table schema, but no supported warehouse (Redshift, BigQuery, Snowflake) enforces them. Constraints exist for documentation and query optimization only — data integrity is your responsibility.
To avoid duplicate records:
- Use
DISTINCTor window functions for deduplication in queries - Implement deduplication logic in your ETL pipelines
- Consider creating views with built-in deduplication for commonly used datasets
Query performance issues
Section titled Query performance issuesIf you're experiencing slow query performance when working with Data Connect data, consider these optimization strategies:
General optimization tips
Section titled General optimization tips-
Filter early: Apply date range filters and other conditions as early as possible in your queries
-- Good: Filter before joiningSELECT *FROM sessionsWHERE time >= DATEADD('day', -30, GETDATE()) -
Select specific columns: Avoid
SELECT *and only query the columns you need-- Good: Select only needed columnsSELECT user_id, session_id, time, device_typeFROM sessions -
Use appropriate joins: Choose the right join type (INNER, LEFT, RIGHT) based on your analysis needs
-
Limit result sets: Use
LIMITfor exploratory queriesSELECT * FROM all_eventsWHERE time >= DATEADD('day', -1, GETDATE())LIMIT 1000 -
Use
EXPLAINto understand query plans: Before optimizing, review the execution plan to identify bottlenecks like full table scans or expensive joins -
Create materialized views for frequent queries: Cache results of commonly-run analytical queries to avoid recomputation
CREATE MATERIALIZED VIEW daily_metrics ASSELECT DATE(time) as day, COUNT(*) as event_countFROM all_eventsGROUP BY DATE(time);
Warehouse-specific optimization
Section titled Warehouse-specific optimizationRedshift:
- Use distribution keys and sort keys for frequently queried tables
- Run
VACUUMandANALYZEregularly on Connect tables - Consider creating materialized views for complex, frequently-run queries
BigQuery:
- Take advantage of table partitioning (Data Connect tables are partitioned by date)
- Use approximate aggregation functions like
APPROX_COUNT_DISTINCT()for large datasets - Consider clustering keys for frequently filtered columns
Snowflake:
- Configure warehouse size appropriately for your workload
- Use Snowflake's automatic clustering or define clustering keys
- Set up automatic warehouse suspension to control costs
- Create materialized views for frequently-run analytical queries
S3:
- Optimize your ETL process to create partitioned Parquet or ORC files
- Use AWS Athena or EMR with proper partition pruning
- Consider using AWS Glue for schema management and cataloging
Common query patterns to optimize
Section titled Common query patterns to optimizeLarge table scans:
-- Instead of scanning all eventsSELECT COUNT(*) FROM all_events
-- Filter to specific time range and event typesSELECT COUNT(*)FROM all_eventsWHERE time >= DATEADD('month', -1, GETDATE()) AND event_table_name IN ('page_view', 'button_click')Expensive joins:
-- Instead of joining all sessions with all eventsSELECT *FROM sessions sINNER JOIN all_events e ON s.session_id = e.session_id
-- Filter both tables first, then joinWITH recent_sessions AS ( SELECT * FROM sessions WHERE time >= DATEADD('day', -7, GETDATE())),recent_events AS ( SELECT * FROM all_events WHERE time >= DATEADD('day', -7, GETDATE()))SELECT *FROM recent_sessions sINNER JOIN recent_events e ON s.session_id = e.session_id AND s.user_id = e.user_id