Troubleshooting

Why are there duplicate event IDs in the All Events view in my data warehouse?

Section titled Why are there duplicate event IDs in the All Events view in my data warehouse?

Some duplicate pairs (user_id, event_id) can be expected in the all_events table, as long as the event_table_name value is different for the rows that share an event ID.

If you have two event definitions that overlap, it's possible for a single event with a unique event ID to qualify for multiple event definitions.

Let's say you have two click event definitions: one that captures every click on your platform and one that captures clicks on a specific element. A click event that qualifies for the first event will automatically qualify for the second event because the first event captures any click.

If both event definitions are synced to Data Connect, they will each have their own event table downstream and a single event that qualifies for both definitions will appear on both tables with the same event ID. An event will always have just one event ID, even if it qualifies for multiple event definitions (and by extension, is included on multiple event tables in your data warehouse).

The all_events table is a UNION of every individual event table that has been synced to your data warehouse. If both individual event tables include the same event, then the all_events table will show duplicate event IDs. However, these two rows will have different event_table_name values.

The primary key for all_events should be a composite of user_id, event_id, and event_table_name. In individual event tables, a composite of user_id and event_id should suffice.

Why don't I see initial user properties in Data Connect?

Section titled Why don't I see initial user properties in Data Connect?

We don't currently sync Initial User properties downstream in Data Connect (such as Initial Marketing Channel, or Initial Browser). Refer to how to recreate those properties in your warehouse.

Why is a given event_id included more than once on the all_events table?

Section titled Why is a given event_id included more than once on the all_events table?

This FAQ applies to using Data Connect for warehouses (Redshift, BigQuery, Snowflake) only.

A given event_id across all tables corresponds with a unique event recorded by Contentsquare. That said, the same event_id may exist in multiple event tables, as Contentsquare provides the flexibility to create multiple event definitions that may correspond to the same raw event. For instance, you may define the following two events in the product:

  • Click CTA defined as Click on .cta
  • Click CTA - Homepage defined as Click on .cta with a filter where Path equals /

If a user clicks the CTA on the homepage, a new event will be recorded on both event tables that correspond with both event definitions. As a result, two events with the same event_id will be included in the all_events table, as it contains every recorded instance of all defined and custom events.

Additionally, pageviews can share an event_id with a defined event. The defined event with a given event_id will have a unique time, after excluding where event_name = 'pageviews'.

Table constraints are not enforced

Section titled Table constraints are not enforced

Data Connect defines primary and foreign key constraints in the table schema, but no supported warehouse (Redshift, BigQuery, Snowflake) enforces them. Constraints exist for documentation and query optimization only — data integrity is your responsibility.

To avoid duplicate records:

  • Use DISTINCT or window functions for deduplication in queries
  • Implement deduplication logic in your ETL pipelines
  • Consider creating views with built-in deduplication for commonly used datasets

If you're experiencing slow query performance when working with Data Connect data, consider these optimization strategies:

  1. Filter early: Apply date range filters and other conditions as early as possible in your queries

    -- Good: Filter before joining
    SELECT *
    FROM sessions
    WHERE time >= DATEADD('day', -30, GETDATE())
  2. Select specific columns: Avoid SELECT * and only query the columns you need

    -- Good: Select only needed columns
    SELECT user_id, session_id, time, device_type
    FROM sessions
  3. Use appropriate joins: Choose the right join type (INNER, LEFT, RIGHT) based on your analysis needs

  4. Limit result sets: Use LIMIT for exploratory queries

    SELECT * FROM all_events
    WHERE time >= DATEADD('day', -1, GETDATE())
    LIMIT 1000
  5. Use EXPLAIN to understand query plans: Before optimizing, review the execution plan to identify bottlenecks like full table scans or expensive joins

  6. Create materialized views for frequent queries: Cache results of commonly-run analytical queries to avoid recomputation

    CREATE MATERIALIZED VIEW daily_metrics AS
    SELECT DATE(time) as day, COUNT(*) as event_count
    FROM all_events
    GROUP BY DATE(time);

Warehouse-specific optimization

Section titled Warehouse-specific optimization

Redshift:

  • Use distribution keys and sort keys for frequently queried tables
  • Run VACUUM and ANALYZE regularly on Connect tables
  • Consider creating materialized views for complex, frequently-run queries

BigQuery:

  • Take advantage of table partitioning (Data Connect tables are partitioned by date)
  • Use approximate aggregation functions like APPROX_COUNT_DISTINCT() for large datasets
  • Consider clustering keys for frequently filtered columns

Snowflake:

  • Configure warehouse size appropriately for your workload
  • Use Snowflake's automatic clustering or define clustering keys
  • Set up automatic warehouse suspension to control costs
  • Create materialized views for frequently-run analytical queries

S3:

  • Optimize your ETL process to create partitioned Parquet or ORC files
  • Use AWS Athena or EMR with proper partition pruning
  • Consider using AWS Glue for schema management and cataloging

Common query patterns to optimize

Section titled Common query patterns to optimize

Large table scans:

-- Instead of scanning all events
SELECT COUNT(*) FROM all_events
-- Filter to specific time range and event types
SELECT COUNT(*)
FROM all_events
WHERE time >= DATEADD('month', -1, GETDATE())
AND event_table_name IN ('page_view', 'button_click')

Expensive joins:

-- Instead of joining all sessions with all events
SELECT *
FROM sessions s
INNER JOIN all_events e ON s.session_id = e.session_id
-- Filter both tables first, then join
WITH recent_sessions AS (
SELECT * FROM sessions
WHERE time >= DATEADD('day', -7, GETDATE())
),
recent_events AS (
SELECT * FROM all_events
WHERE time >= DATEADD('day', -7, GETDATE())
)
SELECT *
FROM recent_sessions s
INNER JOIN recent_events e
ON s.session_id = e.session_id
AND s.user_id = e.user_id