Troubleshooting

Why are there duplicate event IDs in the All Events view in my data warehouse?

Some duplicate pairs (user_id, event_id) can be expected in the all_events table, as long as the event_table_name value is different for the rows that share an event ID.

If you have two event definitions that overlap, it's possible for a single event with a unique event ID to qualify for multiple event definitions.

Let's say you have two click event definitions: one that captures every click on your platform and one that captures clicks on a specific element. A click event that qualifies for the first event will automatically qualify for the second event because the first event captures any click.

If both event definitions are synced to Data Connect, they will each have their own event table downstream and a single event that qualifies for both definitions will appear on both tables with the same event ID. An event will always have just one event ID, even if it qualifies for multiple event definitions (and by extension, is included on multiple event tables in your data warehouse).

The all_events table is a UNION of every individual event table that has been synced to your data warehouse. If both individual event tables include the same event, then the all_events table will show duplicate event IDs. However, these two rows will have different event_table_name values.

The primary key for all_events should be a composite of user_id, event_id, and event_table_name. In individual event tables, a composite of user_id and event_id should suffice.

Why don't I see initial user properties in Data Connect?

We don't currently sync Initial User properties downstream in Data Connect (such as Initial Marketing Channel, or Initial Browser). Refer to how to recreate those properties in your warehouse.

Why is a given `event_id` included more than once on the `all_events` table?

This FAQ applies to using Data Connect for warehouses (Redshift, BigQuery, Snowflake) only.

A given event_id across all tables corresponds with a unique event recorded by Contentsquare. That said, the same event_id may exist in multiple event tables, as Contentsquare provides the flexibility to create multiple event definitions that may correspond to the same raw event. For instance, you may define the following two events in the product:

Click CTA defined as Click on .cta
Click CTA - Homepage defined as Click on .cta with a filter where Path equals /

If a user clicks the CTA on the homepage, a new event will be recorded on both event tables that correspond with both event definitions. As a result, two events with the same event_id will be included in the all_events table, as it contains every recorded instance of all defined and custom events.

Additionally, pageviews can share an event_id with a defined event. The defined event with a given event_id will have a unique time, after excluding where event_name = 'pageviews'.

Table constraints are not enforced

Data Connect defines primary and foreign key constraints in the table schema, but no supported warehouse (Redshift, BigQuery, Snowflake) enforces them. Constraints exist for documentation and query optimization only — data integrity is your responsibility.

To avoid duplicate records:

Use DISTINCT or window functions for deduplication in queries
Implement deduplication logic in your ETL pipelines
Consider creating views with built-in deduplication for commonly used datasets

Query performance issues

If you're experiencing slow query performance when working with Data Connect data, consider these optimization strategies:

General optimization tips

Filter early: Apply date range filters and other conditions as early as possible in your queries

-- Good: Filter before joining
SELECT *
FROM sessions
WHERE time >= DATEADD('day', -30, GETDATE())

Select specific columns: Avoid SELECT * and only query the columns you need

-- Good: Select only needed columns
SELECT user_id, session_id, time, device_type
FROM sessions

Use appropriate joins: Choose the right join type (INNER, LEFT, RIGHT) based on your analysis needs

Limit result sets: Use LIMIT for exploratory queries

SELECT * FROM all_events
WHERE time >= DATEADD('day', -1, GETDATE())
LIMIT 1000

Use EXPLAIN to understand query plans: Before optimizing, review the execution plan to identify bottlenecks like full table scans or expensive joins

Create materialized views for frequent queries: Cache results of commonly-run analytical queries to avoid recomputation

CREATE MATERIALIZED VIEW daily_metrics AS
SELECT DATE(time) as day, COUNT(*) as event_count
FROM all_events
GROUP BY DATE(time);

Warehouse-specific optimization

Redshift:

Use distribution keys and sort keys for frequently queried tables
Run VACUUM and ANALYZE regularly on Connect tables
Consider creating materialized views for complex, frequently-run queries

BigQuery:

Take advantage of table partitioning (Data Connect tables are partitioned by date)
Use approximate aggregation functions like APPROX_COUNT_DISTINCT() for large datasets
Consider clustering keys for frequently filtered columns

Snowflake:

Configure warehouse size appropriately for your workload
Use Snowflake's automatic clustering or define clustering keys
Set up automatic warehouse suspension to control costs
Create materialized views for frequently-run analytical queries

S3:

Optimize your ETL process to create partitioned Parquet or ORC files
Use AWS Athena or EMR with proper partition pruning
Consider using AWS Glue for schema management and cataloging

Common query patterns to optimize

Large table scans:

-- Instead of scanning all events
SELECT COUNT(*) FROM all_events

-- Filter to specific time range and event types
SELECT COUNT(*)
FROM all_events
WHERE time >= DATEADD('month', -1, GETDATE())
  AND event_table_name IN ('page_view', 'button_click')

Expensive joins:

-- Instead of joining all sessions with all events
SELECT *
FROM sessions s
INNER JOIN all_events e ON s.session_id = e.session_id

-- Filter both tables first, then join
WITH recent_sessions AS (
  SELECT * FROM sessions
  WHERE time >= DATEADD('day', -7, GETDATE())
),
recent_events AS (
  SELECT * FROM all_events
  WHERE time >= DATEADD('day', -7, GETDATE())
)
SELECT *
FROM recent_sessions s
INNER JOIN recent_events e
  ON s.session_id = e.session_id
  AND s.user_id = e.user_id