Amazon S3

Amazon S3 (Simple Storage Service) is an object storage service that offers industry-leading scalability, data availability, security, and performance. The Data Connect integration with S3 allows you to export your Contentsquare data to S3 for flexible storage and analysis options.

Unlike the other warehouse integrations (BigQuery, Redshift, Snowflake), the S3 integration provides raw data files that you can process with your preferred analytics tools, such as Athena, EMR, or third-party data processing services.

Before setting up the S3 integration, ensure you have:

  • An AWS account with S3 access
  • An S3 bucket to store Contentsquare data
  • AWS credentials with appropriate permissions for the S3 bucket
  1. Create an S3 bucket to store Contentsquare data (if you don't already have one)

  2. Create an IAM user or role with appropriate permissions.

    The IAM policy should include:

    • s3:PutObject
    • s3:GetObject
    • s3:ListBucket
    • s3:DeleteObject
    • s3:PutObjectAcl
  1. Apply the appropriate bucket policy for your region:

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Sid": "Stmt1441164338000",
    "Effect": "Allow",
    "Action": [
    "s3:*"
    ],
    "Resource": [
    "arn:aws:s3:::",
    "arn:aws:s3:::/*"
    ],
    "Principal": {
    "AWS": [
    "arn:aws:iam::085120003701:root"
    ]
    }
    }
    ]
    }
  2. Generate AWS access keys for the IAM user (if using user-based authentication).

Configure Contentsquare Connect

Section titled Configure Contentsquare Connect
  1. Log in to Contentsquare.
  2. Navigate to Analysis setup > Data Connect.
  3. Create the S3 bucket csq-rs3-<bucket_name> to sync Data Connect data with.
  4. Select Next.
  5. Add the displayed policy to your CSQ bucket on S3.
  6. Input your S3 credentials to connect to your bucket.
  7. Select Connect.

Once setup is complete, you’ll see a sync within 24 hours with the following built-in tables.

When Data Connect syncs data to S3, it creates the following structure:

s3://your-bucket/
├── sync_[sync_id]/
│ ├── _heap_table_name=users/
│ │ ├── part-00000-[uuid].avro
│ │ ├── part-00001-[uuid].avro
│ │ └── ...
│ ├── _heap_table_name=sessions/
│ │ ├── part-00000-[uuid].avro
│ │ ├── part-00001-[uuid].avro
│ │ └── ...
│ ├── _heap_table_name=pageviews/
│ │ └── ...
│ ├── _heap_table_name=[custom_event_name]/
│ │ └── ...
│ └── ...
├── manifests/
│ └── ...
└── sync_reports/
└── ...

Each sync folder follows the pattern sync_[sync_id]/_heap_table_name=[table_name]/part-[part_number]-[uuid].avro. For example:

sync_1010103140/_heap_table_name=sessions/part-00061-45d3eb01-f863-460b-a4ec-c6ab45d.avro

The data is organized by:

  • Sync ID (a unique identifier for each sync operation)
  • Table name (users, sessions, pageviews, custom events), prefixed with _heap_table_name=
  • Part files (data is split into multiple Avro-encoded files)

Each periodic data delivery will be accompanied by a manifest metadata file, which will describe the target schema and provide a full list of relevant data files for each table.

{
"dump_id": 1234,
"tables": [
{
"name": "users",
"files": [
"s3://customer/sync_1234/_heap_table_name=users/part-00000-a97432cba49732.avro",
"s3://customer/sync_1234/_heap_table_name=users/part-00001-584cdba3973c32.avro",
"s3://customer/sync_1234/_heap_table_name=users/part-00002-32917bc3297a3c.avro"
],
"columns": [
"user_id",
"last_modified",
// ...
],
"incremental": true
},
{
"name": "user_migrations",
"files": [
"s3://customer/sync_1234/_heap_table_name=user_migrations/part-00000-2a345bc452456c.avro",
"s3://customer/sync_1234/_heap_table_name=user_migrations/part-00001-4382abc432862c.avro"
],
"columns": [
"from_user_id",
"to_user_id",
// ...
],
"incremental": false // always false for migrations
},
{
"name": "defined_event",
"files": [
"s3://customer/sync_1234/_heap_table_name=defined_event/part-00000-2fa2dbe2456c.avro"
],
"columns": [
"user_id",
"event_id",
"time",
"session_id",
// ...
],
"incremental": true
}
],
"property_definitions": "s3://customer/sync_1234/property_definitions.json"
}

It includes the following information:

  • dump_id: A monotonically increasing sequence number for dumps.
  • tables: For each table synced:
    • name: The name of the table.
    • columns: An array consisting of the columns contained in the table. This can be used to determine which columns need to be added or removed downstream.
    • files: An array of full s3 paths to the Avro-encoded files for the relevant table.
    • incremental: A boolean denoting whether the data for the table is incremental on top of previous dumps. A value of false means it is a full/fresh resync of this table, and all previous data is invalid.
    • property_definitions: The s3 path to the defined property definition file.

Ignore any files in the data delivery that aren't listed in the manifest metadata file.

Each sync will be accompanied by a sync log file that reports on delivery status. These log files will be placed in the sync_reports directory. Each report will be in a JSON format as follows:

{
"start_time":1566968405225,
"finish_time":1566968649169,
"status":"succeeded",
"next_sync_at":1567054800000,
"error":null
}

start_time, finish_time, and next_sync_at are represented as epoch timestamps.

See Data Syncing to learn how the data will be structured upon sync.

The user_id, event_id, and session_id are the only columns that are long types. All other columns should be inferred as string types.

When working with Contentsquare data in S3, you need to perform several processing steps to ensure data accuracy and completeness. The following sections outline the key considerations for building your ETL pipeline.

Contentsquare's identify API allows you to create a single, cohesive view of a user across devices, browsers, and domains. When a user is identified via the identify API, their anonymous user_id is updated to a new user_id, which is a hash of the identity. Once this is set up, when this user is identified on another device or browser, this tells us that these users are the same, and that we should join their data together.

Internally, we call this a “user migration” because we are migrating user and event data from one record to another. In other words, we are resolving data from two users into one identity. This is identity resolution, at a high-level.

In the Contentsquare app, we handle this for you. In Redshift, Snowflake, and BigQuery exports, we handle this on write and in a view for Snowflake and BigQuery, respectively. However, in S3 exports, we do not resolve identities for you. In order to merge this user activity correctly, you must resolve their identity using the user_migrations table.

How do I apply this identity resolution mapping in my data warehouse? The user_migrations table contains a mapping of from_user_id to to_user_id, and should be joined against the users table as well as all pertinent event tables. You should create a view based on these joins that will refresh on a regular cadence.

Below is an example of the view you should create to resolve identity on the users table - join user_migrations.from_user_id on users.user_id, and then coalesce to_user_id and user_id to obtain the user's final state:

CREATE VIEW users_view AS
SELECT
user_id,
MIN("joindate") AS "joindate",
MAX("last_modified") AS "last_modified",
MAX("identity") AS "identity",
MAX("handle") AS "handle",
MAX("email") AS "email",
FROM
(
SELECT
COALESCE("to_user_id", "user_id") AS "user_id",
"joindate","last_modified","identity","handle","email"
FROM users u
LEFT JOIN user_migrations m
ON u.user_id = m.from_user_id
) x
GROUP BY user_id;

The following example illustrates the view you should create to resolve identity on each of your synced event tables. Make sure to select all unique columns from each event table in order to replicate the desired table with migrations applied. Each time you toggle on a new event table to sync in the Contentsquare UI, create this migrated view for that table:

CREATE VIEW example_event_migrated_view AS
SELECT
COALESCE("to_user_id", "user_id") AS "user_id",
"event_column_1","event_column_2","event_id","session_id","time","session_time","type","library","platform","device_type","country","region","city","ip","referrer","landing_page","browser","search_keyword","utm_source","utm_campaign","utm_medium","utm_term","utm_content","domain","query","path","hash","title","href","target_text"
FROM example_event_to_be_migrated e
LEFT JOIN user_migrations m
ON e.user_id = m.from_user_id
;

Data across dumps/files are not guaranteed to be disjointed. As a result, downstream consumers are responsible for de-duplication. De-duplication must happen after applying user migrations. Here's a strategy you can adopt:

TableDe-duplication Columns
Sessionssession_id, user_id
Usersuser_id
Event tablesevent_id, user_id

Updated users (users with properties that have changed since the last sync) will re-appear in the sync files, and thus every repeated occurrence of a user (check on user_id) should replace the old one to ensure that the corresponding property updates are picked up.

user_migrations is a fully materialized mapping of from_user_ids to to_user_ids. Downstream consumers are responsible for joining this with events/users tables downstream to resolve identity retroactively.

For v2, we only sync defined property definitions rather than the actual defined property values. Downstream consumers are responsible for applying these definitions to generate the defined property values for each row.

Schemas are expected to evolve over time: properties can be added to the user and events tables.

Working with Data Connect in S3 requires understanding several S3-specific behaviors and best practices.

The following actions will trigger a full resync of individual tables:

  • Adding new events: When you define and enable a new event in Contentsquare
  • Toggling events on/off: Turning an event sync on or off in the Contentsquare UI
  • Configuration changes: Modifying sync settings or warehouse configuration

The user_migrations table always performs a full sync, regardless of changes.

To minimize disruption:

  • Plan event definition changes during low-traffic periods
  • Test new events in a development environment first
  • Monitor sync reports after making changes

Data Connect supports server-side encryption for S3 buckets.

Supported Encryption:

  • SSE-S3 (Amazon S3-Managed Keys): Fully supported, no additional configuration required
  • Set default bucket encryption in your S3 bucket settings

Not Supported:

  • SSE-KMS (AWS Key Management Service Keys): Not currently supported

To enable SSE-S3 encryption:

  1. Navigate to your S3 bucket in the AWS Console
  2. Go to Properties > Default encryption
  3. Select AES-256 (SSE-S3)
  4. Save changes

No additional IAM roles or permissions are required for SSE-S3. See AWS documentation for detailed instructions.

Schema changes in S3 behave differently than in other warehouses:

Adding Properties:

  • New columns appear in subsequent syncs automatically
  • Historical data will not include the new property (no backfill)
  • The property will be present in new data files going forward

Removing Properties:

  • Archived properties stop syncing immediately
  • The column/field stops appearing in new data files
  • Historical files still contain the property data

Property Type Changes:

  • Not automatically handled - you must manage in your ETL process
  • Consider versioning your data schema
  • Use manifest files to track schema evolution

Unlike warehouse integrations, S3 does not automatically update table schemas. You are responsible for:

  • Detecting schema changes via manifest files
  • Updating your ETL process to handle new/removed columns
  • Managing schema versioning in your data lake

If a sync fails, Data Connect will automatically retry:

  1. Automatic Retries: Most failures are transient and resolve automatically
  2. Monitoring: Check sync reports in the sync_reports directory for status
  3. Manual Intervention: In rare cases, you may need to contact support

To monitor sync health:

Terminal window
# Check latest sync report
aws s3 cp s3://your-bucket/sync_reports/latest.json -
# Monitor for failed syncs
aws s3 ls s3://your-bucket/sync_reports/ | grep -i "failed"

Common failure causes:

  • Insufficient S3 permissions
  • Bucket policy misconfigurations
  • S3 service disruptions
  • Network connectivity issues

To determine if a sync has completed:

  1. Monitor manifest files: A new manifest signals sync completion

    Terminal window
    # List recent manifests
    aws s3 ls s3://your-bucket/manifests/ --recursive | tail -5
  2. Check sync reports: Look for "status": "succeeded" in sync reports

    Terminal window
    # Get latest sync report
    aws s3 cp s3://your-bucket/sync_reports/$(aws s3 ls s3://your-bucket/sync_reports/ | tail -1 | awk '{print $4}') -
  3. Validate data files: Confirm all expected tables have new data files

    Terminal window
    # List recent syncs
    aws s3 ls s3://your-bucket/ | grep sync_
  4. Schedule ETL jobs: Trigger your ETL pipeline after manifest delivery

Each S3 sync creates folders with numeric identifiers:

Sync ID (sync_[number]):

  • Incremental identifier for each sync operation
  • Example: sync_1234567890/
  • Used to organize data by sync batch
  • Corresponds to dump_id in manifest files

Dump ID:

  • Monotonically increasing sequence number
  • Found in manifest metadata files
  • Links manifest files to their corresponding data folders
  • Example: "dump_id": 1234567890 matches sync_1234567890/

Table Name Prefix:

  • Each table is stored under _heap_table_name=[table_name]/
  • Example: sync_1234/_heap_table_name=sessions/
  • This format enables Hive-style partitioning in tools like Athena

Use these identifiers to:

  • Track which sync batch each file belongs to
  • Build incremental ETL processes
  • Audit data lineage and sync history
  • Partition data in your data lake

Contentsquare does not perform deduplication or identity resolution in S3 exports: your organization needs to manage the ETL process.

Data Deduplication:

  • Not performed automatically - you must deduplicate in your ETL
  • Use the deduplication columns specified in the Amazon S3 setup guide
  • Apply deduplication after identity resolution for accurate results

Timestamp Formats:

  • All timestamp columns use UNIX timestamp format (milliseconds since epoch)

  • Warehouses use native timestamp types (such as TIMESTAMP, DATETIME)

  • You must convert timestamps during ETL:

    -- Example conversion in Athena
    FROM_UNIXTIME(joindate / 1000) AS joindate_timestamp

Schema Enforcement:

  • No automatic schema validation or enforcement
  • Your ETL process must handle schema mismatches
  • Monitor manifest files for schema changes

Partitioning and Performance:

  • Data is not pre-partitioned by date
  • You should repartition data during ETL for optimal query performance
  • Consider using Parquet or ORC formats with date-based partitioning
  • Example partitioning strategy: partition by year, month, and day

Why don't I see initial user properties in Data Connect?

Initial User properties (such as Initial Marketing Channel or Initial Browser) are not currently synced downstream in Data Connect. See Creating An Enhanced Users Table for an example of how to recreate those properties in your warehouse.

What triggers a full sync for tables in S3?

The addition of new events, as well as toggling an event on or off within the Contentsquare UI, will result in the full sync of an individual table. Note that the user_migrations table always does a full sync.

How do I set up encryption with S3?

In terms of server-side encryption, Contentsquare currently supports only the Amazon S3-managed keys (SSE-S3) encryption key type. Buckets using the AWS Key Management Service key (SSE-KMS) encryption key type are not currently supported.

No additional user/role in S3 is required.

For instructions on how to edit your bucket's default encryption, see AWS documentation.

Does an S3 schema change initiate a full sync?

Schema changes should not initiate a full sync. In cases where a property is synced/unsynced from within the Contentsquare app, the property (column) will either be added or stop being included going forward. However, if you need to populate the column retroactively, a full resync of a given table is needed.

How often is the schema of each table expected to change?

For the users, pageviews, sessions, and user_migrations tables and their respective built-in properties, the schema typically stays the same. However, it is common for properties to be added/removed (archived/unsynced) from the Contentsquare app.

Note that when a new property is captured, it is automatically included in Data Connect.

What happens after an S3 sync fails?

The majority of sync failures will re-attempt and resolve themselves. In rare cases, manual intervention may be required.

How can I tell if an S3 sync is completed for today?

Completion of a sync is signaled by the delivery of a new manifest file. You should poll s3://<BUCKET>/manifests/* for new manifests. Once a manifest is delivered, you can process it via your ETL pipeline.

What is the numeric value appended to each sync folder?

This value is the dump_id which is appended to each sync and is used to associate the folder to the respective manifest located in s3://<BUCKET>/manifests/*. For example, the folder name may look like sync_123456789/.