Amazon S3

Amazon S3 (Simple Storage Service) is an object storage service that offers industry-leading scalability, data availability, security, and performance. The Data Connect integration with S3 allows you to export your Contentsquare data to S3 for flexible storage and analysis options.

Unlike the other warehouse integrations (BigQuery, Redshift, Snowflake), the S3 integration provides raw data files that you can process with your preferred analytics tools, such as Athena, EMR, or third-party data processing services.

Prerequisites

Before setting up the S3 integration, ensure you have:

An AWS account with S3 access
An S3 bucket to store Contentsquare data
AWS credentials with appropriate permissions for the S3 bucket

Set up AWS S3

Create an S3 bucket to store Contentsquare data (if you don't already have one)
Create an IAM user or role with appropriate permissions.

The IAM policy should include:
- s3:PutObject
- s3:GetObject
- s3:ListBucket
- s3:DeleteObject
- s3:PutObjectAcl

Apply the appropriate bucket policy for your region:
- US
- EU
{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1441164338000", "Effect": "Allow", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::", "arn:aws:s3:::/*" ], "Principal": { "AWS": [ "arn:aws:iam::085120003701:root" ] } } ] }
{ "Version": "2012-10-17", "Statement": [ { "Sid": "ConnectEU", "Effect": "Allow", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::", "arn:aws:s3:::/*" ], "Principal": { "AWS": [ "arn:aws:iam::556519846140:root" ] } } ] }
Generate AWS access keys for the IAM user (if using user-based authentication).

Configure Contentsquare Connect

Log in to Contentsquare.
Navigate to Analysis setup > Data Connect.
Create the S3 bucket csq-rs3-<bucket_name> to sync Data Connect data with.
Select Next.
Add the displayed policy to your CSQ bucket on S3.
Input your S3 credentials to connect to your bucket.
Select Connect.

Once setup is complete, you'll see a sync within 24 hours with the following built-in tables.

Understand your data delivery

Data organization in S3

When Data Connect syncs data to S3, it creates the following structure:

s3://your-bucket/
  ├── sync_[sync_id]/
  │   ├── _heap_table_name=users/
  │   │   ├── part-00000-[uuid].avro
  │   │   ├── part-00001-[uuid].avro
  │   │   └── ...
  │   ├── _heap_table_name=sessions/
  │   │   ├── part-00000-[uuid].avro
  │   │   ├── part-00001-[uuid].avro
  │   │   └── ...
  │   ├── _heap_table_name=pageviews/
  │   │   └── ...
  │   ├── _heap_table_name=[custom_event_name]/
  │   │   └── ...
  │   └── ...
  ├── manifests/
  │   └── ...
  └── sync_reports/
      └── ...

Each sync folder follows the pattern sync_[sync_id]/_heap_table_name=[table_name]/part-[part_number]-[uuid].avro. For example:

sync_1010103140/_heap_table_name=sessions/part-00061-45d3eb01-f863-460b-a4ec-c6ab45d.avro

The data is organized by:

Sync ID (a unique identifier for each sync operation)
Table name (users, sessions, pageviews, custom events), prefixed with _heap_table_name=
Part files (data is split into multiple Avro-encoded files)

Manifest metadata

Each periodic data delivery will be accompanied by a manifest metadata file, which will describe the target schema and provide a full list of relevant data files for each table.

{
  "dump_id": 1234,
  "tables": [
    {
      "name": "users",
      "files": [
        "s3://customer/sync_1234/_heap_table_name=users/part-00000-a97432cba49732.avro",
        "s3://customer/sync_1234/_heap_table_name=users/part-00001-584cdba3973c32.avro",
        "s3://customer/sync_1234/_heap_table_name=users/part-00002-32917bc3297a3c.avro"
      ],
      "columns": [
        "user_id",
        "last_modified",
        // ...
      ],
      "incremental": true
    },
    {
      "name": "user_migrations",
      "files": [
        "s3://customer/sync_1234/_heap_table_name=user_migrations/part-00000-2a345bc452456c.avro",
        "s3://customer/sync_1234/_heap_table_name=user_migrations/part-00001-4382abc432862c.avro"
      ],
      "columns": [
        "from_user_id",
        "to_user_id",
        // ...
      ],
      "incremental": false // always false for migrations
    },
    {
      "name": "defined_event",
      "files": [
        "s3://customer/sync_1234/_heap_table_name=defined_event/part-00000-2fa2dbe2456c.avro"
      ],
      "columns": [
        "user_id",
        "event_id",
        "time",
        "session_id",
        // ...
      ],
      "incremental": true
    }
  ],
  "property_definitions": "s3://customer/sync_1234/property_definitions.json"
}

It includes the following information:

dump_id: A monotonically increasing sequence number for dumps.
tables: For each table synced:
- name: The name of the table.
- columns: An array consisting of the columns contained in the table. This can be used to determine which columns need to be added or removed downstream.
- files: An array of full s3 paths to the Avro-encoded files for the relevant table.
- incremental: A boolean denoting whether the data for the table is incremental on top of previous dumps. A value of false means it is a full/fresh resync of this table, and all previous data is invalid.
- property_definitions: The s3 path to the defined property definition file.

Ignore any files in the data delivery that aren't listed in the manifest metadata file.

Sync reporting

Each sync will be accompanied by a sync log file that reports on delivery status. These log files will be placed in the sync_reports directory. Each report will be in a JSON format as follows:

{
  "start_time":1566968405225,
  "finish_time":1566968649169,
  "status":"succeeded",
  "next_sync_at":1567054800000,
  "error":null
}

start_time, finish_time, and next_sync_at are represented as epoch timestamps.

See Data Syncing to learn how the data will be structured upon sync.

Data types and formats

The user_id, event_id, and session_id are the only columns that are long types. All other columns should be inferred as string types.

Process your S3 data

When working with Contentsquare data in S3, you need to perform several processing steps to ensure data accuracy and completeness. The following sections outline the key considerations for building your ETL pipeline.

Identity resolution

Contentsquare's identify API allows you to create a single, cohesive view of a user across devices, browsers, and domains. When a user is identified via the identify API, their anonymous user_id is updated to a new user_id, which is a hash of the identity. Once this is set up, when this user is identified on another device or browser, this tells us that these users are the same, and that we should join their data together.

Internally, we call this a "user migration" because we are migrating user and event data from one record to another. In other words, we are resolving data from two users into one identity. This is identity resolution, at a high-level.

In the Contentsquare app, we handle this for you. In Redshift exports, we handle this on write. In Snowflake and BigQuery exports, we handle this on write and in a view. However, in S3 exports, we do not resolve identities for you. To merge this user activity correctly, you must resolve their identity using the user_migrations table.

How do I apply this identity resolution mapping in my data warehouse? The user_migrations table contains a mapping of from_user_id to to_user_id, and should be joined against the users table and all pertinent event tables. You should create a view based on these joins that will refresh on a regular cadence.

Below is an example of the view you should create to resolve identity on the users table - join user_migrations.from_user_id on users.user_id, and then coalesce to_user_id and user_id to obtain the user's final state:

CREATE VIEW users_view AS
SELECT
  user_id,
  MIN("joindate") AS "joindate",
  MAX("last_modified") AS "last_modified",
  MAX("identity") AS "identity",
  MAX("handle") AS "handle",
  MAX("email") AS "email",
FROM
(
  SELECT
  COALESCE("to_user_id", "user_id") AS "user_id",
  "joindate","last_modified","identity","handle","email"
  FROM users u
  LEFT JOIN user_migrations m
    ON u.user_id = m.from_user_id
) x
GROUP BY user_id;

The following example illustrates the view you should create to resolve identity on each of your synced event tables. Make sure to select all unique columns from each event table to replicate the desired table with migrations applied. Each time you toggle on a new event table to sync in the Contentsquare UI, create this migrated view for that table:

CREATE VIEW example_event_migrated_view AS
SELECT
COALESCE("to_user_id", "user_id") AS "user_id",
"event_column_1","event_column_2","event_id","session_id","time","session_time","type","library","platform","device_type","country","region","city","ip","referrer","landing_page","browser","search_keyword","utm_source","utm_campaign","utm_medium","utm_term","utm_content","domain","query","path","hash","title","href","target_text"
FROM example_event_to_be_migrated e
LEFT JOIN user_migrations m
  ON e.user_id = m.from_user_id
;

Deduplication

Data across dumps/files are not guaranteed to be disjointed. As a result, downstream consumers are responsible for de-duplication. De-duplication must happen after applying user migrations. Here's a strategy you can adopt:

Table	De-duplication Columns
Sessions	`session_id`, `user_id`
Users	`user_id`
Event tables	`event_id`, `user_id`

Handle user updates

Updated users (users with properties that have changed since the last sync) will re-appear in the sync files, and thus every repeated occurrence of a user (check on user_id) should replace the old one to ensure that the corresponding property updates are picked up.

Apply user migrations

user_migrations is a fully materialized mapping of from_user_ids to to_user_ids. Downstream consumers are responsible for joining this with events/users tables downstream to resolve identity retroactively.

Process property definitions

For v2, we only sync defined property definitions rather than the actual defined property values. Downstream consumers are responsible for applying these definitions to generate the defined property values for each row.

Manage schema evolution

Schemas are expected to evolve over time: properties can be added to the user and events tables.

S3-Specific Considerations

Working with Data Connect in S3 requires understanding several S3-specific behaviors and best practices.

Full Sync Triggers

The following actions will trigger a full resync of individual tables:

Adding new events: When you define and enable a new event in Contentsquare
Toggling events on/off: Turning an event sync on or off in the Contentsquare UI
Configuration changes: Modifying sync settings or warehouse configuration

The user_migrations table always performs a full sync, regardless of changes.

To minimize disruption:

Plan event definition changes during low-traffic periods
Test new events in a development environment first
Monitor sync reports after making changes

Encryption Setup

Data Connect supports server-side encryption for S3 buckets.

Supported Encryption:

SSE-S3 (Amazon S3-Managed Keys): Fully supported, no additional configuration required
Set default bucket encryption in your S3 bucket settings

Not Supported:

SSE-KMS (AWS Key Management Service Keys): Not supported

To enable SSE-S3 encryption:

Navigate to your S3 bucket in the AWS Console
Go to Properties > Default encryption
Select AES-256 (SSE-S3)
Save changes

No additional IAM roles or permissions are required for SSE-S3. See AWS documentation ↗ for detailed instructions.

Schema Change Handling

Schema changes in S3 behave differently than in other warehouses:

Adding Properties:

New columns appear in subsequent syncs automatically
Historical data will not include the new property (no backfill)
The property will be present in new data files going forward

Removing Properties:

Archived properties stop syncing immediately
The column/field stops appearing in new data files
Historical files still contain the property data

Property Type Changes:

Not automatically handled - you must manage in your ETL process
Consider versioning your data schema
Use manifest files to track schema evolution

Unlike warehouse integrations, S3 does not automatically update table schemas. You are responsible for:

Detecting schema changes via manifest files
Updating your ETL process to handle new/removed columns
Managing schema versioning in your data lake

Sync Failure Recovery

If a sync fails, Data Connect will automatically retry:

Automatic Retries: Most failures are transient and resolve automatically
Monitoring: Check sync reports in the sync_reports directory for status
Manual Intervention: In rare cases, you may need to contact support

To monitor sync health:

# Check latest sync report
aws s3 cp s3://your-bucket/sync_reports/latest.json -

# Monitor for failed syncs
aws s3 ls s3://your-bucket/sync_reports/ | grep -i "failed"

Common failure causes:

Insufficient S3 permissions
Bucket policy misconfigurations
S3 service disruptions
Network connectivity issues

Sync Completion Status

To determine if a sync has completed:

Monitor manifest files: A new manifest signals sync completion

# List recent manifests
aws s3 ls s3://your-bucket/manifests/ --recursive | tail -5

Check sync reports: Look for "status": "succeeded" in sync reports

# Get latest sync report
aws s3 cp s3://your-bucket/sync_reports/$(aws s3 ls s3://your-bucket/sync_reports/ | tail -1 | awk '{print $4}') -

Validate data files: Confirm all expected tables have new data files
Terminal window
```
# List recent syncs
aws s3 ls s3://your-bucket/ | grep sync_
```
Schedule ETL jobs: Trigger your ETL pipeline after manifest delivery

Numeric Folder Values

Each S3 sync creates folders with numeric identifiers:

Sync ID (sync_[number]):

Incremental identifier for each sync operation
Example: sync_1234567890/
Used to organize data by sync batch
Corresponds to dump_id in manifest files

Dump ID:

Monotonically increasing sequence number
Found in manifest metadata files
Links manifest files to their corresponding data folders
Example: "dump_id": 1234567890 matches sync_1234567890/

Table Name Prefix:

Each table is stored under _heap_table_name=[table_name]/
Example: sync_1234/_heap_table_name=sessions/
This format enables Hive-style partitioning in tools like Athena

Use these identifiers to:

Track which sync batch each file belongs to
Build incremental ETL processes
Audit data lineage and sync history
Partition data in your data lake

Limitations

Contentsquare does not perform deduplication or identity resolution in S3 exports: your organization needs to manage the ETL process.

Data Deduplication:

Not performed automatically - you must deduplicate in your ETL
Use the deduplication columns specified in the Amazon S3 setup guide
Apply deduplication after identity resolution for accurate results

Timestamp Formats:

All timestamp columns use UNIX timestamp format (milliseconds since epoch)
Warehouses use native timestamp types (such as TIMESTAMP, DATETIME)

You must convert timestamps during ETL:

-- Example conversion in Athena
FROM_UNIXTIME(joindate / 1000) AS joindate_timestamp

Schema Enforcement:

No automatic schema validation or enforcement
Your ETL process must handle schema mismatches
Monitor manifest files for schema changes

Partitioning and Performance:

Data is not pre-partitioned by date
You should repartition data during ETL for optimal query performance
Consider using Parquet or ORC formats with date-based partitioning
Example partitioning strategy: partition by year, month, and day

FAQs

Why don't I see initial user properties in Data Connect?

Initial User properties (such as Initial Marketing Channel or Initial Browser) are not synced downstream in Data Connect. See Creating An Enhanced Users Table for an example of how to recreate those properties in your warehouse.

What triggers a full sync for tables in S3?

The addition of new events, and toggling an event on or off within the Contentsquare UI, will result in the full sync of an individual table. the user_migrations table always does a full sync.

How do I set up encryption with S3?

In terms of server-side encryption, Contentsquare supports only the Amazon S3-managed keys (SSE-S3) encryption key type. Buckets using the AWS Key Management Service key (SSE-KMS) encryption key type are not supported.

No additional user/role in S3 is required.

For instructions on how to edit your bucket's default encryption, see AWS documentation ↗.

Does an S3 schema change initiate a full sync?

Schema changes should not initiate a full sync. In cases where a property is synced/unsynced from within the Contentsquare app, the property (column) will either be added or stop being included going forward. However, if you need to populate the column retroactively, a full resync of a given table is needed.

How often is the schema of each table expected to change?

For the users, pageviews, sessions, and user_migrations tables and their respective built-in properties, the schema typically stays the same. However, it is common for properties to be added/removed (archived/unsynced) from the Contentsquare app.

when a new property is captured, it is automatically included in Data Connect.

What happens after an S3 sync fails?

The majority of sync failures will re-attempt and resolve themselves. In rare cases, manual intervention may be required.

How can I tell if an S3 sync is completed for today?

Completion of a sync is signaled by the delivery of a new manifest file. You should poll s3://<BUCKET>/manifests/* for new manifests. Once a manifest is delivered, you can process it via your ETL pipeline.

What is the numeric value appended to each sync folder?

This value is the dump_id which is appended to each sync and is used to associate the folder to the respective manifest located in s3://<BUCKET>/manifests/*. For example, the folder name may look like sync_123456789/.