---
title: Amazon S3 - Data Connect
description: Instructions for setting up and using Connect with Amazon S3
lastUpdated: 19 February 2026
source_url:
  html: https://docs.contentsquare.com/en/connect/amazon-s3/
  md: https://docs.contentsquare.com/en/connect/amazon-s3/index.md
---

Amazon S3 (Simple Storage Service) is an object storage service that offers industry-leading scalability, data availability, security, and performance. The Data Connect integration with S3 allows you to export your Contentsquare data to S3 for flexible storage and analysis options.

Unlike the other warehouse integrations (BigQuery, Redshift, Snowflake), the S3 integration provides raw data files that you can process with your preferred analytics tools, such as Athena, EMR, or third-party data processing services.

## Prerequisites

Before setting up the S3 integration, ensure you have:

* **An AWS account** with S3 access
* **An S3 bucket** to store Contentsquare data
* **AWS credentials** with appropriate permissions for the S3 bucket

## Set up AWS S3

1. Create an S3 bucket to store Contentsquare data (if you don't already have one)

2. Create an IAM user or role with appropriate permissions.

   The IAM policy should include:

   * `s3:PutObject`
   * `s3:GetObject`
   * `s3:ListBucket`
   * `s3:DeleteObject`
   * `s3:PutObjectAcl`

1) Apply the appropriate bucket policy for your region:

   US

   ```json
   {
     "Version": "2012-10-17",
     "Statement": [
       {
         "Sid": "Stmt1441164338000",
         "Effect": "Allow",
         "Action": [
           "s3:*"
         ],
         "Resource": [
           "arn:aws:s3:::",
           "arn:aws:s3:::/*"
         ],
         "Principal": {
           "AWS": [
             "arn:aws:iam::085120003701:root"
           ]
         }
       }
     ]
   }
   ```

   EU

   ```json
   {
     "Version": "2012-10-17",
     "Statement": [
       {
         "Sid": "ConnectEU",
         "Effect": "Allow",
         "Action": [
           "s3:*"
         ],
         "Resource": [
           "arn:aws:s3:::",
           "arn:aws:s3:::/*"
         ],
         "Principal": {
           "AWS": [
             "arn:aws:iam::556519846140:root"
           ]
         }
       }
     ]
   }
   ```

2) Generate AWS access keys for the IAM user (if using user-based authentication).

## Configure Contentsquare Connect

Admin only

Configuration of this destination is only available to admin users in Contentsquare.

1. Log in to Contentsquare.
2. Navigate to **Analysis setup** > **Data Connect**.
3. Create the S3 bucket `csq-rs3-<bucket_name>` to sync Data Connect data with.
4. Select **Next**.
5. Add the displayed policy to your CSQ bucket on S3.
6. Input your S3 credentials to connect to your bucket.
7. Select **Connect**.

Once setup is complete, you’ll see a sync within 24 hours with the following built-in tables.

## Understand your data delivery

### Data organization in S3

When Data Connect syncs data to S3, it creates the following structure:

```plaintext
s3://your-bucket/
  ├── sync_[sync_id]/
  │   ├── _heap_table_name=users/
  │   │   ├── part-00000-[uuid].avro
  │   │   ├── part-00001-[uuid].avro
  │   │   └── ...
  │   ├── _heap_table_name=sessions/
  │   │   ├── part-00000-[uuid].avro
  │   │   ├── part-00001-[uuid].avro
  │   │   └── ...
  │   ├── _heap_table_name=pageviews/
  │   │   └── ...
  │   ├── _heap_table_name=[custom_event_name]/
  │   │   └── ...
  │   └── ...
  ├── manifests/
  │   └── ...
  └── sync_reports/
      └── ...
```

Each sync folder follows the pattern `sync_[sync_id]/_heap_table_name=[table_name]/part-[part_number]-[uuid].avro`. For example:

`sync_1010103140/_heap_table_name=sessions/part-00061-45d3eb01-f863-460b-a4ec-c6ab45d.avro`

The data is organized by:

* Sync ID (a unique identifier for each sync operation)
* Table name (users, sessions, pageviews, custom events), prefixed with `_heap_table_name=`
* Part files (data is split into multiple Avro-encoded files)

### Manifest metadata

Each periodic data delivery will be accompanied by a manifest metadata file, which will describe the target schema and provide a full list of relevant data files for each table.

```json
{
  "dump_id": 1234,
  "tables": [
    {
      "name": "users",
      "files": [
        "s3://customer/sync_1234/_heap_table_name=users/part-00000-a97432cba49732.avro",
        "s3://customer/sync_1234/_heap_table_name=users/part-00001-584cdba3973c32.avro",
        "s3://customer/sync_1234/_heap_table_name=users/part-00002-32917bc3297a3c.avro"
      ],
      "columns": [
        "user_id",
        "last_modified",
        // ...
      ],
      "incremental": true
    },
    {
      "name": "user_migrations",
      "files": [
        "s3://customer/sync_1234/_heap_table_name=user_migrations/part-00000-2a345bc452456c.avro",
        "s3://customer/sync_1234/_heap_table_name=user_migrations/part-00001-4382abc432862c.avro"
      ],
      "columns": [
        "from_user_id",
        "to_user_id",
        // ...
      ],
      "incremental": false // always false for migrations
    },
    {
      "name": "defined_event",
      "files": [
        "s3://customer/sync_1234/_heap_table_name=defined_event/part-00000-2fa2dbe2456c.avro"
      ],
      "columns": [
        "user_id",
        "event_id",
        "time",
        "session_id",
        // ...
      ],
      "incremental": true
    }
  ],
  "property_definitions": "s3://customer/sync_1234/property_definitions.json"
}
```

It includes the following information:

* `dump_id`: A monotonically increasing sequence number for dumps.

* `tables`: For each table synced:

  * `name`: The name of the table.
  * `columns`: An array consisting of the columns contained in the table. This can be used to determine which columns need to be added or removed downstream.
  * `files`: An array of full s3 paths to the Avro-encoded files for the relevant table.
  * `incremental`: A boolean denoting whether the data for the table is incremental on top of previous dumps. A value of false means it is a full/fresh resync of this table, and all previous data is invalid.
  * `property_definitions`: The s3 path to the defined property definition file.

Ignore any files in the data delivery that aren't listed in the manifest metadata file.

### Sync reporting

Each sync will be accompanied by a sync log file that reports on delivery status. These log files will be placed in the `sync_reports` directory. Each report will be in a JSON format as follows:

```json
{
  "start_time":1566968405225,
  "finish_time":1566968649169,
  "status":"succeeded",
  "next_sync_at":1567054800000,
  "error":null
}
```

`start_time`, `finish_time`, and `next_sync_at` are represented as epoch timestamps.

See [Data Syncing](https://docs.contentsquare.com/en/connect/data-schema/#data-syncing) to learn how the data will be structured upon sync.

### Data types and formats

The `user_id`, `event_id`, and `session_id` are the only columns that are long types. All other columns should be inferred as string types.

## Process your S3 data

When working with Contentsquare data in S3, you need to perform several processing steps to ensure data accuracy and completeness. The following sections outline the key considerations for building your ETL pipeline.

### Identity resolution

Contentsquare's identify API allows you to create a single, cohesive view of a user across devices, browsers, and domains. When a user is identified via the identify API, their anonymous `user_id` is updated to a new `user_id`, which is a hash of the identity. Once this is set up, when this user is identified on another device or browser, this tells us that these users are the same, and that we should join their data together.

Internally, we call this a “user migration” because we are migrating user and event data from one record to another. In other words, we are resolving data from two users into one identity. This is identity resolution, at a high-level.

In the Contentsquare app, we handle this for you. In Redshift, Snowflake, and BigQuery exports, we handle this on write and in a view for Snowflake and BigQuery, respectively. However, in S3 exports, we do not resolve identities for you. In order to merge this user activity correctly, you must resolve their identity using the `user_migrations` table.

How do I apply this identity resolution mapping in my data warehouse? The `user_migrations` table contains a mapping of `from_user_id` to `to_user_id`, and should be joined against the `users` table as well as all pertinent event tables. You should create a view based on these joins that will refresh on a regular cadence.

Below is an example of the view you should create to resolve identity on the users table - join `user_migrations.from_user_id` on `users.user_id`, and then coalesce `to_user_id` and `user_id` to obtain the user's final state:

```sql
CREATE VIEW users_view AS
SELECT
  user_id,
  MIN("joindate") AS "joindate",
  MAX("last_modified") AS "last_modified",
  MAX("identity") AS "identity",
  MAX("handle") AS "handle",
  MAX("email") AS "email",
FROM
(
  SELECT
  COALESCE("to_user_id", "user_id") AS "user_id",
  "joindate","last_modified","identity","handle","email"
  FROM users u
  LEFT JOIN user_migrations m
    ON u.user_id = m.from_user_id
) x
GROUP BY user_id;
```

The following example illustrates the view you should create to resolve identity on each of your synced event tables. Make sure to select all unique columns from each event table in order to replicate the desired table with migrations applied. Each time you toggle on a new event table to sync in the Contentsquare UI, create this migrated view for that table:

```sql
CREATE VIEW example_event_migrated_view AS
SELECT
COALESCE("to_user_id", "user_id") AS "user_id",
"event_column_1","event_column_2","event_id","session_id","time","session_time","type","library","platform","device_type","country","region","city","ip","referrer","landing_page","browser","search_keyword","utm_source","utm_campaign","utm_medium","utm_term","utm_content","domain","query","path","hash","title","href","target_text"
FROM example_event_to_be_migrated e
LEFT JOIN user_migrations m
  ON e.user_id = m.from_user_id
;
```

### Deduplication

Data across dumps/files are not guaranteed to be disjointed. As a result, downstream consumers are responsible for de-duplication. De-duplication must happen after applying user migrations. Here's a strategy you can adopt:

| Table | De-duplication Columns |
| - | - |
| Sessions | `session_id`, `user_id` |
| Users | `user_id` |
| Event tables | `event_id`, `user_id` |

### Handle user updates

Updated users (users with properties that have changed since the last sync) will re-appear in the sync files, and thus every repeated occurrence of a user (check on `user_id`) should replace the old one to ensure that the corresponding property updates are picked up.

### Apply user migrations

`user_migrations` is a fully materialized mapping of `from_user_ids` to `to_user_ids`. Downstream consumers are responsible for joining this with events/users tables downstream to resolve identity retroactively.

### Process property definitions

For v2, we only sync defined property definitions rather than the actual defined property values. Downstream consumers are responsible for applying these definitions to generate the defined property values for each row.

### Manage schema evolution

Schemas are expected to evolve over time: properties can be added to the user and events tables.

## Limitations

Contentsquare does not perform deduplication or identity resolution: your organization needs to manage the ETL process.

## FAQs

### Why don't I see initial user properties in Data Connect?

Initial User properties (such as Initial Marketing Channel or Initial Browser) are not currently synced downstream in Data Connect. See [Creating An Enhanced Users Table](https://docs.contentsquare.com/en/connect/common-queries/#creating-an-enhanced-users-table) for an example of how to recreate those properties in your warehouse.

### What triggers a full sync for tables in S3?

The addition of new events, as well as toggling an event on or off within the Contentsquare UI, will result in the full sync of an individual table. Note that the `user_migrations` table always does a full sync.

### How do I set up encryption with S3?

In terms of server-side encryption, Contentsquare currently supports only the Amazon S3-managed keys (SSE-S3) encryption key type. Buckets using the AWS Key Management Service key (SSE-KMS) encryption key type are **not** currently supported.

No additional user/role in S3 is required.

For instructions on how to edit your bucket's default encryption, see [AWS documentation ↗](https://docs.aws.amazon.com/AmazonS3/latest/userguide/default-bucket-encryption.html).

### Does an S3 schema change initiate a full sync?

Schema changes should not initiate a full sync. In cases where a property is synced/unsynced from within the Contentsquare app, the property (column) will either be added or stop being included going forward. However, if you need to populate the column retroactively, a full resync of a given table is needed.

### How often is the schema of each table expected to change?

For the users, pageviews, sessions, and `user_migrations` tables and their respective built-in properties, the schema typically stays the same. However, it is common for properties to be added/removed (archived/unsynced) from the Contentsquare app.

Note that when a new property is captured, it is automatically included in Data Connect.

### What happens after an S3 sync fails?

The majority of sync failures will re-attempt and resolve themselves. In rare cases, manual intervention may be required.

### How can I tell if an S3 sync is completed for today?

Completion of a sync is signaled by the delivery of a new manifest file. You should poll `s3://<BUCKET>/manifests/*` for new manifests. Once a manifest is delivered, you can process it via your ETL pipeline.

### What is the numeric value appended to each sync folder?

This value is the `dump_id` which is appended to each sync and is used to associate the folder to the respective manifest located in `s3://<BUCKET>/manifests/*`. For example, the folder name may look like `sync_123456789/`.
