Amazon S3
Amazon S3 (Simple Storage Service) is an object storage service that offers industry-leading scalability, data availability, security, and performance. The Data Connect integration with S3 allows you to export your Contentsquare data to S3 for flexible storage and analysis options.
Unlike the other warehouse integrations (BigQuery, Redshift, Snowflake), the S3 integration provides raw data files that you can process with your preferred analytics tools, such as Athena, EMR, or third-party data processing services.
Prerequisites
Section titled PrerequisitesBefore setting up the S3 integration, ensure you have:
- An AWS account with S3 access
- An S3 bucket to store Contentsquare data
- AWS credentials with appropriate permissions for the S3 bucket
Set up AWS S3
Section titled Set up AWS S3-
Create an S3 bucket to store Contentsquare data (if you don't already have one)
-
Create an IAM user or role with appropriate permissions.
The IAM policy should include:
s3:PutObjects3:GetObjects3:ListBuckets3:DeleteObjects3:PutObjectAcl
-
Apply the appropriate bucket policy for your region:
{"Version": "2012-10-17","Statement": [{"Sid": "Stmt1441164338000","Effect": "Allow","Action": ["s3:*"],"Resource": ["arn:aws:s3:::","arn:aws:s3:::/*"],"Principal": {"AWS": ["arn:aws:iam::085120003701:root"]}}]}{"Version": "2012-10-17","Statement": [{"Sid": "ConnectEU","Effect": "Allow","Action": ["s3:*"],"Resource": ["arn:aws:s3:::","arn:aws:s3:::/*"],"Principal": {"AWS": ["arn:aws:iam::556519846140:root"]}}]} -
Generate AWS access keys for the IAM user (if using user-based authentication).
Configure Contentsquare Connect
Section titled Configure Contentsquare Connect- Log in to Contentsquare.
- Navigate to Analysis setup > Data Connect.
- Create the S3 bucket
csq-rs3-<bucket_name>to sync Data Connect data with. - Select Next.
- Add the displayed policy to your CSQ bucket on S3.
- Input your S3 credentials to connect to your bucket.
- Select Connect.
Once setup is complete, you’ll see a sync within 24 hours with the following built-in tables.
Understand your data delivery
Section titled Understand your data deliveryData organization in S3
Section titled Data organization in S3When Data Connect syncs data to S3, it creates the following structure:
s3://your-bucket/ ├── sync_[sync_id]/ │ ├── _heap_table_name=users/ │ │ ├── part-00000-[uuid].avro │ │ ├── part-00001-[uuid].avro │ │ └── ... │ ├── _heap_table_name=sessions/ │ │ ├── part-00000-[uuid].avro │ │ ├── part-00001-[uuid].avro │ │ └── ... │ ├── _heap_table_name=pageviews/ │ │ └── ... │ ├── _heap_table_name=[custom_event_name]/ │ │ └── ... │ └── ... ├── manifests/ │ └── ... └── sync_reports/ └── ...Each sync folder follows the pattern sync_[sync_id]/_heap_table_name=[table_name]/part-[part_number]-[uuid].avro. For example:
sync_1010103140/_heap_table_name=sessions/part-00061-45d3eb01-f863-460b-a4ec-c6ab45d.avro
The data is organized by:
- Sync ID (a unique identifier for each sync operation)
- Table name (users, sessions, pageviews, custom events), prefixed with
_heap_table_name= - Part files (data is split into multiple Avro-encoded files)
Manifest metadata
Section titled Manifest metadataEach periodic data delivery will be accompanied by a manifest metadata file, which will describe the target schema and provide a full list of relevant data files for each table.
{ "dump_id": 1234, "tables": [ { "name": "users", "files": [ "s3://customer/sync_1234/_heap_table_name=users/part-00000-a97432cba49732.avro", "s3://customer/sync_1234/_heap_table_name=users/part-00001-584cdba3973c32.avro", "s3://customer/sync_1234/_heap_table_name=users/part-00002-32917bc3297a3c.avro" ], "columns": [ "user_id", "last_modified", // ... ], "incremental": true }, { "name": "user_migrations", "files": [ "s3://customer/sync_1234/_heap_table_name=user_migrations/part-00000-2a345bc452456c.avro", "s3://customer/sync_1234/_heap_table_name=user_migrations/part-00001-4382abc432862c.avro" ], "columns": [ "from_user_id", "to_user_id", // ... ], "incremental": false // always false for migrations }, { "name": "defined_event", "files": [ "s3://customer/sync_1234/_heap_table_name=defined_event/part-00000-2fa2dbe2456c.avro" ], "columns": [ "user_id", "event_id", "time", "session_id", // ... ], "incremental": true } ], "property_definitions": "s3://customer/sync_1234/property_definitions.json"}It includes the following information:
dump_id: A monotonically increasing sequence number for dumps.tables: For each table synced:name: The name of the table.columns: An array consisting of the columns contained in the table. This can be used to determine which columns need to be added or removed downstream.files: An array of full s3 paths to the Avro-encoded files for the relevant table.incremental: A boolean denoting whether the data for the table is incremental on top of previous dumps. A value of false means it is a full/fresh resync of this table, and all previous data is invalid.property_definitions: The s3 path to the defined property definition file.
Ignore any files in the data delivery that aren't listed in the manifest metadata file.
Sync reporting
Section titled Sync reportingEach sync will be accompanied by a sync log file that reports on delivery status. These log files will be placed in the sync_reports directory. Each report will be in a JSON format as follows:
{ "start_time":1566968405225, "finish_time":1566968649169, "status":"succeeded", "next_sync_at":1567054800000, "error":null}start_time, finish_time, and next_sync_at are represented as epoch timestamps.
See Data Syncing to learn how the data will be structured upon sync.
Data types and formats
Section titled Data types and formatsThe user_id, event_id, and session_id are the only columns that are long types. All other columns should be inferred as string types.
Process your S3 data
Section titled Process your S3 dataWhen working with Contentsquare data in S3, you need to perform several processing steps to ensure data accuracy and completeness. The following sections outline the key considerations for building your ETL pipeline.
Identity resolution
Section titled Identity resolutionContentsquare's identify API allows you to create a single, cohesive view of a user across devices, browsers, and domains. When a user is identified via the identify API, their anonymous user_id is updated to a new user_id, which is a hash of the identity. Once this is set up, when this user is identified on another device or browser, this tells us that these users are the same, and that we should join their data together.
Internally, we call this a “user migration” because we are migrating user and event data from one record to another. In other words, we are resolving data from two users into one identity. This is identity resolution, at a high-level.
In the Contentsquare app, we handle this for you. In Redshift, Snowflake, and BigQuery exports, we handle this on write and in a view for Snowflake and BigQuery, respectively. However, in S3 exports, we do not resolve identities for you. In order to merge this user activity correctly, you must resolve their identity using the user_migrations table.
How do I apply this identity resolution mapping in my data warehouse?
The user_migrations table contains a mapping of from_user_id to to_user_id, and should be joined against the users table as well as all pertinent event tables. You should create a view based on these joins that will refresh on a regular cadence.
Below is an example of the view you should create to resolve identity on the users table - join user_migrations.from_user_id on users.user_id, and then coalesce to_user_id and user_id to obtain the user's final state:
CREATE VIEW users_view ASSELECT user_id, MIN("joindate") AS "joindate", MAX("last_modified") AS "last_modified", MAX("identity") AS "identity", MAX("handle") AS "handle", MAX("email") AS "email",FROM( SELECT COALESCE("to_user_id", "user_id") AS "user_id", "joindate","last_modified","identity","handle","email" FROM users u LEFT JOIN user_migrations m ON u.user_id = m.from_user_id) xGROUP BY user_id;The following example illustrates the view you should create to resolve identity on each of your synced event tables. Make sure to select all unique columns from each event table in order to replicate the desired table with migrations applied. Each time you toggle on a new event table to sync in the Contentsquare UI, create this migrated view for that table:
CREATE VIEW example_event_migrated_view ASSELECTCOALESCE("to_user_id", "user_id") AS "user_id","event_column_1","event_column_2","event_id","session_id","time","session_time","type","library","platform","device_type","country","region","city","ip","referrer","landing_page","browser","search_keyword","utm_source","utm_campaign","utm_medium","utm_term","utm_content","domain","query","path","hash","title","href","target_text"FROM example_event_to_be_migrated eLEFT JOIN user_migrations m ON e.user_id = m.from_user_id;Deduplication
Section titled DeduplicationData across dumps/files are not guaranteed to be disjointed. As a result, downstream consumers are responsible for de-duplication. De-duplication must happen after applying user migrations. Here's a strategy you can adopt:
| Table | De-duplication Columns |
|---|---|
| Sessions | session_id, user_id |
| Users | user_id |
| Event tables | event_id, user_id |
Handle user updates
Section titled Handle user updatesUpdated users (users with properties that have changed since the last sync) will re-appear in the sync files, and thus every repeated occurrence of a user (check on user_id) should replace the old one to ensure that the corresponding property updates are picked up.
Apply user migrations
Section titled Apply user migrationsuser_migrations is a fully materialized mapping of from_user_ids to to_user_ids. Downstream consumers are responsible for joining this with events/users tables downstream to resolve identity retroactively.
Process property definitions
Section titled Process property definitionsFor v2, we only sync defined property definitions rather than the actual defined property values. Downstream consumers are responsible for applying these definitions to generate the defined property values for each row.
Manage schema evolution
Section titled Manage schema evolutionSchemas are expected to evolve over time: properties can be added to the user and events tables.
S3-Specific Considerations
Section titled S3-Specific ConsiderationsWorking with Data Connect in S3 requires understanding several S3-specific behaviors and best practices.
Full Sync Triggers
Section titled Full Sync TriggersThe following actions will trigger a full resync of individual tables:
- Adding new events: When you define and enable a new event in Contentsquare
- Toggling events on/off: Turning an event sync on or off in the Contentsquare UI
- Configuration changes: Modifying sync settings or warehouse configuration
The user_migrations table always performs a full sync, regardless of changes.
To minimize disruption:
- Plan event definition changes during low-traffic periods
- Test new events in a development environment first
- Monitor sync reports after making changes
Encryption Setup
Section titled Encryption SetupData Connect supports server-side encryption for S3 buckets.
Supported Encryption:
- SSE-S3 (Amazon S3-Managed Keys): Fully supported, no additional configuration required
- Set default bucket encryption in your S3 bucket settings
Not Supported:
- SSE-KMS (AWS Key Management Service Keys): Not currently supported
To enable SSE-S3 encryption:
- Navigate to your S3 bucket in the AWS Console
- Go to Properties > Default encryption
- Select AES-256 (SSE-S3)
- Save changes
No additional IAM roles or permissions are required for SSE-S3. See AWS documentation ↗ for detailed instructions.
Schema Change Handling
Section titled Schema Change HandlingSchema changes in S3 behave differently than in other warehouses:
Adding Properties:
- New columns appear in subsequent syncs automatically
- Historical data will not include the new property (no backfill)
- The property will be present in new data files going forward
Removing Properties:
- Archived properties stop syncing immediately
- The column/field stops appearing in new data files
- Historical files still contain the property data
Property Type Changes:
- Not automatically handled - you must manage in your ETL process
- Consider versioning your data schema
- Use manifest files to track schema evolution
Unlike warehouse integrations, S3 does not automatically update table schemas. You are responsible for:
- Detecting schema changes via manifest files
- Updating your ETL process to handle new/removed columns
- Managing schema versioning in your data lake
Sync Failure Recovery
Section titled Sync Failure RecoveryIf a sync fails, Data Connect will automatically retry:
- Automatic Retries: Most failures are transient and resolve automatically
- Monitoring: Check sync reports in the
sync_reportsdirectory for status - Manual Intervention: In rare cases, you may need to contact support
To monitor sync health:
# Check latest sync reportaws s3 cp s3://your-bucket/sync_reports/latest.json -
# Monitor for failed syncsaws s3 ls s3://your-bucket/sync_reports/ | grep -i "failed"Common failure causes:
- Insufficient S3 permissions
- Bucket policy misconfigurations
- S3 service disruptions
- Network connectivity issues
Sync Completion Status
Section titled Sync Completion StatusTo determine if a sync has completed:
-
Monitor manifest files: A new manifest signals sync completion
Terminal window # List recent manifestsaws s3 ls s3://your-bucket/manifests/ --recursive | tail -5 -
Check sync reports: Look for
"status": "succeeded"in sync reportsTerminal window # Get latest sync reportaws s3 cp s3://your-bucket/sync_reports/$(aws s3 ls s3://your-bucket/sync_reports/ | tail -1 | awk '{print $4}') - -
Validate data files: Confirm all expected tables have new data files
Terminal window # List recent syncsaws s3 ls s3://your-bucket/ | grep sync_ -
Schedule ETL jobs: Trigger your ETL pipeline after manifest delivery
Numeric Folder Values
Section titled Numeric Folder ValuesEach S3 sync creates folders with numeric identifiers:
Sync ID (sync_[number]):
- Incremental identifier for each sync operation
- Example:
sync_1234567890/ - Used to organize data by sync batch
- Corresponds to
dump_idin manifest files
Dump ID:
- Monotonically increasing sequence number
- Found in manifest metadata files
- Links manifest files to their corresponding data folders
- Example:
"dump_id": 1234567890matchessync_1234567890/
Table Name Prefix:
- Each table is stored under
_heap_table_name=[table_name]/ - Example:
sync_1234/_heap_table_name=sessions/ - This format enables Hive-style partitioning in tools like Athena
Use these identifiers to:
- Track which sync batch each file belongs to
- Build incremental ETL processes
- Audit data lineage and sync history
- Partition data in your data lake
Limitations
Section titled LimitationsContentsquare does not perform deduplication or identity resolution in S3 exports: your organization needs to manage the ETL process.
Data Deduplication:
- Not performed automatically - you must deduplicate in your ETL
- Use the deduplication columns specified in the Amazon S3 setup guide
- Apply deduplication after identity resolution for accurate results
Timestamp Formats:
-
All timestamp columns use UNIX timestamp format (milliseconds since epoch)
-
Warehouses use native timestamp types (such as
TIMESTAMP,DATETIME) -
You must convert timestamps during ETL:
-- Example conversion in AthenaFROM_UNIXTIME(joindate / 1000) AS joindate_timestamp
Schema Enforcement:
- No automatic schema validation or enforcement
- Your ETL process must handle schema mismatches
- Monitor manifest files for schema changes
Partitioning and Performance:
- Data is not pre-partitioned by date
- You should repartition data during ETL for optimal query performance
- Consider using Parquet or ORC formats with date-based partitioning
- Example partitioning strategy: partition by year, month, and day
Why don't I see initial user properties in Data Connect?
Initial User properties (such as Initial Marketing Channel or Initial Browser) are not currently synced downstream in Data Connect. See Creating An Enhanced Users Table for an example of how to recreate those properties in your warehouse.
What triggers a full sync for tables in S3?
The addition of new events, as well as toggling an event on or off within the Contentsquare UI, will result in the full sync of an individual table. Note that the user_migrations table always does a full sync.
How do I set up encryption with S3?
In terms of server-side encryption, Contentsquare currently supports only the Amazon S3-managed keys (SSE-S3) encryption key type. Buckets using the AWS Key Management Service key (SSE-KMS) encryption key type are not currently supported.
No additional user/role in S3 is required.
For instructions on how to edit your bucket's default encryption, see AWS documentation ↗.
Does an S3 schema change initiate a full sync?
Schema changes should not initiate a full sync. In cases where a property is synced/unsynced from within the Contentsquare app, the property (column) will either be added or stop being included going forward. However, if you need to populate the column retroactively, a full resync of a given table is needed.
How often is the schema of each table expected to change?
For the users, pageviews, sessions, and user_migrations tables and their respective built-in properties, the schema typically stays the same. However, it is common for properties to be added/removed (archived/unsynced) from the Contentsquare app.
Note that when a new property is captured, it is automatically included in Data Connect.
What happens after an S3 sync fails?
The majority of sync failures will re-attempt and resolve themselves. In rare cases, manual intervention may be required.
How can I tell if an S3 sync is completed for today?
Completion of a sync is signaled by the delivery of a new manifest file. You should poll s3://<BUCKET>/manifests/* for new manifests. Once a manifest is delivered, you can process it via your ETL pipeline.
What is the numeric value appended to each sync folder?
This value is the dump_id which is appended to each sync and is used to associate the folder to the respective manifest located in s3://<BUCKET>/manifests/*. For example, the folder name may look like sync_123456789/.