Amazon S3

Amazon S3 (Simple Storage Service) is an object storage service that offers industry-leading scalability, data availability, security, and performance. The Data Connect integration with S3 allows you to export your Contentsquare data to S3 for flexible storage and analysis options.

Unlike the other warehouse integrations (BigQuery, Redshift, Snowflake), the S3 integration provides raw data files that you can process with your preferred analytics tools, such as Athena, EMR, or third-party data processing services.

Before setting up the S3 integration, ensure you have:

  • An AWS account with S3 access
  • An S3 bucket to store Contentsquare data
  • AWS credentials with appropriate permissions for the S3 bucket
  1. Create an S3 bucket to store Contentsquare data (if you don't already have one)

  2. Create an IAM user or role with appropriate permissions.

    The IAM policy should include:

    • s3:PutObject
    • s3:GetObject
    • s3:ListBucket
    • s3:DeleteObject
    • s3:PutObjectAcl
  1. Apply the appropriate bucket policy for your region:

    US
    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Sid": "Stmt1441164338000",
    "Effect": "Allow",
    "Action": [
    "s3:*"
    ],
    "Resource": [
    "arn:aws:s3:::",
    "arn:aws:s3:::/*"
    ],
    "Principal": {
    "AWS": [
    "arn:aws:iam::085120003701:root"
    ]
    }
    }
    ]
    }
    EU
    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Sid": "ConnectEU",
    "Effect": "Allow",
    "Action": [
    "s3:*"
    ],
    "Resource": [
    "arn:aws:s3:::",
    "arn:aws:s3:::/*"
    ],
    "Principal": {
    "AWS": [
    "arn:aws:iam::556519846140:root"
    ]
    }
    }
    ]
    }
  2. Generate AWS access keys for the IAM user (if using user-based authentication).

Configure Contentsquare Connect

Section titled Configure Contentsquare Connect
  1. Log in to Contentsquare.
  2. Navigate to Analysis setup > Data Connect.
  3. Create the S3 bucket csq-rs3-<bucket_name> to sync Data Connect data with.
  4. Select Next.
  5. Add the displayed policy to your CSQ bucket on S3.
  6. Input your S3 credentials to connect to your bucket.
  7. Select Connect.

Once setup is complete, you’ll see a sync within 24 hours with the following built-in tables.

When Data Connect syncs data to S3, it creates the following structure:

s3://your-bucket/
├── sync_[sync_id]/
│ ├── _heap_table_name=users/
│ │ ├── part-00000-[uuid].avro
│ │ ├── part-00001-[uuid].avro
│ │ └── ...
│ ├── _heap_table_name=sessions/
│ │ ├── part-00000-[uuid].avro
│ │ ├── part-00001-[uuid].avro
│ │ └── ...
│ ├── _heap_table_name=pageviews/
│ │ └── ...
│ ├── _heap_table_name=[custom_event_name]/
│ │ └── ...
│ └── ...
├── manifests/
│ └── ...
└── sync_reports/
└── ...

Each sync folder follows the pattern sync_[sync_id]/_heap_table_name=[table_name]/part-[part_number]-[uuid].avro. For example:

sync_1010103140/_heap_table_name=sessions/part-00061-45d3eb01-f863-460b-a4ec-c6ab45d.avro

The data is organized by:

  • Sync ID (a unique identifier for each sync operation)
  • Table name (users, sessions, pageviews, custom events), prefixed with _heap_table_name=
  • Part files (data is split into multiple Avro-encoded files)

Each periodic data delivery will be accompanied by a manifest metadata file, which will describe the target schema and provide a full list of relevant data files for each table.

{
"dump_id": 1234,
"tables": [
{
"name": "users",
"files": [
"s3://customer/sync_1234/_heap_table_name=users/part-00000-a97432cba49732.avro",
"s3://customer/sync_1234/_heap_table_name=users/part-00001-584cdba3973c32.avro",
"s3://customer/sync_1234/_heap_table_name=users/part-00002-32917bc3297a3c.avro"
],
"columns": [
"user_id",
"last_modified",
// ...
],
"incremental": true
},
{
"name": "user_migrations",
"files": [
"s3://customer/sync_1234/_heap_table_name=user_migrations/part-00000-2a345bc452456c.avro",
"s3://customer/sync_1234/_heap_table_name=user_migrations/part-00001-4382abc432862c.avro"
],
"columns": [
"from_user_id",
"to_user_id",
// ...
],
"incremental": false // always false for migrations
},
{
"name": "defined_event",
"files": [
"s3://customer/sync_1234/_heap_table_name=defined_event/part-00000-2fa2dbe2456c.avro"
],
"columns": [
"user_id",
"event_id",
"time",
"session_id",
// ...
],
"incremental": true
}
],
"property_definitions": "s3://customer/sync_1234/property_definitions.json"
}

It includes the following information:

  • dump_id: A monotonically increasing sequence number for dumps.
  • tables: For each table synced:
    • name: The name of the table.
    • columns: An array consisting of the columns contained in the table. This can be used to determine which columns need to be added or removed downstream.
    • files: An array of full s3 paths to the Avro-encoded files for the relevant table.
    • incremental: A boolean denoting whether the data for the table is incremental on top of previous dumps. A value of false means it is a full/fresh resync of this table, and all previous data is invalid.
    • property_definitions: The s3 path to the defined property definition file.

Ignore any files in the data delivery that aren't listed in the manifest metadata file.

Each sync will be accompanied by a sync log file that reports on delivery status. These log files will be placed in the sync_reports directory. Each report will be in a JSON format as follows:

{
"start_time":1566968405225,
"finish_time":1566968649169,
"status":"succeeded",
"next_sync_at":1567054800000,
"error":null
}

start_time, finish_time, and next_sync_at are represented as epoch timestamps.

See Data Syncing to learn how the data will be structured upon sync.

The user_id, event_id, and session_id are the only columns that are long types. All other columns should be inferred as string types.

When working with Contentsquare data in S3, you need to perform several processing steps to ensure data accuracy and completeness. The following sections outline the key considerations for building your ETL pipeline.

Contentsquare's identify API allows you to create a single, cohesive view of a user across devices, browsers, and domains. When a user is identified via the identify API, their anonymous user_id is updated to a new user_id, which is a hash of the identity. Once this is set up, when this user is identified on another device or browser, this tells us that these users are the same, and that we should join their data together.

Internally, we call this a “user migration” because we are migrating user and event data from one record to another. In other words, we are resolving data from two users into one identity. This is identity resolution, at a high-level.

In the Contentsquare app, we handle this for you. In Redshift, Snowflake, and BigQuery exports, we handle this on write and in a view for Snowflake and BigQuery, respectively. However, in S3 exports, we do not resolve identities for you. In order to merge this user activity correctly, you must resolve their identity using the user_migrations table.

How do I apply this identity resolution mapping in my data warehouse? The user_migrations table contains a mapping of from_user_id to to_user_id, and should be joined against the users table as well as all pertinent event tables. You should create a view based on these joins that will refresh on a regular cadence.

Below is an example of the view you should create to resolve identity on the users table - join user_migrations.from_user_id on users.user_id, and then coalesce to_user_id and user_id to obtain the user's final state:

CREATE VIEW users_view AS
SELECT
user_id,
MIN("joindate") AS "joindate",
MAX("last_modified") AS "last_modified",
MAX("identity") AS "identity",
MAX("handle") AS "handle",
MAX("email") AS "email",
FROM
(
SELECT
COALESCE("to_user_id", "user_id") AS "user_id",
"joindate","last_modified","identity","handle","email"
FROM users u
LEFT JOIN user_migrations m
ON u.user_id = m.from_user_id
) x
GROUP BY user_id;

The following example illustrates the view you should create to resolve identity on each of your synced event tables. Make sure to select all unique columns from each event table in order to replicate the desired table with migrations applied. Each time you toggle on a new event table to sync in the Contentsquare UI, create this migrated view for that table:

CREATE VIEW example_event_migrated_view AS
SELECT
COALESCE("to_user_id", "user_id") AS "user_id",
"event_column_1","event_column_2","event_id","session_id","time","session_time","type","library","platform","device_type","country","region","city","ip","referrer","landing_page","browser","search_keyword","utm_source","utm_campaign","utm_medium","utm_term","utm_content","domain","query","path","hash","title","href","target_text"
FROM example_event_to_be_migrated e
LEFT JOIN user_migrations m
ON e.user_id = m.from_user_id
;

Data across dumps/files are not guaranteed to be disjointed. As a result, downstream consumers are responsible for de-duplication. De-duplication must happen after applying user migrations. Here's a strategy you can adopt:

TableDe-duplication Columns
Sessionssession_id, user_id
Usersuser_id
Event tablesevent_id, user_id

Updated users (users with properties that have changed since the last sync) will re-appear in the sync files, and thus every repeated occurrence of a user (check on user_id) should replace the old one to ensure that the corresponding property updates are picked up.

user_migrations is a fully materialized mapping of from_user_ids to to_user_ids. Downstream consumers are responsible for joining this with events/users tables downstream to resolve identity retroactively.

For v2, we only sync defined property definitions rather than the actual defined property values. Downstream consumers are responsible for applying these definitions to generate the defined property values for each row.

Schemas are expected to evolve over time: properties can be added to the user and events tables.

Contentsquare does not perform deduplication or identity resolution: your organization needs to manage the ETL process.

Why don't I see initial user properties in Data Connect?

Initial User properties (such as Initial Marketing Channel or Initial Browser) are not currently synced downstream in Data Connect. See Creating An Enhanced Users Table for an example of how to recreate those properties in your warehouse.

What triggers a full sync for tables in S3?

The addition of new events, as well as toggling an event on or off within the Contentsquare UI, will result in the full sync of an individual table. Note that the user_migrations table always does a full sync.

How do I set up encryption with S3?

In terms of server-side encryption, Contentsquare currently supports only the Amazon S3-managed keys (SSE-S3) encryption key type. Buckets using the AWS Key Management Service key (SSE-KMS) encryption key type are not currently supported.

No additional user/role in S3 is required.

For instructions on how to edit your bucket's default encryption, see AWS documentation.

Does an S3 schema change initiate a full sync?

Schema changes should not initiate a full sync. In cases where a property is synced/unsynced from within the Contentsquare app, the property (column) will either be added or stop being included going forward. However, if you need to populate the column retroactively, a full resync of a given table is needed.

How often is the schema of each table expected to change?

For the users, pageviews, sessions, and user_migrations tables and their respective built-in properties, the schema typically stays the same. However, it is common for properties to be added/removed (archived/unsynced) from the Contentsquare app.

Note that when a new property is captured, it is automatically included in Data Connect.

What happens after an S3 sync fails?

The majority of sync failures will re-attempt and resolve themselves. In rare cases, manual intervention may be required.

How can I tell if an S3 sync is completed for today?

Completion of a sync is signaled by the delivery of a new manifest file. You should poll s3://<BUCKET>/manifests/* for new manifests. Once a manifest is delivered, you can process it via your ETL pipeline.

What is the numeric value appended to each sync folder?

This value is the dump_id which is appended to each sync and is used to associate the folder to the respective manifest located in s3://<BUCKET>/manifests/*. For example, the folder name may look like sync_123456789/.