Amazon S3

Amazon S3 (Simple Storage Service) is an object storage service that offers industry-leading scalability, data availability, security, and performance. The Data Connect integration with S3 allows you to export your Contentsquare data to S3 for flexible storage and analysis options.

Unlike the other warehouse integrations (BigQuery, Redshift, Snowflake), the S3 integration provides raw data files that you can process with your preferred analytics tools, such as Athena, EMR, or third-party data processing services.

Before setting up the S3 integration, ensure you have:

  • An AWS account with S3 access
  • An S3 bucket to store Contentsquare data
  • AWS credentials with appropriate permissions for the S3 bucket
  1. Create an S3 bucket to store Contentsquare data (if you don’t already have one)

  2. Create an IAM user or role with appropriate permissions:

    • The IAM policy should include:

      • s3:PutObject
      • s3:GetObject
      • s3:ListBucket
      • s3:DeleteObject
      • s3:PutObjectAcl
  3. Apply the appropriate bucket policy for your region:

    US
    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Sid": "ConnectEU",
    "Effect": "Allow",
    "Action": [
    "s3:*"
    ],
    "Resource": [
    "arn:aws:s3:::",
    "arn:aws:s3:::/*"
    ],
    "Principal": {
    "AWS": [
    "arn:aws:iam::556519846140:root"
    ]
    }
    }
    ]
    }
    EU
    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Sid": "ConnectEU",
    "Effect": "Allow",
    "Action": [
    "s3:*"
    ],
    "Resource": [
    "arn:aws:s3:::",
    "arn:aws:s3:::/*"
    ],
    "Principal": {
    "AWS": [
    "arn:aws:iam::556519846140:root"
    ]
    }
    }
    ]
    }
  4. Generate AWS access keys for the IAM user (if using user-based authentication)

  1. Log in to Contentsquare.
  2. Navigate to Analysis setup > Data Connect.
  3. Create the S3 bucket csq-rs3-<bucket_name> to sync Data Connect data with.
  4. Select Next.
  5. Add the displayed policy to your CSQ bucket on S3.
  6. Input your S3 credentials to connect to your bucket.
  7. Select Connect.

Once setup is complete, you’ll see a sync within 24 hours with the following built-in tables.

When Data Connect syncs data to S3, it creates the following structure:

s3://your-bucket/[optional-prefix]/
├── users/
│ ├── date=YYYY-MM-DD/
│ │ ├── part-00000.[format].[compression]
│ │ ├── part-00001.[format].[compression]
│ │ └── ...
├── sessions/
│ ├── date=YYYY-MM-DD/
│ │ ├── part-00000.[format].[compression]
│ │ ├── part-00001.[format].[compression]
│ │ └── ...
├── pageviews/
│ ├── date=YYYY-MM-DD/
│ │ └── ...
├── [custom_event_name]/
│ ├── date=YYYY-MM-DD/
│ │ └── ...
└── ...

The data is organized by:

  • Table name (users, sessions, pageviews, custom events)
  • Date partition (based on sync date)
  • Part files (data is split into multiple files)

Each periodic data delivery will be accompanied by a manifest metadata file, which will describe the target schema and provide a full list of relevant data files for each table. Ignore any files in the data delivery that aren’t listed in the manifest metadata file.

It includes the following information:

  • dump_id: A monotonically increasing sequence number for dumps.
  • tables: For each table synced:
    • name: The name of the table.

    • columns: An array consisting of the columns contained in the table. This can be used to determine which columns need to be added or removed downstream.

    • files: An array of full s3 paths to the Avro-encoded files for the relevant table.

    • incremental: A boolean denoting whether the data for the table is incremental on top of previous dumps. A value of false means it is a full/fresh resync of this table, and all previous data is invalid.

    • property_definitions: The s3 path to the defined property definition file.

      Example
      {
      "dump_id": 1234,
      "tables": [
      {
      "name": "users",
      "files": [
      "s3://customer/sync_1234/users/a97432cba49732.avro",
      "s3://customer/sync_1234/users/584cdba3973c32.avro",
      "s3://customer/sync_1234/users/32917bc3297a3c.avro"
      ],
      "columns": [
      "user_id",
      "last_modified",
      // ...
      ],
      "incremental": true
      },
      {
      "name": "user_migrations",
      "files": [
      "s3://customer/sync_1234/user_migrations/2a345bc452456c.avro",
      "s3://customer/sync_1234/user_migrations/4382abc432862c.avro"
      ],
      "columns": [
      "from_user_id",
      "to_user_id",
      // ...
      ],
      "incremental": false // always false for migrations
      },
      {
      "name": "defined_event",
      "files": [
      "s3://customer/sync_1234/defined_event/2fa2dbe2456c.avro"
      ],
      "columns": [
      "user_id",
      "event_id",
      "time",
      "session_id",
      // ...
      ],
      "incremental": true
      }
      ],
      "property_definitions": "s3://customer/sync_1234/property_definitions.j​son"
      }

The user_id, event_id, and session_id are the only columns that are long types. All other columns should be inferred as string types.

Each sync will be accompanied by a sync log file that reports on delivery status. These log files will be placed in the sync_reports directory. Each report will be in a JSON format as follows:

{
"start_time":1566968405225,
"finish_time":1566968649169,
"status":"succeeded",
"next_sync_at":1567054800000,
"error":null
}

start_time, finish_time, and next_sync_at are represented as epoch timestamps.

See Data Syncing to learn how the data will be structured upon sync.

  • Contentsquare does not perform deduplication or identity resolution: your organization will need to manage the ETL process.