Amazon S3

This guide provides detailed instructions for setting up and using Data Connect with Amazon S3.

Amazon S3 (Simple Storage Service) is an object storage service that offers industry-leading scalability, data availability, security, and performance. The Data Connect integration with S3 allows you to export your Contentsquare data to S3 for flexible storage and analysis options.

Unlike the other warehouse integrations (BigQuery, Redshift, Snowflake), the S3 integration provides raw data files that you can process with your preferred analytics tools, such as Athena, EMR, or third-party data processing services.

Before setting up the S3 integration, ensure you have:

  • An AWS account with S3 access
  • An S3 bucket to store Contentsquare data
  • AWS credentials with appropriate permissions for the S3 bucket
  1. Create an S3 bucket to store Contentsquare data (if you don’t already have one)

  2. Create an IAM user or role with appropriate permissions:

    • The IAM policy should include:

      • s3:PutObject
      • s3:GetObject
      • s3:ListBucket
      • s3:DeleteObject (optional, for cleanup)
  3. Generate AWS access keys for the IAM user (if using user-based authentication)

When Data Connect syncs data to S3, it creates the following structure:

s3://your-bucket/[optional-prefix]/
├── users/
│ ├── date=YYYY-MM-DD/
│ │ ├── part-00000.[format].[compression]
│ │ ├── part-00001.[format].[compression]
│ │ └── ...
├── sessions/
│ ├── date=YYYY-MM-DD/
│ │ ├── part-00000.[format].[compression]
│ │ ├── part-00001.[format].[compression]
│ │ └── ...
├── pageviews/
│ ├── date=YYYY-MM-DD/
│ │ └── ...
├── [custom_event_name]/
│ ├── date=YYYY-MM-DD/
│ │ └── ...
└── ...

The data is organized by:

  • Table name (users, sessions, pageviews, custom events)
  • Date partition (based on sync date)
  • Part files (data is split into multiple files)

Data Connect supports multiple file formats for S3 exports:

FormatDescriptionBest For
JSONLine-delimited JSONFlexibility, human readability
CSVComma-separated valuesCompatibility, ease of processing
ParquetColumnar storage formatPerformance, efficient querying

Compress data to reduce storage costs and improve transfer speeds:

CompressionProsCons
GZIPHigh compression ratio, widely supportedSlower decompression
SnappyFast compression/decompressionLower compression ratio
NoneNo processing overheadLarger storage requirements

Data in S3 is partitioned by date, which offers several benefits:

  • Efficient querying by date ranges
  • Easier management of data lifecycle
  • Improved performance when using services like Athena

Unlike direct warehouse integrations, S3 data requires additional processing for analysis:

Amazon Athena is a serverless interactive query service that accelerate data analysis in S3 using standard SQL.

  1. Create an Athena database:

    CREATE DATABASE heap_data;
  2. Create external tables pointing to your S3 data:

    CREATE EXTERNAL TABLE heap_data.sessions (
    session_id STRING,
    user_id STRING,
    time TIMESTAMP,
    -- other columns as needed
    )
    PARTITIONED BY (date STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION 's3://your-bucket/[optional-prefix]/sessions/'
    TBLPROPERTIES ('parquet.compression'='SNAPPY');
  3. Load partitions:

    MSCK REPAIR TABLE heap_data.sessions;
  4. Query the data:

    SELECT
    user_id,
    COUNT(DISTINCT session_id) AS session_count
    FROM
    heap_data.sessions
    WHERE
    date >= '2023-01-01'
    GROUP BY
    user_id
    ORDER BY
    session_count DESC
    LIMIT 100;

AWS Glue is a fully managed ETL service that can automatically discover and catalog metadata from S3 data.

  1. Create a Glue Crawler to catalog your Contentsquare data in S3
  2. Run the crawler to discover schema and create table definitions
  3. Query the data using Athena, or process it using Glue ETL jobs

The S3 integration works with many other data processing tools:

  • Amazon EMR: Process data using Spark, Hive, or Presto
  • AWS Lambda: Create event-driven processing for new data
  • Third-party tools: Data Connect tools like Databricks, Snowflake, or Tableau directly to S3

The S3 integration handles identity resolution differently from other Data Connect destinations:

  • Identity updates: When users are identified or merged in Contentsquare, historical data files may be overwritten with updated identity information
  • File versioning: Consider enabling S3 versioning if you need to track changes to identity data
  • Data freshness: Always use the most recent data files for analysis to ensure the most current identity resolution

Optimize S3 storage and access costs:

  1. Lifecycle policies: Set up lifecycle rules to transition older data to lower-cost storage tiers

    {
    "Rules": [
    {
    "Status": "Enabled",
    "Prefix": "your-prefix/",
    "Transition": {
    "Days": 90,
    "StorageClass": "STANDARD_IA"
    }
    }
    ]
    }
  2. Request optimization: Minimize LIST operations by using well-organized prefixes

  3. Compression: Use compression to reduce storage costs

  4. Clean up temporary files: Remove temporary or duplicate files regularly