Amazon S3
This guide provides detailed instructions for setting up and using Data Connect with Amazon S3.
Amazon S3 (Simple Storage Service) is an object storage service that offers industry-leading scalability, data availability, security, and performance. The Data Connect integration with S3 allows you to export your Contentsquare data to S3 for flexible storage and analysis options.
Unlike the other warehouse integrations (BigQuery, Redshift, Snowflake), the S3 integration provides raw data files that you can process with your preferred analytics tools, such as Athena, EMR, or third-party data processing services.
Prerequisites
Section titled PrerequisitesBefore setting up the S3 integration, ensure you have:
- An AWS account with S3 access
- An S3 bucket to store Contentsquare data
- AWS credentials with appropriate permissions for the S3 bucket
Setup Process
Section titled Setup ProcessPrepare Your S3 Environment
Section titled Prepare Your S3 Environment-
Create an S3 bucket to store Contentsquare data (if you don’t already have one)
-
Create an IAM user or role with appropriate permissions:
-
The IAM policy should include:
s3:PutObject
s3:GetObject
s3:ListBucket
s3:DeleteObject
(optional, for cleanup)
-
-
Generate AWS access keys for the IAM user (if using user-based authentication)
Configure Data Connect
Section titled Configure Data ConnectData Structure in S3
Section titled Data Structure in S3When Data Connect syncs data to S3, it creates the following structure:
s3://your-bucket/[optional-prefix]/ ├── users/ │ ├── date=YYYY-MM-DD/ │ │ ├── part-00000.[format].[compression] │ │ ├── part-00001.[format].[compression] │ │ └── ... ├── sessions/ │ ├── date=YYYY-MM-DD/ │ │ ├── part-00000.[format].[compression] │ │ ├── part-00001.[format].[compression] │ │ └── ... ├── pageviews/ │ ├── date=YYYY-MM-DD/ │ │ └── ... ├── [custom_event_name]/ │ ├── date=YYYY-MM-DD/ │ │ └── ... └── ...
The data is organized by:
- Table name (users, sessions, pageviews, custom events)
- Date partition (based on sync date)
- Part files (data is split into multiple files)
S3-specific Features
Section titled S3-specific FeaturesFile Formats
Section titled File FormatsData Connect supports multiple file formats for S3 exports:
Format | Description | Best For |
---|---|---|
JSON | Line-delimited JSON | Flexibility, human readability |
CSV | Comma-separated values | Compatibility, ease of processing |
Parquet | Columnar storage format | Performance, efficient querying |
Compression Options
Section titled Compression OptionsCompress data to reduce storage costs and improve transfer speeds:
Compression | Pros | Cons |
---|---|---|
GZIP | High compression ratio, widely supported | Slower decompression |
Snappy | Fast compression/decompression | Lower compression ratio |
None | No processing overhead | Larger storage requirements |
Partitioning
Section titled PartitioningData in S3 is partitioned by date, which offers several benefits:
- Efficient querying by date ranges
- Easier management of data lifecycle
- Improved performance when using services like Athena
Analyzing S3 Data
Section titled Analyzing S3 DataUnlike direct warehouse integrations, S3 data requires additional processing for analysis:
Using Amazon Athena
Section titled Using Amazon AthenaAmazon Athena ↗ is a serverless interactive query service that accelerate data analysis in S3 using standard SQL.
-
Create an Athena database:
CREATE DATABASE heap_data; -
Create external tables pointing to your S3 data:
CREATE EXTERNAL TABLE heap_data.sessions (session_id STRING,user_id STRING,time TIMESTAMP,-- other columns as needed)PARTITIONED BY (date STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION 's3://your-bucket/[optional-prefix]/sessions/'TBLPROPERTIES ('parquet.compression'='SNAPPY'); -
Load partitions:
MSCK REPAIR TABLE heap_data.sessions; -
Query the data:
SELECTuser_id,COUNT(DISTINCT session_id) AS session_countFROMheap_data.sessionsWHEREdate >= '2023-01-01'GROUP BYuser_idORDER BYsession_count DESCLIMIT 100;
Using AWS Glue
Section titled Using AWS GlueAWS Glue ↗ is a fully managed ETL service that can automatically discover and catalog metadata from S3 data.
- Create a Glue Crawler to catalog your Contentsquare data in S3
- Run the crawler to discover schema and create table definitions
- Query the data using Athena, or process it using Glue ETL jobs
Using Other Tools
Section titled Using Other ToolsThe S3 integration works with many other data processing tools:
- Amazon EMR: Process data using Spark, Hive, or Presto
- AWS Lambda: Create event-driven processing for new data
- Third-party tools: Data Connect tools like Databricks, Snowflake, or Tableau directly to S3
S3-specific Considerations
Section titled S3-specific ConsiderationsIdentity Resolution
Section titled Identity ResolutionThe S3 integration handles identity resolution differently from other Data Connect destinations:
- Identity updates: When users are identified or merged in Contentsquare, historical data files may be overwritten with updated identity information
- File versioning: Consider enabling S3 versioning if you need to track changes to identity data
- Data freshness: Always use the most recent data files for analysis to ensure the most current identity resolution
Cost Management
Section titled Cost ManagementOptimize S3 storage and access costs:
-
Lifecycle policies: Set up lifecycle rules to transition older data to lower-cost storage tiers
{"Rules": [{"Status": "Enabled","Prefix": "your-prefix/","Transition": {"Days": 90,"StorageClass": "STANDARD_IA"}}]} -
Request optimization: Minimize LIST operations by using well-organized prefixes
-
Compression: Use compression to reduce storage costs
-
Clean up temporary files: Remove temporary or duplicate files regularly