Redshift
This guide provides detailed instructions for setting up and using Data Connect with Amazon Redshift.
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. The Data Connect integration with Redshift allows you to automatically sync your Heap data to Redshift for advanced analysis and joining with other business data.
Prerequisites
Section titled PrerequisitesBefore setting up the Redshift integration, ensure you have:
- An Amazon Redshift cluster that is provisioned and accessible
- Sufficient permissions to create schemas and tables in Redshift
- Data Connect enabled for your Heap account
- Database credentials with appropriate permissions
Setup Process
Section titled Setup ProcessPrepare Your Redshift Environment
Section titled Prepare Your Redshift Environment-
Ensure your Redshift cluster is running and accessible
-
Create a dedicated schema for Heap data (recommended):
CREATE SCHEMA heap_data; -
Create a dedicated user for Data Connect (optional but recommended):
CREATE USER heap_connect_user WITH PASSWORD 'your-secure-password';GRANT USAGE ON SCHEMA heap_data TO heap_connect_user;GRANT CREATE ON SCHEMA heap_data TO heap_connect_user;GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA heap_data TO heap_connect_user;ALTER DEFAULT PRIVILEGES IN SCHEMA heap_data GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO heap_connect_user; -
Ensure your Redshift cluster is accessible from Heap’s servers:
- If your cluster is public, ensure the security group allows access
- If your cluster is private, you may need to set up VPC peering or a VPN connection
Configure Data Connect
Section titled Configure Data ConnectTable Structure in Redshift
Section titled Table Structure in RedshiftWhen Data Connect syncs data to Redshift, it creates the following table structure:
Standard Tables
Section titled Standard TablesTable Name | Description |
---|---|
users | One row per user with user-level properties |
sessions | One row per session with session-level properties |
pageviews | One row per pageview with page-level properties |
all_events | Union of all event tables for comprehensive querying |
Custom Event Tables
Section titled Custom Event TablesIn addition to the standard tables, Data Connect creates individual tables for each defined event in your Heap account, named after the event.
Redshift-specific Features
Section titled Redshift-specific FeaturesData Types
Section titled Data TypesData Connect maps Contentsquare data types to Redshift data types as follows:
Cotentsquare Data Type | Redshift Data Type |
---|---|
Text | VARCHAR |
Number | FLOAT8 |
Boolean | BOOLEAN |
Date | TIMESTAMP |
Array | VARCHAR (JSON string) |
Object | VARCHAR (JSON string) |
Optimizations
Section titled OptimizationsTo optimize Redshift performance with Heap data:
- Distribution Keys: Tables are distributed using appropriate keys
- Sort Keys: Tables use sort keys to improve query performance
- Compression: Appropriate compression encodings are applied
Table Design
Section titled Table DesignData Connect configures Redshift tables with the following optimizations:
CREATE TABLE heap_data.sessions ( session_id VARCHAR(255) NOT NULL DISTKEY, user_id VARCHAR(255) NOT NULL, time TIMESTAMP NOT NULL SORTKEY, -- other columns)
Querying Data in Redshift
Section titled Querying Data in RedshiftBasic Query Syntax
Section titled Basic Query SyntaxSELECT user_id, COUNT(DISTINCT session_id) AS session_countFROM heap_data.sessionsWHERE time::date >= DATEADD(day, -30, CURRENT_DATE)GROUP BY user_idORDER BY session_count DESCLIMIT 100;
Using Redshift-specific Functions
Section titled Using Redshift-specific FunctionsRedshift offers powerful functions that can enhance your analysis:
-- Calculate user retention by weekWITH first_seen AS ( SELECT user_id, DATE_TRUNC('week', MIN(time::date)) AS first_week FROM heap_data.sessions GROUP BY user_id),weekly_activity AS ( SELECT user_id, DATE_TRUNC('week', time::date) AS week FROM heap_data.sessions GROUP BY 1, 2)SELECT first_week, week, DATEDIFF('week', first_week, week) AS week_number, COUNT(DISTINCT wa.user_id) AS active_users, COUNT(DISTINCT fs.user_id) AS cohort_size, ROUND(100.0 * COUNT(DISTINCT wa.user_id) / COUNT(DISTINCT fs.user_id), 2) AS retention_rateFROM weekly_activity waJOIN first_seen fs ON wa.user_id = fs.user_idWHERE week >= first_weekGROUP BY 1, 2, 3ORDER BY 1, 3;
Performance Optimization
Section titled Performance OptimizationRedshift performance can be optimized in several ways:
Query Optimization
Section titled Query Optimization-
Add date filters: Limit queries to specific time ranges
WHERE time >= DATEADD(day, -30, CURRENT_DATE) -
Use appropriate GROUP BY columns: Order columns from highest to lowest cardinality
-- More efficient (if user_id has higher cardinality than device_type)GROUP BY user_id, device_type-- Less efficientGROUP BY device_type, user_id -
Limit columns in your SELECT: Instead of
SELECT *
, specify only the columns you needSELECT user_id, session_id, time, event_nameFROM heap_data.all_events
Cluster Management
Section titled Cluster Management- WLM Configuration: Configure Workload Management for optimal query performance
- Vacuum and Analyze: Regularly run VACUUM and ANALYZE operations
VACUUM heap_data.sessions;ANALYZE heap_data.sessions;
- Concurrency Scaling: Enable concurrency scaling for high-concurrency workloads
Troubleshooting
Section titled TroubleshootingCommon Issues
Section titled Common Issues- Connection failures: Check network access, security groups, and credentials
- Slow queries: Analyze query plans and optimize
EXPLAINSELECT COUNT(*) FROM heap_data.sessionsWHERE time >= DATEADD(day, -30, CURRENT_DATE);
- Disk space issues: Monitor cluster disk usage and resize if necessary
- Failed syncs: Check Data Connect logs for error messages
Support Resources
Section titled Support ResourcesIf you encounter issues with the Redshift integration:
- Check the Troubleshooting FAQs
- Review AWS Redshift documentation for specific error messages
- Run the Redshift Advisor for performance recommendations
- Contact Heap support with detailed information about the issue
Additional Resources
Section titled Additional Resources- Common Queries for analysis examples
- Amazon Redshift Documentation ↗
- Redshift Best Practices ↗