Redshift

This guide provides detailed instructions for setting up and using Data Connect with Amazon Redshift.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. The Data Connect integration with Redshift allows you to automatically sync your Heap data to Redshift for advanced analysis and joining with other business data.

Before setting up the Redshift integration, ensure you have:

  • An Amazon Redshift cluster that is provisioned and accessible
  • Sufficient permissions to create schemas and tables in Redshift
  • Data Connect enabled for your Heap account
  • Database credentials with appropriate permissions

Prepare Your Redshift Environment

Section titled Prepare Your Redshift Environment
  1. Ensure your Redshift cluster is running and accessible

  2. Create a dedicated schema for Heap data (recommended):

    CREATE SCHEMA heap_data;
  3. Create a dedicated user for Data Connect (optional but recommended):

    CREATE USER heap_connect_user WITH PASSWORD 'your-secure-password';
    GRANT USAGE ON SCHEMA heap_data TO heap_connect_user;
    GRANT CREATE ON SCHEMA heap_data TO heap_connect_user;
    GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA heap_data TO heap_connect_user;
    ALTER DEFAULT PRIVILEGES IN SCHEMA heap_data GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO heap_connect_user;
  4. Ensure your Redshift cluster is accessible from Heap’s servers:

    • If your cluster is public, ensure the security group allows access
    • If your cluster is private, you may need to set up VPC peering or a VPN connection

When Data Connect syncs data to Redshift, it creates the following table structure:

Table NameDescription
usersOne row per user with user-level properties
sessionsOne row per session with session-level properties
pageviewsOne row per pageview with page-level properties
all_eventsUnion of all event tables for comprehensive querying

In addition to the standard tables, Data Connect creates individual tables for each defined event in your Heap account, named after the event.

Data Connect maps Contentsquare data types to Redshift data types as follows:

Cotentsquare Data TypeRedshift Data Type
TextVARCHAR
NumberFLOAT8
BooleanBOOLEAN
DateTIMESTAMP
ArrayVARCHAR (JSON string)
ObjectVARCHAR (JSON string)

To optimize Redshift performance with Heap data:

  • Distribution Keys: Tables are distributed using appropriate keys
  • Sort Keys: Tables use sort keys to improve query performance
  • Compression: Appropriate compression encodings are applied

Data Connect configures Redshift tables with the following optimizations:

CREATE TABLE heap_data.sessions (
session_id VARCHAR(255) NOT NULL DISTKEY,
user_id VARCHAR(255) NOT NULL,
time TIMESTAMP NOT NULL SORTKEY,
-- other columns
)
SELECT
user_id,
COUNT(DISTINCT session_id) AS session_count
FROM
heap_data.sessions
WHERE
time::date >= DATEADD(day, -30, CURRENT_DATE)
GROUP BY
user_id
ORDER BY
session_count DESC
LIMIT 100;

Using Redshift-specific Functions

Section titled Using Redshift-specific Functions

Redshift offers powerful functions that can enhance your analysis:

-- Calculate user retention by week
WITH first_seen AS (
SELECT
user_id,
DATE_TRUNC('week', MIN(time::date)) AS first_week
FROM
heap_data.sessions
GROUP BY
user_id
),
weekly_activity AS (
SELECT
user_id,
DATE_TRUNC('week', time::date) AS week
FROM
heap_data.sessions
GROUP BY
1, 2
)
SELECT
first_week,
week,
DATEDIFF('week', first_week, week) AS week_number,
COUNT(DISTINCT wa.user_id) AS active_users,
COUNT(DISTINCT fs.user_id) AS cohort_size,
ROUND(100.0 * COUNT(DISTINCT wa.user_id) / COUNT(DISTINCT fs.user_id), 2) AS retention_rate
FROM
weekly_activity wa
JOIN
first_seen fs ON wa.user_id = fs.user_id
WHERE
week >= first_week
GROUP BY
1, 2, 3
ORDER BY
1, 3;

Redshift performance can be optimized in several ways:

  1. Add date filters: Limit queries to specific time ranges

    WHERE time >= DATEADD(day, -30, CURRENT_DATE)
  2. Use appropriate GROUP BY columns: Order columns from highest to lowest cardinality

    -- More efficient (if user_id has higher cardinality than device_type)
    GROUP BY user_id, device_type
    -- Less efficient
    GROUP BY device_type, user_id
  3. Limit columns in your SELECT: Instead of SELECT *, specify only the columns you need

    SELECT user_id, session_id, time, event_name
    FROM heap_data.all_events
  1. WLM Configuration: Configure Workload Management for optimal query performance
  2. Vacuum and Analyze: Regularly run VACUUM and ANALYZE operations
    VACUUM heap_data.sessions;
    ANALYZE heap_data.sessions;
  3. Concurrency Scaling: Enable concurrency scaling for high-concurrency workloads
  1. Connection failures: Check network access, security groups, and credentials
  2. Slow queries: Analyze query plans and optimize
    EXPLAIN
    SELECT COUNT(*) FROM heap_data.sessions
    WHERE time >= DATEADD(day, -30, CURRENT_DATE);
  3. Disk space issues: Monitor cluster disk usage and resize if necessary
  4. Failed syncs: Check Data Connect logs for error messages

If you encounter issues with the Redshift integration:

  • Check the Troubleshooting FAQs
  • Review AWS Redshift documentation for specific error messages
  • Run the Redshift Advisor for performance recommendations
  • Contact Heap support with detailed information about the issue