Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. The Data Connect integration with Redshift allows you to automatically sync your Contentsquare data to Redshift for advanced analysis and joining with other business data.

Before setting up the Redshift integration, ensure you have:

  • An Amazon Redshift cluster that is provisioned and accessible
  • Sufficient permissions to create schemas and tables in Redshift
  • Database credentials with appropriate permissions
  1. Ensure your Redshift cluster is running and accessible

  2. Create a dedicated schema for Contentsquare data:

    CREATE SCHEMA heap_data;
  3. Create a dedicated user for Data Connect:

    CREATE USER heap_connect_user WITH PASSWORD 'your-secure-password';
    GRANT USAGE ON SCHEMA heap_data TO heap_connect_user;
    GRANT CREATE ON SCHEMA heap_data TO heap_connect_user;
    GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA heap_data TO heap_connect_user;
    ALTER DEFAULT PRIVILEGES IN SCHEMA heap_data GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO heap_connect_user;
  4. Ensure your Redshift cluster is accessible from Contentsquare's servers:

    • If your cluster is public, ensure the security group allows access
    • If your cluster is private, you may need to set up VPC peering or a VPN connection
  1. Log in to Contentsquare.
  2. Navigate to Analysis setup > Data Connect.
  3. Select Connect next to Redshift.
  4. Enter the required information:
    • Host (endpoint)
    • Port (typically 5439)
    • Database name
    • Username
    • Password
    • Schema (the schema you created for Contentsquare data)
  5. Select Connect.

If your Redshift cluster is in a private subnet and not publicly accessible, you can use SSH tunneling to allow Data Connect to connect securely.

When SSH Tunneling is Required:

  • Your Redshift cluster is in a private VPC subnet
  • Security policies prohibit public Redshift endpoints
  • You need additional network security layer
  • Direct VPC peering is not available

Configuration Steps:

  1. Set up a bastion host (jump server) in a public subnet with access to your Redshift cluster

  2. Generate SSH key pair on the bastion host:

    Terminal window
    ssh-keygen -t rsa -b 4096 -f ~/.ssh/contentsquare_rsa
  3. Add Contentsquare's public key to authorized_keys:

    Terminal window
    # Contentsquare will provide their public key
    echo "ssh-rsa AAAAB3NzaC1yc2E..." >> ~/.ssh/authorized_keys
    chmod 600 ~/.ssh/authorized_keys
  4. Configure SSH daemon on bastion host:

    Terminal window
    # Edit /etc/ssh/sshd_config
    PermitRootLogin no
    PubkeyAuthentication yes
    PasswordAuthentication no
  5. Provide connection details to Contentsquare:

    • Bastion host IP address
    • SSH username
    • SSH port (typically 22)
    • Redshift cluster endpoint (internal IP/hostname)
    • Redshift port (typically 5439)

Connection String Format:

The connection will be established as:

User → Contentsquare → SSH Tunnel → Bastion Host → Redshift Cluster

Security Best Practices:

  • Use key-based authentication only (no passwords)
  • Restrict bastion host security group to only necessary IPs
  • Enable AWS CloudTrail logging on bastion host
  • Regularly rotate SSH keys
  • Use non-standard SSH port if required by security policy

Troubleshooting SSH Connectivity:

  • Verify bastion host security group allows SSH from Contentsquare IPs
  • Check Redshift security group allows bastion host IP
  • Test SSH connection manually from bastion to Redshift
  • Review SSH daemon logs: sudo tail -f /var/log/auth.log

Optimizing Redshift performance for Data Connect tables ensures fast queries and cost-effective operations.

Distribution keys determine how data is distributed across nodes in your Redshift cluster.

Recommended Distribution Keys:

-- Users table: distribute by user_id
ALTER TABLE main_production.users
ALTER DISTSTYLE KEY DISTKEY user_id;
-- Sessions table: distribute by user_id for joins with users
ALTER TABLE main_production.sessions
ALTER DISTSTYLE KEY DISTKEY user_id;
-- Event tables: distribute by user_id
ALTER TABLE main_production.page_viewed
ALTER DISTSTYLE KEY DISTKEY user_id;

Why user_id?

  • Most queries join tables on user_id
  • Co-locates related data on same nodes
  • Reduces network traffic during joins
  • Improves query performance

Sort keys determine the physical order of data on disk, enabling efficient range queries.

Recommended Sort Keys:

-- Sort by time for date-range queries
ALTER TABLE main_production.sessions
COMPOUND SORTKEY (time);
ALTER TABLE main_production.page_viewed
COMPOUND SORTKEY (time);
ALTER TABLE main_production.all_events
COMPOUND SORTKEY (time, event_table_name);

Compound vs. Interleaved:

  • Compound: Use when queries filter primarily on time
  • Interleaved: Use when queries filter on multiple columns equally

Regular maintenance ensures optimal performance:

-- Reclaim space and re-sort rows
VACUUM FULL main_production.users;
VACUUM FULL main_production.sessions;
VACUUM FULL main_production.page_viewed;
-- Update table statistics for query planner
ANALYZE main_production.users;
ANALYZE main_production.sessions;
ANALYZE main_production.page_viewed;

Maintenance Schedule:

  • Run VACUUM weekly or after large data loads
  • Run ANALYZE after VACUUM and after schema changes
  • Consider AUTO VACUUM for regular maintenance
  • Schedule during low-usage periods

Use EXPLAIN to analyze queries:

EXPLAIN
SELECT u.user_id, COUNT(s.session_id)
FROM users u
JOIN sessions s ON u.user_id = s.user_id
WHERE s.time >= DATEADD('day', -30, GETDATE())
GROUP BY u.user_id;

Optimization Techniques:

  1. Filter early: Apply WHERE clauses before joins
  2. Use appropriate joins: INNER vs LEFT based on need
  3. Limit columns: Select only needed columns
  4. Use temp tables: For complex multi-step queries
  5. Leverage sort keys: Filter on sort key columns when possible

Example Optimized Query:

-- Create temp table with filtered data
CREATE TEMP TABLE recent_sessions AS
SELECT user_id, session_id, time
FROM sessions
WHERE time >= DATEADD('day', -30, GETDATE());
-- Analyze temp table
ANALYZE recent_sessions;
-- Join with optimized temp table
SELECT u.user_id, u.identity, COUNT(rs.session_id) as session_count
FROM users u
INNER JOIN recent_sessions rs ON u.user_id = rs.user_id
GROUP BY u.user_id, u.identity;

Common Permission Errors:

-- Error: "permission denied for schema"
-- Solution: Grant schema usage
GRANT USAGE ON SCHEMA main_production TO heap_connect_user;
-- Error: "permission denied for table"
-- Solution: Grant table permissions
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA main_production TO heap_connect_user;
-- Ensure future tables are accessible
ALTER DEFAULT PRIVILEGES IN SCHEMA main_production
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO heap_connect_user;

Verify Permissions:

-- Check user permissions
SELECT
schemaname,
tablename,
usename,
HAS_TABLE_PRIVILEGE(usename, schemaname||'.'||tablename, 'SELECT') as can_select,
HAS_TABLE_PRIVILEGE(usename, schemaname||'.'||tablename, 'INSERT') as can_insert
FROM pg_tables
CROSS JOIN pg_user
WHERE schemaname = 'main_production'
AND usename = 'heap_connect_user'
ORDER BY tablename;

Symptoms:

  • Sync fails with timeout errors
  • Connection drops during large syncs
  • Intermittent connectivity issues

Solutions:

  1. Increase cluster connection limit:

    -- Check current connections
    SELECT COUNT(*) FROM stv_sessions;
    -- Modify parameter group if needed
    -- max_connections parameter in AWS Console
  2. Check network connectivity:

    Terminal window
    # Test from Contentsquare IPs (if self-testing)
    telnet your-cluster.region.redshift.amazonaws.com 5439
  3. Review VPC and security groups:

    • Ensure Contentsquare IPs are allowed
    • Check Network ACLs in VPC
    • Verify route tables
  4. Enable enhanced VPC routing (if using S3):

    • Improves reliability for data loads
    • Configure in Redshift cluster settings

Issue: Cannot see tables after sync

-- Check if schema exists
SELECT nspname FROM pg_namespace WHERE nspname = 'main_production';
-- Check if user has access
SELECT
n.nspname as schema_name,
HAS_SCHEMA_PRIVILEGE(u.usename, n.nspname, 'USAGE') as has_usage
FROM pg_namespace n
CROSS JOIN pg_user u
WHERE u.usename = 'heap_connect_user'
AND n.nspname = 'main_production';
-- Grant access if needed
GRANT USAGE ON SCHEMA main_production TO heap_connect_user;

Why are constraints not applied to the tables?

Primary and foreign key constraints are not applied to the Data Connect tables because Amazon Redshift does not enforce constraints.