Redshift
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. The Data Connect integration with Redshift allows you to automatically sync your Contentsquare data to Redshift for advanced analysis and joining with other business data.
Prerequisites
Section titled PrerequisitesBefore setting up the Redshift integration, ensure you have:
- An Amazon Redshift cluster that is provisioned and accessible
- Sufficient permissions to create schemas and tables in Redshift
- Database credentials with appropriate permissions
Redshift Setup
Section titled Redshift Setup-
Ensure your Redshift cluster is running and accessible
-
Create a dedicated schema for Contentsquare data ↗:
CREATE SCHEMA heap_data; -
Create a dedicated user for Data Connect ↗:
CREATE USER heap_connect_user WITH PASSWORD 'your-secure-password';GRANT USAGE ON SCHEMA heap_data TO heap_connect_user;GRANT CREATE ON SCHEMA heap_data TO heap_connect_user;GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA heap_data TO heap_connect_user;ALTER DEFAULT PRIVILEGES IN SCHEMA heap_data GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO heap_connect_user; -
Ensure your Redshift cluster is accessible from Contentsquare's servers:
- If your cluster is public, ensure the security group allows access
- If your cluster is private, you may need to set up VPC peering or a VPN connection
Configure Data Connect
Section titled Configure Data Connect- Log in to Contentsquare.
- Navigate to Analysis setup > Data Connect.
- Select Connect next to Redshift.
- Enter the required information:
- Host (endpoint)
- Port (typically 5439)
- Database name
- Username
- Password
- Schema (the schema you created for Contentsquare data)
- Select Connect.
SSH Tunneling Setup
Section titled SSH Tunneling SetupIf your Redshift cluster is in a private subnet and not publicly accessible, you can use SSH tunneling to allow Data Connect to connect securely.
When SSH Tunneling is Required:
- Your Redshift cluster is in a private VPC subnet
- Security policies prohibit public Redshift endpoints
- You need additional network security layer
- Direct VPC peering is not available
Configuration Steps:
-
Set up a bastion host (jump server) in a public subnet with access to your Redshift cluster
-
Generate SSH key pair on the bastion host:
Terminal window ssh-keygen -t rsa -b 4096 -f ~/.ssh/contentsquare_rsa -
Add Contentsquare's public key to
authorized_keys:Terminal window # Contentsquare will provide their public keyecho "ssh-rsa AAAAB3NzaC1yc2E..." >> ~/.ssh/authorized_keyschmod 600 ~/.ssh/authorized_keys -
Configure SSH daemon on bastion host:
Terminal window # Edit /etc/ssh/sshd_configPermitRootLogin noPubkeyAuthentication yesPasswordAuthentication no -
Provide connection details to Contentsquare:
- Bastion host IP address
- SSH username
- SSH port (typically 22)
- Redshift cluster endpoint (internal IP/hostname)
- Redshift port (typically 5439)
Connection String Format:
The connection will be established as:
User → Contentsquare → SSH Tunnel → Bastion Host → Redshift ClusterSecurity Best Practices:
- Use key-based authentication only (no passwords)
- Restrict bastion host security group to only necessary IPs
- Enable AWS CloudTrail logging on bastion host
- Regularly rotate SSH keys
- Use non-standard SSH port if required by security policy
Troubleshooting SSH Connectivity:
- Verify bastion host security group allows SSH from Contentsquare IPs
- Check Redshift security group allows bastion host IP
- Test SSH connection manually from bastion to Redshift
- Review SSH daemon logs:
sudo tail -f /var/log/auth.log
Performance Optimization
Section titled Performance OptimizationOptimizing Redshift performance for Data Connect tables ensures fast queries and cost-effective operations.
Distribution Keys
Section titled Distribution KeysDistribution keys determine how data is distributed across nodes in your Redshift cluster.
Recommended Distribution Keys:
-- Users table: distribute by user_idALTER TABLE main_production.usersALTER DISTSTYLE KEY DISTKEY user_id;
-- Sessions table: distribute by user_id for joins with usersALTER TABLE main_production.sessionsALTER DISTSTYLE KEY DISTKEY user_id;
-- Event tables: distribute by user_idALTER TABLE main_production.page_viewedALTER DISTSTYLE KEY DISTKEY user_id;Why user_id?
- Most queries join tables on
user_id - Co-locates related data on same nodes
- Reduces network traffic during joins
- Improves query performance
Sort Keys
Section titled Sort KeysSort keys determine the physical order of data on disk, enabling efficient range queries.
Recommended Sort Keys:
-- Sort by time for date-range queriesALTER TABLE main_production.sessionsCOMPOUND SORTKEY (time);
ALTER TABLE main_production.page_viewedCOMPOUND SORTKEY (time);
ALTER TABLE main_production.all_eventsCOMPOUND SORTKEY (time, event_table_name);Compound vs. Interleaved:
- Compound: Use when queries filter primarily on
time - Interleaved: Use when queries filter on multiple columns equally
VACUUM and ANALYZE
Section titled VACUUM and ANALYZERegular maintenance ensures optimal performance:
-- Reclaim space and re-sort rowsVACUUM FULL main_production.users;VACUUM FULL main_production.sessions;VACUUM FULL main_production.page_viewed;
-- Update table statistics for query plannerANALYZE main_production.users;ANALYZE main_production.sessions;ANALYZE main_production.page_viewed;Maintenance Schedule:
- Run VACUUM weekly or after large data loads
- Run ANALYZE after VACUUM and after schema changes
- Consider AUTO VACUUM for regular maintenance
- Schedule during low-usage periods
Query Performance Tuning
Section titled Query Performance TuningUse EXPLAIN to analyze queries:
EXPLAINSELECT u.user_id, COUNT(s.session_id)FROM users uJOIN sessions s ON u.user_id = s.user_idWHERE s.time >= DATEADD('day', -30, GETDATE())GROUP BY u.user_id;Optimization Techniques:
- Filter early: Apply WHERE clauses before joins
- Use appropriate joins: INNER vs LEFT based on need
- Limit columns: Select only needed columns
- Use temp tables: For complex multi-step queries
- Leverage sort keys: Filter on sort key columns when possible
Example Optimized Query:
-- Create temp table with filtered dataCREATE TEMP TABLE recent_sessions ASSELECT user_id, session_id, timeFROM sessionsWHERE time >= DATEADD('day', -30, GETDATE());
-- Analyze temp tableANALYZE recent_sessions;
-- Join with optimized temp tableSELECT u.user_id, u.identity, COUNT(rs.session_id) as session_countFROM users uINNER JOIN recent_sessions rs ON u.user_id = rs.user_idGROUP BY u.user_id, u.identity;Advanced Troubleshooting
Section titled Advanced TroubleshootingPermission Issues Deep-Dive
Section titled Permission Issues Deep-DiveCommon Permission Errors:
-- Error: "permission denied for schema"-- Solution: Grant schema usageGRANT USAGE ON SCHEMA main_production TO heap_connect_user;
-- Error: "permission denied for table"-- Solution: Grant table permissionsGRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA main_production TO heap_connect_user;
-- Ensure future tables are accessibleALTER DEFAULT PRIVILEGES IN SCHEMA main_productionGRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO heap_connect_user;Verify Permissions:
-- Check user permissionsSELECT schemaname, tablename, usename, HAS_TABLE_PRIVILEGE(usename, schemaname||'.'||tablename, 'SELECT') as can_select, HAS_TABLE_PRIVILEGE(usename, schemaname||'.'||tablename, 'INSERT') as can_insertFROM pg_tablesCROSS JOIN pg_userWHERE schemaname = 'main_production' AND usename = 'heap_connect_user'ORDER BY tablename;Connection Timeout Resolution
Section titled Connection Timeout ResolutionSymptoms:
- Sync fails with timeout errors
- Connection drops during large syncs
- Intermittent connectivity issues
Solutions:
-
Increase cluster connection limit:
-- Check current connectionsSELECT COUNT(*) FROM stv_sessions;-- Modify parameter group if needed-- max_connections parameter in AWS Console -
Check network connectivity:
Terminal window # Test from Contentsquare IPs (if self-testing)telnet your-cluster.region.redshift.amazonaws.com 5439 -
Review VPC and security groups:
- Ensure Contentsquare IPs are allowed
- Check Network ACLs in VPC
- Verify route tables
-
Enable enhanced VPC routing (if using S3):
- Improves reliability for data loads
- Configure in Redshift cluster settings
Schema Access Problems
Section titled Schema Access ProblemsIssue: Cannot see tables after sync
-- Check if schema existsSELECT nspname FROM pg_namespace WHERE nspname = 'main_production';
-- Check if user has accessSELECT n.nspname as schema_name, HAS_SCHEMA_PRIVILEGE(u.usename, n.nspname, 'USAGE') as has_usageFROM pg_namespace nCROSS JOIN pg_user uWHERE u.usename = 'heap_connect_user' AND n.nspname = 'main_production';
-- Grant access if neededGRANT USAGE ON SCHEMA main_production TO heap_connect_user;Why are constraints not applied to the tables?
Primary and foreign key constraints are not applied to the Data Connect tables because Amazon Redshift does not enforce constraints.