In this blog, I will show how to enable high performance queries for interactive analytics on Redshift using Rockset. I will walk through steps for setting up an integration between Rockset and a Redshift table and run millisecond-latency SQL on it, for powering an interactive Tableau dashboard.
Data warehouse services like Amazon Redshift are ideal for running complex queries for low concurrency workloads. They can easily scale to petabytes of data and are great for running business reports. Now suppose an organization wants to operationalize the data that’s in Redshift, in the form of an interactive dashboard that allows users to interactively query data in Redshift. There are two challenges:
- Such interactive dashboards demand millisecond-query latency for ad hoc queries, which is not typically supported by Redshift.
- If the dashboard is used by tens of users simultaneously, Redshift cannot support this level of concurrent queries since its not built for high QPS.
To solve this, we can connect Rockset to Redshift and have the operational dashboard issue queries against Rockset instead of Redshift. With Rockset, you can continuously import your data sitting in Amazon Redshift clusters without any ETL, run fast SQL and perform operational analytics without worrying about capacity planning, cluster sizing or performance tuning.
Each Amazon Redshift cluster can have multiple databases, schemas and tables and each table requires data definition to be defined before inserting data. Rockset makes it easy to connect a Redshift cluster and use the same set of permissions to access all the tables inside a cluster. Also, you don't need to provide any data schema to create collection in Rockset. Rockset uses Redshift's unload capability to stage data into a S3 bucket in the same region as the cluster and then ingests data from that S3 bucket. Rockset unloads data using parallel option to stage it faster.
Rockset also allows user to specify a timestamp field in the source Redshift table like
last_updated_at to monitor for new updates. The sync latency is no more than a few seconds when the source Redshift table is getting updated continuously and no more than 5 minutes when the source gets updated infrequently. This currently handles only updates and new inserts in the source table. Support for record deletes is coming soon.
Rockset requires source Redshift table to have Primary Keys. Primary key values from the Redshift table are used to construct the _id field in Rockset to uniquely identify a document in a Rockset collection. This ensures that updates to an existing item in the Redshift table are applied to the corresponding document in Rockset.
Connecting Redshift to Rockset
For this demo, I have loaded sample Oakland Call Center data in Amazon Redshift which I will use to create the Redshift integration below. This uses REQUESTID as the primary key in Redshift. Also I have created a column named updated_at in the the source Redshift table which sets it to current time, whenever a record is inserted or updated. The create command on Redshift looks like this:
create table oakland_call_center ( .... .... updated_at datetime default sysdate);
Creating a Redshift Integration
To let Rockset access Redshift cluster, I will create an Integration with all the permissions required to access it. This includes IAM permission for the S3 bucket which exists in the same account and region as the Redshift cluster and database permissions for Redshift user. For more information, you can refer to the docs.
Creating a Rockset collection
Once the Redshift Integration is set up, we are ready to use it to ingest different tables in Redshift cluster. Rockset requires the database, schema and table name at this step.
At this step the collection is created and getting updated with data from the specified Redshift table. We can now start querying the data.
Querying Redshift Data in Rockset
Each row in Redshift table corresponds to one record in Rockset collection. Let's describe and see all the fields in the collection. For Datetime type fields in Redshift table, Rockset stores it as timestamp with the default UTC timezone.
rockset> describe "oakland-call-center"; +---------------------+---------------+---------+-----------+ | field | occurrences | total | type | |---------------------+---------------+---------+-----------| | ['BEAT'] | 608949 | 608949 | string | | ['COUNCILDISTRICT'] | 608949 | 608949 | string | | ['City'] | 608949 | 608949 | string | | ['DATETIMECLOSED'] | 608949 | 608949 | string | | ['DATETIMEINIT'] | 608949 | 608949 | string | | ['DESCRIPTION'] | 608949 | 608949 | string | | ['PROBADDRESS'] | 608949 | 608949 | string | | ['REFERREDTO'] | 608949 | 608949 | string | | ['REQADDRESS'] | 608949 | 608949 | string | | ['REQCATEGORY'] | 608949 | 608949 | string | | ['REQUESTID'] | 608949 | 608949 | string | | ['SOURCE'] | 608949 | 608949 | string | | ['SRX'] | 608949 | 608949 | string | | ['SRY'] | 608949 | 608949 | string | | ['STATUS'] | 608949 | 608949 | string | | ['State'] | 608949 | 608949 | string | | ['updated_at'] | 608949 | 608949 | timestamp | | ['_event_time'] | 608949 | 608949 | timestamp | | ['_id'] | 608949 | 608949 | string | +---------------------+---------------+---------+-----------+
Now, let's run some queries on this dataset to understand call center operations. First query below checks the number of requests across different sources in the last 3 days.
rockset> select SOURCE, EXTRACT(DATE FROM PARSE_DATETIME_ISO8601(DATETIMEINIT)) as date from "oakland-call-center" where PARSE_DATETIME_ISO8601(DATETIMEINIT) > CURRENT_DATETIME() - DAYS(3)
Using Tableau I also plotted this chart to analyze the trend.
Most of the requests come through SeeClickFix (a mobile app to raise requests). Next let's check how many of these were CANCELED (A request is cancelled if it was created erroneously). Agents answering customer calls spend time on such requests as well and a large number of these can be a good indicator to fix something in the request flow.
rockset> select count(*) as cancelled_requests from "oakland-call-center" where STATUS = 'CANCEL' and SOURCE = 'SeeClickFix' and PARSE_DATETIME_ISO8601(DATETIMEINIT) > CURRENT_DATETIME() - DAYS(3); +----------------------+ | cancelled_requests | |----------------------| | 44 | +----------------------+
The collection also tracks when the issue was resolved. Let's check the average number of days taken to resolve a case based on the type of request. Resolving a case involves external factors and can be used to dig deeper into operations of other teams which take long time to resolve.
rockset> WITH sum_days as ( select sum(EXTRACT(day from PARSE_DATETIME_ISO8601(o."DATETIMECLOSED") - PARSE_DATETIME_ISO8601(o."DATETIMEINIT"))) as days, o."REQCATEGORY" from "oakland-call-center" o where o."DATETIMECLOSED" != '' and o."DATETIMEINIT" != '' GROUP BY o."SOURCE" ), sum_sources as ( select o."REQCATEGORY", count(*) as count from "oakland-call-center" o group by o.REQCATEGORY ) select sum_days.REQCATEGORY, (sum_days.days / sum_sources.count) as avg_days from sum_days join sum_sources on sum_days.REQCATEGORY = sum_sources.REQCATEGORY +----------------+------------+ | REQCATEGORY | avg_days | |----------------+------------| | WATERSHED | 260 | | METER_REPAIR | 223 | | VEGCONTR | 150 | | STREETSW | 105 | | ROW | 99 | | LAB | 89 | | GIS | 81 | | TREES | 79 | | BLDGMAINT | 63 | | PARKS | 54 | | SURVEY | 52 | | TRAFFIC_ENGIN | 46 | ................... ................... | DRAINAGE | 13 | | POLICE | 11 | | ELECTRICAL | 10 | | ROW_INSPECTORS | 8 | | RECYCLING | 8 | | GRAFFITI | 8 | | ILLDUMP | 3 | | HE_CLEAN | 2 | | OTHER | 1 | | PARKING | 0 | +----------------+------------+
The queries I performed are just a subset of the queries that operational dashboards typically require. Rockset supports JOINs so you can run complex queries across collections. I simply created the Redshift integration with Rockset and performed fast SQL without any ETL or cluster re-sizing. The entire process of loading the data, querying the collection and building charts took about a couple of hours. Rockset makes it easy for data practitioners to ingest and join data across different Redshift tables or even other sources!