Office Hours Recap: Optimize Cost and Query Latency With SQL Transformations and Real-Time Rollups

May 25, 2022

,
See Rockset
in action

Get a product tour with a Rockset engineer

Visit our Rockset Community to review previous Office Hours or to see what's coming up.


During our Office Hours a few weeks ago, Tyler and I went over what are SQL transformations and real-time rollups, how to apply them, and how they affect your query performance and index storage size. Below, we’ll cover some of the highlights.

SQL transformations and real-time rollups occur at ingestion time before the Rockset collection is populated with data. Here’s the diagram I did during Rockset Office Hours.

office-hours-image-2


office-hours-image-1

Tyler demonstrated how query performance and storage are impacted when you use SQL transformations and real-time rollups with three different queries. Below, I’ll describe how we built the collection and what we’re doing in the queries.

Initial Query With no SQL Transformations or Rollups Applied

In this query, we’re building a time-series object that grabs the most active tweeters within the last day. There are no SQL transformations or rollups, so the collection contains just the raw data.

-- Initial query against the plain collection 1day: 12sec
with _data as (
    SELECT
        count(*) tweets,
        cast(DATE_TRUNC('HOUR',PARSE_TIMESTAMP('%a %h %d %H:%M:%S %z %Y', t.created_at)) as string) as event_date_hour,
        t.user.id,
        arbitrary(t.user.name) name
    FROM
        officehours."twitter-firehose" t hint(access_path=column_scan)
    where
        t.user.id is not null
        and t.user.id is not undefined
        and PARSE_TIMESTAMP('%a %h %d %H:%M:%S %z %Y', t.created_at) > CURRENT_TIMESTAMP() - DAYS(1)
    group by
        t.user.id,
        event_date_hour
    order by
        event_date_hour desc
),
_intermediate as (
    select
        array_agg(event_date_hour) _keys,
        array_agg(tweets) _values,
        id,
        arbitrary(name) name
    from
        _data
    group by
        _data.id
)
select
    object(_keys, _values) as timeseries,
    id,
    name
from
    _intermediate
    order by length(_keys) desc
limit 100

Source: GitHub gist

  • On line 4 we are counting the total tweets
  • On line 7 we are pulling the ARBITRARY for t.user.name — you can read more about ARBITRARY
  • On lines 15 and 16 we’re doing aggregations on t.user.id and event_date_hour
  • On line 5 we create the event_date_hour by doing a CAST
  • On line 11-12 we filter user.id that is not null or undefined
  • On line 13 we get the latest tweeters from the last day
  • On lines 14-16 we do a GROUP BY with t.user.id and event_date_hour
  • On lines 20-37 we build our time series object
  • On line 38 we return the top 100 tweeters

This inefficient contrived query was run on live data with a medium VI and took about 7 seconds to execute.

Second Query With SQL Transformation Applied Only

In the second query, we applied SQL transformations when we created the collection.

SELECT
  *
  , cast(DATE_TRUNC('HOUR', PARSE_TIMESTAMP('%a %h %d %H:%M:%S %z %Y', i.created_at)) as string) as event_date_hour
  , PARSE_TIMESTAMP('%a %h %d %H:%M:%S %z %Y', i.created_at) as _event_time
  , cast(i.id as string) as id
FROM
  _input i
where
  i.user.id is not null
  and i.user.id is not undefined

Source: GitHub gist

  • On line 3, we create an event_date_hour
  • On line 4, we create an event_time
  • On line 5, we create an id as a string
  • On lines 9 and 10, we select user.id that is not null or undefined

After we apply the transformations, our SQL query looks more simplified than the initial query:

with _data as (
    SELECT
        count(*) tweets,
        event_date_hour,
        t.user.id,
        arbitrary(t.user.name) name
    FROM
        officehours."twitter-firehose_sqlTransformation" t hint(access_path=column_scan)
    where
        _event_time > CURRENT_TIMESTAMP() - DAYS(1)
    group by
        t.user.id,
        event_date_hour
    order by
        event_date_hour desc
),
_intermediate as (
    select
        array_agg(event_date_hour) _keys,
        array_agg(tweets) _values,
        id,
        arbitrary(name) name
    from
        _data
    group by
        _data.id
)
select
    object(_keys, _values) as timeseries,
    id,
    name
from
    _intermediate
    order by length(_keys) desc
limit 100

Source: GitHub gist

  • On line 3, we are counting the total tweets
  • On line 6 we are pulling the ARBITRARY for t.user.name
  • On line 10, the filter is now on the timestamp
  • On lines 11-13 we still do a GROUP BY with t.user.id and event_date_hour
  • On lines 17-34 we still create our time-series object

Basically, we excluded whatever we applied during SQL transformations in the query itself. When we run the query, the storage index size doesn’t change too much, but the query performance goes from seven seconds to three seconds or so. By doing SQL transformations, we save on compute, and it shows — the query performs much faster.

Third Query With SQL Transformation and Rollups Applied

In the third query we performed SQL transformations and rollups when we created the collection.

SELECT
  count(*) tweets,
  cast(DATE_TRUNC('HOUR', PARSE_TIMESTAMP('%a %h %d %H:%M:%S %z %Y', i.created_at)) as string) as event_date_hour_str,
  DATE_TRUNC('HOUR', PARSE_TIMESTAMP('%a %h %d %H:%M:%S %z %Y', i.created_at)) as event_date_hour,
  cast(i.user.id as string) id,
  arbitrary(i.user.name) name
FROM
  _input i
where
  i.user.id is not null
  and i.user.id is not undefined
group by
  i.user.id,
  event_date_hour_str,
  event_date_hour

Source: GitHub gist

In addition to what we did applied earlier for the SQL transformations, we’re now applying rollups as well.

  • On line 2, we are counting all the tweets
  • On line 6 we are pulling the ARBITRARY
  • On lines 12-15 we are applying the GROUP_BY

So now, our final SQL query looks like this:

with _data as (
    SELECT
        tweets,
        event_date_hour_str,
        event_date_hour,
        id,
        name
    FROM
        officehours."twitter-firehose-rollup" t hint(access_path=column_scan) 
    where
        t.event_date_hour > CURRENT_TIMESTAMP() - DAYS(1)
    order by
        event_date_hour desc
),
_intermediate as (
    select
        array_agg(event_date_hour_str) _keys,
        array_agg(tweets) _values,
        id,
        arbitrary(name) name
    from
        _data
    group by
        _data.id
)
select
    object(_keys, _values) as timeseries,
    id,
    name
from
    _intermediate
order by length(_keys) desc
Limit 100

Source: GitHub gist

When we apply the SQL transformations with the rollups, our query goes from a womping seven seconds to two seconds. Also, our storage index size goes from 250 GiB to 11 GiB now!

Advantages/Considerations for SQL Transformations and Real-Time Rollups

SQL Transformations

Advantages:

  • Improves query performance
  • Can drop and mask fields at ingestion time
  • Improve compute cost

Consideration:

  • Need to know what your data looks like

Real-Time Rollups

Advantages:

  • Improves query performance and storage index size
  • Data is updated within the second
  • Don’t need to worry about out-of-order arrivals
  • Exactly-once semantics
  • Improve compute cost

Considerations:

  • Data resolution — You’ll lose the raw data resolution. If you need a copy of the raw data, create another collection without rollups. If you want to avoid double storage, you can set a retention policy when you create a collection.

Rockset’s SQL-based transformations and rollups allow you to perform data transformation that improves query performance and reduces storage index size. The final data transformation is what is persisted in the Rockset collection. It’s important to note that real-time rollups will continuously run on incoming data. In terms of out-of-order arrivals, Rockset will process them and update the required data exactly as though those events actually arrived in-order and on-time. Finally, Rockset guarantees exactly-once semantics for streaming sources, like Kafka and Kinesis.

You can catch the replay of Tyler’s Office Hours session on the Rockset Community. If you have more questions, please find Tyler and Nadine in the Rockset Community.

Embedded content: https://youtu.be/dUrHqoVKC34

Resources:


Rockset is the leading real-time analytics platform built for the cloud, delivering fast analytics on real-time data with surprising efficiency. Learn more at rockset.com.