- Loading Data
Rollups
Rollups are a special type of SQL transformation that enables you to aggregate data as it is ingested. Normal SQL transformations can transform individual documents or filter out a set based on some criteria. Rollups, on the other hand, can combine multiple documents into individual ones.
In practice, it's as simple as adding aggregation functions to your SELECT
clause or adding a
GROUP BY
clause to your SQL transformation. As new data comes in, Rockset will transform and
aggregate it before storing it in your rollup collection. For time-series data, even out-of-order
data arrivals that come in after the fact will be properly aggregated automatically.
Using rollups provides two main benefits:
- Reduced storage size because only the aggregated data is stored and indexed.
- Improved query performance because the metrics you care about have been precomputed at ingestion time.
Rollups are currently available for streaming data sources: Apache Kafka and Amazon Kinesis.
Real-time Rollups
Rollups in Rockset are unique and preserve the real-time nature of your data analytics.
Traditionally, rollups are implemented with ahead-of-time aggregation, which means you don’t have access to your data until the end of the aggregation interval. In this approach, aggregation groups are considered immutable, so all data for an aggregation interval must be present for the aggregation to happen.
In contrast, rollups in Rockset allow updates to aggregation groups, so your data remains real-time. You can query your Rockset rollup collection and get up-to-date results as incoming data continues to be aggregated. In addition, because updates to aggregation groups are allowed, out-of-order arrivals "just work" in Rockset.
Creating Rollups
To use rollups for a collection, you must provide a single SQL query that contains all of the required data transformations and aggregations. You cannot change a collection's rollup query once the collection has been created.
The structure of a rollup query looks as follows:
SELECT
< expressions >
FROM
_input
WHERE
< condition > -- optional
GROUP BY
< fields >
HAVING
< condition > -- optional
The FROM
and WHERE
clauses of a rollup query are the same as those of a field mapping query.
Refer to the SQL transformations documentation for more
information on those parts of the rollup query.
Let’s take a closer look at the other parts of a rollup query.
GROUP BY
clause
The GROUP BY
clause in a rollup query works the same way as in a regular SQL query. It is used to
define the grouping set, which is a list of fields that defines how input rows should be aggregated.
All input rows with the same values for the fields in the grouping set will be aggregated together
to generate one output row.
A rollup query is only allowed to contain one grouping set. This means that you can have
GROUP BY a, b, c
in your rollup query but not GROUP BY GROUPING SETS ((a), (a,b))
or
GROUP BY ROLLUP (a, b)
.
Finally, the GROUP BY
clause is optional for a rollup query. If your rollup query
contains aggregate functions but no GROUP BY
clause, all of the input rows are considered as part
of one group and will be aggregated together to form one output row. The rollup collection in this
case will contain only one row.
HAVING
clause
The HAVING
clause in a rollup query allows you to apply a filter on the aggregated rows generated
by the GROUP BY
clause and aggregations. Unlike the WHERE
clause, which filters out input rows,
the HAVING
clause only filters out rows after the GROUP BY
clause and aggregations have been
applied.
As with a normal SQL query, the HAVING
clause can refer to the fields in the GROUP BY
clause or
aggregate functions. For example, if your rollup query groups by fields a
and b
, then
including HAVING a IS NOT NULL
in your rollup query will drop the aggregated rows where a
is null
. If you change the rollup query to contain HAVING COUNT(*) > 1
instead, this keeps the
aggregated rows where at least two rows were used to form the aggregated row.
SELECT
clause
The SELECT
clause of a rollup query is composed differently than the SELECT
clause of a field
mapping query.
Because all input rows with the same values for the fields in the grouping set will be aggregated
together to form one output row, the SELECT
clause for a rollup query can only contain aggregate
functions or expressions based on the fields in the grouping set.
Examples
The following are some example rollup queries.
Example:
SELECT
DATE_TRUNC('HOUR', PARSE_TIMESTAMP_ISO8601(ts)) AS _event_time,
location,
SUM(cat_count) AS cat_count,
MAX(cat_count + dog_count) AS peak_animals_seen,
SQRT(COUNT(*) + AVG(cat_count)) * APPROX_DISTINCT(dog_count) AS my_crazy_metric
FROM
_input
GROUP BY
_event_time,
location
HAVING
SUM(cat_count) > 5
Imagine the data source is a network of cameras that detect cats and dogs. Every time a cat and/or
dog is found, an event is recorded in the form: {"ts": "2021-04-20T16:20:42+0000", "location": "San Francisco", "cat_count": 1, "dog_count": 2}
. There is a requirement to count the total number of cats
per location, per hour. There is also a requirement to track the locations and times with the highest
numbers of animals seen (perhaps to detect if cameras need faster processors that can count more
animals). In addition, the my_crazy_metric
expression is included here as an example to show just
how complex a rollup query expression can get. The HAVING
clause keeps the aggregated rows
where the total number of cats seen is greater than five.
Example:
SELECT
DATE_TRUNC(
'DAY',
PARSE_TIMESTAMP('%Y/%m/%d %H:%M', datetimeStr)
) AS _event_time,
region,
SUM(purchasePrice) AS revenue,
AVG(purchasePrice) AS avgPrice,
MAX(purchasePrice) AS largestSale,
SUM(CAST(purchasePrice > 3 AS int)) AS largeSales,
SUM(purchasePrice * 2) + 1 AS allTheTransforms,
COUNT(*) AS counts
FROM
_input
GROUP BY
_event_time,
region
In this case, there is raw data for a business’s sales in different regions. Each time a purchase is
made, the timestamp, the location of the sale, and the price of the item purchased are recorded. To
best understand the business’s financial health, a data-driven approach is taken to aggregate
the purchase price data in several different ways. The GROUP BY
clause groups data by date and region.