Rockset
  • Loading Data

Rollups

Rollups are a special type of SQL transformation that enables you to aggregate data as it is ingested. Normal SQL transformations can transform individual documents or filter out a set based on some criteria. Rollups, on the other hand, can combine multiple documents into individual ones.

In practice, it's as simple as adding aggregation functions to your SELECT clause or adding a GROUP BY clause to your SQL transformation. As new data comes in, Rockset will transform and aggregate it before storing it in your rollup collection. For time-series data, even out-of-order data arrivals that come in after the fact will be properly aggregated automatically.

Using rollups provides two main benefits:

  • Reduced storage size because only the aggregated data is stored and indexed.
  • Improved query performance because the metrics you care about have been precomputed at ingestion time.

Rollups are currently available for the following streaming data sources:

  • Apache Kafka
  • Confluent Cloud
  • Confluent Platform (self-managed)
  • Amazon Kinesis
  • Azure Service Bus
  • Azure Event Hubs

Rollups are also available for the following Data Lakes and Warehouses:

  • Amazon S3
  • Google Cloud Storage
  • Azure Blob Storage

Real-time Rollups

Rollups in Rockset are unique and preserve the real-time nature of your data analytics.

Traditionally, rollups are implemented with ahead-of-time aggregation, which means you don’t have access to your data until the end of the aggregation interval. In this approach, aggregation groups are considered immutable, so all data for an aggregation interval must be present for the aggregation to happen.

In contrast, rollups in Rockset allow updates to aggregation groups, so your data remains real-time. You can query your Rockset rollup collection and get up-to-date results as incoming data continues to be aggregated. In addition, because updates to aggregation groups are allowed, out-of-order arrivals "just work" in Rockset.

Creating Rollups

To use rollups for a collection, you must provide a single SQL query that contains all of the required data transformations and aggregations. You cannot change a collection's rollup query once the collection has been created.

The structure of a rollup query looks as follows:

SELECT
    < expressions >
FROM
    _input
WHERE
    < condition > -- optional
GROUP BY
    < fields >
HAVING
    < condition > -- optional

The FROM and WHERE clauses of a rollup query are the same as those of a field mapping query. Refer to the SQL transformations documentation for more information on those parts of the rollup query.

Let’s take a closer look at the other parts of a rollup query.

GROUP BY clause

The GROUP BY clause in a rollup query works the same way as in a regular SQL query. It is used to define the grouping set, which is a list of fields that defines how input rows should be aggregated. All input rows with the same values for the fields in the grouping set will be aggregated together to generate one output row.

A rollup query is only allowed to contain one grouping set. This means that you can have GROUP BY a, b, c in your rollup query but not GROUP BY GROUPING SETS ((a), (a,b)) or GROUP BY ROLLUP (a, b).

Finally, the GROUP BY clause is optional for a rollup query. If your rollup query contains aggregate functions but no GROUP BY clause, all of the input rows are considered as part of one group and will be aggregated together to form one output row. The rollup collection in this case will contain only one row.

HAVING clause

The HAVING clause in a rollup query allows you to apply a filter on the aggregated rows generated by the GROUP BY clause and aggregations. Unlike the WHERE clause, which filters out input rows, the HAVING clause only filters out rows after the GROUP BY clause and aggregations have been applied.

As with a normal SQL query, the HAVING clause can refer to the fields in the GROUP BY clause or aggregate functions. For example, if your rollup query groups by fields a and b, then including HAVING a IS NOT NULL in your rollup query will drop the aggregated rows where a is null. If you change the rollup query to contain HAVING COUNT(*) > 1 instead, this keeps the aggregated rows where at least two rows were used to form the aggregated row.

SELECT clause

The SELECT clause of a rollup query is composed differently than the SELECT clause of a field mapping query.

Because all input rows with the same values for the fields in the grouping set will be aggregated together to form one output row, the SELECT clause for a rollup query can only contain aggregate functions or expressions based on the fields in the grouping set.

Examples

The following are some example rollup queries.

Example:

SELECT
    DATE_TRUNC('HOUR', PARSE_TIMESTAMP_ISO8601(ts)) AS _event_time,
    location,
    SUM(cat_count) AS cat_count,
    MAX(cat_count + dog_count) AS peak_animals_seen,
    SQRT(COUNT(*) + AVG(cat_count)) * APPROX_DISTINCT(dog_count) AS my_crazy_metric
FROM
    _input
GROUP BY
    _event_time,
    location
HAVING
    SUM(cat_count) > 5

Imagine the data source is a network of cameras that detect cats and dogs. Every time a cat and/or dog is found, an event is recorded in the form: {"ts": "2021-04-20T16:20:42+0000", "location": "San Francisco", "cat_count": 1, "dog_count": 2}. There is a requirement to count the total number of cats per location, per hour. There is also a requirement to track the locations and times with the highest numbers of animals seen (perhaps to detect if cameras need faster processors that can count more animals). In addition, the my_crazy_metric expression is included here as an example to show just how complex a rollup query expression can get. The HAVING clause keeps the aggregated rows where the total number of cats seen is greater than five.

Example:

SELECT
    DATE_TRUNC(
        'DAY',
        PARSE_TIMESTAMP('%Y/%m/%d %H:%M', datetimeStr)
    ) AS _event_time,
    region,
    SUM(purchasePrice) AS revenue,
    AVG(purchasePrice) AS avgPrice,
    MAX(purchasePrice) AS largestSale,
    SUM(CAST(purchasePrice > 3 AS int)) AS largeSales,
    SUM(purchasePrice * 2) + 1 AS allTheTransforms,
    COUNT(*) AS counts
FROM
    _input
GROUP BY
    _event_time,
    region

In this case, there is raw data for a business’s sales in different regions. Each time a purchase is made, the timestamp, the location of the sale, and the price of the item purchased are recorded. To best understand the business’s financial health, a data-driven approach is taken to aggregate the purchase price data in several different ways. The GROUP BY clause groups data by date and region.