• Loading Your Data
  • Collections
  • Rollups

Rollups

Rollups are a type of ingest transformation that can pre-aggregate time-series data. Regular field mappings can transform individual documents or filter out documents based on some criteria. On the other hand, rollups can combine multiple documents into individual ones by aggregating fields.

In this way, rollups can dramatically reduce the amount of storage needed for your collection. You will also see improved query performance when using rollups because you will be querying pre-aggregated data, not raw data. As more raw data streams into your collection, the underlying aggregations will be updated in real-time.

#Creating rollups

To use rollups for a collection, you just need provide a single SQL query that contains all of the required data transformations and aggregations. You cannot change a collection's rollup query once the collection has been created.

The structure of the rollup query will look like this:

SELECT
    < expressions >
FROM
    _input
WHERE
    < condition > -- optional
GROUP BY
    < fields >
HAVING
    < condition > -- optional

The FROM and WHERE clauses of a rollup query are the same as those of a field mapping query. Please refer to the field mappings documentation for more information on those parts of the rollup query.

Let’s take a closer look at the other parts of our rollup query.

#GROUP BY clause

The GROUP BY clause in a rollup query works the same way as in a regular SQL query. It is used to define the grouping set, which is a list of fields that defines how input rows should be aggregated. All input rows with the same values for the fields in the grouping set will be aggregated together to generate one output row.

A rollup query is only allowed to contain one grouping set. This means that you can have GROUP BY a, b, c in your rollup query but not GROUP BY GROUPING SETS ((a), (a,b)) or GROUP BY ROLLUP (a, b).

Finally, the GROUP BY clause is technically optional for a rollup query. If your rollup query contains aggregate functions but no GROUP BY clause, all of the input rows are considered as part of one group and aggregated together to form one output row. The final collection in this case will only contain one row.

#HAVING clause

The HAVING clause in a rollup query allows you to apply a filter on the aggregated rows generated by the GROUP BY clause and aggregations. Unlike the WHERE clause, which filters out input rows, the HAVING clause only filters out rows after the GROUP BY clause and aggregations have been applied.

Like in a normal SQL query, the HAVING clause can refer to the fields in the GROUP BY clause or aggregate functions. For example, assuming your rollup query is grouping by fields a and b, then including HAVING a IS NOT NULL in your rollup query means that you want to drop the aggregated rows where a is null. If we change our rollup query to contain HAVING COUNT(\*) > 1 instead, we are specifying that we only want to keep the aggregated rows where at least two rows were used to form the aggregated row.

#SELECT clause

The SELECT clause of a rollup query is composed differently than the SELECT clause of a field mapping query.

Because all input rows with the same values for the fields in the grouping set will be aggregated together to form one output row, the SELECT clause for a rollup query can only contain aggregate functions or expressions based on the fields in the grouping set.

#Examples

Let’s look at some example rollup queries.

SELECT
    DATE_TRUNC('HOUR', PARSE_TIMESTAMP_ISO8601(ts)) AS _event_time,
    location,
    SUM(cat_count) AS cat_count,
    MAX(cat_count + dog_count) AS peak_animals_seen,
FROM
    _input
GROUP BY
    _event_time,
    location
HAVING
    SUM(cat_count) > 5

Imagine our data source is a network of cameras that detect cats and dogs. Every time a cat and/or dog is found, we record an event of the form {"ts": "2021-04-20T16:20:42+0000", "location": "San Francisco", "cat_count": 1, "dog_count": 2}. We are interested in counting the total number of cats per location, per hour. Also, we are interested in tracking the locations and times with the highest numbers of animals seen (perhaps to detect if cameras need faster processors that can count more animals).

This rollup query will aggregate our raw data at ingest-time, effectively precomputing the expressions we are interested in so we can query them effectively later.

SELECT
    DATE_TRUNC(
        'DAY',
        PARSE_TIMESTAMP('%Y/%m/%d %H:%M', datetimeStr)
    ) AS _event_time,
    region,
    SUM(purchasePrice) AS revenue,
    AVG(purchasePrice) AS avgPrice,
    MAX(purchasePrice) AS largestSale,
    SUM(CAST(purchasePrice > 3 AS int)) AS largeSales,
    SUM(purchasePrice * 2) + 1 AS allTheTransforms,
    COUNT(*) AS counts
FROM
    _input
GROUP BY
    _event_time,
    region

In this case, we have raw data for a business’s sales in different regions. Each time a purchase is made, we record the timestamp, the location of the sale, and the price of the item purchased. To best understand our business’s financial health, we will take a data-driven approach and aggregate the purchase price data in several different ways. We use the GROUP BY clause to let us group our data by date and region.