Rockset
  • Loading Data

SQL Transformations

When inserting data into Rockset, you can transform the data by providing a single SQL query, that contains all of the desired data transformations. This is referred to as the collection’s field mapping query.

Every ingested document is passed through the field mapping query specified at collection creation time. The result of the field mapping query for an input document is the final document that will be persisted in the collection. In addition, query-based field mappings enable you to drop input documents based on some specified condition so they don’t become persisted in the collection.

Use Cases

Field mappings are often used to apply data transformations in the following use cases:

Type Coercion

You can map an input field to a specific advanced data type. For example, mapping an input string field into a date, datetime, or timestamp field.

Example:

CAST(last_updated_at as DATETIME)

PII/PHI Masking

If your input dataset has PII (personally identifiable information) or PHI (protected health information) fields, you can use a one-way crypto hash function such as SHA256() on the input field, so that Rockset only stores the hashed value and not the original PII/PHI field.

Example:

TO_HEX(SHA256(email_address))

Search Tokenization

Text fields in the input can be tokenized at data ingestion time so you can more efficiently run text search queries later.

Example:

TOKENIZE(tweet_text, 'en_US')

Functional Index

If your application queries involve complex SQL functions which result in slow query processing, you can use a field mapping to pre-compute the result of the expensive SQL expression at data ingestion. Queries on these computed output fields are much faster than executing the SQL expression at query time.

Example:

-- extract the domain from the input email address
REGEXP_EXTRACT_ALL(
    email_address,
    '^([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)',
    2
)

Importing Geographic Data

You can convert latitude/longitude points or well known text strings to Rockset's native geography type at ingest time by specifying an appropriate field mapping.

Example:

-- create a geography from a longitude and latitude field
ST_GEOGPOINT(longitude, latitude)

Dropping Fields

If you have unused large fields in your input dataset, you can drop them upon data ingestion to save storage space.

Special Fields

You can use field mappings to specify the values of special fields, such as _id and _event_time. See the special fields documentation for more information, and when to use field mappings to specify their values.

Creating Field Mappings

The basic structure of a field mapping query is as follows:

SELECT
    < expressions >
FROM
    _input
WHERE
    < condition > -- optional

Note: As with a normal SQL query, all expressions in the SELECT clause are executed independently of one another, and in parallel. Additionally, you cannot change a collection's field mapping query once the collection has been created.

The following sections provide more detail about the components of a field mapping query.

SELECT Clause

The SELECT clause is where you define:

  • The input fields you want to keep or drop in the final document.
  • The new expressions you want to add as fields to your final document.

To keep all input fields from the input document by default, you include SELECT * in your field mapping query. To drop certain input fields, include something like: SELECT * EXCEPT (<excluded fields>) in your field mapping query. For example, SELECT * EXCEPT (a, b) keeps all input fields from the input document by default, except for fields a and b.

If you don’t want to include any input fields in the final document by default, omit the * field from your field mapping query's SELECT clause.

To add a new field to your final document, add SELECT <expression> AS <field> to your field mapping query. For example SELECT CONCAT(a, b) AS ab will add a new field named ab to your final document, and its value will be computed from the expression CONCAT(a, b).

Note: New fields added to the final document using the SELECT <expression> AS <field> syntax will take naming precedence over input fields from the input document. This means that if you have SELECT *, CONCAT(a, b) AS a as part of your field mapping query, the final document will contain field a with value CONCAT(a, b), not the value of field a in the input document.

FROM Clause

The FROM clause does not change for different field mapping queries. It always looks like FROM _input, and it represents the stream of input documents.

WHERE Clause

The WHERE clause enables you to selectively keep input documents based on some condition. Including this clause in the field mapping query is optional, unlike the SELECT and FROM clauses which are required.

As with a normal SQL query, the WHERE clause condition can refer to input fields from the input document. For example, you can include WHERE a IS NOT NULL AND b IS NOT NULL in your field mapping query to keep non-null input documents and drop documents where the field a is null or the field b is null.

Examples

Let’s look at some example field mapping queries.

Example:

SELECT
    *
EXCEPT
    -- drop some fields
    (extra_data, more_extra_data, email_address),
    -- type coercion
    CAST(last_updated_at AS DATETIME) AS last_updated_at,
    -- PII/PHI masking
    TO_HEX(SHA256(email_address)) AS email_address_hash,
FROM
    _input
WHERE
    -- filter input documents
    email_address IS NOT NULL

This field mapping query keeps all input fields by default, except for extra_data, more_extra_data, and email_address. It adds two new fields to the final document: last_updated_at and email_address_hash. If there are any input fields named last_updated_at or email_address_hash in the input document, their values in the final document will be the values of the specified expressions, not their values in the input document. It also filters out all input documents where email_address is null.

Example:

SELECT
    twitter_handle,
    follower_count,
    -- search tokenization
    TOKENIZE(tweet_text, 'en_US') AS tweet_text_tokens,
    -- functional index
    REGEXP_EXTRACT_ALL(
        email_address,
        '^([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)',
        2
    ) AS email_address_domain,
    -- native geography type
    ST_GEOGPOINT(longitude, latitude) AS geo,
FROM
    _input

This field mapping query drops all input fields by default and only keeps the input fields twitter_handle and follower_count. It also adds three new fields to our final document: tweet_text_tokens, email_address_domain, and geo.

Example:

SELECT
    *,
    IF(
        user IS NOT NULL,
        CONCAT(user, '@rockset.com'),
        UNDEFINED
    ) AS email_address
FROM
    _input

This field mapping query keeps all input fields by default and adds one new field to the final document: email_address.

Note: If user is null, the email_address field will be set to undefined in the final document. Setting a field’s value to undefined in a field mapping query means that the field will not be present in the final document. Thus, email_address will only be present in the final document if the input document’s user field is not null. See documentation on undefined for more details.

Example:

SELECT
    *,
    IF (
        TRY(total_weight / num_items) IS NOT NULL,
        total_weight / num_items,
        UNDEFINED
    ) AS avg_weight
FROM
    _input

This field mapping query keeps all input fields by default. It adds one new field to the final document: avg_weight. As with the example above, the avg_weight field will only be present in the final document if the expression TRY(total_weight/num_items) is not null. TRY is a special operator that evaluates an expression and returns null if an error is encountered during the expression evaluation (rather than failing the query entirely). Otherwise, TRY will return the expression evaluation result. You can use TRY in a field mapping query so that it only includes a field in the final document if its expression can be evaluated successfully. See the documentation for TRY for more details.