• Loading Your Data
  • Collections
  • Transforming Incoming Data

Transforming Incoming Data

This page describes how to apply transformations to incoming data as it is inserted into a collection using SQL.

Note: If you're looking for documentation on our legacy field mappings feature, you can find that here

#Overview

Field mappings allow you to define transformations to be applied on all documents inserted into a collection. To use query-based field mappings on a collection, you must provide a single SQL query that contains all of the desired data transformations. We will refer to this as the collection’s field mapping query.

Every ingested document is passed through the field mapping query specified at collection creation time. The result of the field mapping query for an input document is the final document that will be persisted in the collection. In addition, query-based field mappings allow you to drop input documents based on some specified condition so they don’t get persisted in the collection at all.

#Use Cases

Field mappings are often used to apply data transformations in the following use cases:

#Type Coercion

You can map an input field to a specific advanced data type, such as mapping an input string field into a date, datetime or timestamp field.

Example:

CAST(last_updated_at as DATETIME)

#PII/PHI Masking

If your input dataset has PII (personally identifiable information) or PHI (protected health information) fields, you may use a one-way crypto hash function such as SHA256() on the input field, so that Rockset only stores the hashed value and not the original PII/PHI field.

Example:

TO_HEX(SHA256(email_address))

#Search Tokenization

Text fields in the input can be tokenized upon data ingestion so that you can run search queries (still using SQL).

Example:

TOKENIZE(tweet_text, 'en_US')

#Functional Index

If your application queries involve complex SQL functions that makes query processing slow, you may use a field mapping to pre-compute the result of the expensive SQL expression at data ingestion. Queries on such computed output fields are much faster than executing the SQL expression at query time.

Example:

-- extract the domain from the input email address
REGEXP_EXTRACT_ALL(
    email_address,
    '^([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)',
    2
)

#Importing Geographic Data

You can convert latitude/longitude points or well known text strings to Rockset's native geography type at ingest time by specifying an appropriate field mapping.

Example:

-- create a geography from a longitude and latitude field
ST_GEOGPOINT(longitude, latitude)

#Dropping Fields

If you have unused large fields in your input data set, you can drop them upon data ingestion to save storage space.

#Special Fields

You can use field mappings to specify the values of special fields, such as _id and _event_time. See the documentation here for more details on special fields and why you would want to use field mappings to specify their values.

#Creating Field Mappings

The basic structure of the field mapping query is as follows:

SELECT <expressions>
FROM _input
WHERE <condition>       -- optional

Note that like a normal SQL query, all expressions in the SELECT clause are executed independently of one another, and in parallel. Additionally, you cannot change a collection's field mappings once the collection has been created.

Let’s dive into the components of our field mapping query in more detail.

#SELECT clause

The SELECT clause is where you define:

  • the input fields you want to keep or drop in the final document
  • the new expressions you want to add as fields to your final document

To keep all input fields from the input document by default, you would include SELECT * in your field mapping query. If you want to drop certain input fields, you would include something like SELECT * EXCEPT (<excluded fields>) in your field mapping query. For example, SELECT * EXCEPT (a, b) means keep all input fields from the input document by default except for fields a and b.

If you don’t want to include any input fields in the final document by default, you would omit the * field from your field mapping query SELECT clause.

Adding a new field to your final document is simple. The way to do this is to add SELECT <expression> AS <field> to your field mapping query. For example SELECT CONCAT(a, b) AS ab would add a new field named ab to your final document, and its value would be computed from the expression CONCAT(a, b).

Note that new fields added to the final document using the SELECT <expression> AS <field> syntax will take naming precedence over input fields from the input document. What this means is that if you have SELECT *, CONCAT(a, b) AS a as part of your field mapping query, the final document will contain field a with value CONCAT(a, b), not the value of field a in the input document.

#FROM clause

The FROM clause does not change for different field mapping queries. It always looks like FROM _input and it represents the stream of input documents.

#WHERE clause

The WHERE clause allows you to drop input documents based on some condition. This clause in the field mapping query is optional, unlike the SELECT and FROM clauses which are required.

Like in a normal SQL query, the WHERE clause condition can refer to input fields from the input document. For example, you can include WHERE a IS NOT NULL AND b IS NOT NULL in your field mapping query to drop input documents where the field a is null or the field b is null.

#Examples

Let’s look at some example field mapping queries.

SELECT
    *
EXCEPT
    -- drop some fields
    (extra_data, more_extra_data, email_address),
    -- type coercion
    CAST(last_updated_at AS DATETIME) AS last_updated_at,
    -- PII/PHI masking
    TO_HEX(SHA256(email_address)) AS email_address_hash,
FROM
    _input
WHERE
    -- filter input documents
    email_address IS NOT NULL

This field mapping query is keeping all input fields by default, except for extra_data and more_extra_data, which are dropped. We are adding two new fields to our final document: last_updated_at and email_address_hash. If there are any input fields named last_updated_at or email_address_hash in the input document, their values in the final document will be the values of the specified expressions, not their values in the input document. We also filter out all input documents where email_address is null.

SELECT
    twitter_handle,
    follower_count,
    -- search tokenization
    TOKENIZE(tweet_text, 'en_US') AS tweet_text_tokens,
    -- functional index
    REGEXP_EXTRACT_ALL(
        email_address,
        '^([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)',
        2
    ) AS email_address_domain,
    -- native geography type
    ST_GEOGPOINT(longitude, latitude) AS geo,
FROM
    _input

This field mapping query is dropping all input fields by default and only keeping the input fields twitter_handle and follower_count. We are also adding three new fields to our final document: tweet_text_tokens, email_address_domain, and geo.

SELECT
    *,
    IF(
        user IS NOT NULL,
        CONCAT(user, '@rockset.com'),
        UNDEFINED
    ) AS email_address
FROM
    _input

This field mapping query is keeping all input fields by default. We are adding one new field to our final document: email_address. Note that if user is null, the email_address field will be set to undefined in the final document. Setting a field’s value to undefined in a field mapping query means that the field will not be present in the final document. Thus, email_address will only be present in the final document if the input document’s user field is not null. See our documentation on undefined for more details.

SELECT
    *,
    IF (
        TRY(total_weight / num_items) IS NOT NULL,
        total_weight / num_items,
        UNDEFINED
    ) AS avg_weight
FROM
    _input

This field mapping query is keeping all input fields by default. We are adding one new field to our final document: avg_weight. Like the example above, the avg_weight field will only be present in the final document if the expression TRY(total_weight/num_items) is not null. TRY is a special operator that evaluates an expression and returns null if an error is encountered during the expression evaluation (rather than failing the query entirely). Otherwise, TRY will return the expression evaluation result. You can use TRY in a field mapping query to only include a field in the final document if its expression can be evaluated successfully. See the documentation for TRY for more details.