- Loading Data
SQL Transformations
When inserting data into Rockset, you can transform the data by providing a single SQL query, that contains all of the desired data transformations. This is referred to as the collection’s field mapping query.
Every ingested document is passed through the field mapping query specified at collection creation time. The result of the field mapping query for an input document is the final document that will be persisted in the collection. In addition, query-based field mappings enable you to drop input documents based on some specified condition so they don’t become persisted in the collection.
Use Cases
Field mappings are often used to apply data transformations in the following use cases:
Type Coercion
You can map an input field to a specific advanced data type. For example, mapping an input string field
into a date
, datetime
, or timestamp
field.
Example:
CAST(last_updated_at as DATETIME)
PII/PHI Masking
If your input dataset has PII (personally identifiable information) or PHI (protected health
information) fields, you can use a one-way crypto hash function such as
SHA256()
on the input field, so that Rockset only stores the hashed value and not the original
PII/PHI field.
Example:
TO_HEX(SHA256(email_address))
Search Tokenization
Text fields in the input can be tokenized at data ingestion time so you can more efficiently run text search queries later.
Example:
TOKENIZE(tweet_text, 'en_US')
Functional Index
If your application queries involve complex SQL functions which result in slow query processing, you can use a field mapping to pre-compute the result of the expensive SQL expression at data ingestion. Queries on these computed output fields are much faster than executing the SQL expression at query time.
Example:
-- extract the domain from the input email address
REGEXP_EXTRACT_ALL(
email_address,
'^([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)',
2
)
Importing Geographic Data
You can convert latitude/longitude points or well known text strings to Rockset's native geography type at ingest time by specifying an appropriate field mapping.
Example:
-- create a geography from a longitude and latitude field
ST_GEOGPOINT(longitude, latitude)
Dropping Fields
If you have unused large fields in your input dataset, you can drop them upon data ingestion to save storage space.
Special Fields
You can use field mappings to specify the values of special fields, such as _id
and _event_time
.
See the special fields documentation for more information, and when to use field mappings to specify their values.
Creating Field Mappings
The basic structure of a field mapping query is as follows:
SELECT
< expressions >
FROM
_input
WHERE
< condition > -- optional
Note: As with a normal SQL query, all expressions in the SELECT
clause are executed independently
of one another, and in parallel. Additionally, you cannot change a collection's field mapping
query once the collection has been created.
The following sections provide more detail about the components of a field mapping query.
SELECT
Clause
The SELECT
clause is where you define:
- The input fields you want to keep or drop in the final document.
- The new expressions you want to add as fields to your final document.
To keep all input fields from the input document by default, you include SELECT *
in your
field mapping query. To drop certain input fields, include something like:
SELECT * EXCEPT (<excluded fields>)
in your field mapping query. For example,
SELECT * EXCEPT (a, b)
keeps all input fields from the input document by default, except for
fields a
and b
.
If you don’t want to include any input fields in the final document by default, omit the
*
field from your field mapping query's SELECT
clause.
To add a new field to your final document, add SELECT <expression> AS <field>
to your field mapping query.
For example SELECT CONCAT(a, b) AS ab
will add a new field named ab
to your final document, and its value
will be computed from the expression CONCAT(a, b)
.
Note: New fields added to the final document using the SELECT <expression> AS <field>
syntax
will take naming precedence over input fields from the input document. This means that if
you have SELECT *, CONCAT(a, b) AS a
as part of your field mapping query, the final document will
contain field a
with value CONCAT(a, b)
, not the value of field a
in the input document.
FROM
Clause
The FROM
clause does not change for different field mapping queries. It always looks like
FROM _input
, and it represents the stream of input documents.
WHERE
Clause
The WHERE
clause enables you to drop input documents based on some condition. Including this clause in the
field mapping query is optional, unlike the SELECT
and FROM
clauses which are required.
As with a normal SQL query, the WHERE
clause condition can refer to input fields from the input
document. For example, you can include WHERE a IS NOT NULL AND b IS NOT NULL
in your field mapping
query to drop input documents where the field a
is null
or the field b
is null
.
Examples
Let’s look at some example field mapping queries.
Example:
SELECT
*
EXCEPT
-- drop some fields
(extra_data, more_extra_data, email_address),
-- type coercion
CAST(last_updated_at AS DATETIME) AS last_updated_at,
-- PII/PHI masking
TO_HEX(SHA256(email_address)) AS email_address_hash,
FROM
_input
WHERE
-- filter input documents
email_address IS NOT NULL
This field mapping query keeps all input fields by default, except for extra_data
,
more_extra_data
, and email_address
. It adds two new fields to the final document:
last_updated_at
and email_address_hash
. If there are any input fields named last_updated_at
or
email_address_hash
in the input document, their values in the final document will be the values of
the specified expressions, not their values in the input document. It also filters out all input
documents where email_address
is null
.
Example:
SELECT
twitter_handle,
follower_count,
-- search tokenization
TOKENIZE(tweet_text, 'en_US') AS tweet_text_tokens,
-- functional index
REGEXP_EXTRACT_ALL(
email_address,
'^([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)',
2
) AS email_address_domain,
-- native geography type
ST_GEOGPOINT(longitude, latitude) AS geo,
FROM
_input
This field mapping query drops all input fields by default and only keeps the input fields
twitter_handle
and follower_count
. It also adds three new fields to our final document:
tweet_text_tokens
, email_address_domain
, and geo
.
Example:
SELECT
*,
IF(
user IS NOT NULL,
CONCAT(user, '@rockset.com'),
UNDEFINED
) AS email_address
FROM
_input
This field mapping query keeps all input fields by default and adds one new field to the
final document: email_address
.
Note: If user
is null
, the email_address
field will be set to undefined
in the final document.
Setting a field’s value to undefined
in a field mapping query means that the field will not be present
in the final document. Thus, email_address
will only be present in the final document if the input
document’s user
field is not null
. See documentation on undefined for more details.
Example:
SELECT
*,
IF (
TRY(total_weight / num_items) IS NOT NULL,
total_weight / num_items,
UNDEFINED
) AS avg_weight
FROM
_input
This field mapping query keeps all input fields by default. It adds one new field to the
final document: avg_weight
. As with the example above, the avg_weight
field will only be present in
the final document if the expression TRY(total_weight/num_items)
is not null
. TRY
is a special
operator that evaluates an expression and returns null
if an error is encountered during the
expression evaluation (rather than failing the query entirely). Otherwise, TRY
will return the
expression evaluation result. You can use TRY
in a field mapping query so that it only includes a field in
the final document if its expression can be evaluated successfully. See
the documentation for TRY for more details.