(Legacy) Field Mappings
This page describes how to use field mappings to apply data transformations to incoming data as the transformations are inserted into a collection.
Note: This is a legacy feature. We have a new way of specifying ingest transformations via SQL, documented here
Overview
A field mapping allows you to specify transformations to be applied on all documents inserted into a collection. Any valid SQL expression may be used to describe the transformation logic in a field mapping definition.
Each ingested document is parsed through the set of field mapping transformations specified at collection creation time. Each field mapping takes a set of fields from the incoming document, evaluates the corresponding SQL expression with the values of these input fields as named parameters, and then stores the result of the SQL expression as another field in the document.
Why Field Mappings?
Field mappings are often used to apply data transformations in the following use cases:
Type Coercion
You can map an input field to a specific advanced data type, such as mapping an input string field
into a date
, datetime
or timestamp
field.
Example of a field mapping expression:
CAST(:last_updated_at as DATETIME)
PII/PHI Masking
If your input dataset has PII (personally identifiable information) or PHI (protected health
information) fields, you may use a one-way crypto hash function such as
SHA256()
on the input field, so that Rockset only stores the hashed value and not the original
PII/PHI field.
Example:
TO_HEX(SHA256(:email_address))
Search Tokenization
Text fields in the input can be tokenized upon data ingestion so that you can run search queries (still using SQL).
Example:
TOKENIZE(:tweet_text, 'en_US')
Functional Index
If your application queries involve complex SQL functions that makes query processing slow, you may use a field mapping to pre-compute the result of the expensive SQL expression at data ingestion. Queries on such computed output fields are much faster than executing the SQL expression at query time.
Example:
-- extract the domain from the input email address
REGEXP_EXTRACT_ALL(
:email_address,
'^([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)',
2
)
Importing Geographic Data
You can convert latitude/longitude points or well known text strings to Rockset's native geography type at ingest time by specifying an appropriate field mapping.
Example:
-- create a geography from a longitude and latitude field
ST_GEOGPOINT(:longitude, :latitude)
Dropping Fields
If you have unused large fields in your input data set, you can drop them upon data ingestion to save storage space.
Creating Field Mappings
Field mappings are defined during collection creation, either
by defining the field_mappings
parameter through the Rockset API or
by using the Field Mappings selector in the
Rockset Console:
Notes:
- All specified field mappings are executed independently of one another, and in parallel. Therefore, each field mapping specification should not depend on the output of any other field mapping specification.
- Once the collection has been created, you cannot change a collection's field mappings.
_id
and _event_time
mappings
The following field mappings are especially useful:
_id
You may use the same ID twice. The existing entry will be updated rather than a new entry created, to save storage.
Requirements: _id
must return a string
data type and be unique across documents.
_event_time
This field will map the time of a document event. As this field is specially indexed, time-based query performance will improve. This field will also set a proper retention duration that matches your document's date/timestamp versus just the time they were inserted into Rockset. For more information, see the Rocket retention documentation.
Requirements: _event_time
must return a timestamp
data type. For example, let's say you have
a field time in your data which is an integer (UNIX timestamp in milliseconds). You often filter on
it along with other predicates, such as this query:
SELECT \* FROM users WHERE time > UNIX_MILLIS(CURRENT_TIMESTAMP() - INTERVAL 1 WEEK) AND
state='CA'
This will be much faster if we instead use the _event_time
field, which is specially indexed for
this type of query:
SELECT \* FROM users WHERE \_event_time > CURRENT_TIMESTAMP() - INTERVAL 1 WEEK AND state='CA'
To accomplish this in the console, use the settings shown in the following image:
NOTES: The
_id
and_event_time
fields are mandatory in every collection and cannot be deleted. If you attempt to delete those fields, the delete operation will be ignored.Customer scripts may already contain
_id
and_event_time
fields. If they do, Rockset will process them. If they do not, Rockset will assign both fields to the scripts. So if the customer deletes either field using field mapping, they will be deleted from the original source scripts. Rockset will go ahead and compute these fields at the time of data ingestion.
Attributes
Three attributes need to be set when creating field mappings:
input_fields
output_field
is_drop_all_fields
The input_fields
and output_field
attributes are used to specify data transformations, and the
is_drop_all_fields
attribute is used to designate whitelisting.
Input Fields
The input_fields
attribute specifies behaviors to be applied to input data as it is inserted into
your collection. There are four parameters you may specify with each input_fields
field mapping:
- The
field_name
parameter specifies the name of the field in your input data to apply this field mapping to. - The
if_missing
parameter specifies the behavior for when the field evaluates to eitherNULL
orUNDEFINED
. It accepts two valid strings as input:SKIP
, which skips the update for this document entirely, orPASS
, which will simply set this field toNULL
. - The
is_drop
parameter accepts a boolean specifying whether or not to drop this field completely from the document as it is being inserted. - The
param
parameter exports name alias for this field which can be referred to in a SQL expression in theoutput_field
attribute.
Output Field
The output_field
attribute defines new fields to be added to your collection with each document
during ingest. They are created using SQL expressions, and may optionally use other existing fields
in the document as parameters. For instance, you might define an output_field
with the SQL
expression CURRENT_TIMESTAMP()
to record the insertion time for every input record as a new field
in the document, or define an output_field
with the SQL expression TO_HEX(SHA256(:name))
to
store a particular field's hash in a another field.
You may specify the following three parameters with each output_field
field mapping:
- The
field_name
parameter specifies the name of the new field created by your SQL expression. - The
sql
parameter accepts a string SQL expression used to define the new field being created. It may optionally take another field name as a parameter, or aparam
name alias specified in aninput_fields
field mapping. - The
on_error
parameter specifies the behavior for when there is an error while evaluating the SQL expression defined in thesql
parameter. It accepts two valid strings as input:SKIP
, which skips only this output field but continues the update, orFAIL
, which causes this update to fail entirely.
Any field may be used as the output of a field mapping. Note, however, that
Special fields require the result of the output_field
created using the SQL
expression to be of a specific type:
_id
: The field mapping SQL expression must return astring
data type that is used as a primary key, which thus must be unique across all documents in a collection._event_time
: The field mapping SQL expression must return atimestamp
data type.
You may also specify the same field as both an input_fields
and output_field
for example, if you
wish to overwrite that field with the result of the field mapping SQL expression.
Whitelisting Fields
You can use field mappings to whitelist certain fields. This is useful for ingesting data where the schema is unknown, and only certain fields are of interest. Rockset will automatically drop all fields in each ingested document, except for whitelisted and special fields.
To drop all fields by default and only whitelist specified fields, set the is_drop_all_fields
field mappings property to true
during collection creation.