Field Mappings (Legacy)
This page describes how to use field mappings to apply data transformations to incoming data as the transformations are inserted into a collection.
Note: This is a legacy feature. We have a new way of specifying ingest transformations via SQL, documented here
A field mapping allows you to specify transformations to be applied on all documents inserted into a collection. Any valid SQL expression may be used to describe the transformation logic in a field mapping definition.
Each ingested document is parsed through the set of field mapping transformations specified at collection creation time. Each field mapping takes a set of fields from the incoming document, evaluates the corresponding SQL expression with the values of these input fields as named parameters, and then stores the result of the SQL expression as another field in the document.
#Why Field Mappings?
Field mappings are often used to apply data transformations in the following use cases:
You can map an input field to a specific advanced data type, such as mapping an input string field
Example of a field mapping expression:
CAST(:last_updated_at as DATETIME)
If your input dataset has PII (personally identifiable information) or PHI (protected health
information) fields, you may use a one-way crypto hash function such as
SHA256() on the input field, so that Rockset only stores the hashed value and not the original
Text fields in the input can be tokenized upon data ingestion so that you can run search queries (still using SQL).
If your application queries involve complex SQL functions that makes query processing slow, you may use a field mapping to pre-compute the result of the expensive SQL expression at data ingestion. Queries on such computed output fields are much faster than executing the SQL expression at query time.
-- extract the domain from the input email address
#Importing Geographic Data
-- create a geography from a longitude and latitude field
If you have unused large fields in your input data set, you can drop them upon data ingestion to save storage space.
#Creating Field Mappings
- All specified field mappings are executed independently of one another, and in parallel. Therefore, each field mapping specification should not depend on the output of any other field mapping specification.
- Once the collection has been created, you cannot change a collection's field mappings.
The following field mappings are especially useful:
You may use the same ID twice. The existing entry will be updated rather than a new entry created, to save storage.
_id must return a
string data type and be unique across documents.
This field will map the time of a document event. As this field is specially indexed, time-based query performance will improve. This field will also set a proper retention duration that matches your document's date/timestamp versus just the time they were inserted into Rockset. For more information, see the Rocket retention documentation.
_event_time must return a
timestamp data type. For example, let's say you have
a field time in your data which is an integer (UNIX timestamp in milliseconds). You often filter on
it along with other predicates, such as this query:
SELECT \* FROM users WHERE time > UNIX_MILLIS(CURRENT_TIMESTAMP() - INTERVAL 1 WEEK) AND state='CA'
This will be much faster if we instead use the
_event_time field, which is specially indexed for
this type of query:
SELECT \* FROM users WHERE \_event_time > CURRENT_TIMESTAMP() - INTERVAL 1 WEEK AND state='CA'
To accomplish this in the console, use the settings shown in the following image:
_event_timefields are mandatory in every collection and cannot be deleted. If you attempt to delete those fields, the delete operation will be ignored.
Customer scripts may already contain
_event_timefields. If they do, Rockset will process them. If they do not, Rockset will assign both fields to the scripts. So if the customer deletes either field using field mapping, they will be deleted from the original source scripts. Rockset will go ahead and compute these fields at the time of data ingestion.
Three attributes need to be set when creating field mappings:
output_field attributes are used to specify data transformations, and the
is_drop_all_fields attribute is used to designate whitelisting.
input_fields attribute specifies behaviors to be applied to input data as it is inserted into
your collection. There are four parameters you may specify with each
input_fields field mapping:
field_nameparameter specifies the name of the field in your input data to apply this field mapping to.
if_missingparameter specifies the behavior for when the field evaluates to either
UNDEFINED. It accepts two valid strings as input:
SKIP, which skips the update for this document entirely, or
PASS, which will simply set this field to
is_dropparameter accepts a boolean specifying whether or not to drop this field completely from the document as it is being inserted.
paramparameter exports name alias for this field which can be referred to in a SQL expression in the
output_field attribute defines new fields to be added to your collection with each document
during ingest. They are created using SQL expressions, and may optionally use other existing fields
in the document as parameters. For instance, you might define an
output_field with the SQL
CURRENT_TIMESTAMP() to record the insertion time for every input record as a new field
in the document, or define an
output_field with the SQL expression
store a particular field's hash in a another field.
You may specify the following three parameters with each
output_field field mapping:
field_nameparameter specifies the name of the new field created by your SQL expression.
sqlparameter accepts a string SQL expression used to define the new field being created. It may optionally take another field name as a parameter, or a
paramname alias specified in an
on_errorparameter specifies the behavior for when there is an error while evaluating the SQL expression defined in the
sqlparameter. It accepts two valid strings as input:
SKIP, which skips only this output field but continues the update, or
FAIL, which causes this update to fail entirely.
Any field may be used as the output of a field mapping. Note, however, that
Special fields require the result of the
output_field created using the SQL
expression to be of a specific type:
_id: The field mapping SQL expression must return a
stringdata type that is used as a primary key, which thus must be unique across all documents in a collection.
_event_time: The field mapping SQL expression must return a
You may also specify the same field as both an
output_field for example, if you
wish to overwrite that field with the result of the field mapping SQL expression.
You can use field mappings to whitelist certain fields. This is useful for ingesting data where the schema is unknown, and only certain fields are of interest. Rockset will automatically drop all fields in each ingested document, except for whitelisted and special fields.
To drop all fields by default and only whitelist specified fields, set the
field mappings property to
true during collection creation.