Field Mappings

This page describes how to use field mappings to apply data transformations to incoming data as the transformations are inserted into a collection.

💡

This is a legacy feature. We have a new way of specifying ingest transformations via SQL, documented here

Overview

A field mapping allows you to specify transformations to be applied on all documents inserted into a collection. Any valid SQL expression may be used to describe the transformation logic in a field mapping definition.

Each ingested document is parsed through the set of field mapping transformations specified at collection creation time. Each field mapping takes a set of fields from the incoming document, evaluates the corresponding SQL expression with the values of these input fields as named parameters, and then stores the result of the SQL expression as another field in the document.

Why Field Mappings?

Field mappings are often used to apply data transformations in the following use cases:

Type Coercion

You can map an input field to a specific advanced data type, such as mapping an input string field into a date, datetime or timestamp field.

Example of a field mapping expression:

CAST(:last_updated_at as DATETIME)

PII/PHI Masking

If your input dataset has PII (personally identifiable information) or PHI (protected health information) fields, you may use a one-way crypto hash function such as SHA256() on the input field, so that Rockset only stores the hashed value and not the original PII/PHI field.

Example:

TO_HEX(SHA256(:email_address))

Search Tokenization

Text fields in the input can be tokenized upon data ingestion so that you can run search queries (still using SQL).

Example:

TOKENIZE(:tweet_text, 'en_US')

Functional Index

If your application queries involve complex SQL functions that makes query processing slow, you may use a field mapping to pre-compute the result of the expensive SQL expression at data ingestion. Queries on such computed output fields are much faster than executing the SQL expression at query time.

Example:

-- extract the domain from the input email address
REGEXP_EXTRACT_ALL(:email_address, '^([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)', 2)

Importing Geographic Data

You can convert latitude/longitude points or well known text strings to Rockset's native geography type at ingest time by specifying an appropriate field mapping.

Example:

-- create a geography from a longitude and latitude field
ST_GEOGPOINT(:longitude, :latitude)

Dropping Fields

If you have unused large fields in your input data set, you can drop them upon data ingestion to save storage space.

Creating Field Mappings

Field mappings are defined during collection creation, either by defining the field_mappings parameter through the Rockset API or by using the Field Mappings selector in the Rockset Console:

!undefined

Notes:

  • All specified field mappings are executed independently of one another, and in parallel.
    Therefore, each field mapping specification should not depend on the output of any other field mapping specification.
  • Once the collection has been created, you cannot change a collection's field mappings.

_id and _event_time mappings

The following field mappings are especially useful:

_id

You may use the same ID twice. The existing entry will be updated rather than a new entry created, to save storage.

Requirements: _id must return a string data type and be unique across documents.

_event_time

This field will map the time of a document event. As this field is specially indexed, time-based query performance will improve. This field will also set a proper retention duration that matches your document's date/timestamp versus just the time they were inserted into Rockset. For more information, see the Rocket retention documentation.

Requirements: _event_time must return a timestamp data type. For example, let's say you have
a field time in your data which is an integer (UNIX timestamp in milliseconds). You often filter on it along with other predicates, such as this query:

SELECT \* FROM users WHERE time > UNIX_MILLIS(CURRENT_TIMESTAMP() - INTERVAL 1 WEEK) AND
state='CA'

This will be much faster if we instead use the _event_time field, which is specially indexed for this type of query:

SELECT \* FROM users WHERE \_event_time > CURRENT_TIMESTAMP() - INTERVAL 1 WEEK AND state='CA'

To accomplish this in the console, use the settings shown in the following image:

!undefined

NOTES: The _id and _event_time fields are mandatory in every collection and cannot be
deleted. If you attempt to delete those fields, the delete operation will be ignored.

Customer scripts may already contain _id and _event_time fields. If they do, Rockset will
process them. If they do not, Rockset will assign both fields to the scripts. So if the customer
deletes either field using field mapping, they will be deleted from the original source scripts.
Rockset will go ahead and compute these fields at the time of data ingestion.

Attributes

Three attributes need to be set when creating field mappings:

  • input_fields
  • output_field
  • is_drop_all_fields

The input_fields and output_field attributes are used to specify data transformations, and the is_drop_all_fields attribute is used to designate whitelisting.

Input Fields

The input_fields attribute specifies behaviors to be applied to input data as it is inserted into your collection. There are four parameters you may specify with each input_fields field mapping:

  1. The field_name parameter specifies the name of the field in your input data to apply this field
    mapping to.
  2. The if_missing parameter specifies the behavior for when the field evaluates to either NULL
    or UNDEFINED. It accepts two valid strings as input: SKIP, which skips the update for this document entirely, or PASS, which will simply set this field to NULL.
  3. The is_drop parameter accepts a boolean specifying whether or not to drop this field completely
    from the document as it is being inserted.
  4. The param parameter exports name alias for this field which can be referred to in a SQL
    expression in the output_field attribute.

Output Field

The output_field attribute defines new fields to be added to your collection with each document during ingest. They are created using SQL expressions, and may optionally use other existing fields in the document as parameters. For instance, you might define an output_field with the SQL expression CURRENT_TIMESTAMP() to record the insertion time for every input record as a new field in the document, or define an output_field with the SQL expression TO_HEX(SHA256(:name)) to store a particular field's hash in a another field.

You may specify the following three parameters with each output_field field mapping:

  1. The field_name parameter specifies the name of the new field created by your SQL expression.
  2. The sql parameter accepts a string SQL expression used to define the new field being created.
    It may optionally take another field name as a parameter, or a param name alias specified in an input_fields field mapping.
  3. The on_error parameter specifies the behavior for when there is an error while evaluating the
    SQL expression defined in the sql parameter. It accepts two valid strings as input: SKIP, which skips only this output field but continues the update, or FAIL, which causes this update to fail entirely.

Any field may be used as the output of a field mapping. Note, however, that Special fields require the result of the output_field created using the SQL expression to be of a specific type:

  • _id: The field mapping SQL expression must return a string data type that is used as a primary
    key, which thus must be unique across all documents in a collection.
  • _event_time: The field mapping SQL expression must return a timestamp data type.

You may also specify the same field as both an input_fields and output_field for example, if you wish to overwrite that field with the result of the field mapping SQL expression.

Whitelisting Fields

You can use field mappings to whitelist certain fields. This is useful for ingesting data where the schema is unknown, and only certain fields are of interest. Rockset will automatically drop all fields in each ingested document, except for whitelisted and special fields.

To drop all fields by default and only whitelist specified fields, set the is_drop_all_fields field mappings property to true during collection creation.