Rockset
  • Querying

Special Fields

This page describes fields with unique roles and behaviors in Rockset documents.

Overview

Special fields are prefixed with an underscore and have important effects on the ingestion and querying behavior of documents in your collections. Some are automatically generated by Rockset during data ingestion, while others can be specified from a source document or ingest transformation.

Special fields are immutable. Users should only specify their own special fields if they are certain they will not need to be updated.

You can execute the following query on any collection to view some of its special fields:

SELECT
    _id,
    _meta,
    _event_time
FROM
    mycollection
LIMIT
    5;
+---------------------------------------------------------------------------------+----------------+----------------------------------------+
| _meta                                                                           | _event_time    | _id                                    |
|---------------------------------------------------------------------------------+----------------+----------------------------------------|
| {'s3': {'bucket': 'mycollection', 'offset': 0, 'path': '47304-1709397620935'}}  | 1535068823524  | '5d99f42c-44ec-2024-a032-ac14d1cbf44d' |
| {'s3': {'bucket': 'mycollection', 'offset': 0, 'path': '46242-1709397472272'}}  | 1535068823426  | 'e5d4c136-8d56-ac39-38f5-2f795ae38007' |
| {'s3': {'bucket': 'mycollection', 'offset': 0, 'path': '46242-1503238554088'}}  | 1535068823326  | '71df712e-04bc-7c9b-0114-97b19741b215' |
| {'s3': {'bucket': 'mycollection', 'offset': 0, 'path': '47302-1151051235692'}}  | 1535068822852  | '517412d2-1a23-1b49-fc3b-441c630ed4ff' |
| {'s3': {'bucket': 'mycollection', 'offset': 0, 'path': '45189-1709397393868'}}  | 1535068822527  | 'd16bf924-f758-21db-c832-4a25323b2938' |
+---------------------------------------------------------------------------------+----------------+----------------------------------------+

Below is an exhaustive list of all special fields in Rockset.

The _id field

Every document in a Rockset collection is uniquely identified by its _id field.

  • If the document source does not already have an _id field, Rockset populates it with an automatically generated uuid.
  • If the document source has _id specified, or an ingest transformation outputs an _id field, its value is preserved. A newly ingested document will overwrite any existing document with the same _id value.

For collections with rollups, this field is populated by Rockset and cannot be specified by the user.

The _meta field

Metadata regarding each document is stored in a _meta field of object type.

If the source of a document specifies a _meta field, Rockset will ignore the field. Currently, _meta holds information about the source from which the document was inserted into the collection (such as the bucket name and path in case of S3). If Rockset is unable to parse the source of a document, it will create a document without any of the source's fields and will have _meta with a nested field named bad.

This field is never populated for collections with rollups.

The _event_time field

Rockset associates a timestamp with each document in a field named _event_time, recorded as microseconds since the Unix epoch. By default, _event_time is set as the time a document is inserted into a Rockset collection.

Users can specify their own _event_time by including the field in their source records, or defining an ingest transformation with a mapping for _event_time. User-specified _event_time values must be of either int (microseconds since epoch) or timestamp type, otherwise the ingestion of the document will fail.

Rockset's time-based retention feature uses _event_time to determine when a document has fallen outside the retention window and should be removed from a collection. Sometimes using the default document insertion time for retention makes perfect sense, but many use cases may want to trim records according to something else, in which case they need to define their own _event_time.

If your collection has rollups and your rollup query does not contain an _event_time mapping, this field is populated with the initial insertion time of the rolled up document. It does not change as more input documents are aggregated into the rolled up document.

The _stored field

_stored reduces the hot storage size of your collections by excluding data from certain indexes. More specifically, we exclude _stored and its children from the inverted and range indexes, but we still include them in our columnar and row indexes. You can explore the Rockset Storage Architecture to learn more about Rockset's Converged Indexing.

Leveraging _stored can significantly reduce storage sizes by lowering the storage amplification associated with indexes. Though, the reduction depends on data distributions. For certain data distributions, the sizes of the columnar and row indexes can greatly exceed the sizes of the inverted and range indexes, limiting the relative impact of _stored. We see this pattern with large text fields, since the inverted index only tracks prefixes of these fields and not the entire fields. Using _stored with large text fields will have a limited impact on the overall storage sizes.

You must configure _stored in your ingest transformations. We recommend using _stored as an object, so you can consistently store and reference multiple fields. Though, you can still use _stored as a scalar or array. The following examples outline how to use _stored in your ingest transformations.

SELECT
    foo,
    { 'bar': bar } _stored
FROM
    _input
SELECT
    *
EXCEPT(bar),
    { 'bar': bar } _stored
FROM
    _input

Limitations: Since we exclude _stored and its subfields from the inverted index, queries with predicates on _stored must use the columnar index during execution. Thus, you should not include fields in _stored on which you expect to apply selective filters, as the associated queries will run much more efficiently with those fields in the inverted index. You can still efficiently project fields from _stored after applying selective filters on other fields in your collections.

The _op field

The _op field enables flexibly ingesting records into a Rockset collection. Each document ingested into Rockset may have an optional _op field that will affect its ingestion behavior. The value of _op can come directly from a source document, or from an ingest transformation. Unlike other special fields, _op is purely an ingest-time concept that does not materialize in a Rockset collection and consequently can't be queried.

Here are the supported _op values (case insensitive). If no _op value is explicitly included in the document, the default operation is UPSERT. Any value other than the supported ones below will lead to an ingestion error for the document.

  • INSERT– If no document exists with the same _id, insert this document. If another document with this _id exists, do nothing. _id is optional for this operation and will be automatically generated if not specified.
  • UPDATE– If a document exists with the same _id, overwrite the top-level fields present in this document, leaving all other fields in the existing document unchanged. If no document exists with the same _id do nothing. _id is required for this operation and ingestion will error if it is not specified. Special field _event_time cannot be changed and will be ignored if specified.
  • UPSERT– If a document exists with the same _id, do an UPDATE. If it does not exist, do an INSERT. _id is optional for this operation and will be automatically generated if not specified. This is the default behavior if no _op value is specified.
  • DELETE– Delete the document with this _id if it exists. If no such document exists, do nothing. _id is required for this operation and ingestion will error if it is not specified.
  • REPLACE– If a document exists with the same _id, delete the entire existing document and insert this one instead. If no such document exists, do nothing. _id is required for this operation and ingestion will error if it is not specified. Unlike UPDATE this will change the _event_time of the document.
  • REPSERT– If a document exists with the same _id do a REPLACE. If no such document exist, do an INSERT. _id is required for this operation and ingestion will error if it is not specified.

Not all collections support _op. Namely, it is not supported for:

  • Rollup collections
  • Managed sources which have their own semantics for sending deletes (MongoDB and DynamoDB)

Creating a collection with one of these unsupported configurations with a mapping for _op will lead to an error at collection creation time. If a record being ingested into a collection with an unsupported configuration contains _op from the source, the document will error during ingestion.

To illustrate the behavior of _op, here are some sample documents with various _op types explaining the behavior of each as they are applied sequentially on top of an empty collection.

{"_op": "INSERT",  "_id": "abc", "x": 1, "y": "bar"} // _id=abc is inserted with x=1, y=bar
{"_op": "INSERT",  "_id": "abc", "x": 2, "y": "baz"} // this is a no-op since _id=abc already exists
{"_op": "UPDATE",  "_id": "abc", "x": 2}             // now x=2 for _id=abc and y=bar (unchanged)
{"_op": "UPDATE",  "_id": "def", "x": 3}             // this is a no-op since _id=def does not exist
{"_op": "UPSERT",  "_id": "def", "x": 3}             // _id=def is inserted with x=3
{"_op": "UPSERT",  "_id": "def", "y": "baz"}         // now y=baz for _id=def and x=3 (unchanged)
{"_op": "DELETE",  "_id": "xyz"}                     // this is a no-op since _id=xyz does not exist
{"_op": "DELETE",  "_id": "def"}                     // this deletes _id=def leaving only _id=abc in the collection
{"_op": "REPLACE", "_id": "abc", "z": 4}             // now _id=abc is exactly {"z": 4} and fields x and y have been removed
{"_op": "REPLACE", "_id": "ghi", "z": 5}             // this is a no-op since _id=ghi does not exist
{"_op": "REPSERT", "_id": "ghi", "z": 6}             // _id=ghi is inserted with z=6 
{"_op": "REPSERT", "_id": "ghi", "y": 7}             // now _id=ghi is exactly {"y": 7} and field z has been deleted