• Loading Data
  • Creating Collections

Creating Collections

A collection is a set of Rockset documents. All documents within a collection and all fields within a document are mutable. Similar to tables in traditional SQL databases, collections are traditionally queried with FROM clauses in SQL queries.

Creating Collections

Collections can be created in the Collections tab of the Rockset Console or by using the Rockset API.

Create New Collection

To create a collection using a managed integration with an external data source (such as MongoDB or Amazon S3), first set up the respective integration using the instructions in the Integrations section. We generally recommend mapping each data source (such as MongoDB collection, DynamoDB table, or Kafka topic) to a single collection, and joining those collections at query time using JOIN when necessary.

Note: Using an external data source is not required.

You can always create collections using the Rockset Console (either empty, or from a file upload) or by using the Rockset API directly without referencing any sources. You can read more about how to create a collection from a self-managed data source.

Bulk Ingest Mode

The BULK_INGEST_MODE feature is available only to collections created using one of the following managed data sources: Amazon DynamoDB, MongoDB Atlas, Amazon S3, Azure Blob Storage, and Google Cloud Storage. When a collection is first created using one of the those data sources, Rockset will scan your data source to see if it exceeds the minimum size of 5 GiB required to enter BULK_INGEST_MODE as soon as the collection is initialized. If so, Rockset will change the collection status to BULK_INGEST_MODE, which will prevent the collection from being queried, but will allow data ingestion to occur at speeds several orders of magnitude higher than typical streaming data ingestion.

Note: Bulk ingest mode is only available to dedicated Virtual Instances (VIs). For shared and free VIs the ingest speed is limited to 1MB/sec.

Streaming ingest mode is not recommended for collections that are larger than 5GiB.

Once the bulk ingest is completed, the collection will enter the READY state, at which point you can begin executing queries. Rockset will continue to scan your external data source actively, but any new documents will be added using the normal streaming data ingestion speed. A collection will only ever enter BULK_INGEST_MODE once, and only immediately following its creation.

If Rockset determines that your data source does not meet the requirements to enter BULK_INGEST_MODE, the collection will immediately enter READY state, and documents will be added using the normal streaming data ingestion. You may execute queries as the initial ingest is still occuring, but some of the data may not be visible in queries until the initial data ingestion is fully completed.

Field Mappings

Collections automatically create indexes on every field path that is included in that collection, including those of nested paths.

To transform incoming data by creating an index on a compound or derived field, configure and create field mappings. Field Mappings provide the means to create new fields by applying SQL expressions on fields of incoming documents.

Retention Duration

For each collection, you can set a custom retention duration. This determines how long each document is retained after being added to the collection. Low retention values can be used to keep total data indexed low, while still ensuring your applications are querying the latest data.

Special Fields

Every document ingested into Rockset collections has several system-generated fields which are automatically created by Rockset and added to each document. Learn more about special fields here.

The _events Collection

When your organization is created, Rockset will automatically create a collection named _events which is used for audit logging to provide visibility into your account. Learn more about the _events collection.

Updating Collections

When a collection is created from a managed integration, Rockset will automatically sync your collection with its data source, usually within seconds. For more information about individual source behavior, see the Data Sources section.

If you choose not to create your collection using a managed integration, or wish to make manual changes to data in your collection after Rockset has synced it with your external data source, you can learn more about manually adding, deleting, or patching documents.

Querying Collections

Collections can be queried using SQL the same way tables are queried in traditional SQL databases.

You can write and execute queries on your collections in the Query Editor tab of the Rockset Console. Queries can also be executed using the Rockset API and SDKs (Node.js, Python, Java, or Go) to run queries against collections. SQL queries can also JOIN documents across different Rockset collections and workspaces.

See our SQL Reference for the full list of supported data types, commands, functions, and more.

Collection Errors

Out-of-Sync Collections

Collections sourced from external systems such as Amazon DynamoDB or MongoDB can occasionally become out-of-sync with the data source. This happens rarely when Rockset's replication pipeline encounters errors or bugs.

You can re-sync a collection by dropping and recreating it:

  1. Navigate to the Collection Details page for your collection in the Rockset Console.
  2. Click the "Clone this Collection" button. This will open the Create Collection form in a new tab, and populate the configuration from your existing collection. Double-check that this configuration is correct, but do not click the Create Collection button.
  3. In the original tab displaying your Collection Details, click the "Delete Collection" button. Wait at least 30 seconds for the collection deletion to complete.
  4. Click the "Create Collection" button in the Collection Creation tab to recreate the collection.

Clone Collection

Note that recreating collections will incur additional ingest costs in Rockset and potentially additional charges on the data source (e.g. RCU charges in Amazon DynamoDB). Furthermore, because re-syncing involves dropping and recreating a collection, there will be a period of unavailability. Thus, if you choose to re-sync a collection, we recommend that you do so during a scheduled maintenance window.