Rockset
  • Loading Data
  • Creating Collections

Creating Collections

A collection is a set of Rockset documents. All documents within a collection and all fields within a document are mutable. Similar to tables in traditional SQL databases, collections are traditionally queried with FROM clauses in SQL queries.

Creating Collections

Collections can be created in the Collections tab of the Rockset Console or by using the Rockset API.

Create New Collection

To create a collection using a managed integration with an external data source (such as MongoDB or Amazon S3), first set up the respective integration using the instructions in the Integrations section. We generally recommend mapping each data source (such as MongoDB collection, DynamoDB table, or Kafka topic) to a single collection, and joining those collections at query time using JOIN when necessary.

Note: Using an external data source is not required.

You can always create collections using the Rockset Console (either empty, or from a file upload) or by using the Rockset API directly without referencing any sources. You can read more about how to create a collection from a self-managed data source.

Data Ingest Modes

When a collection is created using one of the managed integrations with an external data source, the data ingestion happens in two phases:

Phase 1: Bulk Ingest Mode

The upstream data set (often a consistent snapshot of it) is downloaded and indexed into the collection in a batch mode called Bulk Ingest Mode. In this phase, Rockset’s data connectors will automatically scale the compute required in order to complete this phase as quickly and as efficiently as possible. The compute required during this part of the data ingest will not come from your Virtual Instance and will come from dedicated autoscaling Bulk Ingest Compute Units that are fully isolated from your Virtual Instance. You will be charged for the total amount of Bulk Ingest Compute Units consumed in addition to your Virtual Instance. Bulk Ingest Mode often speeds up the initial data ingest by 10-100 times while consuming 5-10 times less compute compared to ingesting those large data sets using your regular Virtual Instance. During this mode, the collection status will be set to “Bulk Ingest” and you will not be able to query the collection.

Phase 2: Streaming Ingest Mode

Once the initial bulk ingest phase completes, the data connectors will seamlessly transition to a streaming ingest mode, and your collection will become available for query processing. Once the data connectors transition to this mode, they will be watching for changes happening in the upstream data source and then replicate those changes into the downstream collection in real-time. During this mode, the collection status will be set to “Ready”.

Ingest ModeCompute ResourceCompute MeasureOptimized ForCollection StatusAvailable for Queries
Bulk IngestAutoscaling Bulk Ingest Compute UnitsBulk Ingest Compute UnitsData ingestion throughputBulk IngestNo
Streaming IngestVirtual InstanceVirtual Instance SecondsStreaming ingest and query processingReadyYes

Bulk Ingest Mode

Bulk Ingest Mode is supported only for some of the popular data sources such as: Amazon DynamoDB, MongoDB Atlas, Amazon S3, Azure Blob Storage, Google Cloud Storage, and Snowflake. Please check the documentation of your particular data source to see if Bulk Ingest Mode is supported for it or not.

When the upstream data set is too small (typically less than 5 GiBs) then the Bulk Ingest Mode will be skipped since it does necessarily provide any benefit.

Also note, that a collection will only ever enter Bulk Ingest mode immediately following its creation. If you plan to periodically ingest large volumes of data, it is recommended to load them into a new collection (whose creation will trigger Bulk Ingest mode) and then use collection aliases to point to the new collection so that your queries will seamlessly transition to the new collection.

Note: Bulk ingest mode is only available to dedicated Virtual Instances (VIs). For shared and free VIs the ingest speed is limited to 1MiB/sec.

Factors affecting Bulk Ingest Mode

Since bulk ingest compute units automatically scale to fit the needs of the initial bulk ingest phase it is really important you understand the various factors that could make your bulk ingest more or less efficient. This section explains all the factors that could influence the efficiency and there by the cost of your bulk ingest:

  • Data Size
    • This is probably the most intuitive one. The larger your data set size the more compute it will need to be ingested.
    • Ingesting a 100 GiB data set will require more Bulk Ingest Compute seconds than ingesting a 10 GiB one, assuming both of those data sets were uncompressed and in the same data format.
  • Data Format
    • If your source is using compressed files (parquet, gzip, etc.) then compute resources will be required to decompress them. Some data formats such as Parquet could also result in a significant memory bloat during the data indexing process which could result in additional compute resources being consumed.
  • Ingest Transformation
    • Transformations require additional compute resources. Simple transformations such as skipping certain rows when some conditions are met or dropping fields or renaming fields are computationally trivial. But if your ingest transformations involve parsing string fields in JSON format into native objects and arrays, or casting strings to date/time fields or involve complex JavaScript UDFs then those will require additional compute resources.
  • Rollups
    • Rollups are a special case of ingest transformations that involve aggregating an incoming data stream using a GROUP BY clause in the ingest SQL transformation. When used appropriately, Rollups can significantly cut down your compute required since they can cut down the total amount of data that needs to be indexed. But beware of relatively expensive aggregations functions such as ARRAY_AGG() or MAP_AGG() that might require more compute resources.
  • Data Clustering
    • If the new collection being created is using data clustering on certain fields, then currently it disables many of the data ingest throughput optimizations that are usually available during bulk ingest. This could result in both slower data ingestion times during bulk ingest and also additional compute resources.

Best Practices for Bulk Ingest

If your new collection has a combination of the above factors then please follow these best practices:

  • The best way to estimate the actual compute required is to experiment with a small subset of your data first. For example, if you have a 10 TiB data set, first get a 100 GiB subset of that data set, and import it using the same ingest SQL transformation, data clustering and all other collection configuration that you eventually plan to use on your larger data set. You can expect your total bulk ingest compute consumption to grow linearly with your size. In fact, in most cases due to built-in throughput optimizations it will usually scale sub-linear with data growth.
  • If you have to experiment with complex ingest time data transformations, then ingest a smaller subset of say 10,000 to 100,000 rows of the raw untransformed dataset. Then build the transformations you want to perform on that raw data collection as a view. Once the view returns the data in the form that fits the needs of your application, then take the view definition SQL, change the FROM clause to _input and use that as your ingest SQL transform.
  • If you have to periodically reload all the data in your collection, then instead of deleting and inserting/updating the data in your collection, import the new data set into a brand new collection and use collection aliasing to switch your queries once the new collection is ready for query processing.

Field Mappings

Collections automatically create indexes on every field path that is included in that collection, including those of nested paths.

To transform incoming data by creating an index on a compound or derived field, configure and create field mappings. Field Mappings provide the means to create new fields by applying SQL expressions on fields of incoming documents.

Retention Duration

For each collection, you can set a custom retention duration. This determines how long each document is retained after being added to the collection. Low retention values can be used to keep total data indexed low, while still ensuring your applications are querying the latest data.

Special Fields

Every document ingested into Rockset collections has several system-generated fields which are automatically created by Rockset and added to each document. Learn more about special fields here.

The _events Collection

When your organization is created, Rockset will automatically create a collection named _events which is used for audit logging to provide visibility into your account. Learn more about the _events collection.

Updating Collections

When a collection is created from a managed integration, Rockset will automatically sync your collection with its data source, usually within seconds. For more information about individual source behavior, see the Data Sources section.

If you choose not to create your collection using a managed integration, or wish to make manual changes to data in your collection after Rockset has synced it with your external data source, you can learn more about manually adding, deleting, or patching documents.

Querying Collections

Collections can be queried using SQL the same way tables are queried in traditional SQL databases.

You can write and execute queries on your collections in the Query Editor tab of the Rockset Console. Queries can also be executed using the Rockset API and SDKs (Node.js, Python, Java, or Go) to run queries against collections. SQL queries can also JOIN documents across different Rockset collections and workspaces.

See our SQL Reference for the full list of supported data types, commands, functions, and more.

Collection Errors

Out-of-Sync Collections

Collections sourced from external systems such as Amazon DynamoDB or MongoDB can occasionally become out-of-sync with the data source. This happens rarely when Rockset's replication pipeline encounters errors or bugs.

You can re-sync a collection by dropping and recreating it:

  1. Navigate to the Collection Details page for your collection in the Rockset Console.
  2. Click the "Clone this Collection" button. This will open the Create Collection form in a new tab, and populate the configuration from your existing collection. Double-check that this configuration is correct, but do not click the Create Collection button.
  3. In the original tab displaying your Collection Details, click the "Delete Collection" button. Wait at least 30 seconds for the collection deletion to complete.
  4. Click the "Create Collection" button in the Collection Creation tab to recreate the collection.

Clone Collection

Note that recreating collections will incur additional ingest costs in Rockset and potentially additional charges on the data source (e.g. RCU charges in Amazon DynamoDB). Furthermore, because re-syncing involves dropping and recreating a collection, there will be a period of unavailability. Thus, if you choose to re-sync a collection, we recommend that you do so during a scheduled maintenance window.