Monitoring And Alerting

Metrics Endpoint

Beyond the Console Metrics page, additional metrics are accessible through the metrics endpoint in Prometheus/OpenMetrics format. This format is compatible with monitoring/alerting tools such as Prometheus, Datadog and AWS Cloudwatch (among many others).

$ curl https://$ROCKSET_SERVER/v1/orgs/self/metrics -u {API key}:
# HELP rockset_collections Number of collections.
# TYPE rockset_collections gauge
rockset_collections{virtual_instance_id="30",workspace_name="commons",} 20.0
rockset_collections{virtual_instance_id="30",workspace_name="myWorkspace",} 2.0
rockset_collections{virtual_instance_id="30",workspace_name="myOtherWorkspace",} 1.0
# HELP rockset_collection_size_bytes Collection size in bytes.
# TYPE rockset_collection_size_bytes gauge
rockset_collection_size_bytes{virtual_instance_id="30",workspace_name="commons",collection_name="_events",} 3.74311622E8
...

You can enable the metrics endpoint for your Virtual Instance from the Metrics tab in the Rockset Console.

You can read more about the three metric types currently used here:

🚧

Some metric types (e.g. Histogram) are represented through a set of sub-items.

For example, the rockset_query_latency_seconds metric (a Histogram) would be represented by several rockset_query_latency_seconds_bucket records along with a rockset_query_latency_seconds_sum.

Most monitoring clients will handle these complex types automatically on your behalf.

The following metrics are provided and updated at one-minute intervals:

Virtual Instance Metrics

MetricTypeDescription
rockset_leaf_cpu_utilization_percentageGaugeAverage CPU utilization across the leaves in a Virtual Instance. Leaf nodes store and ingest data. Leaf CPU utilization reflects both data ingestion and query processing.
rockset_leaf_memory_utilization_percentageGaugeAverage memory utilization across the leaves in a Virtual Instance. Leaf nodes store and ingest data. Leaf memory utilization reflects both data ingestion and query processing.
rockset_leaf_block_cache_utilization_percentageGaugePercentage of total memory on the Virtual Instance that the block cache is using. The block cache is where Rockset caches data for reads.
rockset_leaf_block_cache_allocation_percentageGaugeThe block cache can use up to this percentage of total memory of the Virtual Instance. The block cache is where Rockset caches data for reads.
rockset_leaf_block_cache_hit_percentageGaugeThe hit rate measures how often the queried data is found in the block cache. This number is block cache hits / block cache hits and misses.
rockset_leaf_memtable_utilization_percentageGaugePercentage of total memory on the Virtual Instance that the memtable is using. The memtable is an in-memory data structure that stores recently updated data before flushing it to the on-disk storage (SST). We call this the ingest buffer or tailing buffer in the console.
rockset_leaf_memtable_allocation_percentageGaugeThe memtable can use up to this percentage of total memory of the Virtual Instance. The memtable is an in-memory data structure that stores recently updated data before flushing it to the on-disk storage (SST). We call this the ingest buffer or tailing buffer in the console.
rockset_leaf_tailing_stopped_timestamp_secondsGaugeThis value will show the timestamp of when tailing stopped on your Virtual Instance. If tailing is active, this value is 0. Tailing stops when you exceed the memory limit of your memtable. Periodically the VI will restart to try to recover to a stable state so you may see tailing resume temporarily. However, if the Virtual Instance continues to have insufficient memory, tailing will stop again.

Virtual Instance metrics are useful for monitoring compute usage and alerting when your VI is near the limits of its performance. Query performance and ingest latency may both degrade as these metrics near 100%.

Collection Metrics

MetricTypeDescription
rockset_collectionsGaugeNumber of collections.
rockset_collection_size_bytesGaugeCollection size in bytes. Note that this size reflects the current storage size and will decrease as documents expire via specified retention duration or are deleted.
rockset_collection_documentsGaugeNumber of documents currently in each collection.
rockset_collection_total_ingest_bytesCounterNumber of bytes ingested over the history of each collection. Note that this count only ever increases and is therefore well suited for increase and rate functions to compute ingest over time.
rockset_collection_parse_errorsCounterNumber of parse errors for each collection.
rockset_collection_data_discovery_latencyHistogramThe duration (in seconds) from when new or updated data appears in a data source until Rockset first detects it. Elevated values for this metric often reflect configuration issues in the underlying data source (e.g. an inadequate number of RCUs provisioned for DynamoDB sources).
rockset_collection_data_process_latencyHistogramThe duration (in seconds) from when new or updated data is first detected by Rockset until the data is fully processed and query-able. Elevated values for this metric can be alleviated by allocating additional compute to your Virtual Instance.
rockset_collection_memtable_utilization_percentageGaugePercentage of total memory on the Virtual Instance that the memtable is using to tail this collection. The memtable is an in-memory data structure that stores recently updated data before flushing it to the on-disk storage (SST). We call this the ingest buffer or tailing buffer in the console.
rockset_data_discovery_latencyHistogramData discovery latency accross all collections. Unlike the collection-specific metric, this metric continues to include data from deleted collections.
rockset_data_process_latencyHistogramData process latency accross all collections. Unlike the collection-specific metric, this metric continues to include data from deleted collections.

Query Metrics

MetricTypeDescription
rockset_queriesCounterCumulative count of queries run on this Virtual Instance.
rockset_query_latency_secondsHistogramQuery latency, including admission control duration. Note that this metric is exposed as a histogram β€” you can compute any PXX that you'd like with an accuracy of +/- ~15% in almost all cases.
rockset_query_admission_latency_secondsHistogramAdmission control queue duration per query if admission control is enabled for your account.
rockset_query_queue_sizeGaugeNumber of queries currently queued (throttled by admission control).
rockset_query_errorsCounterNumber of query execution errors, labeled by HTTP error code (e.g. 404, 500).
rockset_query_lambda_queriesCounterNumber of queries by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag.
rockset_query_lambda_latency_secondsHistogramQuery latency by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag.
rockset_query_lambda_admission_latency_secondsHistogramQuery admission latency by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag.
rockset_query_lambda_errorsCounterNumber of query execution errors by Query Lambda. Note that the tag label is tracked if and only if the execution is specified by tag.
rockset_running_queriesGaugeNumber of queries that is currently running on the Virtual Instance.

Reference Configurations & Templates

πŸ“˜

You can find reference configurations and templates for Prometheus, Datadog, Grafana and Alertmanager here.

Below is an example of a Prometheus scrape_configs:

  - job_name: Rockset Metrics API
    scrape_interval: 1m
    scrape_timeout: 1m
    honor_timestamps: true
    static_configs:
      - targets:
        - api.usw2a1.rockset.com
    scheme: https
    basic_auth:
      username: <API Key>
      password:
    metrics_path: /v1/orgs/self/metrics