Running SQL on Nested JSON

When we surveyed the market, we saw the need for a solution that could perform fast SQL queries on fluid JSON data, including arrays and nested objects:

The Challenge of SQL on JSON

Some form of ETL to transform JSON to tables in SQL databases may be workable for basic JSON data with fixed fields that are known up front. However, JSON with nested objects or new fields that “can spring up every 2-4 weeks,” as the original Stack Overflow poster put it, is impossible to handle in such a rigid manner.

Relational databases offer alternative approaches to accommodate more complex JSON data. SQL Server stores JSON in varchar columns, while Postgres and MySQL have JSON data types. In these scenarios, users can ingest JSON data without conversion to SQL fields, but take a performance hit when querying the data because these columns support minimal indexing at best.

SQL on Nested JSON Using Rockset

With lots of fields that change, get added/removed, etc, it can be rather cumbersome to maintain ETL pipelines. Rockset was designed to help with this problem—by indexing all fields in JSON documents, along with all type information, and exposing a SQL API on top of it.

For example, with a Rockset collection named new_collection, I can start by adding a single document to an empty collection that looks like:

{
    "my-field": "doc1",
    "my-other-field": "some text"
}

... and then query it.

rockset> select "my-field", "my-other-field" 
         from new_collection;

+------------+------------------+
| my-field   | my-other-field   |
|------------+------------------|
| doc1       | some text        |
+------------+------------------+

Now, if a new JSON document comes in with some new fields - maybe with some arrays, nested JSON objects, etc, I can still query it with SQL.

{
    "my-field": "doc2",
    "my-other-field":[
        {
            "c1": "this",
            "c2": "field",
            "c3": "has",
            "c4": "changed"
        }
    ]
}

I add that to the same collection and can query it just as before.

rockset> select "my-field", "my-other-field" 
         from new_collection;

+------------+---------------------------------------------------------------+
| my-field   | my-other-field                                                |
|------------+---------------------------------------------------------------|
| doc1       | some text                                                     |
| doc2       | [{'c1': 'this', 'c2': 'field', 'c3': 'has', 'c4': 'changed'}] |
+------------+---------------------------------------------------------------+

I can further flatten nested JSON objects and array fields at query time and construct the table I want to get to - without having to do any transformations beforehand.

rockset> select mof.* 
         from new_collection, unnest(new_collection."my-other-field") as mof;

+------+-------+------+---------+
| c1   | c2    | c3   | c4      |
|------+-------+------+---------|
| this | field | has  | changed |
+------+-------+------+---------+

In addition to this, there is strong type information stored, which means I won't get tripped up by having mixed types, etc. Adding a third document:

{
    "my-field": "doc3",
    "my-other-field":[
        {
            "c1": "unexpected",
            "c2": 99,
            "c3": 100,
            "c4": 101
        }
    ]
}

It still adds my document as expected.

rockset> select mof.* 
         from new_collection, unnest(new_collection."my-other-field") as mof;

+------------+-------+------+---------+
| c1         | c2    | c3   | c4      |
|------------+-------+------+---------|
| unexpected | 99    | 100  | 101     |
| this       | field | has  | changed |
+------------+-------+------+---------+

... and the fields are strongly typed.

rockset> select typeof(mof.c2) 
         from new_collection, unnest(new_collection."my-other-field") as mof;

+-----------+
| ?typeof   |
|-----------|
| int       |
| string    |
+-----------+

If being able to run SQL on complex JSON, without any ETL, data pipelines, or fixed schema, sounds interesting to you, you should give Rockset a try.

Real-time SQL on NoSQL

Related Posts

Grafana Time-Series Dashboards with the Rockset-Grafana Plugin

How Rockset uses Grafana dashboards for monitoring production systems, Kubernetes, and GitHub metrics, and how we built a Rockset-Grafana plugin.

Data-Driven Decisions for Where to Park in SF

We built an app to estimate the risk of a car break-in based on historical incidents.

SQL Query Planning for Operational Analytics

We discuss how SQL query planning is implemented to support operational analytics requirements, like low latency and high concurrency, in Rockset.