Running SQL on Nested JSON

When I run SQL on nested JSON during a demo, a viewer will sometimes remark, "It's like magic!” It's definitely awesome to take raw JSON, without any ETL, data pipelines, or fixed schema, and make it immediately queryable via SQL, but it's far from magic. Instead, it has everything to do with how Rockset is designed.

The Challenge of SQL on JSON

When we surveyed the market, we saw the need for a solution that could perform fast SQL queries on fluid JSON data:

Some form of ETL to transform JSON to tables in SQL databases may be workable for basic JSON data with fixed fields that are known up front. However, JSON with nested objects or new fields that “can spring up every 2-4 weeks,” as the original Stack Overflow poster put it, is impossible to handle in such a rigid manner.

Relational databases offer alternative approaches to accommodate more complex JSON data. SQL Server stores JSON in varchar columns, while Postgres and MySQL have JSON data types. In these scenarios, users can ingest JSON data without conversion to SQL fields, but take a performance hit when querying the data because these columns support minimal indexing at best.

But what if there was a better way? What if we could ingest JSON, with its arrays and nested objects, without any upfront schema, and provide millisecond-latency SQL queries on the data at the same time?

SQL on Nested JSON Using Rockset

With lots of fields that change, get added/removed, etc, it can be rather cumbersome to maintain ETL pipelines. Rockset was designed to help with this problem—by indexing all fields in JSON documents, along with all type information, and exposing a SQL API on top of it.

For example, with a Rockset collection named new_col, I can start by adding a single document to an empty collection that looks like:

{
    "my-field": "doc1",
    "my-other-field": "some text"
}

... and then query it.

rockset> select "my-field", "my-other-field" 
         from new_col;

+------------+------------------+
| my-field   | my-other-field   |
+------------+------------------|
| doc1       | some text        |
+------------+------------------+

Now, if a new JSON document comes in with some new fields - maybe with some arrays, nested JSON objects, etc, I can still query it with SQL.

{
    "my-field": "doc2",
    "my-other-field":[
        {
            "c1": "this",
            "c2": "field",
            "c3": "has",
            "c4": "changed"
        }
    ]
}

I add that to the same collection and can query it just as before.

rockset> select "my-field", "my-other-field" 
         from new_col;

+------------+---------------------------------------------------------------+
| my-field   | my-other-field                                                |
|------------+---------------------------------------------------------------|
| doc1       | some text                                                     |
| doc2       | [{'c1': 'this', 'c2': 'field', 'c3': 'has', 'c4': 'changed'}] |
+------------+---------------------------------------------------------------+

I can further flatten nested JSON objects and array fields at query time and construct the table I want to get to - without having to do any transformations beforehand.

rockset> select mof.* 
         from new_col, unnest(new_col."my-other-field") as mof;

+------+-------+------+---------+
| c1   | c2    | c3   | c4      |
|------+-------+------+---------|
| this | field | has  | changed |
+------+-------+------+---------+

In addition to this, there is strong type information stored, which means I won't get tripped up by having mixed types, etc. Adding a third document:

{
    "my-field": "doc3",
    "my-other-field":[
        {
            "c1": "unexpected",
            "c2": 99,
            "c3": 100,
            "c4": 101
        }
    ]
}

It still adds my document as expected.

rockset> select mof.* 
         from new_col, unnest(new_col."my-other-field") as mof;

+------------+-------+------+---------+
| c1         | c2    | c3   | c4      |
|------------+-------+------+---------|
| unexpected | 99    | 100  | 101     |
| this       | field | has  | changed |
+------------+-------+------+---------+

... and the fields are strongly typed.

rockset> select typeof(mof.c2) 
         from new_col, unnest(new_col."my-other-field") as mof;

+-----------+
| ?typeof   |
|-----------|
| int       |
| string    |
+-----------+

If all this sounds like a great way to run SQL on your nested JSON, you should give Rockset a try.