Sign up to receive blog updates in your inbox.

Running SQL on Nested JSON

When we surveyed the market, we saw the need for a solution that could perform fast SQL queries on fluid JSON data, including arrays and nested objects:

The Challenge of SQL on JSON

Some form of ETL to transform JSON to tables in SQL databases may be workable for basic JSON data with fixed fields that are known up front. However, JSON with nested objects or new fields that “can spring up every 2-4 weeks,” as the original Stack Overflow poster put it, is impossible to handle in such a rigid manner.

Relational databases offer alternative approaches to accommodate more complex JSON data. SQL Server stores JSON in varchar columns, while Postgres and MySQL have JSON data types. In these scenarios, users can ingest JSON data without conversion to SQL fields, but take a performance hit when querying the data because these columns support minimal indexing at best.

SQL on Nested JSON Using Rockset

With lots of fields that change, get added/removed, etc, it can be rather cumbersome to maintain ETL pipelines. Rockset was designed to help with this problem—by indexing all fields in JSON documents, along with all type information, and exposing a SQL API on top of it.

For example, with a Rockset collection named new_collection, I can start by adding a single document to an empty collection that looks like:

{
    "my-field": "doc1",
    "my-other-field": "some text"
}

... and then query it.

rockset> select "my-field", "my-other-field" 
         from new_collection;

+------------+------------------+
| my-field   | my-other-field   |
|------------+------------------|
| doc1       | some text        |
+------------+------------------+

Now, if a new JSON document comes in with some new fields - maybe with some arrays, nested JSON objects, etc, I can still query it with SQL.

{
    "my-field": "doc2",
    "my-other-field":[
        {
            "c1": "this",
            "c2": "field",
            "c3": "has",
            "c4": "changed"
        }
    ]
}

I add that to the same collection and can query it just as before.

rockset> select "my-field", "my-other-field" 
         from new_collection;

+------------+---------------------------------------------------------------+
| my-field   | my-other-field                                                |
|------------+---------------------------------------------------------------|
| doc1       | some text                                                     |
| doc2       | [{'c1': 'this', 'c2': 'field', 'c3': 'has', 'c4': 'changed'}] |
+------------+---------------------------------------------------------------+

I can further flatten nested JSON objects and array fields at query time and construct the table I want to get to - without having to do any transformations beforehand.

rockset> select mof.* 
         from new_collection, unnest(new_collection."my-other-field") as mof;

+------+-------+------+---------+
| c1   | c2    | c3   | c4      |
|------+-------+------+---------|
| this | field | has  | changed |
+------+-------+------+---------+

In addition to this, there is strong type information stored, which means I won't get tripped up by having mixed types, etc. Adding a third document:

{
    "my-field": "doc3",
    "my-other-field":[
        {
            "c1": "unexpected",
            "c2": 99,
            "c3": 100,
            "c4": 101
        }
    ]
}

It still adds my document as expected.

rockset> select mof.* 
         from new_collection, unnest(new_collection."my-other-field") as mof;

+------------+-------+------+---------+
| c1         | c2    | c3   | c4      |
|------------+-------+------+---------|
| unexpected | 99    | 100  | 101     |
| this       | field | has  | changed |
+------------+-------+------+---------+

... and the fields are strongly typed.

rockset> select typeof(mof.c2) 
         from new_collection, unnest(new_collection."my-other-field") as mof;

+-----------+
| ?typeof   |
|-----------|
| int       |
| string    |
+-----------+

If being able to run SQL on complex JSON, without any ETL, data pipelines, or fixed schema, sounds interesting to you, you should give Rockset a try.

Real-time SQL on NoSQL. Start today.

Related Posts

Index Scan: Using Rockset's Search Index to Speed up Range Scans Over a Specific Field

Rockset uses Converged Indexing to make different types of queries run fast. We look at how Rockset's Index Scan uses the search index to accelerate range scans.

Can I Do SQL-Style Joins in Elasticsearch?

While joins are primarily an SQL concept, they are equally important in the NoSQL world as well. We explore how to perform the equivalent of SQL joins when using Elasticsearch.

Joining Data in DynamoDB and S3 for Live, Ad-Hoc Analysis

Using SQL to join DynamoDB and S3 data, operations teams can perform live, ad-hoc analysis across multiple cloud systems.