Building a SQL Development Environment for Messy, Semi-Structured Data

June 13, 2019

,

Why build a new SQL development environment?

We love SQL — our mission is to bring fast, real-time queries to messy, semi-structured real-world data and SQL is a core part of our effort. A SQL API allows our product to fit neatly into the stacks of our users without any workflow re-architecting. Our users can easily integrate Rockset with a multitude of existing tools for SQL development (e.g. Datagrip, Jupyter, RStudio) and data exploration / visualization (e.g. Tableau, Redash, Superset). Why ‘reinvent the wheel’ and create our own SQL development environment?

Despite the quantity and quality of editors and dashboards available in the SQL community, we realized that using SQL on raw data (e.g. nested JSON, Parquet, XML) was a novel concept to our users. While Rockset supports standard ANSI SQL, we did add some extensions for arrays and object. And we built Rockset around two core principles: strong dynamic typing and the document object model. While these enable data queries that haven’t traditionally been feasible, they can also run against traditional query development workflows. For example:

  • Strong dynamic typing (TLDR: many different types of data can live in a Rockset field at once): Despite its advantages, strong dynamic typing can lead to some puzzling query results. For example, a

    SELECT *
    WHERE field > 0
    

    query on data [{ field: '1'}, { field: '2'}, { field: 3 }] will return only one value (3), or none on data [{ field: '1'}, { field: '2'}, { field: '3' }]. If a query editor fails to relate the multiple field types present in the field to the user, confusion can ensue.

  • Document object model / Smart schemas (TLDR: Rockset ‘schemas' resemble more JSON objects than field lists): Fields can be nested within other fields or even within arrays. Traditional schema viewers struggle to represent this, especially when multiple types or nested arrays are involved. Additionally, even seasoned SQL veterans might not be familiar with some of the array and object functions that we support.

With these challenges in mind, we decided to build our own SQL development environment from the ground up. We still expect (and hope) our users will take their queries to explore and visualize on the third-party tools of their choice, but hope that we can help along the way in their quest to run familiar SQL on their messy data with as little pain as possible. To do so, our new editor incorporates several key features that we felt we uniquely could provide.

Full Editor

Screen Shot 2019-06-13 at 4.54.26 PM

Custom Features

  • Inline interactive documentation: Unsure what functions we support or what arguments a function requires? From now on all functions supported by Rockset will be included in our autocomplete widget along with a description and link into the relevant portions of our documentation for more details.

Screen Shot 2019-06-10 at 2.10.05 PM

  • Inline field type distribution: Don’t remember what type a field is? See it as you build and ensure you’re writing the query you’re intending to. Or use it to debug a query when the results don’t quite match your expectations.

Screen Shot 2019-06-10 at 2.11.18 PM

  • Instant feedback: We run every query fragment through our SQL parser in real time so that typos, syntax mistakes and other common errors can be discovered as early in the construction process as possible.

Screen Shot 2019-06-10 at 2.31.01 PM

  • Completions for nested fields: Our field completion system is modeled on the document model of the underlying data. No matter the level of nesting, you’ll always get available field completions.

Screen Shot 2019-06-10 at 2.51.42 PM

These new features are accompanied by all the usual things you’d expect in your SQL development environment (schemas, query history, etc).

Technical Challenges

Along the way, we ran into several interesting technical challenges:

  • Tokenizing nested paths and alias processing: some fun language processing / tokenization hacking. CodeMirror (the editor framework we chose) comes with basic SQL syntax highlighting and SQL keyword / table / column completion, but we ultimately built our own parser and completion generators that better accounted for nested field paths and could better interface with our schemas.
  • Bringing in function signatures and descriptions: how could we avoid hardcoding these in our frontend code? To do so would leave this information in three places (frontend code, documentation files, and backend code) - a precarious situation that would almost certainly lose consistency over time. However, as we store our raw documentation files in XML format, we were able to add semantic XML parsing tags directly to our documentation codebase, which we then preprocess out of the docs and into our product at compile time on every release.
  • Showing ‘live’ parse errors: we didn’t want to actually run the query each time, as that would be expensive and wasteful. However we dug into our backend code processes and realized that queries go through two phases - syntax parsing and execution planning - without touching data whatsoever. We added an ‘out switch’ so that validation queries could go through these two stages and report success or failure without continuing on into the execution process. All it took was a bit of hacking around our backend.

Conclusion

We’re excited to introduce these new features as a first step in building the ultimate environment for querying complex, nested mixed-type data, and we’ll be continually improving it over the coming months. Take it for a spin and let us know what you think!

Something else you'd like to see in our SQL development environment? Shoot me an email at scott [at] rockset [dot] com

Resources: CodeMirror (editor and basic autocomplete), Numeracy (widget design inspiration)