Why SQL on Raw Data?
November 1, 2018
Over a decade after the inception of the Hadoop project, the amount of unstructured data available to modern applications continues to increase. Moreover, despite forecasts to the contrary, SQL remains the lingua franca of data processing; today's NoSQL and Big Data infrastructure platform usage often involves some form of SQL-based querying. This longevity is a testament to the community of analysts and data practitioners who are familiar with SQL as well as the mature ecosystem of tools around the language.
A Major Pain Point
However, this process of querying unstructured data using SQL in modern platforms remains painful. Querying an unstructured data source using SQL for use in analytics, data science, and application development requires a sequence of tedious steps: figure out how the data is currently formatted, determine a desired schema, input this schema into a SQL engine, and finally load the data and issue queries. This setup is a major overhead, and this isn’t a one-time tax: users must repeat these steps as data sources and formats evolve.
Fortunately, storage and compute substrates are changing quickly, leading to new opportunities in the form of optimized schemaless SQL processing systems. Specifically:
Storage. With an abundance of inexpensive storage, we can afford to build new types of indexes that allow us to ingest raw data in multiple formats. Instead of having to select a single storage representation optimized for a single type of query, we can store multiple representations of data, and use the best representation for each query as it arrives. To find a single record, we can use a record-based index; to search by a given term, use an inverted index; and, to perform fast aggregation, use columnar encodings. With a range of representations, it’s possible to automatically shred and slice raw data into each index type, allowing us to skip the overhead of schema declaration without sacrificing performance.
Compute. The cloud has made distributed, elastic compute cheaper than ever. As a result, we can scale our query processing quickly and efficiently in response to workload requirements. With serverless execution, it’s possible to scale bursts of query processing capability in seconds or less. For horizontally scalable analytics queries, we can precisely scale a set of worker nodes to match a query-specific latency SLA. In addition, we can leverage the elasticity in allocating heterogeneous resources—for example, aging SSD-resident data to cold storage nodes over time. Compared to on-premise designs, cloud-native design makes this elasticity orders of magnitude more powerful, and means queries on unstructured data can run fast, even for complex operations.
Pulling It Off
In theory, one could simply “bolt on” these kinds of optimizations onto traditional data systems. However, the last twenty years of database development suggest it’s unlikely this would perform well. Instead, taking full advantage of these opportunities requires a new platform that’s built from scratch with these shifts in data, compute, and storage in mind.
With today’s release, Dhruba, Venkat, and the Rockset team are unveiling a serious step towards realizing this potential. Working with the Rockset team over the past two years has been a wonderful experience for me: by combining deep experience in production data analytics and database platforms, like RocksDB, Facebook search, and Google, with an ambitious vision for the future of data-oriented development, Rockset has managed to build a first-in-kind, truly schemaless SQL data platform. Rockset allows users to go from raw, unstructured data to SQL queries, without first defining a schema, manually loading data, or compromising on performance.
The resulting opportunity for both application developers and data scientists is exciting. Rockset stands to deliver lower data engineering and setup overheads for data-driven dashboards and reporting, data science pipelines, and complex data products. As a systems researcher, I’m particularly excited about the opportunity to incorporate even more index types such as learned index structures, dynamic query replanning in response to load and multi-tenancy, and automated schema inference for highly nested data.