Empowering Developers With Query Flexibility

March 24, 2022

,

Analytics has evolved substantially in the last decade. Companies are adopting streaming data, they are dealing with greater volumes and amounts of data, and more of them are working with diverse third party vendors to receive data. In fact, you can describe big data from many different sources by these five characteristics: volume, value, variety, velocity and veracity.

Even though the complexity, data shape and data volume are increasing and changing, companies are looking for simpler and faster database solutions. More so now than before, companies want to easily query data across different sources without worrying about data ops.

It’s difficult to create data analytics systems that can easily do this while maintaining fast query performance and real-time capabilities. It’s even harder to do this without constantly updating your data ops in some way.

Being able to write and adjust any SQL queries you want on the fly on semi-structured data and across various data sources should be something every data engineer should be empowered to do. Query flexibility allows you to prototype and build new features quickly, without investing in heavy data preparation upfront, saving time and effort and increasing overall productivity. This requires a database to automatically ingest and index semi-structured data and generate an underlying schema even as data shape changes. Relational and non-relational databases each have their own unique challenges when it comes to query flexibility.

Relational databases need a fixed schema in order to write to the row in the table. If the data shape changes, you need to alter the table and update the schema. Just as well, you need to create an index on a column when working with relational databases. This causes an administrative overhead and forces you to think about the queries you want to write in order to create the proper indexes. In terms of query flexibility, well, these things limit it. The second your schema changes or the types of queries you want to execute changes, you’re back and updating your data ops, such as the table or index. This investment is very time-consuming and restricting.

Non-relational databases easily ingest semi-structured, regardless if the data shape changes. However, query time JOINs can be resource-intensive, complex, or even impossible in some non-relations systems. You’ll need to denormalize the data, but this is not a good idea if your data changes frequently. In such cases, denormalization would require updating all of the documents when any subset of the data was to change and so should be avoided. Another option besides denormalization is application-side JOINs, but there’s an operational overhead component because you need to create and maintain the codebase.

The point I want to drive is a database that gives you query flexibility without worrying about the underlying data ops empowers you to prototype and iterate quickly.

There are not many databases out there that give you query flexibility. Here are some real-time analytical databases with good performance that provide some query flexibility:

  • Elasticsearch is optimized for search-like queries like log analytics. When it comes to writing queries outside that scope, you might have some challenges, like aggregations. Also, data that needs to be joined typically has to be denormalized to start with. This requires setting up a data pipeline to denormalize the data upfront. If the data shape change, you’ll have to update the data pipeline.
  • Druid supports broadcast JOINs. However, you need to specify a schema during ingest time, and you need to flatten nested data in order to query it.
  • Rockset ingests semi-structured and nested data without the need to specify a schema or denormalize data. Data is automatically indexed by Rockset via a Converged Index. Converged Index indexes all data, allowing you to write different types of SQL queries (including full JOINs) while still maintaining high query performance.

How important is query flexibility to you for iterating and prototyping when building real-time analytical applications, such as real-time reporting and real-time personalization? What databases are you using for real-time analytics? We invite you to join the discussion in the Rockset Community.


Rockset is the real-time analytics database in the cloud for modern data teams. Get faster analytics on fresher data, at lower costs, by exploiting indexing over brute-force scanning.