Mythbusting: The Venerable SQL Database and Today’s Real-Time Analytics

January 5, 2022

Dhruba Borthakur

CTO, Rockset

Rockset is the real-time analytics database in the cloud for modern data teams. Get faster analytics on fresher data, at lower costs, by exploiting indexing over brute-force scanning.

It's not your father’s Oracle cluster, but better.*

We all know the lightning pace of software innovation.

Show me a technology or platform that’s been around for a decade, and I’ll show you an outmoded relic that’s been leapfrogged by faster, more efficient competitors.

So I don’t fault you for resisting my message, which is that the SQL database that came of age in the 80s still has a critical role to play today in moving data-driven companies from batch to real-time analytics.

This may come as a surprise. In many tech circles, SQL databases remain synonymous with old-school on-premises databases like Oracle or DB2. A good number of organizations have moved on from SQL databases, thinking there is no possibility that they could meet the demanding requirements of modern data applications. But nothing could be further from the truth.

We’ll examine some commonly held misconceptions regarding SQL databases in this article. Hopefully we can understand how SQL databases aren’t necessarily bound by the limitations of yesteryear, allowing them to remain very relevant in an era of real-time analytics.

Once Upon a Time

A Brief History of SQL Databases

SQL was originally developed in 1974 by IBM researchers for use with its pioneering relational database, the System R. System R ran only on IBM mainframes that were incredibly powerful for the time and incredibly expensive, as well, out of reach to anyone but the NASAs and NOAAs (the National Oceanic and Atmospheric Administration, in charge of the National Weather Service) of this world.

SQL only really took off in the 1980s, when Oracle Corp. introduced its SQL-powered database to run on less-expensive mini-computers and servers. Other competitors such as Microsoft (SQL Server) and Teradata soon followed.

Different flavors of SQL databases have been added over time. Data warehousing emerged in the 1990s, and open-source databases, such as MySQL and PostgreSQL, came into play in the late 90s and 2000s.

Let’s not gloss over the fact that SQL, as a language, remains incredibly popular, the lingua franca of the data world. It ranks third among ALL programming languages according to a 2020 Stack Overflow survey, used by 54.7% of developers.

You may think that engineering teams would favor building on SQL databases as much as possible, given their rich heritage. Yet, when I talk to CTOs and VPs of engineering, I continually hear three myths about how SQL databases can’t possibly support real-time analytics well. Let’s tackle these myths one by one.

Myth №1: SQL Databases Cannot Support Large Streaming Write Rates

Back before real-time analytics was a dream, the first SQL databases ran on a single machine. As database sizes grew, vendors rewrote them to run on clusters of servers. But this also meant that data had to be distributed across multiple servers. A column-oriented database would be partitioned by column, with each column stored on a particular server. While this made it efficient to retrieve data from a subset of columns, writing a record would require writes to multiple servers. A row-oriented database could do a range partition instead and keep entire records together on one server. However, once secondary indexes that are sharded by different keys are used, we would again have the issue of having to write a single record to the different servers that store the primary table and the secondary indexes.

Because a single data record gets sent off to many machines to be written, these distributed databases, whether row- or column-oriented, must ensure that the data gets updated in multiple servers in the correct order, so that earlier updates don’t overwrite later ones. This is ensured by one of two techniques: a distributed lock or a two-phase lock and commit. While it ensured data integrity, the distributed two-phase lock added a massive delay to SQL database writes — so massive that it inspired the rise of NoSQL databases optimized for fast data writes, such as HBase, Couchbase, and Cassandra.

Newer SQL databases are built differently. Optimized for real-time analytics, they avoid past issues with SQL databases by using an alternative storage technique called document sharding. When a new document is ingested, a document-sharded database will write the entire document at once to the nearest available machine, rather than splitting it apart and sending the different fields to different servers. All secondary indices of a document all reside locally on the same server. This makes storing and writing data extremely fast. When a new document arrives in the system, all the fields of that document and all secondary indices for the document are stored on one single server. There is no need for a distributed cross-server transaction for every update.

It also reminds me of how Amazon stores items in its warehouses for maximum speed. Rather than putting all of laptops in one aisle and all of the vacuum cleaners in another, most items are stored in the nearest random location, adjacent to unrelated items, albeit tracked by Amazon’s inventory software.

Besides document sharding, new real-time SQL databases support super-fast data write speeds because they can use the Log Structured Merge (LSM) tree structure first seen in NoSQL databases, rather than a highly-structured B-Tree as used by prior SQL databases. I will skip the details of how LSM and B-Tree databases work. Suffice to say that in a B-Tree database, data is laid out as storage pages organized in the form of a B-Tree, and an update would do a read-modify-write of the relevant B-Tree pages. That creates additional I/O overhead during the write phase.

By comparison, a LSM-based database can immediately write data to any free location — no read-modify-write I/O cycles required first. LSM has other features such as compaction (compressing the database by removing unused sections), but it’s the ability to write data flexibly and immediately that enables extremely high speeds. Here is a research paper that shows the higher write rates of the RocksDB LSM engine versus the B-Tree based InnoDB storage engine.

By using document sharding and LSM trees, SQL-based real-time databases can ingest and store massive amounts of data and make it available within seconds.

Myth №2: SQL Databases Cannot Handle the Changing Schemas of Streaming Data

This myth is also based on outdated perceptions about SQL databases.

It is true that all SQL databases require data to be structured, or organized in the form of schemas. In the past, SQL databases required those schemas to be defined in advance. Any ingested data would have to comply exactly with the schema, thus requiring ETL (Extract, Transform, Load) steps.

However, streaming data typically arrives raw and semi-structured in the form of JSON, Avro or Protobuf. These streams also continually deliver new fields and columns of data that can be incompatible with existing schemas. Which is why raw data streams cannot be ingested by traditional rigid SQL databases.

But some newer SQL databases can ingest streaming data by inspecting the data on the fly. They inspect the semi-structured data itself and automatically build schemas from it, no matter how nested the data is.

Data typing is another seeming obstacle for streaming data and SQL databases. As part of its commitment to schemas, SQL requires that data be strongly typed — every value must be assigned a data type, e.g. integer, text string, etc. Strong data typing helps prevent mixing incompatible data types in your queries and generating bad results.

Traditional SQL databases assigned a data type to every column in a data table/schema when it is created. The data type, like the rest of the schema, would be static and never change. That would seem to rule out raw data feeds, where the data type can change constantly due to its dynamic nature.

However, there is a newer approach supported by some real-time SQL databases called strong dynamic typing. These databases still assign a data type to all data, except now they can do it at an extremely granular level. Rather than just assigning whole columns of data the same data type, every individual value in a single column can be assigned its own data type. Just because SQL is strongly typed doesn’t mean that the database has to be statically typed. Programming Languages (PL) have shown that strong dynamic typing is possible and powerful. Many recent advances in PL compilers and runtimes prove that they can also be extremely efficient; just look at the performance improvements of the V8 Javascript engine in recent years!

Not all newer SQL databases are equal in their support for semi-structured, real-time data. Some data warehouses can extract JSON document data and assign it to different columns. However, if a single null value is detected, the operation fails, forcing the data warehouse to dump the rest of the document into a single universal ‘Other’ data type that is slow and inconvenient to query. Other databases won’t even try to schematize a semi-structured data stream, instead dumping a whole ingested document into a single blob field with one data type. That also makes them slow and difficult to query.

Myth №3: SQL Databases Cannot Scale Writes Without Impacting Queries

This is still another outdated myth that is untrue of new real-time SQL databases. Traditional on-premises SQL databases tightly coupled the resources used for both ingesting and querying data. That meant that whenever a database simultaneously scaled up reads and writes, it created contention that would cause both functions to drag. The solution was to overprovision your hardware, but that was expensive and wasteful.

As a result, many turned to NoSQL-based systems such as key-value stores, graph databases, and others for big data workloads, and NoSQL databases were celebrated for their performance in handling massive datasets. In truth, NoSQL databases also suffer from the same contention problem as traditional SQL databases. Users just didn’t encounter it because big data and machine learning tend to be batch-oriented workloads, with datasets ingested far in advance of the actual queries. Turns out that when NoSQL database clusters try to read and write large amounts of data at the same time, they are also susceptible to slowdowns.

New cloud-native SQL database services avoid this problem entirely by decoupling the resources used for ingestion from the resources used for querying, so that companies can enjoy fast read and write speeds as well as the power of complex analytical queries at the same time. The latest providers explicitly design their systems to separate the ingest and query functions. This completely avoids the resource contention problem, and enables read or write speeds to be unaffected if the other one scales.

Conclusion

SQL databases have come a long way. The latest ones blend the time-tested power and efficiency of SQL with the large-scale capabilities of NoSQL and the flexible scalability of cloud-native technologies. Cutting-edge SQL databases can deliver real-time analytics using the freshest data. You can run many complex queries at the same time and still get results instantly. And perhaps the most underrated feature: SQL’s enduring popularity among data engineers and developers makes it the most pragmatic choice for your company as it enables the leap from batch to real-time analytics.

If this blog post helped bust some long-held myths you had about SQL, then perhaps it’s time you took another look at the benefits and power that SQL databases can deliver for your use cases.