Expert Talk TLDR: SQL vs NoSQL Databases in the Modern Data Stack
July 22, 2022
Last week, Rockset hosted a conversation with a few seasoned data architects and data practitioners steeped in NoSQL databases to talk about the current state of NoSQL in 2022 and how data teams should think about it. Much was discussed.
Embedded content: https://youtu.be/_rL65XsrB-o
Here are the top 10 takeaways from that conversation.
1. NoSQL is great for well understood access patterns. It’s not best suited for ad hoc queries or operational analytics.
Where does NoSQL fit in the modern data stack? It fits in workloads where I have high velocity, well understood access patterns. NoSQL is about tuning the data models for specific access patterns, removing the JOINs, replacing them with indexes across items on a table that sharded or partitioned and documents in a collection that share indexes because those index lookups have low time complexity, which satisfies your high velocity patterns. That’s what’s going to make it cheaper.
2. Regardless of data management systems, everything starts with getting the data model right.
It doesn’t matter what interface you use. What’s important is getting the data model right. If you don’t understand the complexity of how the data is stored, partitioned, denormalized, and the indexes you created, it doesn’t matter what query language you use; it’s just syntactic sugar on top of a complex data model. The first thing to understand is knowing what you’re trying to do with your data and then choosing the right system to power that.
3. Flexibility comes primarily from dynamic typing.
There is a reason why there is a lot more flexibility that you can achieve with the data models in NoSQL systems than SQL systems. That reason is the type system. [This flexibility is not from the programming language]. NoSQL systems are dynamically typed, while typical SQL based systems are statically typed. It’s like going from C++ to Python. Developers can move fast, and build and launch new apps quickly and it is way easier to iterate on.
In relational DBs, you have to store those types in homogenous containers that are indexed independently of each other. The fundamental purpose of the relational DB is to JOIN those indexes. NoSQL DB lets you put all those type items into one table and you cut across the common index on shared attributes. This reduces all the time complexity of the index join to an index lookup.
4. Developers are asking for more from their NoSQL databases and other purpose built tools are a good complement.
Developers want more than just a database. They want things like online archiving, SQL APIs for downstream consumers, and search indexes that’s real, not just tags. For DynamoDB users who need these missing features, Rockset is the other half. I say go there because it’s more tightly coupled and a more rich developer experience.
At AWS, a big problem the Amazon service team had with Elasticsearch was the synchronization. One of the reasons why I talked to customers about using Rockset was because it was a seamless integration rather than trying to stitch it together themselves.
5. Don’t blindly dump data into a NoSQL system. You need to know your partitions.
NoSQL is a great solution for storing data doing quick lookups, but if you don’t know what that partition is, you’re wasting a lot of the benefits of the fast lookup because you’re never going to look it up by that particular thing. A mistake I see a lot of people make is to dump data into a NoSQL system and assume they can just scan it later. If you’re dumping data into a partition, that partition should be known somehow before issuing your query. There should be some way to tie back to that direct lookup. If not, then I don’t think NoSQL is the right way
6. All tools have limitations. You need to understand the tradeoffs within each tool to best leverage
One thing I really appreciate about learning about NoSQL is I now really understand the fundamentals a lot more. I worked with SQL for years before NoSQL and I just didn’t know what was happening under the hood. The query planner hides so much. With Dynamo and NoSQL, you learn how partitions work, how that sort key is working, and how global secondary indexes work. You get an understanding of the infrastructure and understand what’s expensive and not expensive. All data systems have tradeoffs and if they hide them from you, then you can’t really take advantage of the good and avoid the bad.
7. Make decisions based on your business stage. When small, optimize on making your people more efficient. When bigger, optimize on making your systems more efficient.
The rule of thumb is to figure out where you are spending the most. Is it infrastructure? Is it software? Is it people? Often, when you’re small, people are the biggest expense so the best decision is to pick a tool that makes your developers more effective and productive. So it’s actually cheaper to use NoSQL systems in this case. But once the scale crosses a threshold [and infrastructure becomes your biggest expense], it makes sense to go from a generic solution [like a NoSQL DB] to a special purpose solution because you’re going to save way more on hardware and infrastructure costs. At that point, there is room for a special purpose system.
My take is developers may want to start with a single platform, but then are going to move to special purpose systems when the CFO starts asking about costs. It may be that the threshold point is getting higher and higher as the tech gets more advanced, but it will happen.
The big data problem is becoming everybody’s problem. We’re not talking about terabytes, we’re talking about petabytes.
8. NoSQL is easy to get started with. Just be aware of how costs are managed as things scale.
I find that DynamoDB is this utility platform, which is great because you can build all kinds of stuff, but if you want to create aggregations, I got to enable DynamoDB streams, I got to set up lambda functions so that I can write back to the table and do the aggregations. This is a massive investment in terms of people in setting all those things up: all bespoke, all things you have to do after the fact. The amount of cognitive load that goes into building these things out and then continuing to manage that is huge. And then you get to a point where, for example in DynamoDB, you are now provisioning 3,000 RCUs and things get very expensive as it goes. The scale is great, but you start spending a lot of money to do things that could be done more efficiently. And I think in some cases, providers are taking advantage of people.
9. Data that is accessed together should be stored together
Don’t muck with time series tables, just drop those things every day. Roll up the summary raw data into summaries, maybe store the summary data in with your configuration data because that might be interesting depending on the access patterns. Data accessed together should all be in the same item or the same table or the same collection. If it’s not accessed together, then who cares? The access patterns are totally independent.
10. Change data capture is an unsung innovation in NoSQL systems
People used to write open source op log tailers for MongoDB not so long ago and now the change stream API is wonderful. And with DynamoDB, Dynamo stream can give Kinesis a run for its money. It’s that good. Because if you don’t really need key value lookups, you know what? You can still write to Dynamo and get Dynamo streams out of there and it can be both performant and reliable. Rockset takes advantage of this for our built-in connectors. We tapped into this. Now if you make a change within Dynamo or Mongo, within one or two seconds, you have a fully typed, fully indexed SQL table on the other side and you can instantly have full featured SQL on that data.
About the Speakers
Alex DeBrie is the author of The DynamoDB Book, a comprehensive guide to data modeling with DynamoDB, and the external reference recommended internally within AWS to its developers. He is a AWS Data Hero and speaks regularly at conferences such as AWS re:Invents and AWS Summits. Alex helps many teams with DynamoDB, from designing or reviewing data models and migrations to providing professional training to level up developer teams.
Rick Houlihan currently leads the developer relations team for strategic accounts at MongoDB. Before this, Rick was at AWS for 7 years where he led the architecture and design effort for migrating thousands of relational workloads from RDBMS to NoSQL and built the center of excellence team responsible for defining the best practices and design patterns used today by thousands of Amazon internal service teams and AWS customers.
Jeremy Daly is the GM of Serverless Cloud at Serverless and AWS Serverless Hero. He began building cloud-based applications with AWS in 2009, but after discovering Lambda, became a passionate advocate for FaaS and managed services. He now writes extensively about serverless on his blog jeremydaly.com, publishes a weekly newsletter about all things serverless called Off-by-none, and hosts the Serverless Chats podcast.
Venkat Venkataramani is CEO and co-founder of Rockset. He was previously an Engineering Director in the Facebook infrastructure team responsible for all online data services that stored and served Facebook user data. Prior to Facebook, Venkat worked on the Oracle Database.
Rockset is the leading real-time analytics platform built for the cloud, delivering fast analytics on real-time data with surprising efficiency. Rockset is serverless and fully managed. It offloads the work of managing configuration, cluster provisioning, denormalization and shard/index management. Rockset is also SOC 2 Type II compliant and offers encryption at rest and in flight, securing and protecting any sensitive data. Learn more at rockset.com.