Expert Roundtable: Batch vs Streaming in the Modern Data Stack [Video]
August 11, 2022
I had the pleasure of recently hosting a data engineering expert discussion on a topic that I know many of you are wrestling with – when to deploy batch or streaming data in your organization’s data stack.
Our esteemed roundtable included leading practitioners, thought leaders and educators in the space, including:
- Ben Rogojan, aka Seattle Data Guy, is a data engineering and data science consultant (now based in the Rocky Mountain city of Denver) with a popular YouTube channel, Medium blog, and newsletter.
- Andreas Kretz is a Germany-based consultant and CEO of LearnDataEngineering.com, which provides online courses for budding data engineers as well as recruiting services for companies. Andreas also produces the well-known ‘Plumbers of Data Science’ blog and podcast, as well as a YouTube channel.
- Joe Reis is a “recovering data nerd” and CEO of Ternary Data, a Salt Lake City-based data consultancy. Besides interviewing leading data engineering experts for his own Monday Morning Data Chat podcast, Reis is also the co-author of the new O’Reilly book, Fundamentals of Data Engineering.
We covered this intriguing issue from many angles:
- where companies – and data engineers! – are in the evolution from batch to streaming data;
- the business and technical advantages of each mode, as well as some of the less-obvious disadvantages;
- best practices for those tasked with building and maintaining these architectures,
- and much more.
Our talk follows an earlier video roundtable hosted by Rockset CEO Venkat Venkataramani, who was joined by a different but equally-respected panel of data engineering experts, including:
- DynamoDB author Alex DeBrie;
- MongoDB director of developer relations Rick Houlihan;
- Jeremy Daly, GM of Serverless Cloud.
They tackled the topic, “SQL versus NoSQL Databases in the Modern Data Stack.” You can read the TLDR blog summary of the highlights here.
Below I have curated eight highlights from our discussion. Click on the video preview to watch the full 45-minute event on YouTube, where you can also share your thoughts and reactions.
Embedded content: https://youtu.be/g0zO_1Z7usI
1. On the most-common mistake that data engineers make with streaming data.
Joe Reis Data engineers tend to treat everything like a batch problem, when streaming is really not the same thing at all. When you try to translate batch practices to streaming, you get pretty mixed results. To understand streaming, you need to understand the upstream sources of data as well as the mechanisms to ingest that data. That’s a lot to know. It’s like learning a different language.
2. Whether the stereotype of real-time streaming being prohibitively expensive still holds true.
Andreas Kretz Stream processing has been getting cheaper over time. I remember back in the day when you had to set up your clusters and run Hadoop and Kafka clusters on top, it was quite expensive. Nowadays (with cloud) it's quite cheap to actually start and run a message queue there. Yes, if you have a lot of data then these cloud services might eventually get expensive, but to start out and build something isn't a big deal anymore.
Joe Reis You need to understand things like frequency of access, data sizes, and potential growth so you don’t get hamstrung with something that fits today but doesn't work next month. Also, I would take the time to actually just RTFM so you understand how this tool is going to cost on given workloads. There's no cookie cutter formula, as there are no streaming benchmarks like TPC, which has been around for data warehousing and which people know how to use.
Ben Rogojan A lot of cloud tools are promising reduced costs, and I think a lot of us are finding that challenging when we don’t really know how the tool works. Doing the pre-work is important. In the past, DBAs had to understand how many bytes a column was, because they would use that to calculate out how much space they would use within two years. Now, we don’t have to care about bytes, but we do have to care about how many gigabytes or terabytes we are going to process.
3. On today’s most-hyped trend, the ‘data mesh’.
Ben Rogojan All the companies that are doing data meshes were doing it five or ten years ago by accident. At Facebook, that would just be how they set things up. They didn’t call it a data mesh, it was just the way to effectively manage all of their features.
Joe Reis I suspect a lot of job descriptions are starting to include data mesh and other cool buzzwords just because they are catnip for data engineers. This is like what happened with data science back in the day. It happened to me. I showed up on the first day of the job and I was like, ‘Um, there’s no data here.’ And you realized there was a whole bait and switch.
4. Schemas or schemaless for streaming data?
Andreas Kretz Yes, you can have schemaless data infrastructure and services in order to optimize for speed. I recommend putting an API before your message queue. Then if you find out that your schema is changing, then you have some control and can react to it. However, at some point, an analyst is going to come in. And they are always going to work with some kind of data model or schema. So I would make a distinction between the technical and business side. Because ultimately you still have to make the data usable.
Joe Reis It depends on how your team is structured and how they communicate. Does your application team talk to the data engineers? Or do you each do your own thing and lob things over the wall at each other? Hopefully, discussions are happening, because if you're going to move fast, you should at least understand what you're doing. I’ve seen some wacky stuff happen. We had one client that was using dates as [database] keys. Nobody was stopping them from doing that, either.
5. The data engineering tools they see the most out in the field.
Ben Rogojan Airflow is big and popular. People kind of love and hate it because there's a lot of things you deal with that are both good and bad. Azure Data Factory is decently popular, especially among enterprises. A lot of them are on the Azure data stack, and so Azure Data Factory is what you're going to use because it's just easier to implement. I also see people using Google Dataflow and Workflows workflows as step functions because using Cloud Composer on GCP is really expensive because it's always running. There’s also Fivetran and dbt for data pipelines.
Andreas Kretz For data integration, I see Airflow and Fivetran. For message queues and processing, there is Kafka and Spark. All of the Databricks users are using Spark for batch and stream processing. Spark works great and if it's fully managed, it's awesome. The tooling is not really the issue, it’s more that people don’t know when they should be doing batch versus stream processing.
Joe Reis A good litmus test for (choosing) data engineering tools is the documentation. If they haven't taken the time to properly document, and there's a disconnect between how it says the tool works versus the real world, that should be a clue that it is not going to get any easier over time. It’s like dating.
6. The most common production issues in streaming.
Ben Rogojan Software engineers want to develop. They don't want to be limited by data engineers saying ‘Hey, you need to tell me when something changes’. The other thing that happens is data loss if you don’t have a good way to track when the last data point was loaded.
Andreas Kretz Let’s say you have a message queue that is running perfectly. And then your messaging processing breaks. Meanwhile, your data is building up because the message queue is still running in the background. Then you have this mountain of data piling up. You need to fix the message processing quickly. Otherwise, it will take a lot of time to get rid of that lag. Or you have to figure out if you can make a batch ETL process in order to catch up again.
7. Why Change Data Capture (CDC) is so important to streaming.
Joe Reis I love CDC. People want a point-in-time snapshot of their data as it gets extracted from a MySQL or Postgres database. This helps a ton when someone comes up and asks why the numbers look different from one day to the next. CDC has also become a gateway drug into ‘real’ streaming of events and messages. And CDC is pretty easy to implement with most databases. The only thing I would say is that you have to understand how you are ingesting your data, and don’t do direct inserts. We have one client doing CDC. They were carpet bombing their data warehouse as quickly as they could, AND doing live merges. I think they blew through 10 percent of their annual credits on this data warehouse in a couple days. The CFO was not happy.
8. How to determine when you should choose real-time streaming over batch.
Joe Reis Real time is most appropriate for answering What? or When? questions in order to automate actions. This frees analysts to focus on How? and Why? questions in order to add business value. I foresee this ‘live data stack’ really starting to shorten the feedback loops between events and actions.
Ben Rogojan I get clients who say they need streaming for a dashboard they only plan to look at once a day or once a week. And I’ll question them: ‘Hmm, do you?’ They might be doing IoT, or analytics for sporting events, or even a logistics company that wants to track their trucks. In those cases, I’ll recommend instead of a dashboard that they should automate those decisions. Basically, if someone will look at information on a dashboard, more than likely that can be batch. If it’s something that's automated or personalized through ML, then it’s going to be streaming.