17 New Things Every Modern Data Engineer Should Know in 2022

February 17, 2022

,

It’s the start of 2022 and a great time to look ahead and think about what changes we can expect in the coming months. If we’ve learned any lessons from the past, it’s that keeping ahead of the waves of change is one of the primary challenges of working in this industry.

We asked thought leaders in our industry to ponder what they believe will be the new ideas that will influence or change the way we do things in the coming year. Here are their contributions.

New Thing 1: Data Products

Barr Moses, Co-Founder & CEO, Monte Carlo

In 2022, the next big thing will be “data products.” One of the buzziest topics of 2021 was the concept of “treating data like a product,” in other words, applying the same rigor and standards around usability, trust, and performance to analytics pipelines as you would to SaaS products. Under this framework, teams should treat data systems like production software, a process that requires contracts and service-level agreements (SLAs), to help measure reliability and ensure alignment with stakeholders. In 2022, data discovery, knowledge graphs, and data observability will be critical when it comes to abiding by SLAs and maintaining a pulse on the health of data for both real-time and batch processing infrastructures.

one

New Thing 2: Fresh Features for Real-Time ML

Mike Del Balso, Co-Founder and CEO, Tecton.ai

Real-time machine learning systems benefit dramatically from fresh features. Fraud detection, search results ranking, and product recommendations all perform significantly better with an understanding of current user behavior.

Fresh features come in two flavors: streaming features (near-real-time) and request-time features. Streaming features can be pre-computed asynchronously, and they have unique challenges to address when it comes to backfilling, efficient aggregations, and scale. Request-time features can only be computed at the time of the request and can take into account current data that can’t be pre-computed. Common patterns are a user’s current location or a search query they just typed in.

These signals can become particularly powerful when combined with pre-computed features. For example, you can express a feature like “distance between the user’s current location and the average of their last three known locations” to detect a fraudulent transaction. However, request-time features are difficult for data scientists to productionize if it requires modifying a production application. Knowing how to use a system like a feature store to include streaming and request-time features makes a large difference in real-time ML applications.

New Thing 3: Data Empowers Business Team Members

Zack Khan, Hightouch

In 2022, every modern company now has a cloud data warehouse like Snowflake or BigQuery. Now what? Chances are, you’re primarily using it to power dashboards in BI tools. But the challenge is, business team members don’t live in BI tools: your sales team checks Salesforce everyday, not Looker.

You put in so much work already to set up your data warehouse and prepare data models for analysis. To solve this last mile problem and ensure your data models actually get used by business team members, you need to sync data directly to the tools your business team members use day-to-day, from CRMs like Salesforce to ad networks, email tools and more. But no data engineer likes to write API integrations to Salesforce: that’s why Reverse ETL tools enable data engineers to send data from their warehouse to any SaaS tool with just SQL: no API integrations required.

You might also be wondering: why now? First party data (data explicitly collected from customers) has never been more important. With Apple and Google making changes to their browsers and operating systems to prevent identifying anonymous traffic this year to protect consumer privacy (which will affect over 40% of internet users), companies now need to send their first party data (like which users converted) to ad networks like Google & Facebook in order to optimize their algorithms and reduce costs.

With the adoption of data warehouses, increased privacy concerns, improved data modeling stack (ex: dbt) and Reverse ETL tools, there’s never been a more important, but also easier, time to activate your first party data and turn your data warehouse into the center of your business.

2

New Thing 4: Point-in-Time Correctness for ML Applications

Mike Del Balso, Co-Founder and CEO, Tecton.ai

Machine learning is all about predicting the future. We use labeled examples from the past to train ML models, and it’s critical that we accurately represent the state of the world at that point in time. If events that happened in the future leak into training, models will perform well in training but fail in production.

When future data creeps into the training set, we call it data leakage. It’s far more common than you would expect and difficult to debug. Here are three common pitfalls:

  1. Each label needs its own cutoff time, so it only considers data prior to that label’s timestamp. With real-time data, your training set can have millions of cutoff times where labels and training data must be joined. Naively implementing these joins will quickly blow up the size of the processing job.
  2. All of your features must also have an associated timestamp, so the model can accurately represent the state of the world at the time of the event. For example, if the user has a credit score in their profile, we need to know how that score has changed over time.
  3. Data that arrives late must be handled carefully. For analytics, you want to have the most accurate data even if it means updating historical values. For machine learning, you should avoid updating historical values at all costs, as it can have disastrous effects on your model’s accuracy.

As a data engineer, if you know how to handle the point-in-time correctness problem, you’ve solved one of the key challenges with putting machine learning into production at your organization.

New Thing 5: Application of Domain-Driven Design

Robert Sahlin, Senior Data Engineer, MatHem.se

I think streaming processing/analytics will experience a huge boost with the implementation of data mesh when data producers apply DDD and take ownership of their data products since that will:

  1. Decouple the events published from how they are persisted in the operational source system (i.e. not bound to traditional change data capture [CDC])
  2. Result in nested/repeated data structures that are much easier to process as a stream as joins on the row-level are already done (compared to CDC on RDBMS that results in tabular data streams that you need to join). This is partly due to mentioned decoupling, but also the use of key/value or document stores as operational persistence layer instead of RDBMS.
  3. CDC with outbox pattern – we shouldn't throw out the baby with the water. CDC is an excellent way to publish analytical events since it already has many connectors and practitioners and often supports transactions.

New Thing 6: Controlled Schema Evolution

Robert Sahlin, Senior Data Engineer, MatHem.se

Another thing that isn't really new but even more important in streaming applications is controlled schema evolution since downstream consumers in a higher degree will be machines and not humans and those machines will act in real-time (operational analytics) and you don't want to break that chain since it will have an immediate impact.

3

New Thing 7: Data That is Useful For Everyone

Ben Rogojan, The Seattle Data Guy

With all the focus on the modern data stack, it can be easy to lose the forest in the trees. As data engineers, our goal is to create a data layer that is usable by analysts, data scientists and business users. It’s easy for us as engineers to get caught up by the fancy new toys and solutions that can be applied to our data problems. But our goal is not purely to move data from point A to point B, although that’s how I describe my job to most people.

Our end goal is to create some form of a reliable, centralized, and easy-to-use data storage layer that can then be utilized by multiple teams. We aren’t just creating data pipelines, we are creating data sets that analysts, data scientists and business users rely on to make decisions.

To me, this means our product, at the end of the day, is the data. How usable, reliable and trustworthy that data is important. Yes, it’s nice to use all the fancy tools, but it’s important to remember that our product is the data. As data engineers, how we engineer said data is important.

New Thing 8: The Power of SQL

David Serna, Data Architect/BI Developer

For me, one of the most important things that a modern data engineer needs to know is SQL. SQL is our principal language for data. If you have sufficient knowledge in SQL, you can save time creating appropriate query lambdas in Rockset, avoid time redundancies in your data model, or create complex graphs using SQL with Grafana that can give you important information about your business.

The most important data warehouses nowadays are all based on SQL, so if you want to be a good data engineering consultant, you need to have a deep knowledge of SQL.

sql

New Thing 9: Beware Magic

Alex DeBrie, Principal and Founder, DeBrie Advisory

What a time to be working with data. We're seeing an explosion in the data infrastructure space. The NoSQL movement is continuing to mature after fifteen years of innovation. Cutting-edge data warehouses can generate insights from unfathomable amounts of data. Stream processing has helped to decouple architectures and unlock the rise of real-time. Even our trusty relational database systems are scaling further than ever before. And yet, despite this cornucopia of options, I warn you: beware "magic."

Tradeoffs abound in software engineering, and no piece of data infrastructure can excel at everything. Row-based stores excel at transactional operations and low-latency response times, while column-based tools can chomp through gigantic aggregations at a more leisurely clip. Streaming systems can handle enormous throughput, but are less flexible for querying the current state of a record. Moore's Law and the rise of cloud computing have both pushed the limits of what is possible, but this does not mean we've escaped the fundamental reality of tradeoffs.

This is not a plea for your team to adopt an extreme polyglot persistence approach, as each new piece of infrastructure requires its own set of skills and learning curve. But it is a plea both for careful consideration in choosing your technology and for honesty from vendors. Data infrastructure vendors have taken to larding up their products with a host of features, designed to win checkbox-comparisons in decision documents, but fall short during actual usage. If a vendor isn't honest about what they are good at – or, even more importantly, what they're not good at – examine their claims carefully. Embrace the future, but don't believe in magic quite yet.

New Thing 10: Data Warehouses as CDP

Timo Dechau, Tracking & Analytics Engineer, deepskydata

I think in 2022 we will see more manifestations of the data warehouse as the customer data platform (CDP). It's a logical development that we now start to overcome the separate CDPs. These were just special case data warehouses, often with no or few connections to the real data warehouse. In the modern data stack, the data warehouse is the center of everything, so naturally it handles all customer data and collects all events from all sources. With the rise of operational analytics we now have reliable back channels that can bring the customer data back into marketing systems where they can be included in email workflows, targeting campaigns and so much more.

And now we also get the new possibilities from services like Rockset, where we can model our real-time customer event use cases. This closes the gap to use cases like the good old cart abandonment notification, but on a bigger scale.

datawarehouse

New Thing 11: Data in Motion

Kai Waehner, Field CTO, Confluent

Real-time data beats slow data. That’s true for almost every business scenario; no matter if you work in retail, banking, insurance, automotive, manufacturing, or any other industry.

If you want to fight against fraud, sell your inventory, detect cyber attacks, or keep machines running 24/7, then acting proactively while the data is hot is crucial.

Event streaming powered by Apache Kafka became the de facto standard for integrating and processing data in motion. Building automated actions with native SQL queries enables any development and data engineering team to use the streaming data to add business value.

New Thing 12: Bringing ML to Your Data

Lewis Gavin, Data Architect, lewisgavin.co.uk

A new thing that has grown in influence in recent years is the abstraction of machine learning (ML) techniques so that they can be used relatively simply without a hardcore data science background. Over time, this has progressed from manually coding and building statistical models, to using libraries, and now to serverless technologies that do most of the hard work.

One thing I noticed recently, however, is the introduction of these machine learning techniques within the SQL domain. Amazon recently introduced Redshift ML, and I expect this trend to continue growing. Technologies that help analysis of data at scale have, in one way or another, matured to support some sort of SQL interface because this makes the technology more accessible.

By providing ML functionality on an existing data platform, you are taking the processing to the data instead of the other way around, which solves a key problem that most data scientists face when building models. If your data is stored in a data warehouse and you want to perform ML, you first have to move that data somewhere else. This brings a number of issues; firstly, you've gone through all of the hard work of prepping and cleaning your data in the data warehouse, only for it to be exported elsewhere to be used. Second, you then have to find a suitable place to store your data in order to build your model which often incurs a further cost, and finally, if your dataset is large, it often takes time to export this data.

Chances are, the database where you are storing your data, whether that be a real-time analytics database or a data warehouse, is powerful enough to perform the ML tasks and is able to scale to meet this demand. It therefore makes sense to move the computation to the data and increase the accessibility of this technology to more people in the business by exposing it via SQL.

ml

New Thing 13: The Shift to Real-Time Analytics in the Cloud

Andreas Kretz, CEO, Learn Data Engineering

From a data engineering standpoint I currently see a big shift towards real-time analytics in the cloud. Decision makers as well as operational teams are more and more expecting insight into live data as well as real-time analytics results. The constantly growing amount of data within companies only amplifies this need. Data engineers have to move beyond ETL jobs and start learning techniques as well as tools that help integrate, combine and analyze data from a wide variety of sources in real time.

The combination of data lakes and real-time analytics platforms is very important and here to stay for 2022 and beyond.

rta cloud edit

New Thing 14: Democratization of Real-Time Data

Dhruba Borthakur, Co-Founder and CTO, Rockset

This "real-time revolution," as per the recent cover story by the Economist magazine, has only just begun. The democratization of real-time data follows upon a more general democratization of data that has been happening for a while. Companies have been bringing data-driven decision making out of the hands of a select few and enabling more employees to access and analyze data for themselves.

As access to data becomes commodified, data itself becomes differentiated. The fresher the data, the more valuable it is. Data-driven companies such as Doordash and Uber proved this by building industry-disrupting businesses on the backs of real-time analytics.

Every other business is now feeling the pressure to take advantage of real-time data to provide instant, personalized customer service, automate operational decision making, or feed ML models with the freshest data. Businesses that provide their developers unfettered access to real-time data in 2022, without requiring them to be data engineering heroes, will leap ahead of laggards and reap the benefits.

New Thing 15: Move from Dashboards to Data-Driven Apps

Dhruba Borthakur, Co-Founder and CTO, Rockset

Analytical dashboards have been around for more than a decade. There are several reasons they are becoming outmoded. First off, most are built with batch-based tools and data pipelines. By real-time standards, the freshest data is already stale. Of course, dashboards and the services and pipelines underpinning them can be made more real time, minimizing the data and query latency.

The problem is that there is still latency – human latency. Yes, humans may be the smartest animal on the planet, but we are painfully slow at many tasks compared to a computer. Chess grandmaster Garry Kasparov discovered that more than two decades ago against Deep Blue, while businesses are discovering that today.

If humans, even augmented by real-time dashboards, are the bottleneck, then what is the solution? Data-driven apps that can provide personalized digital customer service and automate many operational processes when armed with real-time data.

In 2022, look to many companies to rebuild their processes for speed and agility supported by data-driven apps.

4

New Thing 16: Data Teams and Developers Align

Dhruba Borthakur, Co-Founder and CTO, Rockset

As developers rise to the occasion and start building data applications, they are quickly discovering two things: 1) they are not experts in managing or utilizing data; 2) they need the help of those who are, namely data engineers and data scientists.

Engineering and data teams have long worked independently. It's one reason why ML-driven applications requiring cooperation between data scientists and developers have taken so long to emerge. But necessity is the mother of invention. Businesses are begging for all manner of applications to operationalize their data. That will require new teamwork and new processes that make it easier for developers to take advantage of data.

It will take work, but less than you may imagine. After all, the drive for more agile application development led to the successful marriage of developers and (IT) operations in the form of DevOps.

In 2022, expect many companies to restructure to closely align their data and developer teams in order to accelerate the successful development of data applications.

New Thing 17: The Move From Open Source to SaaS

Dhruba Borthakur, Co-Founder and CTO, Rockset

While many individuals love open-source software for its ideals and communal culture, companies have always been clear-eyed about why they chose open-source: cost and convenience.

Today, SaaS and cloud-native services trump open-source software on all of these factors. SaaS vendors handle all infrastructure, updates, maintenance, security, and more. This low ops serverless model sidesteps the high human cost of managing software, while enabling engineering teams to easily build high-performing and scalable data-driven applications that satisfy their external and internal customers.

2022 will be an exciting year for data analytics. Not all of the changes will be immediately obvious. Many of the changes are subtle, albeit pervasive cultural shifts. But the outcomes will be transformative, and the business value generated will be huge.

saas


Do you have ideas for what will be the New Things in 2022 that every modern data engineer should know? We invite you to join the Rockset Community and contribute to the discussion on New Things!


Don't miss this series by Rockset's CTO Dhruba Borthakur

Designing the Next Generation of Data Systems for Real-Time Analytics

The first post in the series is Why Mutability Is Essential for Real-Time Data Analytics.

why-mutability-is-essential