Rockset and Feast Feature Store for Real-Time Machine Learning
April 17, 2023
Latency matters in machine learning applications. In high-latency scenarios, fraud goes undetected causing millions in losses, security vulnerabilities are left unchecked giving attackers an open door, recommendations fail to incorporate the latest user interactions becoming irrelevant. The 2022 Uber Hack showed the world that companies are still very vulnerable to socially engineered attacks and being able to quickly detect anomalous behavior like IP address scanning within seconds as opposed to hours can make all the difference.
Real-time machine learning (ML) involves deploying and maintaining machine learning models to perform on-demand predictions for use cases like product recommendations, ETA forecasting, fraud detection and more. In real-time ML, the freshness of the features, the serving latency, and the uptime and availability of the data pipeline and model matter. Making a decision late has operational and cost ramifications.
To better serve real-time machine learning, Rockset integrates with the Feast Feature Store which acts as a centralized platform for deploying, monitoring and managing production ML features. The feature store is one of many tools that have been created to support shipping and supporting models in production. An area of expertise recently coined MLOps. The goal of the feature store is to unify the set of features available for training and serving across an organization. With feature stores, different teams are able to train and deploy on standardized features as opposed to being siloed off and generating similar features on their own. Just like how a git repo lets an engineering team use and modify the same pool of code, a feature repo lets people share and manage the same set of features.
In addition to standardizing how features are stored and generated, feature stores can also help monitor your training data. By keeping an eye on the quality of the data being used to generate the features you can add a new layer of protection to avoid training a bad model (garbage in, garbage out as they say).
Here are some of the benefits of adopting a feature store like Feast:
- Feature Management: deduplicate and standardize features across an organization
- Feature Computation: materialize features in a deterministic way
- Feature Validation: perform validation on features to avoid training on “junk” data
Now you might think “Wow, that sounds a whole lot like materialized views. How do feature stores differ from standard analytical databases?” Well, that’s a bit of a trick question. Feature stores help provide ML orchestration and often leverage multiple databases for model training and serving. Here are the benefits you get from using Rockset as the database for real-time ML:
- Real-time, streaming data for ML: Rockset handles real-time streaming data for machine learning with compute-compute separation, isolating streaming ingest and query compute for predictable performance even in the face of high-volume writes and low latency reads.
- Turn events into real-time features: Rockset turns events into features in real time with SQL ingest transformations. Efficiently compute time-windowed aggregation features, within 1-2 seconds of when the data was generated.
- Serve real-time features with millisecond-latency: Rockset uses its Converged Index to serve features to applications in milliseconds.
- Ensure service-levels at scale: Rockset meets the strict latency requirements of real-time analytics and is designed for high availability and durability with no scheduled downtime.
In today’s demo we’re going to walk through how to use Rockset with the Feast Feature Store which is tailored to make machine learning feature management a breeze.
Learn more about how Rockset extends its real-time analytics capabilities to machine learning. Join VP of Engineering Louis Brandy and product manager John Solitario for the talk From Spam Fighting at Facebook to Vector Search at Rockset: How to Build Real-Time Machine Learning at Scale on May 17th.
Overview of the Feast Integration
Feast is one of the most popular feature stores out there and is open sourced and backed by Tecton, the feature platform for machine learning. Feast provides the ability to train models on a consistent set of features and separates storage out as an abstraction allowing model training to be portable. Along with hosting offline features for batch training, Feast also supports online features, so users can quickly fetch materialized features as input for a trained model used for real-time prediction.
Recently, Rockset integrated with the popular open source Feast Feature Store as a community contributed online store. Rockset is a great fit for serving features in production as the database is purpose-built for real-time ingestion and millisecond-latency queries.
Real-Time Anomaly Detection with Feast and Rockset
One common use case that requires real-time feature serving is anomaly detection. By detecting anomalies in real time, immediate actions can be taken to mitigate risk and prevent harm.
In this example, given some service logs we want to be able to quickly extract features and pipe them into a model that will then generate output indicating a threat probability. We showcase how to serve features in Rockset using the BETH Dataset, a cybersecurity dataset with 8M+ data points that was purpose-built for anomaly detection training. Benign and nefarious kernel and network activity data was collected using a honeypot, in this case a server set up with low level monitoring tools that allowed access with any ssh key. After collecting data, each event in the dataset was manually labeled “sus” for unusual behavior or “evil” for malicious behavior. We can imagine training a model offline on this dataset and then performing model prediction on a real time activity log to predict ongoing levels of threat.
Connect Feast to Rockset
First let’s install Feast/Rockset:
Embedded content: https://gist.github.com/julie-mills/17b3a0499fcf9ff727aa762a826e2bcd
And then initialize the feast repo:
Embedded content: https://gist.github.com/julie-mills/ba48c3871f53754b35028b9fcd8a72f3
You will be prompted for an API key and a host url which you can find in the Rockset console. Alternatively you can leave these blank and set the environment variables described below. If we go into the created project:
Embedded content: https://gist.github.com/julie-mills/7f7bd8e3b6ceefcad44f5942241a3811
We will find our feature_store.yaml
config file. Let’s update this file to point to our Rockset account. Following the Feast reference guide for Rockset, fill in the feature_store.yaml
file:
Embedded content: https://gist.github.com/julie-mills/ee6518f64a60db67f5958bd96cce1654
If we provided input to the prior initialization prompts we should already see our values here. If we want to update this we can generate an API key in the Rockset console as well as fetch the Region Endpoint URL(host). Note: If api_key
or host
in feature_store.yaml
is left empty, the driver will attempt to grab these values from local environment variables ROCKSET_APIKEY
and ROCKSET_APISERVER
.
Generating Features for Real-Time Anomaly Detection
Now download the anomaly detection dataset to the data
/ directory. We will use one of the files for the demo but the steps below can be applied to all files. There are two types of data stored by this dataset: kernel-level process calls and network traffic. Let’s analyze the process calls.
Embedded content: https://gist.github.com/julie-mills/364d1e9ad7530f85d2b8b807a431278b
View one of the data files we’ve downloaded as an example:
Embedded content: https://gist.github.com/julie-mills/958f5f0027e4fccf8b72c3b227f64a84
See all of the kernel process calls for security analysis:
Embedded content: https://gist.github.com/danielin917/e4d2d21b66c873460a58180ba731de8b
Ok, we have the imported data. Let’s write some code that will generate interesting features by creating a feature definition file anomaly_detection_repo.py
. This file declares entities, logical objects described by a set of features, and feature views, a group of features associated with zero or more entities. You can read more on feature definition files here. For our demo setup we will use the processName, processId and eventName features collected in the kernel-process logs as our online features.
Embedded content: https://gist.github.com/julie-mills/e3060b687c8a2a8b5abe13a2ceb261e5
We can apply newly written feature definitions by saving them to the repo using feast apply
.
Serve Features in Milliseconds
In Feast, populating the online store involves materializing over some time frame from the offline store where the latest values for a feature will be taken. Once the materialized features have been loaded to the online store we should be able to query these features within the namespace of their Feature View. Let’s start up the Feast Feature Server, materialize some online features and query! First, write up a small script to start the server:
Embedded content: https://gist.github.com/julie-mills/38e52f50ebd263dd9105e48f4ac077ab
After starting our script, let’s query some input features that would get passed to our trained detection model:
Embedded content: https://gist.github.com/julie-mills/bde2635723627d28f5679cfd176d74d6
Response:
Embedded content: https://gist.github.com/julie-mills/39a0967098992a7ac9686287d20b8f7f
And that’s it! We can now serve our features from views which are each backed by a Rockset collection that is queryable with sub-second latency.
Real-time Machine Learning with Rockset
Feature Stores, including Feast, have become an integral part of the real-time machine learning data pipeline. With Rockset’s new integration with Feast, you can use Rockset as an online feature store and serve features for real-time personalization, anomaly detection, logistics tracking applications and more.
Rockset is currently available as an online store for Feast and you can take a look at the code here. Get started with the integration and real-time machine learning with $300 in free Rockset credits. Happy hacking✌️
Rockset adds support for vector search for real-time personalization, recommendations and anomaly detection. Learn more about how to use vector search on the Rockset blog.