Real-Time Data Ingestion: Snowflake, Snowpipe and Rockset
August 4, 2021
Organizations that depend on data for their success and survival need robust, scalable data architecture, typically employing a data warehouse for analytics needs. Snowflake is often their cloud-native data warehouse of choice. With Snowflake, organizations get the simplicity of data management with the power of scaled-out data and distributed processing.
Although Snowflake is great at querying massive amounts of data, the database still needs to ingest this data. Data ingestion must be performant to handle large amounts of data. Without performant data ingestion, you run the risk of querying outdated values and returning irrelevant analytics.
Snowflake provides a couple of ways to load data. The first, bulk loading, loads data from files in cloud storage or a local machine. Then it stages them into a Snowflake cloud storage location. Once the files are staged, the “COPY” command loads the data into a specified table. Bulk loading relies on user-specified virtual warehouses that must be sized appropriately to accommodate the expected load.
The second method for loading a Snowflake warehouse uses Snowpipe. It continuously loads small data batches and incrementally makes them available for data analysis. Snowpipe loads data within minutes of its ingestion and availability in the staging area. This provides the user with the latest results as soon as the data is available.
Although Snowpipe is continuous, it’s not real-time. Data might not be available for querying until minutes after it’s staged. Throughput can also be an issue with Snowpipe. The writes queue up if too much data is pushed through at one time.
The rest of this article examines Snowpipe’s challenges and explores techniques for decreasing Snowflake’s data latency and increasing data throughput.
Import Delays
When Snowpipe imports data, it can take minutes to show up in the database and be queryable. This is too slow for certain types of analytics, especially when near real-time is required. Snowpipe data ingestion might be too slow for three use categories: real-time personalization, operational analytics, and security.
Real-Time Personalization
Many online businesses employ some level of personalization today. Using minutes- and seconds-old data for real-time personalization has always been elusive but can significantly grow user engagement.
Operational Analytics
Applications such as e-commerce, gaming, and the Internet of things (IoT) commonly require real-time views of what’s happening on a site, in a game, or at a manufacturing plant. This enables the operations staff to react quickly to situations unfolding in real time.
Security
Data applications providing security and fraud detection need to react to streams of data in near real-time. This way, they can provide protective measures immediately if the situation warrants.
You can speed up Snowpipe data ingestion by writing smaller files to your data lake. Chunking a large file into smaller ones allows Snowflake to process each file much quicker. This makes the data available sooner.
Smaller files trigger cloud notifications more often, which prompts Snowpipe to process the data more frequently. This may reduce import latency to as low as 30 seconds. This is enough for some, but not all, use cases. This latency reduction is not guaranteed and can increase Snowpipe costs as more file ingestions are triggered.
Throughput Limitations
A Snowflake data warehouse can only handle a limited number of simultaneous file imports. Snowflake’s documentation is deliberately vague about what these limits are.
Although you can parallelize file loading, it’s unclear how much improvement there can be. You can create 1 to 99 parallel threads. But too many threads can lead to too much context switching. This slows performance. Another issue is that, depending on the file size, the threads may split the file instead of loading multiple files at once. So, parallelism is not guaranteed.
You are likely to encounter throughput issues when trying to continuously import many data files with Snowpipe. This is due to the queue backing up, causing increased latency before data is queryable.
One way to mitigate queue backups is to avoid sending cloud notifications to Snowpipe when imports are queued up. Snowpipe’s REST API can be triggered to import files. With the REST API, you can implement your back-pressure algorithm by triggering file import when the number of files will overload the automated Snowpipe import queue. Unfortunately, slowing file importing delays queryable data.
Another way to improve throughput is to expand your Snowflake cluster. Upgrading to a larger Snowflake warehouse can improve throughput when importing hundreds or thousands of files simultaneously. But, this comes at a significantly increased cost.
Alternatives
So far, we’ve explored some ways to optimize Snowflake and Snowpipe data ingestion. If those solutions are insufficient, it may be time to explore alternatives.
One possibility is to augment Snowflake with Rockset. Rockset is designed for real-time analytics. It indexes all data, including data with nested fields, making queries performant. Rockset uses an architecture called Aggregator Leaf Tailer (ALT). This architecture allows Rockset to scale ingest compute and query compute separately.
Also, like Snowflake, Rockset queries data via SQL, enabling your developers to come up to speed on Rockset swiftly. What truly sets Rockset apart from the Snowflake and Snowpipe combination is its ingestion speed via its ALT architecture: millions of records per second available to queries within two seconds. This speed enables Rockset to call itself a real-time database. A real-time database is one that can sustain a high-write rate of incoming data while at the same time making the data available to the latest application-based queries. The combination of the ALT architecture and indexing everything enables Rockset to greatly reduce database latency.
Like Snowflake, Rockset can scale as needed in the cloud to enable growth. Given the combination of ingestion, fast queriability, and scalability, Rockset can fill Snowflake’s throughput and latency gaps.
Next Steps
Snowflake’s scalable relational database is cloud-native. It can ingest large amounts of data by either loading it on demand or automatically as it becomes available via Snowpipe.
Unfortunately, if your data application needs real-time or near real-time data, Snowpipe might not be fast enough. You can architect your Snowpipe data ingestion to increase throughput and decrease latency, but it can still take minutes before the data is queryable. If you have large amounts of data to ingest, you can increase your Snowpipe compute or Snowflake cluster size. But, this will quickly become cost-prohibitive.
If your applications have data availability needs in seconds, you may want to augment Snowflake with other tools or explore an alternative such as Rockset. Rockset is built from the ground up for fast data ingestion, and its “index everything” approach enables lightning-fast analytics. Furthermore, Rockset’s Aggregator Leaf Tailer architecture with separate scaling for data ingestion and query compute enables Rockset to vastly lower data latency.
Rockset is designed to meet the needs of industries such as gaming, IoT, logistics, and security. You are welcome to explore Rockset for yourself.