Serverless Data Management: A SQL Search and Analytics Engine

March 21, 2019

,

When we started Rockset, we envisioned building a powerful cloud data management system that was really easy to use. Making the data stack simpler is fundamental to making data usable by developers and data scientists.

Simplifying the Data Stack

To that end, we incorporated user-friendly features that alleviate the pain we personally experienced as data practitioners. We pushed the boundaries of the SQL type system to natively support dynamic typing, so that the need for ETL is eliminated in a large number of situations. This makes turning any type of data—from JSON, XML, Parquet, and CSV to even Excel files—into SQL tables a trivial pursuit. We automatically build multiple general-purpose indexes on all data ingested into Rockset, so that we can eliminate the need for database administration and query tuning for a wide spectrum of applications.

Another key aspect of Rockset that makes it simple to use is its serverless nature. Serverless frameworks allow you to build and run applications and services without thinking about provisioning, scaling, and managing any servers. Function-as-a-Service frameworks, such as AWS Lambda, Azure Functions, and Google Cloud Functions, go quite far in realizing that vision for stateless applications, but the real challenge comes when applications need to deal with state. In order for serverless computing to truly become a reality, we need data management systems that are also truly serverless, and in Rockset we have implemented such as system.

Truly Serverless Data Management

A data management system is serverless, if one can load data, persist data, and run queries without ever having to think about servers. Some of the key aspects of a serverless data management system are:

  • No provisioning – Users shouldn't have to concern themselves with what type of hardware they need to provision to set up the data management system.
  • No capacity planning – Users shouldn't need to plan cluster capacity at any point during the lifetime of the application. This means situations such as over-provisioned capacity burning a hole in their pockets or under-provisioned capacity causing performance and reliability issues should not be possible.
  • No scaling limits – Users shouldn't have to worry about hitting a wall with their data footprint growth. The data management should feel limitless.
  • No server maintenance – Users shouldn't have to think about security patching, upgrading dependent modules, or monitoring servers—all the tasks required to support 24 x 7 server uptime.

Without the burden of server management, teams can direct all their efforts towards their business and their products, thereby yielding significantly faster time to market.

Perhaps the most impactful consequence of serverless data management is that users pay for actual usage and not for provisioned capacity. The entire concept of provisioning capacity should be obsolete in a truly serverless world. If you evaluate all cloud data services with this perspective, it is rare that any passes this litmus test, irrespective of what their marketing materials claim. Cloud SQL-based services (such as Amazon RDS, Redshift, Snowflake), cloud NoSQL key-value services, or cloud search services (such as Amazon Elasticsearch Service, Elastic Cloud) do not meet the serverless bar. All these systems require users to pay for provisioned capacity and require active capacity planning to control costs and ensure reliability.

The only type of data services that truly meet the serverless criteria are cloud object stores, such as Amazon S3, Azure Blob Storage, and Google Cloud Storage. In order to use these cloud object stores, you don't need to do any provisioning or ongoing capacity planning, and the service is truly limitless. No wonder they are hugely popular and are perhaps the biggest driver for enterprises flocking to public clouds. But when it comes to operational data management systems, just about every one requires you to pick instance types and number of instances, configure compute/RAM/storage individually, select the correct version of server software, and set up clusters. Even the seemingly serverless ones ask you to provision capacity in terms of peak read ops/sec and write ops/sec.

Here at Rockset, we want to right this wrong. Rockset is truly a serverless search and analytics engine that can power fast sub-second queries over any of your data sets. Rockset will automatically and seamlessly provision more compute and network capacity based on the total amount of data you have stored in it, providing enough compute to cover almost any real-world application. We will write about how Rockset enables this behind the scenes in future posts.

You pay a flat monthly fee, based on the total amount of data you've stored in Rockset. All data stored is automatically indexed in multiple ways to make all your queries fast out of the box. There are no additional costs for query processing or the additional storage required to store all the indexes built on your data sets.

Bringing Serverless Data Management to Developers and Data Scientists

With the simplicity Rockset affords, our early users have realized significant value from their data with small teams and in short amounts of time. Coatue adopted Rockset to minimize all the ETL work and pipeline maintenance they needed to handle complex, changing data. Fynd went a completely serverless route, pairing Rockset with AWS Lambda functions to create a serverless microservice to track key metrics in real time. Rockset makes collecting and analyzing messy data very easy, even for individual developers and students.

I’m very happy to announce that Rockset is now generally available, bringing the power and simplicity of Rockset to you. You can create your Rockset account right away from this link, and getting started is absolutely free for up to 2GB of data. Go ahead and unleash your curiosity!