Azure Blob Storage

This page covers how to use an Azure Blob Storage container as a data source in Rockset. This includes:

  • Creating an Azure Blob Storage Integration to securely connect containers in your Azure Blob Storage account.
  • Creating a Collection which syncs your data from an Azure Blob Storage container into Rockset.

🚧

For the following steps, you must have access to an Azure Blob Storage account and be able to manage Azure Blob Storage account keys.

If you do not have access, please invite your Azure account administrator to Rockset.

Create an Azure Blob Storage Integration

These instructions explain how to setup an Azure Blob Storage integration. An integration can provide access to one or more containers within your Azure Blob Storage account. You can use an integration to create collections that sync data from your Azure Blob Storage containers.

Create a new shared access signature (SAS)

To access your Azure Blob Storage resources, Rockset can authenticate with shared access signature with permissioned access to your desired containers. Once you complete these steps, you can use the Connection String associated with the key to create the Rockset integration in the Rockset console.

To create a new shared access signature (SAS), navigate to the Azure Blob Storage account console, and then select "Shared access signature" from left sidebar.

From there, you can create a new key with read-only access, with the following settings:

  • Allowed services: Blob
  • Allowed resource types: Container, Object
  • Allowed permissions: Read, List
  • Expiry date: 2023-01-14, or a date in the distant future.

Then click "Generate SAS and connection string". You can copy the "connection string" field. This string will be required to create the Azure Blob Storage integration within Rockset console.

Create Azure Blob Storage shared access signature (SAS)

Create a Collection

Once you have set up an integration, you can proceed to create an Azure Blob Storage sourced collection. When you are creating a collection, you can choose which paths you want to include in your collection by adding multiple sources with distinct path names.

You can create a collection from an Azure Blob Storage source in the Collections tab of the Rockset Console.

Create Azure Blob Storage Collection

πŸ’‘

These operations can also be performed using any of the Rockset client libraries, the Rockset API, or the Rockset CLI.

Specifying Blob Path

You can ingest all data in a container by specifying just the container name or restrict to a subset of the objects in the container by specifying an additional prefix or pattern.

By default, if the blob path has no special characters, a prefix match is performed. However, if any of the following special characters are used in the blob path, it triggers pattern matching semantics.

  • ? matches one character
  • * matches zero or more characters
  • ** matches zero or more directories in a path

The following examples of path values explain exactly how the patterns can be used:

  • Empty path matches all directories and files in the container.
  • xyz/ - uses prefix match, matches all files in the folder xyz.
  • xyz/t?st.csv - matches com/test.csv but also com/tast.csv or com/txst.csv in the bucket.
  • xyz/*.csv - matches all .csv files in the xyz directory in the bucket.
  • xyz/**/test.json - matches all test.json files in any subdirectory under the xyz path in the bucket.
  • 05/2018/**/*.json - matches all .json files underneath any subdirectory under the /05/2018/ path in the bucket.

Source Configurations

Rockset allows updating the following configurations for Azure Blob Storage using the Sources API:

Configuration NameDefault ValueMin ValueMax ValueDescription
azblob_scan_frequencyPT5MPT1SPT1HRockset scans an Azure blob Storage container based on a defined time interval. The scan frequency determines the length of time between a new scan and the previous scan. If the previous scan finds new objects or updates to existing objects, Rockset immediately scans the bucket again after processing changes from the previous scan. Duration value is of type ISO 8601 (e.g. PT5H, PT4M, PT3S). It doesn't account for DST, leap seconds and leap years.