Google Cloud Storage

This page covers how to use a Google Cloud Storage bucket as a data source in Rockset. This includes:

  • Creating a Google Cloud Storage integration to securely connect buckets in your GCP account with Rockset.
  • Creating a collection which syncs your data from a Google Cloud Storage bucket into Rockset.

For the following steps, you must have access to a Google Cloud account and be able to manage Google Cloud Service Accounts and Roles. If you do not have access, please invite your GCP account administrator to Rockset.

Create a Google Cloud Storage Integration

These instructions explain how to set up a Google Cloud Storage integration using a GCP Service Account. An integration can provide access to one or more GCS buckets within your GCP account. You can use an integration to create collections that sync data from your GCS buckets.

Step 1: Set Up Your GCP Service Account

To access your GCP resources, Rockset uses a GCP Service Account with permissioned access to your desired GCS buckets. You can either use an existing service account or create a new one for Rockset to use. Once you complete these steps, you can use the JSON key associated with the service account to create the Rockset integration in the Rockset console.

Create a New Service Account

If you don't have an existing service account or want to use a new service account, you will need to navigate to the "IAM & Admin" section in the Google Cloud Console sidebar, and then select the
"Service Accounts" tab within that section.

From there, you can create a new service account by selecting the "Create Service Account" button at the top and then follow the instructions on the page for completing its creation.

Create GCP Service Account

For more details, you can read about how to manage and create service accounts in the GCP documentation found here.

Create a New Key Pair

On the service accounts home page in the Google Cloud Console, select your desired service account (if you just created a new service account for Rockset above, select your newly created account) to view its details. Under the "Keys" section, select "Add Key", and then "Create New Key". Select "JSON" for the key type and then click "Create".

Create GCP Service Account

Once the key is created successfully, it should trigger an automatic download with your key's associated JSON. This JSON will be required to create the GCP integration within Rockset Console.

Step 2: Configure Your GCS Bucket Permissions

In order to access Google Cloud Storage buckets, you must provide roles to the service account that allow access to specific buckets. To do so, you will need to navigate to the "Storage" section in the Google Cloud Console sidebar, and then select the "Browser" tab within that section.

Find your desired GCS bucket that you would like to sync your Rockset collection to, and then click the three dots on the right-hand side to select "Edit Bucket Permissions".

Setting up per-bucket permission

From here, select the "Add Member" button to give the service account the appropriate permissions. When adding the service account as a new member, be sure to input the full email (e.g. [email protected]) of the account.

For a set of standard roles, you can refer to the GCP IAM permissions documentation. For example, you can use the Storage Object Viewer role that gives read access to all your GCS buckets.

You can also configure individual buckets to be accessible by the service account you created. The permissions that Rockset needs are:

  • storage.objects.get - Required to retrieve an object from Google Cloud Storage.
  • storage.objects.list - Required to list objects within a given bucket in Google Cloud Storage.

You can associate a role that provides these permissions to the service account that you created, or you can set it up for your bucket in specific.

Create a Collection

Once you have set up an integration, you can go on to create an Google Cloud Storage sourced collection. When you are creating a collection, you can choose which paths you want to include in your collection by adding multiple sources with distinct path names.

You can create a collection from a GCS source in the Collections tab of the Rockset Console.

Create GCS Collection

Note that these operations can also be performed using any of the Rockset client libraries, the Rockset API, or the Rockset CLI.

Specifying GCS Path

You can ingest all data in a bucket by specifying just the bucket name or restrict to a subset of the objects in the bucket by specifying an additional prefix or pattern.

By default, if the object path has no special characters, a prefix match is performed. However, if any of the following special characters are used in the GCS path, it triggers pattern matching semantics.

  • ? matches one character
  • * matches zero or more characters
  • ** matches zero or more directories in a path

The following examples of path values explain exactly how the patterns can be used:

  • gs://bucket/xyz - uses prefix match, matches all files that have a prefix of xyz.
  • gs://bucket/xyz/t?st.csv - matches com/test.csv but also com/tast.csv or com/txst.csv in
    the bucket.
  • gs://bucket/xyz/*.csv - matches all .csv files in the xyz directory in the bucket.
  • gs://bucket/xyz/**/test.json - matches all test.json files in any subdirectory under the xyz
    path in the bucket.
  • gs://bucket/05/2018/**/*.json - matches all .json files underneath any subdirectory under the
    /05/2018/ path in the bucket.