Rockset
  • Loading Data
  • Adding a Data Source

Amazon DynamoDB

This page covers how to use an Amazon DynamoDB table as a data source in Rockset. You will learn the following:

  • How to create an Amazon DynamoDB integration to securely connect tables in your AWS account with Rockset.
  • How to create a collection which syncs your data from an Amazon DynamoDB table into Rockset in real-time.
  • Understand how the DynamoDB integration works.

To apply the following steps, you must:

  • Have access to an AWS account.
  • Be able to manage AWS IAM policies and IAM users within it.

If you do not have access, please invite your AWS administrator to Rockset.

Create a DynamoDB Integration

The steps below explain how to set up an Amazon DynamoDB integration using AWS Cross-Account IAM Roles and AWS Access Keys (deprecated). An integration can provide access to one or more DynamoDB tables within your AWS account. You can use an integration to create collections that continuously sync data from your DynamoDB tables.

Follow the steps below to set up an Amazon DynamoDB integration:

  1. Configure AWS IAM Policy
  2. Configure Role / Access Key

Configure AWS IAM Policy

  1. Navigate to the IAM Service in the AWS Management Console.

    1. Set up a new policy. Navigate to Policies.

    2. Click Create policy. If you already have a policy set up for Rockset, you can update that existing policy. For more details, refer to AWS Documentation on IAM Policies. AWS IAM Policies

  2. Set up access to your DynamoDB table. You can switch to the JSON tab and paste the policy shown below. This JSON enables you to configure all permissions rather than configuring them one by one. If you already have a Rockset policy set up, you can add the body of the Statement attribute to it.
    AWS IAM JSON

    Note:

    • You must replace <YOUR-DYNAMODB-TABLE-NAME> with the name of your DynamoDB table.
    • You must replace <YOUR-BUCKET-NAME> with the name of your S3 bucket.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:Scan",
        "dynamodb:DescribeStream",
        "dynamodb:DescribeExport",
        "dynamodb:GetRecords",
        "dynamodb:GetShardIterator",
        "dynamodb:DescribeTable",
        "dynamodb:DescribeContinuousBackups",
        "dynamodb:ExportTableToPointInTime",
        "dynamodb:UpdateTable",
        "dynamodb:UpdateContinuousBackups",
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:dynamodb:*:*:table/<YOUR-DYNAMODB-TABLE-NAME>",
        "arn:aws:dynamodb:*:*:table/<YOUR-DYNAMODB-TABLE-NAME>/stream/*",
        "arn:aws:dynamodb:*:*:table/<your-table>/export/*",
        "arn:aws:s3:::<Your-Bucket-Name>",
        "arn:aws:s3:::<Your-Bucket-Name>/*"
      ]
    }
  ]
}
  1. Save the newly created (or updated) policy and give it a descriptive name. You will attach this policy to a user or role in the next step.

The following explains why these permissions are necessary:

  • dynamodb:Scan: Required for initial table scans when reading data using scan method, and to provide a preview of your data during collection creation.
  • dynamodb:DescribeStream: Required for retrieving metadata about DynamoDB streams.
  • dynamodb:GetRecords: Required for retrieving records from DynamoDB streams.
  • dynamodb:GetShardIterator: Required for retrieving records from DynamoDB streams.
  • dynamodb:DescribeTable: Required for metadata about DynamoDB tables.
  • dynamodb:UpdateTable: Optional. Enables streams on a DynamoDB table. DynamoDB streams are required for live sync to work correctly. You can omit this permission if you would like to enable streams on your tables manually, as described here. You have to specify the StreamViewType as NEW_AND_OLD_IMAGES while creating the stream.
  • dynamodb:DescribeContinuousBackups: Required for checking the status of the data export.
  • dynamodb:DescribeExport: Required for exporting your data to the S3 bucket.
  • dynamodb:UpdateContinuousBackups: Required for enabling export on your DynamoDB table.
  • dynamodb:ExportTableToPointInTime: Required for initiating the export of your DynamoDB table.
  • s3:ListBucket: Required to read the Amazon S3 bucket and object metadata, where your table is exported to.
  • s3:GetObject: Required to retrieve objects from the designated Amazon S3 bucket.
  • s3:PutObject: Required to export DynamoDB documents to your Amazon S3 bucket.

We recommend using an empty S3 bucket for this integration. This enables you to easily find and delete any table exports you are not interested in keeping, to avoid paying additional S3 storage costs.

Advanced Permissions

You can set up permissions for multiple tables (or even all tables) by modifying the Resource ARNs. The format of the ARN for DynamoDB is as follows: arn:aws:dynamodb:region:account-id:table/tablename.

You can substitute the following resources in the policy above, to grant access to multiple tables as shown below:

  • All tables in your account:
    • arn:aws:dynamodb:*:*:table/*
    • arn:aws:dynamodb:*:*:table/*/stream/*
  • All tables starting with prod:
    • arn:aws:dynamodb:*:*:table/prod*
    • arn:aws:dynamodb:*:*:table/prod*/stream/*
  • All tables in us-west-2:
    • arn:aws:dynamodb:us-west-2:*:table/*
    • arn:aws:dynamodb:us-west-2:*:table/*/stream/*

Note: You must also include the corresponding /stream/* permissions with the above, for live sync to work correctly. For more details on how to specify a resource path, refer to AWS documentation on DynamoDB ARNs.

You can also set up permissions for multiple buckets or specific paths, by modifying the Resource ARNs. The format of the ARN for S3 is as follows: arn:aws:s3:::bucket_name/key_name.

You can substitute the following resources in the policy above, to grant access to multiple buckets or prefixes as shown below:

  • All paths under mybucket/salesdata:
    • arn:aws:s3:::mybucket
    • arn:aws:s3:::mybucket/salesdata/*
  • All buckets starting with sales:
    • arn:aws:s3:::sales*
    • arn:aws:s3:::sales*/*
  • All buckets in your account:
    • arn:aws:s3:::*
    • arn:aws:s3:::*/*

For more details on how to specify a resource path, refer to AWS documentation on S3 ARNs.

Configure Role / Access Key

There are two mechanisms you can use to grant Rockset permissions to your AWS resource:

AWS Cross-Account IAM Role

The most secure way to grant Rockset access to your AWS account, is to add cross-account access in Rockset. To do so, you'll need to create an IAM Role that assumes your newly created policy on Rockset's behalf.

You'll need information from the Rockset Console to create and save this integration.

Follow the steps below to set up this role:

  1. Navigate to the IAM service in the AWS Management Console.

  2. Set up a new role.

    1. Navigate to Roles.

    2. Click Create role.

      Note: If you already have a role for Rockset set up, you can re-use it and either add or update the above policy directly.

AWS IAM Roles

  1. Select Another AWS account as a trusted entity type, and configure the following:

    1. Enable Require External ID.

    2. Populate the Account ID and External ID fields with the Rockset Account ID and External ID found on the Create Integration page of Rockset Console (under the Cross-Account Role option).

AWS IAM Create Role

  1. Click Next.

  2. Choose the policy created for this role. You can create the policy now if needed by clicking Create policy.

AWS IAM Roles Attach Policy

  1. Click Next.

  2. Name the role descriptively (e.g., 'rockset-role'). Finally, add a description and any tags and click Next.

Note: Ensure you record the Role ARN for the Rockset integration in the Rockset Console.

AWS Access Key (deprecated)

Follow the steps below to set up an access key:

  1. Navigate to the IAM service in the AWS Management Console.

  2. Create a new user.

    1. Navigate to Users.

    2. Click Add User. If you already created a user for Rockset, you can attach the policy created in the previous section to that user.

    AWS IAM Users

  3. Enter a name for the user, enable Programmatic access, and click Next:Permissions to continue:

    AWS IAM Create User

  4. Choose Attach existing policies directly and select the policy you created in step 1. Click through the remaining steps to finish creating the user. AWS IAM Attach Policy

  5. When the new user is successfully created, you should see the Access key ID and Secret access key displayed on the screen.

    AWS IAM Access Key

    If you are attaching the policy to an existing IAM user, navigate to Security Credentials under the IAM user and generate a new access key.

  6. Record both of these values in the Rockset Console within a new DynamoDB integration.

Create a Collection

Once you create a collection backed by Amazon DynamoDB and the initial load is finished, Rockset scans the DynamoDB tables to continuously ingest data changes (inserts, updates, and deletes). It subsequently uses the stream to update collections with new objects, which are added to the DynamoDB table. The sync latency is no more than five seconds under a regular load.

You can create a collection from a DynamoDB source in the Collections tab of the Rockset Console. Create DynamoDB Integration

Fill the form sections with the table information and any SQL ingest transformations that should be applied to the ingest.

Note: The Rockset and DynamoDB integration requires that _id is not specified. Rockset will calculate a unique _id from existing identifiers in DynamoDB and use that to update records accordingly. If your DynamoDB table has _id specified, you can map it to a new field (i.e. dynamodbId) using the following ingest transformation:

SELECT
    _id as dynamodbId,
    *
EXCEPT
    (_id)
FROM
    _input

Best Practices

Amazon DynamoDB allows up to 2 simultaneous readers per shard in DynamoDB streams. We recommend no more than one collection reads from a DynamoDB stream. If multiple collections read from the same DynamoDB stream, they will encounter rate limits imposed by DynamoDB, which results in throttling. When encountering throttling, Rockset retries with backoff, which can increase data ingestion latencies. Read more about the limits here.

How Does The DynamoDB Integration Work

Rockset uses two modes to import data:

Export Using S3

If you choose to have DynamoDB export to an S3 bucket, Rockset will start to copy the data from the designated S3 bucket and index it, once the export is finished. There is an export set-up time of about seven minutes, regardless of table size.

The export capability for Bulk Ingest is a new, preferred method because it:

  • Removes the limitation on the table size that can be ingested to Rockset, due to DynamoDB’s 24-hour limit for the scan method.
  • Prevents the initial data import phase from hitting any bottlenecks or impacting your production workload.
  • Can significantly accelerate the overall ingest time.

Read more about the benefits of the new export method here.

After a DynamoDB backed collection is set up, it will be a replica of the DynamoDB table.

Note: In order to avoid additional charges by AWS, you can delete the S3 export once you receive the notification that the initial ingest is done, and optionally, disable the Point In Time Recovery functionality.

Table Scan

This method is only recommended for small tables (<5 GB) where the above S3 export set-up time will exceed the time it takes to scan the DynamoDB table into a Rockset collection.

A one-time full scan of the DynamoDB table will be performed, in which all items are indexed and stored in a Rockset collection. This method uses your DynamoDB Read Capacity Units (RCUs).

Read Capacity Units (RCUs)

Each DynamoDB table is configured with RCUs, which represent an upper bound on the read requests a client can issue. Rockset consumes RCUs to perform strongly-consistent scans during the initial table full scan. Strongly-consistent scans guarantee that no updates to the table are missed between the beginning of the scan and the start of ingestion of the DynamoDB stream.

You can manually configure the RCUs allocated for ingest, when you create a new collection using your DynamoDB integration:

Create DynamoDB Collection RCU

RCUs can be configured in DynamoDB to enable application reads at a faster rate. Likewise, you can specify an upper bound on RCUs in Rockset, to use during the initial scan. If you configure a high number of RCUs for Rockset in your DynamoDB table, it will result in faster ingest speeds.

RCUs in DynamoDB can be configured with two different table modes:

  • Provisioned RCU mode: We recommend not allocating more than 50% of RCUs to Rockset, as production workloads may get affected during the initial scan.
  • On-Demand RCU mode: We recommend allocating enough RCUs based on the data size and how fast you want it to be ingested in Rockset.

Note: Allocating more than the RCUs configured on DynamoDB table will result in throttled read requests.

Streaming Ingest

When the initial ingest phase is complete, Rockset will switch to the Streaming Ingest mode, regardless of the method you choose for initial table loads. Rockset will continuously monitor and sync changes from the DynamoDB table (inserts, deletes, and updates) to the Rockset collection in real-time, using DynamoDB streams. If you used the Export Using S3 method, you can safely delete the S3 bucket contents, as they are not used during streaming ingest. You will receive a notification email when the initial table load using the S3 method is complete, after which, you can delete the contents of the S3 bucket.

Note: If using a provisioning tool to manage your DynamoDB table(s), be careful to not allow it to disable your DynamoDB streams. For Terraform, we recommend updating terraform to have stream enabled:

stream_enabled   = true
stream_view_type = "NEW_AND_OLD_IMAGES"

Or having Terraform ignore this attribute by adding the following to your AWS DynamoDB table resource:

lifecycle {
  ignore_changes = [stream_enabled, stream_view_type]
}