Client Configuration¶
Usage¶
Client objects allow you to connect securely to the Rockset service. All other API calls require a valid Client object.
In order to create a Client object, you will need a valid Rockset API key. If you have access to the Rockset Console, then you can use the console to create an API key. If not, please contact the Rockset team at support@rockset.io
You will also need to set a valid API server. To view a list of our servers, check out our documentation.
from rockset import Client
# connect securely to Rockset production API servers
client = Client(api_server='api.rs2.usw2.rockset.com',
api_key='XKQL6YCU0zDUglhWHPMDDmDYyMxDHrASGk5apCnn3A07twh')
You can manage your api_key credentials using the rock
command-line tool.
Run the rock configure
tool to setup one or more api_key credentials and
select the one that you want all rock
commands and the Python Rockset
Client to use. Once setup, you should expect the following to work.
from rockset import Client
# connect to the active credentials profile
# you can see see the active profile by running ``rock configure ls``
rs = Client()
# connect to credentials profile 'prod' as defined by ``rock configure``
rs = Client(profile='prod')
Example
Connect to Rockset API server and then subsequently use the client object to retrieve collections.
from rockset import Client
# connect securely to Rockset dev API server
rs = Client(api_server='api-us-west-2.rockset.io',
api_key='adkjf234rksjfa23waejf2')
# list all collections in the account that I have access to
all_collections = rs.Collection.list()
# create a new collection; returns a collection object
new_collection = rs.Collection.create('customer_info')
# get details of an existing collection as a collection object
users = rs.retrieve('users')
Reference¶
- class rockset.Client(api_key=None, api_server=None, profile=None, driver=None, **kwargs)[source]¶
Securely connect to Rockset using an API key.
Optionally, an alternate API server host can also be provided. If you have configured credentials using the
rock configure
command, then those credentials will act as fall back values, when none of the api_key/api_server parameters are specified.- Parameters
api_key (str) – API key
api_server (str) – API server URL. Will default to https if URL does not specify a scheme.
profile (str) – Optionally, you can also specify name of your credentials profile setup using
rock configure
- Returns
Client object
- Return type
- Raises
ValueError – when API key is not specified and could not be fetched from
rock
CLI credentials or api_server URL is invalid.
- classmethod config_dir()[source]¶
Returns name of the directory where Rockset credentials, config, and logs are stored.
Defaults to
"~/.rockset/"
Can be overriddden via
ROCKSET_CONFIG_HOME
env variable.
- sql(q, **kwargs)[source]¶
Execute a query against Rockset.
This method prepares the given query object and binds it to a Cursor object, and returns that Cursor object. The request is not actually dispatched to the backend until the results are fetched from the cursor.
Input query needs to be supplied as a Query object.
Cursor objects are iterable, and you can iterate through a cursor to fetch the results. The entire result data set can also be retrieved from the cursor object using a single
results()
call.When you iterate through the cursor in a loop, the cursor objects implement automatic pagination behind the scenes. If the query returns a large number of results, with automatic pagination, only a portion of the results are buffered into the cursor at a time. As the cursor iterator reaches the end of the current batch, it will automatically issue a new query to fetch the next batch and seamlessly resume. Cursor’s default iterator uses batch size of 10,000, and you can create an iterator of a different batch size using the iter() method in the cursor object.
Example:
... rs = Client() cursor = rs.sql(q) # fetch all results in 1 go all_results = cursor.results() # iterate through all results; # automatic pagination with default iterator batch size of 100 # if len(all_results) == 21,442, then as part of looping # through the results, three distinct queries would be # issued with (limit, skip) of (10000, 0), (10000, 10000), # (10000, 20000) for result in cursor: print(result) # iterate through all results; # automatic pagination with iterator batch size of 20,000 # if len(all_results) == 21,442, then as part of looping # through the results, two distinct queries would have # been issued with (limit, skip) of (20000, 0), (20000, 20000). for result in cursor.iter(20000): print(result) ...
- Parameters
q (Query) – Input Query object
timeout (int) – Client side timeout. When specified, RequestTimeout exception will be thrown upon timeout expiration. By default, the client will wait indefinitely until it receives results or an error from the server.
- Returns
returns a cursor that can fetch query results with or without automatic pagination
- Return type
- MAX_DOCUMENT_SIZE_BYTES = 41943040¶
Maximum allowed size of a single document
- MAX_FIELD_NAME_LENGTH = 10240¶
Maximum allowed length of a field name
- MAX_FIELD_VALUE_BYTES = 4194304¶
Maximum allowed size of a field value
- MAX_ID_VALUE_LENGTH = 10240¶
Maximum allowed length of
_id
field value
- MAX_NAME_LENGTH = 2048¶
Maximum allowed length of a collection name
- MAX_NESTED_FIELD_DEPTH = 30¶
Maximum allowed levels of depth for nested documents
Workspaces¶
Usage¶
Workspace objects repesents a container of other workspaces and Rockset collections.
Example
from rockset import Client, Q, F
# connect securely to Rockset
rs = Client()
# create a workspace
rs.Workspace.create('marketing')
# create a collection in the workspace
user_events = rs.Collection.create('user_events', workspace='marketing')
Create a new workspace¶
Creating a workspace using the Client object is as simple as
calling client.Workspace.create('my-new-workspace')
:
from rockset import Client
rs = Client()
new_ws = rs.Workspace.create('my-new-workspace')
List all workspaces¶
List all workspaces using the Client object using:
from rockset import Client
rs = Client()
workspaces = rs.Workspace.list()
Retrieve an existing workspace¶
Retrieve a workspace to run various operations on that workspace:
from rockset import Client
rs = Client()
marketing = rs.Workspace.retrieve('marketing')
Drop a workspace¶
Use the drop()
method to remove a workspace permanently from Rockset.
Note
This is a permanent and non-recoverable operation. Beware.
from rockset import Client
rs = Client()
marketing = rs.Workspace.drop('marketing')
Collections¶
Usage¶
Collection objects repesents a single Rockset collection. These objects are generally created using a Rockset Client object using methods such as:
from rockset import Client
# connect to Rockset
rs = Client()
# create a new collection
user_events = rs.Collection.create('user-events')
# retrieve an existing collection
users = rs.Collection.retrieve('users')
You can add documents to the collection using the add_docs()
method. Each
document in a collection is uniquely identified by its _id
field.
If documents added have _id
fields that match existing documents,
then their contents will be merged. Otherwise, the new documents will be
added to the collection.
You can remove documents from a collection using the remove_docs()
method.
Refer to the Query module for documentation and examples on how to query collections.
Example
from rockset import Client, Q, F
# connect securely to Rockset
rs = Client()
# retrieve the relevant collection
emails = rs.Collection.retrieve('emails')
# look for all emails to johndoe that contains the term 'secret'
johndoe_secret_q = Q('emails').where(
(F["to"].startswith('johndoe@')) & (F["body"][:] == 'secret')
)
# query the collection
docs = rs.sql(johndoe_secret_q).results()
Create a new collection¶
Creating a collection using the Client object is as simple as
calling client.Collection.create("my-new-collection")
:
from rockset import Client
rs = Client()
new_collection = rs.Collection.create("my-new-collection")
# create a collection in a workspace
leads_collection = rs.Collection.create("leads",
workspace="marketing")
# create a collection and map timestamp field to event-time
field_mappings = [
rs.FieldMapping.mapping(
name="transformation1",
input_fields=[
rs.FieldMapping.input_field(
field_name="ts",
if_missing="PASS",
is_drop=True,
param="ts"
)
],
output_field=rs.FieldMapping.output_field(
field_name="_event_time",
sql_expression="CAST(:ts AS TIMESTAMP)",
on_error="SKIP"
)
)
]
event_data_collection = rs.Collection.create("event-data-collection",
field_mappings=field_mappings)
Creating a collection with a retention duration of 10 days:
from rockset import Client
rs = Client()
new_collection_with_retention = rs.Collection.create("my-event-collection",
retention_secs=10*24*60*60)
List all collections¶
List all collections using the Client object using:
from rockset import Client
rs = Client()
# list all collections
collections = rs.Collection.list()
Retrieve an existing collection¶
Retrive a collection to run various operations on that collection such as adding or removing documents or executing queries:
from rockset import Client
rs = Client()
users = rs.Collection.retrieve('users')
# retrieve a collection in a workspace
users = rs.Collection.retrieve('users', workspace='marketing')
Describe an existing collection¶
The describe
method can be used to fetch all the details about the collection
such as what data sets act as the collection’s sources, various performance and
usage statistics:
from rockset import Client
rs = Client()
users = rs.Collection.retrieve('users')
print(users.describe())
Drop a collection¶
Use the drop()
method to remove a collection permanently from Rockset.
Note
This is a permanent and non-recoverable operation. Beware.
from rockset import Client
rs = Client()
users = rs.Collection.retrieve('users')
users.drop()
Add documents to a collection¶
Python dicts can be added as documents to a collection using the add_docs
method. Documents are uniquely identified by the _id
field. If an input
document does not have an _id
field, then an unique id will be assigned
by Rockset.
If the _id
field of an input document does not match an existing document,
then a new document will be created.
If the _id
field of an input document matches an existing document,
then the new document will be merged with the existing document:
from rockset import Client
import json
rs = Client()
with open('my-json-array-of-dicts.json') as data_fh:
ret = rs.Collection.add_docs('users', json.load(data_fh))
Delete documents from a collection¶
Remove documents from a collection using the remove_docs
method:
from rockset import Client
rs = Client()
users_to_remove = ['user007', 'user042', 'user435']
docs_to_remove = [{'_id': u} for u in users_to_remove]
ret = rs.Collection.remove_docs('users', docs_to_remove)
Reference¶
- class rockset.collection.Collection(*args, **kwargs)[source]¶
Collection objects represent a single Rockset collection.
Objects of this class are never instantiated directly and are generally returned by methods such as:
from rockset import Client rs = Client() first = rs.Collection.create('my-first-collection') another = rs.Collection.retrieve('another-collection')
See more examples and documentation here.
- add_docs(docs, timeout=None)[source]¶
Adds or merges documents to the collection. Provides document level atomicity.
Documents within a collection are uniquely identified by the
_id
field. If input document does not specify_id
, then an unique UUID will be assigned to the document.If the
_id
field of an input document does not match with any existing collection documents, then the input document will be inserted.If the
_id
field of an input document matches with an existing collection document, then the input document will be merged atomically as described below:All fields present in both the input document and the collection document will be updated to values from the input document.
Fields present in the input document but not the collection document will be inserted.
Fields present in the collection document but not the input document will be left untouched.
All fields within every input document will be inserted or updated atomically. No atomicity guarantees are provided across two different documents added.
Example:
from rockset import Client rs = Client() docs = [ {"_id": "u42", "name": {"first": "Jim", "last": "Gray"}}, {"_id": "u1201", "name": {"first": "David", "last": "DeWitt"}}, ] results = rs.Collection.add_docs("my-favorite-collection", docs) ...
- Parameters
name (str) – Name of the collection
docs (list of dicts) – New documents to be added or merged
timeout (int) – Client side timeout. When specified, RequestTimeout exception will be thrown upon timeout expiration. By default, the client will wait indefinitely until it receives results or an error from the server.
- Returns
The response dict will have 1 field:
data
The
data
field will be a list of document status records, one for each input document indexed in the same order as the list of input documents provided as part of the request. Each of those document status records will have fields such as the document_id
,_collection
name,status
describing if that particular document add request succeeded or not, and an optionalerror
field with more details.- Return type
Dict
- classmethod create(name, workspace='commons', description=None, sources=None, retention_secs=None, field_mappings=None, clustering_key=None, **kwargs)[source]¶
Creates a new Rockset collection.
Use it via rockset.Client().Collection.create()
Only alphanumeric characters,
_
, and-
are allowed in collection names.- Parameters
name (str) – name of the collection to be created.
description (str) – a human readable description of the collection
sources (Source) – array of Source objects that defines the set of input data sources for this collection
retention_secs (int) – an integer representing minimum duration (in seconds), for which documents are retained in this collection before being automatically deleted.
field_mappings (FieldMapping) – array of FieldMapping objects that defines the set of transformations to apply on all documents
clustering_key (ClusteringKey) – array of ClusteringKey objects that defines the clustering config for this collection
- Returns
Collection object
- Return type
- describe()[source]¶
Returns all properties of the collection as a dict.
- Returns
properties of the collection
- Return type
dict
- drop()[source]¶
Deletes the collection represented by this object.
If successful, the current object will contain a property named
dropped
with valueTrue
Example:
... print(my_coll.asdict()) my_coll.drop() print(my_coll.dropped) # will print True ...
- classmethod list(**kwargs)[source]¶
Returns list of all collections.
Use it via rockset.Client().Collection.list()
- Returns
A list of Collection objects
- Return type
List
- patch_docs(docpatches, timeout=None)[source]¶
Updates documents in the collection.
This method expects a list of docpatches, where each docpatch describes a set of updates that need to be applied to a single document in the collection.
All updates specified in a single docpatch will be applied atomically to the document. If a single patch operation specified in a patch fails, the entire patch operation will be aborted.
Each docpatch is a dict that is required to have 2 fields:
_id
that holds the _id field of the document to be updatedpatch
that holds a list of patch operations to be applied to that document, following the JSONPatch standard defined at http://jsonpatch.com
Example:
from rockset import Client rs = Client() docpatch = { "_id": "u42", "patch": [ {"op": "replace", "path": "/name/middle", "value": "Nicholas"} ] } rs.Collection.patch_docs('my-collection', [docpatch])
Each patch is a list of individual patch operations, where each patch operation specifies how a particular field or field path within the target document should be updated.
Each patch operation is a dict with a key called “op” (for operation) and few more keys that act as arguments to the “op”, which differ from one operation type to another. The JSONPatch standard defines several types of patch operations, their arguments and their behavior. Refer to http://jsonpatch.com/#operations for more details.
If a patch opertion’s argument is a field path, then it is specified using the JSON Pointer standard defined at https://tools.ietf.org/html/rfc6901 In essence, field paths are represented as a string of tokens separated by
/
characters. These tokens either specify keys in objects or indexes into arrays, and arrays are 0-based.For example, in this document:
{ "biscuits": [ { "name": "Digestive" }, { "name": "Choco Leibniz" } ] } "/biscuits" would point to the array of biscuits "/biscuits/1/name" would point to "Choco Leibniz".
Here are some examples of individual patch operations:
Add
Example:
{ "op": "add", "path": "/biscuits/1", "value": { "name": "Ginger Nut" } }
Adds a value to an object or inserts it into an array. In the case of an array, the value is inserted before the given index. The
-
character can be used instead of an index to insert at the end of an array.Remove
Example:
{ "op": "remove", "path": "/biscuits" }
Removes a value from an object or array.
Another Example:
{ "op": "remove", "path": "/biscuits/0" }
Removes the first element of the array at biscuits (or just removes the “0” key if biscuits is an object)
Replace
Example:
{ "op": "replace", "path": "/biscuits/0/name", "value": "Chocolate Digestive" }
Replaces a value. Equivalent to a “remove” followed by an “add”.
Copy
Example:
{ "op": "copy", "from": "/biscuits/0", "path": "/best_biscuit" }
Copies a value from one location to another within the JSON document. Both “from” and “path” are JSON Pointers.
Move
Example:
{ "op": "move", "from": "/biscuits", "path": "/cookies" }
Moves a value from one location to the other. Both “from” and “path” are JSON Pointers.
Test
Example:
{ "op": "test", "path": "/best_biscuit/name", "value": "Choco Leibniz" }
Tests that the specified value is set in the document. If the test fails, then the patch as a whole will not apply.
- Parameters
name (str) – Name of the collection
docpatches (list of dicts) – List of document patches to be applied.
timeout (int) – Client side timeout. When specified, RequestTimeout exception will be thrown upon timeout expiration. By default, the client will wait indefinitely until it receives results or an error from the server.
- Returns
The response dict will have 1 field:
data
.The
data
field will be a list of document status records, one for each input document indexed in the same order as the list of input documents provided as part of the request. Each of those document status records will have fields such as the document_id
,_collection
name,status
describing if that particular document add request succeeded or not, and an optionalerror
field with more details.- Return type
Dict
- remove_docs(docs, timeout=None)[source]¶
Deletes documents from the collection. The
_id
field needs to be populated in each input document. Other fields in each document will be ignored.- Parameters
name (str) – Name of the collection
docs (list of dicts) – Documents to be deleted.
timeout (int) – Client side timeout. When specified, RequestTimeout exception will be thrown upon timeout expiration. By default, the client will wait indefinitely until it receives results or an error from the server.
- Returns
The response dict will have 1 field:
data
.The
data
field will be a list of document status records, one for each input document indexed in the same order as the list of input documents provided as part of the request. Each of those document status records will have fields such as the document_id
,_collection
name,status
describing if that particular document add request succeeded or not, and an optionalerror
field with more details.- Return type
Dict
Collection Sources¶
Introduction¶
Source objects represent various data sources that could be used to create collections.
Example usage¶
from rockset import Client, Q, F
import os
rs = Client()
# create a collection from an AWS S3 bucket
integration = rs.Integration.retrieve('aws-rockset-read-only')
s3 = rs.Source.s3(bucket='my-s3-bucket',
integration=integration)
newcoll = rs.Collection.create(name='newcoll', sources=[s3])
Create AWS S3 source for a collection¶
AWS S3 buckets can be used as a data source for collections:
from rockset import Client, Q, F
import os
rs = Client()
# create a collection from an AWS S3 bucket
integration = rs.Integration.retrieve('aws-rockset-read-only')
s3 = rs.Source.s3(bucket='my-s3-bucket',
integration=integration)
newcoll = rs.Collection.create(name='newcoll', sources=[s3])
- Source.s3()¶
Gets the s3 of this Source. # noqa: E501
configuration for ingestion from S3 # noqa: E501
- Returns
The s3 of this Source. # noqa: E501
- Return type
SourceS3
Reference¶
- class rockset.source.Source(integration, format_params)[source]¶
- classmethod azure_blob_storage(container=None, prefix=None, integration=None, format_params=None)[source]¶
Creates a source object to represent an Azure Blob Storage container as a data source for a collection.
- Parameters
container (str) – Name of the Azure Blob Storage container
prefix – selects objects whose path matches the specified prefix within the container
integration (rockset.Integration) – An Integration object
format_params (FormatParams) – the specifications of the format, CsvParams or XmlParams
- classmethod csv_params(separator=None, encoding=None, first_line_as_column_names=None, column_names=None, column_types=None)[source]¶
Creates a object to represent options needed to parse a CSV file
- Parameters
separator (str) – The separator between column values in a line
encoding (str) – The encoding format of data, one of “UTF-8”, “UTF-16” “US_ASCII” [default: “US-ASCII”]
first_line_as_column_names (boolean) – Set to true if the first line of a data object has the names of columns to be used. If this is set to false, the the column names are auto generated. [default: False]
column_names (list of strings) – The names of columns
column_types (list of strings) – The types of columns
- classmethod dynamo(table_name, integration=None, aws_region='us-west-2', use_scan_api=False)[source]¶
Creates a source object to represent an AWS DynamoDB table as a data source for a collection.
- Parameters
table_name (str) – Name of the DynamoDB table
integration (rockset.Integration) – An Integration object (optional)
aws_region – AWS region of the DynamoDB table
use_scan_api – Whether the initial table scan should use the DynamoDB scan API. If false, export will be performed using an S3 bucket.
- classmethod gcs(bucket=None, prefix=None, integration=None, format_params=None)[source]¶
Creates a source object to represent an Google Cloud Storage(GCS) bucket as a data source for a collection.
- Parameters
bucket (str) – Name of the GCS bucket
prefix – selects objects whose path matches the specified prefix within the bucket
integration (rockset.Integration) – An Integration object (optional)
format_params (FormatParams) – the specifications of the format, CsvParams or XmlParams
- classmethod kafka(kafka_topic_name, integration, use_v3=None, offset_reset_policy='EARLIEST')[source]¶
Creates a source object to represent Apache Kafka as a data source for a collection.
- Parameters
kafka_topic_name (str) – Kafka topic to be tailed
use_v3 (boolean) – whether to use v3
offset_reset_policy (str, LATEST/EARLIEST) – the offset reset policy
integration (rockset.Integration) – An Integration object
- classmethod kinesis(stream_name, integration, dms_primary_key=None, format_params=None, aws_region='us-west-2')[source]¶
Creates a source object to represent a Kinesis Stream as a data source for a collection
- Parameters
stream_name (str) – Name of the Kinesis Stream
integration (rockset.Integration) – An Integration object (optional)
dms_primary_key (list of strings) – DMS primary key (optional)
format_params (FormatParams) – the specifications of the format (optional)
aws_region (str) – AWS region the stream is created in (defaults to us-west-2)
- classmethod mongo(database_name, collection_name, integration)[source]¶
Creates a source object to represent a MongoDB collection as a data source for a Rockset collection.
- Parameters
database_name (str) – Name of the MongoDB database
collection_name (str) – Name of the MongoDB collection
integration (rockset.Integration) – An Integration object
- classmethod s3(bucket, prefixes=None, prefix=None, pattern=None, integration=None, format_params=None)[source]¶
Creates a source object to represent an AWS S3 bucket as a data source for a collection.
- Parameters
bucket (str) – Name of the S3 bucket
prefix – Path prefix to only source S3 objects that are recursively within the given path. (optional)
pattern – Path pattern to only source S3 objects that match the given pattern. (optional)
integration (rockset.Integration) – An Integration object (optional)
format_params (FormatParams) – the specifications of the format, CsvParams or XmlParams
- classmethod xml_params(root_tag=None, doc_tag=None, encoding=None, value_tag=None, attribute_prefix=None)[source]¶
Creates a object to represent options needed to parse a XML file
- Parameters
root_tag (str) – Outermost tag within an XML file to be treated as the root. Any content outside the root tag is ignored.
doc_tag (str) – Every rockset document is contained between <doc_tag> and a </doc_tag>
encoding (str) – The encoding format of data. [default: ‘UTF-8’]
value_tag (str) – Tag used for the value when there are attributes in the element having no child. [default: ‘value’]
attribute_prefix (str) – Attributes are transformed into key-value pairs in a Rockset document This prefix is used to tell attributes apart from nested tags in a Rockset document.
Collection Field Mappings¶
Introduction¶
FieldMapping objects allow you to specify transformations to be applied on all documents inserted into a collection. Any valid SQL can be applied on any field in a document.
For more information on field mappings, refer to the official documentation.
Example of basic field mapping¶
from rockset import Client
rs = Client()
mappings = [
rs.FieldMapping.mapping(
name="anonymize_name",
input_fields=[
rs.FieldMapping.input_field(
field_name="name",
if_missing="SKIP",
is_drop=True,
param="name"
)
],
output_field=rs.FieldMapping.output_field(
field_name="name", sql_expression="SHA256(:name)", on_error="FAIL"
)
)
]
# This collection will have all its `name` fields anonymized through the SQL expression
# in the output field above.
collection = rs.Collection.create(name="collection", field_mappings=mappings)
Example of field whitelisting¶
from rockset import Client
rs = Client()
mappings = [
rs.FieldMapping.mapping(name="drop_all_fields", is_drop_all_fields=True),
rs.FieldMapping.mapping(
name="whitelist_name",
input_fields=[
rs.FieldMapping.input_field(
field_name="name",
if_missing="SKIP",
is_drop=True,
param="name"
)
],
output_field=rs.FieldMapping.output_field(
field_name="name", sql_expression=":name", on_error="FAIL"
)
)
]
# This collection will have `name` as a whitelisted field, essentially dropping all fields
# except for `name`.
collection = rs.Collection.create(name="collection", field_mappings=mappings)
Reference¶
- class rockset.field_mapping.FieldMapping[source]¶
- classmethod input_field(field_name, param=None, if_missing=None, is_drop=None)[source]¶
Create a new InputField object
- Parameters
field_name (str) – The name of the field, parsed as a SQL qualified name
param (str) –
SQL parameter name (default: same as field name. Required if the field name is nested)
if_missing (str): Define the behavior if the field is missing from the document or is NULL, one of:
SKIP: skip the SQL expression in the output field, i.e. acts as if the field mapping does not exist
PASS: pass NULL to the SQL expression specified in the output field
Default is SKIP.
is_drop (boolean) – Set to true if the input field needs to be dropped
- classmethod mapping(name, is_drop_all_fields=None, input_fields=None, output_field=None)[source]¶
Creates a new mapping object
- Parameters
name (str) – Name of the mapping
is_drop_all_fields (bool) – Whether to drop all the fields in a document. This can only be set once in a list of field mappings, and a mapping specifying is_drop_all_fields cannot have any input or output fields
input_fields – An Array of InputField objects
output – An OutputField object
- classmethod output_field(field_name, sql_expression, on_error=None)[source]¶
Create a new OutputField object
- Parameters
field_name (str) – The name of the field, parsed as SQL qualified name
value (Value) – SQL expression
on_error (str) –
Define the behavior if the SQL expression fails, one of:
SKIP: skip the SQL expression, i.e. acts as if the mapping does not exist
FAIL: fail the entire mapping, i.e. acts as if the document does not exist
Default is SKIP.
Collection Aliases¶
Usage¶
Aliases are references to Rockset collections. You can use an alias name in your queries instead of the actual collection name.
Example
from rockset import Client, Q
import time
# connect securely to Rockset
rs = Client()
# create an alias
alias = rs.Alias.create(
'myFirstAlias',
workspace='commons',
collections=['commons._events'])
# wait for the alias to be ready
while not alias.collections:
alias = self.rs.Alias.retrieve(
'myFirstAlias', workspace='commons'
)
time.sleep(1)
# create a Query Lambda
qlambda = rs.QueryLambda.create(
'myQueryLambda',
query=Q('SELECT * FROM commons.myFirstAlias LIMIT 10'))
# execute a Query Lambda
results = qlambda.execute()
print(results)
Create a new alias¶
Create a new alias using the Client object using:
from rockset import Client
rs = Client()
alias = rs.Alias.create(
'myFirstAlias',
description='alias referencing collection _events in workspace commons',
workspace='commons',
collections=['commons._events'])
List all aliases¶
List all aliases using the Client object using:
from rockset import Client
rs = Client()
# list all aliases across all workspaces
aliases = rs.Alias.list()
# list all aliases in a workspace
aliases = rs.Alias.list(workspace='commons')
Retrieve an existing alias¶
Retrieve an existing alias by workspace and alias name:
from rockset import Client
rs = Client()
# retrieve an alias
alias = rs.Alias.retrieve('myFirstAlias', workspace='commons')
# retrieve collections referenced
referenced = alias.collections
Fetch the collections referenced by an alias¶
Fetch the collection objects referenced by an alias:
from rockset import Client
import time
rs = Client()
# create an alias
alias = rs.Alias.create(
'myFirstAlias',
workspace='commons',
collections=['commons._events'])
# wait for the alias to be ready
while not alias.collections:
alias = self.rs.Alias.retrieve(
'myFirstAlias', workspace='commons'
)
time.sleep(1)
# fetch the _events collection
events = alias.resolve()[0]
Update an alias¶
Update an alias to reference a new collection:
from rockset import Client
rs = Client()
# create a new collection
rs.Collection.create("my-new-collection")
# retrieve an alias
alias = rs.Alias.retrieve('myFirstAlias', workspace='commons')
# update collection referenced
alias.update(collections=['commons.my-new-collection'])
Drop an alias¶
Use the drop()
method to remove an alias permanently from Rockset.
Note
This is a permanent and non-recoverable operation. Beware.
from rockset import Client
rs = Client()
alias = rs.Alias.retrieve('myFirstAlias', workspace='commons')
alias.drop()
Query Lambdas¶
Usage¶
Query Lambdas are named parameterized and versioned SQL queries stored in Rockset and can be executed through a dedicated HTTPS endpoint.
Example
from rockset import Client, Q
# connect securely to Rockset
rs = Client()
# create a Query Lambda
qlambda = rs.QueryLambda.create(
'myQueryLambda',
query=Q('SELECT 1'))
# print details about Query Lambda
print(qlambda.workspace, qlambda.name, qlambda.version, qlambda.query)
# execute a Query Lambda
results = qlambda.execute()
print(results)
Create a new Query Lambda¶
Creating a Query Lambda using the Client object is as simple as
calling rs.QueryLambda.create('myQueryLambda', query=Q('SELECT...'))
:
from rockset import Client, Q, F, P
rs = Client()
# construct a parameterized query
q = Q('_events').where(F['type'] == P['target_type']).limit(100)
# set default value for query parameter
q.P['target_type'] = 'INFO'
# create a new Query Lambda
qlambda = rs.QueryLambda.create(
'myQueryLambda',
workspace='commons',
query=q,
)
List all Query Lambdas¶
List all Query Lambdas using the Client object using:
from rockset import Client
rs = Client()
# List latest version of all Query Lambdas across all workspaces
qlambdas = rs.QueryLambda.list()
# List latest version of all Query Lambdas in a given workspace
qlambdas = rs.QueryLambda.list(workspace='commons')
Retrieve an existing Query Lambda version by a given tag¶
Retrieve the version of a given Query Lambda associated with a given tag:
from rockset import Client
rs = Client()
# retrieve Query Lambda
qlambda = rs.QueryLambda.retrieveByTag(
'myQueryLambda',
tag='dev',
workspace='commons')
Retrieve an existing Query Lambda version¶
Retrieve a particular version of a given Query Lambda:
from rockset import Client
rs = Client()
# retrieve Query Lambda
qlambda = rs.QueryLambda.retrieveByVersion(
'myQueryLambda',
version='ac34bfg234ee',
workspace='commons')
Describe an existing Query Lambda version¶
Fetch all details about a particular version of a given Query Lambda:
from rockset import Client
rs = Client()
# retrieve Query Lambda
qlambda = rs.QueryLambda.retrieveByTag(
'myQueryLambda',
tag='dev',
workspace='commons')
# print all details about this Query Lambda version
print(qlambda.describe())
Execute a specific Query Lambda version¶
Execute a Query Lambda version, optionally passing in parameters:
from rockset import Client, ParamDict
rs = Client()
# retrieve Query Lambda
qlambda = rs.QueryLambda.retrieveByTag(
'myQueryLambda',
tag='dev',
workspace='commons')
params = ParamDict()
for target_type in ['INFO', 'DEBUG', 'WARN', 'ERROR']:
params['target_type'] = target_type
for result in qlambda.execute(parameters=params).results:
print(target_type, result)
Update a Query Lambda by creating a new version¶
Update the query associated with the Query Lambda by creating a new version of it and optionally giving it a tag:
from rockset import Client, Q, F, P, ParamDict
rs = Client()
# retrieve Query Lambda
ql = rs.QueryLambda.retrieveByTag(
'myQueryLambda',
tag='dev',
workspace='commons')
# construct a new query
q = (Q('_events')
.where(
(F['type'] == P['target_type']) &
(F['userEmail'] == P['user_email'])
)
.limit(100)
)
# optionally, set default value for some or all of the query parameters
q.P['target_type'] = 'INFO'
# update Query Lambda
# optionally, add a version tag at this time
ql_new_version = ql.update(q, tag='dev')
# execute the new Query Lambda for different user_emails
params = ParamDict()
for email in ['scott@rockset.com', 'veeve@rockset.com']:
params['user_email'] = email
results = ql_new_version.execute(params).results
print(email, results)
Tag a version of a Query Lambda¶
Apply a tag to a specific Query Lambda version:
from rockset import Client
rs = Client()
# retrieve Query Lambda
ql = rs.QueryLambda.retrieveByVersion(
'myQueryLambda',
version='ac34bfg234ee',
workspace='commons')
# add tag 'dev' to this Query Lambda version
ql.tag('dev')
List all versions of a Query Lambda¶
Fetch all versions of a given Query Lambda:
from rockset import Client
rs = Client()
# retrieve Query Lambda
ql = rs.QueryLambda.retrieveByTag(
'myQueryLambda',
tag='dev',
workspace='commons')
# fetch all versions of this Query Lambda
all_qlambda_versions = ql.history()
Drop Query Lambda along with all previous versions¶
Use the drop()
method to remove a Query Lambda permanently from Rockset.
Note
This is a permanent and non-recoverable operation. Beware.
from rockset import Client
rs = Client()
qlambda = rs.QueryLambda.retrieveByVersion(
'myQueryLambda',
version='ac34bfg234ee',
workspace='commons')
qlambda.drop()
Python Query Builder¶
Usage¶
Query module contains a set of APIs that allows you to compose powerful queries over collections.
This module comprises of two major components:
Q
: Query Builder
Used to compose complex and powerful queries.
>>> from rockset import Q
>>> q = Q('hello-world').limit(10)
>>> (sqltxt, sqlargs) = q.sql()
>>> print(sqltxt)
SELECT *
FROM "hello-world"
LIMIT 10
>>>
F
: Field Reference
Used to construct field expressions that refer to particular fields within a document.
>>> from rockset import F
>>> (F['answer'] == 42).sqlexpression()
'"answer" = 42'
>>>
Example
from rockset import Client, Q, F
# connect to Rockset
rs = Client()
# fetch user whose "_id" == "u42"
u = rs.sql(Q('users').where(F["_id"] == "u42"))
# fetch the 100 oldest users in the 'users' collection
q = Q('users').highest(100, F["age"])
old100 = rs.sql(q)
# find the average rating of all songs by "The Beatles"
q = Q('songs').where(F["artist"] == "The Beatles").select(F["rating"].avg())
avg_rating = rs.sql(q)
Query Operators: Overview¶
- rockset.Q(query, alias=None)[source]¶
All query objects are constructed using the
Q(<collection-name>)
query builder construct and are then followed by a chain of query operators to build the full query expression.
Constructor
Q
: Specify the collection to be queried
>>> # return all documents in the logins collection. >>> q = Q('logins')
Filter queries
where
: Classic selection operator to only return documents that match the given criteria. UseF
to construct field expressions to specify the selection criteria.
>>> # return all docs in logins where field "user_id" is equal to "u42" >>> q = Q('logins').where(F['user_id'] == 'u42')
Projection
select
: Specify the list of desired fields to be returned
>>> # will return the fields "user_id" and "login_ip" from all docs in logins >>> q = Q('logins').select(F['user_id'], F['login_ip'])
Pagination
limit
: Specify limit with skip support
>>> # return 10 documents from logins after skipping the first 40 results >>> q = Q('logins').limit(10, skip=40)
Sorting
highest
,lowest
: Find the top N or the bottom N
>>> # will return 10 documents with the most recent "login_time" >>> q = Q('logins').highest(10, F['login_time'])
Aggregation
aggregate
: Group by and aggregate fields
>>> # will aggregate all documents in logins by "user_id", >>> # and return "user_id", max("login_time") and count(*) after aggregation. >>> Q('logins').aggregate(F['user_id'], F['login_time'].max(), F.count())
Query Operator: Filters¶
Where operator¶
Syntax:
<Query>.where(<Query>)
where
allows you to chain a new query object as a conjuntion. In most cases, field reference expressions are sufficient, butwhere
comes in especially handy when you want to sub-select documents following another operation such as a sort or an aggregation.Examples:
# find all "Jim"s who are in the top 100 highest scorers Q('players') \ .highest(100, F["score"]) \ .where(F["first_name"] == "Jim")
Query Operator: Projection¶
Select operator¶
Syntax:
<Query>.select(<field_ref> [, <field_ref> [, ...]])
Allows you to specify the fields that you wish to include in the query results.
Examples:
Q('authors') \ .where(F["last_name"] == "Gray") \ .select(F["first_name"], F["last_name"], F["age"])
Query Operator: Pagination¶
Limit operator¶
Syntax:
<Query>.limit(<max_results> [, <skip_count>])
Limit operator allows you to perform pagination queries and positional filters.
Examples:
# find the "_id" field of the 5 most recently uploaded documents # since the default sorting is more recently updated first, # this query will simply be: Q('uploads').limit(5) # fetch a third batch of 100 results, for all users older than 18 # i.e., skip the first 200 results Q('uploads').where(F["age"] >= 18).limit(100, skip=200)
- Query.limit(limit, skip=0)[source]¶
Returns a new query object that when executed will only return a subset of the results. The query when executed will return no more than
limit
results after skipping the firstskip
number of results. The limit operator is most commonly used for pagination.- Parameters
limit (int) – maximum number of results to return
skip (int) – the number of results to skip
- Returns
new query object that only returns the desired subset
- Return type
Query Operator: Sorting¶
Highest, Lowest operators¶
Syntax:
<Query>.highest(N, <field_ref> [, <field_ref> [, ...]])
,<Query>.lowest(N, <field_ref> [, <field_ref> [, ...]])
Examples:
Q(F["last_name"] == "Gray").highest(5, F["score"], F["first_name"]) Q(F["last_name"] == "Gray").lowest(10, F["salary"])
- Query.highest(limit, *fields)[source]¶
Returns a new query object that when executed will sort the results from the current query object by the list of fields provided as input in descending order and return top N defined by the limit parameter.
- Parameters
limit (int) – top N results you wish to fetch
fields (list of FieldRef) – fields you wish to sort
by (descending) –
- Returns
new query object that returns top N descending
- Return type
- Query.lowest(limit, *fields)[source]¶
Returns a new query object that when executed will sort the results from the current query object by the list of fields provided as input in ascending order and return top N defined by the limit parameter.
- Parameters
limit (int) – top N results you wish to fetch
fields (list of FieldRef) – fields you wish to sort
by (ascending) –
- Returns
new query object that returns top N ascending
- Return type
Query Operator: JOINs¶
JOINS¶
Syntax:
<Query>.join(<Query>, on=<field_ref expression>)
<Query>.left_outer_join(<Query>, on=<field_ref expression>)
<Query>.right_outer_join(<Query>, on=<field_ref expression>)
Examples:
Q('emails') \ .join(Q('users'), on=(F['emails']['from'] == F['users']['email'])) Q('emails', alias='e') \ .left_outer_join(Q('users', alias='u'), on=(F['e']['from'] == F['u']['email']))
Query Operator: Unnest - to expand nested array fields¶
Unnest operators¶
Syntax:
<Query>.unnest(<field_ref>)
Examples:
Q('emails').unnest(F['to']) (Q('linkedin_people', 'p') .unnest(F['p']['experiences'], alias='exp') .where(F['exp']['company_name'] == 'Rockset'))
- Query.unnest(field, field_alias=None, alias=None)[source]¶
Returns a new query object that when executed will unnest the specified array field present in the results of the current query object.
- Parameters
field (FieldRef) – array field that you wish to unnest
field_alias (str) – This field is required if the specified
is (array field holds an array of scalar values. This field) –
objects. (optional if the specified array field holds an array of) –
alias (str) – subquery alias name for the unnested fields.
name. (This field is optional and defaults to the input array field) –
- Returns
new query object for the unnested query
- Return type
Query Operator: Aggregation¶
Aggregate operator and field ref aggregate functions¶
Syntax:
<Query>.aggregate(<field_ref> [, <field_ref> [, ...]])
Field reference objects can also include any of the following aggregation functions:
min
max
avg
sum
count
countdistinct
approximatecountdistinct
collect
You can also optionally provide a field name alias in the field reference using the
named
function. This comes in especially handy for the aggregated fields.Examples:
# find min and max salaries broken down by age Q('employees').aggregate(F["age"], F["salary"].min(), F["salary"].max()) # will return documents such as: # {"age", "18", "min(salary)": 50000, "max(salary)": 150000} # {"age", "19", "min(salary)": 50000, "max(salary)": 152000} # example using field name alias Q('employees').aggregate(F["age"], F["salary"].avg().named("avg_salary")) # will return documents such as: # {"age", "18", "avg_salary": 82732}
- Query.aggregate(*fields)[source]¶
Returns a new query object that when executed will aggregate results from the current query object by the list of fields provided as input.
Field reference objects can include one of the supported aggregate functions such as
max
,min
,avg
,sum
,count
,countdistinct
,approximatecountdistinct
,collect
as follows:<field_ref>.max()
,<field_ref>.min()
, … .The list of fields provided as input can contain a mix of field references that include an aggregate function and field references that does not.
- Parameters
fields (list of FieldRef) – fields you wish to aggregate by
- Returns
new query object that includes the desired field aggregations
- Return type
- FieldRef.min()[source]¶
Returns a new FieldRef that represents a MIN() aggregation of the given field.
- Returns
FieldRef object that represents the desired
min
aggregation.- Return type
AggFieldRef
- FieldRef.max()[source]¶
Returns a new FieldRef that represents a MAX() aggregation of the given field.
- Returns
FieldRef object that represents the desired
max
aggregation.- Return type
AggFieldRef
- FieldRef.avg()[source]¶
Returns a new FieldRef that represents an AVG() aggregation of the given field.
- Returns
FieldRef object that represents the desired
avg
aggregation.- Return type
AggFieldRef
- FieldRef.sum()[source]¶
Returns a new FieldRef that represents a SUM() aggregation of the given field.
- Returns
FieldRef object that represents the desired
sum
aggregation.- Return type
AggFieldRef
- FieldRef.count()[source]¶
Returns a new FieldRef that represents a COUNT() aggregation of the given field.
When called from a field, say F[‘username’].count(), then the SQL expression generated will be of the following form, which when executed will return the number of rows where “username” IS NOT NULL:
COUNT(“username”)
When called on the root field, say F.count(), then the SQL expression generated will be of the form:
COUNT(*)
- Returns
FieldRef object that represents the desired
count
aggregation.- Return type
AggFieldRef
Field Expressions Overview¶
- rockset.F = <rockset.query.FieldRef object>¶
F
is a field reference object that helps in building query expressions natively using Python language expressions.F
uses Python operator overloading heavily and operations on field references generate Query objects that can be used in conjunction withQ
to build compose complex queries.
Value comparators
==
,!=
,<
,<=
,>
,>=
:
>>> # match all docs where "first_name" is equal to "Jim" >>> F["first_name"] == "Jim" >>> # match all docs where "rating" is greater than or equal to 4.5 >>> F["rating"] >= 4.5 >>> # match all docs where "title" text is lexographcially greater than "Star Wars" >>> F["title"] >= "Star Wars"
String functions
startswith
,like
: Prefix and classic SQL LIKE expressions
>>> # match all docs where "title" starts with "Star Wars" >>> F["title"].startswith("Stars Wars") >>> # match all docs where "title" contains the word "Wars" >>> F["title"].like("% Wars %")
Boolean compositions
&
,|
,~
: AND, OR and NOT expressions
>>> # match all records with "rating" >= 4.5 AND "title" starts with "Star Wars" >>> e1 = (F["rating"] >= 4.5) & F["title"].startswith("Star Wars") >>> # match all records with "director" == "George Lucas" OR "title" starts with "Star Wars" >>> e2 = (F["director"] == 'George Lucas') | F["title"].startswith("Star Wars") >>> # match all records that are not included in expressions e1 or e2 >>> e1e2_complement = ~(e1 | e2)
IN and EXISTS operators
exists
: Construct a query of the formSELECT ... WHERE <field> IN <subquery>
Can also be used to just check if a subquery returns one or more results
exists
: Construct a query of the formSELECT ... WHERE EXISTS (<subquery>)
>>> # find records where "source_ip" is one of the "ip"s in "wolves" collection >>> e1 = F["source_ip"].exists(Q("wolves").select(F["ip"])) >>> # match all such records in the "logins" collection >>> q = Q("logins").where(e1) >>> # another way to write the same query >>> q = (Q("logins") >>> .where( >>> F.exists( >>> Q("wolves") >>> .where(F["wolves"]["ip"] = F["logins"]["source_ip"]) >>> ) >>> ) >>> )
Field aggregations
avg
,collect
,count
,countdistinct
,max
,min
,sum
>>> # count(*) >>> F.count() >>> # min(login_time) >>> F["login_time"].min() >>> # max(login_time) as last_login_time >>> F["login_time"].max().named('last_login_time')
Nested documents and arrays
[]
: The[]
notation can be used to refer to fields within nested documents and arrays.Consider a collection where documents looked like this example below.
{ "_id": {"u42"}, "name": { "first": "James", "middle": "Nicholas", "last": "Gray" }, "tags": [ "ACID", "database locking", "two phase commits", "five-minute rule", "data cube", "turing award" ] }
Example field references to access nested documents and arrays:
>>> # expression to find all documents where field "name" contains a >>> # nested field "middle" with value equal to "Nicholas" >>> F["name"]["middle"] == "Nicholas" >>> >>> # similarly, for array fields, you can specify the array offset. >>> # expression to find all documents where the first "tags" field >>> # is equal to "ACID" >>> F["tags"][0] == "ACID"
In order to match against any element within an array field, you can use Python’s empty slice
[:]
notation.
>>> # expression to find all documents where the "tags" array field >>> # contains "ACID" as one of the elements >>> F["tags"][:] == "ACID" >>> # find all documents where one of the "tags" is "turing award" >>> F["tags"][:] == "turing award"
Field Expression: Value Comparators¶
Equality operator: ==
¶
Supported types:
int
,float
,bool
,str
Syntax:
<field_ref> == <value>
Examples:
F["first_name"] == "Jim" F["year"] == 2017 F["score"] == 5.0 F["tags"][:] == 'critical'
Value comparators: <
, <=
, >=
, >
¶
Supported types:
int
,float
,str
Syntax:
<field_ref> < <value>
,<field_ref> <= <value>
,<field_ref> >= <value>
,<field_ref> > <value>
Examples:
F["year"] < 2000 F["year"] >= 2007 F["rating"] >= 4.5 F["title"] >= "Star Wars"
Prefix operator: startswith
¶
Supported types:
str
Syntax:
<field_ref>.startswith(<prefix>)
Examples:
F["first_name"].startswith("Ben")
Like operator: like
¶
Supported types:
str
Syntax:
<field_ref>.like(<pattern>)
Examples:
F["address"].like("%State St%")
Field alias: named
¶
Supported types: All field references
Syntax:
<field_ref>.named(<new-field-name>)
Examples:
F["full_name"].named("name") F["login_time"].max().named("last_login_time")
Field existence: is_defined
¶
Supported types: All
Syntax:
<field_ref>.is_defined()
Field existence tested with
<field_ref>.is_defined()
will match all documents where the field is defined, even if it has a null value.
Null comparison: is_not_null
¶
Supported types: All
Syntax:
<field_ref>.is_not_null()
Field expression
<field_ref>.is_not_null()
will match all documents where the field is defined and has a non-null value.
Tip
There is no is_null()
because of the potential confusion of calling is_null()
on an undefined field. Use ~<field_ref>.is_not_null()
or <field_ref>.is_defined() & ~<field_ref>.is_not_null()
depending on your use case.
Field Expression: Boolean Compositions¶
Three different boolean operators (&
, |
, and ~
) are
overloaded to allow easy composition of boolean operators.
Note
The boolean operators are NOT and
, or
, and not
, as those are special and cannot be overridden in Python.
AND operator: &
(intersection)¶
Syntax:
<Query object> & <Query object>
Examples:
# find all documents where field tags contains the "turing award" # and the age is greater than 40 (F["tags"][:] == "turing award") & (F["age"] > 40)
OR operator: |
(union)¶
Syntax:
<Query object> | <Query object>
Examples:
# find all documents where the first_name is "jim" # or last_name is "gray" (F["first_name"] == "jim") | (F["last_name"] == "gray")
NOT operator: ~
(negation)¶
Syntax:
~<Query object>
Examples:
# find all documents whose title does not contain the term "confidential" ~F["title"][:] == "confidential"
Field Expression: Array operators¶
nested operator¶
syntax:
<field_ref>.nested(<query>)
nested
operator makes it easy to work with nested array of documents.example:
# find all books authored by 'jim gray' F["authors"].nested((F["first_name") == "jim") & (F["last_name"] == "gray")) # find all users who logged in from given ip on june 06, 2006 F["logins"].nested((F["ipv4"] == "10.6.6.6") & (F["login_date"] == "2006-06-06"))
Unnest operator¶
Syntax:
<field_ref>.unnest(alias=None)
unnest
operator should be called only on array fields and will return the a query object to represent the SQL expression of the form UNNEST(<field_ref>)Example:
# unnest all authors from the books collection, so that # there is a record for every (book x book's author) Q("books).join(F["authors"].unnest()) # find all books authored by 'jim gray' (Q("books") .join(Q(F["authors"].unnest(), alias="a")) .where((F["a"]["first_name"] == "jim") & (F["a"]["last_name"] == "gray")) )
Reference for Q
and Query¶
- rockset.query.Q(query, alias=None)[source]¶
All query objects are constructed using the
Q(<collection-name>)
query builder construct and are then followed by a chain of query operators to build the full query expression.
- class rockset.query.Query(source=None, alias=None, child=None, children=None)[source]¶
- aggregate(*fields)[source]¶
Returns a new query object that when executed will aggregate results from the current query object by the list of fields provided as input.
Field reference objects can include one of the supported aggregate functions such as
max
,min
,avg
,sum
,count
,countdistinct
,approximatecountdistinct
,collect
as follows:<field_ref>.max()
,<field_ref>.min()
, … .The list of fields provided as input can contain a mix of field references that include an aggregate function and field references that does not.
- Parameters
fields (list of FieldRef) – fields you wish to aggregate by
- Returns
new query object that includes the desired field aggregations
- Return type
- highest(limit, *fields)[source]¶
Returns a new query object that when executed will sort the results from the current query object by the list of fields provided as input in descending order and return top N defined by the limit parameter.
- Parameters
limit (int) – top N results you wish to fetch
fields (list of FieldRef) – fields you wish to sort
by (descending) –
- Returns
new query object that returns top N descending
- Return type
- join(query, on=None, join_type='JOIN')[source]¶
Returns a new query object that when executed will do an INNER JOIN of the current query object with the input query object based on the given join predicate.
- left_outer_join(query, on)[source]¶
Returns a new query object that when executed will do a LEFT OUTER JOIN of the current query object with the input query object based on the given join predicate.
- Parameters
- Returns
new query object that incorporates the left outer join
- Return type
- limit(limit, skip=0)[source]¶
Returns a new query object that when executed will only return a subset of the results. The query when executed will return no more than
limit
results after skipping the firstskip
number of results. The limit operator is most commonly used for pagination.- Parameters
limit (int) – maximum number of results to return
skip (int) – the number of results to skip
- Returns
new query object that only returns the desired subset
- Return type
- lowest(limit, *fields)[source]¶
Returns a new query object that when executed will sort the results from the current query object by the list of fields provided as input in ascending order and return top N defined by the limit parameter.
- Parameters
limit (int) – top N results you wish to fetch
fields (list of FieldRef) – fields you wish to sort
by (ascending) –
- Returns
new query object that returns top N ascending
- Return type
- right_outer_join(query, on)[source]¶
Returns a new query object that when executed will do a RIGHT OUTER JOIN of the current query object with the input query object based on the given join predicate.
- Parameters
- Returns
new query object that incorporates the right outer join
- Return type
- select(*fields)[source]¶
Returns a new query object that when executed will only include the list of fields provided as input.
- Parameters
fields (list of FieldRef) – fields you wish to select
- Returns
new query object that includes the desired field selection
- Return type
- sqlbuild(sqlsel)[source]¶
Returns an SQLSelect object, which can be used to generate the SQL text for the query.
- sqlprepare(sqlsel)[source]¶
Returns an SQLSelect object, which can be used to build the SQL version of the query.
- unnest(field, field_alias=None, alias=None)[source]¶
Returns a new query object that when executed will unnest the specified array field present in the results of the current query object.
- Parameters
field (FieldRef) – array field that you wish to unnest
field_alias (str) – This field is required if the specified
is (array field holds an array of scalar values. This field) –
objects. (optional if the specified array field holds an array of) –
alias (str) – subquery alias name for the unnested fields.
name. (This field is optional and defaults to the input array field) –
- Returns
new query object for the unnested query
- Return type
Reference for F
and FieldRef¶
- rockset.query.F = <rockset.query.FieldRef object>¶
F
is a field reference object that helps in building query expressions natively using Python language expressions.F
uses Python operator overloading heavily and operations on field references generate Query objects that can be used in conjunction withQ
to build compose complex queries.
- class rockset.query.FieldRef(name=None, parent=None, source=None)[source]¶
- array_agg()[source]¶
Returns a new FieldRef that represents a ARRAY_AGG() aggregation of the given field.
- Returns
FieldRef object that represents the desired
collect
aggregation.- Return type
AggFieldRef
- avg()[source]¶
Returns a new FieldRef that represents an AVG() aggregation of the given field.
- Returns
FieldRef object that represents the desired
avg
aggregation.- Return type
AggFieldRef
- count()[source]¶
Returns a new FieldRef that represents a COUNT() aggregation of the given field.
When called from a field, say F[‘username’].count(), then the SQL expression generated will be of the following form, which when executed will return the number of rows where “username” IS NOT NULL:
COUNT(“username”)
When called on the root field, say F.count(), then the SQL expression generated will be of the form:
COUNT(*)
- Returns
FieldRef object that represents the desired
count
aggregation.- Return type
AggFieldRef
- countdistinct()[source]¶
Returns a new FieldRef that represents COUNT(DISTINCT <field_ref>) aggregation of the given field.
- Returns
FieldRef object that represents the desired
countdistinct
aggregation.- Return type
AggFieldRef
- exists(inner_query)[source]¶
Returns a new query object that represents a SQL expression as an IN clause or an EXISTS clause.
If the current field ref is just
F
and represents the root field, then the SQL expression will be of the form:EXISTS (<inner_query>)
If the current field ref represents a field within the collection, then the SQL expression will be of the form:
<field_ref> IN (<inner_query>)
Example usage:
Q(“logins”).where(F[“source_ip”].exists(Q(“wolves”).select(F[“ip”])))
will construct the following SQL:
SELECT * FROM “logins” WHERE “source_ip” IN (SELECT “ip” FROM “wolves”)
- map_agg(fvalue)[source]¶
Returns a new FieldRef that represents a MAP_AGG() such that MAP_AGG(key=<self>, value=<fvalue>) aggregation of the given fields.
- Returns
FieldRef object that represents the desired
collect
aggregation.- Return type
AggFieldRef
- max()[source]¶
Returns a new FieldRef that represents a MAX() aggregation of the given field.
- Returns
FieldRef object that represents the desired
max
aggregation.- Return type
AggFieldRef
- min()[source]¶
Returns a new FieldRef that represents a MIN() aggregation of the given field.
- Returns
FieldRef object that represents the desired
min
aggregation.- Return type
AggFieldRef
- nested(nested_query, field_alias=None, alias=None)[source]¶
Returns a new query object that matches all documents where the given inner query matches on one or more individual nested documents present within the field path of the given field.
Useful to run complex query expressions on fields that contain an nested array of documents.
Example
Say you have a collection where every document describes a book, and each document has an “authors” field that is a nested array of documents describing each author:
{"title": "Transaction Processing: Concepts and Techniques", "authors": [ {"first_name": "Jim", "last_name": "Gray"}, {"first_name": "Andreas", "last_name": "Reuter"}, ], "publisher": ... }
If you want to do find all books where ‘Jim Gray’ was one of the authors, you can use the following query:
F["authors"].nested((F["first_name"] == 'Jim') & (F["last_name"] == 'Gray'))
Note: Constructing the same query as follows is incorrect:
# CAUTION: This is not same as the query above (F["authors"][:]["first_name"] == 'Jim') & (F["authors"][:]["last_name"] == 'Gray')
The incorrect version will return all books which has at least one author with first name ‘Jim’ and at least one author with last name ‘Gray’, but it need not be the same author. A book with two authors named ‘Jim Smith’ and ‘Alice Gray’ will also match, which is not what is intended.
- sum()[source]¶
Returns a new FieldRef that represents a SUM() aggregation of the given field.
- Returns
FieldRef object that represents the desired
sum
aggregation.- Return type
AggFieldRef
- unnest(alias=None)[source]¶
Returns a new query object that represents a SQL expression of the form:
UNNEST(<field_ref>)
or, when alias is not None
UNNEST(<field_ref> as <alias>)
Example usage:
Q(“books”).join(F[“authors”].unnest())
will construct the following SQL:
SELECT * FROM “books” CROSS JOIN UNNEST(“books”.”authors”)
- Parameters
alias (str) – Required when the fied ref is an array of scalars
'dog'] (such as ['vacation', 'beach', 'sand',) –
objects. (Not required if the field is an array of) –
- Returns
query object that represents the desired UNNEST() expression
- Return type
Cursor¶
Introduction¶
Cursor objects are return by the Client.sql() API. A cursor simply binds the query to a particular collection and the query is not executed server side until the application starts to fetch results from the cursor.
Note
Cursors are never instantiated directly by the application and are always instantiated by Rockset Client sql()
APIs.
Fetch all results¶
Use the cursor’s results() method to fetch all the results of the query in one shot:
results = mycollection.sql(q).results()
The above technique will work well, if the number of results returned by the query is limited, say because it employs a LIMIT clause.
For queries, that return a large number of results, please use the cursor iterators as described below.
Iterators with automatic pagination¶
Cursor objects are iterables, so you can do something like:
results = mycollection.sql(q)
for r in results:
print(r)
Cursor objects support seamless automatic pagination to iterate over large result sets. The cursor iterator will fetch and buffer a small portion of the results and as the iterator reaches the end of the current batch of buffered results, it will automatically issue the query with the appropriate pagination parameters to buffer the next batch and seamlessly continue results iteration.
The default cursor iterator uses a batch size of 10,000. You can create a cursor
iterator with a different batch size by using the iter()
method.
Example using the default cursor iterator:
results = mycollection.sql(q)
for r in results:
print(r)
Example using a custom cursor iterator with batch size 200:
results = mycollection.sql(q)
for r in results.iter(200):
print(r)
- Cursor.iter()[source]¶
Returns an iterator that does seamless automatic pagination behind the scenes while fetching no more than the specified batch size number of results at a time.
- Parameters
batch (int) – maximum number of results fetched at a time
- Returns
Iterator that will return all results one
Document
object at a time with automatic pagination- Return type
Iterator Object
Async requests¶
Cursors support asyncio.Future to schedule and run queries concurrently along with other async events in your application.
One can create an asyncio.Future from a cursor object using the
Cursor.async_request()
method. These futures are not scheduled in
any async threads and the application have to schedule them in an asyncio
event loop. Once the futures are scheduled and run to completion, then
the results of their respective queries can be accessed by calling
future.result(). The return value of future.result() will be identical
to calling Cursor.results() API on the original query.
For example:
jims_future = people.sql(F["first_name"] == "Jim").async_request()
sfmerch_future = merchants.sql(F["zipcode"] == "94107").async_request()
# can schedule these asyncio.futures along with other futures
# issue both queries concurrently and block until both of them finish
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(jims_future, sfmerch_future))
all_jims = jims_future.result()
all_sfmerchants = sfmerch_future.result()
Even if an future was originally issued by an async_requst() API call, one can still call the blocking Cursor.results() API to fetch the results synchronously. Cursor.results() will schedule the future, block until the future finishes execution and then will return those results. A subsequent future.result() call will return the query results immediately.
For example:
jims_cursor = people.sql(F["first_name"] == "Jim")
jims_future = jims_cursor.async_request()
# do a blocking results() will block on the future behind the scenes
results = jims_cursor.results()
# this will return immediately without incurring any server round-trip
results2 = jims_future.result()
- Cursor.async_request()[source]¶
Returns an asyncio.Future object that can be scheduled in an asyncio event loop. Once scheduled and run to completion, the results can be fetched via the future.result() API. The return value of future.result() will be the same as the return value of Cursor.results()
- Returns
Returns a Future object that can be scheduled in an asyncio event loop and future.result() will hold the same return value as Cursor.results()
- Return type
asyncio.Future
Reference¶
- class rockset.cursor.Cursor(q=None, client=None, generate_warnings=False, timeout=None)[source]¶
Fetch the results of a query executed against a collection
- async_request()[source]¶
Returns an asyncio.Future object that can be scheduled in an asyncio event loop. Once scheduled and run to completion, the results can be fetched via the future.result() API. The return value of future.result() will be the same as the return value of Cursor.results()
- Returns
Returns a Future object that can be scheduled in an asyncio event loop and future.result() will hold the same return value as Cursor.results()
- Return type
asyncio.Future
- fields()[source]¶
Once query has been executed, this method will return all the fields that are present in the query result, while preserving the order in which those fields were selected in the input query.
- NOTE: this method will return None when called before the query has
actually been executed.
- Returns
- List of column headers in the query result,
preserving the ordering from the input query.
- Return type
list[dicts]
- iter(batch=10000)[source]¶
Returns an iterator that does seamless automatic pagination behind the scenes while fetching no more than the specified batch size number of results at a time.
- Parameters
batch (int) – maximum number of results fetched at a time
- Returns
Iterator that will return all results one
Document
object at a time with automatic pagination- Return type
Iterator Object
- results()[source]¶
Execute the query and fetch all the results in one shot.
- Returns
Results of the query represent as a list of Document objects
- Return type
list[
Document
]
- rowcount()[source]¶
Number of rows returned by the last query
- Returns
The number of rows that the last query produced. Returns None in case no query has been executed yet.
- Return type
int
Document¶
Introduction¶
Document object represents a single record or row in the result retrieved from executing a query.
Note
Document objects are generally instantiated by the Cursor object and do not need to be instantiated directly by the application while retrieving results of a query.
This class adapts Rockset’s SQL types to standard Python types for all the top level fields retrieved by the query.
Rockset Data Type |
Python Data Type |
Comments |
---|---|---|
SQL NULL |
None |
|
JSON NULL |
None |
|
bool |
bool |
|
int |
int |
|
float |
float |
|
string |
str |
|
bytes |
str |
|
array |
list |
|
object |
dict |
|
date |
datetime.date |
|
datetime |
datetime.datetime |
tzinfo will be None |
time |
datetime.time |
|
timestamp |
str |
|
month_interval |
dict |
|
microsecond_interval |
datetime.timedelta |
|
geography.Point |
geojson.Point |
|
geography.LineString |
geojson.LineString |
|
geography.Polygon |
geojson.Polygon |
Note
Please note that this type adaptation is only done for the top level fields returned in a query. If a top level field retrieved by the query is a map or an array, then fields nested within that map or an array are not adapted to the respective Python data types.
Reference¶
- class rockset.document.Document(*args, **kwargs)[source]¶
Represents a single record or row in query results.
This is a sub-class of dict. So, treat this object as a dict for all practical purposes.
Only the constructor is overridden to handle the type adaptations shown in the table above.
Limits¶
This section lists all the system-wide limits such as the biggest document that can be added to a collection or other limits relating to field sizes.
- Client.MAX_DOCUMENT_SIZE_BYTES = 41943040¶
Maximum allowed size of a single document
- Client.MAX_FIELD_NAME_LENGTH = 10240¶
Maximum allowed length of a field name
- Client.MAX_FIELD_VALUE_BYTES = 4194304¶
Maximum allowed size of a field value
- Client.MAX_ID_VALUE_LENGTH = 10240¶
Maximum allowed length of
_id
field value
- Client.MAX_NAME_LENGTH = 2048¶
Maximum allowed length of a collection name
- Client.MAX_NESTED_FIELD_DEPTH = 30¶
Maximum allowed levels of depth for nested documents
Exceptions¶
Introduction¶
Various Python exceptions thrown by the rockset
module are explained in
this section, along with possible reasons and remedies to assist in
trouble-shooting.
Authentication Errors¶
The server is rejecting your request because you have either an expired or an invalid token. Ensure you have a valid API key or generate a new one using the Rockset Console before trying your request again.
Input Errors¶
The server is unable to understand the API request as it was sent. This most likely means the API was badly formed (like the input query has a syntax error). When you encounter this error, please refer to the relevant documentation and verify if the request is constructed properly and if the resource is still present.
- class rockset.exception.InputError(**kwargs)[source]¶
User request has a missing or invalid parameter and cannot be processed as is. Syntax errors in queries fall in this category.
- code¶
HTTP status code obtained from server
- Type
int
- message¶
error message with more details
- Type
str
- type¶
error sub-category
- Type
str
Limit Reached¶
The server could understand the input request but refuses to execute it. This commonly happens when an account limit has been reached. Please reach out to Rockset Support with more details to alter your account limit.
- class rockset.exception.LimitReached(**kwargs)[source]¶
The API request has exceeded some user-defined limit (such as max deadline set for a query) or a system limit. Refer to documentation to increase the limit or reach out to Rockset support with more details to alter your account limit.
- code¶
HTTP status code obtained from server
- Type
int
- message¶
error message with more details
- Type
str
- type¶
error sub-category
- Type
str
Not Yet Implemented¶
Your API request needs a feature that is not present in your cluster for it to complete. Your cluster needs an upgrade or this feature is in our roadmap but we haven’t gotten around to implementing it yet. Please reach out to Rockset support with more details to help us prioritize this feature.
- class rockset.exception.NotYetImplemented(**kwargs)[source]¶
Your request is expecting a feature that has not been deployed in your cluster or has not yet been implemented. Please reach out to Rockset support with more details to help us prioritize this feature. Thank you.
- code¶
HTTP status code obtained from server
- Type
int
- message¶
error message with more details
- Type
str
- type¶
error sub-category
- Type
str
Request Timeouts¶
The server did not complete the API request before the timeout you set for the request expired. To troubleshoot, see if your request succeeds when you don’t set a timeout. If it does then you need to recalibrate your timeout value. If it doesn’t, then debug the issue based on the new error you receive.
- class rockset.exception.RequestTimeout(**kwargs)[source]¶
Request timed out.
Many API calls allow a client side timeout to be specified. When specified, this exception will be thrown when the timeout expires and the API call has not received a valid response or an error from the servers.
- message¶
timeout error message
- Type
str
- timeout¶
timeout specfied with API call in seconds
- Type
int
Server Errors¶
These errors mean the server correctly parsed the input request, but couldn’t process it for some reason. If a particular request or application is seeing this while other requests are fine, then you probably uncovered a bug with Rockset. Please contact Rockset support to report the bug and we will provide a time estimte for resolution and send you a t-shirt.
- class rockset.exception.ServerError(**kwargs)[source]¶
Something totally unexpected happened on our servers while processing your request and most likely you have encountered a bug in Rockset. Please contact Rockset support and provide all the details you received along with the error for quick diagnosis, resolution, and to collect your t-shirt.
- code¶
HTTP status code obtained from server
- Type
int
- message¶
error message with more details
- Type
str
- type¶
error sub-category
- Type
str
Transient Server Errors¶
When many of your requests are failing with TransientServerErrors, it means our servers are going through a period of instability or unplanned downtime. This always means our alerts are firing, our pagers are ringing, phones are buzzing, and little adorable kittens are getting lost in the woods. We are actively investigating and fixing this issue. Look for upates on our status page with estimates on time to resolution. Sorry.
- class rockset.exception.TransientServerError(**kwargs)[source]¶
Some transient hiccup made us fail this request. This means our oncall engineers are actively working on this issue and should resolve the issue soon. Please retry after sometime. Sorry.
- code¶
HTTP status code obtained from server
- Type
int
- message¶
error message with more details
- Type
str
- type¶
error sub-category
- Type
str