File Formats

When importing data from a source like S3 or Google Cloud Storage, your data is going be one or more files (also true if you manually upload a file.) This page covers the file formats we support, which currently include JSON, CSV, XML, Parquet, XLS and PDF.

JSON Files

When a JSON file is ingested into a Rockset Collection, each top-level, comma-separated or newline-separated, JSON object is parsed as one document. Within each of those JSON objects, the top-level keys will be parsed as column names, and their corresponding values will be parsed as document fields, including nested data which are parsed as nested fields.

If a new column name is parsed, Rockset will add a new column in the collection, and all other documents will have that field set to NULL.

For example, the following file is a valid JSON input and would create 3 documents.

{ "color": "red", "size": "small" }
{ "color": "blue", "size": "medium" }
{ "color": "green", "size": "large" }

CSV Files

When a CSV file is ingested into a Rockset collection, each line of each CSV file is parsed as one document. There are five configurations to be specified to ingest CSV files:

  • Header – This tells Rockset how to define names for each column in the collection. There are
    two supported options for this:

    • First line of file as column names - Select this option if the CSV source contains column names
      in the first line.
    • Generate column names automatically - Rockset will automatically generate unique column names
      (c1, c2, ..) for the CSV data source.

Additional configuration involving your headers and fields can be done with Ingest Transformations.

  • Separator - The separator used in the CSV data source. Supported separators include
    comma (,) (default), tab ( ), and pipe (|).
  • Encoding - The encoding format. Supported encodings are UTF-8 (default), UTF-16,
    ISO 8859-1.
  • Quote Character - A one-character string used to quote fields containing special characters,
    such as quotechar (") (default) or space (\s).
  • Escape Character - A one-character string used to escape special characters, such as the
    backslash (\) (default) or quotechar (").

For example, the following file is a valid CSV input:

color,size
red,small
blue,medium
green,large

It would be parsed into the following Rockset collection (excludes special fields):

+-----------+------------+
| color     | size       |
|-----------+------------+
| 'red'     | 'small'    |
| 'blue'    | 'medium'   |
| 'green'   | 'large'    |
+-----------+------------+

XML Files

When an XML file is ingested into a Rockset collection, each tag at the document tag level is parsed as one document. Five configurations must be specified to ingest XML files:

  • Root Tag:Β Specifies where to start parsing documents. If not specified, Rockset will ingest at
    all root tags it finds.
  • Document Tag:Β Specifies what XML level corresponds to a document in Rockset. If not specified,
    the first tag encountered after the root tag will be used as the document tag.
  • Attribute Prefix: Used to tell attributes apart from nested tags in a Rockset document.
    Attributes are transformed into key-value pairs in a Rockset document.
  • Value Tag: Used for data within leaf tags with one or more attributes. The default value is
    value.
  • Encoding: The encoding format. Supported encodings are UTF-8 (default), UTF-16,
    ISO 8859-1.

Parquet Files

When an Apache Parquet file is ingested into a Rockset collection, each row group of the data table stored by the columnar storage format is parsed as one document. Each column is individually decompressed and decoded using that column's metadata, and then each value in that column is parsed as a separate document field in Rockset, including all nested fields. Please ensure that all rows of your parquet file contain all required metadata fields, which can be found in this Apache Parquet Format GitHub file. Snappy, gzip, and zstd compression formats are currently supported.

XLS Files

When an XLS/XLSX file is ingested into a Rockset collection, each row of each spreadsheet is parsed as one document. Each document will have the following fields, based on the row it corresponds to:

  • rownum - the index of the row (starting at 1)
  • sheet.number - the index of the sheet the row is in (starting at 1)
  • sheet.name - the name of the sheet the row is in

Additionally, Rockset will add a field for each of the row's cells, with field name corresponding to the column's header (such as A, B, and so on).

For example, a collection that ingested data from Excel files could have the schema shown below.

DESCRIBE spreadsheet_data
+--------------------------------------------+---------------+---------+-----------+ | field | occurrences | total | type | |--------------------------------------------+---------------+---------+-----------| | ['B'] | 1 | 10860 | string | | ['B'] | 5384 | 10860 | float | | ['C'] | 10856 | 10860 | string | | ['D'] | 10855 | 10860 | string | | ['E'] | 10852 | 10860 | string | | ['F'] | 10852 | 10860 | string | | ['G'] | 5468 | 10860 | string | | ['G'] | 5385 | 10860 | float | | ['_event_time'] | 10860 | 10860 | timestamp | | ['_id'] | 10860 | 10860 | string | | ['_meta'] | 10860 | 10860 | object | | ['_meta', 'file_upload'] | 10860 | 10860 | object | | ['_meta', 'file_upload', 'file'] | 10860 | 10860 | string | | ['_meta', 'file_upload', 'file_upload_id'] | 10860 | 10860 | string | | ['_meta', 'file_upload', 'upload_time'] | 10860 | 10860 | string | | ['rownum'] | 10860 | 10860 | int | | ['sheet'] | 10860 | 10860 | object | | ['sheet', 'name'] | 10860 | 10860 | string | | ['sheet', 'number'] | 10860 | 10860 | int | +--------------------------------------------+---------------+---------+-----------+

PDF Files

When a PDF file is ingested into a Rockset collection, Rockset parses metadata and text data from the PDF. Important fields include:

  • text: The text of the pdf (typically where most of the information in a PDF resides)
  • _meta: The metadata object of information for the pdf upload to Rockset.

For example, a collection that ingested data from PDF files could have the schema shown below.

DESCRIBE pdf_data
+--------------------------------------------+---------------+---------+-----------+ | field | occurrences | total | type | |--------------------------------------------+---------------+---------+-----------| | ['Author'] | 9 | 9 | string | | ['CreationDate'] | 9 | 9 | string | | ['Creator'] | 9 | 9 | string | | ['ModDate'] | 9 | 9 | string | | ['Producer'] | 9 | 9 | string | | ['Subject'] | 9 | 9 | string | | ['Title'] | 9 | 9 | string | | ['_event_time'] | 9 | 9 | timestamp | | ['_id'] | 9 | 9 | string | | ['_meta'] | 9 | 9 | object | | ['_meta', 'file_upload'] | 9 | 9 | object | | ['_meta', 'file_upload', 'file'] | 9 | 9 | string | | ['_meta', 'file_upload', 'file_upload_id'] | 9 | 9 | string | | ['_meta', 'file_upload', 'upload_time'] | 9 | 9 | string | | ['author'] | 9 | 9 | string | | ['creation_date'] | 9 | 9 | int | | ['creator'] | 9 | 9 | string | | ['modification_date'] | 9 | 9 | int | | ['producer'] | 9 | 9 | string | | ['subject'] | 9 | 9 | string | | ['text'] | 9 | 9 | string | | ['title'] | 9 | 9 | string | +--------------------------------------------+---------------+---------+-----------+

πŸ’‘

Most of the data in a PDF will be stored in the text field, which may look like this:

+--------------------------------------------------------------+
| text                                                         |
|--------------------------------------------------------------|
| ....                                                         |
| ....                                                         |
| Statement Date: 10/11/2018                                   |
| Your Account Summary                                         |
| ....                                                         |
| Total Amount Due:                                            |
| $157.57                                                      |
| Amount Enclosed:                                             |
| ...                                                          |
+--------------------------------------------------------------+