• Loading Your Data
  • Collections
  • Configuring File Formats

Configuring File Formats

This page covers how each file format should be configured for data ingestion. The file formats currently supported include JSON, CSV, XML, Parquet, XLS and PDF.

#JSON Files

When a JSON file is ingested into a Rockset collection, each top-level, comma-separated or newline-separated, JSON object is parsed as one document. Within each of those JSON objects, the top-level keys will be parsed as column names, and their corresponding values will be parsed as document fields, including nested data which are parsed as nested fields.

If a new column name is parsed, Rockset will add a new column in the collection, and all other documents will have that field set to NULL.

For example, the following file is a valid JSON input:

{ "color": "red", "size": "small" }
{ "color": "blue", "size": "medium" }
{ "color": "green", "size": "large" }

It would be parsed into the following Rockset collection (excludes special fields):

+-----------+------------+
| color     | size       |
|-----------+------------+
| 'red'     | 'small'    |
| 'blue'    | 'medium'   |
| 'green'   | 'large'    |
+-----------+------------+

#CSV Files

When a CSV file is ingested into a Rockset collection, each line of each CSV file is parsed as one document. There are five configurations to be specified to ingest CSV files:

  • Header – This tells Rockset how to define names for each column in the collection. There are three supported options for this:

    • First line of file as column names - Select this option if the CSV source contains column names in the first line.
    • Specify Columns manually - Select this option if you want to provide custom names for each column in the CSV data source. This option will ask you to provide a name and datatype for each column.
    • Generate column names automatically - Rockset will automatically generate unique column names (c1, c2, ..) for the CSV data source.
  • Separator - The separator used in the CSV data source. Supported separators include comma (,) (default), tab ( ), and pipe (|).

  • Encoding - The encoding format. Supported encodings are UTF-8 (default), UTF-16, ISO 8859-1.

  • Quote Character - A one-character string used to quote fields containing special characters, such as quotechar (") (default) or space (\s).

  • Escape Character - A one-character string used to escape special characters, such as the backslash (\) (default) or quotechar (").

For example, the following file is a valid CSV input:

color,size
red,small
blue,medium
green,large

It would be parsed into the following Rockset collection (excludes special fields):

+-----------+------------+
| color     | size       |
|-----------+------------+
| 'red'     | 'small'    |
| 'blue'    | 'medium'   |
| 'green'   | 'large'    |
+-----------+------------+

#XML Files

When an XML file is ingested into a Rockset collection, each tag at the document tag level is parsed as one document. There are five configurations to be specified to ingest XML files:

  • Root Tag – The root tag specifies where to start parsing documents. If no root tag is specified, Rockset will ingest at all root tags it finds.
  • Document Tag – The document tag specifies what XML level corresponds to a document in Rockset. If no document tag is specified, the first tag encountered after the root tag will be used as the document tag.
  • Attribute Prefix – The attribute prefix is used to tell attributes apart from nested tags in a Rockset document. Attributes are transformed into key-value pairs in a Rockset document.
  • Value Tag The value tag is used for data within leaf tags with one or more attributes. The default value is value.
  • Encoding - The encoding format. Supported encodings are UTF-8 (default), UTF-16, ISO 8859-1.

#Parquet Files

When an Apache Parquet file is ingested into a Rockset collection, each row group of the data table stored by the columnar storage format is parsed as one document. Each column is individually decompressed and decoded using that column's metadata, and then each value in that column is parsed as a separate document field in Rockset, including all nested fields.

#XLS Files

When an XLS/XLSX file is ingested into a Rockset collection, each row of each spreadsheet is parsed as one document. Each document will have the following fields, based on the row it corresponds to:

  • rownum - the index of the row (starting at 1)
  • sheet.number - the index of the sheet the row is in (starting at 1)
  • sheet.name - the name of the sheet the row is in

Additionally, Rockset will add a field for each of the row's cells, with field name corresponding to the column's header (i.e., "A", "B", and so on).

For example, a collection that ingested data from Excel files could have the schema shown below.

DESCRIBE spreadsheet_data
+--------------------------------------------+---------------+---------+-----------+
| field                                      | occurrences   | total   | type      |
|--------------------------------------------+---------------+---------+-----------|
| ['B']                                      | 1             | 10860   | string    |
| ['B']                                      | 5384          | 10860   | float     |
| ['C']                                      | 10856         | 10860   | string    |
| ['D']                                      | 10855         | 10860   | string    |
| ['E']                                      | 10852         | 10860   | string    |
| ['F']                                      | 10852         | 10860   | string    |
| ['G']                                      | 5468          | 10860   | string    |
| ['G']                                      | 5385          | 10860   | float     |
| ['_event_time']                            | 10860         | 10860   | timestamp |
| ['_id']                                    | 10860         | 10860   | string    |
| ['_meta']                                  | 10860         | 10860   | object    |
| ['_meta', 'file_upload']                   | 10860         | 10860   | object    |
| ['_meta', 'file_upload', 'file']           | 10860         | 10860   | string    |
| ['_meta', 'file_upload', 'file_upload_id'] | 10860         | 10860   | string    |
| ['_meta', 'file_upload', 'upload_time']    | 10860         | 10860   | string    |
| ['rownum']                                 | 10860         | 10860   | int       |
| ['sheet']                                  | 10860         | 10860   | object    |
| ['sheet', 'name']                          | 10860         | 10860   | string    |
| ['sheet', 'number']                        | 10860         | 10860   | int       |
+--------------------------------------------+---------------+---------+-----------+

#PDF Files

When a PDF file is ingested into a Rockset collection, Rockset parses metadata and text data from the PDF. Important fields include:

  • text - the text of the pdf (typically where most of the information in a PDF resides)
  • _meta - metadata object of information regarding the pdf upload to Rockset.

For example, a collection that ingested data from PDF files could have the schema shown below.

DESCRIBE pdf_data
+--------------------------------------------+---------------+---------+-----------+
| field                                      | occurrences   | total   | type      |
|--------------------------------------------+---------------+---------+-----------|
| ['Author']                                 | 9             | 9       | string    |
| ['CreationDate']                           | 9             | 9       | string    |
| ['Creator']                                | 9             | 9       | string    |
| ['ModDate']                                | 9             | 9       | string    |
| ['Producer']                               | 9             | 9       | string    |
| ['Subject']                                | 9             | 9       | string    |
| ['Title']                                  | 9             | 9       | string    |
| ['_event_time']                            | 9             | 9       | timestamp |
| ['_id']                                    | 9             | 9       | string    |
| ['_meta']                                  | 9             | 9       | object    |
| ['_meta', 'file_upload']                   | 9             | 9       | object    |
| ['_meta', 'file_upload', 'file']           | 9             | 9       | string    |
| ['_meta', 'file_upload', 'file_upload_id'] | 9             | 9       | string    |
| ['_meta', 'file_upload', 'upload_time']    | 9             | 9       | string    |
| ['author']                                 | 9             | 9       | string    |
| ['creation_date']                          | 9             | 9       | int       |
| ['creator']                                | 9             | 9       | string    |
| ['modification_date']                      | 9             | 9       | int       |
| ['producer']                               | 9             | 9       | string    |
| ['subject']                                | 9             | 9       | string    |
| ['text']                                   | 9             | 9       | string    |
| ['title']                                  | 9             | 9       | string    |
+--------------------------------------------+---------------+---------+-----------+

Note that most of the data in a PDF will be stored in the text field, which may look like this

+--------------------------------------------------------------+
| text                                                         |
|--------------------------------------------------------------|
| ....                                                         |
| ....                                                         |
| Statement Date: 10/11/2018                                   |
| Your Account Summary                                         |
| ....                                                         |
| Total Amount Due:                                            |
| $157.57                                                      |
| Amount Enclosed:                                             |
| ...                                                          |
+--------------------------------------------------------------+