How to Run SQL on PDF Files

PDFs are the de facto standard for distributing and sharing fixed-layout documents today. A quick survey of my laptop folders reveals account statements, receipts, technical papers, book chapters, and presentation slides—all PDFs. Lots of valuable information finds its way into all manner of PDF files. Which is a great reason for Rockset to support SQL queries on PDF files, in our mission to make data more usable to everyone.

Fast SQL on PDFs in Rockset

Rockset makes it easy for developers and data practitioners to ingest and run fast SQL on semi-structured data in a variety of data formats, such as JSON, CSV, and XLSX, without any upfront data prep. Now add PDFs to the mix, and users can combine PDF data with data of other formats, from various sources, into their SQL analyses. Or analyzing multiple PDFs together might be valuable too, if you have a series of electricity bills like I do, as we'll see in our short example below.

bill-pdf

Uploading PDFs

From an existing collection, click the Upload File button at the top right of the console and specify PDF format to ingest into Rockset.

pdf-upload

Querying Data in PDFs

I uploaded 9 months of electricity bills. We can use the DESCRIBE command to view the fields that were extracted from the PDFs.

> describe "elec-bills";
+--------------------------------------------+---------------+---------+-----------+
| field                                      | occurrences   | total   | type      |
|--------------------------------------------+---------------+---------+-----------|
| ['Author']                                 | 9             | 9       | string    |
| ['CreationDate']                           | 9             | 9       | string    |
| ['Creator']                                | 9             | 9       | string    |
| ['ModDate']                                | 9             | 9       | string    |
| ['Producer']                               | 9             | 9       | string    |
| ['Subject']                                | 9             | 9       | string    |
| ['Title']                                  | 9             | 9       | string    |
| ['_event_time']                            | 9             | 9       | timestamp |
| ['_id']                                    | 9             | 9       | string    |
| ['_meta']                                  | 9             | 9       | object    |
| ['_meta', 'file_upload']                   | 9             | 9       | object    |
| ['_meta', 'file_upload', 'file']           | 9             | 9       | string    |
| ['_meta', 'file_upload', 'file_upload_id'] | 9             | 9       | string    |
| ['_meta', 'file_upload', 'upload_time']    | 9             | 9       | string    |
| ['author']                                 | 9             | 9       | string    |
| ['creation_date']                          | 9             | 9       | int       |
| ['creator']                                | 9             | 9       | string    |
| ['modification_date']                      | 9             | 9       | int       |
| ['producer']                               | 9             | 9       | string    |
| ['subject']                                | 9             | 9       | string    |
| ['text']                                   | 9             | 9       | string    |
| ['title']                                  | 9             | 9       | string    |
+--------------------------------------------+---------------+---------+-----------+

Rockset parses out all the metadata like author, creation_date, etc. from the document along with the text.

The text field is typically where most of the information in a PDF resides, so let's examine what's in a sample text field.

+--------------------------------------------------------------+
| text                                                         |
|--------------------------------------------------------------|
| ....                                                         |
| ....                                                         |
| Statement Date: 10/11/2018                                   |
| Your Account Summary                                         |
| ....                                                         |
| Total Amount Due:                                            |
| $157.57                                                      |
| Amount Enclosed:                                             |
| ...                                                          |
+--------------------------------------------------------------+

Combining Data from Multiple PDFs

With my 9 months of eletricity bills ingested and indexed in Rockset, I can do some simple analysis of my usage over this timespan. We can run a SQL query to select the month/year and billing amount out of text.

> with details as (
    select tokenize(REGEXP_EXTRACT(text, 'Statement Date: .*'))[3] as month,
    tokenize(REGEXP_EXTRACT(text, 'Statement Date: .*'))[5] as year,
    cast(tokenize(REGEXP_EXTRACT(text, 'Total Amount Due:\n.*\nAmount Enclosed'))[4] as float) as amount
    from "elec-bills"
) 
select concat(month, '/', year) as billing_period, amount
from details
order by year asc, month;

+----------+------------------+
| amount   | billing_period   |
|----------+------------------|
| 47.55    | 04/2018          |
| 76.5     | 05/2018          |
| 52.28    | 06/2018          |
| 50.58    | 07/2018          |
| 47.62    | 08/2018          |
| 39.7     | 09/2018          |
| <null>   | 10/2018          |
| 72.93    | 11/2018          |
| 157.57   | 12/2018          |
+----------+------------------+

And plot the results in Superset.

pdf-graph

My October bill was surprisingly zero. Was the billing amount not extracted correctly? I went back and checked, and it turns out I received a California Climate Credit in October which zeroed out my bill, so ingesting and querying PDFs is working as it should!

Real-time SQL on raw data

Related Posts

true

Visualize Data in Rockset with Redash

Connect Redash to Rockset to create visualizations quickly and easily.
true

Rockset adds Excel spreadsheet support: Use SQL across XLSX files and join with other JSON, CSV or Parquet data

Run complex SQL across multiple Excel spreadsheets and join XLSX files with JSON, Parquet or CSV data.
true

How to Do Data Science Using SQL on Raw JSON

How to query nested JSON and CSV using SQL (including joins), without any upfront data preparation or complex data pipelines - for interactive data science using Python notebooks.