How to Run SQL on PDF Files

February 21, 2019

,

PDFs are the de facto standard for distributing and sharing fixed-layout documents today. A quick survey of my laptop folders reveals account statements, receipts, technical papers, book chapters, and presentation slides—all PDFs. Lots of valuable information finds its way into all manner of PDF files. Which is a great reason for Rockset to support SQL queries on PDF files, in our mission to make data more usable to everyone.

Fast SQL on PDFs in Rockset

Rockset makes it easy for developers and data practitioners to ingest and run fast SQL on semi-structured data in a variety of data formats, such as JSON, CSV, and XLSX, without any upfront data prep. Now add PDFs to the mix, and users can combine PDF data with data of other formats, from various sources, into their SQL analyses. Or analyzing multiple PDFs together might be valuable too, if you have a series of electricity bills like I do, as we'll see in our short example below.

bill-pdf

Uploading PDFs

From an existing collection, click the Upload File button at the top right of the console and specify PDF format to ingest into Rockset.

pdf-upload

Querying Data in PDFs

I uploaded 9 months of electricity bills. We can use the DESCRIBE command to view the fields that were extracted from the PDFs.

> describe "elec-bills";
+--------------------------------------------+---------------+---------+-----------+
| field                                      | occurrences   | total   | type      |
|--------------------------------------------+---------------+---------+-----------|
| ['Author']                                 | 9             | 9       | string    |
| ['CreationDate']                           | 9             | 9       | string    |
| ['Creator']                                | 9             | 9       | string    |
| ['ModDate']                                | 9             | 9       | string    |
| ['Producer']                               | 9             | 9       | string    |
| ['Subject']                                | 9             | 9       | string    |
| ['Title']                                  | 9             | 9       | string    |
| ['_event_time']                            | 9             | 9       | timestamp |
| ['_id']                                    | 9             | 9       | string    |
| ['_meta']                                  | 9             | 9       | object    |
| ['_meta', 'file_upload']                   | 9             | 9       | object    |
| ['_meta', 'file_upload', 'file']           | 9             | 9       | string    |
| ['_meta', 'file_upload', 'file_upload_id'] | 9             | 9       | string    |
| ['_meta', 'file_upload', 'upload_time']    | 9             | 9       | string    |
| ['author']                                 | 9             | 9       | string    |
| ['creation_date']                          | 9             | 9       | int       |
| ['creator']                                | 9             | 9       | string    |
| ['modification_date']                      | 9             | 9       | int       |
| ['producer']                               | 9             | 9       | string    |
| ['subject']                                | 9             | 9       | string    |
| ['text']                                   | 9             | 9       | string    |
| ['title']                                  | 9             | 9       | string    |
+--------------------------------------------+---------------+---------+-----------+

Rockset parses out all the metadata like author, creation_date, etc. from the document along with the text.

The text field is typically where most of the information in a PDF resides, so let's examine what's in a sample text field.

+--------------------------------------------------------------+
| text                                                         |
|--------------------------------------------------------------|
| ....                                                         |
| ....                                                         |
| Statement Date: 10/11/2018                                   |
| Your Account Summary                                         |
| ....                                                         |
| Total Amount Due:                                            |
| $157.57                                                      |
| Amount Enclosed:                                             |
| ...                                                          |
+--------------------------------------------------------------+

Combining Data from Multiple PDFs

With my 9 months of eletricity bills ingested and indexed in Rockset, I can do some simple analysis of my usage over this timespan. We can run a SQL query to select the month/year and billing amount out of text.

> with details as (
    select tokenize(REGEXP_EXTRACT(text, 'Statement Date: .*'))[3] as month,
    tokenize(REGEXP_EXTRACT(text, 'Statement Date: .*'))[5] as year,
    cast(tokenize(REGEXP_EXTRACT(text, 'Total Amount Due:\n.*\nAmount Enclosed'))[4] as float) as amount
    from "elec-bills"
) 
select concat(month, '/', year) as billing_period, amount
from details
order by year asc, month;

+----------+------------------+
| amount   | billing_period   |
|----------+------------------|
| 47.55    | 04/2018          |
| 76.5     | 05/2018          |
| 52.28    | 06/2018          |
| 50.58    | 07/2018          |
| 47.62    | 08/2018          |
| 39.7     | 09/2018          |
| <null>   | 10/2018          |
| 72.93    | 11/2018          |
| 157.57   | 12/2018          |
+----------+------------------+

And plot the results in Superset.

pdf-graph

My October bill was surprisingly zero. Was the billing amount not extracted correctly? I went back and checked, and it turns out I received a California Climate Credit in October which zeroed out my bill, so ingesting and querying PDFs is working as it should!