Sign up to receive blog updates in your inbox.

February 21st, 2019
Fast SQL
How to Run SQL on PDF Files

PDFs are the de facto standard for distributing and sharing fixed-layout documents today. A quick survey of my laptop folders reveals account statements, receipts, technical papers, book chapters, and presentation slides—all PDFs. Lots of valuable information finds its way into all manner of PDF files. Which is a great reason for Rockset to support SQL queries on PDF files, in our mission to make data more usable to everyone.

Fast SQL on PDFs in Rockset

Rockset makes it easy for developers and data practitioners to ingest and run fast SQL on semi-structured data in a variety of data formats, such as JSON, CSV, and XLSX, without any upfront data prep. Now add PDFs to the mix, and users can combine PDF data with data of other formats, from various sources, into their SQL analyses. Or analyzing multiple PDFs together might be valuable too, if you have a series of electricity bills like I do, as we'll see in our short example below.

bill-pdf

Uploading PDFs

From an existing collection, click the Upload File button at the top right of the console and specify PDF format to ingest into Rockset.

pdf-upload

Querying Data in PDFs

I uploaded 9 months of electricity bills. We can use the DESCRIBE command to view the fields that were extracted from the PDFs.

> describe "elec-bills";
+--------------------------------------------+---------------+---------+-----------+
| field                                      | occurrences   | total   | type      |
|--------------------------------------------+---------------+---------+-----------|
| ['Author']                                 | 9             | 9       | string    |
| ['CreationDate']                           | 9             | 9       | string    |
| ['Creator']                                | 9             | 9       | string    |
| ['ModDate']                                | 9             | 9       | string    |
| ['Producer']                               | 9             | 9       | string    |
| ['Subject']                                | 9             | 9       | string    |
| ['Title']                                  | 9             | 9       | string    |
| ['_event_time']                            | 9             | 9       | timestamp |
| ['_id']                                    | 9             | 9       | string    |
| ['_meta']                                  | 9             | 9       | object    |
| ['_meta', 'file_upload']                   | 9             | 9       | object    |
| ['_meta', 'file_upload', 'file']           | 9             | 9       | string    |
| ['_meta', 'file_upload', 'file_upload_id'] | 9             | 9       | string    |
| ['_meta', 'file_upload', 'upload_time']    | 9             | 9       | string    |
| ['author']                                 | 9             | 9       | string    |
| ['creation_date']                          | 9             | 9       | int       |
| ['creator']                                | 9             | 9       | string    |
| ['modification_date']                      | 9             | 9       | int       |
| ['producer']                               | 9             | 9       | string    |
| ['subject']                                | 9             | 9       | string    |
| ['text']                                   | 9             | 9       | string    |
| ['title']                                  | 9             | 9       | string    |
+--------------------------------------------+---------------+---------+-----------+

Rockset parses out all the metadata like author, creation_date, etc. from the document along with the text.

The text field is typically where most of the information in a PDF resides, so let's examine what's in a sample text field.

+--------------------------------------------------------------+
| text                                                         |
|--------------------------------------------------------------|
| ....                                                         |
| ....                                                         |
| Statement Date: 10/11/2018                                   |
| Your Account Summary                                         |
| ....                                                         |
| Total Amount Due:                                            |
| $157.57                                                      |
| Amount Enclosed:                                             |
| ...                                                          |
+--------------------------------------------------------------+

Combining Data from Multiple PDFs

With my 9 months of eletricity bills ingested and indexed in Rockset, I can do some simple analysis of my usage over this timespan. We can run a SQL query to select the month/year and billing amount out of text.

> with details as (
    select tokenize(REGEXP_EXTRACT(text, 'Statement Date: .*'))[3] as month,
    tokenize(REGEXP_EXTRACT(text, 'Statement Date: .*'))[5] as year,
    cast(tokenize(REGEXP_EXTRACT(text, 'Total Amount Due:\n.*\nAmount Enclosed'))[4] as float) as amount
    from "elec-bills"
) 
select concat(month, '/', year) as billing_period, amount
from details
order by year asc, month;

+----------+------------------+
| amount   | billing_period   |
|----------+------------------|
| 47.55    | 04/2018          |
| 76.5     | 05/2018          |
| 52.28    | 06/2018          |
| 50.58    | 07/2018          |
| 47.62    | 08/2018          |
| 39.7     | 09/2018          |
| <null>   | 10/2018          |
| 72.93    | 11/2018          |
| 157.57   | 12/2018          |
+----------+------------------+

And plot the results in Superset.

pdf-graph

My October bill was surprisingly zero. Was the billing amount not extracted correctly? I went back and checked, and it turns out I received a California Climate Credit in October which zeroed out my bill, so ingesting and querying PDFs is working as it should!

Real-time Analytics at Cloud Scale.
Get started with $300 in free credits. No credit card required.
Share
Search and analyze your data in real-time

Ready to start building?

Related Posts

Index Scan: Using Rockset's Search Index to Speed up Range Scans Over a Specific Field

Rockset uses Converged Indexing to make different types of queries run fast. We look at how Rockset's Index Scan uses the search index to accelerate range scans.

Can I Do SQL-Style Joins in Elasticsearch?

While joins are primarily an SQL concept, they are equally important in the NoSQL world as well. We explore how to perform the equivalent of SQL joins when using Elasticsearch.

Joining Data in DynamoDB and S3 for Live, Ad-Hoc Analysis

Using SQL to join DynamoDB and S3 data, operations teams can perform live, ad-hoc analysis across multiple cloud systems.