Sign up to receive blog updates in your inbox.
An incredible amount of business data is floating around in Excel spreadsheets - so data scientists often need to analyze data across multiple worksheets or even multiple spreadsheets using SQL. Additionally, this data may need to be joined with other data sets that are in JSON, CSV or Parquet formats.
Microsoft Excel currently has some basic SQL support in place:
- Use SQL for connecting to an external database like Access or SQL Server, parsing field or table contents and importing the data.
- Use SQL for reading a worksheet (
SELECT * FROM [Sheet1$]) or reading a range (
SELECT * FROM MyRange).
However it does not support complex SQL analysis across multiple spreadsheets and other data sets.
Using Rockset to analyze Excel spreadsheets
Rockset's core superpower is the ability to ingest different data formats like JSON, CSV or Parquet from different sources like local desktops, data lakes, streaming sources and online databases - and immediately power fast SQL across all these data sets. We recently added support for Excel spreadsheets (see documentation), which means you can now ingest XLSX files into a Rockset collection and instantly query across them using full-featured SQL with millisecond latency.
Start by creating a new collection, say MyCollection, in Rockset and ingesting your Excel spreadsheets. Your XLSX files can be uploaded from your local host using Rockset's file uploader, or bulk ingested from a data lake like AWS S3. Rockset will automatically parse and index the contents of the spreadsheet so that it is ready to query. We take advantage of Rockset's strong dynamic typing in SQL to achieve this.
Start by using the
DESCRIBE command to list the available fields in your collection. Each row in your spreadsheet will correspond to a document in Rockset. You might want to query the first several rows to see the shape of the data:
SELECT * FROM MyCollection ORDER BY rownum LIMIT 10
If you have other Rockset collections with other spreadsheets or nested JSON, CSV, Parquet data you can now run standard SQL to join and analyze your data sets. We often see examples of interesting data science on nested JSON.
Once you are done with your analysis you can use Rockset as the serving layer for an app or a live dashboard using a visualization tool of your choice.
As an example, here is an interesting analysis of trends in college financial aid using SQL across XLSX and CSV files.