How to Do Data Science Using SQL on Raw JSON

January 10, 2019


This post outlines how to use SQL for querying and joining raw data sets like nested JSON and CSV - for enabling fast, interactive data science.

Data scientists and analysts deal with complex data. Much of what they analyze could be third-party data, over which there is little control. In order to make use of this data, significant effort is spent in data engineering. Data engineering transforms and normalizes high-cardinality, nested data into relational databases or into an output format that can then be loaded into data science notebooks to derive insights. At many organizations, data scientists or, more commonly, data engineers implement data pipelines to transform their raw data into something usable.

Data pipelines, however, regularly get in the way of data scientists and analysts getting to insights with their data. They are time-consuming to write and maintain, especially as the number of pipelines grows with each new data source added. They are often brittle, don't handle schema changes well, and add complexity to the data science process. Data scientists are typically dependent on others—data engineering teams—to build these pipelines as well, reducing their speed to value with their data.

data pipelines

Analyzing Third-Party Data to Support Investment Decisions

I've had the opportunity to work with a number of data scientists and analysts in investment management firms, who are analyzing complex data sets in order to support investment decisions. They increasingly bring in alternative, third-party data—app usage, website visits, people hired, and fundraising—to enhance their research. And they use this data to evaluate their existing portfolio and source new investment opportunities. The typical pipeline for these data sets includes scripts and Apache Spark jobs to transform data, relational databases like PostgreSQL to store the transformed data, and finally, dashboards that serve information from the relational database.

In this blog, we take a specific example where a data scientist may combine two data sets—an App Annie nested JSON data set that has statistics of mobile app usage and engagement, and Crunchbase CSV data set that tracks public and private companies globally. The CSV data to be queried is stored in AWS S3. We will use SQL to transform the nested JSON and CSV data sets and then join them together to derive some interesting insights in the form of interactive data science, all without any prior preparation or transformation. We will use Rockset for running SQL on the JSON and CSV data sets.

Understanding the shape of the nested JSON data set using Jupyter notebook

We begin by loading the App Annie dataset into a Rockset collection named app_annie_monthly. App Annie data is in the form of nested JSON, and has up to 3 levels of nested arrays in it. It has descriptions of fields in columns, including statistics of Monthly Active Users (MAU) that we'll be using later. The rows contain the data corresponding to those columns in the description.

app annie json

Following this, we can set up our Jupyter notebook configured to use our Rockset account. Immediately after setup, we can run some basic SQL queries on the nested JSON data set that we have loaded.

data science jupyter 1

Running SQL on nested JSON data

Once we have understood the overall structure of the nested JSON data set, we can start unpacking the parts we're interested in using the UNNEST command in SQL. In our case, we care about the app name, the percentage increase in MAU month over month, and the company that makes the app.

data science jupyter 2

Once we have gotten to this table, we can do some basic statistical calculations by exporting the data to dataframes. Dataframes can be used to visualize the percentage growth in MAU over the data set for a particular month.

data science jupyter 3

Using SQL to join the nested JSON data with CSV data

Now we can create the crunchbase_funding_rounds collection in Rockset from CSV files stored in Amazon S3 so that we can query them using SQL. This is a fairly simple CSV file with many fields. We are particularly interested in some fields: company_name, country_code, investment_type, investor_names, and last_funding. These fields provide us additional information about the companies. We can join these on the company_name field, and apply a few additional filters to arrive at the final list of prospects for investment, ranked from maximum to minimum increase in MAU.



-- # compute application statistics, MAU and percent change in MAU.
appStats AS 
      rows.r[2][1]."name" AS app,
      rows.r[2][1]."company_name" AS company,
      rows.r[4][1] AS mau,
      rows.r[4][4] AS mau_percent_change 
      app_annie_monthly a,
      unnest(a."data"."table"."rows" AS r) AS rows 
      a._meta.s3.path LIKE 'app\_annie/monthly/2018-05/01/data/all\_users\_top\_usage\_US\_iphone\_100\_%' 

-- # Get list of crunchbase orgs to join with.
crunchbaseOrgs AS 
      founded_on AS founded_on,
      uuid AS company_uuid,
      short_description AS short_description,
      company_name as company_name

-- # Get the JOINED relation from the above steps.
appStatsWithCrunchbaseOrgs as 
   SELECT as App,
      appStats.mau as mau,
      appStats.mau_percent_change as mau_percent_change,
      crunchbaseOrgs.company_uuid as company_uuid,
      crunchbaseOrgs.company_name as company_name,
      crunchbaseOrgs.founded_on as founded_on,
      crunchbaseOrgs.short_description as short_description
         ON = crunchbaseOrgs.company_name 

-- # Compute companyStatus = (IPO|ACQUIRED|CLOSED|OPERATING)
-- # There may be more than one status associated with a company, so, we do the Group By and Min.
companyStatus as 
) as status 

-- #  JOIN with companyStatus == (OPERATING), call it ventureFunded
ventureFunded as (SELECT
      ON appStatsWithCrunchbaseOrgs.company_name = companyStatus.company_name 
      AND companyStatus.status = 4),

-- # Find the latest round that each company raised, grouped by company UUID
latestRound AS 
      company_uuid as cuid,
      max(announced_on) as announced_on,
      max(raised_amount_usd) as raised_amount_usd 

-- # Join it back with crunchbase_funding_rounds to get other details about that company
fundingRounds AS 
      cfr.company_uuid as company_uuid,
      cfr.announced_on as announced_on,
      cfr.funding_round_uuid as funding_round_uuid,
      cfr.company_name as company_name,
      cfr.investment_type as investment_type,
      cfr.raised_amount_usd as raised_amount_usd,
      cfr.country_code as country_code,
      cfr.state_code as state_code,
      cfr.investor_names as investor_names 
      "crunchbase_funding_rounds" cfr 
     ON latestRound.company_uuid = cfr.company_uuid 
     AND latestRound.announced_on = cfr.announced_on

-- # Finally, select the dataset with all the fields that are interesting to us. ventureFundedAllRegions
ventureFundedAllRegions AS (
       ventureFunded.App as App,
       ventureFunded.company_name as company_name,
       ventureFunded.mau as mau,
       ventureFunded.mau_percent_change as mau_percent_change,
       ventureFunded.short_description as short_description,
       fundingRounds.announced_on as last_funding,
       fundingRounds.raised_amount_usd as raised_amount_usd,
       fundingRounds.country_code as country_code,
       fundingRounds.state_code as state_code,
       fundingRounds.investor_names as investor_names,
       fundingRounds.investment_type as investment_type 
          ON fundingRounds.company_uuid = ventureFunded.company_uuid)

SELECT * FROM ventureFundedAllRegions
   mau_percent_change DESC LIMIT 10

This final large query does several operations one after another. In order, the operations that it performs and the intermediate SQL query names are:

  • appStats: UNNEST operation on the App Annie dataset that extracts the interesting fields into a format resembling a flat table.
  • crunchbaseOrgs: Extracts relevant fields from the crunchbase collection.
  • appStatsWithCrunchbaseOrgs: Joins the App Annie and Crunchbase data on the company name.
  • companyStatus: Sets up filtering for companies based on their current status - IPO/Acquired/Closed/Operating. Each company may have multiple records but the ordering ensures that the latest status is captured.
  • ventureFunded: Uses the above metric to filter out organizations that are not currently privately held and operating.
  • latestRound: Finds the latest funding round—in total sum invested (USD) and the date when it was announced.
  • fundingRounds & ventureFundedAllRegions: Wrap it all together and extract other details of relevance that we can use.

Data Science Insights on Prospective Investments

We can run one final query on the named query we have, ventureFundedAllRegions to generate the best prospective investments for the investment management firm.

data science jupyter 4

As we see above, we get data that can help with decision making from an investment perspective. We started with applications that have posted significant growth in active users month over month. Then we performed some filtering to impose some constraints to improve the relevance of our list. Then we also extracted other details about the companies that created those applications and came up with a final list of prospects above. In this entire process, we did not make use of any ETL processes that transform the data from one format to another or wrangle it. The last query which was the longest took less than 4 seconds to run, due to Rockset's indexing of all fields and using those indexes to speed up the individual queries.