Case Study: The Path to Better Pollution Forecasting Goes Through Nested JSON
March 19, 2019
Think about the steel industry in the US, and you’ll likely think of Pittsburgh. Known as the “Steel City” for leading the nation in steel production in the first half of the 20th century, Pittsburgh also went by the moniker “the Smoky City,” due to the air pollution from steel and other heavy industries. With increased regulation and the decline of the steel industry, Pittsburgh has gotten much cleaner since its darkest, smokiest days in the 1940s, but it still hasn’t shed all the vestiges of steel-related pollution.
Coke, one of the raw materials in steelmaking, is manufactured by heating coal at high temperatures. The largest coke plant in North America resides in Allegheny County, which includes Pittsburgh. During the coke production process, the facility emits a mixture of particulate and gas pollutants that can aggravate existing respiratory ailments, such as asthma and emphysema. This is where Pittsburgh resident, Doug Balog, a data engineer for a large retailer by day and civic hacker by night, comes into the picture. He aims to use his technical skills to bring about a greater recognition of the impact of pollution in his Pittsburgh community.
Gaining Greater Visibility into Pollution
Doug is particularly interested in tracking temperature inversions, so called because the normal decrease in temperature with altitude is inverted. During an inversion, a layer of warmer air traps cooler air close to the ground. This phenomenon also prevents smoke and air pollution from escaping, and exacerbates the poor air quality in the areas surrounding the coke plant.
Doug has been collecting National Weather Service (NWS) data on inversions for more than a year. He hopes to combine this weather data with crowdsourced pollution data—occurrences of pollution odors logged through a self-reporting app—for analysis. His goal is to reliably forecast periods of heavier pollution to provide adequate warning to sensitive populations, so that they may take appropriate precautionary measures. He also hopes to use the collected data to support calls for stricter enforcement of air pollution regulations by the county.
Taming Complex Weather Data Using Rockset
Doug has developed tools that scrape NWS forecasts hourly for about a hundred points within Allegheny County. The NWS data is represented in nested JSON format, which is difficult to handle in a relational database. The data either has to be converted into SQL columns, requiring a fixed schema along with considerable ETL, or stored in JSON columns that support limited indexing, neither of which is an ideal solution. Instead, using Rockset, Doug never has to specify any schema, and is able to run fast SQL queries directly on fully indexed JSON.
Doug also encounters unexpected situations with field types and values from the NWS data. To indicate gusting wind, the NWS data shows a value like “20G30,” for example, instead of a numeric value. With Rockset, Doug can ingest and analyze unanticipated data types and values without errors and without any additional data cleaning.
Accelerating the Path from Data to Insight
As a solo developer attempting to use data to help the community tackle pollution, Rockset has proven particularly useful to Doug, saving him significant time and effort compared to alternative approaches.
“There is a lot of data we can gather that can provide pieces of the answer to the problem of pollution in Pittsburgh, but it’s a difficult job to bring it together for analysis because the data quality is lacking. There’s always going to be something unexpected in the data that trips you up,” says Doug. “With Rockset, I don’t have to worry about data being typed or formatted in a way I didn’t anticipate, and I don’t have to modify my code every time the schema changes. Rockset just sucks in all the raw data and makes it accessible using SQL, so it's faster and easier to develop on the data.”
Having spent much of his career around data management, Doug is well aware of the true cost of standing up a SQL database to store his data. Using Rockset’s cloud service, he has been able to get a reliable SQL API into all his data, while avoiding the challenges associated with setting up and managing a database. In Doug's words, Rockset required no setup on his part, and creating Rockset collections for the NWS data was very easy—simply point Rockset to the data, with no data preparation required.
Doug’s next steps will be to find more uses for the data he has gathered. He is working to provide pollution researchers an interface for them to query the NWS data he has collected in Rockset. He also intends to train machine learning models on the data to predict pollution levels in the community.