HealthyHomes: Predicting local air quality for healthy housing decisions

阿新 • • 發佈：2018-12-28

Nearly 1 in 10 childhood asthma cases in Los Angeles are due to traffic pollution exposures. Similarly, recent studies from Oakland, CA have demonstrated up to 40% increases in heart attack risk amongst the elderly across just a few blocks of the city. Knowing these local risks matter, as airborne pollutant concentrations

can vary by a factor of eight across a single block, and long-term exposures to pollutants, such as nitrogen dioxide and black carbon, are a major driver of a number of chronic health problems. There is a clear need for more transparency in exposure risk information so the public can make health-informed housing decisions.

As a Fellow in the Insight Health Data Science program, I decided to tackle this problem by creating a web-app, HealthyHomes, to predict address-level pollutant exposures. This helps answer how close is too close when it comes to living near highways or industry.

Do you think you live too close to the highway?

This web-app allows the user to input their address in East Bay cities (currently Oakland, Berkeley, Albany, Emeryville, and El Cerrito) and generate an immediate prediction of their exposure risk relative to the overall region. Users can also view exposure estimates through a map-based interface, and generate suggestions for nearby neighborhoods within their housing budget where pollutant exposures are reduced. A number of steps went into this project that I will outline below.

The Data and Machine Learning Pipeline

The overall data pipeline for building HealthyHomes. Features were engineered for the Google Street View pollutant mapping data using OpenStreetMaps, US Census, urban zoning, and weather data. Random forest models were then trained to predict gas concentrations throughout the East Bay. Finally, recommendations for healthier neighborhoods within your price range are provided by web scraping Zillow.

Leveraging a unique high-resolution pollutant mapping effort by Google and the Environmental Defense Fund

1. The Data

Data is generally unavailable to make address-level assessments, as common traffic and industry-related pollutants may only be measured at a few locations in a city. Though these stationary measurements are important, they don’t allow the general public to understand their exposure risks on a neighborhood-to-neighborhood or block-by-block basis.

So how can I go about making these hyperlocal predictions? Here, I leveraged a unique dataset collected by the Environmental Defense Fund and Google Street View, where they mapped pollutants at a high-resolution throughout large parts of Oakland from Google Street View cars.

A team of scientists at the University of Texas at Austin processed the raw measurements and provided median pollutant concentrations every 30 meters for all of West, Downtown, and parts of East Oakland. These mapping efforts give a unique insight into the variability of pollutant concentrations on a block-by-block basis and provide a means to model exposures throughout the greater East Bay. However, these data only provide a snapshot into certain locations of Oakland, so how did I go about creating a generalized model and providing predictions to those living in areas outside of the study zone?

Example of Google Street View coverage in Oakland. How can we predict in the areas not sampled?

2. Feature Engineering

To predict exposures at unique addresses I needed to create a generalizable feature set to describe pollutant concentrations. This presented a challenge as the Google Street View data only contained location information (lat, long) and gas concentrations. To overcome this challenge I brought together a diverse set of data sources, including OpenStreetMaps, US Census, city zoning, and weather data, to engineer a feature set upon which I could train a machine learning model. I created 20 features in total to predict concentrations of nitrogen dioxide and black carbon. These features fell into three main categories:

Distance-based features: Such as distance to nearest highway, traffic intersection, or industrial area
Census and zoning features: Such as the population density and zoning type for the region where a measurement took place
Weather features: Such as the mean annual wind speed in 1 km² blocks

Random forest predictions of nitrogen dioxide against the one-third test set

3. Model Development

At first I started simple, using multiple linear regression and generalized additive models with splines, but these models failed to capture important interactions within the data. From there, I moved on to tree-based models that could handle the skewed data, capture important interactions, while still giving feature importance.

Both random forest regression and gradient boosting did a great job of predicting pollutant concentrations, achieving R² of 0.95 and 0.84 on the one-third test set (~6000 samples) for nitrogen dioxide and black carbon, respectively. Both models had near-identical test set accuracy (RMSE) following tuning, so I chose to implement random forests in the final product. This is because random forests are less prone to overfitting, and my model is intended for generalizability. The most important model features were distance to nearest highway, distance to closest industrial area, average wind speed, whether the measurement occurred on a residential street, and the local population density. All of these features are closely tied to the intensity of car traffic and industrial activities.

Now that I had a working machine learning model, I could provide address-specific estimates of pollutant exposures for any address queried throughout the Easy Bay. This is done on HealthyHomes by first extracting the exact location for an address using GoogleMaps API, generating all 20 features for the location as described above, and then inputing these features into the trained random forest. The heatmap on the web-app is similarly created by generating features for a point grid evenly spaced across the East Bay at a 50 meter resolution.

HealthyHomes interface visualizing air quality. HealthyHomes provides air quality estimates at specified addresses and an interactive neighborhood heatmap. Areas that are red have worse air quality.

4. Neighborhood Suggestions

In a final step, I created alternative neighborhood suggestions by scraping all the current Zillow rental data for Oakland using BeautifulSoup to estimate average rent for each neighborhood in the city. The pollutant prediction heatmap was then used to estimate average pollutant exposures for each neighborhood. From there, I could find nearby neighborhoods with similar rents and lower pollutant exposures. Neighborhood suggestions are limited to Oakland, but could easily be extended with more extensive web scraping.

HealthyHomes: Predicting local air quality for healthy housing decisions

Nearly 1 in 10 childhood asthma cases in Los Angeles are due to traffic pollution exposures. Similarly, recent studies from Oakland, CA have demonstrated u

Structural Features for Predicting the Linguistic Quality of Text: Applications to Machine Translation, Automatic Summarization and Human-Authored Tex

abstract句子結構是文字語言質量的關鍵，我們記錄了以下實驗結果：句法短語統計和其他結構特徵對文字方面的預測能力。手工評估的句子fluency流利度用於機器翻譯評估和文字摘要質量的評估是黃金準則。我們發現和短語長度相關的結構特徵是弱特徵，但是與fluency強相關，基於整個結構特徵的分類器可以在句子flu