9. Februar 2019

Inspecting air pollution data from OpenAQ using Colab, Pandas, and BigQuery

OpenAQ is publishing real-time air quality data from around the world to BigQuery. Today, I have poked around in the dataset to inspect air quality from many places of the world.

Why bother

As mentioned in a news bullet from Google Cloud, U.S. EPA and OpenAQ air quality data now available in BigQuery. I decided to have a look at the data for the following two reasons:

  1. I want to learn how to work with data from BigQuery in Colab.
  2. I care about breathing air of good quality.

In case you’re interested in any of the above, continue reading. All of what I am writing here is my personal opinion.

If you haven’t heard of it, BigQuery is a data warehouse by Google for analytics, and allows you to query enormous amounts of data, expressing your needs in SQL. Colab allows you to interactively work with data in Python.

Download Colab notebook

You can find the Colab notebook which I used for this article here:

File: Inspecting_air_pollution_data_from_OpenAQ_using_Colab_Pandas_and_BigQuery.ipynb [135.03 kB]
Category:
Download: 39

You can open this Colab notebook using Go to File>Upload Notebook… in Colab.

Fetching the OpenAQ data with BigQuery in Colab

OpenAQ (https://openaq.org) “aggregate[s] physical air quality data from public data sources provided by government, research-grade and other sources.” with the purpose of fighting air inequality.

I might not contribute much to OpenAQ’s purpose today (except spreading the word), but am curious about air pollution, particularly since I moved to London about five years ago. I also played with a portable sensor once, see my earlier post on Air Pollution – Green, Yellow, Red: Toying with CleanSpace.

But back to the task at hand. OpenAQ regularly publishes the latest readings on air pollution to a public dataset stored in BigQuery (bigquery-public-data.openaq.global_air_quality). The OpenAQ air quality dataset is small (at the time of writing, it has about 25k), so much of the potential power behind BigQuery is not really required because I can simply load the whole table into memory in Colab.

I should mention that there is an alternative to Colab, which is Cloud Datalab. Both are based on similar technology (Jupyter notebooks), but Cloud Datalab is more closely integrated into Google Cloud Platform (GCP). Today, I don’t need anything else than a connection to BigQuery, so Colab it is. The very first thing is to authenticate you as a user such that it is possible to run the BigQuery jobs against my GCP project.

#@title Authenticate user
from google.colab import auth
auth.authenticate_user()

As I said, let’s load the whole dataset into memory.

#@title Pull OpenAQ data via BigQuery
%%bigquery --project silicon-perigee-208521 p
#standardSQL
select
  pollutant,
  country,
  city,
  location,
  value
from `bigquery-public-data.openaq.global_air_quality`;

Inspecting data with Pandas in Colab

The next step is preparing the data such that I can easily pull out the worst spots reported in the world, and within each country (see pandas.DataFrame.rank).

#@title For each pollutant, calculate rank of each location
#          within its country and world-wide
# Rank within world
p = p.assign(world_rank=(
      p.groupby(['pollutant'])['value']
        .rank(method='min', ascending=False)
        .astype(int)))
# Rank within country
p = p.assign(country_rank=(
      p.groupby(['pollutant', 'country'])['value']
        .rank(method='min', ascending=False)
        .astype(int)))
# Rank within city
p = p.assign(city_rank=(
      p.groupby(['pollutant', 'country', 'city'])['value']
        .rank(method='min', ascending=False)
        .astype(int)))
p

#@title Pollutants in dataset
print(', '.join(p['pollutant'].unique()))

pm25, co, bc, no2, pm10, o3, so2

Thus, there are seven different pollutants recorded in the dataset. I was not able to find a reference, but by looking at https://www.epa.gov/criteria-air-pollutants/naaqs-table, and by interpreting some of the codes as empirical formulas for molecules, we have:

no2 Nitrogen Dioxide
so2 Sulphur Dioxide
o3 Ozone
bc Black Carbon
co Carbon Monoxide
pm25 Particulate Matter 2.5
pm10 Particulate Matter 10

Nitrogen Dioxide (NO2)

Let us look at the three locations with the highest NO2 concentration─of course this only includes locations where there is a measurement in the dataset.

#@title Top three locations by Nitrogen Dioxide (NO2)
(p.query('pollutant == "no2"')
  .nsmallest(3, 'world_rank')
  [['country', 'city', 'location', 'value', 'world_rank']])

	country	city		location			value	world_rank
19239	XK	Hani i Elezit	Hani i Elezit			397.926	1
6078	IN	Solapur		Solapur, Solapur - MPCB		268.360	2
16204	IN	Kalaburagi	Lal Bahadur Shastri Nagar...	229.210	3

Hani i Elezit is in Kosovo. Note that these aren’t long term observations, and I can only take a guess here at why this location is listed high. There is a cement factory there, and “cement plants are a significant source of sulfur dioxide, nitrogen oxide and carbon monoxide” according to the EPA (https://www.epa.gov/enforcement/cement-manufacturing-enforcement-initiative).

Solapur is reported as “one of the most-polluted cities in Maharashtra” (https://en.wikipedia.org/wiki/Solapur#Environment). Kalaburagi is also located in India, but I have not come up with any hypotheses here.

#@title Top three locations in Great Britain by Nitrogen Dioxide (NO2)
(p.query('pollutant == "no2" & country == "GB"')
  .nsmallest(3, 'country_rank')
  [['city', 'location', 'value', 'world_rank']])

	city		location			value	world_rank
4585	Belfast		Belfast Stockman's Lane		63.0	359
6707	Chepstow	Chepstow A48			60.0	418
6442	Aberdeen	Aberdeen Union Street Roadside	54.0	544

Sulphur Dioxide (SO2)

#@title Top three locations by Sulphur Dioxide (SO2)
(p.query('pollutant == "so2"')
  .nsmallest(3, 'world_rank')
  [['country', 'city', 'location', 'value', 'world_rank']])

	country	city		location	value		world_rank
24629	MK	State Air Quality ...	Karpos		2.050019e+15	1
7092	CL	Hualpén			JUNJI		2.088350e+03	2
3412	ES	Murcia			ES1633A		1.301000e+03	3

The highest readings of SO2 were reported in Skopje, Macedonia (Karpoš is part of Skopje). I did quick research and found last year’s report from UN Enironment about The most polluted capital in Europe, you didn’t even know about, in which you can read that Skopje is Europe’s most pollted capital city (WHO Global Ambient Air Quality Database).

Hualpén has an oil refinery, and oil refineries are known to be sources for sulphur dioxide. So that’s a likely reason for the high readings.

You shall keep in mind here that I have not spent further time on evaluating what the usual range is for amounts of sulphur dioxide in the air. We also have to be careful when searching for maximum values, as any measurement errors (that are way off the scale) can easily distort the results.

#@title Top three locations in Great Britain by Sulphur Dioxide (SO2)
(p.query('pollutant == "so2" & country == "GB"')
  .nsmallest(3, 'country_rank')
  [['city', 'location', 'value', 'world_rank']])

	city		location		value	world_rank
19339	Middlesbrough	Middlesbrough		19.0	323
17408	Southampton	Southampton Centre	14.0	497
1188	Derry		Derry Rosemount		13.0	545

Middlesbrough, Southampton, and Derry (Londonderry) have made the top when I queried the database. These cities also have made it into the news for air pollution in the past:

Particulate Matter (PM2.5)

#@title Top three locations by PM2.5
(p.query('pollutant == "pm25"')
  .nsmallest(3, 'world_rank')
  [['country', 'city', 'location', 'value', 'world_rank']])

	country	city	location		value	world_rank
9849	CN	义乌市	江东			690.0	1
1695	IN	Patna	IGSC Planetarium...	674.0	2
9845	CN	义乌市	北苑			494.0	3

义乌市 ─ that’s Yiwu in Zhejiang, China.

Patna has been previously known as making it high in the list when it comes to air pollution (see references to WHO report in https://www.nytimes.com/2014/05/09/world/asia/cities-in-india-among-the-most-polluted-who-says.html).

#@title Top three locations in Great Britain by PM2.5
(p.query('pollutant == "pm25" & country == "GB"')
  .nsmallest(3, 'country_rank')
  [['city', 'location', 'value', 'world_rank']])

	city		location			value	world_rank
17442	Storrington	Storrington Roadside		62.0	523	
19611	West Midlands	Birmingham A4540 Roadside	28.0	1118	
8973	Newcastle	Newcastle Centre		17.0	1672

Newcastle has been mentioned in several reports about poor air quality in 2018 in its local newspaper, the Chronicle Live. More surprising was the result of my quick online research about Storrington, which looks on the map like a quiet town near the South Downs but recently got some fame: traffic-choked Storrington was named by the World Health Organisation among the 32 worst polluted areas in the UK.

Comparing Great Britain and Sweden (Nitrogen Dioxide)

To close this off, let’s just do a little bit of Panda querying and pull out the worst location with regards to NO2 in Great Britain and Sweden, respectively.

#@title Compare the worst locations (NO2) in Great Britain and Sweden
(p.query('pollutant == "no2"')
  .query('country_rank == 1')
  .query('(country == "GB") | (country == "SE")')
  [['country', 'city', 'location', 'value', 'world_rank']])

	country	city		location	value	world_rank
4585	GB	Belfast		Belfast Stockma...	63.0000	359
9241	SE	Stockholm	Sveavägen		42.6175	917

Belfast came up in the corresponding query above. And in Sweden, the worst location is near a central street in Stockholm.

Conclusion

You have now seen how we can pull OpenAQ air quality data from BigQuery into Colab, and then to find pollution hotspots within either the world, or within Great Britain. I have tried─with very limited time at hand, and rather superficial research─to come up with hypotheses why some of these places showed high values for pollutants.