Module 12: Maps¶

Let's draw some maps. 🗺🧐

A dotmap with Altair¶

Let's start with altair.

When your dataset is large, it is nice to enable something called "json data transformer" in altair. What it does is, instead of generating and holding the whole dataset in the memory, to transform the dataset and save into a temporary file. This makes the whole plotting process much more efficient. For more information, check out: https://altair-viz.github.io/user_guide/data_transformers.html

In [2]:

Copied!

import altair as alt

# saving data into a file rather than embedding into the chart
alt.data_transformers.enable('json')
import altair as alt

# saving data into a file rather than embedding into the chart
alt.data_transformers.enable('json')

Out[2]:

DataTransformerRegistry.enable('json')

We need a dataset with geographical coordinates. This zipcodes dataset contains the location and zipcode of each zip code area.

In [3]:

Copied!

from vega_datasets import data

zipcodes_url = data.zipcodes.url
zipcodes = data.zipcodes()
zipcodes.head()
from vega_datasets import data

zipcodes_url = data.zipcodes.url
zipcodes = data.zipcodes()
zipcodes.head()

Out[3]:

	zip_code	latitude	longitude	city	state	county
0	00501	40.922326	-72.637078	Holtsville	NY	Suffolk
1	00544	40.922326	-72.637078	Holtsville	NY	Suffolk
2	00601	18.165273	-66.722583	Adjuntas	PR	Adjuntas
3	00602	18.393103	-67.180953	Aguada	PR	Aguada
4	00603	18.455913	-67.145780	Aguadilla	PR	Aguadilla

In [4]:

Copied!

zipcodes = data.zipcodes(dtype={'zip_code': 'category'})
zipcodes.head()
zipcodes = data.zipcodes(dtype={'zip_code': 'category'})
zipcodes.head()

Out[4]:

	zip_code	latitude	longitude	city	state	county
0	00501	40.922326	-72.637078	Holtsville	NY	Suffolk
1	00544	40.922326	-72.637078	Holtsville	NY	Suffolk
2	00601	18.165273	-66.722583	Adjuntas	PR	Adjuntas
3	00602	18.393103	-67.180953	Aguada	PR	Aguada
4	00603	18.455913	-67.145780	Aguadilla	PR	Aguadilla

In [5]:

Copied!

zipcodes.zip_code.dtype
zipcodes.zip_code.dtype

Out[5]:

CategoricalDtype(categories=['00501', '00544', '00601', '00602', '00603', '00604',
                  '00605', '00606', '00610', '00611',
                  ...
                  '99919', '99921', '99922', '99923', '99925', '99926',
                  '99927', '99928', '99929', '99950'],
, ordered=False, categories_dtype=object)

Btw, you'll have fewer issues if you pass URL instead of a dataframe to alt.Chart.

Let's draw it¶

Now we have the dataset loaded and start drawing some plots. Let's say you don't know anything about map projections. What would you try with geographical data? Probably the simplest way is considering (longitude, latitude) as a Cartesian coordinate and directly plot them.

In [6]:

Copied!





alt.Chart(zipcodes_url).mark_circle().encode(
    x='longitude:Q',
    y='latitude:Q',
)
alt.Chart(zipcodes_url).mark_circle().encode(
    x='longitude:Q',
    y='latitude:Q',
)

Out[6]:

Actually this itself is a map projection called Equirectangular projection. This projection (or almost a non-projection) is super straight-forward and doesn't require any processing of the data. So, often it is used to just quickly explore geographical data. As you dig deeper, you still want to think about which map projection fits your need best. Don't just use equirectangular projection without any thoughts!

Anyway, let's make it look slighly better by reducing the size of the circles and adjusting the aspect ratio. Q: Can you adjust the width and height?

In [7]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[7]:

But, a much better way to do this is explicitly specifying that they are lat, lng coordinates by using longitude= and latitude=, rather than x= and y=. If you do that, altair automatically adjust the aspect ratio. Q: Can you try it?

In [8]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[8]:

Because the American empire is far-reaching and complicated, the information density of this map is very low (although interesting). Moreover, the US looks twisted because a default projection that is not focused on the US is used.

A common projection for visualizing US data is AlbersUSA, which uses Albers (equal-area) projection. This is a standard projection used in United States Geological Survey and the United States Census Bureau. Albers USA contains a composition of US main land, Alaska, and Hawaii.

To use it, we call project method and specify which variables are longitude and latitude.

Q: use the project method to draw the map in the AlbersUsa projection.

In [9]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[9]:

Now we're talking. 😎

Let's visualize the large-scale zipcode patterns. We can use the fact that the zipcodes are hierarchically organized. That is, the first digit captures the largest area divisions and the other digits are about smaller geographical divisions.

Altair provides some data transformation functionalities. One of them is extracting a substring from a variable.

In [10]:

Copied!





import altair as alt

alt.data_transformers.enable('json')

alt.Chart(zipcodes_url).mark_circle(size=2).transform_calculate(
    first_digit=alt.expr.substring(alt.datum.zip_code, 0, 1)
).encode(
    longitude='longitude:Q',
    latitude='latitude:Q',
    color='first_digit:N',
).project(
    type='albersUsa'
).properties(
    width=700,
    height=400,
)
import altair as alt

alt.data_transformers.enable('json')

alt.Chart(zipcodes_url).mark_circle(size=2).transform_calculate(
    first_digit=alt.expr.substring(alt.datum.zip_code, 0, 1)
).encode(
    longitude='longitude:Q',
    latitude='latitude:Q',
    color='first_digit:N',
).project(
    type='albersUsa'
).properties(
    width=700,
    height=400,
)

Out[10]:

For each row (datum), you obtain the zip_code variable and get the substring (imagine Python list slicing), and then you call the result first_digit. Now, you can use this first_digit variable to color the circles. Also note that we specify first_digit as a nominal variable, not quantitative, to obtain a categorical colormap. But we can also play with it too.

Q: Why don't you extract the first two digits, name it as two_digits, and declare that as a quantitative variable? Any interesting patterns? What does it tell us about the history of US?

In [11]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[11]:

Q: also try it with declaring the first two digits as a categorical variable

In [12]:

Copied!

zipcodes
zipcodes

Out[12]:

	zip_code	latitude	longitude	city	state	county
0	00501	40.922326	-72.637078	Holtsville	NY	Suffolk
1	00544	40.922326	-72.637078	Holtsville	NY	Suffolk
2	00601	18.165273	-66.722583	Adjuntas	PR	Adjuntas
3	00602	18.393103	-67.180953	Aguada	PR	Aguada
4	00603	18.455913	-67.145780	Aguadilla	PR	Aguadilla
...	...	...	...	...	...	...
42044	99926	55.094325	-131.566827	Metlakatla	AK	Prince Wales Ketchikan
42045	99927	55.517921	-132.003244	Point Baker	AK	Prince Wales Ketchikan
42046	99928	55.395359	-131.675370	Ward Cove	AK	Ketchikan Gateway
42047	99929	56.449893	-132.364407	Wrangell	AK	Wrangell Petersburg
42048	99950	55.542007	-131.432682	Ketchikan	AK	Ketchikan Gateway

42049 rows × 6 columns

In [13]:

Copied!

# Implement

# YOUR SOLUTION HERE
# Implement

# YOUR SOLUTION HERE

Out[13]:

Btw, you can always click "view source" or "open in Vega Editor" to look at the json object that defines this visualization. You can embed this json object on your webpage and easily put up an interactive visualization.

Q: Can you put a tooltip that displays the zipcode when you mouse-over?

In [14]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[14]:

Choropleth¶

Let's try some choropleth now. Vega datasets have US county / state boundary data (us_10m) and world country boundary data (world-110m). You can take a look at the boundaries on GitHub (they renders topoJSON files):

If you click "Raw" then you can take a look at the actual file, which is hard to read.

Essentially, each file is a large dictionary with the following keys.

In [15]:

Copied!

usmap = data.us_10m()
usmap.keys()
usmap = data.us_10m()
usmap.keys()

Out[15]:

dict_keys(['type', 'transform', 'objects', 'arcs'])

In [16]:

Copied!

usmap['type']
usmap['type']

Out[16]:

'Topology'

In [17]:

Copied!

usmap['transform']
usmap['transform']

Out[17]:

{'scale': [0.003589294092944858, 0.0005371535195261037],
 'translate': [-179.1473400003406, 17.67439566600018]}

This transformation is used to quantize the data and store the coordinates in integer (easier to store than float type numbers).

https://github.com/topojson/topojson-specification#212-transforms

In [18]:

Copied!

usmap['objects'].keys()
usmap['objects'].keys()

Out[18]:

dict_keys(['counties', 'states', 'land'])

This data contains not only county-level boundaries (objects) but also states and land boundaries.

In [19]:

Copied!

usmap['objects']['land']['type'], usmap['objects']['states']['type'], usmap['objects']['counties']['type']
usmap['objects']['land']['type'], usmap['objects']['states']['type'], usmap['objects']['counties']['type']

Out[19]:

('MultiPolygon', 'GeometryCollection', 'GeometryCollection')

land is a multipolygon (one object) and states and counties contains many geometrics (multipolygons) because there are many states (counties). We can look at a state as a set of arcs that define it. It's id captures the identity of the state and is the key to link to other datasets.

In [20]:

Copied!

state1 = usmap['objects']['states']['geometries'][1]
state1
state1 = usmap['objects']['states']['geometries'][1]
state1

Out[20]:

{'type': 'MultiPolygon',
 'arcs': [[[10337]],
  [[10342]],
  [[10341]],
  [[10343]],
  [[10834, 10340]],
  [[10344]],
  [[10345]],
  [[10338]]],
 'id': 15}

The arcs referred here is defined in usmap['arcs'].

In [21]:

Copied!

usmap['arcs'][:10]
usmap['arcs'][:10]

Out[21]:

[[[15739, 57220], [0, 0]],
 [[15739, 57220], [29, 62], [47, -273]],
 [[15815, 57009], [-6, -86]],
 [[15809, 56923], [0, 0]],
 [[15809, 56923], [-36, -8], [6, -210], [32, 178]],
 [[15811, 56883], [9, -194], [44, -176], [-29, -151], [-24, -319]],
 [[15811, 56043], [-12, -216], [26, -171]],
 [[15825, 55656], [-2, 1]],
 [[15823, 55657], [-19, 10], [26, -424], [-26, -52]],
 [[15804, 55191], [-30, -72], [-47, -344]]]

It seems pretty daunting to work with this dataset, right? But fortunately people have already built tools to handle such data.

In [22]:

Copied!

states = alt.topo_feature(data.us_10m.url, 'states')
states = alt.topo_feature(data.us_10m.url, 'states')

In [23]:

Copied!

states
states

Out[23]:

UrlData({
  format: TopoDataFormat({
    feature: 'states',
    type: 'topojson'
  }),
  url: 'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/us-10m.json'
})

Can you find a mark for geographical shapes from here https://altair-viz.github.io/user_guide/marks/index.html# and draw the states?

In [24]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[24]:

And then project it using the albersUsa?

In [25]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[25]:

Can you do the same thing with counties and draw county boundaries?

In [26]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[26]:

Let's load some county-level unemployment data.

In [27]:

Copied!

unemp_data = data.unemployment(sep='\t')
unemp_data.head()
unemp_data = data.unemployment(sep='\t')
unemp_data.head()

Out[27]:

	id	rate
0	1001	0.097
1	1003	0.091
2	1005	0.134
3	1007	0.121
4	1009	0.099

This dataset has unemployment rate. When? I don't know. We don't care about data provenance here because the goal is quickly trying out choropleth. But if you're working with a real dataset, you should be very sensitive about the provenance of your dataset. Make sure you understand where the data came from and how it was processed.

Anyway, for each county specified with id. To combine two datasets, we use "Lookup transform" - https://vega.github.io/vega/docs/transforms/lookup/. Essentially, we use the id in the map data to look up (again) id field in the unemp_data and then bring in the rate variable. Then, we can use that rate variable to encode the color of the geoshape mark.

In [28]:

Copied!





alt.Chart(us_counties).mark_geoshape().project(
    type='albersUsa'
).transform_lookup(
    lookup='id',
    from_=alt.LookupData(unemp_data, 'id', ['rate'])
).encode(
    color='rate:Q'
).properties(
    width=700,
    height=400
)
alt.Chart(us_counties).mark_geoshape().project(
    type='albersUsa'
).transform_lookup(
    lookup='id',
    from_=alt.LookupData(unemp_data, 'id', ['rate'])
).encode(
    color='rate:Q'
).properties(
    width=700,
    height=400
)

Out[28]:

There you have it, a nice choropleth map. 😎

Raster visualization with datashader¶

Although many geovisualizations use vector graphics, raster visualization is still useful especially when you deal with images and lots of datapoints. Datashader is a package that aggregates and visualizes a large amount of data very quickly. Given a scene (visualization boundary, resolution, etc.), it quickly aggregate the data and produce pixels and send them to you.

To appreciate its power, we need a fairly large dataset. Let's use NYC taxi trip dataset on Kaggle: https://www.kaggle.com/kentonnlp/2014-new-york-city-taxi-trips You can download even bigger trip data from NYC open data website: https://opendata.cityofnewyork.us/data/

Ah, and you want to install the datashader, bokeh, and holoviews first if you don't have them yet.

uv add datashader bokeh holoviews jupyter-bokeh

or

pip install datashader bokeh holoviews jupyter-bokeh

or

conda install datashader bokeh holoviews jupyter-bokeh

In [29]:

Copied!





import pandas as pd
import datashader as ds
from datashader import transfer_functions as tf
from colorcet import fire
import pandas as pd
import datashader as ds
from datashader import transfer_functions as tf
from colorcet import fire

Because the dataset is pretty big, let's use a small sample first. For this visualization, we only keep the dropoff location.

In [30]:

Copied!

nyctaxi_small = pd.read_csv('~/Downloads/archive/nyc_taxi_data_2014.csv', nrows=10000,
                            usecols=['dropoff_longitude', 'dropoff_latitude'])
nyctaxi_small.head()
nyctaxi_small = pd.read_csv('~/Downloads/archive/nyc_taxi_data_2014.csv', nrows=10000,
                            usecols=['dropoff_longitude', 'dropoff_latitude'])
nyctaxi_small.head()

Out[30]:

	dropoff_longitude	dropoff_latitude
0	-73.982227	40.731790
1	-73.960449	40.763995
2	-73.986626	40.765217
3	-73.979863	40.777050
4	-73.984367	40.720524

Although the dataset is different, we can still follow the example here: https://datashader.org/getting_started/Introduction.html

In [31]:

Copied!

agg = ds.Canvas().points(nyctaxi_small, 'dropoff_longitude', 'dropoff_latitude')
tf.set_background(tf.shade(agg, cmap=fire),"black")
agg = ds.Canvas().points(nyctaxi_small, 'dropoff_longitude', 'dropoff_latitude')
tf.set_background(tf.shade(agg, cmap=fire),"black")

Out[31]:

Why can't we see anything? Wait, do you see the small dots on the left top? Can that be New York City? Maybe we don't see anything because some people travel very far? or because the dataset has some missing data?

Q: Can you first check whether there are NaNs? Then drop them and draw the map again?

In [32]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[32]:

dropoff_longitude    1
dropoff_latitude     1
dtype: int64

In [33]:

Copied!

# drop the rows with NaN and then draw the map again.

# YOUR SOLUTION HERE
# drop the rows with NaN and then draw the map again.

# YOUR SOLUTION HERE

Out[33]:

So it's not about the missing data.

Q: Can you identify the issue and draw the map like the following?

hint: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.between.html this method may be helpful.

In [34]:

Copied!

# You can use multiple cells to figure out what's going on.

# YOUR SOLUTION HERE
# You can use multiple cells to figure out what's going on.

# YOUR SOLUTION HERE

In [35]:

Copied!

agg = ds.Canvas().points(nyctaxi_small_filtered, 'dropoff_longitude', 'dropoff_latitude')
tf.set_background(tf.shade(agg, cmap=fire), "black")
agg = ds.Canvas().points(nyctaxi_small_filtered, 'dropoff_longitude', 'dropoff_latitude')
tf.set_background(tf.shade(agg, cmap=fire), "black")

Out[35]:

Do you see the black empty space at the center? That looks like the Central Park. This is cool, but it'll be awesome if we can explore the data interactively.

Ok, now let's get serious by loading the whole dataset. It may take some time. Apply the same data cleaning procedure.

In [36]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Can you feed the data directly to datashader to reproduce the static plot, this time with the full data?

In [37]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[37]:

Wow, that's fast. Also it looks cool!

Let's try the interactive version from here: https://datashader.org/getting_started/Introduction.html

In [38]:

Copied!

nyctaxi_filtered
nyctaxi_filtered

Out[38]:

	dropoff_longitude	dropoff_latitude
0	-73.982227	40.731790
1	-73.960449	40.763995
2	-73.986626	40.765217
3	-73.979863	40.777050
4	-73.984367	40.720524
...	...	...
14999994	-74.000675	40.725737
14999995	-73.991287	40.692535
14999996	-73.776505	40.740790
14999997	-74.005953	40.710922
14999998	-73.972407	40.747463

14751421 rows × 2 columns

We currently only have longitudes and latitudes. We need to conver them into a coordinate system that datashader understands.

In [39]:

Copied!

from datashader.utils import lnglat_to_meters

df = nyctaxi_filtered
df['dropoff_x'], df['dropoff_y'] = lnglat_to_meters(df.dropoff_longitude, df.dropoff_latitude)
from datashader.utils import lnglat_to_meters

df = nyctaxi_filtered
df['dropoff_x'], df['dropoff_y'] = lnglat_to_meters(df.dropoff_longitude, df.dropoff_latitude)

/var/folders/d0/wgh1l_5905x4crqpp1b7whz40000gn/T/ipykernel_74650/1009409252.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['dropoff_x'], df['dropoff_y'] = lnglat_to_meters(df.dropoff_longitude, df.dropoff_latitude)
/var/folders/d0/wgh1l_5905x4crqpp1b7whz40000gn/T/ipykernel_74650/1009409252.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['dropoff_x'], df['dropoff_y'] = lnglat_to_meters(df.dropoff_longitude, df.dropoff_latitude)

Now we can visualize the data interactively. See https://datashader.org/getting_started/Introduction.html

In [40]:

Copied!





import holoviews as hv
import colorcet as cc

from holoviews.element.tiles import EsriImagery
from holoviews.operation.datashader import datashade
hv.extension('bokeh')

# YOUR SOLUTION HERE
import holoviews as hv
import colorcet as cc

from holoviews.element.tiles import EsriImagery
from holoviews.operation.datashader import datashade
hv.extension('bokeh')

# YOUR SOLUTION HERE

Out[40]:

BokehModel(combine_events=True, render_bundle={'docs_json': {'1f334765-f198-460d-b567-13c6286e1155': {'version…

Q: how many rows (data points) are we visualizing right now?

In [41]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[41]:

14751421

That's a lot of data points. If we are using a vector format, it is probably hopeless to expect any interactivity because you need to move that many points! Yet, datashader + holoviews + bokeh renders everything almost in real time!

Leaflet¶

Another useful tool is Leaflet. It allows you to use various map tile data (Google maps, Open streetmap, ...) with many types of marks (points, heatmap, etc.). Leaflet.js is one of the easiest options to do that on the web, and there is a Python bridge of it: https://github.com/jupyter-widgets/ipyleaflet.

You can install it simply by

pip install ipyleaflet

It is quite easy to display an interactive map with it. You can also add markers, polygons, etc.

from ipyleaflet import Map, Marker

center = (52.204793, 360.121558)

m = Map(center=center, zoom=15)

marker = Marker(location=center, draggable=True) m.add(marker);

display(m)

marker.location = center

ipyleaflet lets you use various basemaps. You can find the list of available basemaps here: https://leaflet-extras.github.io/leaflet-providers/preview/

For instance, let's use OpenStreetMap and zoom into the Luddy School building.

In [43]:

Copied!

from ipyleaflet import basemaps, Map

center = (39.172681590059604, -86.5233788123735)
zoom = 17

Map(basemap=basemaps.OpenStreetMap.HOT, center=center, zoom=zoom)
from ipyleaflet import basemaps, Map

center = (39.172681590059604, -86.5233788123735)
zoom = 17

Map(basemap=basemaps.OpenStreetMap.HOT, center=center, zoom=zoom)

Out[43]:

Map(center=[39.172681590059604, -86.5233788123735], controls=(ZoomControl(options=['position', 'zoom_in_text',…

Q: can you identify a combination of location, zoom-level, and basemap to create an interesting(?) map?

In [ ]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Let's create a heatmap. First create a small list of random points around the center.

In [44]:

Copied!

center
center

Out[44]:

(39.172681590059604, -86.5233788123735)

In [46]:

Copied!





import random

random_points = [
    [
        random.uniform(center[0]-0.01, center[0]+0.01), 
        random.uniform(center[1]-0.01, center[1]+0.01), 
        random.uniform(0, 5)
    ] for x in range(100)
]
random_points[:2]
import random

random_points = [
    [
        random.uniform(center[0]-0.01, center[0]+0.01), 
        random.uniform(center[1]-0.01, center[1]+0.01), 
        random.uniform(0, 5)
    ] for x in range(100)
]
random_points[:2]

Out[46]:

[[39.177558691297605, -86.52638666590724, 1.8689416246340635],
 [39.17728485479434, -86.51897110890643, 0.48052240513828004]]

Q: using these random points, can you create a heatmap?

documentation: https://ipyleaflet.readthedocs.io/en/latest/layers/heatmap.html

In [47]:

Copied!

from ipyleaflet import Map, Heatmap

# YOUR SOLUTION HERE
from ipyleaflet import Map, Heatmap

# YOUR SOLUTION HERE

Out[47]:

Map(center=[39.172681590059604, -86.5233788123735], controls=(ZoomControl(options=['position', 'zoom_in_text',…

Summary¶

Ok! In this assignment, we played with dot maps, choropleth maps, and heatmaps. Although sometimes we can simply show the data points that reveal the geography, usually we need to have a basemap upon which we put our data.

Tools: what to use when?¶

Some packages like datashader allows us to work with tons of data points and visualize and interact with them on a map. Altair/Vega can provide a very simple way to create standard map visualizations that can be used in a web pages and dashboards. Finally, ipyleaflet allows us to create highly interactive maps. The focus is more on showing a specific basemap and overlaying data points on it, while most other packages are more focused on showing the data in question on a basic map.

There are many other tools as well. For example, plotly can create 3D globes and folium can create interactive maps. R also has a lot of tools for creating maps and frequently used for geographical data analysis and visualization.