# Module 2: John Snow's map¶

Follow the contents of this notebook and answer all questions (e.g. **Q1: ...**)

## Jupyter + Pandas = Awesomeness¶

Jupyter notebook (lab) and Pandas may be the two most important libraries responsible for the `Python`

's rise in data science. `Jupyter`

lets you interactively explore datasets and code; `Pandas`

lets you handle tabular datasets with superb speed and convenience. And they work so well together! In many cases, `Jupyter`

and `Pandas`

are all you need to load, clean, transform, visualize, and understand a dataset.

If you are not familiar with `Pandas`

, you may want to follow their official tutorial called 10 Minutes to pandas now or in the near future.

## Importing pandas¶

The convention for importing `pandas`

is the following

```
import pandas as pd
```

You can check the version of the library. Because pandas is fast-evolving library, you want to make sure that you have the up-to-date version of the library.

```
pd.__version__
```

'2.1.1'

You also need `matplotlib`

, which is used by `pandas`

to plot figures. The following is the most common convention to import `matplotlib`

library.

```
import matplotlib.pyplot as plt
```

Let's check its version too.

```
import matplotlib
matplotlib.__version__
```

'3.8.0'

## Loading a CSV data file¶

Using pandas, you can read tabular data files in many formats and through many protocols. Pandas supports not only flat files such as `.csv`

, but also various other formats including clipboard, Excel, JSON, HTML, Feather, Parquet, SQL, Google BigQuery, and so on. Moreover, you can pass a local file path or a URL. If it's on Amazon S3, just pass a url like `s3://path/to/file.csv`

. If it's on a webpage, then just use `https://some/url.csv`

.

Let's load a dataset about the location of pumps in the John Snow's map. You can download the file to your computer and try to load it using the local path too.

```
pump_df = pd.read_csv('https://raw.githubusercontent.com/yy/dviz-course/master/data/pumps.csv')
```

`df`

stands for "Data Frame", which is a fundamental data object in Pandas. You can take a look at the dataset by looking at the first few lines.

```
pump_df.head()
```

X | Y | |
---|---|---|

0 | 8.651201 | 17.891600 |

1 | 10.984780 | 18.517851 |

2 | 13.378190 | 17.394541 |

3 | 14.879830 | 17.809919 |

4 | 8.694768 | 14.905470 |

**Q1: can you print only the first three lines?** Refer: http://pandas.pydata.org/pandas-docs/stable/index.html

```
# YOUR SOLUTION HERE
```

X | Y | |
---|---|---|

0 | 8.651201 | 17.891600 |

1 | 10.984780 | 18.517851 |

2 | 13.378190 | 17.394541 |

You can also sample several rows randomly. If the data is sorted in some ways, sampling may give you a rather unbiased view of the dataset.

```
pump_df.sample(5)
```

X | Y | |
---|---|---|

12 | 8.999440 | 5.101023 |

2 | 13.378190 | 17.394541 |

7 | 10.660970 | 7.428647 |

10 | 18.914391 | 9.737819 |

0 | 8.651201 | 17.891600 |

You can also figure out the number of rows in the dataset by running

```
len(pump_df)
```

13

Note that `df.size`

does not give you the number of rows. It tells you the number of elements.

```
pump_df.size
```

26

You can also look into the shape of the dataset as well as what are the columns in the dataset.

```
pump_df.shape # 13 rows and 2 columns
```

(13, 2)

```
pump_df.columns
```

Index(['X', 'Y'], dtype='object')

You can also check out basic descriptive statistics of the whole dataset by using `describe()`

method.

```
pump_df.describe()
```

X | Y | |
---|---|---|

count | 13.000000 | 13.000000 |

mean | 12.504677 | 11.963446 |

std | 3.376869 | 4.957821 |

min | 8.651201 | 5.046838 |

25% | 8.999440 | 7.958250 |

50% | 12.571360 | 11.727170 |

75% | 14.879830 | 17.394541 |

max | 18.914391 | 18.517851 |

You can slice the data like a list

```
pump_df[:2]
```

X | Y | |
---|---|---|

0 | 8.651201 | 17.891600 |

1 | 10.984780 | 18.517851 |

```
pump_df[-2:]
```

X | Y | |
---|---|---|

11 | 16.00511 | 5.046838 |

12 | 8.99944 | 5.101023 |

```
pump_df[1:5]
```

X | Y | |
---|---|---|

1 | 10.984780 | 18.517851 |

2 | 13.378190 | 17.394541 |

3 | 14.879830 | 17.809919 |

4 | 8.694768 | 14.905470 |

or filter rows using some conditions.

```
pump_df[pump_df.X > 13]
```

X | Y | |
---|---|---|

2 | 13.378190 | 17.394541 |

3 | 14.879830 | 17.809919 |

8 | 13.521460 | 7.958250 |

9 | 16.434891 | 9.252130 |

10 | 18.914391 | 9.737819 |

11 | 16.005110 | 5.046838 |

Now let's load another CSV file that documents the cholera deaths. The URL is https://raw.githubusercontent.com/yy/dviz-course/master/data/deaths.csv

**Q2: load the death dataset and inspect it**

**load this dataset as**`death_df`

.**show the first 2 rows.****show the total number of rows.**

```
# YOUR SOLUTION HERE
```

```
# YOUR SOLUTION HERE
```

X | Y | |
---|---|---|

0 | 13.588010 | 11.09560 |

1 | 9.878124 | 12.55918 |

```
# YOUR SOLUTION HERE
```

578

## Some visualizations?¶

Let's visualize them! Pandas actually provides a nice visualization interface that uses matplotlib under the hood. You can do many basic plots without learning `matplotlib`

. So let's try.

```
death_df.plot()
```

<Axes: >

This is not what we want! When asked to plot the data, it tries to figure out what we want based on the type of the data. However, that doesn't mean that it will successfully do so!

Oh by the way, depending on your environment, you may not see any plot. If you don't see anything run the following command.

```
%matplotlib inline
```

The commands that start with `%`

is called the magic commands, which are available in IPython and Jupyter. The purpose of this command is telling the IPython / Jupyter to show the plot right here instead of trying to use other external viewers.

Anyway, this doesn't seem like the plot we want. Instead of putting each row as a point in a 2D plane by using the X and Y as the coordinate, it just created a line chart. Let's fix it. Please take a look at the plot method documentation. How should we change the command? Which `kind`

of plot do we want to draw?

Yes, we want to draw a *scatter plot* using x and y as the Cartesian coordinates.

```
death_df.plot(x='X', y='Y', kind='scatter', label='Deaths')
```

<Axes: xlabel='X', ylabel='Y'>

I think I want to reduce the size of the dots and change the color to black. But it is difficult to find how to do that! It is sometimes quite annoying to figure out how to change how the visualization looks, especially when we use `matplotlib`

. Unlike some other advanced tools, `matplotlib`

does not provide a very coherent way to adjust your visualizations. That's one of the reasons why there are lots of visualization libraries that *wrap* `matplotlib`

. Anyway, this is how you do it.

```
death_df.plot(x='X', y='Y', kind='scatter', label='Deaths', s=2, c='black')
```

<Axes: xlabel='X', ylabel='Y'>

Can we visualize both deaths and pumps?

```
death_df.plot(x='X', y='Y', s=2, c='black', kind='scatter', label='Deaths')
pump_df.plot(x='X', y='Y', kind='scatter', c='red', s=8, label='Pumps')
```

<Axes: xlabel='X', ylabel='Y'>

Oh well, this is not what we want! We want to overlay them to see them together, right? How can we do that? Before going into that, we probably want to understand some key components of matplotlib figures.

## Figure and Axes¶

Why do we have two separate plots? The reason is that, by default, the `plot`

method creates a new \emph{figure} instead of putting them inside a single figure. In order to avoid it, we need to either create an Axes and tell `plot`

to use that axes. What is an `axes`

? See this illustration.

A figure can contain multiple axes (link). The figure below contains two axes:

and an axes can contain multiple plots (link).

Conveniently, when you call `plot`

method, it creates an axes and returns it to you

```
ax = death_df.plot(x='X', y='Y', s=2, c='black', kind='scatter', label='Deaths')
```

```
ax
```

<Axes: xlabel='X', ylabel='Y'>

This object contains all the information and objects in the plot we see. Whatever we want to do with this axes (e.g., changing x or y scale, overlaying other data, changing the color or size of symbols, etc.) can be done by accessing this object.

Then you can pass this axes object to another plot to put both plots in the same axes. Note `ax=ax`

in the second plot command. It tells the plot command *where* to draw the points.

```
ax = death_df.plot(x='X', y='Y', s=2, c='black', alpha=0.5, kind='scatter', label='Deaths')
pump_df.plot(x='X', y='Y', kind='scatter', c='red', s=8, label='Pumps', ax=ax)
```

<Axes: xlabel='X', ylabel='Y'>

Although simply invoking the `plot()`

command is quick and easy when doing an exploratory data analysis, it is usually better to be formal about figure and axes objects.

Here is the recommended way to create a plot. Call the `subplots()`

method (see https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.subplots.html) to get the figure and axes objects explicitly.

As you can see below, `subplots()`

creates an empty figure and returns the figure and axes object to you. Then you can fill this empty canvas with your plots. Whatever manipulation you want to make about your figure (e.g., changing the size of the figure) or axes (e.g., drawing a new plot on it) can be done with `fig`

and `ax`

objects. So whenever possible, use this method!

**Now, can you use this method to produce the same plot just above?**

```
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
# YOUR SOLUTION HERE
```

<Axes: xlabel='X', ylabel='Y'>

## Voronoi diagram¶

Let's try the Voronoi diagram. You can use the `scipy.spatial.Voronoi`

and `scipy.spatial.voronoi_plot_2d`

from `scipy`

, the *scientific python* library.

```
from scipy.spatial import Voronoi, voronoi_plot_2d
```

Take a look at the documentation of Voronoi and voronoi_plot_2d and

**Q3: produce a Voronoi diagram that shows the deaths, pumps, and voronoi cells**

```
# you'll need this
points = pump_df.values
points
```

array([[ 8.6512012, 17.8915997], [10.9847803, 18.5178509], [13.37819 , 17.3945408], [14.8798304, 17.8099194], [ 8.694768 , 14.9054699], [ 8.8644161, 12.75354 ], [12.5713596, 11.72717 ], [10.6609697, 7.428647 ], [13.5214596, 7.95825 ], [16.4348907, 9.2521296], [18.9143906, 9.7378187], [16.0051098, 5.0468378], [ 8.9994402, 5.1010232]])

```
# YOUR SOLUTION HERE
```