# Module 7: 1D data¶

Let's first import basic packages and then load a dataset from `vega_datasets`

package. If you don't have `vega_datasets`

or `altair`

installed yet, use `pip`

or `conda`

to install them.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from vega_datasets import data
```

```
cars = data.cars()
cars.head()
```

Name | Miles_per_Gallon | Cylinders | Displacement | Horsepower | Weight_in_lbs | Acceleration | Year | Origin | |
---|---|---|---|---|---|---|---|---|---|

0 | chevrolet chevelle malibu | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 1970-01-01 | USA |

1 | buick skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 1970-01-01 | USA |

2 | plymouth satellite | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 1970-01-01 | USA |

3 | amc rebel sst | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 1970-01-01 | USA |

4 | ford torino | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 1970-01-01 | USA |

## 1D scatter plot¶

Let's consider the `Acceleration`

column as our 1D data. If we ask pandas to plot this series, it'll produce a line graph where the index becomes the horizontal axis.

```
cars.Acceleration.plot()
```

<Axes: >

Because the index is not really meaningful, drawing line between subsequent values is misleading! This is definitely not the plot we want!

It's actually not trivial to use pandas to create an 1-D scatter plot. Instead, we can use `matploblib`

's `scatter`

function. We can first create an array with zeros that we can use as the vertical coordinates of the points that we will plot. `np.zeros_like`

returns an array with zeros that matches the shape of the input array.

```
np.zeros_like([1,2,3])
```

array([0, 0, 0])

**Q: now can you create an 1D scatter plot wit matplotlib's scatter function?** Make the figure wide (e.g. set

`figsize=(10,2)`

) and then remove the y tics.```
# YOUR SOLUTION HERE
```

<matplotlib.collections.PathCollection at 0x129e03fe0>

As you can see, there are lots of occlusions. So this plot cannot show the distribution properly and we would like to fix it. How about adding some jitters? You can use `numpy`

's `random.rand()`

function to generate random numbers, instead of using an array with zeros.

**Q: create a jittered 1D scatter plot.**

```
# jittered_y = ...
# YOUR SOLUTION HERE
```

We can further improve this by adding transparency to the symbols. The transparency option for `scatter`

function is called `alpha`

. Set it to be 0.2.

**Q: create a jittered 1D scatter plot with transparency (alpha=0.2)**

```
# YOUR SOLUTION HERE
```

Another strategy is using empty symbols. The option is `facecolors`

. You can also change the stroke color (`edgecolors`

).

**Q: create a jittered 1D scatter plot with empty symbols.**

```
# YOUR SOLUTION HERE
```

## What happens if you have lots and lots of points?¶

Whatever strategy that you use, it's almost useless if you have too many data points. Let's play with different number of data points and see how it looks.

It not only becomes completely useless, it also take a while to draw the plot itself.

```
# TODO: play with N and see what happens.
N = 100000
x = np.random.rand(N)
jittered_y = np.random.rand(N)
# YOUR SOLUTION HERE
```

## Histogram and boxplot¶

When you have lots of data points, you can't no longer use the scatter plots. Even when you don't have millions of data points, you often want to get a quick summary of the distribution rather than seeing the whole dataset. For 1-D datasets, two major approaches are histogram and boxplot. Histogram is about aggregating and counting the data while boxplot is about summarizing the data. Let's first draw some histograms.

### Histogram¶

It's very easy to draw a histogram with pandas.

```
cars.Acceleration.hist()
```

<Axes: >

You can adjust the bin size, which is the main parameter of the histogram.

```
cars.Acceleration.hist(bins=15)
```

<Axes: >

You can even specify the actual bins.

```
bins = [7.5, 8.5, 10, 15, 30]
cars.Acceleration.hist(bins=bins)
```

<Axes: >

Do you see anything funky going on with this histogram? What's wrong? Can you fix it?

**Q: Explain what's wrong with this histogram and fix it.**

(hints: do you remember what we discussed regarding histogram? Also pandas documentation does not show the option that you should use. You should take a look at the `matplotlib`

's documentation.

```
# YOUR SOLUTION HERE
```

<Axes: >

### Boxplot¶

Boxplot can be created with pandas very easily. Check out the `plot`

documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

**Q: create a box plot of Acceleration**

```
# YOUR SOLUTION HERE
```

<Axes: >

## 1D scatter plot with Seaborn and Altair¶

As you may have noticed, it is not very easy to use `matplotlib`

. The organization of plot functions and parameters are not very systematic. Whenever you draw something, you should search how to do it, what are the parameters you can tweak, etc. You need to manually tweak a lot of things when you work with `matplotlib`

.

There are more systematic approaches towards data visualization, such as the "Grammar of Graphics". This idea of *grammar* led to the famous `ggplot2`

(http://ggplot2.tidyverse.org) package in R as well as the Vega & Vega-lite for the web. The grammar-based approach lets you work with *tidy data* in a natural way, and also lets you approach the data visualization systematically. In other words, they are very cool. 😎

I'd like to introduce two nice Python libraries. One is called `seaborn`

(https://seaborn.pydata.org), which is focused on creating complex statistical data visualizations, and the other is called `altair`

(https://altair-viz.github.io/) and it is a Python library that lets you *define* a visualization and translates it into vega-lite json.

Seaborn would be useful when you are doing exploratory data analysis; altair may be useful if you are thinking about creating and putting an interactive visualization on the web.

If you don't have them yet, check the installation page of altair. In `conda`

,

`$ conda install -c conda-forge altair vega_datasets jupyterlab `

Let's play with it.

```
import seaborn as sns
import altair as alt
# Uncomment the following line if you are using Jupyter notebook
# alt.renderers.enable('notebook')
```

```
cars.head()
```

Name | Miles_per_Gallon | Cylinders | Displacement | Horsepower | Weight_in_lbs | Acceleration | Year | Origin | |
---|---|---|---|---|---|---|---|---|---|

0 | chevrolet chevelle malibu | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 1970-01-01 | USA |

1 | buick skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 1970-01-01 | USA |

2 | plymouth satellite | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 1970-01-01 | USA |

3 | amc rebel sst | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 1970-01-01 | USA |

4 | ford torino | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 1970-01-01 | USA |

### Beeswarm plots with seaborn¶

Seaborn has a built-in function to create 1D scatter plots with multiple categories.

```
sns.stripplot(x='Origin', y='Acceleration', data=cars)
```

<Axes: xlabel='Origin', ylabel='Acceleration'>

And you can easily add jitters or even create a beeswarm plot.

```
sns.stripplot(x='Origin', y='Acceleration', data=cars, jitter=True)
```

<Axes: xlabel='Origin', ylabel='Acceleration'>