Module 8: Histogram and CDFĀ¶
A deep dive into Histogram and boxplot.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import altair as alt
import pandas as pd
The tricky histogram with pre-counted dataĀ¶
Let's revisit the table from the class
Hours | Frequency |
---|---|
0-1 | 4,300 |
1-3 | 6,900 |
3-5 | 4,900 |
5-10 | 2,000 |
10-24 | 2,100 |
You can draw a histogram by just providing bins and counts instead of a list of numbers. So, let's try that.
bins = [0, 1, 3, 5, 10, 24]
data = {0.5: 4300, 2: 6900, 4: 4900, 7: 2000, 15: 2100}
data.keys()
dict_keys([0.5, 2, 4, 7, 15])
Q: Draw histogram using this data. Useful query: Google search: matplotlib histogram pre-counted
# YOUR SOLUTION HERE
Text(0, 0.5, 'Frequency')
As you can see, the default histogram does not normalize with binwidth and simply shows the counts! This can be very misleading if you are working with variable bin width (e.g. logarithmic bins). So please be mindful about histograms when you work with variable bins.
Q: You can fix this by using the density
option.
# YOUR SOLUTION HERE
Text(0, 0.5, 'Density')
Let's use an actual datasetĀ¶
import vega_datasets
movies = vega_datasets.data.movies()
movies.head()
Title | US_Gross | Worldwide_Gross | US_DVD_Sales | Production_Budget | Release_Date | MPAA_Rating | Running_Time_min | Distributor | Source | Major_Genre | Creative_Type | Director | Rotten_Tomatoes_Rating | IMDB_Rating | IMDB_Votes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | The Land Girls | 146083.0 | 146083.0 | NaN | 8000000.0 | Jun 12 1998 | R | NaN | Gramercy | None | None | None | None | NaN | 6.1 | 1071.0 |
1 | First Love, Last Rites | 10876.0 | 10876.0 | NaN | 300000.0 | Aug 07 1998 | R | NaN | Strand | None | Drama | None | None | NaN | 6.9 | 207.0 |
2 | I Married a Strange Person | 203134.0 | 203134.0 | NaN | 250000.0 | Aug 28 1998 | None | NaN | Lionsgate | None | Comedy | None | None | NaN | 6.8 | 865.0 |
3 | Let's Talk About Sex | 373615.0 | 373615.0 | NaN | 300000.0 | Sep 11 1998 | None | NaN | Fine Line | None | Comedy | None | None | 13.0 | NaN | NaN |
4 | Slam | 1009819.0 | 1087521.0 | NaN | 1000000.0 | Oct 09 1998 | R | NaN | Trimark | Original Screenplay | Drama | Contemporary Fiction | None | 62.0 | 3.4 | 165.0 |
Let's plot the histogram of IMDB ratings.
plt.hist(movies.IMDB_Rating)
(array([ 9., 39., 76., 133., 293., 599., 784., 684., 323., 48.]), array([1.4 , 2.18, 2.96, 3.74, 4.52, 5.3 , 6.08, 6.86, 7.64, 8.42, 9.2 ]), <BarContainer object of 10 artists>)
Did you get an error or a warning? What's going on?
The problem is that the column contains NaN
(Not a Number) values, which represent missing data points. The following command check whether each value is a NaN
and returns the result.
movies.IMDB_Rating.isna()
0 False 1 False 2 False 3 True 4 False ... 3196 False 3197 True 3198 False 3199 False 3200 False Name: IMDB_Rating, Length: 3201, dtype: bool
As you can see there are a bunch of missing rows. You can count them.
sum(movies.IMDB_Rating.isna())
213
or drop them.
IMDB_ratings_nan_dropped = movies.IMDB_Rating.dropna()
len(IMDB_ratings_nan_dropped)
2988
213 + 2988
3201
The dropna
can be applied to the dataframe too.
Q: drop rows from movies
dataframe where either IMDB_Rating
or IMDB_Votes
is NaN
.
# YOUR SOLUTION HERE
# Both should be zero.
print(sum(movies.IMDB_Rating.isna()), sum(movies.IMDB_Votes.isna()))
0 0
How does matplotlib
decides the bins? Actually matplotlib
's hist
function uses numpy
's histogram
function under the hood.
Q: Plot the histogram of movie ratings (IMDB_Rating
) using the plt.hist()
function.
# YOUR SOLUTION HERE
Text(0, 0.5, 'Frequency')
Have you noticed that this function returns three objects? Take a look at the documentation here to figure out what they are.
To get the returned three objects:
n_raw, bins_raw, patches = plt.hist(movies.IMDB_Rating)
print(n_raw)
print(bins_raw)
[ 9. 39. 76. 133. 293. 599. 784. 684. 323. 48.] [1.4 2.18 2.96 3.74 4.52 5.3 6.08 6.86 7.64 8.42 9.2 ]
Here, n_raw
contains the values of histograms, i.e., the number of movies in each of the 10 bins. Thus, the sum of the elements in n_raw
should be equal to the total number of movies.
Q: Test whether the sum of values in n_raw
is equal to the number of movies in the movies
dataset
# YOUR SOLUTION HERE
2988.0 2988
True
The second returned object (bins_raw
) is a list containing the edges of the 10 bins: the first bin is [1.4, 2.18], the second [2.18, 2.96], and so on. What's the width of the bins?
np.diff(bins_raw)
array([0.78, 0.78, 0.78, 0.78, 0.78, 0.78, 0.78, 0.78, 0.78, 0.78])
The width is same as the maximum value minus minimum value, divided by 10.
min_rating = min(movies.IMDB_Rating)
max_rating = max(movies.IMDB_Rating)
print(min_rating, max_rating)
print( (max_rating-min_rating) / 10 )
1.4 9.2 0.7799999999999999
Now, let's plot a normalized (density) histogram.
n, bins, patches = plt.hist(movies.IMDB_Rating, density=True)
print(n)
print(bins)
[0.0038616 0.0167336 0.03260907 0.05706587 0.12571654 0.25701095 0.33638829 0.29348162 0.13858854 0.0205952 ] [1.4 2.18 2.96 3.74 4.52 5.3 6.08 6.86 7.64 8.42 9.2 ]
The ten bins do not change. But now n
represents the density of the data inside each bin. In other words, the sum of the area of each bar will equal to 1.
Q: Can you verify this?
Hint: the area of each bar is calculated as height * width. You may get something like 0.99999999999999978 instead of 1.
# YOUR SOLUTION HERE
1.0
Anyway, these data generated from the hist
function is calculated from numpy
's histogram
function. https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html
Note that the result of np.histogram()
is same as that of plt.hist()
.
np.histogram(movies.IMDB_Rating)
(array([ 9, 39, 76, 133, 293, 599, 784, 684, 323, 48]), array([1.4 , 2.18, 2.96, 3.74, 4.52, 5.3 , 6.08, 6.86, 7.64, 8.42, 9.2 ]))
plt.hist(movies.IMDB_Rating)
(array([ 9., 39., 76., 133., 293., 599., 784., 684., 323., 48.]), array([1.4 , 2.18, 2.96, 3.74, 4.52, 5.3 , 6.08, 6.86, 7.64, 8.42, 9.2 ]), <BarContainer object of 10 artists>)
If you look at the documentation, you can see that numpy
uses simply 10 as the default number of bins. But you can set it manually or set it to be auto
, which is the "Maximum of the sturges
and fd
estimators.". Let's try this auto
option.
_ = plt.hist(movies.IMDB_Rating, bins='auto')
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
movies.IMDB_Rating.hist(bins=3)
plt.subplot(1,2,2)
movies.IMDB_Rating.hist(bins=20)
<Axes: >
What does the argument in plt.subplot(1,2,1)
mean? If you're not sure, check out: http://stackoverflow.com/questions/3584805/in-matplotlib-what-does-the-argument-mean-in-fig-add-subplot111
Q: create 8 subplots (2 rows and 4 columns) with the following binsizes
.
nbins = [2, 3, 5, 10, 30, 40, 60, 100 ]
figsize = (18, 10)
# TODO
# YOUR SOLUTION HERE
Do you see the issues with having too few bins or too many bins? In particular, do you notice weird patterns that emerge from bins=30
?
Q: Can you guess why do you see such patterns? What are the peaks and what are the empty bars? What do they tell you about choosing the binsize in histograms?
# YOUR SOLUTION HERE
# YOUR SOLUTION HERE
40
YOUR SOLUTION HEREĀ¶
Formulae for choosing the number of bins.Ā¶
We can manually choose the number of bins based on those formulae.
N = len(movies)
plt.figure(figsize=(12,4))
# Sqrt
nbins = int(np.sqrt(N))
plt.subplot(1,3,1)
plt.title("SQRT, {} bins".format(nbins))
movies.IMDB_Rating.hist(bins=nbins)
# Sturge's formula
nbins = int(np.ceil(np.log2(N) + 1))
plt.subplot(1,3,2)
plt.title("Sturge, {} bins".format(nbins))
movies.IMDB_Rating.hist(bins=nbins)
# Freedman-Diaconis
iqr = np.percentile(movies.IMDB_Rating, 75) - np.percentile(movies.IMDB_Rating, 25)
width = 2*iqr/np.power(N, 1/3)
nbins = int((max(movies.IMDB_Rating) - min(movies.IMDB_Rating)) / width)
plt.subplot(1,3,3)
plt.title("F-D, {} bins".format(nbins))
movies.IMDB_Rating.hist(bins=nbins)
<Axes: title={'center': 'F-D, 35 bins'}>
But we can also use built-in formulae too. Let's try all of them.
plt.figure(figsize=(20,4))
plt.subplot(161)
movies.IMDB_Rating.hist(bins='fd')
plt.subplot(162)
movies.IMDB_Rating.hist(bins='doane')
plt.subplot(163)
movies.IMDB_Rating.hist(bins='scott')
plt.subplot(164)
movies.IMDB_Rating.hist(bins='rice')
plt.subplot(165)
movies.IMDB_Rating.hist(bins='sturges')
plt.subplot(166)
movies.IMDB_Rating.hist(bins='sqrt')
<Axes: >