Module 9: EstimationĀ¶
In this lab, we will learn about Kernel Density Estimation (KDE), interpolation, and (briefly) regression.
InĀ [1]:
Copied!
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import altair as alt
import pandas as pd
import scipy.stats as ss
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import altair as alt
import pandas as pd
import scipy.stats as ss
%matplotlib inline
Kernel density estimationĀ¶
Let's import the IMDb data.
InĀ [2]:
Copied!
import vega_datasets
movies = vega_datasets.data.movies()
movies.head()
import vega_datasets
movies = vega_datasets.data.movies()
movies.head()
Out[2]:
Title | US_Gross | Worldwide_Gross | US_DVD_Sales | Production_Budget | Release_Date | MPAA_Rating | Running_Time_min | Distributor | Source | Major_Genre | Creative_Type | Director | Rotten_Tomatoes_Rating | IMDB_Rating | IMDB_Votes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | The Land Girls | 146083.0 | 146083.0 | NaN | 8000000.0 | Jun 12 1998 | R | NaN | Gramercy | None | None | None | None | NaN | 6.1 | 1071.0 |
1 | First Love, Last Rites | 10876.0 | 10876.0 | NaN | 300000.0 | Aug 07 1998 | R | NaN | Strand | None | Drama | None | None | NaN | 6.9 | 207.0 |
2 | I Married a Strange Person | 203134.0 | 203134.0 | NaN | 250000.0 | Aug 28 1998 | None | NaN | Lionsgate | None | Comedy | None | None | NaN | 6.8 | 865.0 |
3 | Let's Talk About Sex | 373615.0 | 373615.0 | NaN | 300000.0 | Sep 11 1998 | None | NaN | Fine Line | None | Comedy | None | None | 13.0 | NaN | NaN |
4 | Slam | 1009819.0 | 1087521.0 | NaN | 1000000.0 | Oct 09 1998 | R | NaN | Trimark | Original Screenplay | Drama | Contemporary Fiction | None | 62.0 | 3.4 | 165.0 |
Although we have learned that it is dangerous to drop any missing values, we will do so for the sake of simplicity. We are also really not trying to draw any conclusions about the data so it is okay. But be careful with missing data in practice!
Q: Can you drop rows that have NaN value in either IMDB_Rating
or Rotten_Tomatoes_Rating
?
InĀ [3]:
Copied!
# YOUR SOLUTION HERE
# YOUR SOLUTION HERE
We can plot histogram and KDE using pandas:
InĀ [4]:
Copied!
movies['IMDB_Rating'].hist(bins=10, density=True)
movies['IMDB_Rating'].plot(kind='kde')
movies['IMDB_Rating'].hist(bins=10, density=True)
movies['IMDB_Rating'].plot(kind='kde')
Out[4]:
<Axes: ylabel='Density'>
Or using seaborn (two ways):
InĀ [5]:
Copied!
sns.displot(movies['IMDB_Rating'], bins=15, kde=True)
sns.displot(movies['IMDB_Rating'], bins=15, kde=True)
Out[5]:
<seaborn.axisgrid.FacetGrid at 0x147ba3560>
InĀ [6]:
Copied!
sns.histplot(movies['IMDB_Rating'], bins=15, kde=True)
sns.histplot(movies['IMDB_Rating'], bins=15, kde=True)
Out[6]:
<Axes: xlabel='IMDB_Rating', ylabel='Count'>
Q: Can you plot the histogram and KDE of the Rotten_Tomatoes_Rating
?
InĀ [7]:
Copied!
# YOUR SOLUTION HERE
# YOUR SOLUTION HERE
Out[7]:
<Axes: ylabel='Density'>
InĀ [8]:
Copied!
f = plt.figure(figsize=(15,8))
plt.xlim(0, 10)
sample_sizes = [10, 50, 100, 500, 1000, 2000]
for i, N in enumerate(sample_sizes, 1):
plt.subplot(2,3,i)
plt.title("Sample size: {}".format(N))
for j in range(5):
s = movies['IMDB_Rating'].sample(N)
sns.kdeplot(s, legend=False)
plt.tight_layout()
f = plt.figure(figsize=(15,8))
plt.xlim(0, 10)
sample_sizes = [10, 50, 100, 500, 1000, 2000]
for i, N in enumerate(sample_sizes, 1):
plt.subplot(2,3,i)
plt.title("Sample size: {}".format(N))
for j in range(5):
s = movies['IMDB_Rating'].sample(N)
sns.kdeplot(s, legend=False)
plt.tight_layout()
We can also draw KDE plots using scikit-learn to change kernel functions.
First, we need points to score across.
- Remember the np.linspace() function?
- IMDB scores are only between 1 and 10. Let's create 1000 points between 1 and 10.
InĀ [9]:
Copied!