Module 9: Estimation¶

In this lab, we will learn about Kernel Density Estimation (KDE), interpolation, and (briefly) regression.

In [1]:

Copied!





import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import altair as alt
import pandas as pd
import scipy.stats as ss
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import altair as alt
import pandas as pd
import scipy.stats as ss
%matplotlib inline

Kernel density estimation¶

Let's import the IMDb data.

In [2]:

Copied!

import vega_datasets

movies = vega_datasets.data.movies()
movies.head()
import vega_datasets

movies = vega_datasets.data.movies()
movies.head()

Out[2]:

	Title	US_Gross	Worldwide_Gross	US_DVD_Sales	Production_Budget	Release_Date	MPAA_Rating	Running_Time_min	Distributor	Source	Major_Genre	Creative_Type	Director	Rotten_Tomatoes_Rating	IMDB_Rating	IMDB_Votes
0	The Land Girls	146083.0	146083.0	NaN	8000000.0	Jun 12 1998	R	NaN	Gramercy	None	None	None	None	NaN	6.1	1071.0
1	First Love, Last Rites	10876.0	10876.0	NaN	300000.0	Aug 07 1998	R	NaN	Strand	None	Drama	None	None	NaN	6.9	207.0
2	I Married a Strange Person	203134.0	203134.0	NaN	250000.0	Aug 28 1998	None	NaN	Lionsgate	None	Comedy	None	None	NaN	6.8	865.0
3	Let's Talk About Sex	373615.0	373615.0	NaN	300000.0	Sep 11 1998	None	NaN	Fine Line	None	Comedy	None	None	13.0	NaN	NaN
4	Slam	1009819.0	1087521.0	NaN	1000000.0	Oct 09 1998	R	NaN	Trimark	Original Screenplay	Drama	Contemporary Fiction	None	62.0	3.4	165.0

Although we have learned that it is dangerous to drop any missing values, we will do so for the sake of simplicity. We are also really not trying to draw any conclusions about the data so it is okay. But be careful with missing data in practice!

Q: Can you drop rows that have NaN value in either IMDB_Rating or Rotten_Tomatoes_Rating?

In [3]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

We can plot histogram and KDE using pandas:

In [4]:

Copied!

movies['IMDB_Rating'].hist(bins=10, density=True)
movies['IMDB_Rating'].plot(kind='kde')
movies['IMDB_Rating'].hist(bins=10, density=True)
movies['IMDB_Rating'].plot(kind='kde')

Out[4]:

<Axes: ylabel='Density'>

Or using seaborn (two ways):

In [5]:

Copied!

sns.displot(movies['IMDB_Rating'], bins=15, kde=True)
sns.displot(movies['IMDB_Rating'], bins=15, kde=True)

Out[5]:

<seaborn.axisgrid.FacetGrid at 0x12bdd1580>

In [6]:

Copied!

sns.histplot(movies['IMDB_Rating'], bins=15, kde=True)
sns.histplot(movies['IMDB_Rating'], bins=15, kde=True)

Out[6]:

<Axes: xlabel='IMDB_Rating', ylabel='Count'>

Q: Can you plot the histogram and KDE of the Rotten_Tomatoes_Rating?

In [7]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[7]:

<Axes: ylabel='Density'>

Let's study how KDE changes with the size of the data.

We can get a random sample using the pandas' sample() function. The kdeplot() function in seaborn provides many options (like kernel types) to do KDE. Let's sample some data points and see how does KDE plot changes with the size of the samples.

In [8]:

Copied!





f = plt.figure(figsize=(15,8))
plt.xlim(0, 10)

sample_sizes = [10, 50, 100, 500, 1000, 2000]
for i, N in enumerate(sample_sizes, 1):
    plt.subplot(2,3,i)
    plt.title("Sample size: {}".format(N))
    for j in range(5):
        s = movies['IMDB_Rating'].sample(N)
        sns.kdeplot(s, legend=False)
plt.tight_layout()
f = plt.figure(figsize=(15,8))
plt.xlim(0, 10)

sample_sizes = [10, 50, 100, 500, 1000, 2000]
for i, N in enumerate(sample_sizes, 1):
    plt.subplot(2,3,i)
    plt.title("Sample size: {}".format(N))
    for j in range(5):
        s = movies['IMDB_Rating'].sample(N)
        sns.kdeplot(s, legend=False)
plt.tight_layout()

We can also draw KDE plots using scikit-learn to change kernel functions.

First, we need points to score across.

Remember the np.linspace() function?
IMDB scores are only between 1 and 10. Let's create 1000 points between 1 and 10.

In [9]:

Copied!

# use the linespace function to create a list of 1000 values
# between 1 and 10
# Store the sequence in a variable called X_vals

# YOUR SOLUTION HERE

X_vals[0:10]
# use the linespace function to create a list of 1000 values
# between 1 and 10
# Store the sequence in a variable called X_vals

# YOUR SOLUTION HERE

X_vals[0:10]

Out[9]:

array([1.        , 1.00900901, 1.01801802, 1.02702703, 1.03603604,
       1.04504505, 1.05405405, 1.06306306, 1.07207207, 1.08108108])

In [10]:

Copied!

X_vals.shape
X_vals.shape

Out[10]:

(1000,)

Our array needs to have 2 dimensions for score_samples() to work. Let's add another dimension to our array using np.newaxis.

In [11]:

Copied!

X_plot = X_vals[:, np.newaxis]
X_plot[0:10]
X_plot = X_vals[:, np.newaxis]
X_plot[0:10]

Out[11]:

array([[1.        ],
       [1.00900901],
       [1.01801802],
       [1.02702703],
       [1.03603604],
       [1.04504505],
       [1.05405405],
       [1.06306306],
       [1.07207207],
       [1.08108108]])

In [12]:

Copied!

X_plot.shape
X_plot.shape

Out[12]:

(1000, 1)

The KernelDensity function in scikit-learn needs a 2d array for the samples when we fit to the different kernels. Multidimensional scaling is deprecated for pandas series. We can cast the series to a numpy array before adding another dimension to solve this issue.

In [13]:

Copied!





sample = movies['IMDB_Rating'].sample(5)
print(sample)
print(type(sample))
print(sample.shape)
sample = movies['IMDB_Rating'].sample(5)
print(sample)
print(type(sample))
print(sample.shape)

2490    5.8
434     8.0
3040    6.4
2943    7.3
2353    5.9
Name: IMDB_Rating, dtype: float64
<class 'pandas.core.series.Series'>
(5,)

In [14]:

Copied!





sample = np.array(sample)
print(sample)
print(type(sample))
print(sample.shape)
sample = np.array(sample)
print(sample)
print(type(sample))
print(sample.shape)

[5.8 8.  6.4 7.3 5.9]
<class 'numpy.ndarray'>
(5,)

In [15]:

Copied!

# Add another dimension to sample using np.newaxis
# Just like we did to X_vals.

# YOUR SOLUTION HERE
# Add another dimension to sample using np.newaxis
# Just like we did to X_vals.

# YOUR SOLUTION HERE

[[5.8]
 [8. ]
 [6.4]
 [7.3]
 [5.9]]
<class 'numpy.ndarray'>
(5, 1)

Now, let's plot using the tophat and gaussian kernels using the different sample sizes we used before. We'll do this using scikit-learn's KernelDensity() function.

In [16]:

Copied!





from sklearn.neighbors import KernelDensity
sample_sizes = [10, 50, 100, 500, 1000, 2000]

def plot_kde(X_plot, sample_sizes, movies, kernel='gaussian'):
    for sample_size in sample_sizes:
        sample = np.array(movies['IMDB_Rating'].sample(sample_size))
        sample = sample[:, np.newaxis]
        kde = KernelDensity(kernel=kernel).fit(sample)
        log_density = kde.score_samples(X_plot)
        plt.plot(X_plot[:, 0], np.exp(log_density), '-')

fig, ax = plt.subplots(figsize=(10,5))

# Let's look at the gaussian kernel like before
# put it in the first spot on a 1x2 grid
plt.subplot(1,2,1)
plt.title("kernel: {}".format('gaussian'))
plot_kde(X_plot, sample_sizes, movies)
# Now let's look at the tophat kernel
# put it in the second spot on a 1x2 grid
plt.subplot(1,2,2)
plt.title("kernel: {}".format('tophat'))
plot_kde(X_plot, sample_sizes, movies, kernel='tophat')
from sklearn.neighbors import KernelDensity
sample_sizes = [10, 50, 100, 500, 1000, 2000]

def plot_kde(X_plot, sample_sizes, movies, kernel='gaussian'):
    for sample_size in sample_sizes:
        sample = np.array(movies['IMDB_Rating'].sample(sample_size))
        sample = sample[:, np.newaxis]
        kde = KernelDensity(kernel=kernel).fit(sample)
        log_density = kde.score_samples(X_plot)
        plt.plot(X_plot[:, 0], np.exp(log_density), '-')

fig, ax = plt.subplots(figsize=(10,5))

# Let's look at the gaussian kernel like before
# put it in the first spot on a 1x2 grid
plt.subplot(1,2,1)
plt.title("kernel: {}".format('gaussian'))
plot_kde(X_plot, sample_sizes, movies)
# Now let's look at the tophat kernel
# put it in the second spot on a 1x2 grid
plt.subplot(1,2,2)
plt.title("kernel: {}".format('tophat'))
plot_kde(X_plot, sample_sizes, movies, kernel='tophat')

Let's try all kernel types using scikit-learn's KernelDensity(). Plot a 2x3 grid, however, instead of changing sample sizes like above, make each plot use the 6 different kernels supported by scikit-learn. Keep the sample size to 2 so we can more easily see the different distribution shapes. Also, draw a different sample (of 2) 5 times like the plots we drew with seaborn above.

Helpful links:

In [17]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Q: Let's use Seaborn again. We can play with the bandwidth option. Make sure to set the xlim so that all plots have the same x range, so that we can compare.

In [18]:

Copied!





f = plt.figure(figsize=(15,8))
bw = ['scott', 'silverman', 0.01, 0.1, 1, 5]
sample_size = 10

# YOUR SOLUTION HERE
plt.tight_layout()
f = plt.figure(figsize=(15,8))
bw = ['scott', 'silverman', 0.01, 0.1, 1, 5]
sample_size = 10

# YOUR SOLUTION HERE
plt.tight_layout()

Q: What's your takeaway? Explain how bandwidth affects the result of your visualization.

In [19]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Interpolation¶

One area where interpolation is used a lot is image processing. Play with it!

https://matplotlib.org/examples/images_contours_and_fields/interpolation_methods.html

In [20]:

Copied!





methods = [None, 'none', 'nearest', 'bilinear', 'bicubic', 'spline16',
           'spline36', 'hanning', 'hamming', 'hermite', 'kaiser', 'quadric',
           'catrom', 'gaussian', 'bessel', 'mitchell', 'sinc', 'lanczos']
np.random.seed(0)
grid = np.random.rand(4, 4)

plt.imshow(grid, interpolation=None, cmap='viridis')
methods = [None, 'none', 'nearest', 'bilinear', 'bicubic', 'spline16',
           'spline36', 'hanning', 'hamming', 'hermite', 'kaiser', 'quadric',
           'catrom', 'gaussian', 'bessel', 'mitchell', 'sinc', 'lanczos']
np.random.seed(0)
grid = np.random.rand(4, 4)

plt.imshow(grid, interpolation=None, cmap='viridis')

Out[20]:

<matplotlib.image.AxesImage at 0x138239eb0>

In [21]:

Copied!

plt.imshow(grid, interpolation='bicubic', cmap='viridis')
plt.imshow(grid, interpolation='bicubic', cmap='viridis')

Out[21]:

<matplotlib.image.AxesImage at 0x1382bdb50>

Let's look at some time series data.

In [22]:

Copied!

co2 = vega_datasets.data.co2_concentration()
co2.head()
co2 = vega_datasets.data.co2_concentration()
co2.head()

Out[22]:

	Date	CO2
0	1958-03-01	315.70
1	1958-04-01	317.46
2	1958-05-01	317.51
3	1958-07-01	315.86
4	1958-08-01	314.93

In [23]:

Copied!

co2.Date.dtype
co2.Date.dtype

Out[23]:

dtype('O')

The Date colume is stored as strings. Let's convert it to datetime so that we can manipulate.

In [24]:

Copied!

pd.to_datetime(co2.Date).head()
pd.to_datetime(co2.Date).head()

Out[24]:

0   1958-03-01
1   1958-04-01
2   1958-05-01
3   1958-07-01
4   1958-08-01
Name: Date, dtype: datetime64[ns]

In [25]:

Copied!

co2.Date = pd.to_datetime(co2.Date)
co2.Date = pd.to_datetime(co2.Date)

In [26]:

Copied!

co2.set_index('Date', inplace=True)
co2.head()
co2.set_index('Date', inplace=True)
co2.head()

Out[26]:

	CO2
Date
1958-03-01	315.70
1958-04-01	317.46
1958-05-01	317.51
1958-07-01	315.86
1958-08-01	314.93

In [27]:

Copied!

co2.plot()
co2.plot()

Out[27]:

<Axes: xlabel='Date'>

😢

In [28]:

Copied!

recent_co2 = co2.tail(8)
recent_co2 = co2.tail(8)

In [29]:

Copied!

recent_co2.plot()
recent_co2.plot()

Out[29]:

<Axes: xlabel='Date'>

This standard line chart above can be considered as a chart with linear interpolation between data points.

The data contains measurements at the resolution of about a month. Let's up-sample the data. This process create new rows that fill the gap between data points. We can use interpolate() function to fill the gaps.

In [30]:

Copied!

upsampled = recent_co2.resample('D')
upsampled.interpolate().head()
upsampled = recent_co2.resample('D')
upsampled.interpolate().head()

Out[30]:

	CO2
Date
2017-05-01	409.910000
2017-05-02	409.884516
2017-05-03	409.859032
2017-05-04	409.833548
2017-05-05	409.808065

If we do linear interpolation, we get more or less the same plot, but just with more points.

In [31]:

Copied!

recent_co2.resample('D').interpolate(method='linear').plot(style='+-')
recent_co2.resample('D').interpolate(method='linear').plot(style='+-')

Out[31]:

<Axes: xlabel='Date'>

In [32]:

Copied!

recent_co2.plot(style='+-')
recent_co2.plot(style='+-')

Out[32]:

<Axes: xlabel='Date'>

Nearest interpolation is just a process of assigning the nearest value to each missing rows.

In [33]:

Copied!

ax = recent_co2.resample('D').interpolate(method='nearest').plot(style='+-')
recent_co2.plot(ax=ax, style='o', ms=5)
ax = recent_co2.resample('D').interpolate(method='nearest').plot(style='+-')
recent_co2.plot(ax=ax, style='o', ms=5)

Out[33]:

<Axes: xlabel='Date'>

Let's try a spline too.

In [34]:

Copied!

ax = recent_co2.resample('D').interpolate(method='spline', order=5).plot(style='+-', lw=1)
recent_co2.plot(ax=ax, style='o', ms=5)
ax = recent_co2.resample('D').interpolate(method='spline', order=5).plot(style='+-', lw=1)
recent_co2.plot(ax=ax, style='o', ms=5)

Out[34]:

<Axes: xlabel='Date'>

Moving average¶

Pandas has a nice method called rolling(): https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html

It lets you do operations on the rolling windows. For instance, if you want to calculate the moving average, you can simply

In [35]:

Copied!

ax = co2[-100:].plot(lw=0.5)
co2[-100:].rolling(12).mean().plot(ax=ax)
ax = co2[-100:].plot(lw=0.5)
co2[-100:].rolling(12).mean().plot(ax=ax)

Out[35]:

<Axes: xlabel='Date'>

By default, it consider every data point inside each window equally (win_type=None) but there are many window types supported by scipy. Also by default, the mean value is put at the right end of the window (trailing average).

Q: can you create a plot with triang window type and centered average?

In [36]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[36]:

<Axes: xlabel='Date'>

Examining relationsips¶

Remember Anscombe's quartet? Actually, the dataset is not only included in vega_datasets but also in seaborn.

In [37]:

Copied!

df = sns.load_dataset("anscombe")
df.head()
df = sns.load_dataset("anscombe")
df.head()

Out[37]:

	dataset	x	y
0	I	10.0	8.04
1	I	8.0	6.95
2	I	13.0	7.58
3	I	9.0	8.81
4	I	11.0	8.33

All four datasets are in this single data frame and the 'dataset' indicator is one of the columns. This is a form often called tidy data, which is easy to manipulate and plot. In tidy data, each row is an observation and columns are the properties of the observation. Seaborn makes use of the tidy form. Using seaborn's lmplot, you can very quickly examine relationships between variables, separated by some facets of the dataset.

Q: Can you produce the plot below using lmplot()?

In [38]:

Copied!





# plotting parameters you can use
palette = "muted"
scatter_kws={"s": 50, "alpha": 1}
ci=None
height=4

# Implement

# YOUR SOLUTION HERE
# plotting parameters you can use
palette = "muted"
scatter_kws={"s": 50, "alpha": 1}
ci=None
height=4

# Implement

# YOUR SOLUTION HERE

Out[38]:

<seaborn.axisgrid.FacetGrid at 0x1387951f0>

Q: So let's look at the relationship between IMDB_Rating and Rotten_Tomatoes_Rating in the movies dataset, separated with respect to MPAA_Rating. Put 4 plots in a row.

In [39]:

Copied!

# YOUR SOLUTION HERE

# YOUR SOLUTION HERE

Out[39]:

<seaborn.axisgrid.FacetGrid at 0x1383ce390>

It may be interesting to dig up what are the movies that have super high Rotten Tomatoes rating and super low IMDB rating (and vice versa)!

Another useful method for examining relationships is jointplot(), which produces a scatter plot with two marginal histograms.

In [40]:

Copied!

g = sns.jointplot(x = movies['IMDB_Rating'], y = movies['Rotten_Tomatoes_Rating'], s=10, alpha=0.4, facecolors='none', edgecolor='b')
g = sns.jointplot(x = movies['IMDB_Rating'], y = movies['Rotten_Tomatoes_Rating'], s=10, alpha=0.4, facecolors='none', edgecolor='b')

Hexbin density plot¶

In 2D, heatmap can be considered as a color-based histogram. You divide the space into bins and show the frequency with colors. A common binning method is the hexagonal bin.

We can again use the jointplot() and setting the kind to be hexbin.

Q: Can you create one?

In [41]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

2D KDE¶

We can also do 2D KDE using seaborn's kdeplot() function.

Q: Can you draw one like this?

In [42]:

Copied!

cmap = "Reds"
fill = True         # what happens if you change this?
thresh = 0        # what happens if you change this?

# YOUR SOLUTION HERE
cmap = "Reds"
fill = True         # what happens if you change this?
thresh = 0        # what happens if you change this?

# YOUR SOLUTION HERE

Or again using jointplot() by setting the kind parameter. Look, we also have the 1D marginal KDE plots!

Q: create jointplot with KDE

Note that the X-axis is logarithm of IMDB_Votes.

In [43]:

Copied!

x = np.log(movies['IMDB_Votes'])

# YOUR SOLUTION HERE
x = np.log(movies['IMDB_Votes'])

# YOUR SOLUTION HERE

Out[43]:

<seaborn.axisgrid.JointGrid at 0x138d88e60>