Introduction to High-Dimensional Data Visualization¶

High-dimensional data (data with many features/variables) is ubiquitous in modern data science and machine learning. Not only that we are incorporating numerous features in our models, we are also dealing with large neural network models that transform complex data into high-dimensional vector representations that have hundreds of dimensions.

The Challenge¶

While we can easily visualize data in 2D (screen or paper), visualizing 3D (on a screen or a paper) is already challenging because we need to involve 3D rendering or other clever tricks. Going beyond 3D, we need to rethink how we show the data.

Solutions We'll Explore¶

This lab covers multiple approaches to visualizing high-dimensional data. Roughly speaking, they can be divided into two categories. First class of methods are primarily designed to visualize somewhat low-dimensional data that we can make use of our 2D or 3D visualization tools. The second class of methods are designed to compress and reduce the dimensionality of the data itself.

Key Questions to Consider¶

As you work through this lab, think about:

What information is preserved vs. lost in each visualization?
When would you choose one method over another? Why?
Is this visualization "truthful"? What is it hiding? What is it distorting?

Module 11: Visualizing high dimensional data¶

In [1]:

Copied!





import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np

sns.set_style('white')
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np

sns.set_style('white')

Scatterplot matrix for low-high dimensional data¶

In many cases, the number of dimensions is not too large. For instance, the "Iris" dataset contains four dimensions of measurements on the three types of iris flower species. It's more than two dimensions, yet still manageable.

In [2]:

Copied!

iris = sns.load_dataset('iris')
iris.head(2)
iris = sns.load_dataset('iris')
iris.head(2)

Out[2]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa

We get four dimensions (sepal_length, sepal_width, petal_length, petal_width). One direct way to visualize them is to have a scatter plot for each pair of dimensions. We can use the pairplot() function in seaborn to do this.

In [3]:

Copied!

sns.pairplot(iris)
sns.pairplot(iris)

Out[3]:

<seaborn.axisgrid.PairGrid at 0x11d20d760>

By using colors, you can get a much more useful plot.

In [4]:

Copied!

sns.pairplot(iris, hue='species')
sns.pairplot(iris, hue='species')

Out[4]:

<seaborn.axisgrid.PairGrid at 0x11f5bfce0>

Seaborn also lets us to specify what to put in the diagonal. When hue is used, it defaults to KDE plot. We can change it back to histogram. See: https://seaborn.pydata.org/generated/seaborn.pairplot.html

Q: draw a pairplot with hue and histogram on the diagonal

In [5]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[5]:

<seaborn.axisgrid.PairGrid at 0x11f6ddc40>

We can use altair to create an interactive scatterplot matrix. Can you create a scatterplot matrix of iris dataset by consulting https://altair-viz.github.io/gallery/scatter_matrix.html?

Q: Draw an interactive scatterplot matrix for iris dataset in altair

In [6]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[6]:

Parallel coordinates¶

Another useful visualization you can create with not-so-high-dimensional datasets is parallel coordinate visualization. Actually pandas supports parallel coordinate plots as well as "Andrews curve" (you can think of it as a smooth version of parallel coordinate.

Q: Can you draw a parallel coordinate plot and a andrews curve plot of iris dataset? (I'm using viridis and winter colormap btw)

In [7]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[7]:

<Axes: >

In [8]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[8]:

<Axes: >

We can also use altair.

In [9]:

Copied!





iris_transformed = iris.reset_index().melt(['species', 'index'])
alt.Chart(iris_transformed).mark_line().encode(
    x='variable:N',
    y='value:Q',
    color='species:N',
    detail='index:N',
    opacity=alt.value(0.5),
).properties(width=500)
iris_transformed = iris.reset_index().melt(['species', 'index'])
alt.Chart(iris_transformed).mark_line().encode(
    x='variable:N',
    y='value:Q',
    color='species:N',
    detail='index:N',
    opacity=alt.value(0.5),
).properties(width=500)

Out[9]:

Q: can you explain how iris_transformed is different from the original iris dataset and why do we need to transform in this way?

YOUR SOLUTION HERE¶

PCA¶

The principal component analysis (PCA) is the most basic dimensionality reduction method. For example, in the Iris dataset we have four variables (sepal_length, sepal_width, petal_length, petal_width). If we can reduce the number of variables to two, then we can easily visualize them in two dimensions.

PCA is already implemented in the scikit-learn package, a machine learning library in Python, which should have been included in Anaconda. If you don't have it, install it. Depending on your environment manager, you can use the following commands.

uv add scikit-learn

or

conda install scikit-learn

or

pip install scikit-learn

In [10]:

Copied!

iris.head(2)
iris.head(2)

Out[10]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa

This is a four dimensional data. To run the PCA we want to isolate only the numerical columns.

In [11]:

Copied!

features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
iris_only_features = iris[features]
iris_only_features.head()
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
iris_only_features = iris[features]
iris_only_features.head()

Out[11]:

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

Data Preprocessing: The Importance of Standardization¶

Before applying PCA, let's standardize the data. PCA is sensitive to the scale of features because it finds directions of maximum variance. Features with larger scales will dominate the principal components.

Example: If one feature ranges from 0-1000 and another from 0-1, PCA will be biased toward the first feature simply due to its scale, not its actual importance.

Let's check the scales of our iris features:

In [12]:

Copied!

iris_only_features.describe()
iris_only_features.describe()

Out[12]:

	sepal_length	sepal_width	petal_length	petal_width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Q: can you now standardize this feature matrix?

In [13]:

Copied!

# Standardize the features
from sklearn.preprocessing import StandardScaler

# YOUR SOLUTION HERE

print("\nFeature statistics after standardization:")
print(iris_scaled_df.describe())
# Standardize the features
from sklearn.preprocessing import StandardScaler

# YOUR SOLUTION HERE

print("\nFeature statistics after standardization:")
print(iris_scaled_df.describe())

Feature statistics after standardization:
       sepal_length   sepal_width  petal_length   petal_width
count  1.500000e+02  1.500000e+02  1.500000e+02  1.500000e+02
mean  -4.736952e-16 -7.815970e-16 -4.263256e-16 -4.736952e-16
std    1.003350e+00  1.003350e+00  1.003350e+00  1.003350e+00
min   -1.870024e+00 -2.433947e+00 -1.567576e+00 -1.447076e+00
25%   -9.006812e-01 -5.923730e-01 -1.226552e+00 -1.183812e+00
50%   -5.250608e-02 -1.319795e-01  3.364776e-01  1.325097e-01
75%    6.745011e-01  5.586108e-01  7.627583e-01  7.906707e-01
max    2.492019e+00  3.090775e+00  1.785832e+00  1.712096e+00

We should first create a PCA object and specify the number of components to obtain. Note that you can obtain more than two principal components.

In [14]:

Copied!

from sklearn.decomposition import PCA
pca = PCA(n_components=2) # set the number of components to 2
from sklearn.decomposition import PCA
pca = PCA(n_components=2) # set the number of components to 2

Now you can run fit() method to identify principal components.

In [15]:

Copied!

pca_iris_fitted = pca.fit(iris_scaled)
pca_iris_fitted = pca.fit(iris_scaled)

An important set of numbers that you want to look at is the explained variance ratio.

In [16]:

Copied!

pca_iris_fitted.explained_variance_ratio_
pca_iris_fitted.explained_variance_ratio_

Out[16]:

array([0.72962445, 0.22850762])

It tells you how much of the variance in the original dataset is explained by the principal components that you obtained. It seems like the first two components capture more than 95% of the variance in original dataset. This means that the PCA is very effective on this dataset and just using two principal components is a very good approximation to use all dimensions. Now you can use the result to transform the original dataset.

In [17]:

Copied!

iris_pca = pca_iris_fitted.transform(iris_scaled)
iris_pca[:5]
iris_pca = pca_iris_fitted.transform(iris_scaled)
iris_pca[:5]

Out[17]:

array([[-2.26470281,  0.4800266 ],
       [-2.08096115, -0.67413356],
       [-2.36422905, -0.34190802],
       [-2.29938422, -0.59739451],
       [-2.38984217,  0.64683538]])

A convenient way to do both fitting and transforming is

In [18]:

Copied!

iris_pca = pca.fit_transform(iris_scaled)
iris_pca[:5]
iris_pca = pca.fit_transform(iris_scaled)
iris_pca[:5]

Out[18]:

array([[-2.26470281,  0.4800266 ],
       [-2.08096115, -0.67413356],
       [-2.36422905, -0.34190802],
       [-2.29938422, -0.59739451],
       [-2.38984217,  0.64683538]])

You can see that this transformed matrix has two columns. Each column corresponds to the "loading" for one of the principal components.

In [19]:

Copied!

iris_pca_df = pd.DataFrame(data=iris_pca, columns=['PC1', 'PC2'])
iris_pca_df.head()
iris_pca_df = pd.DataFrame(data=iris_pca, columns=['PC1', 'PC2'])
iris_pca_df.head()

Out[19]:

	PC1	PC2
0	-2.264703	0.480027
1	-2.080961	-0.674134
2	-2.364229	-0.341908
3	-2.299384	-0.597395
4	-2.389842	0.646835

Let's add the species information to the dataframe.

Q: add species column to iris_pca_df.

In [20]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[20]:

	PC1	PC2	species
0	-2.264703	0.480027	setosa
1	-2.080961	-0.674134	setosa
2	-2.364229	-0.341908	setosa
3	-2.299384	-0.597395	setosa
4	-2.389842	0.646835	setosa

Now we can produce a scatterplot based on the two principal components. Well, let's just draw a pairplot.

In [21]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[21]:

<seaborn.axisgrid.PairGrid at 0x12ce983e0>

The PC1 seems to capture inter-species variation while PC2 seems to capture intra-species variation. 🧐 Interesting!

PCA with faces¶

Let's play with PCA with some faces. 🙄😬🤓

In [22]:

Copied!

from sklearn.datasets import fetch_olivetti_faces

dataset = fetch_olivetti_faces(shuffle=True)
faces = dataset.data
from sklearn.datasets import fetch_olivetti_faces

dataset = fetch_olivetti_faces(shuffle=True)
faces = dataset.data

In [23]:

Copied!

n_samples, n_features = faces.shape
print(n_samples)
print(n_features)
n_samples, n_features = faces.shape
print(n_samples)
print(n_features)

400
4096

So, this dataset contains 400 faces, and each of them has 4096 features (=pixels). Let's look at the first face:

In [24]:

Copied!

print(faces[0].shape)
faces[0]
print(faces[0].shape)
faces[0]

(4096,)

Out[24]:

array([0.6694215 , 0.6363636 , 0.6487603 , ..., 0.08677686, 0.08264463,
       0.07438017], shape=(4096,), dtype=float32)

It's an one-dimensional array with 4096 numbers. But a face should be a two-dimensional picture, right? Use numpy's reshape() function as well as matplotlib's imshow() function, transform this one-dimensional array into an appropriate 2-D matrix and draw it to show the face. You probably want to use plt.cm.gray as colormap.

Be sure to play with different shapes (e.g. 2 x 2048, 1024 x 4, 128 x 32, and so on) and think about why they look like what they look like. What is the right shape of the array?

Q: reshape the one-dimensional array into an appropriate two dimensional array and show the face

In [25]:

Copied!

# TODO: draw faces[0] with various shapes. Find the correct dimension. 
# image_shape = (xx, yy)

# YOUR SOLUTION HERE
# TODO: draw faces[0] with various shapes. Find the correct dimension. 
# image_shape = (xx, yy)

# YOUR SOLUTION HERE

Out[25]:

<matplotlib.image.AxesImage at 0x12e1fe210>

Let's perform PCA on this dataset.

In [26]:

Copied!

from sklearn.decomposition import PCA
from sklearn.decomposition import PCA

Set the number of components to 6:

In [27]:

Copied!

n_components=6
pca = PCA(n_components=n_components)
n_components=6
pca = PCA(n_components=n_components)

Fit the faces data:

In [28]:

Copied!

pca.fit(faces)
pca.fit(faces)

Out[28]:

PCA(n_components=6)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

PCA has an attribute called components_. It is a $\text{n-components} \times \text{n-features}$ matrix, in our case $6 \times 4096$. Each row is a component.

In [29]:

Copied!

pca.components_
pca.components_

Out[29]:

array([[ 0.0041911 ,  0.00710948,  0.00933611, ..., -0.00018516,
        -0.00337965, -0.00318824],
       [ 0.0285914 ,  0.03328843,  0.03784651, ..., -0.02962786,
        -0.02721302, -0.02488902],
       [ 0.00135686, -0.00032583, -0.00019807, ..., -0.01541362,
        -0.01370976, -0.01188339],
       [ 0.00112417, -0.00179048, -0.01168248, ...,  0.02943009,
         0.02781928,  0.02521859],
       [ 0.02384355,  0.02359225,  0.02216258, ...,  0.04244031,
         0.0400757 ,  0.04110456],
       [ 0.02909761,  0.03130125,  0.02877332, ..., -0.01635729,
        -0.01637271, -0.014908  ]], shape=(6, 4096), dtype=float32)

In [30]:

Copied!

pca.components_.shape
pca.components_.shape

Out[30]:

(6, 4096)

We can display the 6 components as images:

In [31]:

Copied!





# use the `image_shape` that you defined in the previous question. 

for i, comp in enumerate(pca.components_, 1):
    plt.subplot(2, 3, i)
    plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
    plt.xticks(())
    plt.yticks(())
# use the `image_shape` that you defined in the previous question. 

for i, comp in enumerate(pca.components_, 1):
    plt.subplot(2, 3, i)
    plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
    plt.xticks(())
    plt.yticks(())

😱 Looks a bit scary...

They are the "principal faces", which means that, by adding up these images with some appropriate weights, we can get a close approximation of the 400 images in the dataset!

We can get the coordinates of the 6 components to understand how each face is composed with the components.

In [32]:

Copied!

faces_pca_transformed = pca.transform(faces)
faces_pca_transformed = pca.transform(faces)

In [33]:

Copied!

faces_pca_transformed.shape
faces_pca_transformed.shape

Out[33]:

(400, 6)

faces_r is a $400 \times 6$ matrix. Each row corresponds to one face, containing the coordinates of the 6 components. For instance, the coordinates for the first face is

In [34]:

Copied!

faces_pca_transformed[0]
faces_pca_transformed[0]

Out[34]:

array([-0.8157654 ,  4.14402   ,  2.4832482 , -0.90308356, -0.831388  ,
        0.8863355 ], dtype=float32)

It seems that the second component (with coordinate 4.14403343) contributes the most to the first face. Let's display them together and see how similar they are:

In [35]:

Copied!





# display the first face image 
plt.subplot(1, 2, 1)
plt.imshow(faces[0].reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
plt.xticks(())
plt.yticks(())

# display the second component
plt.subplot(1, 2, 2)
plt.imshow(pca.components_[1].reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
plt.xticks(())
plt.yticks(())
# display the first face image 
plt.subplot(1, 2, 1)
plt.imshow(faces[0].reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
plt.xticks(())
plt.yticks(())

# display the second component
plt.subplot(1, 2, 2)
plt.imshow(pca.components_[1].reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
plt.xticks(())
plt.yticks(())

Out[35]:

([], [])

We can display the composition of faces in an "equation" style:

In [36]:

Copied!





from matplotlib import gridspec

def display_image(ax, image):
    ax.imshow(image, cmap=plt.cm.gray, interpolation='nearest')
    ax.set_xticks(())
    ax.set_yticks(())

def display_text(ax, text):
    ax.text(.5, .5, text, size=12)
    ax.axis('off')

face_idx = 0

plt.figure(figsize=(16,4))
gs = gridspec.GridSpec(2, 10, width_ratios=[5,1,1,5,1,1,5,1,1,5])

# display the face
ax = plt.subplot(gs[0])
display_image(ax, faces[face_idx].reshape(image_shape))

# display the equal sign
ax = plt.subplot(gs[1])
display_text(ax, r'$=$')

# display the 6 coordinates
for coord_i, gs_i in enumerate( [2,5,8,12,15,18] ):
    ax = plt.subplot(gs[gs_i])
    display_text( ax, r'$%.3f \times $' % faces_pca_transformed[face_idx][coord_i] )

# display the 6 components
for comp_i, gs_i in enumerate( [3,6,9,13,16,19] ):
    ax = plt.subplot(gs[gs_i])
    display_image( ax, pca.components_[comp_i].reshape(image_shape) )

# display the plus sign
for gs_i in [4,7,11,14,17]:
    ax = plt.subplot(gs[gs_i])
    display_text(ax, r'$+$')
from matplotlib import gridspec

def display_image(ax, image):
    ax.imshow(image, cmap=plt.cm.gray, interpolation='nearest')
    ax.set_xticks(())
    ax.set_yticks(())

def display_text(ax, text):
    ax.text(.5, .5, text, size=12)
    ax.axis('off')

face_idx = 0

plt.figure(figsize=(16,4))
gs = gridspec.GridSpec(2, 10, width_ratios=[5,1,1,5,1,1,5,1,1,5])

# display the face
ax = plt.subplot(gs[0])
display_image(ax, faces[face_idx].reshape(image_shape))

# display the equal sign
ax = plt.subplot(gs[1])
display_text(ax, r'$=$')

# display the 6 coordinates
for coord_i, gs_i in enumerate( [2,5,8,12,15,18] ):
    ax = plt.subplot(gs[gs_i])
    display_text( ax, r'$%.3f \times $' % faces_pca_transformed[face_idx][coord_i] )

# display the 6 components
for comp_i, gs_i in enumerate( [3,6,9,13,16,19] ):
    ax = plt.subplot(gs[gs_i])
    display_image( ax, pca.components_[comp_i].reshape(image_shape) )

# display the plus sign
for gs_i in [4,7,11,14,17]:
    ax = plt.subplot(gs[gs_i])
    display_text(ax, r'$+$')

We can directly see the results of this addition.

In [37]:

Copied!





f, axes = plt.subplots(1, 6, figsize=(16,4))

faceid = 0

constructed_faces = []
for i in range(2, 10):
    constructed_faces.append(np.dot(faces_pca_transformed[faceid][:i], pca.components_[:i]))

# the face that we want to construct. 
display_image(axes[0], faces[0].reshape(image_shape))

for idx, ax in enumerate(axes[1:]):
    display_image(ax, constructed_faces[idx].reshape(image_shape))
f, axes = plt.subplots(1, 6, figsize=(16,4))

faceid = 0

constructed_faces = []
for i in range(2, 10):
    constructed_faces.append(np.dot(faces_pca_transformed[faceid][:i], pca.components_[:i]))

# the face that we want to construct. 
display_image(axes[0], faces[0].reshape(image_shape))

for idx, ax in enumerate(axes[1:]):
    display_image(ax, constructed_faces[idx].reshape(image_shape))

It becomes more and more real, although quite far with only several components.

NMF¶

There is another pretty cool dimensionality reduction method called NMF (Non-negative matrix factorization). It is widely used in many domains, such as identifying topics in documents, identifying key components in images, and so on. The key idea is by forcing every element in the decomposed matrices positive, NMF breaks something into parts that we can add together.

Q: fit the faces data with NMF, set the number of iterations higher than default

In [38]:

Copied!





from sklearn.decomposition import NMF
n_components=20
MAX_ITER = 10000

# YOUR SOLUTION HERE

for i, comp in enumerate(nmf_fitted.components_, 1):
    plt.subplot(4, 5, i)
    plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
    plt.xticks(())
    plt.yticks(())
from sklearn.decomposition import NMF
n_components=20
MAX_ITER = 10000

# YOUR SOLUTION HERE

for i, comp in enumerate(nmf_fitted.components_, 1):
    plt.subplot(4, 5, i)
    plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
    plt.xticks(())
    plt.yticks(())

As you can see here, each 'component' of NMF picks up a certain part of the face (light area), such as eyes, chin, nose, and so on. Very cool.

In [39]:

Copied!

faces_nmf_tranformed = nmf_fitted.transform(faces)
faces_nmf_tranformed = nmf_fitted.transform(faces)

In [40]:

Copied!

faces_nmf_tranformed[0]
faces_nmf_tranformed[0]

Out[40]:

array([0.0711282 , 0.02707407, 0.        , 0.07285579, 0.0333754 ,
       0.00445678, 0.04553186, 0.02588788, 0.00755309, 0.02399708,
       0.        , 0.        , 0.01487688, 0.        , 0.13019557,
       0.01083633, 0.17124934, 0.24680914, 0.09447206, 0.04335925],
      dtype=float32)

Can you show the reconstructed faces using the first n components, as we did for the PCA?

In [41]:

Copied!

f, axes = plt.subplots(1, 8, figsize=(20,4))
faceid = 0
constructed_faces = []

# YOUR SOLUTION HERE
f, axes = plt.subplots(1, 8, figsize=(20,4))
faceid = 0
constructed_faces = []

# YOUR SOLUTION HERE

Unlike PCA that keeps superposing positive and negative images, NMF tends to gradually add multiple parts to the image. This is why it is widely used for many decomposing tasks such as detecting topics from documents.

t-SNE, Isomap, and MDS¶

Isomap, t-SNE, and MDS are nonlinear dimensionality reduction methods. Isomap preserves only the local relationships, MDS tries to preserve everything, and t-SNE is more flexible. t-SNE is very popular especially in machine learning.

Let's try t-SNE out with the iris data.

Q: Fit-transform the iris data with t-SNE and create a scatterplot of it.

In [42]:

Copied!





from sklearn.manifold import TSNE
from sklearn.manifold import Isomap
from sklearn.manifold import MDS
from sklearn.datasets import load_iris

# YOUR SOLUTION HERE
from sklearn.manifold import TSNE
from sklearn.manifold import Isomap
from sklearn.manifold import MDS
from sklearn.datasets import load_iris

# YOUR SOLUTION HERE

Out[42]:

<matplotlib.collections.PathCollection at 0x12eb096a0>

The hyperparameter perplexity determines how to balance attention between local and global aspects of your data. Changing this parameter (default is 30) may cause drastic changes in the output. Play with multiple values of perplexity.

In [43]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[43]:

<matplotlib.collections.PathCollection at 0x12e6a9220>

If you want to learn more about t-SNE, play with https://distill.pub/2016/misread-tsne/ and https://experiments.withgoogle.com/visualizing-high-dimensional-space

Visualizing the Digits dataset¶

This is a classic dataset of images of handwritten digits. It contains 1797 images with (8*8=64) pixels each.

In [44]:

Copied!

from sklearn.datasets import load_digits

digits = load_digits()
digits.data.shape
from sklearn.datasets import load_digits

digits = load_digits()
digits.data.shape

Out[44]:

(1797, 64)

digits.data stores the images:

In [45]:

Copied!

digits.data[0]
digits.data[0]

Out[45]:

array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

and digits.target is the classes (or labels) that the images belong to. There are 10 classes in total.

In [46]:

Copied!

digits.target
digits.target

Out[46]:

array([0, 1, 2, ..., 8, 9, 8], shape=(1797,))

Q: use imshow to display the first image.

In [47]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[47]:

<matplotlib.image.AxesImage at 0x12eaf7710>

Let's first reorder the data points according to the handwritten numbers. We can use np.vstack and np.hstack.

In [48]:

Copied!





X = np.vstack([digits.data[digits.target==i]
               for i in range(10)])
y = np.hstack([digits.target[digits.target==i]
               for i in range(10)])
X = np.vstack([digits.data[digits.target==i]
               for i in range(10)])
y = np.hstack([digits.target[digits.target==i]
               for i in range(10)])

Then initialize a tsne model. For the meaning of the parameters see here.

In [49]:

Copied!

tsne = TSNE(n_components=2, init='pca', random_state=0)
tsne = TSNE(n_components=2, init='pca', random_state=0)

Fit the model on the data.

In [50]:

Copied!

digits_proj = tsne.fit_transform(X)
digits_proj = tsne.fit_transform(X)

Plot the results. Seaborn's hls palatte provides evenly spaced colors in HLS hue space, we can divide it into 10 colors.

In [51]:

Copied!

palette = np.array(sns.color_palette("hls", 10))
palette = np.array(sns.color_palette("hls", 10))

Make a scatter plot of the first component against the second component, with color based on the numbers.

In [52]:

Copied!

plt.figure(figsize = (6,6))
plt.scatter(digits_proj[:,0], digits_proj[:,1],c=palette[y])
plt.figure(figsize = (6,6))
plt.scatter(digits_proj[:,0], digits_proj[:,1],c=palette[y])

Out[52]:

<matplotlib.collections.PathCollection at 0x12f058920>

We can add some text for each cluster. The place of the text can be the center of the cluster. We can use np.median to find the centers. To simplify things, we can make the code into a function.

In [53]:

Copied!





def plot_scatter(projection):
    plt.figure(figsize = (6,6))
    plt.scatter(projection[:,0], projection[:,1],c=palette[y])
    for i in range(10):
        # Position of each label.
        xtext, ytext = np.median(projection[y == i, :], axis=0)
        txt = plt.text(xtext, ytext, str(i), fontsize=24)
def plot_scatter(projection):
    plt.figure(figsize = (6,6))
    plt.scatter(projection[:,0], projection[:,1],c=palette[y])
    for i in range(10):
        # Position of each label.
        xtext, ytext = np.median(projection[y == i, :], axis=0)
        txt = plt.text(xtext, ytext, str(i), fontsize=24)

In [54]:

Copied!

plot_scatter(digits_proj)
plot_scatter(digits_proj)

Comparison with Isomap and MDS¶

We talked about MDS and Isomap in class as two other manifold learning methods. Sklearn also has implementations for this two algorithms: MDS and Isomap, so the usage is very similar. Examples for using this methods can be found here.

Q: Can you make another two plots with these two methods? You only need to change the models and call the plot_scatter function.

In [55]:

Copied!

# ISOMAP

# YOUR SOLUTION HERE
# ISOMAP

# YOUR SOLUTION HERE

/Users/yyahn/git/dataviz-solutions/.venv/lib/python3.12/site-packages/sklearn/manifold/_isomap.py:384: UserWarning: The number of connected components of the neighbors graph is 2 > 1. Completing the graph to fit Isomap might be slow. Increase the number of neighbors to avoid this issue.
  self._fit_transform(X)
/Users/yyahn/git/dataviz-solutions/.venv/lib/python3.12/site-packages/scipy/sparse/_index.py:168: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil and dok are more efficient.
  self._set_intXint(row, col, x.flat[0])

In [56]:

Copied!

# MDS

# YOUR SOLUTION HERE
# MDS

# YOUR SOLUTION HERE

(Optional) Feel free to try UMAP as well! It's a new dimensionality reduction method that is getting the most popular.

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2