Module 4: Perception¶
In this lab, we will learn basic usage of pandas
library and then perform a small experiment to test the Stevens' power law.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Vega datasets¶
Before going into the perception experiment, let's first talk about some handy datasets that you can play with.
It's nice to have clean datasets handy to practice data visualization. And there is this nice small package called vega-datasets
, from the altair project.
You can install the package by running
$ pip install vega-datasets
or
$ pip3 install vega-datasets
Once you install the package, you can import and see the list of datasets:
from vega_datasets import data
data.list_datasets()
['7zip', 'airports', 'annual-precip', 'anscombe', 'barley', 'birdstrikes', 'budget', 'budgets', 'burtin', 'cars', 'climate', 'co2-concentration', 'countries', 'crimea', 'disasters', 'driving', 'earthquakes', 'ffox', 'flare', 'flare-dependencies', 'flights-10k', 'flights-200k', 'flights-20k', 'flights-2k', 'flights-3m', 'flights-5k', 'flights-airport', 'gapminder', 'gapminder-health-income', 'gimp', 'github', 'graticule', 'income', 'iowa-electricity', 'iris', 'jobs', 'la-riots', 'londonBoroughs', 'londonCentroids', 'londonTubeLines', 'lookup_groups', 'lookup_people', 'miserables', 'monarchs', 'movies', 'normal-2d', 'obesity', 'ohlc', 'points', 'population', 'population_engineers_hurricanes', 'seattle-temps', 'seattle-weather', 'sf-temps', 'sp500', 'stocks', 'udistrict', 'unemployment', 'unemployment-across-industries', 'uniform-2d', 'us-10m', 'us-employment', 'us-state-capitals', 'volcano', 'weather', 'weball26', 'wheat', 'windvectors', 'world-110m', 'zipcodes']
or you can work with only smaller, local datasets.
from vega_datasets import local_data
local_data.list_datasets()
['airports', 'anscombe', 'barley', 'burtin', 'cars', 'crimea', 'driving', 'iowa-electricity', 'iris', 'la-riots', 'ohlc', 'seattle-temps', 'seattle-weather', 'sf-temps', 'stocks', 'us-employment', 'wheat']
Ah, we have the anscombe
data here! Let's see the description of the dataset.
print(local_data.anscombe.description)
Anscombe's Quartet is a famous dataset constructed by Francis Anscombe [1]_. Common summary statistics are identical for each subset of the data, despite the subsets having vastly different characteristics.
Anscombe's quartet dataset¶
How does the actual data look like? Very conveniently, calling the dataset returns a Pandas dataframe for you.
df = local_data.anscombe()
df.head()
Series | X | Y | |
---|---|---|---|
0 | I | 10 | 8.04 |
1 | I | 8 | 6.95 |
2 | I | 13 | 7.58 |
3 | I | 9 | 8.81 |
4 | I | 11 | 8.33 |
Q1: can you draw a scatterplot of the dataset "I"? You can filter the dataframe based on the Series
column and use plot
function that you used for the Snow's map. We have already learned how to filter a dataframe. If you're unsure, you can check the Getting started tutorials.
# YOUR SOLUTION HERE
<Axes: xlabel='X', ylabel='Y'>
Some histograms with pandas¶
Let's look at a slightly more complicated dataset.
car_df = local_data.cars()
car_df.head()
Name | Miles_per_Gallon | Cylinders | Displacement | Horsepower | Weight_in_lbs | Acceleration | Year | Origin | |
---|---|---|---|---|---|---|---|---|---|
0 | chevrolet chevelle malibu | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 1970-01-01 | USA |
1 | buick skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 1970-01-01 | USA |
2 | plymouth satellite | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 1970-01-01 | USA |
3 | amc rebel sst | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 1970-01-01 | USA |
4 | ford torino | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 1970-01-01 | USA |
Pandas provides useful summary functions. It identifies numerical data columns and provides you with a table of summary statistics.
car_df.describe()
Miles_per_Gallon | Cylinders | Displacement | Horsepower | Weight_in_lbs | Acceleration | Year | |
---|---|---|---|---|---|---|---|
count | 398.000000 | 406.000000 | 406.000000 | 400.000000 | 406.000000 | 406.000000 | 406 |
mean | 23.514573 | 5.475369 | 194.779557 | 105.082500 | 2979.413793 | 15.519704 | 1975-12-30 13:49:57.044334976 |
min | 9.000000 | 3.000000 | 68.000000 | 46.000000 | 1613.000000 | 8.000000 | 1970-01-01 00:00:00 |
25% | 17.500000 | 4.000000 | 105.000000 | 75.750000 | 2226.500000 | 13.700000 | 1973-01-01 00:00:00 |
50% | 23.000000 | 4.000000 | 151.000000 | 95.000000 | 2822.500000 | 15.500000 | 1976-01-01 00:00:00 |
75% | 29.000000 | 8.000000 | 302.000000 | 130.000000 | 3618.250000 | 17.175000 | 1979-01-01 00:00:00 |
max | 46.600000 | 8.000000 | 455.000000 | 230.000000 | 5140.000000 | 24.800000 | 1982-01-01 00:00:00 |
std | 7.815984 | 1.712160 | 104.922458 | 38.768779 | 847.004328 | 2.803359 | NaN |
If you ask to draw a histogram, you get all of them. :)
car_df.hist()
array([[<Axes: title={'center': 'Miles_per_Gallon'}>, <Axes: title={'center': 'Cylinders'}>, <Axes: title={'center': 'Displacement'}>], [<Axes: title={'center': 'Horsepower'}>, <Axes: title={'center': 'Weight_in_lbs'}>, <Axes: title={'center': 'Acceleration'}>], [<Axes: title={'center': 'Year'}>, <Axes: >, <Axes: >]], dtype=object)
Well this is too small. You can check out the documentation and change the size of the figure.
Q2: by consulting the documentation, can you make the figure larger so that we can see all the labels clearly? And then change the layout so that there are two rows and four columns? Then change the number of bins to 20.
# YOUR SOLUTION HERE
Stevens’ power-law and your own psychophysics experiment!¶
Let's do an experiment! The procedure is as follows:
- Generate a random number between [1, 10];
- Use a horizontal bar to represent the number, i.e., the length of the bar is equal to the number;
- Guess the length of the bar by comparing it to two other bars with length 1 and 10 respectively;
- Store your guess (perceived length) and actual length to two separate lists;
- Repeat the above steps many times;
- Check whether Steven's power-law holds.
First, let's define the length of a short and a long bar. We also create two empty lists to store perceived and actual length.
import random
import time
import numpy as np
l_short_bar = 1
l_long_bar = 10
perceived_length_list = []
actual_length_list = []
Perception of length¶
Let's run the experiment.
The random
module in Python provides various random number generators, and the random.uniform(a,b)
function returns a floating point number in [a,b].
We can plot horizontal bars using the pyplot.barh()
function. Using this function, we can produce a bar graph that looks like this:
mystery_length = random.uniform(1, 10) # generate a number between 1 and 10. this is the *actual* length.
plt.barh(np.arange(3), [l_short_bar, mystery_length, l_long_bar], align='center')
plt.yticks(np.arange(3), ('1', '?', '10'))
plt.xticks([]) # no hint!
([], [])
Btw, np.arange
is used to create a simple integer list [0, 1, 2]
.
np.arange(3)
array([0, 1, 2])
Now let's define a function to perform the experiment once. When you run this function, it picks a random number between 1.0 and 10.0 and show the bar chart. Then it asks you to input your estimate of the length of the middle bar. It then saves that number to the perceived_length_list
and the actual answer to the actual_length_list
.
Note, if the input box does not appear for you try (1) switching to firefox, OR (2) removing the input line from the code and manually record the numbers in a cell, OR (3) adding a prompt to it like input("enter estimation")
def run_exp_once():
mystery_length = random.uniform(1, 10) # generate a number between 1 and 10.
plt.barh(np.arange(3), [l_short_bar, mystery_length, l_long_bar], height=0.5, align='center')
plt.yticks(np.arange(3), ('1', '?', '10'))
plt.xticks([]) # no hint!
plt.show()
perceived_length_list.append( float(input()) )
actual_length_list.append(mystery_length)
run_exp_once()
Now, run the experiment many times to gather your data. Check the two lists to make sure that you have the proper dataset. The length of the two lists should be the same.
# YOUR SOLUTION HERE
plt.scatter(x=[1,5,10], y=[1,10, 5])
<matplotlib.collections.PathCollection at 0x10d813e10>
Q3: Now plot your result using the scatter()
function. You should also use plt.title()
, plt.xlabel()
, and plt.ylabel()
to label your axes and the plot itself.
# YOUR SOLUTION HERE
Text(0, 0.5, 'Perceived')
After plotting, let's fit the relation between actual and perceived lengths using a polynomial function. We can easily do it using curve_fit(f, x, y)
in Scipy, which is to fit $x$ and $y$ using the function f
. In our case, $f = a*x^b +c$. For instance, we can check whether this works by creating a fake dataset that follows the exact form:
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a * np.power(x, b) + c
x = np.arange(20) # [0,1,2,3, ..., 19]
y = np.power(x, 2) # [0,1,4,9, ... ]
popt, pcov = curve_fit(func, x, y)
print('{:.2f} x^{:.2f} + {:.2f}'.format(*popt))
1.00 x^2.00 + 0.00
Q4: Now fit your data! Do you see roughly linear relationship between the actual and the perceived lengths? It's ok if you don't!
# YOUR SOLUTION HERE
0.52 x^1.18 + 1.76
Perception of area¶
Similar to the above experiment, we now represent a random number as a circle, and the area of the circle is equal to the number.
First, calculate the radius of a circle from its area and then plot using the Circle()
function. plt.Circle((0,0), r)
will plot a circle centered at (0,0) with radius r
.
n1 = 0.005
n2 = 0.05
radius1 = np.sqrt(n1/np.pi) # area = pi * r * r
radius2 = np.sqrt(n2/np.pi)
random_radius = np.sqrt(n1*random.uniform(1,10)/np.pi)
plt.axis('equal')
plt.axis('off')
circ1 = plt.Circle( (0,0), radius1, clip_on=False )
circ2 = plt.Circle( (4*radius2,0), radius2, clip_on=False )
rand_circ = plt.Circle((2*radius2,0), random_radius, clip_on=False )
plt.gca().add_artist(circ1)
plt.gca().add_artist(circ2)
plt.gca().add_artist(rand_circ)
<matplotlib.patches.Circle at 0x10db1ac10>
Let's have two lists for this experiment.
perceived_area_list = []
actual_area_list = []
And define a function for the experiment.
Note, if the input box does not appear for you try (1) switching to firefox, OR (2) removing the input line from the code and manually record the numbers in a cell, OR (3) adding a prompt to it like input("enter estimation")
def run_area_exp_once(n1=0.005, n2=0.05):
radius1 = np.sqrt(n1/np.pi) # area = pi * r * r
radius2 = np.sqrt(n2/np.pi)
mystery_number = random.uniform(1,10)
random_radius = np.sqrt(n1*mystery_number/math.pi)
plt.axis('equal')
plt.axis('off')
circ1 = plt.Circle( (0,0), radius1, clip_on=False )
circ2 = plt.Circle( (4*radius2,0), radius2, clip_on=False )
rand_circ = plt.Circle((2*radius2,0), random_radius, clip_on=False )
plt.gca().add_artist(circ1)
plt.gca().add_artist(circ2)
plt.gca().add_artist(rand_circ)
plt.show()
perceived_area_list.append( float(input()) )
actual_area_list.append(mystery_number)
Q5: Now you can run the experiment many times, plot the result, and fit a power-law curve to test the Stevens' power-law!
# TODO: put your code here. You can use multiple cells.
import math
# YOUR SOLUTION HERE