Module 6 assignment: Data types and tidy data¶

Tidy data¶

Let's take a look at this small dataset: https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/relig_income.csv

In [2]:

Copied!

import pandas as pd
import pandas as pd

In [3]:

Copied!

pew_df = pd.read_csv('https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/relig_income.csv')
pew_df
pew_df = pd.read_csv('https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/relig_income.csv')
pew_df

Out[3]:

	religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k	$75-100k	$100-150k	>150k	Don't know/refused
0	Agnostic	27	34	60	81	76	137	122	109	84	96
1	Atheist	12	27	37	52	35	70	73	59	74	76
2	Buddhist	27	21	30	34	33	58	62	39	53	54
3	Catholic	418	617	732	670	638	1116	949	792	633	1489
4	Don’t know/refused	15	14	15	11	10	35	21	17	18	116
5	Evangelical Prot	575	869	1064	982	881	1486	949	723	414	1529
6	Hindu	1	9	7	9	11	34	47	48	54	37
7	Historically Black Prot	228	244	236	238	197	223	131	81	78	339
8	Jehovah's Witness	20	27	24	24	21	30	15	11	6	37
9	Jewish	19	19	25	25	30	95	69	87	151	162
10	Mainline Prot	289	495	619	655	651	1107	939	753	634	1328
11	Mormon	29	40	48	51	56	112	85	49	42	69
12	Muslim	6	7	9	10	9	23	16	8	6	22
13	Orthodox	13	17	23	32	32	47	38	42	46	73
14	Other Christian	9	7	11	13	13	14	18	14	12	18
15	Other Faiths	20	33	40	46	49	63	46	40	41	71
16	Other World Religions	5	2	3	4	2	7	3	4	4	8
17	Unaffiliated	217	299	374	365	341	528	407	321	258	597

This dataset is about the relationships between income and religion, assembled from a research by the Pew Research Center. Is this dataset tidy or not? Why?

Yes, many of the columns are values, not variable names. How should we fix it?

Pandas provides a convenient function called melt. You specify the id_vars that are variable columns, and value_vars that are value columns, and provide the name for the variable as well as the name for the values.

Q: so please go ahead and tidy it up! I'd suggest to use the variable name "income" and value name "frequency"

In [4]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

If you were successful, you'll have something like this:

In [5]:

Copied!

pew_tidy_df
pew_tidy_df

Out[5]:

	religion	income	frequency
0	Agnostic	<$10k	27
1	Atheist	<$10k	12
2	Buddhist	<$10k	27
3	Catholic	<$10k	418
4	Don’t know/refused	<$10k	15
...	...	...	...
175	Orthodox	Don't know/refused	73
176	Other Christian	Don't know/refused	18
177	Other Faiths	Don't know/refused	71
178	Other World Religions	Don't know/refused	8
179	Unaffiliated	Don't know/refused	597

180 rows × 3 columns

This "tidy" form is the default form that you want to keep your datasets. It is easy to filter, manipulate, and plot. Many of the visualization libraries we'll use in this course expect the data to be in this form.

Data types¶

Let's talk about data types briefly. Understanding data types is not only important for choosing the right visualizations, but also important for efficient computing and storage of data. You may not have thought about how pandas represent data in memory. A Pandas Dataframe is essentially a bunch of Series, and those Series are essentially numpy arrays. An array may contain a fixed-length items such as integers or variable length items such as strings. Putting some efforts to think about the correct data type can potentially save a lot of memory as well as time.

A nice example would be the categorical data type. If you have a variable that only has several possible values, it's essentially a categorical data. Take a look at the income variable.

In [6]:

Copied!

pew_tidy_df.income.value_counts()
pew_tidy_df.income.value_counts()

Out[6]:

income
<$10k                 18
$10-20k               18
$20-30k               18
$30-40k               18
$40-50k               18
$50-75k               18
$75-100k              18
$100-150k             18
>150k                 18
Don't know/refused    18
Name: count, dtype: int64

These were the column names in the original non-tidy data. The value can take only one of these income ranges and thus it is a categorical data. What is the data type that pandas use to store this column?

In [7]:

Copied!

pew_tidy_df.income.dtype
pew_tidy_df.income.dtype

Out[7]:

dtype('O')

The O means that it is an Object data type, which does not have a fixed size like integer or float. The series contains a sort of pointer to the actual text objects. You can actually inspect the amount of memory used by the dataset.

In [8]:

Copied!

pew_tidy_df.memory_usage()
pew_tidy_df.memory_usage()

Out[8]:

Index         132
religion     1440
income       1440
frequency    1440
dtype: int64

In [9]:

Copied!

pew_tidy_df.memory_usage(deep=True)
pew_tidy_df.memory_usage(deep=True)

Out[9]:

Index          132
religion     11260
income       10260
frequency     1440
dtype: int64

What's going on with the deep=True option? When you don't specify deep=True, the memory usage method just tells you the amount of memory used by the numpy arrays in the pandas dataframe. When you pass deep=True, it tells you the total amount of memory by including the memory used by all the text objects. So, the religion and income columns occupies almost ten times of memory than the frequency column, which is simply an array of integers.

In [10]:

Copied!

pew_tidy_df.frequency.dtype
pew_tidy_df.frequency.dtype

Out[10]:

dtype('int64')

Is there any way to save up the memory? Note that there are only 10 categories in the income variable. That means we just need 10 numbers to represent the categories! Of course we need to store the names of each category, but that's just one-time cost. The simplest way to convert a column is using astype method.

In [11]:

Copied!

income_categorical_series = pew_tidy_df.income.astype('category')
# you can do pew_tidy_df.income = pew_tidy_df.income.astype('category')
income_categorical_series = pew_tidy_df.income.astype('category')
# you can do pew_tidy_df.income = pew_tidy_df.income.astype('category')

Now, this series has the CategoricalDtype dtype.

In [12]:

Copied!

income_categorical_series.dtype
income_categorical_series.dtype

Out[12]:

CategoricalDtype(categories=['$10-20k', '$100-150k', '$20-30k', '$30-40k', '$40-50k',
                  '$50-75k', '$75-100k', '<$10k', '>150k',
                  'Don't know/refused'],
, ordered=False, categories_dtype=object)

How much memory do we use?

In [13]:

Copied!

income_categorical_series.memory_usage(deep=True)
income_categorical_series.memory_usage(deep=True)

Out[13]:

Compared with the original column?

In [14]:

Copied!

pew_tidy_df.income.memory_usage(deep=True)
pew_tidy_df.income.memory_usage(deep=True)

Out[14]:

We have reduced the memory usage by almost 10 fold! Not only that, because now the values are just numbers, it will be much faster to match, filter, manipulate. If your dataset is huge, this can save up a lot of space and time.

If the categories have ordering, you can specify the ordering too.

In [15]:

Copied!





from pandas.api.types import CategoricalDtype
income_type = CategoricalDtype(categories=["Don't know/refused", '<$10k', '$10-20k', '$20-30k', '$30-40k',
                                           '$40-50k', '$50-75k', '$75-100k', '$100-150k', '>150k'], ordered=True)
income_type
from pandas.api.types import CategoricalDtype
income_type = CategoricalDtype(categories=["Don't know/refused", '<$10k', '$10-20k', '$20-30k', '$30-40k',
                                           '$40-50k', '$50-75k', '$75-100k', '$100-150k', '>150k'], ordered=True)
income_type

Out[15]:

CategoricalDtype(categories=['Don't know/refused', '<$10k', '$10-20k', '$20-30k',
                  '$30-40k', '$40-50k', '$50-75k', '$75-100k', '$100-150k',
                  '>150k'],
, ordered=True, categories_dtype=object)

In [16]:

Copied!

pew_tidy_df.income.astype(income_type).dtype
pew_tidy_df.income.astype(income_type).dtype

Out[16]:

CategoricalDtype(categories=['Don't know/refused', '<$10k', '$10-20k', '$20-30k',
                  '$30-40k', '$40-50k', '$50-75k', '$75-100k', '$100-150k',
                  '>150k'],
, ordered=True, categories_dtype=object)

This data type now allows you to compare and sort based on the ordering.

Q: ok, now convert both religion and income columns of pew_tidy_df as categorical dtype (in place) and show that pew_tidy_df now uses much less memory

In [17]:

Copied!

# YOUR SOLUTION HERE

pew_tidy_df.memory_usage(deep=True)
# YOUR SOLUTION HERE

pew_tidy_df.memory_usage(deep=True)

Out[17]:

Index         132
religion     1883
income       1050
frequency    1440
dtype: int64

Module 6 assignment: Data types and tidy data¶

Tidy data¶

Data types¶

If you want to know more¶