Module 10: Logscale¶

In this module, we will learn why we want to use log scale for some types of data and strategies for using log scale.

In [1]:

Copied!

import matplotlib.pyplot as plt
import numpy as np
import vega_datasets
import matplotlib.pyplot as plt
import numpy as np
import vega_datasets

Ratio and logarithm¶

If you use linear scale to visualize ratios, it can be quite misleading. As learned in the class, ratio values larger than 1 can vary between 1 and infinite, while ratio values smaller than 1 can vary only between 0 and 1. For instance, the ratios of 100:1 (100/1) or 1000:1 (1000/1) are represented as 100 and 1000. The corresponding distances from 1:1 (1) are 99 and 999, respectively. On the other hand, the ratios of 1:100 (1/100) or 1:1000 (1/1000) are represented as 0.01 and 0.001. The corresponding distances from 1:1 (1) are 0.99 and 0.999, respectively. In other words, there is no symmetry between symmetric ratios!

You can watch my video Why you should use logarithmic scale when visualizing ratios.

To see this clearly, let's first create some ratios.

In [2]:

Copied!





x = np.array([1,    1,   1,  1, 10, 100, 1000])
y = np.array([1000, 100, 10, 1, 1,  1,   1   ])
ratio = x/y
print(ratio)
x = np.array([1,    1,   1,  1, 10, 100, 1000])
y = np.array([1000, 100, 10, 1, 1,  1,   1   ])
ratio = x/y
print(ratio)

[1.e-03 1.e-02 1.e-01 1.e+00 1.e+01 1.e+02 1.e+03]

Q: Plot on the linear scale using the scatter() function. Also draw a horizontal line at ratio=1 for a reference. The x-axis will be simply the data ID that refers to each ratio data point. Y-axis will be the ratio values.

In [3]:

Copied!

X = np.arange(len(ratio))

# YOUR SOLUTION HERE
X = np.arange(len(ratio))

# YOUR SOLUTION HERE

Out[3]:

Text(0, 0.5, 'Ratio')

Q: Is this a good visualization of the ratio data? Why? Why not? Explain.

YOUR SOLUTION HERE¶

Q: Can you fix it?

In [4]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Log-binning¶

One way to draw a histogram in log-scale, with a broadly distributed data, is by using log-binning.

Let's first see what happens if we do not use the log scale for a dataset with a heavy tail.

Q: Load the movie dataset from vega_datasets and remove the NaN rows based on the following columns: IMDB Rating, IMDB Votes, Worldwide_Gross, Rotten Tomatoes Rating.

In [5]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[5]:

	Title	US_Gross	Worldwide_Gross	US_DVD_Sales	Production_Budget	Release_Date	MPAA_Rating	Running_Time_min	Distributor	Source	Major_Genre	Creative_Type	Director	Rotten_Tomatoes_Rating	IMDB_Rating	IMDB_Votes
4	Slam	1009819.0	1087521.0	NaN	1000000.0	Oct 09 1998	R	NaN	Trimark	Original Screenplay	Drama	Contemporary Fiction	None	62.0	3.4	165.0
8	Pirates	1641825.0	6341825.0	NaN	40000000.0	Jul 01 1986	R	NaN	None	None	None	None	Roman Polanski	25.0	5.8	3275.0
9	Duel in the Sun	20400000.0	20400000.0	NaN	6000000.0	Dec 31 2046	None	NaN	None	None	None	None	None	86.0	7.0	2906.0
10	Tom Jones	37600000.0	37600000.0	NaN	1000000.0	Oct 07 1963	None	NaN	None	None	None	None	None	81.0	7.0	4035.0
11	Oliver!	37402877.0	37402877.0	NaN	10000000.0	Dec 11 1968	None	NaN	Sony Pictures	None	Musical	None	None	84.0	7.5	9111.0

If you simply call hist() method with a dataframe object, it identifies all the numeric columns and draw a histogram for each.

Q: draw all possible histograms of the movie dataframe. Adjust the size of the plots if needed.

In [6]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

As we can see, a majority of the columns are not normally distributed. In particular, if you look at the worldwide gross variable, you only see a couple of meaningful data from the histogram. Is this a problem of resolution? How about increasing the number of bins?

In [7]:

Copied!

ax = movies["Worldwide_Gross"].hist(bins=200)
ax.set_xlabel("World wide gross")
ax.set_ylabel("Frequency")
ax = movies["Worldwide_Gross"].hist(bins=200)
ax.set_xlabel("World wide gross")
ax.set_ylabel("Frequency")

Out[7]:

Text(0, 0.5, 'Frequency')

Maybe a bit more useful, but it doesn't tell anything about the data distribution above certain point. How about changing the vertical scale to logarithmic scale?

In [8]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[8]:

Text(0, 0.5, 'Frequency')

Now, let's try log-bin. Recall that when plotting histgrams we can specify the edges of bins through the bins parameter. For example, we can specify the edges of bins to [1, 2, 3, ... , 10] as follows.

In [9]:

Copied!

movies["IMDB_Rating"].hist(bins=range(0,11))
movies["IMDB_Rating"].hist(bins=range(0,11))

Out[9]:

<Axes: >

Here, we can specify the edges of bins in a similar way. Instead of specifying on the linear scale, we do it on the log space. Some useful resources:

Hint: since $10^{\text{start}}= \text{min(Worldwide\_Gross)}$, $\text{start} = \log_{10}(\text{min(Worldwide\_Gross)})$

In [10]:

Copied!

min(movies["Worldwide_Gross"])
min(movies["Worldwide_Gross"])

Out[10]:

0.0

Because there seems to be movie(s) that made $0, and because log(0) is undefined & log(1) = 0, let's add 1 to the variable.

In [11]:

Copied!

movies["Worldwide_Gross"] = movies["Worldwide_Gross"]+1.0
movies["Worldwide_Gross"] = movies["Worldwide_Gross"]+1.0

In [12]:

Copied!

# TODO: specify the edges of bins using np.logspace

# YOUR SOLUTION HERE
# TODO: specify the edges of bins using np.logspace

# YOUR SOLUTION HERE

Out[12]:

array([1.00000000e+00, 3.14018485e+00, 9.86076088e+00, 3.09646119e+01,
       9.72346052e+01, 3.05334634e+02, 9.58807191e+02, 3.01083182e+03,
       9.45456845e+03, 2.96890926e+04, 9.32292387e+04, 2.92757043e+05,
       9.19311230e+05, 2.88680720e+06, 9.06510822e+06, 2.84661155e+07,
       8.93888645e+07, 2.80697558e+08, 8.81442219e+08, 2.76789150e+09])

Now we can plot a histgram with log-bin. Set both axis to be log-scale.

In [13]:

Copied!





ax = (movies["Worldwide_Gross"]+1.0).hist(bins=bins)
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel("World wide gross")
ax.set_ylabel("Frequency")
ax = (movies["Worldwide_Gross"]+1.0).hist(bins=bins)
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel("World wide gross")
ax.set_ylabel("Frequency")

Out[13]:

Text(0, 0.5, 'Frequency')

What is going on? Is this the right plot?

Q: explain and fix

In [14]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[14]:

Text(0, 0.5, 'Probability density')

Interactive exploration of log-binning¶

Let's put the linear and log-binned histogram side by side to explore how the number of bins affects the log-binned histogram visualization. The slider below allows you to experiment with different numbers of bins to see how it impacts the visualization of the heavy-tailed distribution.

In [15]:

Copied!





from ipywidgets import interact
import matplotlib.pyplot as plt
import numpy as np

@interact(n_bins=(5, 100, 5))
def plot_log_histogram(n_bins):
    """
    Interactive visualization to explore how bin number affects log-binned histograms.
    
    Parameters:
    n_bins: Number of bins for the histogram
    """

    # Create linear-spaced bins
    linear_bins = np.linspace(movies["Worldwide_Gross"].min(), movies["Worldwide_Gross"].max(), n_bins)

    # Create log-spaced bins
    log_bins = np.logspace(np.log10(movies["Worldwide_Gross"].min()), 
                      np.log10(movies["Worldwide_Gross"].max()), 
                      n_bins)
    
    # Create figure with two subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Left plot: Linear scale
    ax1.hist(movies["Worldwide_Gross"], bins=linear_bins, density=True, alpha=0.7, color='blue', edgecolor='black')
    ax1.set_xlabel("Worldwide Gross Revenue ($)")
    ax1.set_ylabel("Probability Density")
    ax1.set_title(f"Linear Scale (n_bins={n_bins})")
    ax1.grid(True, alpha=0.3)
    
    # Right plot: Log-log scale
    ax2.hist(movies["Worldwide_Gross"], bins=log_bins, density=True, alpha=0.7, color='green', edgecolor='black')
    ax2.set_xscale('log')
    ax2.set_yscale('log')
    ax2.set_xlabel("Worldwide Gross Revenue ($, log scale)")
    ax2.set_ylabel("Probability Density (log scale)")
    ax2.set_title(f"Log-Log Scale (n_bins={n_bins})")
    ax2.grid(True, which="both", ls="-", alpha=0.2)
    
    plt.tight_layout()
    plt.show()
from ipywidgets import interact
import matplotlib.pyplot as plt
import numpy as np

@interact(n_bins=(5, 100, 5))
def plot_log_histogram(n_bins):
    """
    Interactive visualization to explore how bin number affects log-binned histograms.
    
    Parameters:
    n_bins: Number of bins for the histogram
    """

    # Create linear-spaced bins
    linear_bins = np.linspace(movies["Worldwide_Gross"].min(), movies["Worldwide_Gross"].max(), n_bins)

    # Create log-spaced bins
    log_bins = np.logspace(np.log10(movies["Worldwide_Gross"].min()), 
                      np.log10(movies["Worldwide_Gross"].max()), 
                      n_bins)
    
    # Create figure with two subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Left plot: Linear scale
    ax1.hist(movies["Worldwide_Gross"], bins=linear_bins, density=True, alpha=0.7, color='blue', edgecolor='black')
    ax1.set_xlabel("Worldwide Gross Revenue ($)")
    ax1.set_ylabel("Probability Density")
    ax1.set_title(f"Linear Scale (n_bins={n_bins})")
    ax1.grid(True, alpha=0.3)
    
    # Right plot: Log-log scale
    ax2.hist(movies["Worldwide_Gross"], bins=log_bins, density=True, alpha=0.7, color='green', edgecolor='black')
    ax2.set_xscale('log')
    ax2.set_yscale('log')
    ax2.set_xlabel("Worldwide Gross Revenue ($, log scale)")
    ax2.set_ylabel("Probability Density (log scale)")
    ax2.set_title(f"Log-Log Scale (n_bins={n_bins})")
    ax2.grid(True, which="both", ls="-", alpha=0.2)
    
    plt.tight_layout()
    plt.show()

interactive(children=(IntSlider(value=50, description='n_bins', min=5, step=5), Output()), _dom_classes=('widg…

CCDF¶

The cumulative distribution function $F_X(x)$ at $x$ is defined by

$$F_X(x) = P(X \le x),$$

which is, in other words, the probability that $X$ takes a value less than or equal to $x$. When empirically calculated (empirical CDF), $F_X(x)$ is the fraction of data points that are less than or equal to $x$. CDF allows us to examine any percentile of the data distribution and is also useful for comparing distributions.

However, when the data spans multiple orders of magnitude, CDF may not be useful. Let's try.

In [16]:

Copied!





gross_sorted = movies.Worldwide_Gross.sort_values()
N = len(gross_sorted)
Y = np.linspace(1/N, 1, num=N)
plt.xlabel("World wide gross")
plt.ylabel("Empirical CDF")
_ = plt.plot(gross_sorted, Y)
gross_sorted = movies.Worldwide_Gross.sort_values()
N = len(gross_sorted)
Y = np.linspace(1/N, 1, num=N)
plt.xlabel("World wide gross")
plt.ylabel("Empirical CDF")
_ = plt.plot(gross_sorted, Y)

Although the movies that are interesting are those with large worldwide gross, we don't see any details about their distribution as they are all close to 1. In other words, CDF sucks at revealing the details of the tail.

CCDF is a nice alternative to examine distributions with heavy tails. The idea is same as CDF, but the direction of aggregation is opposite. Because we are starting from the largest value, it can reveal the details of those large values (tail).

CCDF is defined as follows:

$$ \bar{F}_X(x) = P(X > x)$$

And thus,

$$ \bar{F}_X(x) = P(X > x) = 1 - F_X(x)$$

In other words, we can use CDF to calculate CCDF.

Q: draw CCDF using the CDF code above.

In [17]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

How about making the y axis in log scale?

In [18]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Although this is technically the correct CCDF plot, there is a very subtle issue. Do you see the vertical line at the rightmost side of the CCDF plot? To understand what's going on, let's look at the Y values of this plot. We used 1 - CDF to calculate CCDF. So,

In [19]:

Copied!

1 - Y
1 - Y

Out[19]:

array([9.99556738e-01, 9.99113475e-01, 9.98670213e-01, ...,
       8.86524823e-04, 4.43262411e-04, 0.00000000e+00], shape=(2256,))

What happens when we take the log of these values?

In [20]:

Copied!

np.log(1-Y)
np.log(1-Y)

/var/folders/d0/wgh1l_5905x4crqpp1b7whz40000gn/T/ipykernel_8464/1767632406.py:1: RuntimeWarning: divide by zero encountered in log
  np.log(1-Y)

Out[20]:

array([-4.43360681e-04, -8.86918018e-04, -1.33067219e-03, ...,
       -7.02820143e+00, -7.72134861e+00,            -inf], shape=(2256,))

Because the last value of 1 - Y is 0.0, we got -inf as the log value. That means, the largest value's (let's say $x$) coordinate in our CCDF plot will be $(x, -inf)$ if we use a log scale for our y-axis. And thus we will not be able to see it in the plot. This occurs because we are drawing CDF in a simplified way. In reality, ECDF and ECCDF are step functions and this shouldn't matter. However, because we are drawing a line between the points, we are getting this issue.

This is somewhat problematic because the largest value in our dataset can be quite important and therefore we want to see it in the plot!

This is why, in practice, we sometimes use "incorrect" version of CCDF. We can consider $\bar{F}_X(x)$ as a "flipped" version of CDF.

$$ \bar{F}_X(x) = P(X \ge x) $$

instead of

$$ \bar{F}_X(x) = P(X > x) $$

In doing so, we can see the largest value in the data in our CCDF plot. We can also draw the correct version of CCDF, but this quick-and-dirty version is often easier and good enough to show what we want to show.

A simple way is just to define the y coordinates as follows:

In [21]:

Copied!

Y = np.linspace( 1.0, 1/N, num=N)
Y = np.linspace( 1.0, 1/N, num=N)

Q: Draw a CCDF of worldwide gross data. Use log scale for y-axis.

In [22]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

A straight line in semilog scale means exponential decay (cf. a straight line in log-log scale means power-law decay). So it seems like the amount of money a movie makes across the world follows roughly an exponential distribution, while there are some outliers that make insane amount of money.

Q: Which is the most successful movie in our dataset?

You can use the following

In [23]:

Copied!

# YOUR SOLUTION HERE
# YOUR SOLUTION HERE

Out[23]:

Title                                  Avatar
US_Gross                          760167650.0
Worldwide_Gross                  2767891500.0
US_DVD_Sales                      146153933.0
Production_Budget                 237000000.0
Release_Date                      Dec 18 2009
MPAA_Rating                             PG-13
Running_Time_min                          NaN
Distributor                  20th Century Fox
Source                    Original Screenplay
Major_Genre                            Action
Creative_Type                 Science Fiction
Director                        James Cameron
Rotten_Tomatoes_Rating                   83.0
IMDB_Rating                               8.3
IMDB_Votes                           261439.0
Name: 1234, dtype: object

Practical Summary: When to Use Each Technique¶

Ok! After working through this lab, you should now understand when and how to use logarithmic scales and related techniques. Here's a practical guide for real-world applications:

When to Use Log Scale¶

Use log scale when:

Comparing ratios: Ratios need symmetric representation (e.g., 10:1 vs 1:10)
Data spans multiple orders of magnitude: Values range from 1 to 1,000,000+
Multiplicative relationships: Growth rates, compound interest, population growth
Visualizing percentage changes: Stock prices, economic indicators
Heavy-tailed distributions: Income, city sizes, word frequencies

Common applications:

Financial data (stock prices, market caps)
Scientific measurements (earthquake magnitudes, sound intensity)
Web analytics (page views, user engagement)
Biological data (population sizes, gene expression)

When to Use Log-binning¶

Use log-binning when:

Creating histograms for heavy-tailed distributions
Linear bins would result in most data falling into the first few bins
You need to see the structure across the entire range of values
The data follows a power law, log-normal, or some heavy-tailed distribution

Key considerations:

Normalize by bin width (use density=True). Otherwise, you may get very misleading results.
Experiment with different numbers of bins to find the right balance
Try both linear and log-scale views for comparison

When to Use CCDF (Complementary CDF)¶

Use CCDF when:

Analyzing tail behavior of distributions
Comparing multiple heavy-tailed distributions
Identifying outliers or extreme events

Advantages over regular/log-binned histograms:

No binning artifacts
Full use of the data
Easier to compare multiple distributions
Clear visualization of rare events

Quick Decision Guide¶

"My data values vary across many orders of magnitude, like from 10 to 10,000,000" → try log-scale and log-binning
"I'm comparing growth rates or ratios" → try log-scale
"My histogram has one tall bar and everything else is tiny" → try log-binning

Common Pitfalls to Avoid¶

Pay attention to zeros and negative values in your data
Don't forget to normalize histograms when your bins are not uniform (e.g., log-binning)
Always label axes clearly to indicate log scale usage and binning
Consider your audience - add explanatory notes if they may not be familiar with log scales and binning

Python Quick Reference¶

# Log scale for current plot
plt.yscale('log')     # using matplotlib.pyplot
ax.set_yscale('log')  # using ax object

# Log-binning
bins = np.logspace(np.log10(data.min()), np.log10(data.max()), n_bins)
plt.hist(data, bins=bins, density=True)   # density=True is important! 

# Quick and dirty CCDF
data_sorted = np.sort(data)
ccdf = np.linspace(1, 1/len(data), len(data))
plt.loglog(data_sorted, ccdf)  # for log-log scale