Module 10: Logscale¶
In this module, we will learn why we want to use log scale for some types of data and strategies for using log scale.
import matplotlib.pyplot as plt
import numpy as np
import vega_datasets
Ratio and logarithm¶
If you use linear scale to visualize ratios, it can be quite misleading. As learned in the class, ratio values larger than 1 can vary between 1 and infinite, while ratio values smaller than 1 can vary only between 0 and 1. For instance, the ratios of 100:1 (100/1) or 1000:1 (1000/1) are represented as 100 and 1000. The corresponding distances from 1:1 (1) are 99 and 999, respectively. On the other hand, the ratios of 1:100 (1/100) or 1:1000 (1/1000) are represented as 0.01 and 0.001. The corresponding distances from 1:1 (1) are 0.99 and 0.999, respectively. In other words, there is no symmetry between symmetric ratios!
You can watch my video Why you should use logarithmic scale when visualizing ratios.
To see this clearly, let's first create some ratios.
x = np.array([1, 1, 1, 1, 10, 100, 1000])
y = np.array([1000, 100, 10, 1, 1, 1, 1 ])
ratio = x/y
print(ratio)
[1.e-03 1.e-02 1.e-01 1.e+00 1.e+01 1.e+02 1.e+03]
Q: Plot on the linear scale using the scatter()
function. Also draw a horizontal line at ratio=1 for a reference. The x-axis will be simply the data ID that refers to each ratio data point. Y-axis will be the ratio values.
X = np.arange(len(ratio))
# YOUR SOLUTION HERE
Text(0, 0.5, 'Ratio')
Q: Is this a good visualization of the ratio data? Why? Why not? Explain.
YOUR SOLUTION HERE¶
Q: Can you fix it?
# YOUR SOLUTION HERE
Log-binning¶
One way to draw a histogram in log-scale, with a broadly distributed data, is by using log-binning.
Let's first see what happens if we do not use the log scale for a dataset with a heavy tail.
Q: Load the movie dataset from vega_datasets
and remove the NaN rows based on the following columns: IMDB Rating
, IMDB Votes
, Worldwide_Gross
, Rotten Tomatoes Rating
.
# YOUR SOLUTION HERE
Title | US_Gross | Worldwide_Gross | US_DVD_Sales | Production_Budget | Release_Date | MPAA_Rating | Running_Time_min | Distributor | Source | Major_Genre | Creative_Type | Director | Rotten_Tomatoes_Rating | IMDB_Rating | IMDB_Votes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | Slam | 1009819.0 | 1087521.0 | NaN | 1000000.0 | Oct 09 1998 | R | NaN | Trimark | Original Screenplay | Drama | Contemporary Fiction | None | 62.0 | 3.4 | 165.0 |
8 | Pirates | 1641825.0 | 6341825.0 | NaN | 40000000.0 | Jul 01 1986 | R | NaN | None | None | None | None | Roman Polanski | 25.0 | 5.8 | 3275.0 |
9 | Duel in the Sun | 20400000.0 | 20400000.0 | NaN | 6000000.0 | Dec 31 2046 | None | NaN | None | None | None | None | None | 86.0 | 7.0 | 2906.0 |
10 | Tom Jones | 37600000.0 | 37600000.0 | NaN | 1000000.0 | Oct 07 1963 | None | NaN | None | None | None | None | None | 81.0 | 7.0 | 4035.0 |
11 | Oliver! | 37402877.0 | 37402877.0 | NaN | 10000000.0 | Dec 11 1968 | None | NaN | Sony Pictures | None | Musical | None | None | 84.0 | 7.5 | 9111.0 |
If you simply call hist()
method with a dataframe object, it identifies all the numeric columns and draw a histogram for each.
Q: draw all possible histograms of the movie dataframe. Adjust the size of the plots if needed.
# YOUR SOLUTION HERE
As we can see, a majority of the columns are not normally distributed. In particular, if you look at the worldwide gross variable, you only see a couple of meaningful data from the histogram. Is this a problem of resolution? How about increasing the number of bins?
ax = movies["Worldwide_Gross"].hist(bins=200)
ax.set_xlabel("World wide gross")
ax.set_ylabel("Frequency")
Text(0, 0.5, 'Frequency')
Maybe a bit more useful, but it doesn't tell anything about the data distribution above certain point. How about changing the vertical scale to logarithmic scale?
# YOUR SOLUTION HERE
Text(0, 0.5, 'Frequency')
Now, let's try log-bin. Recall that when plotting histgrams we can specify the edges of bins through the bins
parameter. For example, we can specify the edges of bins to [1, 2, 3, ... , 10] as follows.
movies["IMDB_Rating"].hist(bins=range(0,11))
<Axes: >
Here, we can specify the edges of bins in a similar way. Instead of specifying on the linear scale, we do it on the log space. Some useful resources:
Hint: since $10^{\text{start}}= \text{min(Worldwide\_Gross)}$, $\text{start} = \log_{10}(\text{min(Worldwide\_Gross)})$
min(movies["Worldwide_Gross"])
0.0
Because there seems to be movie(s) that made $0, and because log(0) is undefined & log(1) = 0, let's add 1 to the variable.
movies["Worldwide_Gross"] = movies["Worldwide_Gross"]+1.0
# TODO: specify the edges of bins using np.logspace
# YOUR SOLUTION HERE
array([1.00000000e+00, 3.14018485e+00, 9.86076088e+00, 3.09646119e+01, 9.72346052e+01, 3.05334634e+02, 9.58807191e+02, 3.01083182e+03, 9.45456845e+03, 2.96890926e+04, 9.32292387e+04, 2.92757043e+05, 9.19311230e+05, 2.88680720e+06, 9.06510822e+06, 2.84661155e+07, 8.93888645e+07, 2.80697558e+08, 8.81442219e+08, 2.76789150e+09])
Now we can plot a histgram with log-bin. Set both axis to be log-scale.
ax = (movies["Worldwide_Gross"]+1.0).hist(bins=bins)
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel("World wide gross")
ax.set_ylabel("Frequency")
Text(0, 0.5, 'Frequency')
What is going on? Is this the right plot?
Q: explain and fix
# YOUR SOLUTION HERE
Text(0, 0.5, 'Probability density')
Interactive exploration of log-binning¶
Let's put the linear and log-binned histogram side by side to explore how the number of bins affects the log-binned histogram visualization. The slider below allows you to experiment with different numbers of bins to see how it impacts the visualization of the heavy-tailed distribution.
from ipywidgets import interact
import matplotlib.pyplot as plt
import numpy as np
@interact(n_bins=(5, 100, 5))
def plot_log_histogram(n_bins):
"""
Interactive visualization to explore how bin number affects log-binned histograms.
Parameters:
n_bins: Number of bins for the histogram
"""
# Create linear-spaced bins
linear_bins = np.linspace(movies["Worldwide_Gross"].min(), movies["Worldwide_Gross"].max(), n_bins)
# Create log-spaced bins
log_bins = np.logspace(np.log10(movies["Worldwide_Gross"].min()),
np.log10(movies["Worldwide_Gross"].max()),
n_bins)
# Create figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Left plot: Linear scale
ax1.hist(movies["Worldwide_Gross"], bins=linear_bins, density=True, alpha=0.7, color='blue', edgecolor='black')
ax1.set_xlabel("Worldwide Gross Revenue ($)")
ax1.set_ylabel("Probability Density")
ax1.set_title(f"Linear Scale (n_bins={n_bins})")
ax1.grid(True, alpha=0.3)
# Right plot: Log-log scale
ax2.hist(movies["Worldwide_Gross"], bins=log_bins, density=True, alpha=0.7, color='green', edgecolor='black')
ax2.set_xscale('log')
ax2.set_yscale('log')
ax2.set_xlabel("Worldwide Gross Revenue ($, log scale)")
ax2.set_ylabel("Probability Density (log scale)")
ax2.set_title(f"Log-Log Scale (n_bins={n_bins})")
ax2.grid(True, which="both", ls="-", alpha=0.2)
plt.tight_layout()
plt.show()
interactive(children=(IntSlider(value=50, description='n_bins', min=5, step=5), Output()), _dom_classes=('widg…
CCDF¶
The cumulative distribution function $F_X(x)$ at $x$ is defined by
$$F_X(x) = P(X \le x),$$
which is, in other words, the probability that $X$ takes a value less than or equal to $x$. When empirically calculated (empirical CDF), $F_X(x)$ is the fraction of data points that are less than or equal to $x$. CDF allows us to examine any percentile of the data distribution and is also useful for comparing distributions.
However, when the data spans multiple orders of magnitude, CDF may not be useful. Let's try.
gross_sorted = movies.Worldwide_Gross.sort_values()
N = len(gross_sorted)
Y = np.linspace(1/N, 1, num=N)
plt.xlabel("World wide gross")
plt.ylabel("Empirical CDF")
_ = plt.plot(gross_sorted, Y)
Although the movies that are interesting are those with large worldwide gross, we don't see any details about their distribution as they are all close to 1. In other words, CDF sucks at revealing the details of the tail.
CCDF is a nice alternative to examine distributions with heavy tails. The idea is same as CDF, but the direction of aggregation is opposite. Because we are starting from the largest value, it can reveal the details of those large values (tail).
CCDF is defined as follows:
$$ \bar{F}_X(x) = P(X > x)$$
And thus,
$$ \bar{F}_X(x) = P(X > x) = 1 - F_X(x)$$
In other words, we can use CDF to calculate CCDF.
Q: draw CCDF using the CDF code above.
# YOUR SOLUTION HERE
How about making the y axis in log scale?
# YOUR SOLUTION HERE
Although this is technically the correct CCDF plot, there is a very subtle issue. Do you see the vertical line at the rightmost side of the CCDF plot? To understand what's going on, let's look at the Y values of this plot. We used 1 - CDF to calculate CCDF. So,
1 - Y
array([9.99556738e-01, 9.99113475e-01, 9.98670213e-01, ..., 8.86524823e-04, 4.43262411e-04, 0.00000000e+00], shape=(2256,))
What happens when we take the log of these values?
np.log(1-Y)
/var/folders/d0/wgh1l_5905x4crqpp1b7whz40000gn/T/ipykernel_8464/1767632406.py:1: RuntimeWarning: divide by zero encountered in log np.log(1-Y)
array([-4.43360681e-04, -8.86918018e-04, -1.33067219e-03, ..., -7.02820143e+00, -7.72134861e+00, -inf], shape=(2256,))
Because the last value of 1 - Y is 0.0, we got -inf
as the log value. That means, the largest value's (let's say $x$) coordinate in our CCDF plot will be $(x, -inf)$ if we use a log scale for our y-axis. And thus we will not be able to see it in the plot. This occurs because we are drawing CDF in a simplified way. In reality, ECDF and ECCDF are step functions and this shouldn't matter. However, because we are drawing a line between the points, we are getting this issue.
This is somewhat problematic because the largest value in our dataset can be quite important and therefore we want to see it in the plot!
This is why, in practice, we sometimes use "incorrect" version of CCDF. We can consider $\bar{F}_X(x)$ as a "flipped" version of CDF.
$$ \bar{F}_X(x) = P(X \ge x) $$
instead of
$$ \bar{F}_X(x) = P(X > x) $$
In doing so, we can see the largest value in the data in our CCDF plot. We can also draw the correct version of CCDF, but this quick-and-dirty version is often easier and good enough to show what we want to show.
A simple way is just to define the y coordinates as follows:
Y = np.linspace( 1.0, 1/N, num=N)
Q: Draw a CCDF of worldwide gross data. Use log scale for y-axis.
# YOUR SOLUTION HERE
A straight line in semilog scale means exponential decay (cf. a straight line in log-log scale means power-law decay). So it seems like the amount of money a movie makes across the world follows roughly an exponential distribution, while there are some outliers that make insane amount of money.
Q: Which is the most successful movie in our dataset?
You can use the following
# YOUR SOLUTION HERE
Title Avatar US_Gross 760167650.0 Worldwide_Gross 2767891500.0 US_DVD_Sales 146153933.0 Production_Budget 237000000.0 Release_Date Dec 18 2009 MPAA_Rating PG-13 Running_Time_min NaN Distributor 20th Century Fox Source Original Screenplay Major_Genre Action Creative_Type Science Fiction Director James Cameron Rotten_Tomatoes_Rating 83.0 IMDB_Rating 8.3 IMDB_Votes 261439.0 Name: 1234, dtype: object
Practical Summary: When to Use Each Technique¶
Ok! After working through this lab, you should now understand when and how to use logarithmic scales and related techniques. Here's a practical guide for real-world applications:
When to Use Log Scale¶
Use log scale when:
- Comparing ratios: Ratios need symmetric representation (e.g., 10:1 vs 1:10)
- Data spans multiple orders of magnitude: Values range from 1 to 1,000,000+
- Multiplicative relationships: Growth rates, compound interest, population growth
- Visualizing percentage changes: Stock prices, economic indicators
- Heavy-tailed distributions: Income, city sizes, word frequencies
Common applications:
- Financial data (stock prices, market caps)
- Scientific measurements (earthquake magnitudes, sound intensity)
- Web analytics (page views, user engagement)
- Biological data (population sizes, gene expression)
When to Use Log-binning¶
Use log-binning when:
- Creating histograms for heavy-tailed distributions
- Linear bins would result in most data falling into the first few bins
- You need to see the structure across the entire range of values
- The data follows a power law, log-normal, or some heavy-tailed distribution
Key considerations:
- Normalize by bin width (use
density=True
). Otherwise, you may get very misleading results. - Experiment with different numbers of bins to find the right balance
- Try both linear and log-scale views for comparison
When to Use CCDF (Complementary CDF)¶
Use CCDF when:
- Analyzing tail behavior of distributions
- Comparing multiple heavy-tailed distributions
- Identifying outliers or extreme events
Advantages over regular/log-binned histograms:
- No binning artifacts
- Full use of the data
- Easier to compare multiple distributions
- Clear visualization of rare events
Quick Decision Guide¶
- "My data values vary across many orders of magnitude, like from 10 to 10,000,000" → try log-scale and log-binning
- "I'm comparing growth rates or ratios" → try log-scale
- "My histogram has one tall bar and everything else is tiny" → try log-binning
Common Pitfalls to Avoid¶
- Pay attention to zeros and negative values in your data
- Don't forget to normalize histograms when your bins are not uniform (e.g., log-binning)
- Always label axes clearly to indicate log scale usage and binning
- Consider your audience - add explanatory notes if they may not be familiar with log scales and binning
Python Quick Reference¶
# Log scale for current plot
plt.yscale('log') # using matplotlib.pyplot
ax.set_yscale('log') # using ax object
# Log-binning
bins = np.logspace(np.log10(data.min()), np.log10(data.max()), n_bins)
plt.hist(data, bins=bins, density=True) # density=True is important!
# Quick and dirty CCDF
data_sorted = np.sort(data)
ccdf = np.linspace(1, 1/len(data), len(data))
plt.loglog(data_sorted, ccdf) # for log-log scale