QQ-Plots#

Michael J. Pyrcz, Professor, The University of Texas at Austin

Twitter | GitHub | Website | GoogleScholar | Geostatistics Book | YouTube | Applied Geostats in Python e-book | Applied Machine Learning in Python e-book | LinkedIn

Chapter of e-book “Applied Geostatistics in Python: a Hands-on Guide with GeostatsPy”.

Cite as: Pyrcz, M.J., 2024, Applied Geostatistics in Python: a Hands-on Guide with GeostatsPy, https://geostatsguy.github.io/GeostatsPyDemos_Book.

By Michael J. Pyrcz
© Copyright 2024.

This chapter is a tutorial for / demonstration of QQ-Plots and PP-Plots.

YouTube Lecture: check out my lecture on Q-Q and P-P Plots. For your convenience here’s a summary of salient points.

QQ-Plot#

The scatter plot of matching percentiles between two distribution. Why learn about QQ-Plots?

  • convenient plot to compare distributions for 2 features

  • a function fit to a QQ-plot is the distribution transform, forward,

\[ y = F_y^{-1}\left( F_x(x) \right) \]

and reverse,

\[ x = F_x^{-1}\left( F_y(y) \right) \]

To visualize the calculation of a QQ-plot see this illustration of the calculation of a single point for the 80th percentile.

Illustration of the calculation of a single point (80th percentile) for a QQ-plot.

QQ-Plot Interpretation#

If the two distributions are the same, then all the percentiles will be equal and the points will all fall on the 45 degree line. Here’s an example with very similar distributions.

Illustration of the calculation of a single point (80th percentile) for a QQ-plot.

If the means of the two distributions are different, then the points will be shifted from the 45 degree line.

  • down and right from the 45 degree line if distribution on x-axis is has a larger mean than the distribution on the y-axis

  • up and left from the 45 degree line if distribution on x-axis is has a smaller mean than the distribution on the y-axis

QQ-plot with mean of \(X_1\) greater than mean of \(X_2\).

If the variances (or standard deviations) of the two distributions are different, then the points will appear to be stretched out along the axis for the distribution with greater variance.

  • difference in variance will appear like a “rotation” from the 45 degree line

QQ-plot with variance of \(X_1\) greater than variance of \(X_2\).

Of course, both the mean and variance can be different.

QQ-plot with mean of \(X_1\) greater than mean of \(X_2\), and variance of \(X_1\) greater than variance of \(X_2\).

Finally, it is possible for distributions to be similar and then to diverge only for part of the distribution. This will be quite clear on a QQ-plot.

PP-plot#

There is an alternative to the QQ-plot, the PP-plot. Instead of matching by percentiles like a QQ-plot, a PP-plot matches by values and plots the cumulative probabilities.

  • tails better expressed (difference magnified) on QQ-plot

  • mode better expressed (difference magnified) with PP-plot

  • QQ-plot is a distribution transform function

  • PP-plot low and upper tails are forced to be 0.0, 0.0 and 1.0, 1.0 respectively, forcing the plots to be more similar than QQ-plots

To visualize the calculation of a PP-plot see this illustration of the calculation of a single point for the 0.35 porosity.

The QQ-plot and PP-plot methods for matching between distributions (left), an illustration of the calculation of a single point 0.3 cumulative probability for a QQ-plot (center), and an illustration of the calculation of a single point 0.35 porosity for a PP-plot (right).

Getting Started#

Here’s the steps to get setup in Python with the GeostatsPy package:

  1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/).

  2. From Anaconda Navigator (within Anaconda3 group), go to the environment tab, click on base (root) green arrow and open a terminal.

  3. In the terminal type: pip install geostatspy.

  4. Open Jupyter and in the top block get started by copy and pasting the code block below from this Jupyter Notebook to start using the geostatspy functionality.

You will need to copy the data file to your working directory. The dataset is available on my GitHub account in my GeoDataSets repository at:

Load the Required Libraries#

The following code loads the required libraries.

import geostatspy.GSLIB as GSLIB                              # GSLIB utilities, visualization and wrapper
import geostatspy.geostats as geostats                        # GSLIB methods convert to Python  
import geostatspy
print('GeostatsPy version: ' + str(geostatspy.__version__))     
GeostatsPy version: 0.0.72

We will also need some standard packages. These should have been installed with Anaconda 3.

ignore_warnings = True                                        # ignore warnings?
import numpy as np                                            # ndarrays for gridded data
import pandas as pd                                           # DataFrames for tabular data
from scipy import stats                                       # inverse percentiles, percentileofscore function for P-P plots
import os                                                     # set working directory, run executables
import matplotlib.pyplot as plt                               # plotting
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator) # control of axes ticks
import matplotlib.gridspec as gridspec
plt.rc('axes', axisbelow=True)
if ignore_warnings == True:                                   
    import warnings
    warnings.filterwarnings('ignore')
cmap = plt.cm.inferno                                         # color map
seed = 42                                                     # random number seed

If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing ‘python -m pip install [package-name]’. More assistance is available with the respective package docs.

Declare Functions#

Let’s define a single function to streamline plotting correlation matrices. I also added a convenience function to add major and minor gridlines to improve plot interpretability.

def add_grid():
    plt.gca().grid(True, which='major',linewidth = 1.0); plt.gca().grid(True, which='minor',linewidth = 0.2) # add y grids
    plt.gca().tick_params(which='major',length=7); plt.gca().tick_params(which='minor', length=4)
    plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks 

def add_grid2(sub_plot):
    sub_plot.grid(True, which='major',linewidth = 1.0); sub_plot.grid(True, which='minor',linewidth = 0.2) # add y grids
    sub_plot.tick_params(which='major',length=7); sub_plot.tick_params(which='minor', length=4)
    sub_plot.xaxis.set_minor_locator(AutoMinorLocator()); sub_plot.yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks  

Set the Working Directory#

I always like to do this so I don’t lose files and to simplify subsequent read and writes (avoid including the full address each time). Set this to your working directory, with the above mentioned data file.

#os.chdir("d:/PGE383")                                        # set the working directory

Make Data#

Let’s specify two univariate Gaussian distribution and then sample from this distribution.

  • This allows us to vary the distributions and number of data and visualize the impact on the QQ-plot.

n1 = 100; mean1 = 0.35; stdev1 = 0.06                         # specify the two distribution (assume Gaussian) 
n2 = 50; mean2 = 0.3; stdev2 = 0.05
                                      
X1 = np.random.normal(loc=mean1,scale=stdev1,size=n1)
X2 = np.random.normal(loc=mean2,scale=stdev2,size=n2)

Calculate the QQ-plot#

Calculate and match percentiles from both data distributions.

nq = 100                                                      # the number of points (equal cumulative probability) sampled for the QQ-plot
xmin=0.0; xmax=0.6                                            # the range values for the plot axes

cumul_prob = np.linspace(1,99,nq)                             # cumulative probability array
X1_percentiles = np.percentile(X1,cumul_prob)                 # calculate all percentiles for plotting
X2_percentiles = np.percentile(X2,cumul_prob)

Make the Q-Q Plot Visualization#

Let’s look at the data histograms, cumulative distribution functions and QQ-plot.

fig = plt.figure()
spec = fig.add_gridspec(2, 3)

ax0 = fig.add_subplot(spec[:, 1:])
plt.scatter(X1_percentiles,X2_percentiles,color='darkorange',edgecolor='black',s=10,label='Q-Q plot')
plt.plot([0,1],[0,1],ls='--',color='red')
plt.grid(); plt.xlim([xmin,xmax]); plt.ylim([xmin,xmax]); plt.xlabel(r'$X_1$ - Porosity (fraction)'); plt.ylabel(r'$X_2$ - Porosity (fraction)'); 
plt.title('Q-Q Plot'); plt.legend(loc='lower right')
add_grid2(ax0)

ax10 = fig.add_subplot(spec[0, 0])
plt.hist(X1,bins=np.linspace(xmin,xmax,30),color='red',alpha=0.5,edgecolor='black',label=r'$X_1$',density=True)
plt.hist(X2,bins=np.linspace(xmin,xmax,30),color='yellow',alpha=0.5,edgecolor='black',label=r'$X_2$',density=True)
plt.grid(); plt.xlim([xmin,xmax]); plt.ylim([0,15]); plt.xlabel('Porosity (fraction)'); plt.ylabel('Density')
plt.title('Histograms'); plt.legend(loc='upper right')
add_grid2(ax10)

ax11 = fig.add_subplot(spec[1, 0])
plt.scatter(np.sort(X1),np.linspace(0,1,len(X1)),color='red',edgecolor='black',s=10,label=r'$X_1$')
plt.scatter(np.sort(X2),np.linspace(0,1,len(X2)),color='yellow',edgecolor='black',s=10,label=r'$X_2$')
plt.grid(); plt.xlim([xmin,xmax]); plt.ylim([0,1]); plt.xlabel('Porosity (fraction)'); plt.title('CDFs'); plt.legend(loc='lower right')
add_grid2(ax11)

plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=1.4, wspace=0.3, hspace=0.3); plt.show()
_images/fecc223adb91da21382fa29a36bedb05677b8525c796eff31c588dfe34c4c730.png

Calculate the P-P plot#

Calculate and match the values from both data distributions and plot the cumulative probabilities.

min_X = min(X1.min(),X2.min())                                # find the min and max feature values for interval to sample
max_X = max(X1.max(),X2.max())

X_values = np.linspace(min_X,max_X,nq)                        # values to sample

X1_cumul_probs = []; X2_cumul_probs = []

for X in X_values:                                            # final percentiles to plot
    X1_cumul_probs.append(stats.percentileofscore(X1,X)/100)
    X2_cumul_probs.append(stats.percentileofscore(X2,X)/100)
    
X1_cumul_probs = np.asarray(X1_cumul_probs); X2_cumul_probs = np.asarray(X2_cumul_probs)

Make the P-P Plot Visualization#

Let’s look at the data histograms, cumulative distribution functions and QQ-plot.

fig = plt.figure()
spec = fig.add_gridspec(2, 3)

ax0 = fig.add_subplot(spec[:, 1:])
plt.scatter(X1_cumul_probs,X2_cumul_probs,color='darkorange',edgecolor='black',s=10,label='P-P plot')
plt.plot([0,1.0],[0,1.0],ls='--',color='red')
plt.grid(); plt.xlim([0.0,1.0]); plt.ylim([0.0,1.0]); plt.xlabel(r'$X_1$ - Porosity Percentile (fraction)'); plt.ylabel(r'$X_2$ - Porosity Percentile (fraction)'); 
plt.title('P-P Plot'); plt.legend(loc='lower right')

ax10 = fig.add_subplot(spec[0, 0])
plt.hist(X1,bins=np.linspace(xmin,xmax,30),color='red',alpha=0.5,edgecolor='black',label=r'$X_1$',density=True)
plt.hist(X2,bins=np.linspace(xmin,xmax,30),color='yellow',alpha=0.5,edgecolor='black',label=r'$X_2$',density=True)
plt.grid(); plt.xlim([xmin,xmax]); plt.ylim([0,15]); plt.xlabel('Porosity (fraction)'); plt.ylabel('Density')
plt.title('Histograms'); plt.legend(loc='upper right')

ax11 = fig.add_subplot(spec[1, 0])
plt.scatter(np.sort(X1),np.linspace(0,1,len(X1)),color='red',edgecolor='black',s=10,label=r'$X_1$')
plt.scatter(np.sort(X2),np.linspace(0,1,len(X2)),color='yellow',edgecolor='black',s=10,label=r'$X_2$')
plt.grid(); plt.xlim([xmin,xmax]); plt.ylim([0,1]); plt.xlabel('Porosity (fraction)'); plt.title('CDFs'); plt.legend(loc='lower right')

plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=1.4, wspace=0.3, hspace=0.3); plt.show()
_images/0cb616ae2780125e91498b30f797dbef994e699f23f733eae17fed6cf436739f.png

Comments#

I hope you found this chapter helpful. Much more could be done and discussed, I have many more resources. Check out my shared resource inventory,

Michael

The Author:#

Michael J. Pyrcz, Professor, The University of Texas at Austin Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions

With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers’ and geoscientists’ impact in subsurface resource development.

For more about Michael check out these links:

Twitter | GitHub | Website | GoogleScholar | Geostatistics Book | YouTube | Applied Geostats in Python e-book | Applied Machine Learning in Python e-book | LinkedIn

Want to Work Together?#

I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.

  • Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I’d be happy to drop by and work with you!

  • Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PI is Professor John Foster)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!

  • I can be reached at mpyrcz@austin.utexas.edu.

I’m always happy to discuss,

Michael

Michael J. Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin

More Resources Available at: Twitter | GitHub | Website | GoogleScholar | Geostatistics Book | YouTube | Applied Geostats in Python e-book | Applied Machine Learning in Python e-book | LinkedIn