Multivariate Analysis#

Michael J. Pyrcz, Professor, The University of Texas at Austin

Twitter | GitHub | Website | GoogleScholar | Book | YouTube | Applied Geostats in Python e-book | LinkedIn

Chapter of e-book “Applied Machine Learning in Python: a Hands-on Guide with Code”.

Cite as: Pyrcz, M.J., 2024, Applied Machine Learning in Python: a Hands-on Guide with Code, https://geostatsguy.github.io/MachineLearningDemos_Book.

By Michael J. Pyrcz
© Copyright 2024.

This chapter is a tutorial for / demonstration of Multivariate Analysis.

YouTube Lecture: check out my lectures on:

These lectures are all part of my Machine Learning Course on YouTube with linked well-documented Python workflows and interactive dashboards. My goal is to share accessible, actionable, and repeatable educational content. If you want to know about my motivation, check out Michael’s Story.

Motivation for Multivariate Analysis#

To build good machine learning models, we build on a foundation of statistical analysis to explore and understand the relationships within our data. This includes the relationships between 2 or more features at a time!

  • in many cases these multivariate statistics are actually critical components of the machine learning methods, e.g., eigenvalues and eigenvectors are calculated from the covariance matrix for principal component analysis.

Let’s talk about bivariate and multivariate methods briefly. I confess the statistics beyond bivariate is a placeholder right now. Note, in the feature ranking workflow (link above) I have included partial correlation, maximum relevance minimum redundancy (MRMR) based on mutual information and random forest feature importance methods that extract information beyond bivariate.

Bivariate Analysis#

Understand and quantify the relationship between two variables

  • example: relationship between porosity and permeability

  • how can we use this relationship?

What would be the impact if we ignore this relationship and simply modeled porosity and permeability independently?

  • no relationship beyond constraints at data locations

  • independent away from data

  • nonphysical results, unrealistic uncertainty models

Bivariate Statistics#

Pearson’s Product‐Moment Correlation Coefficient

  • Provides a measure of the degree of linear relationship.

  • We refer to it as the ‘correlation coefficient’

Let’s review the sample variance of variable \(x\). Of course, I’m truncating our notation as \(x\) is a set of samples a locations in our modeling space, \(x(\bf{u_\alpha}), \, \forall \, \alpha = 0, 1, \dots, n - 1\).

\[ \sigma^2_{x} = \frac{\sum_{i=1}^{n} (x_i - \overline{x})^2}{(n-1)} \]

We can expand the squared term and replace on of them with \(y\), another variable in addition to \(x\).

\[ C_{xy} = \frac{\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{(n-1)} \]

We now have a measure that represents the manner in which variables \(x\) and \(y\) co-vary or vary together. We can standardized the covariance by the product of the standard deviations of \(x\) and \(y\) to calculate the correlation coefficient.

\[ \rho_{xy} = \frac{\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{(n-1)\sigma_x \sigma_y}, \, -1.0 \le \rho_{xy} \le 1.0 \]

In summary we can state that the correlation coefficient is related to the covariance as:

\[ \rho_{xy} = \frac{C_{xy}}{\sigma_x \sigma_y} \]

The Person’s correlation coefficient is quite sensitive to outliers and departure from linear behavior (in the bivariate sense). We have an alternative known as the Spearman’s rank correlations coefficient.

\[ \rho_{R_x R_y} = \frac{\sum_{i=1}^{n} (R_{x_i} - \overline{R_x})(R_{y_i} - \overline{R_y})}{(n-1)\sigma_{R_x} \sigma_{R_y}}, \, -1.0 \le \rho_{xy} \le 1.0 \]

The rank correlation applies the rank transform to the data prior to calculating the correlation coefficient. To calculate the rank transform simply replace the data values with the rank \(R_x = 1,\dots,n\), where \(n\) is the maximum value and \(1\) is the minimum value.

\[ x_\alpha, \, \forall \alpha = 1,\dots, n, \, | \, x_i \ge x_j \, \forall \, i \gt j \]
\[ R_{x_i} = i \]

The correlation coefficients provide useful metrics to quantify relationships between two variables at a time. We can also consider bivariate scatter plots and matrix scatter plots to visualize multivariate data. In general, current practical subsurface modeling is bivariate, two variables at a time.

Multivariate Statistics#

See my lecture on Multivariate Analysis that includes more on multivariate statistics, including the concepts of joint, conditional and marginal probability. For convenience I summarize some concepts here.

For example, if we define a probability density function (PDF) of feature \(X_1\) as,

\[ f_X(x) \]

we can extend the definition to a joint PDF for any arbitrary number, e.g., \(m\), features as,

\[ f_{X_1,\ldots,X_m}(x_1,\ldots,x_m) \]

Similarly, conditional PFDs can be calculated,

\[ f_{Y | X}(y | x) \]

and extended to any number of features, e.g., \(m\), features as,

\[ f_{Y | X_1, \ldots, X_m}(y | x_1,\ldots,x_m) \]

and many of our statistical concepts extend to high dimensions, for example consider the correlation coefficient,

\[ \rho_{X,Y} \]

between 2 features and now the partial correlation coefficient that calculates correlation between 2 features (\(X\) and \(Y\)) while controlling for all other features (\(Z\)).

\[ \rho_{X,Y | Z} \]

Ultimately the difficulty with calculating high dimensionality (massively multivariate) probabilities and statistics is inferential, we rarely have enough data to observe vast multivariate space and to model probabilities and statistics in that space!

  • See my linked lecture on the curse of dimensionality above.

Load the Required Libraries#

The following code loads the required libraries.

import geostatspy.GSLIB as GSLIB                              # GSLIB utilities, visualization and wrapper
import geostatspy.geostats as geostats                        # GSLIB methods convert to Python  
import geostatspy
print('GeostatsPy version: ' + str(geostatspy.__version__))     
GeostatsPy version: 0.0.71

We will also need some standard packages. These should have been installed with Anaconda 3.

ignore_warnings = True                                        # ignore warnings?
import numpy as np                                            # ndarrays for gridded data
import pandas as pd                                           # DataFrames for tabular data
from scipy import stats                                       # summary statistics
import copy                                                   # for deep copies
import os                                                     # set working directory, run executables
import matplotlib.pyplot as plt                               # for plotting
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator) # control of axes ticks
from matplotlib.colors import ListedColormap                  # custom color maps
import seaborn as sns                                         # advanced plotting
import matplotlib.ticker as mtick                             # control tick label formatting
plt.rc('axes', axisbelow=True)                                # plot all grids below the plot elements
if ignore_warnings == True:                                   
    import warnings
    warnings.filterwarnings('ignore')
cmap = plt.cm.inferno                                         # color map
seed = 42                                                     # random number seed

If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing ‘python -m pip install [package-name]’. More assistance is available with the respective package docs.

Declare Functions#

Let’s define a single function to streamline plotting correlation matrices. I also added a convenience function to add major and minor gridlines to improve plot interpretability.

def plot_corr(corr_matrix,title,limits,mask):                 # plots a graphical correlation matrix 
    my_colormap = plt.cm.get_cmap('RdBu_r', 256)          
    newcolors = my_colormap(np.linspace(0, 1, 256))
    white = np.array([256/256, 256/256, 256/256, 1])
    white_low = int(128 - mask*128); white_high = int(128+mask*128)
    newcolors[white_low:white_high, :] = white                # mask all correlations less than abs(0.8)
    newcmp = ListedColormap(newcolors)
    m = corr_matrix.shape[0]
    im = plt.matshow(corr_matrix,fignum=0,vmin = -1.0*limits, vmax = limits,cmap = newcmp)
    plt.xticks(range(len(corr_matrix.columns)), corr_matrix.columns)
    plt.yticks(range(len(corr_matrix.columns)), corr_matrix.columns)
    plt.colorbar(im, orientation = 'vertical')
    plt.title(title)
    for i in range(0,m):
        plt.plot([i-0.5,i-0.5],[-0.5,m-0.5],color='black')
        plt.plot([-0.5,m-0.5],[i-0.5,i-0.5],color='black')
    plt.ylim([-0.5,m-0.5]); plt.xlim([-0.5,m-0.5])

def add_grid():
    plt.gca().grid(True, which='major',linewidth = 1.0); plt.gca().grid(True, which='minor',linewidth = 0.2) # add y grids
    plt.gca().tick_params(which='major',length=7); plt.gca().tick_params(which='minor', length=4)
    plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks 

def add_grid2(sub_plot):
    sub_plot.grid(True, which='major',linewidth = 1.0); sub_plot.grid(True, which='minor',linewidth = 0.2) # add y grids
    sub_plot.tick_params(which='major',length=7); sub_plot.tick_params(which='minor', length=4)
    sub_plot.xaxis.set_minor_locator(AutoMinorLocator()); sub_plot.yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks  

Design Custom Color Map#

Accounting for significance by masking nonsignificant values

  • for demonstration only currently, could be updated for each plot based on results confidence and uncertainty

my_colormap = plt.cm.get_cmap('RdBu_r', 256)                  # make a custom colormap
newcolors = my_colormap(np.linspace(0, 1, 256))               # define colormap space
white = np.array([250/256, 250/256, 250/256, 1])              # define white color (4 channel)
#newcolors[26:230, :] = white                                 # mask all correlations less than abs(0.8)
#newcolors[56:200, :] = white                                 # mask all correlations less than abs(0.6)
newcolors[76:180, :] = white                                  # mask all correlations less than abs(0.4)
signif = ListedColormap(newcolors)                            # assign as listed colormap
        
my_colormap = plt.cm.get_cmap('inferno', 256)                 # make a custom colormap
newcolors = my_colormap(np.linspace(0, 1, 256))               # define colormap space
white = np.array([250/256, 250/256, 250/256, 1])              # define white color (4 channel)
#newcolors[26:230, :] = white                                 # mask all correlations less than abs(0.8)
newcolors[0:12, :] = white                                    # mask all correlations less than abs(0.6)
#newcolors[86:170, :] = white                                 # mask all correlations less than abs(0.4)
sign1 = ListedColormap(newcolors)                             # assign as listed colormap

Set the Working Directory#

I always like to do this so I don’t lose files and to simplify subsequent read and writes (avoid including the full address each time).

#os.chdir("c:/PGE383")                                        # set the working directory

Loading Tabular Data#

Here’s the command to load our comma delimited data file in to a Pandas’ DataFrame object.

Let’s load the provided multivariate, spatial dataset ‘sample_data_MV_biased.csv’. It is a comma delimited file with:

  • X and Y coordinates (\(m\))

  • facies 0 and 1

  • porosity (fraction)

  • permeability (\(mD\))

  • acoustic impedance (\(\frac{kg}{m^3} \cdot \frac{m}{s} \cdot 10^3\)).

#df = pd.read_csv('sample_data_MV_biased.csv')                # load our data table
df = pd.read_csv('https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/sample_data_MV_biased.csv')
df = df.iloc[:,1:]

Visualize the DataFrame#

Visualizing the DataFrame is useful.

We can preview by utilizing the ‘head’ DataFrame member function (with a nice and clean format, see below).

  • add parameter ‘n=13’ to see the first 13 rows of the dataset.

df.head(n=13)                                                 # we could also use this command for a table preview
X Y Facies Porosity Perm AI
0 100.0 900.0 0.0 0.101319 1.996868 5590.417154
1 100.0 800.0 1.0 0.147676 10.711789 3470.845666
2 100.0 700.0 1.0 0.145912 17.818143 3586.988513
3 100.0 600.0 1.0 0.186167 217.109365 3732.114787
4 100.0 500.0 1.0 0.146088 16.717367 2534.551236
5 200.0 900.0 1.0 0.129949 23.348473 4781.590782
6 200.0 700.0 1.0 0.185299 595.674540 4729.017454
7 200.0 600.0 1.0 0.164923 154.555841 2147.668805
8 200.0 500.0 1.0 0.167026 50.970899 4583.679953
9 200.0 400.0 1.0 0.125382 4.061772 4477.930200
10 200.0 200.0 0.0 0.077419 1.162642 6844.535923
11 300.0 800.0 1.0 0.164797 497.593578 3881.489046
12 300.0 700.0 1.0 0.165155 486.504799 3726.023776

Summary Statistics for Tabular Data#

There are a lot of efficient methods to calculate summary statistics from tabular data in DataFrames. The describe command provides count, mean, minimum, maximum, and quartiles all in a nice data table.

  • We use transpose just to flip the table so that features are on the rows and the statistics are on the columns.

df.describe().transpose()                                     # DataFrame summary statistics
count mean std min 25% 50% 75% max
X 368.0 499.565217 289.770794 0.000000 240.000000 500.000000 762.500000 990.000000
Y 368.0 520.644022 277.412187 9.000000 269.000000 539.000000 769.000000 999.000000
Facies 368.0 0.597826 0.491004 0.000000 0.000000 1.000000 1.000000 1.000000
Porosity 368.0 0.127026 0.030642 0.041122 0.103412 0.125842 0.148623 0.210258
Perm 368.0 85.617362 228.362654 0.094627 2.297348 10.377292 50.581288 1991.097723
AI 368.0 4791.736646 974.560569 1981.177309 4110.728374 4713.325533 5464.043562 7561.250336

Visualizing Tabular Data with Location Maps#

It is natural to set the x and y coordinate and feature ranges manually, e.g. do you want your color bar to go from 0.05887 to 0.24230 exactly? Also, let’s pick a color map for display.

  • I heard that plasma is known to be friendly to the color blind as the color and intensity vary together (hope I got that right, it was an interesting Twitter conversation started by Matt Hall from Agile if I recall correctly).

  • We will assume a study area of 0 to 1,000m in x and y and omit any data outside this area.

xmin = 0.0; xmax = 1000.0                                     # range of x values
ymin = 0.0; ymax = 1000.0                                     # range of y values
pormin = 0.05; pormax = 0.25;                                 # range of porosity values
permmin = 0.01; permmax = 2000.0                              # range of permeability values
AImin = 2000.0; AImax = 8000.0                                # range of AI values
nx = 100; ny = 100; csize = 10.0                      

Let’s try out locmap. This is a reimplementation of GSLIB’s locmap program that uses matplotlib. I hope you find it simpler than matplotlib, if you want to get more advanced and build custom plots lock at the source. If you improve it, send me the new code.

Now we can populate the plotting parameters and visualize the porosity data.

plt.subplot(221)
GSLIB.locmap_st(df,'X','Y','Facies',xmin,xmax,ymin,ymax,0,1,'Well Data - Porosity','X(m)','Y(m)','Facies (0-shale, 1-sand)',cmap)

plt.subplot(222)
GSLIB.locmap_st(df,'X','Y','Porosity',xmin,xmax,ymin,ymax,pormin,pormax,'Well Data - Porosity','X(m)','Y(m)','Porosity (fraction)',cmap)

plt.subplot(223)
GSLIB.locmap_st(df,'X','Y','Perm',xmin,xmax,ymin,ymax,permmin,permmax,'Well Data - Permeability','X(m)','Y(m)','Permeability (md)',cmap)

plt.subplot(224)
GSLIB.locmap_st(df,'X','Y','AI',xmin,xmax,ymin,ymax,AImin,AImax,'Well Data - Acoustic Impedance','X(m)','Y(m)','Acoustic Impedance (m/s x g/cm^3)',cmap)

plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=1.7, wspace=0.3, hspace=0.3); plt.show()
_images/fb390aff66ae78b026e0a8d670a17f1652724168801abaaf0115cf944e3ed6b3.png

Bivariate Analysis#

Let’s start with simple bivariate scatter plots, and calculating bivariate statistics. Here’s the scatter plots.

  • I included code below (commented out) for writing the plot as an image file. Note, you can write to many different formats, e.g., .tif, jpg, etc. and you can also control the resolution.

  • by saving in .pdf, .svg, or eps formats you create vector graphics that retain their sharpness at any resolution with economy of file size

plt.subplot(121)
plt.plot(df['Porosity'].values,df['Perm'].values, 'o', label='', markerfacecolor='darkorange', markeredgecolor='black', alpha=0.8)
plt.title('Well Data Permeability vs. Porosity')
plt.xlabel('Porosity (fraction)'); plt.ylabel('Permeability (mD)')
plt.xlim([pormin,pormax]); plt.ylim([permmin,permmax]); add_grid()

plt.subplot(122)
plt.plot(df['AI'].values,df['Porosity'].values, 'o', label='', markerfacecolor='darkorange', markeredgecolor='black', alpha=0.8)
plt.title('Well Data Porosity vs. Acoustic Impedance')
plt.ylabel('Porosity (fraction)'); plt.xlabel('Acoustic Impedance (m/s x g/cm^3)')
plt.xlim([AImin,AImax]); plt.ylim([pormin,pormax]); add_grid()

plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=0.8, wspace=0.3, hspace=0.2)
#plt.savefig('Test.pdf', dpi=600, bbox_inches = 'tight',format='pdf')   
plt.show()
_images/67700ddcbaaf979a239805be756a89d96c635a8ccbc9df986502ccd3864bdb15.png

Correlation and Covariance#

It is straight forward to calculate the covariance and correlation from the pairs of data in our dataset. Here’s the covariance matrix.

  • Notice that the matrix is symmetrical? Makes sense, as the \(C_{Por,Perm} = C_{Perm,Por}\).

  • Also, note that the diagonal values (\(C_{i,j}\) where \(i=j\)) equal to the variance.

We check porosity by calculating the variance.

print(df.iloc[:,3:7].cov())                                   # the covariance matrix for columns 3,4,5 and 6 and all rows
print('The variance of porosity is ' + str(round(np.var(df['Porosity'].values),6)))
           Porosity          Perm             AI
Porosity   0.000939      4.055029     -17.132244
Perm       4.055029  52149.501968  -46471.695092
AI       -17.132244 -46471.695092  949768.302409
The variance of porosity is 0.000936

Here’s the correlation coefficient.

df.iloc[:,3:7].corr()                                         # correlation matrix
Porosity Perm AI
Porosity 1.000000 0.579493 -0.573700
Perm 0.579493 1.000000 -0.208812
AI -0.573700 -0.208812 1.000000

Visualize the Correlation Matrix#

It is convenient to visualize the correlation matrix as follows.

  • I added a custom colour bar to communicate significance. Demonstration only, I did not calculate confidence intervals, but that could be added.

corr_matrix = df.iloc[:,2:].corr()
plt.subplot(111)
plot_corr(corr_matrix,'Correlation Matrix',1.0,0.5)           # using our correlation matrix visualization function
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=0.8, wspace=0.2, hspace=0.3); plt.show()
_images/83761366319655e588a2d6284fe8bda3b63be5655c93c0deb0fb5a9394b8d6d6.png

Spearman’s Rank Correlation Coefficient#

The Spearman’s rank correlation coefficient, provides a measure of the degree of monotonic relationship.

  • Rank transform, e.g. \(R_(x_i)\), sort the data in ascending order and replace the data with the index, $i=1,\ldots,n.

  • Spearman’s rank correlation coefficient is more robust in the presence of outliers and some nonlinear features than the Pearson’s correlation coefficient

Let’s try out the rank correlation coefficient with SciPy’s stats module function.

  • it also provides a p-value for each measure for significance testing, i.e., if p-value < alpha then we reject the null hypothesis that the rank correlation coefficient is 0.0.

rank_correlation, rank_correlation_pval = stats.spearmanr(df.iloc[:,2:]) # calculate the range correlation coefficient
rank_matrix = pd.DataFrame(rank_correlation,columns=corr_matrix.columns)
print('Rank Correlation:')
print(rank_correlation)
print('\nRank Correlation p-value:')
print(rank_correlation_pval)
Rank Correlation:
[[ 1.          0.79973189  0.71923703 -0.50503617]
 [ 0.79973189  1.          0.88314655 -0.56863773]
 [ 0.71923703  0.88314655  1.         -0.34876935]
 [-0.50503617 -0.56863773 -0.34876935  1.        ]]

Rank Correlation p-value:
[[0.00000000e+000 4.11417796e-083 7.49150387e-060 3.24513314e-025]
 [4.11417796e-083 0.00000000e+000 2.26847950e-122 6.59769641e-033]
 [7.49150387e-060 2.26847950e-122 0.00000000e+000 5.77157354e-012]
 [3.24513314e-025 6.59769641e-033 5.77157354e-012 0.00000000e+000]]

Plot Correlation and Rank Correlation Matrices#

plt.subplot(131)                                              # plot correlation matrix with significance colormap
plot_corr(corr_matrix,'Correlation Matrix',1.0,0.5)           # using our correlation matrix visualization function

plt.subplot(132)                                              # plot correlation matrix with significance colormap
plot_corr(rank_matrix,'Rank Correlation Matrix',1.0,0.5)      # using our correlation matrix visualization function

plt.subplot(133)                                              # plot correlation matrix with significance colormap
diff = corr_matrix.values - rank_matrix.values
diff_matrix = pd.DataFrame(diff,columns=corr_matrix.columns)
plot_corr(diff_matrix,'Correlation - Rank Correlation',0.3,0.0) # using our correlation matrix visualization function

plt.subplots_adjust(left=0.0, bottom=0.0, right=2.5, top=0.6, wspace=0.2, hspace=0.3); plt.show()
_images/edcf5c74036d111e6669dfe2ad18f9bc29f717402eb2ff1ccfbe3afa42a8acc9.png

From the correlation minus rank correlation heatmap we see,

  • the nonlinearity is potentially decreasing the permeability correlation with facies and porosity, and increasing the correlation with AI.

Matrix Scatter Plots#

If we have 3 or more variables to consider then matrix scatter plot offer an efficient method to display the multivariate relationships, 2 variables at a time. Once can identify:

  1. the range, envelope of the paired data

  2. homoscedastic and heteroscedastic behaviors

  3. nonlinear features

Here’s the seaborn package matrix scatter plot function, pairplot. Let’s color the results by facies.

sns.pairplot(df, hue='Facies',vars=['Porosity','Perm','AI'],markers='o')
plt.subplots_adjust(left=0.0, bottom=0.0, right=0.9, top=0.9, wspace=0.2, hspace=0.3); plt.show()
_images/a482081691d29bb3a8958da4b9cf7f18e0d1d750b4f4b1fc8ea60fcb707ce1da.png

Remember, although these matrix scatterplots are very useful, they are still a bivariate marginalization of the higher dimensionality relationships.

  • There may be structures in the other dimensions that we do not see.

Pair Grid Plot#

For maximum control of the resulting plot we can work with pair grid plots from seaborn.

  • we can independently set lower, upper and diagonal components of the plot

We demonstrate this with all facies pooled together for:

  • scatter plots for upper

  • histograms for diagonal

  • kernel density plots for the lower

pairgrid = sns.PairGrid(df,vars=['Porosity','Perm','AI'])
pairgrid = pairgrid.map_upper(plt.scatter, color = 'darkorange', edgecolor = 'black', alpha = 0.8, s = 10)
pairgrid = pairgrid.map_diag(plt.hist, bins = 20, color = 'darkorange',alpha = 0.8, edgecolor = 'k')# Map a density plot to the lower triangle
pairgrid = pairgrid.map_lower(sns.kdeplot, cmap = plt.cm.inferno, 
                              shade = False, shade_lowest = False, alpha = 1.0, n_levels = 10)
pairgrid.add_legend()
plt.subplots_adjust(left=0.0, bottom=0.0, right=0.9, top=0.9, wspace=0.2, hspace=0.2); plt.show()
_images/ac24270df583b5747f7ae51bb452146ed7d0a0f4c92343d3511f3d5967b76259.png

Joint, Conditional and Marginals#

We can use kernel density estimation to estimate the joint probabilities density function (pdf) for the paired data, a 2D pdf!

  • We could use this to estimate any required joint, marginal and conditional probability (care must be taken with normalization).

Let’s use the seaborn package ‘kdeplot’ function to estimate the joint pdf for porosity and acoustic impedance.

ax = sns.kdeplot(df,x=df['AI'].values,y=df['Porosity'].values, shade=True, n_levels = 10,cmap=cmap,cbar= True, shade_lowest = False)
#ax = sns.kdeplot(df.loc[:,['AI','Porosity']], shade=True, n_levels = 10,cmap=cmap,cbar= True, shade_lowest = False)
ax.set_xlabel('Acoustic Impedance (m/s x g/cm^3)'); ax.set_ylabel('Porosity (fraction)'); ax.set_title('Porosity vs. Acoustic Impedance')
add_grid()
plt.subplots_adjust(left=0.0, bottom=0.0, right=0.9, top=0.9, wspace=0.2, hspace=0.2); plt.show()
_images/69b015671bacfc7fa2fc9d4ab2759470bc875d39a319cd25ef75348eea016048.png

I think is it useful to visualize the joint pdfs with the marginal pdfs on a single plot. We can use seaborn’s ‘jointplot’ to accomplish this.

ax = sns.jointplot(df,x='AI',y='Porosity',kind='kde',shade = False,n_levels = 10,cmap=cmap,shade_lowest = True);
_images/a799790562a52731e8995fba6d657f6d392e8438ec2431d8c8edeeba1566ace9.png

The correlation coefficient and the p-value of the correlation coefficient (significant if < \(\alpha/2\) or > \(1-\alpha/2\)).

Calculating Conditional Statistics#

Of course, we could just calculate the conditional statistics by-hand. We need to select some bins over the variable that we will condition to. Let’s calculate conditional statistical of porosity given acoustic impedance. We will select 9 equal spaced bins.

AI_bins = np.linspace(2000,8000,10)                           # set the bin boundaries and then the centroids for plotting
AI_centroids = np.linspace((AI_bins[0]+AI_bins[1])*0.5,(AI_bins[8]+AI_bins[9])*0.5,9)
print(AI_bins)                                                # check the boundaries
print(AI_centroids)                                           # check the centroids
df['AI_bins'] = pd.cut(df['AI'], AI_bins,labels = AI_centroids) # cut on bondaries and label with centroids 
df.head()                                                     # check the new column in the DataFrame
[2000.         2666.66666667 3333.33333333 4000.         4666.66666667
 5333.33333333 6000.         6666.66666667 7333.33333333 8000.        ]
[2333.33333333 3000.         3666.66666667 4333.33333333 5000.
 5666.66666667 6333.33333333 7000.         7666.66666667]
X Y Facies Porosity Perm AI AI_bins
0 100.0 900.0 0.0 0.101319 1.996868 5590.417154 5666.666667
1 100.0 800.0 1.0 0.147676 10.711789 3470.845666 3666.666667
2 100.0 700.0 1.0 0.145912 17.818143 3586.988513 3666.666667
3 100.0 600.0 1.0 0.186167 217.109365 3732.114787 3666.666667
4 100.0 500.0 1.0 0.146088 16.717367 2534.551236 2333.333333

Now we can use the ‘groupby’ function built-in to Pandas’ DataFrames to extract subsets of porosity values in each bin from the DataFrame and then to calculate the conditional statistics: expectation, P90 and P10. Let’s plot the result.

cond_exp = df.groupby('AI_bins')['Porosity'].mean()
cond_P90 = df.groupby('AI_bins')['Porosity'].quantile(.9)
cond_P10 = df.groupby('AI_bins')['Porosity'].quantile(.1)

plt.subplot(111)
plt.plot(AI_centroids,cond_exp,color='black')
plt.plot(AI_centroids,cond_P90,'r--',color='black',linewidth = 1.0)
plt.plot(AI_centroids,cond_P10,'r--',color='black',linewidth = 1.0)

plt.xlabel('Acoustic Impedance (m/s x g/cm^3)')
plt.ylabel('Porosity (fraction) | Acoustic Impedance')
t = plt.title('Porosity Conditional to Acoustic Impedance')
plt.ylim(pormin,pormax)
plt.xlim(AImin,AImax)
plt.text(3200, .10, 'P10')
plt.text(3200, .15, 'Expectation')
plt.text(3200, .19, 'P90')

add_grid()

plt.subplots_adjust(left=0.0, bottom=0.0, right=1.2, top=1.2, wspace=0.2, hspace=0.2); plt.show()
_images/3b0219cd99865b38ef5047da0a726d7a86111c6fd26a15d3fea9639b6e47a3d8.png

Does acoustic impedance provide information about porosity?

Yes, clearly the conditional statistics vary over acoustic impedance, knowing the acoustic impedance reduces the uncertainty about porosity.

  • this provide a powerful, nonparametric model of the relationship, i.e., without any assumptions such as linearity.

Comments#

I hope you found this tutorial useful. I’m always happy to discuss geostatistics, statistical modeling, uncertainty modeling and machine learning,

Michael

The Author:#

Michael J. Pyrcz, Professor, The University of Texas at Austin Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions

With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers’ and geoscientists’ impact in subsurface resource development.

For more about Michael check out these links:

Twitter | GitHub | Website | GoogleScholar | Book | YouTube | Applied Geostats in Python e-book | LinkedIn#

Want to Work Together?#

I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.

  • Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I’d be happy to drop by and work with you!

  • Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PI is Professor John Foster)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!

  • I can be reached at mpyrcz@austin.utexas.edu.

I’m always happy to discuss,

Michael

Michael J. Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin

More Resources Available at: Twitter | GitHub | Website | GoogleScholar | Book | YouTube | Applied Geostats in Python e-book | LinkedIn#