Multivariate Analysis#
Michael J. Pyrcz, Professor, The University of Texas at Austin
Twitter | GitHub | Website | GoogleScholar | Book | YouTube | Applied Geostats in Python e-book | LinkedIn
Chapter of e-book “Applied Machine Learning in Python: a Hands-on Guide with Code”.
Cite this e-Book as:
Pyrcz, M.J., 2024, Applied Machine Learning in Python: a Hands-on Guide with Code, https://geostatsguy.github.io/MachineLearningDemos_Book.
The workflows in this book and more are available here:
Cite the MachineLearningDemos GitHub Repository as:
Pyrcz, M.J., 2024, MachineLearningDemos: Python Machine Learning Demonstration Workflows Repository (0.0.1). Zenodo.
By Michael J. Pyrcz
© Copyright 2024.
This chapter is a tutorial for / demonstration of Multivariate Analysis.
YouTube Lecture: check out my lectures on:
These lectures are all part of my Machine Learning Course on YouTube with linked well-documented Python workflows and interactive dashboards. My goal is to share accessible, actionable, and repeatable educational content. If you want to know about my motivation, check out Michael’s Story.
Motivation for Multivariate Analysis#
To build good machine learning models, we build on a foundation of statistical analysis to explore and understand the relationships within our data. This includes the relationships between 2 or more features at a time!
in many cases these multivariate statistics are actually critical components of the machine learning methods, e.g., eigenvalues and eigenvectors are calculated from the covariance matrix for principal component analysis.
Let’s talk about bivariate and multivariate methods briefly. I confess the statistics beyond bivariate is a placeholder right now. Note, in the feature ranking workflow (link above) I have included partial correlation, maximum relevance minimum redundancy (MRMR) based on mutual information and random forest feature importance methods that extract information beyond bivariate.
Bivariate Analysis#
Understand and quantify the relationship between two variables
example: relationship between porosity and permeability
how can we use this relationship?
What would be the impact if we ignore this relationship and simply modeled porosity and permeability independently?
no relationship beyond constraints at data locations
independent away from data
nonphysical results, unrealistic uncertainty models
Bivariate Statistics#
Pearson’s Product‐Moment Correlation Coefficient
Provides a measure of the degree of linear relationship.
We refer to it as the ‘correlation coefficient’
Let’s review the sample variance of variable \(x\). Of course, I’m truncating our notation as \(x\) is a set of samples a locations in our modeling space, \(x(\bf{u_\alpha}), \, \forall \, \alpha = 0, 1, \dots, n - 1\).
We can expand the squared term and replace on of them with \(y\), another variable in addition to \(x\).
We now have a measure that represents the manner in which variables \(x\) and \(y\) co-vary or vary together. We can standardized the covariance by the product of the standard deviations of \(x\) and \(y\) to calculate the correlation coefficient.
In summary we can state that the correlation coefficient is related to the covariance as:
The Person’s correlation coefficient is quite sensitive to outliers and departure from linear behavior (in the bivariate sense). We have an alternative known as the Spearman’s rank correlations coefficient.
The rank correlation applies the rank transform to the data prior to calculating the correlation coefficient. To calculate the rank transform simply replace the data values with the rank \(R_x = 1,\dots,n\), where \(n\) is the maximum value and \(1\) is the minimum value.
The correlation coefficients provide useful metrics to quantify relationships between two variables at a time. We can also consider bivariate scatter plots and matrix scatter plots to visualize multivariate data. In general, current practical subsurface modeling is bivariate, two variables at a time.
Multivariate Statistics#
See my lecture on Multivariate Analysis that includes more on multivariate statistics, including the concepts of joint, conditional and marginal probability. For convenience I summarize some concepts here.
For example, if we define a probability density function (PDF) of feature \(X_1\) as,
we can extend the definition to a joint PDF for any arbitrary number, e.g., \(m\), features as,
Similarly, conditional PFDs can be calculated,
and extended to any number of features, e.g., \(m\), features as,
and many of our statistical concepts extend to high dimensions, for example consider the correlation coefficient,
between 2 features and now the partial correlation coefficient that calculates correlation between 2 features (\(X\) and \(Y\)) while controlling for all other features (\(Z\)).
Ultimately the difficulty with calculating high dimensionality (massively multivariate) probabilities and statistics is inferential, we rarely have enough data to observe vast multivariate space and to model probabilities and statistics in that space!
See my linked lecture on the curse of dimensionality above.
Load the Required Libraries#
The following code loads the required libraries.
import geostatspy.GSLIB as GSLIB # GSLIB utilities, visualization and wrapper
import geostatspy.geostats as geostats # GSLIB methods convert to Python
import geostatspy
print('GeostatsPy version: ' + str(geostatspy.__version__))
GeostatsPy version: 0.0.72
We will also need some standard packages. These should have been installed with Anaconda 3.
ignore_warnings = True # ignore warnings?
import numpy as np # ndarrays for gridded data
import pandas as pd # DataFrames for tabular data
from scipy import stats # summary statistics
import copy # for deep copies
import os # set working directory, run executables
import matplotlib.pyplot as plt # for plotting
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator) # control of axes ticks
from matplotlib.colors import ListedColormap # custom color maps
import seaborn as sns # advanced plotting
import matplotlib.ticker as mtick # control tick label formatting
plt.rc('axes', axisbelow=True) # plot all grids below the plot elements
if ignore_warnings == True:
import warnings
warnings.filterwarnings('ignore')
cmap = plt.cm.inferno # color map
seed = 42 # random number seed
If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing ‘python -m pip install [package-name]’. More assistance is available with the respective package docs.
Declare Functions#
Let’s define a single function to streamline plotting correlation matrices. I also added a convenience function to add major and minor gridlines to improve plot interpretability.
def plot_corr(corr_matrix,title,limits,mask): # plots a graphical correlation matrix
my_colormap = plt.cm.get_cmap('RdBu_r', 256)
newcolors = my_colormap(np.linspace(0, 1, 256))
white = np.array([256/256, 256/256, 256/256, 1])
white_low = int(128 - mask*128); white_high = int(128+mask*128)
newcolors[white_low:white_high, :] = white # mask all correlations less than abs(0.8)
newcmp = ListedColormap(newcolors)
m = corr_matrix.shape[0]
im = plt.matshow(corr_matrix,fignum=0,vmin = -1.0*limits, vmax = limits,cmap = newcmp)
plt.xticks(range(len(corr_matrix.columns)), corr_matrix.columns)
plt.yticks(range(len(corr_matrix.columns)), corr_matrix.columns)
plt.colorbar(im, orientation = 'vertical')
plt.title(title)
for i in range(0,m):
plt.plot([i-0.5,i-0.5],[-0.5,m-0.5],color='black')
plt.plot([-0.5,m-0.5],[i-0.5,i-0.5],color='black')
plt.ylim([-0.5,m-0.5]); plt.xlim([-0.5,m-0.5])
def add_grid():
plt.gca().grid(True, which='major',linewidth = 1.0); plt.gca().grid(True, which='minor',linewidth = 0.2) # add y grids
plt.gca().tick_params(which='major',length=7); plt.gca().tick_params(which='minor', length=4)
plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks
def add_grid2(sub_plot):
sub_plot.grid(True, which='major',linewidth = 1.0); sub_plot.grid(True, which='minor',linewidth = 0.2) # add y grids
sub_plot.tick_params(which='major',length=7); sub_plot.tick_params(which='minor', length=4)
sub_plot.xaxis.set_minor_locator(AutoMinorLocator()); sub_plot.yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks
Design Custom Color Map#
Accounting for significance by masking nonsignificant values
for demonstration only currently, could be updated for each plot based on results confidence and uncertainty
my_colormap = plt.cm.get_cmap('RdBu_r', 256) # make a custom colormap
newcolors = my_colormap(np.linspace(0, 1, 256)) # define colormap space
white = np.array([250/256, 250/256, 250/256, 1]) # define white color (4 channel)
#newcolors[26:230, :] = white # mask all correlations less than abs(0.8)
#newcolors[56:200, :] = white # mask all correlations less than abs(0.6)
newcolors[76:180, :] = white # mask all correlations less than abs(0.4)
signif = ListedColormap(newcolors) # assign as listed colormap
my_colormap = plt.cm.get_cmap('inferno', 256) # make a custom colormap
newcolors = my_colormap(np.linspace(0, 1, 256)) # define colormap space
white = np.array([250/256, 250/256, 250/256, 1]) # define white color (4 channel)
#newcolors[26:230, :] = white # mask all correlations less than abs(0.8)
newcolors[0:12, :] = white # mask all correlations less than abs(0.6)
#newcolors[86:170, :] = white # mask all correlations less than abs(0.4)
sign1 = ListedColormap(newcolors) # assign as listed colormap
Set the Working Directory#
I always like to do this so I don’t lose files and to simplify subsequent read and writes (avoid including the full address each time).
#os.chdir("c:/PGE383") # set the working directory
Loading Tabular Data#
Here’s the command to load our comma delimited data file in to a Pandas’ DataFrame object.
Let’s load the provided multivariate, spatial dataset ‘sample_data_MV_biased.csv’. It is a comma delimited file with:
X and Y coordinates (\(m\))
facies 0 and 1
porosity (fraction)
permeability (\(mD\))
acoustic impedance (\(\frac{kg}{m^3} \cdot \frac{m}{s} \cdot 10^3\)).
#df = pd.read_csv('sample_data_MV_biased.csv') # load our data table
df = pd.read_csv('https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/sample_data_MV_biased.csv')
df = df.iloc[:,1:]
Visualize the DataFrame#
Visualizing the DataFrame is useful.
We can preview by utilizing the ‘head’ DataFrame member function (with a nice and clean format, see below).
add parameter ‘n=13’ to see the first 13 rows of the dataset.
df.head(n=13) # we could also use this command for a table preview
X | Y | Facies | Porosity | Perm | AI | |
---|---|---|---|---|---|---|
0 | 100.0 | 900.0 | 0.0 | 0.101319 | 1.996868 | 5590.417154 |
1 | 100.0 | 800.0 | 1.0 | 0.147676 | 10.711789 | 3470.845666 |
2 | 100.0 | 700.0 | 1.0 | 0.145912 | 17.818143 | 3586.988513 |
3 | 100.0 | 600.0 | 1.0 | 0.186167 | 217.109365 | 3732.114787 |
4 | 100.0 | 500.0 | 1.0 | 0.146088 | 16.717367 | 2534.551236 |
5 | 200.0 | 900.0 | 1.0 | 0.129949 | 23.348473 | 4781.590782 |
6 | 200.0 | 700.0 | 1.0 | 0.185299 | 595.674540 | 4729.017454 |
7 | 200.0 | 600.0 | 1.0 | 0.164923 | 154.555841 | 2147.668805 |
8 | 200.0 | 500.0 | 1.0 | 0.167026 | 50.970899 | 4583.679953 |
9 | 200.0 | 400.0 | 1.0 | 0.125382 | 4.061772 | 4477.930200 |
10 | 200.0 | 200.0 | 0.0 | 0.077419 | 1.162642 | 6844.535923 |
11 | 300.0 | 800.0 | 1.0 | 0.164797 | 497.593578 | 3881.489046 |
12 | 300.0 | 700.0 | 1.0 | 0.165155 | 486.504799 | 3726.023776 |
Summary Statistics for Tabular Data#
There are a lot of efficient methods to calculate summary statistics from tabular data in DataFrames. The describe command provides count, mean, minimum, maximum, and quartiles all in a nice data table.
We use transpose just to flip the table so that features are on the rows and the statistics are on the columns.
df.describe().transpose() # DataFrame summary statistics
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
X | 368.0 | 499.565217 | 289.770794 | 0.000000 | 240.000000 | 500.000000 | 762.500000 | 990.000000 |
Y | 368.0 | 520.644022 | 277.412187 | 9.000000 | 269.000000 | 539.000000 | 769.000000 | 999.000000 |
Facies | 368.0 | 0.597826 | 0.491004 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 |
Porosity | 368.0 | 0.127026 | 0.030642 | 0.041122 | 0.103412 | 0.125842 | 0.148623 | 0.210258 |
Perm | 368.0 | 85.617362 | 228.362654 | 0.094627 | 2.297348 | 10.377292 | 50.581288 | 1991.097723 |
AI | 368.0 | 4791.736646 | 974.560569 | 1981.177309 | 4110.728374 | 4713.325533 | 5464.043562 | 7561.250336 |
Visualizing Tabular Data with Location Maps#
It is natural to set the x and y coordinate and feature ranges manually, e.g. do you want your color bar to go from 0.05887 to 0.24230 exactly? Also, let’s pick a color map for display.
I heard that plasma is known to be friendly to the color blind as the color and intensity vary together (hope I got that right, it was an interesting Twitter conversation started by Matt Hall from Agile if I recall correctly).
We will assume a study area of 0 to 1,000m in x and y and omit any data outside this area.
xmin = 0.0; xmax = 1000.0 # range of x values
ymin = 0.0; ymax = 1000.0 # range of y values
pormin = 0.05; pormax = 0.25; # range of porosity values
permmin = 0.01; permmax = 2000.0 # range of permeability values
AImin = 2000.0; AImax = 8000.0 # range of AI values
nx = 100; ny = 100; csize = 10.0
Let’s try out locmap. This is a reimplementation of GSLIB’s locmap program that uses matplotlib. I hope you find it simpler than matplotlib, if you want to get more advanced and build custom plots lock at the source. If you improve it, send me the new code.
Now we can populate the plotting parameters and visualize the porosity data.
plt.subplot(221)
GSLIB.locmap_st(df,'X','Y','Facies',xmin,xmax,ymin,ymax,0,1,'Well Data - Porosity','X(m)','Y(m)','Facies (0-shale, 1-sand)',cmap)
plt.subplot(222)
GSLIB.locmap_st(df,'X','Y','Porosity',xmin,xmax,ymin,ymax,pormin,pormax,'Well Data - Porosity','X(m)','Y(m)','Porosity (fraction)',cmap)
plt.subplot(223)
GSLIB.locmap_st(df,'X','Y','Perm',xmin,xmax,ymin,ymax,permmin,permmax,'Well Data - Permeability','X(m)','Y(m)','Permeability (md)',cmap)
plt.subplot(224)
GSLIB.locmap_st(df,'X','Y','AI',xmin,xmax,ymin,ymax,AImin,AImax,'Well Data - Acoustic Impedance','X(m)','Y(m)','Acoustic Impedance (m/s x g/cm^3)',cmap)
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=1.7, wspace=0.3, hspace=0.3); plt.show()
Bivariate Analysis#
Let’s start with simple bivariate scatter plots, and calculating bivariate statistics. Here’s the scatter plots.
I included code below (commented out) for writing the plot as an image file. Note, you can write to many different formats, e.g., .tif, jpg, etc. and you can also control the resolution.
by saving in .pdf, .svg, or eps formats you create vector graphics that retain their sharpness at any resolution with economy of file size
plt.subplot(121)
plt.plot(df['Porosity'].values,df['Perm'].values, 'o', label='', markerfacecolor='darkorange', markeredgecolor='black', alpha=0.8)
plt.title('Well Data Permeability vs. Porosity')
plt.xlabel('Porosity (fraction)'); plt.ylabel('Permeability (mD)')
plt.xlim([pormin,pormax]); plt.ylim([permmin,permmax]); add_grid()
plt.subplot(122)
plt.plot(df['AI'].values,df['Porosity'].values, 'o', label='', markerfacecolor='darkorange', markeredgecolor='black', alpha=0.8)
plt.title('Well Data Porosity vs. Acoustic Impedance')
plt.ylabel('Porosity (fraction)'); plt.xlabel('Acoustic Impedance (m/s x g/cm^3)')
plt.xlim([AImin,AImax]); plt.ylim([pormin,pormax]); add_grid()
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=0.8, wspace=0.3, hspace=0.2)
#plt.savefig('Test.pdf', dpi=600, bbox_inches = 'tight',format='pdf')
plt.show()
Correlation and Covariance#
It is straight forward to calculate the covariance and correlation from the pairs of data in our dataset. Here’s the covariance matrix.
Notice that the matrix is symmetrical? Makes sense, as the \(C_{Por,Perm} = C_{Perm,Por}\).
Also, note that the diagonal values (\(C_{i,j}\) where \(i=j\)) equal to the variance.
We check porosity by calculating the variance.
print(df.iloc[:,3:7].cov()) # the covariance matrix for columns 3,4,5 and 6 and all rows
print('The variance of porosity is ' + str(round(np.var(df['Porosity'].values),6)))
Porosity Perm AI
Porosity 0.000939 4.055029 -17.132244
Perm 4.055029 52149.501968 -46471.695092
AI -17.132244 -46471.695092 949768.302409
The variance of porosity is 0.000936
Here’s the correlation coefficient.
df.iloc[:,3:7].corr() # correlation matrix
Porosity | Perm | AI | |
---|---|---|---|
Porosity | 1.000000 | 0.579493 | -0.573700 |
Perm | 0.579493 | 1.000000 | -0.208812 |
AI | -0.573700 | -0.208812 | 1.000000 |
Visualize the Correlation Matrix#
It is convenient to visualize the correlation matrix as follows.
I added a custom colour bar to communicate significance. Demonstration only, I did not calculate confidence intervals, but that could be added.
corr_matrix = df.iloc[:,2:].corr()
plt.subplot(111)
plot_corr(corr_matrix,'Correlation Matrix',1.0,0.5) # using our correlation matrix visualization function
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=0.8, wspace=0.2, hspace=0.3); plt.show()
Spearman’s Rank Correlation Coefficient#
The Spearman’s rank correlation coefficient, provides a measure of the degree of monotonic relationship.
Rank transform, e.g. \(R_(x_i)\), sort the data in ascending order and replace the data with the index, $i=1,\ldots,n.
Spearman’s rank correlation coefficient is more robust in the presence of outliers and some nonlinear features than the Pearson’s correlation coefficient
Let’s try out the rank correlation coefficient with SciPy’s stats module function.
it also provides a p-value for each measure for significance testing, i.e., if p-value < alpha then we reject the null hypothesis that the rank correlation coefficient is 0.0.
rank_correlation, rank_correlation_pval = stats.spearmanr(df.iloc[:,2:]) # calculate the range correlation coefficient
rank_matrix = pd.DataFrame(rank_correlation,columns=corr_matrix.columns)
print('Rank Correlation:')
print(rank_correlation)
print('\nRank Correlation p-value:')
print(rank_correlation_pval)
Rank Correlation:
[[ 1. 0.79973189 0.71923703 -0.50503617]
[ 0.79973189 1. 0.88314655 -0.56863773]
[ 0.71923703 0.88314655 1. -0.34876935]
[-0.50503617 -0.56863773 -0.34876935 1. ]]
Rank Correlation p-value:
[[0.00000000e+000 4.11417796e-083 7.49150387e-060 3.24513314e-025]
[4.11417796e-083 0.00000000e+000 2.26847950e-122 6.59769641e-033]
[7.49150387e-060 2.26847950e-122 0.00000000e+000 5.77157354e-012]
[3.24513314e-025 6.59769641e-033 5.77157354e-012 0.00000000e+000]]
Plot Correlation and Rank Correlation Matrices#
plt.subplot(131) # plot correlation matrix with significance colormap
plot_corr(corr_matrix,'Correlation Matrix',1.0,0.5) # using our correlation matrix visualization function
plt.subplot(132) # plot correlation matrix with significance colormap
plot_corr(rank_matrix,'Rank Correlation Matrix',1.0,0.5) # using our correlation matrix visualization function
plt.subplot(133) # plot correlation matrix with significance colormap
diff = corr_matrix.values - rank_matrix.values
diff_matrix = pd.DataFrame(diff,columns=corr_matrix.columns)
plot_corr(diff_matrix,'Correlation - Rank Correlation',0.3,0.0) # using our correlation matrix visualization function
plt.subplots_adjust(left=0.0, bottom=0.0, right=2.5, top=0.6, wspace=0.2, hspace=0.3); plt.show()
From the correlation minus rank correlation heatmap we see,
the nonlinearity is potentially decreasing the permeability correlation with facies and porosity, and increasing the correlation with AI.
Matrix Scatter Plots#
If we have 3 or more variables to consider then matrix scatter plot offer an efficient method to display the multivariate relationships, 2 variables at a time. Once can identify:
the range, envelope of the paired data
homoscedastic and heteroscedastic behaviors
nonlinear features
Here’s the seaborn package matrix scatter plot function, pairplot. Let’s color the results by facies.
sns.pairplot(df, hue='Facies',vars=['Porosity','Perm','AI'],markers='o')
plt.subplots_adjust(left=0.0, bottom=0.0, right=0.9, top=0.9, wspace=0.2, hspace=0.3); plt.show()
Remember, although these matrix scatterplots are very useful, they are still a bivariate marginalization of the higher dimensionality relationships.
There may be structures in the other dimensions that we do not see.
Pair Grid Plot#
For maximum control of the resulting plot we can work with pair grid plots from seaborn.
we can independently set lower, upper and diagonal components of the plot
We demonstrate this with all facies pooled together for:
scatter plots for upper
histograms for diagonal
kernel density plots for the lower
pairgrid = sns.PairGrid(df,vars=['Porosity','Perm','AI'])
pairgrid = pairgrid.map_upper(plt.scatter, color = 'darkorange', edgecolor = 'black', alpha = 0.8, s = 10)
pairgrid = pairgrid.map_diag(plt.hist, bins = 20, color = 'darkorange',alpha = 0.8, edgecolor = 'k')# Map a density plot to the lower triangle
pairgrid = pairgrid.map_lower(sns.kdeplot, cmap = plt.cm.inferno,
shade = False, shade_lowest = False, alpha = 1.0, n_levels = 10)
pairgrid.add_legend()
plt.subplots_adjust(left=0.0, bottom=0.0, right=0.9, top=0.9, wspace=0.2, hspace=0.2); plt.show()
Joint, Conditional and Marginals#
We can use kernel density estimation to estimate the joint probabilities density function (pdf) for the paired data, a 2D pdf!
We could use this to estimate any required joint, marginal and conditional probability (care must be taken with normalization).
Let’s use the seaborn package ‘kdeplot’ function to estimate the joint pdf for porosity and acoustic impedance.
ax = sns.kdeplot(df,x=df['AI'].values,y=df['Porosity'].values, shade=True, n_levels = 10,cmap=cmap,cbar= True, shade_lowest = False)
#ax = sns.kdeplot(df.loc[:,['AI','Porosity']], shade=True, n_levels = 10,cmap=cmap,cbar= True, shade_lowest = False)
ax.set_xlabel('Acoustic Impedance (m/s x g/cm^3)'); ax.set_ylabel('Porosity (fraction)'); ax.set_title('Porosity vs. Acoustic Impedance')
add_grid()
plt.subplots_adjust(left=0.0, bottom=0.0, right=0.9, top=0.9, wspace=0.2, hspace=0.2); plt.show()
I think is it useful to visualize the joint pdfs with the marginal pdfs on a single plot. We can use seaborn’s ‘jointplot’ to accomplish this.
ax = sns.jointplot(df,x='AI',y='Porosity',kind='kde',shade = False,n_levels = 10,cmap=cmap,shade_lowest = True);
The correlation coefficient and the p-value of the correlation coefficient (significant if < \(\alpha/2\) or > \(1-\alpha/2\)).
Calculating Conditional Statistics#
Of course, we could just calculate the conditional statistics by-hand. We need to select some bins over the variable that we will condition to. Let’s calculate conditional statistical of porosity given acoustic impedance. We will select 9 equal spaced bins.
AI_bins = np.linspace(2000,8000,10) # set the bin boundaries and then the centroids for plotting
AI_centroids = np.linspace((AI_bins[0]+AI_bins[1])*0.5,(AI_bins[8]+AI_bins[9])*0.5,9)
print(AI_bins) # check the boundaries
print(AI_centroids) # check the centroids
df['AI_bins'] = pd.cut(df['AI'], AI_bins,labels = AI_centroids) # cut on bondaries and label with centroids
df.head() # check the new column in the DataFrame
[2000. 2666.66666667 3333.33333333 4000. 4666.66666667
5333.33333333 6000. 6666.66666667 7333.33333333 8000. ]
[2333.33333333 3000. 3666.66666667 4333.33333333 5000.
5666.66666667 6333.33333333 7000. 7666.66666667]
X | Y | Facies | Porosity | Perm | AI | AI_bins | |
---|---|---|---|---|---|---|---|
0 | 100.0 | 900.0 | 0.0 | 0.101319 | 1.996868 | 5590.417154 | 5666.666667 |
1 | 100.0 | 800.0 | 1.0 | 0.147676 | 10.711789 | 3470.845666 | 3666.666667 |
2 | 100.0 | 700.0 | 1.0 | 0.145912 | 17.818143 | 3586.988513 | 3666.666667 |
3 | 100.0 | 600.0 | 1.0 | 0.186167 | 217.109365 | 3732.114787 | 3666.666667 |
4 | 100.0 | 500.0 | 1.0 | 0.146088 | 16.717367 | 2534.551236 | 2333.333333 |
Now we can use the ‘groupby’ function built-in to Pandas’ DataFrames to extract subsets of porosity values in each bin from the DataFrame and then to calculate the conditional statistics: expectation, P90 and P10. Let’s plot the result.
cond_exp = df.groupby('AI_bins')['Porosity'].mean()
cond_P90 = df.groupby('AI_bins')['Porosity'].quantile(.9)
cond_P10 = df.groupby('AI_bins')['Porosity'].quantile(.1)
plt.subplot(111)
plt.plot(AI_centroids,cond_exp,color='black')
plt.plot(AI_centroids,cond_P90,'r--',color='black',linewidth = 1.0)
plt.plot(AI_centroids,cond_P10,'r--',color='black',linewidth = 1.0)
plt.xlabel('Acoustic Impedance (m/s x g/cm^3)')
plt.ylabel('Porosity (fraction) | Acoustic Impedance')
t = plt.title('Porosity Conditional to Acoustic Impedance')
plt.ylim(pormin,pormax)
plt.xlim(AImin,AImax)
plt.text(3200, .10, 'P10')
plt.text(3200, .15, 'Expectation')
plt.text(3200, .19, 'P90')
add_grid()
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.2, top=1.2, wspace=0.2, hspace=0.2); plt.show()
Does acoustic impedance provide information about porosity?
Yes, clearly the conditional statistics vary over acoustic impedance, knowing the acoustic impedance reduces the uncertainty about porosity.
this provide a powerful, nonparametric model of the relationship, i.e., without any assumptions such as linearity.
Twitter | GitHub | Website | GoogleScholar | Book | YouTube | Applied Geostats in Python e-book | LinkedIn#
Want to Work Together?#
I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.
Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I’d be happy to drop by and work with you!
Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PI is Professor John Foster)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!
I can be reached at mpyrcz@austin.utexas.edu.
I’m always happy to discuss,
Michael
Michael J. Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin
Comments#
I hope you found this chapter helpful. Much more could be done and discussed, I have many more resources. Check out my shared resource inventory,
Michael