![](_static/intro/title_page.png)
Principal Component Analysis#
Michael J. Pyrcz, Professor, The University of Texas at Austin
Twitter | GitHub | Website | GoogleScholar | Geostatistics Book | YouTube | Applied Geostats in Python e-book | Applied Machine Learning in Python e-book | LinkedIn
Chapter of e-book βApplied Machine Learning in Python: a Hands-on Guide with Codeβ.
Cite this e-Book as:
Pyrcz, M.J., 2024, Applied Machine Learning in Python: a Hands-on Guide with Code, https://geostatsguy.github.io/MachineLearningDemos_Book.
The workflows in this book and more are available here:
Cite the MachineLearningDemos GitHub Repository as:
Pyrcz, M.J., 2024, MachineLearningDemos: Python Machine Learning Demonstration Workflows Repository (0.0.1). Zenodo.
By Michael J. Pyrcz
Β© Copyright 2024.
This chapter is a tutorial for / demonstration of Principal Component Analysis.
YouTube Lecture: check out my lectures on:
These lectures are all part of my Machine Learning Course on YouTube with linked well-documented Python workflows and interactive dashboards. My goal is to share accessible, actionable, and repeatable educational content. If you want to know about my motivation, check out Michaelβs Story.
Motivation for Principal Component Analysis#
Working with more features / variables is harder!
More difficult to visualize data and model
More data are required to infer the joint probabilities
Less data coverage of feature space
More difficult to interrogate / check the model
More likely redundant features, e.g., multicollinearity, resulting in model instability
More computational effort, more computational resources and longer run times
More complicated model is more likely overfit
More professional time for model construction
We get a better model with fewer, informative features than throwing all our features into the model! A big part of this motivation is driven by the curse of dimensionality.
Curse of Dimensionality#
Data and Model Visualization - we cannot visualize beyond 3D, i.e., access the model fit to data, evaluate interpolation vs. extrapolation.
consider a 5D example shown as a matrix scatter plot, even in this case there is an extreme marginalization to 2D for each plot,
![](_static\feature_ranking\matrix_scatterplot.png)
Sampling - the number of samples sufficient to infer statistics like the joint probability, \(P(x_1,\ldots,x_m)\).
recall the calculation of a histogram or normalized histogram: we establish bins and calculate frequencies or probabilities in each bin.
we require a nominal number of data samples for each bin, so we require \(π=π_{π /πππ} \cdot π_{ππππ }\) samples in 1D
but in mD we required \(n\) samples to calculate the discretized joint probability,
for example, 10 samples per bin with 35 bins requires 12,250 samples in 2D, and 428,750 samples in 3D
![](_static/feature_ranking/bivariate_example.png)
Sample Coverage - the range of the sample values cover the predictor feature space.
fraction of the possible solution space that is sampled, for 1 feature we assume 80% coverage
remember, we usually, directly sample only \(\frac{1}{10^7}\) of the volume of the subsurface
yes, the concept of coverage is subjective, how much data to cover? What about gaps? etc.
![](_static/feature_ranking/coverage1D.png)
now if there is 80% coverage for 2 features the 2D coverage is 64%
![](_static/feature_ranking/coverage2D.png)
coverage is,
Distorted Space - high dimensional space is distorted.
take the ratio of the volume of an inscribed hypersphere in a hypercube,
recall, \(\Gamma(π)=(πβ1)!\).
high dimensional space is all corners and no middle and most of high dimensional space is far from the middle (all corners!).
as a result distances in high dimensional space lose sensitivity, i.e., for any random points in the space the expected pairwise distances all become the same,
the limit of the expectation of the range of pairwise distances over random points in hyper-dimensional space tends to zero. If distances are almost all the same, Euclidian distance is no longer meaningful!
![](_static/feature_ranking/distortion_chart.png)
hereβs the severity of the distortion for various dimensionalities,
m |
nD / 2D |
---|---|
2 |
1.0 |
5 |
0.28 |
10 |
0.003 |
20 |
0.00000003 |
Multicollinearity - higher dimensional datasets are more likely to have collinearity or multicollinearity.
Feature linearly described by other features resulting in high model variance.
Inferential Machine Learning#
Princpal conponent analysis is an inferential, unsupervised machine learnng method.
there are no response features, \(y\), just predictor features,
Machine learns by mimicry a compact representation of the data
Captures patterns as feature projections, group assignments, neural network latent features, etc.
We focus on inference of the population, the natural system, instead of prediction of response features.
Principal Component Analysis#
Principal Component Analysis one of a variety of methods for dimensional reduction:
Dimensional reduction transforms the data to a lower dimension
Given features, \(π_1,\dots,π_π\) we would require \({m \choose 2}=\frac{π \cdot (πβ1)}{2}\) scatter plots to visualize just the two-dimensional scatter plots.
Once we have 4 or more variables understanding our data gets very hard.
Recall the curse of dimensionality, impact inference, modeling and visualization.
One solution, is to find a good lower dimensional, \(π\), representation of the original dimensions \(π\)
Benefits of Working in a Reduced Dimensional Representation:
Data storage / Computational Time
Easier visualization
Also takes care of multicollinearity
Orthogonal Transformation#
Convert a set of observations into a set of linearly uncorrelated variables known as principal components
The number of principal components (\(k\)) available are minβ‘(\(πβ1,π\))
Limited by the variables/features, \(π\), and the number of data
Components are ordered,
First component describes the larges possible variance / accounts for as much variability as possible
Next component describes the largest possible remaining variance
Up to the maximum number of principal components
There are mutliple ways to interpret principal components analysis,
Best Fitting Interpretation#
Minimizes the orthogonal projection error between the data and the principal components,
where \(π½_π\) is a matrix of our first \(π\) vectors, and \(πΏ_π\) is a vector for sample \(π\) over all \(π\) features and \(\overline{X}\) is a vector of the means,
![](_static/PCA/error.png)
where the princpal components describe a vector in 1D and a plane in 2D, and where the principal component scores in the projected space are,
and the back transformation in the original space with reduced dimensionality is,
note, given the \(V_p\) matrix is orthogonal,
Rotation-based Intepretation#
Orthogonal transformation is a rotation that maximizes the variance explained on the first principal component, maximizes the remaining variance on the second principal component, etc.
If you would like to see PCA as a rotation in action, check out my PCA Rotation interactive Python dashboard,
![](_static/PCA/rotation.png)
from this dashboard it is clear that there is a rotation that maixmizes the variance explained by the first principal component while removing the correlation between the first and second principal component.
Eigenvalues / Eigenvectors Interpretation#
For principal components analysis we calculate the data covariance matrix, the pairwise covariance for the combinatorial of features.
The we calculate the eigenvectors and eigenvalues from the covariance matrix.
The eigenvalues are the variance explained for each component.
The eigenvectors of the data covariance matrix are the principal components.
The Principal Components Analysis Workflow#
Standardize the Features
where $X_i$ are original features and $X^s_i$ are transformed features.
standardization is required to prevent features with larger variance dominating the solution, i.e., first principal component aligned with feature with greatest variance
Calculate the standardized feature covariance matrix
given the features are standardized the matrix is a correlation matrix
given the features are standardized the matrix is a correlation matrix,
Calculate the eigenvalues and eigenvectors of covariance matrix, \(πͺ\),
given πΆ is a square matrix \((π \times π)\), \(π£ (π \times 1)\) is a vector and \(\lambda\) is a scaler (\(1\)),
we can reorder to,
where $I$ is an identity matrix. By Cramerβs rule, we have a solution if the determinant is 0,
find the possible Eigenvalues, $\lambda_πΌ$, and solve for eigenvectors, $π_πΆ, \quad \alpha=π,\ldots,π$
the resulting \(\text{πππ}β‘(π,πβπ)\) eigenvectors in a matrix, \(π½_π\)
![](_static/PCA/components.png)
that form a basis on which the data are projected for dimensionality reduction,
![](_static/PCA/components_basis.png)
If you would like to see the principal components loadings and the variance partitioning between components, check out my PCA loadings interactive Python dashboard,
![](_static/PCA/interactive_loadings.png)
Load the Required Libraries#
The following code loads the required libraries. These should have been installed with Anaconda 3.
ignore_warnings = True # ignore warnings?
from sklearn.decomposition import PCA # PCA program from scikit learn (package for machine learning)
from sklearn.preprocessing import StandardScaler # standardize variables to mean of 0.0 and variance of 1.0
import numpy as np # ndarrays for gridded data
import pandas as pd # DataFrames for tabular data
import pandas.plotting as pd_plot # pandas plotting functions
import copy # for deep copies
import os # set working directory, run executables
import matplotlib.pyplot as plt # for plotting
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator) # control of axes ticks
import matplotlib.ticker as mtick # control tick label formatting
import seaborn as sns # advanced plotting
plt.rc('axes', axisbelow=True) # plot all grids below the plot elements
if ignore_warnings == True:
import warnings
warnings.filterwarnings('ignore')
cmap = plt.cm.inferno # color map
seed = 42 # random number seed
If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing βpython -m pip install [package-name]β. More assistance is available with the respective package docs.
Declare Functions#
Letβs define a single function to streamline plotting correlation matrices. I also added a convenience function to add major and minor gridlines to improve plot interpretability.
def plot_corr(df,size=10): # plots a graphical correlation matrix
from matplotlib.colors import ListedColormap # make a custom colormap
my_colormap = plt.cm.get_cmap('RdBu_r', 256)
newcolors = my_colormap(np.linspace(0, 1, 256))
white = np.array([256/256, 256/256, 256/256, 1])
newcolors[65:191, :] = white # mask all correlations less than abs(0.8)
newcmp = ListedColormap(newcolors)
m = len(df.columns)
corr = df.corr()
fig, ax = plt.subplots(figsize=(size, size))
im = ax.matshow(corr,vmin = -1.0, vmax = 1.0,cmap = newcmp)
plt.xticks(range(len(corr.columns)), corr.columns);
plt.yticks(range(len(corr.columns)), corr.columns);
plt.colorbar(im, orientation = 'vertical')
plt.title('Correlation Matrix')
for i in range(0,m):
plt.plot([i-0.5,i-0.5],[-0.5,m-0.5],color='black')
plt.plot([-0.5,m-0.5],[i-0.5,i-0.5],color='black')
plt.ylim([-0.5,m-0.5]); plt.xlim([-0.5,m-0.5])
def add_grid():
plt.gca().grid(True, which='major',linewidth = 1.0); plt.gca().grid(True, which='minor',linewidth = 0.2) # add y grids
plt.gca().tick_params(which='major',length=7); plt.gca().tick_params(which='minor', length=4)
plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks
def add_grid2(sub_plot):
sub_plot.grid(True, which='major',linewidth = 1.0); sub_plot.grid(True, which='minor',linewidth = 0.2) # add y grids
sub_plot.tick_params(which='major',length=7); sub_plot.tick_params(which='minor', length=4)
sub_plot.xaxis.set_minor_locator(AutoMinorLocator()); sub_plot.yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks
Set the Working Directory#
I always like to do this so I donβt lose files and to simplify subsequent read and writes (avoid including the full address each time).
#os.chdir("c:/PGE383") # set the working directory
You will have to update the part in quotes with your own working directory and the format is different on a Mac (e.g. β~/PGEβ).
Loading Tabular Data#
Hereβs the command to load our comma delimited data file in to a Pandasβ DataFrame object.
Letβs load the provided multivariate, spatial dataset βunconv_MV.csvβ. This dataset has variables from 1,000 unconventional wells including:
well average porosity
log transform of permeability (to linearize the relationships with other variables)
acoustic impedance (kg/m^3 x m/s x 10^6)
brittleness ratio (%)
total organic carbon (%)
vitrinite reflectance (%)
initial production 90 day average (MCFPD).
Note, the dataset is synthetic.
We load it with the pandas βread_csvβ function into a DataFrame we called βmy_dataβ and then preview it to make sure it loaded correctly.
#my_data = pd.read_csv("unconv_MV.csv")
my_data = pd.read_csv(r"https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/unconv_MV.csv") # load the comma delimited data file
my_data = my_data.iloc[:,1:] # remove the well index
Visualize the DataFrame#
Visualizing the DataFrame is a useful first check.
We can preview by slicing the DataFrame.
we are showing all records from 0 up to and not including 7
my_data[:7] # preview the first 7 rows of the dataframe
Por | LogPerm | AI | Brittle | TOC | VR | Production | |
---|---|---|---|---|---|---|---|
0 | 15.91 | 1.67 | 3.06 | 14.05 | 1.36 | 1.85 | 177.381958 |
1 | 15.34 | 1.65 | 2.60 | 31.88 | 1.37 | 1.79 | 1479.767778 |
2 | 20.45 | 2.02 | 3.13 | 63.67 | 1.79 | 2.53 | 4421.221583 |
3 | 11.95 | 1.14 | 3.90 | 58.81 | 0.40 | 2.03 | 1488.317629 |
4 | 19.53 | 1.83 | 2.57 | 43.75 | 1.40 | 2.11 | 5261.094919 |
5 | 19.47 | 2.04 | 2.73 | 54.37 | 1.42 | 2.12 | 5497.005506 |
6 | 12.70 | 1.30 | 3.70 | 43.03 | 0.45 | 1.95 | 1784.266285 |
Summary Statistics for Tabular Data#
There are a lot of efficient methods to calculate summary statistics from tabular data in DataFrames. The describe command provides count, mean, minimum, maximum, and quartiles all in a nice data table.
We use transpose just to flip the table so that features are on the rows and the statistics are on the columns.
my_data.describe().transpose() # calculate summary statistics for the data
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Por | 1000.0 | 14.950460 | 3.029634 | 5.400000 | 12.85750 | 14.98500 | 17.080000 | 24.65000 |
LogPerm | 1000.0 | 1.398880 | 0.405966 | 0.120000 | 1.13000 | 1.39000 | 1.680000 | 2.58000 |
AI | 1000.0 | 2.982610 | 0.577629 | 0.960000 | 2.57750 | 3.01000 | 3.360000 | 4.70000 |
Brittle | 1000.0 | 49.719480 | 15.077006 | -10.500000 | 39.72250 | 49.68000 | 59.170000 | 93.47000 |
TOC | 1000.0 | 1.003810 | 0.504978 | -0.260000 | 0.64000 | 0.99500 | 1.360000 | 2.71000 |
VR | 1000.0 | 1.991170 | 0.308194 | 0.900000 | 1.81000 | 2.00000 | 2.172500 | 2.90000 |
Production | 1000.0 | 2247.295809 | 1464.256312 | 2.713535 | 1191.36956 | 1976.48782 | 3023.594214 | 12568.64413 |
Good that we checked the summary statistics, we have some negative values for brittleness and total organic carbon. This is physically impossible.
The values must be in error. We know the lowest possible values are 0.0, so we will truncate on 0.0.
We use the:
df.get_numerical_data()
DataFrame member function to get a shallow copy of the data from the DataFrame.
Since it is a shallow copy, any changes we make to the copy are made to the data in the original DataFrame.
This allows us to apply this simple conditional statement to all the data values in the DataFrame all at once.
num = my_data._get_numeric_data() # get the numerical values
num[num < 0] = 0 # truncate negative values to 0.0
my_data.describe().transpose() # calculate summary statistics for the data
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Por | 1000.0 | 14.950460 | 3.029634 | 5.400000 | 12.85750 | 14.98500 | 17.080000 | 24.65000 |
LogPerm | 1000.0 | 1.398880 | 0.405966 | 0.120000 | 1.13000 | 1.39000 | 1.680000 | 2.58000 |
AI | 1000.0 | 2.982610 | 0.577629 | 0.960000 | 2.57750 | 3.01000 | 3.360000 | 4.70000 |
Brittle | 1000.0 | 49.731480 | 15.033593 | 0.000000 | 39.72250 | 49.68000 | 59.170000 | 93.47000 |
TOC | 1000.0 | 1.006170 | 0.499838 | 0.000000 | 0.64000 | 0.99500 | 1.360000 | 2.71000 |
VR | 1000.0 | 1.991170 | 0.308194 | 0.900000 | 1.81000 | 2.00000 | 2.172500 | 2.90000 |
Production | 1000.0 | 2247.295809 | 1464.256312 | 2.713535 | 1191.36956 | 1976.48782 | 3023.594214 | 12568.64413 |
Calculate the Correlation Matrix#
For dimensional reduction, a good first step is data visualization.
Letβs start with the correlation matrix.
We can calculate it and view it in the console with these commands.
corr_matrix = np.corrcoef(my_data, rowvar = False)
the input data is a 2D ndarray and \(rowvar\) specifies if the variables are in the rows instead of columns.
corr_matrix = np.corrcoef(my_data, rowvar = False)
print(np.around(corr_matrix,2)) # print the correlation matrix to 2 decimals
[[ 1. 0.81 -0.51 -0.25 0.71 0.08 0.69]
[ 0.81 1. -0.32 -0.15 0.51 0.05 0.57]
[-0.51 -0.32 1. 0.17 -0.55 0.49 -0.33]
[-0.25 -0.15 0.17 1. -0.24 0.3 -0.07]
[ 0.71 0.51 -0.55 -0.24 1. 0.31 0.5 ]
[ 0.08 0.05 0.49 0.3 0.31 1. 0.14]
[ 0.69 0.57 -0.33 -0.07 0.5 0.14 1. ]]
Note the 1.0 diagonal resulting from the correlation of each variable with themselves.
Letβs use our function declared above to make a graphical correlation matrix visualization.
This may improve our ability to spot features. It relies on the built in correlation matrix method with Numpy DataFrames and MatPlotLib for plotting.
plot_corr(my_data,7) # using our correlation matrix visualization function
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=0.8, wspace=0.2, hspace=0.3); plt.show()
![_images/779b947412ca0f360bb3a9f517fdd22d7c6ad98015e0cf1a6bd4a8f8fb0162da.png](_images/779b947412ca0f360bb3a9f517fdd22d7c6ad98015e0cf1a6bd4a8f8fb0162da.png)
This looks good. There is a mix of bivariate, linear correlation magnitudes. Of course, correlation coefficients are limited to degree of linear correlations.
Check Matrix Scatter Plots#
For more complete information, letβs look at the matrix scatter plot from the Pandas package.
covariance and correlation are sensitive to outliers and nonlinearity
pd_plot.scatter_matrix(my_data)
the \(alpha\) allows us to use semitransparent points for easier visualization with dense scatter plots.
the \(hist_kwds\) is a set of parameters for the histograms on the diagonal elements.
pd_plot.scatter_matrix(my_data, alpha = 0.1, # pandas matrix scatter plot
figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=0.8, wspace=0.2, hspace=0.3); plt.show()
![_images/ad5210bc0e50b9cd1f9e82e2101cd6a8e22b7b46827929b73884539038ea060c.png](_images/ad5210bc0e50b9cd1f9e82e2101cd6a8e22b7b46827929b73884539038ea060c.png)
Simple Bivariate Example#
Letβs simplify the problem to bivariate (2 features), porosity and the log transform of permeability and reduce the number of wells from 1,000 to 100.
my_data_por_perm = my_data.iloc[0:100,0:2] # extract just por and logperm, 100 samples
my_data_por_perm.describe().transpose() # calculate summary statistics for the data
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Por | 100.0 | 14.9856 | 2.823016 | 9.23 | 12.9275 | 14.720 | 16.705 | 21.00 |
LogPerm | 100.0 | 1.3947 | 0.390947 | 0.36 | 1.1475 | 1.365 | 1.650 | 2.48 |
Letβs first check the univariate statistics of Por and LogPerm.
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
ax1.hist(my_data_por_perm["Por"], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20)
ax1.set_title('Porosity'); ax1.set_xlabel('Porosity (%)'); ax1.set_ylabel('Frequency'); add_grid2(ax1)
ax2.hist(my_data_por_perm["LogPerm"], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20)
ax2.set_title('Log Transformed Permeability'); ax2.set_xlabel('Log[Permeability] (log(mD)'); add_grid2(ax2)
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=0.8, wspace=0.2, hspace=0.3); plt.show()
![_images/06c2be261b8ed9903e1e372588a28bcd52cf56d411cbd6376aa5fdec5bbda07e.png](_images/06c2be261b8ed9903e1e372588a28bcd52cf56d411cbd6376aa5fdec5bbda07e.png)
The distributions may actually be Gaussian distributed, regardless they are well behaved, we cannot observe obvious gaps nor truncations.
Letβs look at a scatter plot of porosity vs. log permeability.
This would be the basic command from matplotlib to make a scatter plot.
plt.scatter(my_data_por_perm["Por"],my_data_por_perm["LogPerm"]
the additional parameters are for formatting and labels
plt.scatter(my_data_por_perm["Por"],my_data_por_perm["LogPerm"],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
plt.title('Log Transformed Permeability vs. Porosity'); plt.xlabel('Porosity (%)'); plt.ylabel('Log(Permeability (Log(mD))'); add_grid()
plt.subplots_adjust(left=0.0, bottom=0.0, right=0.7, top=0.8, wspace=0.2, hspace=0.3); plt.show()
![_images/6db1439ec376e70dfe82d70d9dfae50de99df77b9c2bbc25f29d89f082850b57.png](_images/6db1439ec376e70dfe82d70d9dfae50de99df77b9c2bbc25f29d89f082850b57.png)
Calculation of Principal Components#
With the log transform of permeability we have a very nice linear relationship with porosity, PCA should work well on this data.
We are ready to perform PCA with porosity and log of permeability.
Standardize the Features#
We must standardize our variables to have a mean equal to zero, \(\bar{x} = 0.0\), and the variance equal to one, \(\sigma^{2}_{x} = 1.0\).
Otherwise the difference between the scale of porosity and permeability would have a significant impact. Note, given the impact of choice of units on variance, e.g., darcies (D) vs. millidarcies (mD) for permeability or fraction instead of a percentage for porosity. This is quite arbitrary!
To remove this effect, we should always standardize unless the two variables have the same units and then the range, variance between them is meaningful and standardization could remove important information.
features = ['Por','LogPerm']
x = my_data_por_perm.loc[:,features].values
mu = np.mean(x, axis=0)
sd = np.std(x, axis=0)
x = StandardScaler().fit_transform(x) # standardize the data features to mean = 0, var = 1.0
print("Original Mean Por", np.round(mu[0],2), ', Original Mean LogPerm = ', np.round(mu[1],2))
print("Original StDev Por", np.round(sd[0],2), ', Original StDev LogPerm = ', np.round(sd[1],2))
print('Mean Transformed Por =',np.round(np.mean(x[:,0]),2),', Mean Transformed LogPerm =',np.round(np.mean(x[:,1]),2))
print('Variance Transformed Por =',np.var(x[:,0]),', Variance Transformed LogPerm =',np.var(x[:,1]))
Original Mean Por 14.99 , Original Mean LogPerm = 1.39
Original StDev Por 2.81 , Original StDev LogPerm = 0.39
Mean Transformed Por = 0.0 , Mean Transformed LogPerm = -0.0
Variance Transformed Por = 1.0000000000000002 , Variance Transformed LogPerm = 1.0
cov = np.cov(x,rowvar = False)
cov
array([[1.01010101, 0.80087707],
[0.80087707, 1.01010101]])
βxβ is a 2D ndarray from Numpy package with the features in columns and the samples in rows.
Above we confirm that the features in the βxβ 2D array are standardized.
It is not a bad idea to check the univariate and bivariate distributions of our standardized variables.
dfS = pd.DataFrame({'sPor': x[:,0], 'sLogPerm': x[:,1]})
sns.jointplot(data=dfS,x='sPor',y='sLogPerm',marginal_kws=dict(bins=30),color='darkorange',edgecolor='black')
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.1, wspace=0.2, hspace=0.3); plt.show()
![_images/294c5353a5f381ec9771db8cfa867a24ca2c07c3089527accf291ed36b79524d.png](_images/294c5353a5f381ec9771db8cfa867a24ca2c07c3089527accf291ed36b79524d.png)
Everything looks fine and we are ready to apply principal components analysis.
Principal Component Analysis (PCA)#
To run PCA with the SciKitLearn machine learning package in Python, we first make a PCA model with a specified number of components and then we βfitβ it to our data.
n_components = 2
pca = PCA(n_components=n_components)
pca.fit(x)
As you will see later with dimensional reduction, we can use matrix math with this model and reduce our data to any dimensionality from 1 to the number of features, m. Letβs run the model with number of components equal to number of features, m.
n_components = 2
pca = PCA(n_components=n_components).fit(x)
Component Loadings#
The first thing we should do is look at the component loadings. Letβs view them and interpret our result.
print(np.round(pca.components_,3))
print('First Principal Component = ' + str(np.round(pca.components_[0,:],3)))
print('Second Principal Component = ' + str(np.round(pca.components_[1,:],3)))
[[ 0.707 0.707]
[ 0.707 -0.707]]
First Principal Component = [0.707 0.707]
Second Principal Component = [ 0.707 -0.707]
The components are listed as a 2D array (ndarray) with:
principal components on the rows
features on the columns
the rows are sorted so that the first principal component is the top row and the last principal component is the last row.
Proportion of Variance Explained with Principal Components#
It is also important to look at the proportion of the variance described by each principal component.
print('Variance explained by PC1 and PC2 =', np.round(pca.explained_variance_ratio_,3))
print('First Principal Component explains ' + str(np.round(pca.explained_variance_ratio_[0],3)) + ' of the total variance.')
print('Second Principal Component explains ' + str(np.round(pca.explained_variance_ratio_[1],3)) + ' of the total variance.')
Variance explained by PC1 and PC2 = [0.896 0.104]
First Principal Component explains 0.896 of the total variance.
Second Principal Component explains 0.104 of the total variance.
Principal Component Scores, Forward and Reverse Projections#
We can calculate the principle component scores of the original data.
This is effectively a rotation of the data, aligned with PC1 for the direction of greatest variance.
We will calculate the principal component scores with the βtransformβ function built into PCA and then visualize as a scatter plot.
Then to βclose the loopβ and check what we have done (and our knowledge) we will reverse the PCA, go from the principal component scores back to the standardized features.
f, (ax101, ax102, ax103) = plt.subplots(1, 3,figsize=(12,3))
f.subplots_adjust(wspace=0.7)
ax101.scatter(x[:,0],x[:,1], s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax101.set_title('Standardized LogPerm vs. Por'); ax101.set_xlabel('Standardized Por'); ax101.set_ylabel('Standardized LogPerm')
ax101.set_xlim([-3,3]); ax101.set_ylim([-3,3]); add_grid2(ax101)
x_trans = pca.transform(x) # calculate the principal component scores
ax102.scatter(x_trans[:,0],-1*x_trans[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax102.set_title('Principal Component Scores'); ax102.set_xlabel('PC1'); ax102.set_ylabel('PC2')
ax102.set_xlim([-3,3]); ax102.set_ylim([-3,3]); add_grid2(ax102)
x_reverse = pca.inverse_transform(x_trans) # reverse the principal component scores to standardized values
ax103.scatter(x_reverse[:,0],x_reverse[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax103.set_title('Reverse PCA'); ax103.set_xlabel('Standardized Por'); ax103.set_ylabel('Standardized LogPerm')
ax103.set_xlim([-3,3]); ax103.set_ylim([-3,3]); add_grid2(ax103)
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=0.8, wspace=0.2, hspace=0.3); plt.show()
![_images/040728685fd4737dba466d509bb325eece330659c69d1efa7a1d9c67735a74ae.png](_images/040728685fd4737dba466d509bb325eece330659c69d1efa7a1d9c67735a74ae.png)
The standardized original and reverse PCA cross plots should look exactly the same. If so, the method is working.
Conservation of Variance#
Letβs check the variances of the principle component scores, since we have calculated them now.
we calculate the variance for each of the original features
then sum to get the original total variance
we calculate the variance for each of the transformed, principal component scores
then we sum to get the transformed total variance
We note the:
the first principal component score has larger variance than the second component scores
total variance is preserved over the transformation, the sum of variance is the same for original features and m principal component scores
print('Variance of the 2 features:')
print(np.var(x, axis = 0))
print('\nTotal Variance from Original Features:')
print(np.sum(np.var(x, axis = 0)))
print('\nVariance of the 2 principle components:')
print(np.round(np.var(x_trans, axis = 0),2))
print('\nTotal Variance from Original Features:')
print(round(np.sum(np.var(x_trans, axis = 0)),2))
Variance of the 2 features:
[1. 1.]
Total Variance from Original Features:
2.0
Variance of the 2 principle components:
[1.79 0.21]
Total Variance from Original Features:
2.0
Independence of Principal Component Scores#
Letβs check the correlations for the original features vs. our projected features.
print('\nCorrelation Matrix of the 2 original features components:')
print(np.round(np.corrcoef(x, rowvar = False),2))
print('\nCorrelation Matrix of the 2 principle components\' scores:')
print(np.round(np.corrcoef(x_trans, rowvar = False),2))
Correlation Matrix of the 2 original features components:
[[1. 0.79]
[0.79 1. ]]
Correlation Matrix of the 2 principle components' scores:
[[ 1. -0.]
[-0. 1.]]
We have projected our original features with high correlation to 2 new features without correlation between the new features.
Principal Component Analysis By-hand with Eigenvalue and Eigen Vector Calculator#
Letβs show PCA by-hand with the standardized features and the eigen calculation and compare to the scikit-learn results from above.
we confirm that the results match.
from numpy.linalg import eig
eigen_values,eigen_vectors = eig(cov)
print('Eigen Vectors:\n' + str(np.round(eigen_vectors,2)))
print('First Eigen Vector: ' + str(eigen_vectors[:,0]))
print('Second Eigen Vector: ' + str(eigen_vectors[:,1]))
print('Eigen Values:\n' + str(np.round(eigen_values,2)))
PC = eigen_vectors.T.dot(x.T)
plt.subplot(121)
plt.scatter(PC[0,:],PC[1,:],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
plt.title('Principal Component Scores By-hand with numpy.linalg Eig Function'); plt.xlabel('PC1'); plt.ylabel('PC2')
plt.xlim([-3,3]); plt.ylim([-3,3]); add_grid()
plt.subplot(122)
plt.scatter(x_trans[:,0],-1*x_trans[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
plt.title('Principal Component Scores with scikit-learn PCA'); plt.xlabel('PC1'); plt.ylabel('PC2')
plt.xlim([-3,3]); plt.ylim([-3,3]); add_grid()
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=1.0, wspace=0.2, hspace=0.3); plt.show()
Eigen Vectors:
[[ 0.71 -0.71]
[ 0.71 0.71]]
First Eigen Vector: [0.70710678 0.70710678]
Second Eigen Vector: [-0.70710678 0.70710678]
Eigen Values:
[1.81 0.21]
![_images/792323ef0c7e641c08fbe33eefb6dc67e05dfacaee639375589ba3f7f4689f07.png](_images/792323ef0c7e641c08fbe33eefb6dc67e05dfacaee639375589ba3f7f4689f07.png)
Demonstration of Dimensional Reduction#
Now letβs attempt dimensional reduction by only retaining the first principle component. We will go from original values to predictions of original values.
Recall we were able to explain about 90% of the variance with the first principal component so the result should look βpretty goodβ, right?
We will do the whole thing by hand to make it as simple/understandable as possible for this first time through. Later we will be much more compact. The steps:
start with the original porosity and permeability data
standardize such that Por and LogPerm have a mean of 0.0 and a variance of 1.0
calculate the 2 principal component model, visualize the principal component scores
remove the 2nd principal component by setting the associated component scores to 0.0
reverse the principal component by matrix multiplication of the scores with the component loadings
apply matrix math to restore the original mean and variance
nComp = 1
f, ((ax201, ax202, ax203), (ax206, ax205, ax204)) = plt.subplots(2, 3,figsize=(15,10))
#f, ((ax201, ax202), (ax203, ax204), (ax205, ax206)) = plt.subplots(3, 2,figsize=(10,15))
f.subplots_adjust(wspace=0.5,hspace = 0.3)
ax201.scatter(my_data_por_perm["Por"],my_data_por_perm["LogPerm"],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax201.set_title('1. LogPerm vs. Por'); ax201.set_xlabel('Por'); ax201.set_ylabel('LogPerm')
ax201.set_xlim([8,22]); ax201.set_ylim([0,2.5]); add_grid2(ax201)
mu = np.mean(np.vstack((my_data_por_perm["Por"].values,my_data_por_perm["LogPerm"].values)), axis=1)
sd = np.std(np.vstack((my_data_por_perm["Por"].values,my_data_por_perm["LogPerm"].values)), axis=1)
x = StandardScaler().fit_transform(x) # standardize the data features to mean = 0, var = 1.0
ax202.scatter(x[:,0],x[:,1], s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax202.set_title('2. Standardized LogPerm vs. Por'); ax202.set_xlabel('Standardized Por'); ax202.set_ylabel('Standardized LogPerm')
ax202.set_xlim([-3.5,3.5]); ax202.set_ylim([-3.5,3.5]); add_grid2(ax202)
n_components = 2 # build principal component model with 2 components
pca = PCA(n_components=n_components)
pca.fit(x)
x_trans = pca.transform(x) # calculate principal component scores
ax203.scatter(x_trans[:,0],-1*x_trans[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax203.set_title('3. Principal Component Scores'); ax203.set_xlabel('PC1'); ax203.set_ylabel('PC2')
ax203.set_xlim([-3.5,3.5]); ax203.set_ylim([-3.5,3.5]); add_grid2(ax203)
x_trans[:,1] = 0.0 # zero / remove the 2nd principal component
ax204.scatter(x_trans[:,0],x_trans[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax204.set_title('4. Only 1st Principal Component Scores'); ax204.set_xlabel('PC1'); ax204.set_ylabel('PC2')
ax204.set_xlim([-3.5,3.5]); ax204.set_ylim([-3.5,3.5]); add_grid2(ax204)
xhat = pca.inverse_transform(x_trans) # reverse the principal component scores to standardized values
ax205.scatter(xhat[:,0],xhat[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax205.set_title('5. Reverse PCA'); ax205.set_xlabel('Standardized Por'); ax205.set_ylabel('Standardized LogPerm')
ax205.set_xlim([-3.5,3.5]); ax205.set_ylim([-3.5,3.5]); add_grid2(ax205)
xhat = np.dot(pca.inverse_transform(x)[:,:nComp], pca.components_[:nComp,:])
xhat = sd*xhat + mu # remove the standardization
ax206.scatter(my_data_por_perm["Por"],my_data_por_perm["LogPerm"],s=None, c="blue", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.6, linewidths=1.0, edgecolors="black")
ax206.scatter(xhat[:,0],xhat[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax206.set_title('6. De-standardized Reverse PCA'); ax206.set_xlabel('Por'); ax206.set_ylabel('LogPerm')
ax206.set_xlim([8,22]); ax206.set_ylim([0,2.5]); add_grid2(ax206)
plt.show()
![_images/99fce5c8ff4962b84e1d76599a56e25a485eb8a4462b4fa608f8d5a9af64a8dd.png](_images/99fce5c8ff4962b84e1d76599a56e25a485eb8a4462b4fa608f8d5a9af64a8dd.png)
Letβs put the original data and the resulting lower dimensional model side-by-side and check the resulting variances.
f, (ax201, ax206) = plt.subplots(1, 2,figsize=(10,6))
f.subplots_adjust(wspace=0.5,hspace = 0.3)
ax201.scatter(my_data_por_perm["Por"],my_data_por_perm["LogPerm"],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax201.set_title('1. LogPerm vs. Por'); ax201.set_xlabel('Por'); ax201.set_ylabel('LogPerm')
ax201.set_xlim([8,22]); ax201.set_ylim([0,2.5]); add_grid2(ax201)
ax206.scatter(xhat[:,0],xhat[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax206.set_title('6. De-standardized Reverse PCA'); ax206.set_xlabel('Por'); ax206.set_ylabel('LogPerm')
ax206.set_xlim([8,22]); ax206.set_ylim([0,2.5]); add_grid2(ax206)
plt.show()
var_por = np.var(my_data_por_perm["Por"]); var_por_hat = np.var(xhat[:,0]);
var_logperm = np.var(my_data_por_perm["LogPerm"]); var_logperm_hat = np.var(xhat[:,1]);
print('Variance Por =',np.round(var_por,3),', Variance Reduced Dimensional Por =',np.round(var_por_hat,3),'Fraction = ',np.round(var_por_hat/var_por,3))
print('Variance LogPerm =',np.round(var_logperm,3),', Variance Reduced Dimensional LogPerm =',np.round(var_logperm_hat,3),'Fraction = ',np.round(var_logperm_hat/var_logperm,3))
print('Total Variance =',np.round(var_por + var_logperm,3), ', Total Variance Reduced Dimension =',np.round(var_por_hat+var_logperm_hat,3),'Fraction = ',np.round((var_por_hat+var_logperm_hat)/(var_por+var_logperm),3))
![_images/dd23216e8863e8d206d5ad4311ffe9586147d99dbb5d2ad0581d859f68582c1d.png](_images/dd23216e8863e8d206d5ad4311ffe9586147d99dbb5d2ad0581d859f68582c1d.png)
Variance Por = 7.89 , Variance Reduced Dimensional Por = 7.073 Fraction = 0.896
Variance LogPerm = 0.151 , Variance Reduced Dimensional LogPerm = 0.136 Fraction = 0.896
Total Variance = 8.041 , Total Variance Reduced Dimension = 7.208 Fraction = 0.896
#### All Predictor Features
We will go back to the original data file and this time extract all 6 predictor variables and the first 500 samples.
Cell In[26], line 3
We will go back to the original data file and this time extract all 6 predictor variables and the first 500 samples.
^
SyntaxError: invalid syntax
my_data_f6 = my_data.iloc[0:500,0:6] # extract the 6 predictors, 500 samples
It is a good idea to start with the summary statistics for our data.
my_data_f6.describe().transpose() # calculate summary statistics for the data
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Por | 500.0 | 14.89936 | 2.985967 | 5.40 | 12.8500 | 14.900 | 17.0125 | 23.85 |
LogPerm | 500.0 | 1.40010 | 0.409616 | 0.18 | 1.1475 | 1.380 | 1.6700 | 2.58 |
AI | 500.0 | 2.99244 | 0.563674 | 1.21 | 2.5900 | 3.035 | 3.3725 | 4.70 |
Brittle | 500.0 | 49.74682 | 15.212123 | 0.00 | 39.3125 | 49.595 | 59.2075 | 93.47 |
TOC | 500.0 | 0.99800 | 0.503635 | 0.00 | 0.6400 | 0.960 | 1.3500 | 2.71 |
VR | 500.0 | 1.99260 | 0.307434 | 0.90 | 1.8200 | 2.010 | 2.1725 | 2.84 |
Letβs also calculate a correlation matrix and view it.
corr_matrix = np.corrcoef(my_data_f6, rowvar = False)
print(np.around(corr_matrix,2)) # print the correlation matrix to 2 decimals
[[ 1. 0.79 -0.49 -0.25 0.71 0.12]
[ 0.79 1. -0.32 -0.13 0.48 0.04]
[-0.49 -0.32 1. 0.14 -0.53 0.47]
[-0.25 -0.13 0.14 1. -0.24 0.24]
[ 0.71 0.48 -0.53 -0.24 1. 0.35]
[ 0.12 0.04 0.47 0.24 0.35 1. ]]
We will need to standardize each variable to have a mean of zero and a variance of one. Letβs do that and check the results. In the console below we print all the initial and standardized means and variances for all 6 predictors.
features = ['Por','LogPerm','AI','Brittle','TOC','VR']
x_f6 = my_data_f6.loc[:,features].values
mu_f6 = np.mean(x_f6, axis=0)
sd_f6 = np.std(x_f6, axis=0)
x_f6 = StandardScaler().fit_transform(x_f6)
print("Original Means", features[:], np.round(mu_f6[:],2))
print("Original StDevs", features[:],np.round(sd_f6[:],2))
print('Mean Transformed =',features[:],np.round(x.mean(axis=0),2))
print('Variance Transformed Por =',features[:],np.round(x.var(axis=0),2))
Original Means ['Por', 'LogPerm', 'AI', 'Brittle', 'TOC', 'VR'] [14.9 1.4 2.99 49.75 1. 1.99]
Original StDevs ['Por', 'LogPerm', 'AI', 'Brittle', 'TOC', 'VR'] [ 2.98 0.41 0.56 15.2 0.5 0.31]
Mean Transformed = ['Por', 'LogPerm', 'AI', 'Brittle', 'TOC', 'VR'] [0. 0.]
Variance Transformed Por = ['Por', 'LogPerm', 'AI', 'Brittle', 'TOC', 'VR'] [1. 1.]
We should also check the univariate distributions for each variable.
f, (ax6,ax7,ax8,ax9,ax10,ax11) = plt.subplots(1, 6, sharey=True, figsize=(15,2))
ax6.hist(x_f6[:,0], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20); ax6.set_title('Std. Porosity'); ax6.set_xlim(-5,5)
ax7.hist(x_f6[:,1], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20); ax7.set_title('Std. Log[Perm.]'); ax7.set_xlim(-5,5)
ax8.hist(x_f6[:,2], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20); ax8.set_title('Std. Acoustic Imped.'); ax8.set_xlim(-5,5)
ax9.hist(x_f6[:,3], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20); ax9.set_title('Std. Brittleness'); ax9.set_xlim(-5,5)
ax10.hist(x_f6[:,4], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20); ax10.set_title('Std. Total Organic C'); ax10.set_xlim(-5,5)
ax11.hist(x_f6[:,5], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20); ax11.set_title('Std. Vit. Reflectance'); ax11.set_xlim(-5,5)
plt.show()
![_images/723ea42f5bdf48a43d93c909876263675500438d484279bc163c899b6c5189f9.png](_images/723ea42f5bdf48a43d93c909876263675500438d484279bc163c899b6c5189f9.png)
The summary statistics and distributions look good. No obvious missing data, gaps, significant truncations, spikes or outliers. We are ready to perform principal component analysis on our 6 features.
n_components = 6
pca_f6 = PCA(n_components=n_components)
pca_f6.fit(x_f6)
print(np.round(pca_f6.components_,3)) # visualize the component loadings
[[ 0.558 0.476 -0.405 -0.211 0.504 0.01 ]
[-0.117 -0.114 -0.432 -0.323 -0.229 -0.794]
[-0.019 -0.124 0.384 -0.898 0.07 0.157]
[-0.214 -0.674 -0.424 -0.006 0.526 0.21 ]
[-0.784 0.522 -0.031 -0.046 0.331 -0.019]
[ 0.12 -0.138 0.566 0.206 0.55 -0.549]]
Letβs look at the component loadings first. Each row is a component, top row is the first principal component (PC1), next row is the second principal component (PC2) up to the last row the sixth principal component (PC6). The columns are the features ordered from βPorβ, βLogPermβ, βAIβ, βBrittleβ, βTOCβ, to βVRβ.
First principal component is mainly composed of porosity, log permeability, acoustic impedance and total organic carbon, suggesting that the way they vary together is responsible for much of the variance. The next principle component is mainly composed of vitrinite reflectance. The third principal coordinate is mainly composed of brittleness and so on.
Scree Plots#
To assist in this interpretation we should consider the variance contributions from each principal component.
print('Variance explained by PC1 thru PC6 =', np.round(pca_f6.explained_variance_ratio_,3))
f, (ax10, ax11) = plt.subplots(1, 2,figsize=(10,6))
f.subplots_adjust(wspace=0.5,hspace = 0.3)
ax10.plot(np.arange(1,7,1),pca_f6.explained_variance_ratio_*100,color='darkorange',alpha=0.8)
ax10.scatter(np.arange(1,7,1),pca_f6.explained_variance_ratio_*100,color='darkorange',alpha=0.8,edgecolor='black')
ax10.set_xlabel('Principal Component'); ax10.set_ylabel('Variance Explained'); ax10.set_title('Variance Explained by Principal Component')
fmt = '%.0f%%' # Format you want the ticks, e.g. '40%'
yticks = mtick.FormatStrFormatter(fmt); ax10.set_xlim(1,6); ax10.set_ylim(0,100.0)
ax10.yaxis.set_major_formatter(yticks); add_grid2(ax10)
ax11.plot(np.arange(1,7,1),np.cumsum(pca_f6.explained_variance_ratio_*100),color='darkorange',alpha=0.8)
ax11.scatter(np.arange(1,7,1),np.cumsum(pca_f6.explained_variance_ratio_*100),color='darkorange',alpha=0.8,edgecolor='black')
ax11.plot([1,6],[95,95], color='black',linestyle='dashed')
ax11.set_xlabel('Principal Component'); ax11.set_ylabel('Cumulative Variance Explained'); ax11.set_title('Cumulative Variance Explained by Principal Component')
fmt = '%.0f%%' # Format you want the ticks, e.g. '40%'
yticks = mtick.FormatStrFormatter(fmt); ax11.set_xlim(1,6); ax11.set_ylim(0,100.0); ax11.annotate('95% variance explained',[4.05,90])
ax11.yaxis.set_major_formatter(yticks); add_grid2(ax11)
plt.show()
Variance explained by PC1 thru PC6 = [0.462 0.246 0.149 0.11 0.024 0.009]
![_images/440bc104628ed7ca4c595ebaff4fff2fa70e71b24b63daa330698e03fb7ff9fa.png](_images/440bc104628ed7ca4c595ebaff4fff2fa70e71b24b63daa330698e03fb7ff9fa.png)
We can see that about 46% of the variance is described by the 1st principal component and then about 25% is described by the 2nd principal component etc.
Independence of Principal Component Scores#
Letβs check the pairwise feature correlations before and after the projection.
print('\nCorrelation Matrix of the 6 original features components:')
print(np.round(np.corrcoef(x_f6, rowvar = False),2))
print('\nCorrelation Matrix of the 6 principle components\' scores:')
print(np.round(np.corrcoef(pca_f6.transform(x_f6), rowvar = False),2))
Correlation Matrix of the 6 original features components:
[[ 1. 0.79 -0.49 -0.25 0.71 0.12]
[ 0.79 1. -0.32 -0.13 0.48 0.04]
[-0.49 -0.32 1. 0.14 -0.53 0.47]
[-0.25 -0.13 0.14 1. -0.24 0.24]
[ 0.71 0.48 -0.53 -0.24 1. 0.35]
[ 0.12 0.04 0.47 0.24 0.35 1. ]]
Correlation Matrix of the 6 principle components' scores:
[[ 1. -0. 0. -0. 0. -0.]
[-0. 1. -0. -0. 0. -0.]
[ 0. -0. 1. -0. 0. 0.]
[-0. -0. -0. 1. 0. -0.]
[ 0. 0. 0. 0. 1. 0.]
[-0. -0. 0. -0. 0. 1.]]
The new projected features (even without dimensionality reduction, \(p=m\)) all have pairwise correlations of 0.0!
all the projected features are linearly independent of each other
Reduced Dimensionality Impact on a 2 Feature Relationship#
It would be interesting to look just at the porosity vs. log permeability bivariate relationship when we retain \(1,\ldots,6\) principal components.
to do this we use matrix math to reverse with PCA and the standardization with various number of principal component and then plot the scatter plots of log permeability vs. porosity.
nComp = 6
xhat_6d = np.dot(pca_f6.transform(x_f6)[:,:nComp], pca_f6.components_[:nComp,:])
xhat_6d = sd_f6*xhat_6d + mu_f6
nComp = 5
xhat_5d = np.dot(pca_f6.transform(x_f6)[:,:nComp], pca_f6.components_[:nComp,:])
xhat_5d = sd_f6*xhat_5d + mu_f6
nComp = 4
xhat_4d = np.dot(pca_f6.transform(x_f6)[:,:nComp], pca_f6.components_[:nComp,:])
xhat_4d = sd_f6*xhat_4d + mu_f6
nComp = 3
xhat_3d = np.dot(pca_f6.transform(x_f6)[:,:nComp], pca_f6.components_[:nComp,:])
xhat_3d = sd_f6*xhat_3d + mu_f6
nComp = 2
xhat_2d = np.dot(pca_f6.transform(x_f6)[:,:nComp], pca_f6.components_[:nComp,:])
xhat_2d = sd_f6*xhat_2d + mu_f6
nComp = 1
xhat_1d = np.dot(pca_f6.transform(x_f6)[:,:nComp], pca_f6.components_[:nComp,:])
xhat_1d = sd_f6*xhat_1d + mu_f6
f, (ax12, ax13, ax14, ax15, ax16, ax17, ax18) = plt.subplots(1, 7,figsize=(20,20))
f.subplots_adjust(wspace=0.7)
ax12.scatter(my_data_f6["Por"],my_data_f6["LogPerm"],s=None, c="darkorange",marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.2, linewidths=1.0, edgecolors="black")
ax12.set_title('Original Data'); ax12.set_xlabel('Por'); ax12.set_ylabel('LogPerm')
ax12.set_ylim(0.0,3.0); ax12.set_xlim(8,22); ax12.set_aspect(4.0);
ax13.scatter(xhat_1d[:,0],xhat_1d[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.2, linewidths=1.0, edgecolors="black")
ax13.set_title('1 Principal Component'); ax13.set_xlabel('Por'); ax13.set_ylabel('LogPerm')
ax13.set_ylim(0.0,3.0); ax13.set_xlim(8,22); ax13.set_aspect(4.0)
ax14.scatter(xhat_2d[:,0],xhat_2d[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.2, linewidths=1.0, edgecolors="black")
ax14.set_title('2 Principal Components'); ax14.set_xlabel('Por'); ax14.set_ylabel('LogPerm')
ax14.set_ylim(0.0,3.0); ax14.set_xlim(8,22); ax14.set_aspect(4.0)
ax15.scatter(xhat_3d[:,0],xhat_3d[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.2, linewidths=1.0, edgecolors="black")
ax15.set_title('3 Principal Components'); ax15.set_xlabel('Por'); ax15.set_ylabel('LogPerm')
ax15.set_ylim(0.0,3.0); ax15.set_xlim(8,22); ax15.set_aspect(4.0)
ax16.scatter(xhat_4d[:,0],xhat_4d[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.2, linewidths=1.0, edgecolors="black")
ax16.set_title('4 Principal Components'); ax16.set_xlabel('Por'); ax16.set_ylabel('LogPerm')
ax16.set_ylim(0.0,3.0); ax16.set_xlim(8,22); ax16.set_aspect(4.0)
ax17.scatter(xhat_5d[:,0],xhat_5d[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.2, linewidths=1.0, edgecolors="black")
ax17.set_title('5 Principal Components'); ax17.set_xlabel('Por'); ax17.set_ylabel('LogPerm')
ax17.set_ylim(0.0,3.0); ax17.set_xlim(8,22); ax17.set_aspect(4.0)
ax18.scatter(xhat_6d[:,0],xhat_6d[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.2, linewidths=1.0, edgecolors="black")
ax18.set_title('6 Principal Components'); ax18.set_xlabel('Por'); ax18.set_ylabel('LogPerm')
ax18.set_ylim(0.0,3.0); ax18.set_xlim(8,22); ax18.set_aspect(4.0)
plt.show()
![_images/422a00fd31196fe093314dad28d30e0b04af5817a25075faffb515ac32a96a72.png](_images/422a00fd31196fe093314dad28d30e0b04af5817a25075faffb515ac32a96a72.png)
Very interesting to watch the accuracy of the bivariate relationship between log permeability and porosity improve as we include more components. Letβs check the variances.
print('1 Principal Component : Variance Por =',np.round(np.var(xhat_1d[:,0])/(sd_f6[0]*sd_f6[0]),2),' Variance Log Perm = ',np.round(np.var(xhat_1d[:,1])/(sd_f6[1]*sd_f6[1]),2))
print('2 Principal Components: Variance Por =',np.round(np.var(xhat_2d[:,0])/(sd_f6[0]*sd_f6[0]),2),' Variance Log Perm = ',np.round(np.var(xhat_2d[:,1])/(sd_f6[1]*sd_f6[1]),2))
print('3 Principal Components: Variance Por =',np.round(np.var(xhat_3d[:,0])/(sd_f6[0]*sd_f6[0]),2),' Variance Log Perm = ',np.round(np.var(xhat_3d[:,1])/(sd_f6[1]*sd_f6[1]),2))
print('4 Principal Components: Variance Por =',np.round(np.var(xhat_4d[:,0])/(sd_f6[0]*sd_f6[0]),2),' Variance Log Perm = ',np.round(np.var(xhat_4d[:,1])/(sd_f6[1]*sd_f6[1]),2))
print('5 Principal Components: Variance Por =',np.round(np.var(xhat_5d[:,0])/(sd_f6[0]*sd_f6[0]),2),' Variance Log Perm = ',np.round(np.var(xhat_5d[:,1])/(sd_f6[1]*sd_f6[1]),2))
print('6 Principal Components: Variance Por =',np.round(np.var(xhat_6d[:,0])/(sd_f6[0]*sd_f6[0]),2),' Variance Log Perm = ',np.round(np.var(xhat_6d[:,1])/(sd_f6[1]*sd_f6[1]),2))
1 Principal Component : Variance Por = 0.86 Variance Log Perm = 0.63
2 Principal Components: Variance Por = 0.88 Variance Log Perm = 0.65
3 Principal Components: Variance Por = 0.88 Variance Log Perm = 0.66
4 Principal Components: Variance Por = 0.91 Variance Log Perm = 0.96
5 Principal Components: Variance Por = 1.0 Variance Log Perm = 1.0
6 Principal Components: Variance Por = 1.0 Variance Log Perm = 1.0
This is interesting. With the first principal component we describe 86% of the porosity variance. The next two principal components do not provide much assistance. Then there is a jump with the 4th and 5th principal components.
of course, the problem is 6 dimensional, not just porosity vs. log permeability, but is it interesting to see the relationship between number of principal components and variance retained each of these 2 original features
principal components do not uniformly described each feature
Reduced Dimensionality Impact on Matrix Scatter Plots of All Features#
Letβs look at the matrix scatter plots for see all of the bivariate combinations.
first some book keeping, we have to put the 6D reduced dimensionality models in DataFrames (there are currently Numpy ndarrays.
df_1d = pd.DataFrame(data=xhat_1d,columns=features)
df_2d = pd.DataFrame(data=xhat_2d,columns=features)
df_3d = pd.DataFrame(data=xhat_3d,columns=features)
df_4d = pd.DataFrame(data=xhat_4d,columns=features)
df_5d = pd.DataFrame(data=xhat_5d,columns=features)
df_6d = pd.DataFrame(data=xhat_6d,columns=features)
Now we can go ahead and produce the matrix scatter plots with these DataFrames. It is very interesting to see the accuracy of the bivariate plots improve as we add principal components. Also, with only two principal components we capture some of the bivariate relationships quite well for some of the variable pairs.
fig = plt.figure()
pd_plot.scatter_matrix(my_data_f6, alpha = 0.1, # pandas matrix scatter plot
figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('Original Data')
pd_plot.scatter_matrix(df_1d, alpha = 0.1, # pandas matrix scatter plot
figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('1 Principal Component')
pd_plot.scatter_matrix(df_2d, alpha = 0.1, # pandas matrix scatter plot
figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('2 Principal Components')
pd_plot.scatter_matrix(df_3d, alpha = 0.1, # pandas matrix scatter plot
figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('3 Principal Components')
pd_plot.scatter_matrix(df_4d, alpha = 0.1, # pandas matrix scatter plot
figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('4 Principal Components')
pd_plot.scatter_matrix(df_5d, alpha = 0.1, # pandas matrix scatter plot
figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('5 Principal Components')
pd_plot.scatter_matrix(df_6d, alpha = 0.1, # pandas matrix scatter plot
figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('6 Principal Components')
plt.show()
<Figure size 640x480 with 0 Axes>
![_images/9037b92fcdafe9a720302d1a1992c353882de03fbb58d7636a99f3bc723cc668.png](_images/9037b92fcdafe9a720302d1a1992c353882de03fbb58d7636a99f3bc723cc668.png)
![_images/6196eb5d627c4f2fbe52fec9822e0c4480369f587058fb6b53633e6f26e21cbb.png](_images/6196eb5d627c4f2fbe52fec9822e0c4480369f587058fb6b53633e6f26e21cbb.png)
![_images/7ec799f9728d5b7cb4ae06957f30aeed9d702cb0c0e7bd56d8897cb2be4ef1cd.png](_images/7ec799f9728d5b7cb4ae06957f30aeed9d702cb0c0e7bd56d8897cb2be4ef1cd.png)
![_images/48284ddedfa43e6cf42229c8c5d4f25a72df3247dedac542cbc23664c2044fb3.png](_images/48284ddedfa43e6cf42229c8c5d4f25a72df3247dedac542cbc23664c2044fb3.png)
![_images/5b8d868cfd7077aca0e5cf9aa13f2327e85f86e516d61d6aa0103d070f6487dd.png](_images/5b8d868cfd7077aca0e5cf9aa13f2327e85f86e516d61d6aa0103d070f6487dd.png)
![_images/9f43cd1d40efe65f50f73dc0c7e33230e21b82413bfce8ebe8dc13dbfa4f6351.png](_images/9f43cd1d40efe65f50f73dc0c7e33230e21b82413bfce8ebe8dc13dbfa4f6351.png)
![_images/e5f6bf1a8c41d590c3552be8d7754a8f16f447bb52c6b155fb5efbb5b59faece.png](_images/e5f6bf1a8c41d590c3552be8d7754a8f16f447bb52c6b155fb5efbb5b59faece.png)
Want to Work Together?#
I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.
Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? Iβd be happy to drop by and work with you!
Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PI is Professor John Foster)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!
I can be reached at mpyrcz@austin.utexas.edu.
Iβm always happy to discuss,
Michael
Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin
More Resources Available at: Twitter | GitHub | Website | GoogleScholar | Geostatistics Book | YouTube | Applied Geostats in Python e-book | Applied Machine Learning in Python e-book | LinkedIn
Comments#
This was a basic treatment of dimensionality reduction by principal component analysis (PCA). Much more could be done and discussed, I have many more resources. Check out my shared resource inventory and the YouTube lecture links at the start of this chapter with resource links in the videosβ descriptions.
I hope this is helpful,
Michael