Principal Component Analysis#

Michael J. Pyrcz, Professor, The University of Texas at Austin

Twitter | GitHub | Website | GoogleScholar | Book | YouTube | Applied Geostats in Python e-book | LinkedIn

Chapter of e-book “Applied Machine Learning in Python: a Hands-on Guide with Code”.

Cite as: Pyrcz, M.J., 2024, Applied Machine Learning in Python: a Hands-on Guide with Code, https://geostatsguy.github.io/MachineLearningDemos_Book.

By Michael J. Pyrcz
© Copyright 2024.

This chapter is a tutorial for / demonstration of Principal Component Analysis.

YouTube Lecture: check out my lectures on:

These lectures are all part of my Machine Learning Course on YouTube with linked well-documented Python workflows and interactive dashboards. My goal is to share accessible, actionable, and repeatable educational content. If you want to know about my motivation, check out Michael’s Story.

Motivation for Principal Component Analysis, The Curse of Dimensionality#

Working with more features / variables is harder!

  1. More difficult to visualize data and model

  2. More data are required to infer the joint probabilities

  3. Less data coverage of feature space

  4. More difficult to interrogate / check the model

  5. More likely redundant features, e.g., multicollinearity, resulting in model instability

  6. More computational effort, more computational resources and longer run times

  7. More complicated model is more likely overfit

  8. More professional time for model construction

We get a better model with fewer, informative features than throwing all our features into the model!

Inferential Machine Learning#

There are no response features, \(y\), just predictor features,

\[ 𝑋_1,\ldots,𝑋_𝑚 \]
  • Machine learns by mimicry a compact representation of the data

  • Captures patterns as feature projections, group assignments, neural network latent features, etc.

  • We focus on inference of the population, the natural system, instead of prediction of response features.

Principal Component Analysis#

Principal Component Analysis one of a variety of methods for dimensional reduction:

Dimensional reduction transforms the data to a lower dimension

  • Given features, \(𝑋_1,\dots,𝑋_𝑚\) we would require \({m \choose 2}=\frac{𝑚 \cdot (𝑚−1)}{2}\) scatter plots to visualize just the two-dimensional scatter plots.

  • Once we have 4 or more variables understanding our data gets very hard.

  • Recall the curse of dimensionality, impact inference, modeling and visualization.

One solution, is to find a good lower dimensional, \(𝑝\), representation of the original dimensions \(𝑚\)

Benefits of Working in a Reduced Dimensional Representation:

  1. Data storage / Computational Time

  2. Easier visualization

  3. Also takes care of multicollinearity

Orthogonal Transformation#

Convert a set of observations into a set of linearly uncorrelated variables known as principal components

  • The number of principal components (\(k\)) available are min⁡(\(𝑛−1,𝑚\))

  • Limited by the variables/features, \(𝑚\), and the number of data

Components are ordered,

  • First component describes the larges possible variance / accounts for as much variability as possible

  • Next component describes the largest possible remaining variance

  • Up to the maximum number of principal components

Eigenvalues / Eigenvectors#

For principal components analysis we calculate the data covariance matrix, the pairwise covariance for the combinatorial of features. The we calculate the eigenvectors and eigenvalues from the covariance matrix.

  • The eigenvalues are the variance explained for each component.

  • The eigenvectors of the data covariance matrix are the principal components.

Load the Required Libraries#

The following code loads the required libraries. These should have been installed with Anaconda 3.

ignore_warnings = True                                        # ignore warnings?
from sklearn.decomposition import PCA                         # PCA program from scikit learn (package for machine learning)
from sklearn.preprocessing import StandardScaler              # standardize variables to mean of 0.0 and variance of 1.0
import numpy as np                                            # ndarrays for gridded data
import pandas as pd                                           # DataFrames for tabular data
import pandas.plotting as pd_plot                             # pandas plotting functions
import copy                                                   # for deep copies
import os                                                     # set working directory, run executables
import matplotlib.pyplot as plt                               # for plotting
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator) # control of axes ticks
import matplotlib.ticker as mtick                             # control tick label formatting
import seaborn as sns                                         # advanced plotting
plt.rc('axes', axisbelow=True)                                # plot all grids below the plot elements
if ignore_warnings == True:                                   
    import warnings
    warnings.filterwarnings('ignore')
cmap = plt.cm.inferno                                         # color map
seed = 42                                                     # random number seed

If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing ‘python -m pip install [package-name]’. More assistance is available with the respective package docs.

Declare Functions#

Let’s define a single function to streamline plotting correlation matrices. I also added a convenience function to add major and minor gridlines to improve plot interpretability.

def plot_corr(df,size=10):                                    # plots a graphical correlation matrix 
    from matplotlib.colors import ListedColormap              # make a custom colormap
    my_colormap = plt.cm.get_cmap('RdBu_r', 256)          
    newcolors = my_colormap(np.linspace(0, 1, 256))
    white = np.array([256/256, 256/256, 256/256, 1])
    newcolors[65:191, :] = white                              # mask all correlations less than abs(0.8)
    newcmp = ListedColormap(newcolors)
    m = len(df.columns)
    corr = df.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    im = ax.matshow(corr,vmin = -1.0, vmax = 1.0,cmap = newcmp)
    plt.xticks(range(len(corr.columns)), corr.columns);
    plt.yticks(range(len(corr.columns)), corr.columns);
    plt.colorbar(im, orientation = 'vertical')
    plt.title('Correlation Matrix')
    for i in range(0,m):
        plt.plot([i-0.5,i-0.5],[-0.5,m-0.5],color='black')
        plt.plot([-0.5,m-0.5],[i-0.5,i-0.5],color='black')
    plt.ylim([-0.5,m-0.5]); plt.xlim([-0.5,m-0.5])

def add_grid():
    plt.gca().grid(True, which='major',linewidth = 1.0); plt.gca().grid(True, which='minor',linewidth = 0.2) # add y grids
    plt.gca().tick_params(which='major',length=7); plt.gca().tick_params(which='minor', length=4)
    plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks 

def add_grid2(sub_plot):
    sub_plot.grid(True, which='major',linewidth = 1.0); sub_plot.grid(True, which='minor',linewidth = 0.2) # add y grids
    sub_plot.tick_params(which='major',length=7); sub_plot.tick_params(which='minor', length=4)
    sub_plot.xaxis.set_minor_locator(AutoMinorLocator()); sub_plot.yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks   

Set the Working Directory#

I always like to do this so I don’t lose files and to simplify subsequent read and writes (avoid including the full address each time).

#os.chdir("c:/PGE383")                                 # set the working directory

You will have to update the part in quotes with your own working directory and the format is different on a Mac (e.g. “~/PGE”).

Loading Tabular Data#

Here’s the command to load our comma delimited data file in to a Pandas’ DataFrame object.

Let’s load the provided multivariate, spatial dataset ‘unconv_MV.csv’. This dataset has variables from 1,000 unconventional wells including:

  • well average porosity

  • log transform of permeability (to linearize the relationships with other variables)

  • acoustic impedance (kg/m^3 x m/s x 10^6)

  • brittleness ratio (%)

  • total organic carbon (%)

  • vitrinite reflectance (%)

  • initial production 90 day average (MCFPD).

Note, the dataset is synthetic.

We load it with the pandas ‘read_csv’ function into a DataFrame we called ‘my_data’ and then preview it to make sure it loaded correctly.

#my_data = pd.read_csv("unconv_MV.csv")  
my_data = pd.read_csv(r"https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/unconv_MV.csv") # load the comma delimited data file
my_data = my_data.iloc[:,1:]                              # remove the well index

Visualize the DataFrame#

Visualizing the DataFrame is a useful first check.

We can preview by slicing the DataFrame.

  • we are showing all records from 0 up to and not including 7

my_data[:7]                                               # preview the first 7 rows of the dataframe
Por LogPerm AI Brittle TOC VR Production
0 15.91 1.67 3.06 14.05 1.36 1.85 177.381958
1 15.34 1.65 2.60 31.88 1.37 1.79 1479.767778
2 20.45 2.02 3.13 63.67 1.79 2.53 4421.221583
3 11.95 1.14 3.90 58.81 0.40 2.03 1488.317629
4 19.53 1.83 2.57 43.75 1.40 2.11 5261.094919
5 19.47 2.04 2.73 54.37 1.42 2.12 5497.005506
6 12.70 1.30 3.70 43.03 0.45 1.95 1784.266285

Summary Statistics for Tabular Data#

There are a lot of efficient methods to calculate summary statistics from tabular data in DataFrames. The describe command provides count, mean, minimum, maximum, and quartiles all in a nice data table.

  • We use transpose just to flip the table so that features are on the rows and the statistics are on the columns.

my_data.describe().transpose()                            # calculate summary statistics for the data
count mean std min 25% 50% 75% max
Por 1000.0 14.950460 3.029634 5.400000 12.85750 14.98500 17.080000 24.65000
LogPerm 1000.0 1.398880 0.405966 0.120000 1.13000 1.39000 1.680000 2.58000
AI 1000.0 2.982610 0.577629 0.960000 2.57750 3.01000 3.360000 4.70000
Brittle 1000.0 49.719480 15.077006 -10.500000 39.72250 49.68000 59.170000 93.47000
TOC 1000.0 1.003810 0.504978 -0.260000 0.64000 0.99500 1.360000 2.71000
VR 1000.0 1.991170 0.308194 0.900000 1.81000 2.00000 2.172500 2.90000
Production 1000.0 2247.295809 1464.256312 2.713535 1191.36956 1976.48782 3023.594214 12568.64413

Good that we checked the summary statistics, we have some negative values for brittleness and total organic carbon. This is physically impossible.

  • The values must be in error. We know the lowest possible values are 0.0, so we will truncate on 0.0.

We use the:

df.get_numerical_data()

DataFrame member function to get a shallow copy of the data from the DataFrame.

Since it is a shallow copy, any changes we make to the copy are made to the data in the original DataFrame.

  • This allows us to apply this simple conditional statement to all the data values in the DataFrame all at once.

num = my_data._get_numeric_data()                         # get the numerical values
num[num < 0] = 0                                          # truncate negative values to 0.0
my_data.describe().transpose()                            # calculate summary statistics for the data
count mean std min 25% 50% 75% max
Por 1000.0 14.950460 3.029634 5.400000 12.85750 14.98500 17.080000 24.65000
LogPerm 1000.0 1.398880 0.405966 0.120000 1.13000 1.39000 1.680000 2.58000
AI 1000.0 2.982610 0.577629 0.960000 2.57750 3.01000 3.360000 4.70000
Brittle 1000.0 49.731480 15.033593 0.000000 39.72250 49.68000 59.170000 93.47000
TOC 1000.0 1.006170 0.499838 0.000000 0.64000 0.99500 1.360000 2.71000
VR 1000.0 1.991170 0.308194 0.900000 1.81000 2.00000 2.172500 2.90000
Production 1000.0 2247.295809 1464.256312 2.713535 1191.36956 1976.48782 3023.594214 12568.64413

Calculate the Correlation Matrix#

For dimensional reduction, a good first step is data visualization.

Let’s start with the correlation matrix.

We can calculate it and view it in the console with these commands.

corr_matrix = np.corrcoef(my_data, rowvar = False)

the input data is a 2D ndarray and \(rowvar\) specifies if the variables are in the rows instead of columns.

corr_matrix = np.corrcoef(my_data, rowvar = False)
print(np.around(corr_matrix,2))                           # print the correlation matrix to 2 decimals
[[ 1.    0.81 -0.51 -0.25  0.71  0.08  0.69]
 [ 0.81  1.   -0.32 -0.15  0.51  0.05  0.57]
 [-0.51 -0.32  1.    0.17 -0.55  0.49 -0.33]
 [-0.25 -0.15  0.17  1.   -0.24  0.3  -0.07]
 [ 0.71  0.51 -0.55 -0.24  1.    0.31  0.5 ]
 [ 0.08  0.05  0.49  0.3   0.31  1.    0.14]
 [ 0.69  0.57 -0.33 -0.07  0.5   0.14  1.  ]]

Note the 1.0 diagonal resulting from the correlation of each variable with themselves.

Let’s use our function declared above to make a graphical correlation matrix visualization.

  • This may improve our ability to spot features. It relies on the built in correlation matrix method with Numpy DataFrames and MatPlotLib for plotting.

plot_corr(my_data,7)                                      # using our correlation matrix visualization function
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=0.8, wspace=0.2, hspace=0.3); plt.show()
_images/779b947412ca0f360bb3a9f517fdd22d7c6ad98015e0cf1a6bd4a8f8fb0162da.png

This looks good. There is a mix of bivariate, linear correlation magnitudes. Of course, correlation coefficients are limited to degree of linear correlations.

Check Matrix Scatter Plots#

For more complete information, let’s look at the matrix scatter plot from the Pandas package.

  • covariance and correlation are sensitive to outliers and nonlinearity

pd_plot.scatter_matrix(my_data)

the \(alpha\) allows us to use semitransparent points for easier visualization with dense scatter plots.

the \(hist_kwds\) is a set of parameters for the histograms on the diagonal elements.

pd_plot.scatter_matrix(my_data, alpha = 0.1,              # pandas matrix scatter plot
    figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=0.8, wspace=0.2, hspace=0.3); plt.show()
_images/ad5210bc0e50b9cd1f9e82e2101cd6a8e22b7b46827929b73884539038ea060c.png

Simple Bivariate Example#

Let’s simplify the problem to bivariate (2 features), porosity and the log transform of permeability and reduce the number of wells from 1,000 to 100.

my_data_por_perm = my_data.iloc[0:100,0:2]                # extract just por and logperm, 100 samples
my_data_por_perm.describe().transpose()                   # calculate summary statistics for the data
count mean std min 25% 50% 75% max
Por 100.0 14.9856 2.823016 9.23 12.9275 14.720 16.705 21.00
LogPerm 100.0 1.3947 0.390947 0.36 1.1475 1.365 1.650 2.48

Let’s first check the univariate statistics of Por and LogPerm.

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
ax1.hist(my_data_por_perm["Por"], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20)
ax1.set_title('Porosity'); ax1.set_xlabel('Porosity (%)'); ax1.set_ylabel('Frequency'); add_grid2(ax1)
ax2.hist(my_data_por_perm["LogPerm"], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20)
ax2.set_title('Log Transformed Permeability'); ax2.set_xlabel('Log[Permeability] (log(mD)'); add_grid2(ax2)
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=0.8, wspace=0.2, hspace=0.3); plt.show()
_images/06c2be261b8ed9903e1e372588a28bcd52cf56d411cbd6376aa5fdec5bbda07e.png

The distributions may actually be Gaussian distributed, regardless they are well behaved, we cannot observe obvious gaps nor truncations.

Let’s look at a scatter plot of porosity vs. log permeability.

This would be the basic command from matplotlib to make a scatter plot.

plt.scatter(my_data_por_perm["Por"],my_data_por_perm["LogPerm"]
  • the additional parameters are for formatting and labels

plt.scatter(my_data_por_perm["Por"],my_data_por_perm["LogPerm"],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
plt.title('Log Transformed Permeability vs. Porosity'); plt.xlabel('Porosity (%)'); plt.ylabel('Log(Permeability (Log(mD))'); add_grid()
plt.subplots_adjust(left=0.0, bottom=0.0, right=0.7, top=0.8, wspace=0.2, hspace=0.3); plt.show()
_images/6db1439ec376e70dfe82d70d9dfae50de99df77b9c2bbc25f29d89f082850b57.png

Calculation of Principal Components#

With the log transform of permeability we have a very nice linear relationship with porosity, PCA should work well on this data.

  • We are ready to perform PCA with porosity and log of permeability.

Standardize the Features#

We must standardize our variables to have a mean equal to zero, \(\bar{x} = 0.0\), and the variance equal to one, \(\sigma^{2}_{x} = 1.0\).

  • Otherwise the difference between the scale of porosity and permeability would have a significant impact. Note, given the impact of choice of units on variance, e.g., darcies (D) vs. millidarcies (mD) for permeability or fraction instead of a percentage for porosity. This is quite arbitrary!

To remove this effect, we should always standardize unless the two variables have the same units and then the range, variance between them is meaningful and standardization could remove important information.

features = ['Por','LogPerm']
x = my_data_por_perm.loc[:,features].values
mu = np.mean(x, axis=0)
sd = np.std(x, axis=0)
x = StandardScaler().fit_transform(x)                     # standardize the data features to mean = 0, var = 1.0

print("Original Mean Por", np.round(mu[0],2), ', Original Mean LogPerm = ', np.round(mu[1],2)) 
print("Original StDev Por", np.round(sd[0],2), ', Original StDev LogPerm = ', np.round(sd[1],2)) 
print('Mean Transformed Por =',np.round(np.mean(x[:,0]),2),', Mean Transformed LogPerm =',np.round(np.mean(x[:,1]),2))
print('Variance Transformed Por =',np.var(x[:,0]),', Variance Transformed LogPerm =',np.var(x[:,1]))
Original Mean Por 14.99 , Original Mean LogPerm =  1.39
Original StDev Por 2.81 , Original StDev LogPerm =  0.39
Mean Transformed Por = 0.0 , Mean Transformed LogPerm = -0.0
Variance Transformed Por = 1.0000000000000002 , Variance Transformed LogPerm = 1.0
cov = np.cov(x,rowvar = False)
cov
array([[1.01010101, 0.80087707],
       [0.80087707, 1.01010101]])

“x” is a 2D ndarray from Numpy package with the features in columns and the samples in rows.

  • Above we confirm that the features in the “x” 2D array are standardized.

It is not a bad idea to check the univariate and bivariate distributions of our standardized variables.

dfS = pd.DataFrame({'sPor': x[:,0], 'sLogPerm': x[:,1]})
sns.jointplot(data=dfS,x='sPor',y='sLogPerm',marginal_kws=dict(bins=30),color='darkorange',edgecolor='black')
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.1, wspace=0.2, hspace=0.3); plt.show()
_images/294c5353a5f381ec9771db8cfa867a24ca2c07c3089527accf291ed36b79524d.png

Everything looks fine and we are ready to apply principal components analysis.

Principal Component Analysis (PCA)#

To run PCA with the SciKitLearn machine learning package in Python, we first make a PCA model with a specified number of components and then we ‘fit’ it to our data.

n_components = 2
pca = PCA(n_components=n_components)
pca.fit(x)

As you will see later with dimensional reduction, we can use matrix math with this model and reduce our data to any dimensionality from 1 to the number of features, m. Let’s run the model with number of components equal to number of features, m.

n_components = 2
pca = PCA(n_components=n_components).fit(x)

Component Loadings#

The first thing we should do is look at the component loadings. Let’s view them and interpret our result.

print(np.round(pca.components_,3))
print('First Principal Component = ' + str(np.round(pca.components_[0,:],3)))
print('Second Principal Component = ' + str(np.round(pca.components_[1,:],3)))
[[ 0.707  0.707]
 [ 0.707 -0.707]]
First Principal Component = [0.707 0.707]
Second Principal Component = [ 0.707 -0.707]

The components are listed as a 2D array (ndarray) with:

  • principal components on the rows

  • features on the columns

  • the rows are sorted so that the first principal component is the top row and the last principal component is the last row.

Proportion of Variance Explained with Principal Components#

It is also important to look at the proportion of the variance described by each principal component.

print('Variance explained by PC1 and PC2 =', np.round(pca.explained_variance_ratio_,3))
print('First Principal Component explains ' + str(np.round(pca.explained_variance_ratio_[0],3)) + ' of the total variance.')
print('Second Principal Component explains ' + str(np.round(pca.explained_variance_ratio_[1],3)) + ' of the total variance.')
Variance explained by PC1 and PC2 = [0.896 0.104]
First Principal Component explains 0.896 of the total variance.
Second Principal Component explains 0.104 of the total variance.

Principal Component Scores, Forward and Reverse Projections#

We can calculate the principle component scores of the original data.

  • This is effectively a rotation of the data, aligned with PC1 for the direction of greatest variance.

  • We will calculate the principal component scores with the “transform” function built into PCA and then visualize as a scatter plot.

  • Then to “close the loop” and check what we have done (and our knowledge) we will reverse the PCA, go from the principal component scores back to the standardized features.

f, (ax101, ax102, ax103) = plt.subplots(1, 3,figsize=(12,3))
f.subplots_adjust(wspace=0.7)

ax101.scatter(x[:,0],x[:,1], s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax101.set_title('Standardized LogPerm vs. Por'); ax101.set_xlabel('Standardized Por'); ax101.set_ylabel('Standardized LogPerm')
ax101.set_xlim([-3,3]); ax101.set_ylim([-3,3]); add_grid2(ax101)

x_trans = pca.transform(x)                                # calculate the principal component scores
ax102.scatter(x_trans[:,0],-1*x_trans[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax102.set_title('Principal Component Scores'); ax102.set_xlabel('PC1'); ax102.set_ylabel('PC2')
ax102.set_xlim([-3,3]); ax102.set_ylim([-3,3]); add_grid2(ax102)

x_reverse = pca.inverse_transform(x_trans)                        # reverse the principal component scores to standardized values
ax103.scatter(x_reverse[:,0],x_reverse[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax103.set_title('Reverse PCA'); ax103.set_xlabel('Standardized Por'); ax103.set_ylabel('Standardized LogPerm')
ax103.set_xlim([-3,3]); ax103.set_ylim([-3,3]); add_grid2(ax103)

plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=0.8, wspace=0.2, hspace=0.3); plt.show()
_images/040728685fd4737dba466d509bb325eece330659c69d1efa7a1d9c67735a74ae.png

The standardized original and reverse PCA cross plots should look exactly the same. If so, the method is working.

Conservation of Variance#

Let’s check the variances of the principle component scores, since we have calculated them now.

  • we calculate the variance for each of the original features

  • then sum to get the original total variance

  • we calculate the variance for each of the transformed, principal component scores

  • then we sum to get the transformed total variance

We note the:

  • the first principal component score has larger variance than the second component scores

  • total variance is preserved over the transformation, the sum of variance is the same for original features and m principal component scores

print('Variance of the 2 features:')
print(np.var(x, axis = 0))

print('\nTotal Variance from Original Features:')
print(np.sum(np.var(x, axis = 0)))

print('\nVariance of the 2 principle components:')
print(np.round(np.var(x_trans, axis = 0),2))

print('\nTotal Variance from Original Features:')
print(round(np.sum(np.var(x_trans, axis = 0)),2))
Variance of the 2 features:
[1. 1.]

Total Variance from Original Features:
2.0

Variance of the 2 principle components:
[1.79 0.21]

Total Variance from Original Features:
2.0

Independence of Principal Component Scores#

Let’s check the correlations for the original features vs. our projected features.

print('\nCorrelation Matrix of the 2 original features components:')
print(np.round(np.corrcoef(x, rowvar = False),2))

print('\nCorrelation Matrix of the 2 principle components\' scores:')
print(np.round(np.corrcoef(x_trans, rowvar = False),2))
Correlation Matrix of the 2 original features components:
[[1.   0.79]
 [0.79 1.  ]]

Correlation Matrix of the 2 principle components' scores:
[[ 1. -0.]
 [-0.  1.]]

We have projected our original features with high correlation to 2 new features without correlation between the new features.

Principal Component Analysis By-hand with Eigenvalue and Eigen Vector Calculator#

Let’s show PCA by-hand with the standardized features and the eigen calculation and compare to the scikit-learn results from above.

  • we confirm that the results match.

from numpy.linalg import eig
eigen_values,eigen_vectors = eig(cov)
print('Eigen Vectors:\n' +  str(np.round(eigen_vectors,2)))
print('First Eigen Vector: ' + str(eigen_vectors[:,0]))
print('Second Eigen Vector: ' + str(eigen_vectors[:,1]))
print('Eigen Values:\n' +  str(np.round(eigen_values,2)))
PC = eigen_vectors.T.dot(x.T)
plt.subplot(121)
plt.scatter(PC[0,:],PC[1,:],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
plt.title('Principal Component Scores By-hand with numpy.linalg Eig Function'); plt.xlabel('PC1'); plt.ylabel('PC2')
plt.xlim([-3,3]); plt.ylim([-3,3]); add_grid()

plt.subplot(122)
plt.scatter(x_trans[:,0],-1*x_trans[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
plt.title('Principal Component Scores with scikit-learn PCA'); plt.xlabel('PC1'); plt.ylabel('PC2')
plt.xlim([-3,3]); plt.ylim([-3,3]); add_grid()

plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=1.0, wspace=0.2, hspace=0.3); plt.show()
Eigen Vectors:
[[ 0.71 -0.71]
 [ 0.71  0.71]]
First Eigen Vector: [0.70710678 0.70710678]
Second Eigen Vector: [-0.70710678  0.70710678]
Eigen Values:
[1.81 0.21]
_images/792323ef0c7e641c08fbe33eefb6dc67e05dfacaee639375589ba3f7f4689f07.png

Demonstration of Dimensional Reduction#

Now let’s attempt dimensional reduction by only retaining the first principle component. We will go from original values to predictions of original values.

  • Recall we were able to explain about 90% of the variance with the first principal component so the result should look ‘pretty good’, right?

We will do the whole thing by hand to make it as simple/understandable as possible for this first time through. Later we will be much more compact. The steps:

  1. start with the original porosity and permeability data

  2. standardize such that Por and LogPerm have a mean of 0.0 and a variance of 1.0

  3. calculate the 2 principal component model, visualize the principal component scores

  4. remove the 2nd principal component by setting the associated component scores to 0.0

  5. reverse the principal component by matrix multiplication of the scores with the component loadings

  6. apply matrix math to restore the original mean and variance

nComp = 1
f, ((ax201, ax202), (ax203, ax204), (ax205, ax206)) = plt.subplots(3, 2,figsize=(10,15))
f.subplots_adjust(wspace=0.5,hspace = 0.3)

ax201.scatter(my_data_por_perm["Por"],my_data_por_perm["LogPerm"],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax201.set_title('1. LogPerm vs. Por'); ax201.set_xlabel('Por'); ax201.set_ylabel('LogPerm')
ax201.set_xlim([8,22]); ax201.set_ylim([0,2.5]); add_grid2(ax201)

mu = np.mean(np.vstack((my_data_por_perm["Por"].values,my_data_por_perm["LogPerm"].values)), axis=1)
sd = np.std(np.vstack((my_data_por_perm["Por"].values,my_data_por_perm["LogPerm"].values)), axis=1)
x = StandardScaler().fit_transform(x)                     # standardize the data features to mean = 0, var = 1.0

ax202.scatter(x[:,0],x[:,1], s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax202.set_title('2. Standardized LogPerm vs. Por'); ax202.set_xlabel('Standardized Por'); ax202.set_ylabel('Standardized LogPerm')
ax202.set_xlim([-3.5,3.5]); ax202.set_ylim([-3.5,3.5]); add_grid2(ax202)

n_components = 2                                          # build principal component model with 2 components
pca = PCA(n_components=n_components)
pca.fit(x)

x_trans = pca.transform(x)                                # calculate principal component scores
ax203.scatter(x_trans[:,0],-1*x_trans[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax203.set_title('3. Principal Component Scores'); ax203.set_xlabel('PC1'); ax203.set_ylabel('PC2')
ax203.set_xlim([-3.5,3.5]); ax203.set_ylim([-3.5,3.5]); add_grid2(ax203)

x_trans[:,1] = 0.0                                         # zero / remove the 2nd principal component 

ax204.scatter(x_trans[:,0],x_trans[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax204.set_title('4. Only 1st Principal Component Scores'); ax204.set_xlabel('PC1'); ax204.set_ylabel('PC2')
ax204.set_xlim([-3.5,3.5]); ax204.set_ylim([-3.5,3.5]); add_grid2(ax204)

xhat = pca.inverse_transform(x_trans)                             # reverse the principal component scores to standardized values
ax205.scatter(xhat[:,0],xhat[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax205.set_title('5. Reverse PCA'); ax205.set_xlabel('Standardized Por'); ax205.set_ylabel('Standardized LogPerm')
ax205.set_xlim([-3.5,3.5]); ax205.set_ylim([-3.5,3.5]); add_grid2(ax205)

xhat = np.dot(pca.inverse_transform(x)[:,:nComp], pca.components_[:nComp,:])
xhat = sd*xhat + mu                                       # remove the standardization

ax206.scatter(my_data_por_perm["Por"],my_data_por_perm["LogPerm"],s=None, c="blue", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.6, linewidths=1.0, edgecolors="black")
ax206.scatter(xhat[:,0],xhat[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax206.set_title('6. De-standardized Reverse PCA'); ax206.set_xlabel('Por'); ax206.set_ylabel('LogPerm')
ax206.set_xlim([8,22]); ax206.set_ylim([0,2.5]); add_grid2(ax206)

plt.show()
_images/a96ea9c2383c7fb10b075509c6a9544615e7520a22bc05c5d1b9984153cc99ea.png

Let’s put the original data and the resulting lower dimensional model side-by-side and check the resulting variances.

f, (ax201, ax206) = plt.subplots(1, 2,figsize=(10,6))
f.subplots_adjust(wspace=0.5,hspace = 0.3)

ax201.scatter(my_data_por_perm["Por"],my_data_por_perm["LogPerm"],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax201.set_title('1. LogPerm vs. Por'); ax201.set_xlabel('Por'); ax201.set_ylabel('LogPerm')
ax201.set_xlim([8,22]); ax201.set_ylim([0,2.5]); add_grid2(ax201)

ax206.scatter(xhat[:,0],xhat[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.8, linewidths=1.0, edgecolors="black")
ax206.set_title('6. De-standardized Reverse PCA'); ax206.set_xlabel('Por'); ax206.set_ylabel('LogPerm')
ax206.set_xlim([8,22]); ax206.set_ylim([0,2.5]); add_grid2(ax206)
plt.show()

var_por = np.var(my_data_por_perm["Por"]); var_por_hat = np.var(xhat[:,0]);
var_logperm = np.var(my_data_por_perm["LogPerm"]); var_logperm_hat = np.var(xhat[:,1]);
print('Variance Por =',np.round(var_por,3),', Variance Reduced Dimensional Por =',np.round(var_por_hat,3),'Fraction = ',np.round(var_por_hat/var_por,3))
print('Variance LogPerm =',np.round(var_logperm,3),', Variance Reduced Dimensional LogPerm =',np.round(var_logperm_hat,3),'Fraction = ',np.round(var_logperm_hat/var_logperm,3))
print('Total Variance =',np.round(var_por + var_logperm,3), ', Total Variance Reduced Dimension =',np.round(var_por_hat+var_logperm_hat,3),'Fraction = ',np.round((var_por_hat+var_logperm_hat)/(var_por+var_logperm),3))
_images/dd23216e8863e8d206d5ad4311ffe9586147d99dbb5d2ad0581d859f68582c1d.png
Variance Por = 7.89 , Variance Reduced Dimensional Por = 7.073 Fraction =  0.896
Variance LogPerm = 0.151 , Variance Reduced Dimensional LogPerm = 0.136 Fraction =  0.896
Total Variance = 8.041 , Total Variance Reduced Dimension = 7.208 Fraction =  0.896

We got a model that explains 89.6% of the variance for both porosity and log permeability.

All Predictor Features#

We will go back to the original data file and this time extract all 6 predictor variables and the first 500 samples.

my_data_f6 = my_data.iloc[0:500,0:6]                      # extract the 6 predictors, 500 samples

It is a good idea to start with the summary statistics for our data.

my_data_f6.describe().transpose()                         # calculate summary statistics for the data
count mean std min 25% 50% 75% max
Por 500.0 14.89936 2.985967 5.40 12.8500 14.900 17.0125 23.85
LogPerm 500.0 1.40010 0.409616 0.18 1.1475 1.380 1.6700 2.58
AI 500.0 2.99244 0.563674 1.21 2.5900 3.035 3.3725 4.70
Brittle 500.0 49.74682 15.212123 0.00 39.3125 49.595 59.2075 93.47
TOC 500.0 0.99800 0.503635 0.00 0.6400 0.960 1.3500 2.71
VR 500.0 1.99260 0.307434 0.90 1.8200 2.010 2.1725 2.84

Let’s also calculate a correlation matrix and view it.

corr_matrix = np.corrcoef(my_data_f6, rowvar = False)
print(np.around(corr_matrix,2))                           # print the correlation matrix to 2 decimals
[[ 1.    0.79 -0.49 -0.25  0.71  0.12]
 [ 0.79  1.   -0.32 -0.13  0.48  0.04]
 [-0.49 -0.32  1.    0.14 -0.53  0.47]
 [-0.25 -0.13  0.14  1.   -0.24  0.24]
 [ 0.71  0.48 -0.53 -0.24  1.    0.35]
 [ 0.12  0.04  0.47  0.24  0.35  1.  ]]

We will need to standardize each variable to have a mean of zero and a variance of one. Let’s do that and check the results. In the console below we print all the initial and standardized means and variances for all 6 predictors.

features = ['Por','LogPerm','AI','Brittle','TOC','VR']
x_f6 = my_data_f6.loc[:,features].values
mu_f6 = np.mean(x_f6, axis=0)
sd_f6 = np.std(x_f6, axis=0)
x_f6 = StandardScaler().fit_transform(x_f6)
 
print("Original Means", features[:], np.round(mu_f6[:],2)) 
print("Original StDevs", features[:],np.round(sd_f6[:],2)) 
print('Mean Transformed =',features[:],np.round(x.mean(axis=0),2))
print('Variance Transformed Por =',features[:],np.round(x.var(axis=0),2))
Original Means ['Por', 'LogPerm', 'AI', 'Brittle', 'TOC', 'VR'] [14.9   1.4   2.99 49.75  1.    1.99]
Original StDevs ['Por', 'LogPerm', 'AI', 'Brittle', 'TOC', 'VR'] [ 2.98  0.41  0.56 15.2   0.5   0.31]
Mean Transformed = ['Por', 'LogPerm', 'AI', 'Brittle', 'TOC', 'VR'] [0. 0.]
Variance Transformed Por = ['Por', 'LogPerm', 'AI', 'Brittle', 'TOC', 'VR'] [1. 1.]

We should also check the univariate distributions for each variable.

f, (ax6,ax7,ax8,ax9,ax10,ax11) = plt.subplots(1, 6, sharey=True, figsize=(15,2))
ax6.hist(x_f6[:,0], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20); ax6.set_title('Std. Porosity'); ax6.set_xlim(-5,5)
ax7.hist(x_f6[:,1], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20); ax7.set_title('Std. Log[Perm.]'); ax7.set_xlim(-5,5)
ax8.hist(x_f6[:,2], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20); ax8.set_title('Std. Acoustic Imped.'); ax8.set_xlim(-5,5)
ax9.hist(x_f6[:,3], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20); ax9.set_title('Std. Brittleness'); ax9.set_xlim(-5,5)
ax10.hist(x_f6[:,4], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20); ax10.set_title('Std. Total Organic C'); ax10.set_xlim(-5,5)
ax11.hist(x_f6[:,5], alpha = 0.8, color = 'darkorange', edgecolor = 'black', bins=20); ax11.set_title('Std. Vit. Reflectance'); ax11.set_xlim(-5,5)
plt.show()
_images/cb70ebc58a6161c91e168f37c51faf16ad0c7a73cb23c7741794ee731d2470a4.png

The summary statistics and distributions look good. No obvious missing data, gaps, significant truncations, spikes or outliers. We are ready to perform principal component analysis on our 6 features.

n_components = 6
pca_f6 = PCA(n_components=n_components)
pca_f6.fit(x_f6)

print(np.round(pca_f6.components_,3))                     # visualize the component loadings
[[ 0.558  0.476 -0.405 -0.211  0.504  0.01 ]
 [-0.117 -0.114 -0.432 -0.323 -0.229 -0.794]
 [-0.019 -0.124  0.384 -0.898  0.07   0.157]
 [-0.214 -0.674 -0.424 -0.006  0.526  0.21 ]
 [-0.784  0.522 -0.031 -0.046  0.331 -0.019]
 [ 0.12  -0.138  0.566  0.206  0.55  -0.549]]

Let’s look at the component loadings first. Each row is a component, top row is the first principal component (PC1), next row is the second principal component (PC2) up to the last row the sixth principal component (PC6). The columns are the features ordered from ‘Por’, ‘LogPerm’, ‘AI’, ‘Brittle’, ‘TOC’, to ‘VR’.

First principal component is mainly composed of porosity, log permeability, acoustic impedance and total organic carbon, suggesting that the way they vary together is responsible for much of the variance. The next principle component is mainly composed of vitrinite reflectance. The third principal coordinate is mainly composed of brittleness and so on.

Scree Plots#

To assist in this interpretation we should consider the variance contributions from each principal component.

print('Variance explained by PC1 thru PC6 =', np.round(pca_f6.explained_variance_ratio_,3))

f, (ax10, ax11) = plt.subplots(1, 2,figsize=(10,6))
f.subplots_adjust(wspace=0.5,hspace = 0.3)

ax10.plot(np.arange(1,7,1),pca_f6.explained_variance_ratio_*100,color='darkorange',alpha=0.8)
ax10.scatter(np.arange(1,7,1),pca_f6.explained_variance_ratio_*100,color='darkorange',alpha=0.8,edgecolor='black')
ax10.set_xlabel('Principal Component'); ax10.set_ylabel('Variance Explained'); ax10.set_title('Variance Explained by Principal Component')
fmt = '%.0f%%' # Format you want the ticks, e.g. '40%'
yticks = mtick.FormatStrFormatter(fmt); ax10.set_xlim(1,6); ax10.set_ylim(0,100.0)
ax10.yaxis.set_major_formatter(yticks); add_grid2(ax10)

ax11.plot(np.arange(1,7,1),np.cumsum(pca_f6.explained_variance_ratio_*100),color='darkorange',alpha=0.8)
ax11.scatter(np.arange(1,7,1),np.cumsum(pca_f6.explained_variance_ratio_*100),color='darkorange',alpha=0.8,edgecolor='black')
ax11.plot([1,6],[95,95], color='black',linestyle='dashed')
ax11.set_xlabel('Principal Component'); ax11.set_ylabel('Cumulative Variance Explained'); ax11.set_title('Cumulative Variance Explained by Principal Component')
fmt = '%.0f%%' # Format you want the ticks, e.g. '40%'
yticks = mtick.FormatStrFormatter(fmt); ax11.set_xlim(1,6); ax11.set_ylim(0,100.0); ax11.annotate('95% variance explained',[4.05,90])
ax11.yaxis.set_major_formatter(yticks); add_grid2(ax11)

plt.show()
Variance explained by PC1 thru PC6 = [0.462 0.246 0.149 0.11  0.024 0.009]
_images/f13a2759a5a3a9ba079c2c90976c1d01b7e4e03c073aeb9780c2e4db83e7bbbf.png

We can see that about 46% of the variance is described by the 1st principal component and then about 25% is described by the 2nd principal component etc.

Independence of Principal Component Scores#

Let’s check the pairwise feature correlations before and after the projection.

print('\nCorrelation Matrix of the 6 original features components:')
print(np.round(np.corrcoef(x_f6, rowvar = False),2))

print('\nCorrelation Matrix of the 6 principle components\' scores:')
print(np.round(np.corrcoef(pca_f6.transform(x_f6), rowvar = False),2))
Correlation Matrix of the 6 original features components:
[[ 1.    0.79 -0.49 -0.25  0.71  0.12]
 [ 0.79  1.   -0.32 -0.13  0.48  0.04]
 [-0.49 -0.32  1.    0.14 -0.53  0.47]
 [-0.25 -0.13  0.14  1.   -0.24  0.24]
 [ 0.71  0.48 -0.53 -0.24  1.    0.35]
 [ 0.12  0.04  0.47  0.24  0.35  1.  ]]

Correlation Matrix of the 6 principle components' scores:
[[ 1. -0.  0. -0.  0. -0.]
 [-0.  1. -0. -0.  0. -0.]
 [ 0. -0.  1. -0.  0.  0.]
 [-0. -0. -0.  1.  0. -0.]
 [ 0.  0.  0.  0.  1.  0.]
 [-0. -0.  0. -0.  0.  1.]]

The new projected features (even without dimensionality reduction, \(p=m\)) all have pairwise correlations of 0.0!

  • all the projected features are linearly independent of each other

Reduced Dimensionality Impact on a 2 Feature Relationship#

It would be interesting to look just at the porosity vs. log permeability bivariate relationship when we retain \(1,\ldots,6\) principal components.

  • to do this we use matrix math to reverse with PCA and the standardization with various number of principal component and then plot the scatter plots of log permeability vs. porosity.

nComp = 6
xhat_6d = np.dot(pca_f6.transform(x_f6)[:,:nComp], pca_f6.components_[:nComp,:])
xhat_6d = sd_f6*xhat_6d + mu_f6

nComp = 5
xhat_5d = np.dot(pca_f6.transform(x_f6)[:,:nComp], pca_f6.components_[:nComp,:])
xhat_5d = sd_f6*xhat_5d + mu_f6

nComp = 4
xhat_4d = np.dot(pca_f6.transform(x_f6)[:,:nComp], pca_f6.components_[:nComp,:])
xhat_4d = sd_f6*xhat_4d + mu_f6

nComp = 3
xhat_3d = np.dot(pca_f6.transform(x_f6)[:,:nComp], pca_f6.components_[:nComp,:])
xhat_3d = sd_f6*xhat_3d + mu_f6

nComp = 2
xhat_2d = np.dot(pca_f6.transform(x_f6)[:,:nComp], pca_f6.components_[:nComp,:])
xhat_2d = sd_f6*xhat_2d + mu_f6

nComp = 1
xhat_1d = np.dot(pca_f6.transform(x_f6)[:,:nComp], pca_f6.components_[:nComp,:])
xhat_1d = sd_f6*xhat_1d + mu_f6

f, (ax12, ax13, ax14, ax15, ax16, ax17, ax18) = plt.subplots(1, 7,figsize=(20,20))
f.subplots_adjust(wspace=0.7)

ax12.scatter(my_data_f6["Por"],my_data_f6["LogPerm"],s=None, c="darkorange",marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.2, linewidths=1.0, edgecolors="black")
ax12.set_title('Original Data'); ax12.set_xlabel('Por'); ax12.set_ylabel('LogPerm')
ax12.set_ylim(0.0,3.0); ax12.set_xlim(8,22); ax12.set_aspect(4.0); 

ax13.scatter(xhat_1d[:,0],xhat_1d[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.2, linewidths=1.0, edgecolors="black")
ax13.set_title('1 Principal Component'); ax13.set_xlabel('Por'); ax13.set_ylabel('LogPerm')
ax13.set_ylim(0.0,3.0); ax13.set_xlim(8,22); ax13.set_aspect(4.0)

ax14.scatter(xhat_2d[:,0],xhat_2d[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.2, linewidths=1.0, edgecolors="black")
ax14.set_title('2 Principal Components'); ax14.set_xlabel('Por'); ax14.set_ylabel('LogPerm')
ax14.set_ylim(0.0,3.0); ax14.set_xlim(8,22); ax14.set_aspect(4.0)

ax15.scatter(xhat_3d[:,0],xhat_3d[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.2, linewidths=1.0, edgecolors="black")
ax15.set_title('3 Principal Components'); ax15.set_xlabel('Por'); ax15.set_ylabel('LogPerm')
ax15.set_ylim(0.0,3.0); ax15.set_xlim(8,22); ax15.set_aspect(4.0)

ax16.scatter(xhat_4d[:,0],xhat_4d[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.2, linewidths=1.0, edgecolors="black")
ax16.set_title('4 Principal Components'); ax16.set_xlabel('Por'); ax16.set_ylabel('LogPerm')
ax16.set_ylim(0.0,3.0); ax16.set_xlim(8,22); ax16.set_aspect(4.0)

ax17.scatter(xhat_5d[:,0],xhat_5d[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.2, linewidths=1.0, edgecolors="black")
ax17.set_title('5 Principal Components'); ax17.set_xlabel('Por'); ax17.set_ylabel('LogPerm')
ax17.set_ylim(0.0,3.0); ax17.set_xlim(8,22); ax17.set_aspect(4.0)

ax18.scatter(xhat_6d[:,0],xhat_6d[:,1],s=None, c="darkorange", marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=0.2, linewidths=1.0, edgecolors="black")
ax18.set_title('6 Principal Components'); ax18.set_xlabel('Por'); ax18.set_ylabel('LogPerm')
ax18.set_ylim(0.0,3.0); ax18.set_xlim(8,22); ax18.set_aspect(4.0)

plt.show()
_images/2c07008ca4c1cad616bfb73f5ffed082b67c565a0bbab5e216624e59e8c949ff.png

Very interesting to watch the accuracy of the bivariate relationship between log permeability and porosity improve as we include more components. Let’s check the variances.

print('1 Principal Component : Variance Por =',np.round(np.var(xhat_1d[:,0])/(sd_f6[0]*sd_f6[0]),2),' Variance Log Perm = ',np.round(np.var(xhat_1d[:,1])/(sd_f6[1]*sd_f6[1]),2))

print('2 Principal Components: Variance Por =',np.round(np.var(xhat_2d[:,0])/(sd_f6[0]*sd_f6[0]),2),' Variance Log Perm = ',np.round(np.var(xhat_2d[:,1])/(sd_f6[1]*sd_f6[1]),2))

print('3 Principal Components: Variance Por =',np.round(np.var(xhat_3d[:,0])/(sd_f6[0]*sd_f6[0]),2),' Variance Log Perm = ',np.round(np.var(xhat_3d[:,1])/(sd_f6[1]*sd_f6[1]),2))

print('4 Principal Components: Variance Por =',np.round(np.var(xhat_4d[:,0])/(sd_f6[0]*sd_f6[0]),2),' Variance Log Perm = ',np.round(np.var(xhat_4d[:,1])/(sd_f6[1]*sd_f6[1]),2))

print('5 Principal Components: Variance Por =',np.round(np.var(xhat_5d[:,0])/(sd_f6[0]*sd_f6[0]),2),'  Variance Log Perm = ',np.round(np.var(xhat_5d[:,1])/(sd_f6[1]*sd_f6[1]),2))

print('6 Principal Components: Variance Por =',np.round(np.var(xhat_6d[:,0])/(sd_f6[0]*sd_f6[0]),2),'  Variance Log Perm = ',np.round(np.var(xhat_6d[:,1])/(sd_f6[1]*sd_f6[1]),2))
1 Principal Component : Variance Por = 0.86  Variance Log Perm =  0.63
2 Principal Components: Variance Por = 0.88  Variance Log Perm =  0.65
3 Principal Components: Variance Por = 0.88  Variance Log Perm =  0.66
4 Principal Components: Variance Por = 0.91  Variance Log Perm =  0.96
5 Principal Components: Variance Por = 1.0   Variance Log Perm =  1.0
6 Principal Components: Variance Por = 1.0   Variance Log Perm =  1.0

This is interesting. With the first principal component we describe 86% of the porosity variance. The next two principal components do not provide much assistance. Then there is a jump with the 4th and 5th principal components.

  • of course, the problem is 6 dimensional, not just porosity vs. log permeability, but is it interesting to see the relationship between number of principal components and variance retained each of these 2 original features

  • principal components do not uniformly described each feature

Reduced Dimensionality Impact on Matrix Scatter Plots of All Features#

Let’s look at the matrix scatter plots for see all of the bivariate combinations.

  • first some book keeping, we have to put the 6D reduced dimensionality models in DataFrames (there are currently Numpy ndarrays.

df_1d = pd.DataFrame(data=xhat_1d,columns=features)   
df_2d = pd.DataFrame(data=xhat_2d,columns=features)
df_3d = pd.DataFrame(data=xhat_3d,columns=features)
df_4d = pd.DataFrame(data=xhat_4d,columns=features)
df_5d = pd.DataFrame(data=xhat_5d,columns=features)
df_6d = pd.DataFrame(data=xhat_6d,columns=features)

Now we can go ahead and produce the matrix scatter plots with these DataFrames. It is very interesting to see the accuracy of the bivariate plots improve as we add principal components. Also, with only two principal components we capture some of the bivariate relationships quite well for some of the variable pairs.

fig = plt.figure()

pd_plot.scatter_matrix(my_data_f6, alpha = 0.1,           # pandas matrix scatter plot
    figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('Original Data')

pd_plot.scatter_matrix(df_1d, alpha = 0.1,                # pandas matrix scatter plot
    figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('1 Principal Component')

pd_plot.scatter_matrix(df_2d, alpha = 0.1,                # pandas matrix scatter plot
    figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('2 Principal Components')

pd_plot.scatter_matrix(df_3d, alpha = 0.1,                # pandas matrix scatter plot
    figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('3 Principal Components')

pd_plot.scatter_matrix(df_4d, alpha = 0.1,                # pandas matrix scatter plot
    figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('4 Principal Components')

pd_plot.scatter_matrix(df_5d, alpha = 0.1,                # pandas matrix scatter plot
    figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('5 Principal Components')

pd_plot.scatter_matrix(df_6d, alpha = 0.1,                # pandas matrix scatter plot
    figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('6 Principal Components')

plt.show()
<Figure size 640x480 with 0 Axes>
_images/7e0fd0b08b519dd15fc7c0b2a9ccd5a2b5754140ab93346fc7115546b8d3e809.png _images/f27b9d55ef99f92c08c7cb21de0a0d689254e6fc3a83a54a3e8eb85f5c95c0d0.png _images/aed8d060779415617db82a0fcf6fe64f0807b6bd8859de1bf769095e5a16edf8.png _images/fe7783f010bac449b06713e8507ed1212dae68d11fe79bf77c977cfda4ec2a47.png _images/258d2ca47aaa2f729024d1eb1517e1e69073d620c641c8c2a8663cf88c072056.png _images/c8dbfa1e504c7c4ee4572bd05c6736e244f886478e03ac2b6eba2326d41cf45f.png _images/ea4343603b3fdf5cb42d2edb405f36462556c343aa0a8d5a79a90a367b4c45e2.png

Principal Components Analysis on Uncorrelated Data#

Let’s try one more test, principal components analysis on uncorrelated data.

  • we generate a large number of random samples (n is large) for 5 feature.

  • we will assume a uniform distribution

x_rand = np.random.rand(10000,5); df_x_rand = pd.DataFrame(x_rand)
print('Variance of original features: ', np.round(np.var(x_rand, axis = 0),2))
print('Proportion of variance of original features: ', np.round(np.var(x_rand, axis = 0)/np.sum(np.var(x_rand, axis = 0)),2))
print('Correlation Matrix of original features:\n'); print(np.round(np.cov(x_rand, rowvar = False),2)); print()

pd_plot.scatter_matrix(df_x_rand, alpha = 0.1,                # pandas matrix scatter plot
    figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('Original Features')

pca_rand = PCA(n_components=5)
pca_rand.fit(x_rand)
print('PCA Variance Explained ', np.round(pca_rand.explained_variance_ratio_,2))  

scores_x_rand = pca_rand.transform(x_rand); df_scores_x_rand = pd.DataFrame(scores_x_rand)

print('\nCorrelation Matrix of scores:\n'); print(np.round(np.cov(scores_x_rand, rowvar = False),2)); print()

pd_plot.scatter_matrix(df_scores_x_rand, alpha = 0.1,                # pandas matrix scatter plot
    figsize=(6, 6),color = 'black', hist_kwds={'color':['grey']})
plt.suptitle('Principal Component Scores')
Variance of original features:  [0.08 0.08 0.08 0.08 0.08]
Proportion of variance of original features:  [0.2 0.2 0.2 0.2 0.2]
Correlation Matrix of original features:

[[ 0.08  0.   -0.    0.   -0.  ]
 [ 0.    0.08  0.    0.   -0.  ]
 [-0.    0.    0.08 -0.   -0.  ]
 [ 0.    0.   -0.    0.08  0.  ]
 [-0.   -0.   -0.    0.    0.08]]
PCA Variance Explained  [0.2  0.2  0.2  0.2  0.19]

Correlation Matrix of scores:

[[ 0.09  0.   -0.   -0.    0.  ]
 [ 0.    0.08 -0.    0.    0.  ]
 [-0.   -0.    0.08  0.    0.  ]
 [-0.    0.    0.    0.08 -0.  ]
 [ 0.    0.    0.   -0.    0.08]]
Text(0.5, 0.98, 'Principal Component Scores')
_images/fcabed3fc39607aa5c8b2860b9424e44074c26f297ad430820c905735e553e94.png _images/ccaab0899f2a0cf72c23e847adcfb9b87005608a41c91f36890ac5f00215742d.png

What happens when principal component analysis is applied to uncorrelated, uniformly distributed features?

  • all the principal components describe the same amount of variance

  • there is no opportunity for dimensionality reduction through feature projection

  • the linear combination of independent random variables invokes the central limit theorem, the principle component scores tend to a Gaussian distribution (see the rounding of the points in the matrix scatter plots above)

Comments#

There are so many more tests that one could attempt to gain experience with principal components analysis. I’ll end here for brevity, but I invite you to continue. Consider, on your own apply other data sets or attempting methods with PCA kernels or regression. I hope you found this tutorial useful. I’m always happy to discuss geostatistics, statistical modeling, uncertainty modeling and machine learning,

Michael

The Author:#

Michael Pyrcz, Professor, The University of Texas at Austin Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions

With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers’ and geoscientists’ impact in subsurface resource development.

For more about Michael check out these links:

Twitter | GitHub | Website | GoogleScholar | Book | YouTube | Applied Geostats in Python e-book | LinkedIn

Want to Work Together?#

I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.

  • Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I’d be happy to drop by and work with you!

  • Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PI is Professor John Foster)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!

  • I can be reached at mpyrcz@austin.utexas.edu.

I’m always happy to discuss,

Michael

Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin

More Resources Available at: Twitter | GitHub | Website | GoogleScholar | Book | YouTube | Applied Geostats in Python e-book | LinkedIn#