Multidimensional Scaling#

Michael J. Pyrcz, Professor, The University of Texas at Austin

Twitter | GitHub | Website | GoogleScholar | Geostatistics Book | YouTube | Applied Geostats in Python e-book | Applied Machine Learning in Python e-book | LinkedIn

Chapter of e-book “Applied Machine Learning in Python: a Hands-on Guide with Code”.

Cite this e-Book as:

Pyrcz, M.J., 2024, Applied Machine Learning in Python: a Hands-on Guide with Code, https://geostatsguy.github.io/MachineLearningDemos_Book.

The workflows in this book and more are available here:

Cite the MachineLearningDemos GitHub Repository as:

Pyrcz, M.J., 2024, MachineLearningDemos: Python Machine Learning Demonstration Workflows Repository (0.0.1). Zenodo. DOI

By Michael J. Pyrcz
© Copyright 2024.

This chapter is a tutorial for / demonstration of Multidimensional Scaling.

YouTube Lecture: check out my lectures on:

These lectures are all part of my Machine Learning Course on YouTube with linked well-documented Python workflows and interactive dashboards. My goal is to share accessible, actionable, and repeatable educational content. If you want to know about my motivation, check out Michael’s Story.

Motivation for Multidimensional Scaling#

Working with more features / variables is harder!

  1. More difficult to visualize data and model

  2. More data are required to infer the joint probabilities

  3. Less data coverage of feature space

  4. More difficult to interrogate / check the model

  5. More likely redundant features, e.g., multicollinearity, resulting in model instability

  6. More computational effort, more computational resources and longer run times

  7. More complicated model is more likely overfit

  8. More professional time for model construction

We get a better model with fewer, informative features than throwing all our features into the model!

Inferential Machine Learning#

There are no response features, \(y\), just predictor features,

\[ 𝑋_1,\ldots,𝑋_𝑚 \]
  • Machine learns by mimicry a compact representation of the data

  • Captures patterns as feature projections, group assignments, neural network latent features, etc.

  • We focus on inference of the population, the natural system, instead of prediction of response features.

Multidimensional Scaling#

A powerful ordination method in inferential statistics / information visualization for exploring / visualizing the similarity (conversely the difference) between individual samples from a high dimensional dataset.

  • beyond 2 or 3 features it is difficult to visualize the relationships within our datasets

  • for 2 features we can easily visualize the relationships between data samples with a regular scatter plot

  • for 3 features we can either visualize our scatter plot in 3D (e.g., with interactive rotation) or by including color or through marginalization with matrix scatter plots

Multidimensional scaling projects the \(m\) dimensional data to \(p\) dimensions such that \(p << m\).

  • ideally we are able to project to \(p=2\) to easily explore the relationships between the samples

While principal component analysis (PCA) operates with the covariance matrix, multidimensional scaling operates with the distance / dissimilarity matrix. For multidimensional scaling:

  • you don’t need to know the actual feature values, just the distance or dissimilarity between the samples

  • as with any distance in feature space, we consider feature standardization to ensure that features with larger variance do not dominate the calculation

  • we may work with a variety of dissimilarity measures

Classical Multidimensional Scaling#

Based on Euclidian distance between samples.

The Steps:

  1. calculate the square, symmetric distance matrix:

\[\begin{split} D^(2) = \begin{bmatrix} \delta_{1,1}^2 & \delta_{1,2}^2 & \dots & \delta_{1,n}^2 \\ \delta_{2,1}^2 & \delta_{2,2}^2 & \dots & \delta_{2,n}^2 \\ \vdots & \vdots & \ddots & \vdots \\ \delta_{n,1}^2 & \delta_{n,2}^2 & \dots & \delta_{n,n}^2 \\ \end{bmatrix} \end{split}\]

where \(\delta_{i,j}^2\) is Euclidean distance between data samples at data indices \(i\) and \(j\).

  1. apply double centering \(B = - \frac{1}{2} J D^{(2)} J\)

  2. solve for the eigenvalues, \(\lambda_1,\ldots,\lambda_p\)

  3. solve for the projected coordinates, \(x^{'}_1, \dots ,x^{'}_p\)

\[ X^{'} = E_m \land^{\frac{1}{2}}_m \]

where \(E_m\) is the matrix of eigenvectors and \(\land_m\) is the diagonal matrix of eigenvalues.

General comments about classical multidimensional scaling:

  • nonlinear dimensionality reduction

  • no distribution assumption

  • the transform may not be unique, and may be arbitrarily be translated, rotated and transformed (note, these do not change the pairwise distances \(\delta_{i,j}^2\)).

Metric Multidimensional Scaling#

A generalization of classical multidimensional scaling with a variety of metrics and a loss function optimization.

  • formulated as an optimization problem to minimize the square difference between the original and projected pairwise distances

\[ min_{x_1,\ldots,x_m} \sum_{i<j} \left( ||x_i - x_j|| - \delta_{i,j} \right)^2 \]

where \(||x_i - x_j||\) are the pairwise distances in the projected space (\(p\) dimensional) and \(\delta_{i,j}\) are the pairwise distances in the original feature space.

General comments about metric multidimensional scaling:

  • dissimilarity measure must be meaningful

  • dimensionality reduction is performed such that the error in the sample pairwise distance is minimized

  • there is a variant known as non-metric multidimensional scaling for ordinal features (categorical with ordering).

Checking Multidimensional Scaling Results#

The multidimensional scaling approach minimizes the square difference of the pairwise distances between all of the data samples and each other between the projected, lower dimensional, and original feature space.

  • stress is defined as:

\[ Stress_P(x_1,\ldots,x_n) = \left( \sum_{i \ne j = 1,\ldots,n} \left( ||x_i - x_j|| - \delta_{i,j} \right)^2 \right)^{\frac{1}{2}} \]
  • it is also useful to visualize the scatterplot of projected vs. original pairwise distances

Comparison with Principal Component Analysis#

MDS and PCA qre quite different, here’s a short summary of each to help ellucidate the difference:

Principal component analysis takes the covariance matrix (\(m \times m\)) between all the features and finds the orthogonal rotation such that the variance is maximized over the ordered principle components.

Multidimensional scaling takes the matrix of the pairwise distances (\(n \times n\)) between all the samples in feature space and finds the nonlinear projection such that the error in the pairwise distances is minimized.

Load the Required Libraries#

The following code loads the required libraries. These should have been installed with Anaconda 3.

ignore_warnings = True                                        # ignore warnings?
import numpy as np                                            # ndarrays for gridded data
import pandas as pd                                           # DataFrames for tabular data
import copy                                                   # for deep copies
import os                                                     # set working directory, run executables
import matplotlib.pyplot as plt                               # for plotting
from matplotlib.colors import ListedColormap                  # custom color maps
from matplotlib.ticker import (MultipleLocator,AutoMinorLocator,NullLocator,FuncFormatter) # control of axes ticks
from scipy import stats                                       # summary statistics
import math                                                   # trigonometry etc.
import scipy.signal as signal                                 # kernel for moving window calculation
import random                                                 # for random numbers
import seaborn as sns                                         # for matrix scatter plots
from scipy import linalg                                      # for linear regression
from sklearn.preprocessing import StandardScaler # feature standardization
from sklearn.manifold import MDS                              # multidimensional scaling
from sklearn.metrics.pairwise import euclidean_distances
plt.rc('axes', axisbelow=True)                                # plot all grids below the plot elements
if ignore_warnings == True:                                   
    import warnings
    warnings.filterwarnings('ignore')
cmap = plt.cm.inferno                                         # color map
seed = 42                                                     # random number seed 

If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing ‘python -m pip install [package-name]’. More assistance is available with the respective package docs.

Declare Functions#

Convenience functions to add commas to numbers and major and minor gridlines to plots.

def comma_format(x, pos):
    return f'{int(x):,}'
    
def add_grid():                                            
    plt.gca().grid(True, which='major',linewidth = 1.0); plt.gca().grid(True, which='minor',linewidth = 0.2) # add y grids
    plt.gca().tick_params(which='major',length=7); plt.gca().tick_params(which='minor', length=4)
    plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks 

Set the Working Directory#

I always like to do this so I don’t lose files and to simplify subsequent read and writes (avoid including the full address each time).

#os.chdir("c:/PGE383")                     # set the working directory

You will have to update the part in quotes with your own working directory and the format is different on a Mac (e.g. “~/PGE”).

Loading Data#

Let’s load the provided multivariate, spatial dataset unconv_MV.csv available in my GeoDataSet repo. It is a comma delimited file with:

  • well index (integer)

  • porosity (%)

  • permeability (\(mD\))

  • acoustic impedance (\(\frac{kg}{m^3} \cdot \frac{m}{s} \cdot 10^6\)).

  • brittleness (%)

  • total organic carbon (%)

  • vitrinite reflectance (%)

  • initial gas production (90 day average) (MCFPD)

We load it with the pandas ‘read_csv’ function into a data frame we called ‘df’ and then preview it to make sure it loaded correctly.

Python Tip: using functions from a package just type the label for the package that we declared at the beginning:

import pandas as pd

so we can access the pandas function ‘read_csv’ with the command:

pd.read_csv()

but read csv has required input parameters. The essential one is the name of the file. For our circumstance all the other default parameters are fine. If you want to see all the possible parameters for this function, just go to the docs here.

  • The docs are always helpful

  • There is often a lot of flexibility for Python functions, possible through using various inputs parameters

also, the program has an output, a pandas DataFrame loaded from the data. So we have to specficy the name / variable representing that new object.

df = pd.read_csv("unconv_MV.csv")  

Let’s run this command to load the data and then this command to extract a random subset of the data.

df = df.sample(frac=.30, random_state = 73073); 
df = df.reset_index()
df = pd.read_csv(r'https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/unconv_MV_v4.csv')

index = ['Well']
Xname = ['Por','AI','TOC']                                    # specify features of interest

yname = ['Prod']
df =  df[index + Xname + yname]

Xunit = ['%',r'$kg/m2 \times 10^6$','%']; Xlabel = ['Porosity','Acoustic Impedance','Total Organic Carbon'] # specify features' metadata
Xlabelunit = [f"{s1} ({s2})" for s1, s2 in zip(Xlabel, Xunit)]
Xmin = [0.0,0.0,0.0]; Xmax = [30.0,5.0,100.0]                 # set minimums and maximums for visualization 
nXname = ['n' + feature for feature in Xname]                 # labels, units and ranges for normalized features

normalized = StandardScaler().fit_transform(df[Xname])        # standardize the features

for col_name, new_col in zip(nXname, normalized.T):           # add standardized features to the DataFrame
    df[col_name] = new_col

nXunit = ['norm[' + unit + ']' for unit in Xunit]             # specify the standardized features' metadata
nXlabel = ['Normalized ' + feature for feature in Xlabel]
nXlabelunit = [f"{s1} ({s2})" for s1, s2 in zip(nXlabel, nXunit)]
nXmin = [-3.0,-3.0,-3.0]; nXmax = [3.0,3.0,3.0]  

Visualize the Loaded DataFrame#

df.head(n=4)                                                  # we could also use this command for a table preview
Well Por AI TOC Prod nPor nAI nTOC
0 1 12.08 2.80 1.16 1695.360819 -0.982256 -0.298603 0.352948
1 2 12.38 3.22 0.89 3007.096063 -0.881032 0.444147 -0.209104
2 3 14.02 4.01 0.89 2531.938259 -0.327677 1.841224 -0.209104
3 4 17.67 2.63 1.08 5288.514854 0.903875 -0.599240 0.186414

Summary Statistics#

Let’s check the summary statistics of our data.

df.describe()
Well Por AI TOC Prod nPor nAI nTOC
count 200.000000 200.000000 200.000000 200.000000 200.000000 2.000000e+02 2.000000e+02 2.000000e+02
mean 100.500000 14.991150 2.968850 0.990450 3864.407081 2.486900e-16 4.130030e-16 3.375078e-16
std 57.879185 2.971176 0.566885 0.481588 1553.277558 1.002509e+00 1.002509e+00 1.002509e+00
min 1.000000 6.550000 1.280000 -0.190000 839.822063 -2.848142e+00 -2.986650e+00 -2.457313e+00
25% 50.750000 12.912500 2.547500 0.617500 2686.227611 -7.013606e-01 -7.451372e-01 -7.763606e-01
50% 100.500000 15.070000 2.955000 1.030000 3604.303506 2.660490e-02 -2.449306e-02 8.233024e-02
75% 150.250000 17.402500 3.345000 1.350000 4752.637555 8.136175e-01 6.652032e-01 7.484661e-01
max 200.000000 23.550000 4.630000 2.180000 8590.384044 2.887855e+00 2.937664e+00 2.476256e+00

The data looks to be in pretty good shape and for brevity we skip outlier detection. Let’s look at the distributions with a matrix scatter plot from the Seaborn package.

pairplot = sns.pairplot(df,vars=nXname,markers='o',plot_kws={'color':'grey','alpha': 0.8},)
pairplot.map_diag(sns.histplot, color='grey', kde=False)
pairplot.x_vars = Xlabelunit; pairplot.y_vars = Xlabelunit; pairplot._add_axis_labels()
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.2, top=1.2, wspace=0.3, hspace=0.2); plt.show()
_images/322a920d30e5891ae978cbd8c7837f204979312d742c54f31ab1193828b0cdd7.png

Data Preparation#

Let’s make an ordinal feature from the continuous production:

  1. low

  2. medium

  3. high

  4. very high

production rates. This will help us visualize the results as we proceed, we can look at wells with different levels of production projected into a variety of lower dimensional spaces with multidimensional scaling.

bins = [0,2500,5000,7500,10000]                               # assign the production bins (these are the fence posts)

labels = ['low', 'med', 'high', 'vhigh']                      # assign the labels
category = pd.cut(df[yname].values.reshape(-1),bins,labels=labels) # make the 1D array with the labels for our data
df['tProd'] = category                                        # add the new ordinal production feature to our DataFrames    
df.head()
dpalette = sns.color_palette("rocket_r",n_colors = 4)
palette = sns.color_palette("rocket")

Let’s take a look at the matrix scatter plot of our 3 features and the production levels.

pairplot2 = sns.pairplot(df[nXname + ['tProd']],markers='o',hue = 'tProd', palette = dpalette,diag_kws={'edgecolor':'black'},
                    plot_kws=dict(s=50, edgecolor="black", linewidth=0.5))
pairplot2.x_vars = Xlabelunit; pairplot2.y_vars = Xlabelunit; pairplot2._add_axis_labels()
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=1.5, wspace=0.3, hspace=0.2); plt.show()
_images/2862000c378356e1a3b7086ca47073c0d92c817dc95b150a9d66584ac3e0515d.png

Multidimensional Scaling#

The multidimensional scaling method follows the sample pattern as other scikit-learn methods, we:

  1. instantiate

  2. fit

  3. apply, transform or predict

Let’s run multidimensional scaling on our subset of features (\(m = 3\)) and project to only 2 features (\(p = 2\)).

  • we set the random_state for repeatability, everyone gets the same result from the iterative solution

  • we use 20 random initializations, the best solution is selected to improve likelihood of selection of (or search resulting in) the global optimum and not a local optimum

  • we use an increased number of max_iter to improve the convergence

np.random.seed(seed)                                          # set the random number seed, so we all get the same answer
n_components = 2                                              # p, reduced dimensionality space
embedding = MDS(n_components=2,n_init = 20,max_iter = 1000,random_state = 73073) # instantiate and set the hyperparameter
MDS_transformed = embedding.fit_transform(df[nXname])
MDS_transformed.shape
(200, 2)

The output is 2 multidimensional scaling components. We have projected our 3 features to 2 features to minimize the error in pairwise distance between the samples. Let’s add the 2 components to our DataFrame.

df['MDS1'] = MDS_transformed[:,0]                             # add MDS projections to DataFrame
df['MDS2'] = MDS_transformed[:,1]
df.head()
Well Por AI TOC Prod nPor nAI nTOC tProd MDS1 MDS2
0 1 12.08 2.80 1.16 1695.360819 -0.982256 -0.298603 0.352948 low 0.619075 -0.641325
1 2 12.38 3.22 0.89 3007.096063 -0.881032 0.444147 -0.209104 med 0.872691 0.306021
2 3 14.02 4.01 0.89 2531.938259 -0.327677 1.841224 -0.209104 med 0.458548 1.846056
3 4 17.67 2.63 1.08 5288.514854 0.903875 -0.599240 0.186414 high -0.885125 -0.484969
4 5 17.52 3.18 1.51 2859.469624 0.853263 0.373409 1.081534 med -1.382408 0.413270

Let’s take a look at the samples projected into the new 2 dimensional feature space.

  • note the rotation, translation is arbitrary in this space, only the sample pairwise distances are relevant.

plt.subplot(121)                                              # plot labeled MDS projection
pairplot = sns.scatterplot(x = df['MDS1'],y = df['MDS2'],hue = df['tProd'],markers='o',palette = dpalette,edgecolor="black")

plt.subplot(122)
pairplot = sns.scatterplot(x = df['MDS1'],y = df['MDS2'],hue = df['Well'],markers='o',edgecolor="black")

plt.subplots_adjust(left=0.0, bottom=0.0, right=2., top=1.1, wspace=0.3, hspace=0.2,); plt.show()
_images/878020c74fa0b6fc3c688626e51d63152eb408d458e31f001912e5bf1844201a.png

Some comments:

  • these are the actual samples in this space, we know the values, including the sample number to assist with interpretations

  • the general relationship between samples is preserved.

  • the general production transition from low to very high production is preserved.

Let’s check our model:

  • we will calculate the original and projected pairwise distances between all the samples

  • we will cross plot the original vs the projects pairwise distances

  • we will plot the distribution of the ratio between projects / original pairwise distances

dists = euclidean_distances(df[nXname], squared=False).ravel() # calculate pairwise distance
nonzero = dists != 0                                          # select only non-identical samples pairs
dists = dists[nonzero]
projected_dists = euclidean_distances(MDS_transformed, squared=False).ravel()[nonzero] # calculate projected pairwise distance

plt.subplot(221)                                              # scatter plot original and projected pairwise distance
plt.scatter(dists,projected_dists,c='darkorange',alpha=0.4,edgecolor = 'black')
plt.arrow(0,0,200,200,width=0.02,color='black',head_length=0.0,head_width=0.0)
plt.xlim(0,15); plt.ylim(0,15); add_grid()
plt.xlabel("Pairwise Distance: original space"); plt.ylabel("Pairwise Distance: projected space")
plt.title("MDS Pairwise Distance: Projected to %d Components" % n_components)

ratio = projected_dists / dists                               # calculate projected distance/original distance ratio

plt.subplot(222)                                              # histogram of distance ratio
plt.hist(ratio, bins=50, range=(0.5, 1.5),color = 'darkorange',alpha = 0.8,edgecolor='k')
plt.xlabel("Distance Ratio: projected / original"); plt.ylabel("Frequency"); add_grid()
plt.title("MDS Pairwise Distance: Projected to %d Components" % n_components)
plt.annotate('Distance Ratio:',[0.55,9000])
plt.annotate('mean = ' + str(round(np.mean(ratio),2)),[0.61,8300])
plt.annotate('st.dev. = ' + str(round(np.std(ratio),2)),[0.61,7600])

plt.subplot(223)                                               # histogram of original pairwise distance
plt.hist(dists, bins=50, range=(0., 15.),color = 'darkorange',alpha = 0.8,edgecolor='k'); add_grid()
plt.xlabel("Pairwise Distance"); plt.ylabel("Frequency"); plt.title("MDS Pairwise Distance: Original Data"); plt.xlim(xmin=0)
plt.annotate('Original Pairwise Distance:',[9,4000]); plt.ylim([0,5500])
plt.annotate('mean = ' + str(round(np.mean(dists),2)),[10,3700])
plt.annotate('st.dev. = ' + str(round(np.std(dists),2)),[10,3400])

plt.subplot(224)                                               # histogram of projected pairwise distance
plt.hist(projected_dists, bins=50, range=(0., 15.),color = 'darkorange',alpha = 0.8, edgecolor='k')
plt.xlabel("Pairwise Distance"); plt.ylabel("Frequency"); add_grid(); plt.xlim(xmin=0); plt.ylim([0,5500])
plt.title("MDS Pairwise Distance: Projected to %d Components" % n_components)
plt.annotate('Original Pairwise Distance:',[9,4000])
plt.annotate('mean = ' + str(round(np.mean(projected_dists),2)),[10,3700])
plt.annotate('st.dev. = ' + str(round(np.std(projected_dists),2)),[10,3400])

plt.subplots_adjust(left=0.0, bottom=0.0, right=1.6, top=1.7, wspace=0.2, hspace=0.3); plt.show()
_images/86303ca6a38106e6d81078053b2c24fac4ab9aed520d28d5f239bf009b7bb5d9.png

We projected to a 2 dimensional feature space and did a pretty good job preserving the pairwise distances between the samples.

Observing Specific Samples#

Let’s reduce the number of wells and post the well numbers to observe their spacings in the original and projected features spaces.

df_subset = df.sample(n = 30,random_state = seed+5).copy(deep = True)           # randomly sample 30 data without replacement
df_subset = df_subset.reset_index()

Let’s look at out dataset, we will visualize the 3 possible scatter plots between the three features with the sample points labeled by well number (1 through 30).

plt.subplot(131)
pairplot = sns.scatterplot(x = df_subset[nXname[0]],y = df_subset[nXname[1]],s=60,
        hue = df_subset['tProd'],edgecolor="black",markers='o')
plt.xlabel(nXlabelunit[0]); plt.ylabel(nXlabelunit[1]); add_grid();
plt.xlim([nXmin[0],nXmax[0]]); plt.ylim([nXmin[1],nXmax[1]]); plt.title('30 Samples, ' + str(Xlabel[1]) + ' vs. ' + str(Xlabel[0]))
for i, txt in enumerate(df_subset['Well']):
    pairplot.annotate(txt, (df_subset[nXname[0]][i]+0.1, df_subset[nXname[1]][i]+0.1))
    
plt.subplot(132)
pairplot = sns.scatterplot(x = df_subset[nXname[1]],y = df_subset[nXname[2]],s=60,
        hue = df_subset['tProd'],edgecolor="black",markers='o')
plt.xlabel(nXlabelunit[1]); plt.ylabel(nXlabelunit[2]); add_grid();
plt.xlim([nXmin[1],nXmax[1]]); plt.ylim([nXmin[2],nXmax[2]]); plt.title('30 Samples, ' + str(Xlabel[2]) + ' vs. ' + str(Xlabel[1]))
for i, txt in enumerate(df_subset['Well']):
    pairplot.annotate(txt, (df_subset[nXname[1]][i]+0.1, df_subset[nXname[2]][i]+0.1))
    
plt.subplot(133)
pairplot = sns.scatterplot(x = df_subset[nXname[0]],y = df_subset[nXname[2]],s=60,
        hue = df_subset['tProd'],edgecolor="black",markers='o')
plt.xlabel(nXlabelunit[0]); plt.ylabel(nXlabelunit[2]); add_grid();
plt.xlim([nXmin[0],nXmax[0]]); plt.ylim([nXmin[2],nXmax[2]]); plt.title('30 Samples, ' + str(Xlabel[2]) + ' vs. ' + str(Xlabel[0]))
for i, txt in enumerate(df_subset['Well']):
    pairplot.annotate(txt, (df_subset[nXname[0]][i]+0.1, df_subset[nXname[2]][i]+0.1))
    
plt.subplots_adjust(left=0.0, bottom=0.0, right=3.0, top=1.1, wspace=0.3, hspace=0.2); plt.show()
_images/6e9e890f08655d5dcfe32faa16b55fc80fd973a37b63d0015962db70c23a5aa9.png

We can now see our 30 wells with their sample numbers plots on each of the 3 possible bivariate plots. Let’s perform multidimensional scaling to project 2 components.

embedding_subset = MDS(n_components=2)                        # MDS projection
MDS_transformed_subset = embedding_subset.fit_transform(df_subset[nXname])

Now let’s visualize the wells in the projected feature space.

plt.subplot(121)                                              # plot data samples in MDS space
pairplot = sns.scatterplot(x = MDS_transformed_subset[:,0],y = MDS_transformed_subset[:,1],s=60,
        hue = df_subset['tProd'],edgecolor="black",markers='o')
for i, txt in enumerate(df_subset['Well'].values):
    plt.annotate(txt, [MDS_transformed_subset[:,0][i]+0.1,MDS_transformed_subset[:,1][i]+0.1])
plt.xlim([-3.5,3.5]); plt.ylim([-3.5,3.5]); plt.title('30 Samples, MDS#2 vs. MDS#1')
pairplot.set_xlabel('MDS1'); pairplot.set_ylabel('MDS2')  
pairplot.legend(loc='lower right'); add_grid(); plt.legend(loc = 'upper right')

dists = euclidean_distances(df_subset[nXname], squared=False).ravel() # calculate all pairwise distances
nonzero = dists != 0                                          # select only non-identical samples pairs
dists = dists[nonzero]
projected_dists = euclidean_distances(MDS_transformed_subset, squared=False).ravel()[nonzero] # calculate projected pairwise distance

plt.subplot(122)                                              # scatter plot original and projected pairwise distance
plt.scatter(dists,projected_dists,c='darkorange',alpha=0.4,edgecolor = 'black')
plt.arrow(0,0,200,200,width=0.02,color='black',head_length=0.0,head_width=0.0)
plt.xlim(0,15); plt.ylim(0,15); add_grid()
plt.xlabel("Pairwise Distance: original space"); plt.ylabel("Pairwise Distance: projected space")
plt.title("MDS Pairwise Distance: Projected to %d Components" % n_components)

plt.subplots_adjust(left=0.0, bottom=0.0, right=2., top=1.1, wspace=0.3, hspace=0.2); plt.show()
_images/965ceb7c6a6d4c9f915821ebdff43af35733b767e8481c508762c7e4ae620a19.png

Some observations:

  • the transformation is nonlinear, nor orthogonal, i.e., not a rotation (like PCA)

  • the inter-sample distance is approximately preserved

\[ \delta_{i,j} = \sqrt{\left( (\delta Por_{i,j}^{2}) + (\delta AI_{i,j}^{2}) + (\delta TOC_{i,j}^{2}) \right)} \]

Applications of MDS#

The main application with multidimensional scaling is the ability to inspect high dimensional feature spaces for relationships between samples.

  • we may observe that specific wells cluster together

  • we may observe systematic transitions

For example, in the figure above we could observe:

  1. Wells 1 and 168 are in a location of medium producing wells, but are low producing wells.

  2. Well 70 is close to the many low producing well, but is a medium producing well.

  3. Wells 155 and 158 are in a location of high producing wells, but are medium producing wells.

This analysis may identify opportunities for further investigation into these specific data samples. Perhaps there are data issues or other confounding features to integrate.

Recall, MDS is a one-way-trip.

  • While we can project the data into MDS space, there is no backtransformation; therefore, we cannot take an arbitrary location in MDS space and calculate the original feature combination.

  • The solution is nonunique with respect to translation, rotation and flipping. Our interpretation is based on separation distance and groupings in MDS space.

  • The calculation is performed each time we change the data resulting in a new solution.

Comments#

I hope you found this chapter helpful. Much more could be done and discussed, I have many more resources. Check out my shared resource inventory,

Michael

The Author:#

Michael Pyrcz, Professor, The University of Texas at Austin Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions

With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers’ and geoscientists’ impact in subsurface resource development.

For more about Michael check out these links:

Twitter | GitHub | Website | GoogleScholar | Geostatistics Book | YouTube | Applied Geostats in Python e-book | Applied Machine Learning in Python e-book | LinkedIn

Want to Work Together?#

I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.

  • Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I’d be happy to drop by and work with you!

  • Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PI is Professor John Foster)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!

  • I can be reached at mpyrcz@austin.utexas.edu.

I’m always happy to discuss,

Michael

Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin

More Resources Available at: Twitter | GitHub | Website | GoogleScholar | Geostatistics Book | YouTube | Applied Geostats in Python e-book | Applied Machine Learning in Python e-book | LinkedIn#