Distribution Transformations#

Michael Pyrcz, Professor, The University of Texas at Austin#

Twitter | GitHub | Website | GoogleScholar | Book | YouTube | LinkedIn#

This is a tutorial for / demonstration of Distribution Transformations.

YouTube Lecture: check out my lecture on Distribution Transformations. For your convenience here’s a summary of salient points.

Here’s a simple workflow with some basic univariate distribution transformations for subsurface modeling workflows. This should help you get started data transformations.

Data Distribution Transformations#

Why do we perform distribution transformations in geostatistical methods and workflows?:

  • variable has expected shape / correcting for too few data

  • a specific distribution assumption is required

  • correct for outliers

How do we perform distribution transformations?:

There are a variety of transformations. In general, we are transforming the values from the cumulative distribution function (CDF), \(F_{X}\), to a new CDF , \(G_{Y}\). This can be generalized with the quantile - quantile transformation applied to all the sample data:

  • The forward transform:

\[ Y = G_{Y}^{-1}(F_{X}(X)) \]
  • The reverse transform:

\[ X = F_{X}^{-1}(G_{Y}(Y)) \]

This may be applied to any data, nonparametric or samples from a parametric distribution. We just need to be able to map from one distribution to another through percentiles, so it is a:

  • Rank preserving transform

We will cover three examples including:

  1. Affine Correction - rescaling the distribution mean and variance without shape change

  2. Normal score transform - transformation to a Gaussian distribution with shape change

Load the required libraries#

The following code loads the required libraries.

import geostatspy.GSLIB as GSLIB                              # GSLIB utilies, visualization and wrapper
import geostatspy.geostats as geostats                        # GSLIB methods convert to Python      
import geostatspy
print('GeostatsPy version: ' + str(geostatspy.__version__))        
GeostatsPy version: 0.0.71

We will also need some standard packages. These should have been installed with Anaconda 3.

import os                                                     # set working directory, run executables

from tqdm import tqdm                                         # supress the status bar
from functools import partialmethod
tqdm.__init__ = partialmethod(tqdm.__init__, disable=True)

ignore_warnings = True                                        # ignore warnings?
import numpy as np                                            # ndarrys for gridded data
import pandas as pd                                           # DataFrames for tabular data
import matplotlib.pyplot as plt                               # for plotting
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator) # control of axes ticks
from scipy import stats                                       # summary statistics
from statsmodels.stats.weightstats import DescrStatsW         # any weighted statistics
import math                                                   # trig etc.
import scipy.signal as signal                                 # kernel for moving window calculation
import random
plt.rc('axes', axisbelow=True)                                # plot all grids below the plot elements
if ignore_warnings == True:                                   
    import warnings
    warnings.filterwarnings('ignore')
cmap = plt.cm.inferno                                         # color map

Declare functions#

This is a convenience function to add major and minor gridlines to improve plot interpretability.

def add_grid():
    plt.gca().grid(True, which='major',linewidth = 1.0); plt.gca().grid(True, which='minor',linewidth = 0.2) # add y grids
    plt.gca().tick_params(which='major',length=7); plt.gca().tick_params(which='minor', length=4)
    plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks   

Set the working directory#

I always like to do this so I don’t lose files and to simplify subsequent read and writes (avoid including the full address each time).

#os.chdir("c:/PGE383/Examples")                               # set the working directory

Loading Tabular Data#

Here’s the command to load our comma delimited data file in to a Pandas’ DataFrame object.

df = pd.read_csv(r'https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/sample_data_biased.csv') # load our data form my GitHub account
print('Using ' + str(len(df)) + ' number of samples')
df.head(n=3)                                                  # preview the data table
Using 289 number of samples
X Y Facies Porosity Perm
0 100 900 1 0.115359 5.736104
1 100 800 1 0.136425 17.211462
2 100 600 1 0.135810 43.724752

It worked, we loaded our file into our DataFrame called ‘df’. But how do you really know that it worked? Visualizing the DataFrame would be useful and we already leard about these methods in this demo (https://git.io/fNgRW).

We can preview the DataFrame by printing a slice or by utilizing the ‘head’ DataFrame member function (with a nice and clean format, see below).

  • With the slice we could look at any subset of the data table and with the head command, add parameter ‘n=3’ to see the first 3 rows of the dataset.

Summary Statistics for Tabular Data#

The table includes X and Y coordinates (meters), Porosity (fraction), and permeability as Perm (mDarcy).

There are a lot of efficient methods to calculate summary statistics from tabular data in DataFrames.

  • The describe command provides count, mean, minimum, maximum, and quartiles all in a nice data table.

  • We use transpose just to flip the table so that features are on the rows and the statistics are on the columns.

df.describe().transpose()                                     # summary statistics
count mean std min 25% 50% 75% max
X 289.0 475.813149 254.277530 0.000000 300.000000 430.000000 670.000000 990.000000
Y 289.0 529.692042 300.895374 9.000000 269.000000 549.000000 819.000000 999.000000
Facies 289.0 0.813149 0.390468 0.000000 1.000000 1.000000 1.000000 1.000000
Porosity 289.0 0.134744 0.037745 0.058548 0.106318 0.126167 0.154220 0.228790
Perm 289.0 207.832368 559.359350 0.075819 3.634086 14.908970 71.454424 5308.842566

Visualizing Tabular Data with Location Maps#

It is natural to set the x and y coordinate and feature ranges manually, e.g. do you want your color bar to go from 0.05887 to 0.24230 exactly? Also, let’s pick a color map for display.

  • I heard that plasma and inferno are known to be friendly to the color blind as the color and intensity vary together (hope I got that right, it was an interesting Twitter conversation started by Matt Hall from Agile if I recall correctly).

  • We will assume a study area of 0 to 1,000m in x and y and omit any data outside this area.

xmin = 0.0; xmax = 1000.0                                     # range of x values
ymin = 0.0; ymax = 1000.0                                     # range of y values
pormin = 0.05; pormax = 0.25;                                 # range of porosity values

Let’s try out locmap. This is a reimplementation of GSLIB’s locmap program that uses matplotlib. I hope you find it simpler than matplotlib, if you want to get more advanced and build custom plots lock at the source. If you improve it, send me the new code. Any help is appreciated. To see the parameters, just type the command name:

GSLIB.locmap
<function geostatspy.GSLIB.locmap(df, xcol, ycol, vcol, xmin, xmax, ymin, ymax, vmin, vmax, title, xlabel, ylabel, vlabel, cmap, fig_name)>

Now we can populate the plotting parameters and visualize the porosity data.

GSLIB.locmap_st(df,'X','Y','Porosity',xmin,xmax,ymin,ymax,pormin,pormax,'Well Data - Porosity','X(m)','Y(m)','Porosity (fraction)',cmap)
add_grid(); plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.1, wspace=0.2, hspace=0.2); plt.show()
_images/7410abb8cb7331481b5e67b45152cf8d93c1132cb10c7aa0a8796ba05063cedc.png

Decluster the Data#

Look carefully, and you’ll notice the the spatial samples are more dense in the high porosity regions and less dense in the low porosity regions. There is preferential sampling.

  • We cannot use the naive statistics to represent this region. We have to correct for the clustering of the samples in the high porosity regions.

  • Let’s try cell declustering. We can interpret that we will want to minimize the declustering mean and that a cell size of between 100 - 200m is likely a good cell size, this is ‘an ocular’ estimate of the largest average spacing in the sparsely sampled regions.

Let’s check out the declus program in GeostatsPy. Of course, I have a couple complete workflows for declustering with GeostatsPy that provide more complete descriptions.

geostats.declus                                               # GeostatsPy's declustering program
<function geostatspy.geostats.declus(df, xcol, ycol, vcol, iminmax, noff, ncell, cmin, cmax)>

We can now populate the parameters. We will run a very wide range of cell sizes, from 10m to 2,000m (‘cmin’ and ‘cmax’) and take the cell size that minimizes the declustered mean (‘iminmax’ = 1 minimize, and = 0 maximize). Multiple offsets (number of these is ‘noff’) uses multiple grid origins and averages the results to remove sensitivity to grid position. The ncell is the number of cell sizes.

The output from this program is:

  • wts - an array with the weigths for each data (they sum to the number of data, 1 indicates nominal weight)

  • cell_sizes - an array with the considered cell sizes

  • dmeans - de an

wts, cell_sizes, dmeans = geostats.declus(df,'X','Y','Porosity',iminmax = 1, noff= 10, ncell=100,cmin=10,cmax=2000)
df['Porosity_Wts'] = wts                                               # add weights to the sample data DataFrame
df.head()                                                     # preview to check the sample data DataFrame
There are 289 data with:
   mean of      0.13474387540138408 
   min and max  0.058547873 and 0.228790002
   standard dev 0.03767982164385208 
X Y Facies Porosity Perm Porosity_Wts
0 100 900 1 0.115359 5.736104 3.064286
1 100 800 1 0.136425 17.211462 1.076608
2 100 600 1 0.135810 43.724752 0.997239
3 100 500 0 0.094414 1.609942 1.165119
4 100 100 0 0.113049 10.886001 1.224164

Let’s look at the location map of the weights.

GSLIB.locmap_st(df,'X','Y','Porosity_Wts',xmin,xmax,ymin,ymax,0.5,2.5,'Well Data Weights','X(m)','Y(m)','Weights',cmap); 
add_grid(); plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.1, wspace=0.2, hspace=0.2); plt.show()
_images/018d11a13dd261ad6ca373160c107a02da4c4ff77a998c0d1743c9f33ba93fc1.png

Does it look correct? See the weight varies with local sampling density?

Now let’s add the distribution of the weights and the naive and declustered porosity distributions. You should see the histogram bars adjusted by the weights. Also note the change in the mean due to the weights. There is a significant change.

plt.subplot(121)
GSLIB.locmap_st(df,'X','Y','Porosity',xmin,xmax,ymin,ymax,pormin,pormax,'Well Data - Porosity','X(m)','Y(m)','Porosity (fraction)',cmap); add_grid()

plt.subplot(122)
GSLIB.hist_st(df['Porosity'],0.05,0.25,log=False,cumul=False,bins=30,weights=df['Porosity_Wts'],xlabel="Porosity",title="Declustered Porosity")
plt.ylim(0.0,40); add_grid()

plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.1, wspace=0.2, hspace=0.2); plt.show()
_images/87aa5f63692c6edfbfb357b7e234c548d9330c3c9745a3aeb8ac327447808453.png

Now we can calculate the representative statistics:

  • declustered mean

  • declustered standard deviation

weighted_data = DescrStatsW(df['Porosity'].values, weights=df['Porosity_Wts'], ddof=0)
dmean = weighted_data.mean;  dstdev = weighted_data.std

print('Declustered, Weighted Statistics:')
print('  Mean: ' + str(round(dmean,3)))
print('  Standard Deviation: ' + str(round(dstdev,3)))
print('  Variance: ' + str(round(weighted_data.var,5)))
Declustered, Weighted Statistics:
  Mean: 0.121
  Standard Deviation: 0.032
  Variance: 0.00102

We are now ready to do some data transformations.

Distribution Rescaling / Affine Correction#

Distribution rescaling can be thought of as shifting, and stretching and squeezing a distribution. The common method is known as affine correction:

\[ y = \frac{\sigma_y}{\sigma_x}(y - \overline{y}) + \overline{x} \]

We can see that the metho first centers the distribution, the rescales the dispersion based on the ratio of the new standard deviation to the original standard deviation and then shifts the distribution to centered on the target mean.

We have a function in GeostatsPy to do the affine correction of the distribution.

GSLIB.affine                                                  # GeostatsPy's affine correction function
<function geostatspy.GSLIB.affine(array, tmean, tstdev)>

We just need to specify the new target mean and variance. Let’s make 2 new rescaled distributions and then plot the results.

Affine Standardization#

If we correct the mean to 0.0 and the variance (and standard deviation) to 1.0, we have standardized our feature.

por1 = GSLIB.affine(df['Porosity'].values,0.0,1.0)            # rescale the porosity to have a standard distribution
df['stand_Por'] = por1                                        # add transformed data to the DataFrame

plt.subplot(221)
GSLIB.hist_st(df['Porosity'],0.05,0.25,log=False,cumul=False,bins=30,weights=df['Porosity_Wts'],xlabel="Porosity (fraction)",title="Declustered Porosity")
plt.ylim(0.0,40); add_grid()

plt.subplot(222)
GSLIB.hist_st(df['stand_Por'],-3.0,3.0,log=False,cumul=False,bins=30,weights=None,xlabel="Porosity Standardized",title="Standardized Porosity")
plt.ylim(0.0,40); add_grid()

plt.subplot(223)
GSLIB.locmap_st(df,'X','Y','Porosity',xmin,xmax,ymin,ymax,pormin,pormax,'Declustered Porosity','X(m)','Y(m)','Porosity (fraction)',cmap); add_grid()

plt.subplot(224)
GSLIB.locmap_st(df,'X','Y','stand_Por',xmin,xmax,ymin,ymax,-3,3,'Standardized Porosity','X(m)','Y(m)','Porosity (fraction)',cmap); add_grid()

print('Affine Transformed Distributon Statistics:')
print('  Mean: ' + str(round(np.average(por1),3)))
print('  Standard Deviation: ' + str(round(np.std(por1),3)))
print('  Variance: ' + str(round(np.var(por1),5)))

plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=2.1, wspace=0.2, hspace=0.2); plt.show()
Affine Transformed Distributon Statistics:
  Mean: 0.0
  Standard Deviation: 1.0
  Variance: 1.0
_images/9c117cc40e96e47a9daeddd99d59bb5c4086de049d3bc4662e4f01ca856252b6.png

Let’s make some observations:

  • Notice that I did not say standard normal? A standard distribution has a mean of 0.0 and standard deviation of 1.0. The rescaling does not change the distribution shape; therefore, a non-normal (non-Gaussian) distribution cannot become normal (Gaussian) shaped just by rescaling. We’ll cover the Gaussian transformation in a bit.

  • Also, notice that the shape is the same and the location maps look almost the same? By adjusting the minimum and maximum values in the histogram x-axis and the location map color bar, we made them look unchanged! There are minor differences in bars due to the precise locations of the bin boundaries.

Let’s try a minor adjustment as in the case of correcting the porosity to a more representative declustered mean and standard deviation.

target_mean = dmean; target_stdev = dstdev                    # target mean and standard deviation for affine correction

por2 = GSLIB.affine(df['Porosity'].values,target_mean,target_stdev) # rescale the porosity to have a standard distribution
df['affine_Por'] = por2                                       # add transformed data to the DataFrame

plt.subplot(221)
GSLIB.hist_st(df['Porosity'],0.05,0.25,log=False,cumul=False,bins=30,weights=df['Porosity_Wts'],xlabel="Porosity (fraction)",title="Declustered Porosity")
plt.ylim(0.0,60)

plt.subplot(222)
GSLIB.hist_st(df['affine_Por'],0.05,0.25,log=False,cumul=False,bins=30,weights=None,xlabel="Porosity (fraction)",title="Affine Corrected Porosity")
plt.ylim(0.0,60)

plt.subplot(223)
GSLIB.locmap_st(df,'X','Y','Porosity',xmin,xmax,ymin,ymax,pormin,pormax,'Declustered Porosity','X(m)','Y(m)','Porosity (fraction)',cmap)

plt.subplot(224)
GSLIB.locmap_st(df,'X','Y','affine_Por',xmin,xmax,ymin,ymax,0.05,0.25,'Affine Corrected Porosity','X(m)','Y(m)','Porosity (fraction)',cmap)

print('Affine Transformed Distributon Statistics:')
print('  Mean: ' + str(round(np.average(por2),3)))
print('  Standard Deviation: ' + str(round(np.std(por2),3)))
print('  Variance: ' + str(round(np.var(por2),5)))

plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=2.1, wspace=0.2, hspace=0.2); plt.show()
Affine Transformed Distributon Statistics:
  Mean: 0.121
  Standard Deviation: 0.032
  Variance: 0.00102
_images/1a683e5810f68a47fd882ca23ec1fc397fc69635946af8e688d2d221b8819288.png

Note that the distribution mean, standard deviation and variance are corrected to the more representative values.

Note, affine correction is often applied to calculate multiple scenarios of the feature distribution (more on these uncertainty methods later).

  • be careful about the potential for non-physical values if there are large changes, e.g., negative porosity.

Normal Score Transform / Gaussian Anamorphosis#

We showed that the correction of the mean to 0.0 and standard deviation to 1.0 with affine correction does not change the shape; therefore, does not make a Gaussian distributed property. For many statistic / geostatistical methods the assumption of Gaussian distributed is required. We need normal score transforms in many subsurface modeling workflows.

Let’s check out the GSLIB NSCORE program translated to Python in GeostatsPy.

geostats.nscore                                               # GeostatsPy's Gaussian transformation function
<function geostatspy.geostats.nscore(df, vcol, wcol=None, ismooth=False, dfsmooth=None, smcol=0, smwcol=0)>

The inputs are primarily the DataFrame, the variable and the data weight columns (‘df’, ‘vcol’ and ‘wcol’).

  • The remainder of the variables are for the use of a reference distribution. When would you use a reference distribution? This would be the case when you have too few data to perform a reliable transformation and use analog information to inform a more complete distribution to support the transformation.

  • As you can see the inputs from weights column (‘wcol’) have defaults of 0. You can run the function omitting these (e.g. just DataFrame and variable column etc.).

  • The output form the program include the transformed data, and the trasformation table (discretized values in original and associated Gaussian space).

ns_por,trans_vr,trans_ns = geostats.nscore(df,'Porosity','Porosity_Wts') # Gaussian (NSCORE) transformation
df['NPor'] = ns_por                                           # add transformed data to the DataFrame                                   # 

plt.subplot(221)
GSLIB.hist_st(df['Porosity'],0.05,0.25,log=False,cumul=False,bins=30,weights=df['Porosity_Wts'],xlabel="Porosity (fraction)",title="Declustered Porosity")
plt.ylim(0.0,40)

plt.subplot(222)
GSLIB.hist_st(df['NPor'],-3.0,3.0,log=False,cumul=False,bins=30,weights=df['Porosity_Wts'],xlabel="Normal Scores Porosity",title="Gaussian Transformed Porosity")
plt.ylim(0.0,40)

plt.subplot(223)
GSLIB.locmap_st(df,'X','Y','Porosity',xmin,xmax,ymin,ymax,pormin,pormax,'Declustered Porosity','X(m)','Y(m)','Porosity (fraction)',cmap)

plt.subplot(224)
GSLIB.locmap_st(df,'X','Y','NPor',xmin,xmax,ymin,ymax,-3.0,3.0,'Gaussian Transformed Porosity','X(m)','Y(m)','Normal Scores Porosity',cmap)

plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=2.1, wspace=0.2, hspace=0.2)
plt.show()
_images/914b9959a91a18a84cd50f50c92c439196efe3264da57b3d24dc82d6bee9fdea.png

Some observations:

  • That is interesting! Why is the new distribution not perfectly Gaussian in shape? Because it is the declustered distribution of the data transformed to Gaussian. It accounts for the spatial bias in the sampling.

  • I’m not completely satified with the behavoir of the transformed data distribution at the tails. The NSCORE programs does not have any tail extrapolation model as found with simulation methods. The transform at the tails is hard to do just from the data alone. When we get into simulation methods we’ll check that out.

We should also visualize the transformation table with a Q-Q plot.

plt.subplot(111)
plt.scatter(trans_vr,trans_ns, c = "darkorange",edgecolor='black',marker='o', alpha = 0.6)
plt.xlabel('Porosity (%)')
plt.ylabel('Normal Score Transformed Porosity')
plt.title('Normal Score Transformed Porosity vs Untransformed Porosity p-p Plot')
plt.ylim(-4,4); plt.xlim(0,.30); add_grid()

plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.0, wspace=0.2, hspace=0.2)
plt.show()
_images/6521e85f2d3eda362cc940417bab76329f2261fa76135488baccee1d8ab92604.png

This is a q-q plot that maps the transform from our original distribution to the Gaussian distribution. Notice how the declustering weights have shift up the lowe quantiles as they received more weight.

As a final step we should check out the summary statistics of all the variants of porosity from our various data transformations.

df.describe().transpose()
count mean std min 25% 50% 75% max
X 289.0 4.758131e+02 254.277530 0.000000 300.000000 430.000000 670.000000 990.000000
Y 289.0 5.296920e+02 300.895374 9.000000 269.000000 549.000000 819.000000 999.000000
Facies 289.0 8.131488e-01 0.390468 0.000000 1.000000 1.000000 1.000000 1.000000
Porosity 289.0 1.347439e-01 0.037745 0.058548 0.106318 0.126167 0.154220 0.228790
Perm 289.0 2.078324e+02 559.359350 0.075819 3.634086 14.908970 71.454424 5308.842566
Porosity_Wts 289.0 1.000000e+00 0.639743 0.281976 0.670642 0.789486 1.174123 3.984325
stand_Por 289.0 1.475175e-16 1.001735 -2.022196 -0.754414 -0.227615 0.516880 2.495928
affine_Por 289.0 1.212482e-01 0.031916 0.056819 0.097212 0.113996 0.137716 0.200771
NPor 289.0 3.876516e-01 1.042739 -2.336338 -0.330268 0.331970 1.118825 3.157063

Comments#

This was a basic demonstration of distribution transformations with GeostatsPy. Much more can be done, I have other demonstrations for modeling workflows with GeostatsPy in the GitHub repository GeostatsPy_Demos.

I hope this is helpful,

Michael

The Author:#

Michael Pyrcz, Professor, The University of Texas at Austin#

Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions

With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers’ and geoscientists’ impact in subsurface resource development.

For more about Michael check out these links:

Twitter | GitHub | Website | GoogleScholar | Book | YouTube | LinkedIn#

Want to Work Together?#

I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.

  • Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I’d be happy to drop by and work with you!

  • Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!

  • I can be reached at mpyrcz@austin.utexas.edu.

I’m always happy to discuss,

Michael

Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin

More Resources Available at: Twitter | GitHub | Website | GoogleScholar | Book | YouTube | LinkedIn#