Bootstrap for Uncertainty Modeling#

Michael Pyrcz, Professor, The University of Texas at Austin#

Twitter | GitHub | Website | GoogleScholar | Book | YouTube | LinkedIn#

This is a tutorial for / demonstration of Bootstrap.

YouTube Lecture: check out my lecture on Bootstrap. For your convenience here’s a summary of salient points.

Bootstrap#

Uncertainty in the sample statistics

  • one source of uncertainty is the paucity of data.

  • do 200 or even less wells provide a precise (and accurate estimate) of the mean? standard deviation? skew? P13?

Would it be useful to know the uncertainty in these statistics due to limited sampling?

  • what is the impact of uncertainty in the mean porosity e.g. 20%+/-2%?

Bootstrap is a method to assess the uncertainty in a sample statistic by repeated random sampling with replacement.

Assumptions

  • sufficient, representative sampling, identical, idependent samples

Limitations

  1. assumes the samples are representative

  2. assumes stationarity

  3. only accounts for uncertainty due to too few samples, e.g. no uncertainty due to changes away from data

  4. does not account for boundary of area of interest

  5. assumes the samples are independent

  6. does not account for other local information sources

The Bootstrap Approach (Efron, 1982)

Statistical resampling procedure to calculate uncertainty in a calculated statistic from the data itself.

  • Does this work? Prove it to yourself, for uncertainty in the mean solution is standard error:

\[ \sigma^2_\overline{x} = \frac{\sigma^2_s}{n} \]

Extremely powerful - could calculate uncertainty in any statistic! e.g. P13, skew etc.

  • Would not be possible access general uncertainty in any statistic without bootstrap.

  • Advanced forms account for spatial information and sampling strategy (game theory and Journel’s spatial bootstrap (1993).

Steps:

  1. assemble a sample set, must be representative, reasonable to assume independence between samples

  2. optional: build a cumulative distribution function (CDF)

    • may account for declustering weights, tail extrapolation

    • could use analogous data to support

  3. For \(\ell = 1, \ldots, L\) realizations, do the following:

    • For \(i = \alpha, \ldots, n\) data, do the following:

      • Draw a random sample with replacement from the sample set or Monte Carlo simulate from the CDF (if available).

  4. Calculate a realization of the sammary statistic of interest from the \(n\) samples, e.g. \(m^\ell\), \(\sigma^2_{\ell}\). Return to 3 for another realization.

  5. Compile and summarize the \(L\) realizations of the statistic of interest.

This is a very powerful method. Let’s try it out.

Load the required libraries#

The following code loads the required libraries.

import geostatspy.GSLIB as GSLIB                              # GSLIB utilies, visualization and wrapper
import geostatspy.geostats as geostats                        # GSLIB methods convert to Python      
import geostatspy
print('GeostatsPy version: ' + str(geostatspy.__version__))
GeostatsPy version: 0.0.71

Check Another Package#

At some point you are going to run into an issue with package versions. This is how you can check packages.

import numpy
numpy.__version__
'1.23.3'

We will also need some standard packages. These should have been installed with Anaconda 3.

import os                                                     # set working directory, run executables

from tqdm import tqdm                                         # supress the status bar
from functools import partialmethod
tqdm.__init__ = partialmethod(tqdm.__init__, disable=True)

ignore_warnings = True                                        # ignore warnings?
import numpy as np                                            # ndarrys for gridded data
import pandas as pd                                           # DataFrames for tabular data
import matplotlib.pyplot as plt                               # for plotting
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator) # control of axes ticks
from scipy import stats                                       # summary statistics
import math                                                   # trig etc.
import scipy.signal as signal                                 # kernel for moving window calculation
import random
plt.rc('axes', axisbelow=True)                                # plot all grids below the plot elements
if ignore_warnings == True:                                   
    import warnings
    warnings.filterwarnings('ignore')
cmap = plt.cm.inferno                                         # color map

If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing ‘python -m pip install [package-name]’. More assistance is available with the respective package docs.

Declare functions#

These are some of the functions from GeostatsPy required by the new program. We will declare them here and then in the future integrate the new indicator kriging program into the package properly.

def add_grid():
    plt.gca().grid(True, which='major',linewidth = 1.0); plt.gca().grid(True, which='minor',linewidth = 0.2) # add y grids
    plt.gca().tick_params(which='major',length=7); plt.gca().tick_params(which='minor', length=4)
    plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks   

Set the working directory#

I always like to do this so I don’t lose files and to simplify subsequent read and writes (avoid including the full address each time).

#os.chdir("c:/PGE383")                                        # set the working directory

Loading Tabular Data#

Here’s the command to load our comma delimited data file in to a Pandas’ DataFrame object.

df = pd.read_csv(r'https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/sample_data_biased.csv') # load our data form my GitHub account

Let’s drop some samples so that we increase the variations in bootstrap samples for our demonstration below.

  • Warning, this is not a standard part of the bootstrap workflow, I’m doing this so my students can change the number of data and observe the impact on uncertainty.

df = df.sample(frac = 0.2)                                    # extract 50 random samples to reduce the size of the dataset  
print('Using ' + str(len(df)) + ' number of samples')
Using 58 number of samples

Visualizing the DataFrame would be useful and we already learned about these methods in this demo (https://git.io/fNgRW).

We can preview the DataFrame by printing a slice or by utilizing the ‘head’ DataFrame member function (with a nice and clean format, see below). With the slice we could look at any subset of the data table and with the head command, add parameter ‘n=13’ to see the first 13 rows of the dataset.

df.head(n=3)                                                  # DataFrame preview to check
X Y Facies Porosity Perm
235 810 589 0 0.115543 5.490790
182 370 139 1 0.123530 3.696604
88 690 559 0 0.109998 4.750459

Summary Statistics for Tabular Data#

The table includes X and Y coordinates (meters), Facies 1 and 0 (1 is sandstone and 0 interbedded sand and mudstone), Porosity (fraction), and permeability as Perm (mDarcy).

There are a lot of efficient methods to calculate summary statistics from tabular data in DataFrames. The describe command provides count, mean, minimum, maximum, and quartiles all in a nice data table. We use transpose just to flip the table so that features are on the rows and the statistics are on the columns.

df.describe().transpose()                                     # summary statistics
count mean std min 25% 50% 75% max
X 58.0 434.137931 262.665383 0.000000 260.000000 400.000000 632.500000 990.000000
Y 58.0 560.344828 270.961352 59.000000 326.500000 574.000000 792.250000 999.000000
Facies 58.0 0.844828 0.365231 0.000000 1.000000 1.000000 1.000000 1.000000
Porosity 58.0 0.134842 0.035624 0.074373 0.110108 0.124393 0.145943 0.223661
Perm 58.0 230.637558 623.228764 0.665880 3.755548 16.527409 83.089700 3178.630458

Visualizing Tabular Data with Location Maps#

It is natural to set the x and y coordinate and feature ranges manually. e.g. do you want your color bar to go from 0.05887 to 0.24230 exactly? Also, let’s pick a color map for display. I heard that plasma is known to be friendly to the color blind as the color and intensity vary together (hope I got that right, it was an interesting Twitter conversation started by Matt Hall from Agile if I recall correctly). We will assume a study area of 0 to 1,000m in x and y and omit any data outside this area.

xmin = 0.0; xmax = 1000.0                                     # range of x values
ymin = 0.0; ymax = 1000.0                                     # range of y values
pormin = 0.05; pormax = 0.25;                                 # range of porosity values
nx = 100; ny = 100; csize = 10.0                       

Let’s try out locmap. This is a reimplementation of GSLIB’s locmap program that uses matplotlib. I hope you find it simpler than matplotlib, if you want to get more advanced and build custom plots lock at the source. If you improve it, send me the new code. Any help is appreciated. To see the parameters, just type the command name:

GSLIB.locmap                                                  # GeostatsPy's 2D point plot function
<function geostatspy.GSLIB.locmap(df, xcol, ycol, vcol, xmin, xmax, ymin, ymax, vmin, vmax, title, xlabel, ylabel, vlabel, cmap, fig_name)>

Now we can populate the plotting parameters and visualize the porosity data.

plt.subplot(111)
GSLIB.locmap_st(df,'X','Y','Porosity',xmin,xmax,ymin,ymax,pormin,pormax,'Well Data - Porosity','X(m)','Y(m)','Porosity (fraction)',cmap)
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.1, wspace=0.2, hspace=0.2); plt.show()
_images/0ae4591fc32d1dec1d95e31403209fe8ebf168a9a8a9c83e389b288778fcc895.png

A Very Simple Bootstrap Method in Python#

If you are new to bootstrap and Python, here’s the most simple code possible for bootstrap.

  • specify the number of bootstrap realizations, \(L\)

  • declare a list to store the bootstrap realizations of the statistic of interest

  • loop over L bootstrap realizations

    • n MCS, random samples with replacement for a new realization of the data

    • calculate the realization of the statistic from the realization of the data

  • summarize the resulting uncertainty model, histogram, summary statistics etc.

import random                          # import random package
L = 1000                               # set the number of bootstrap realizations   
por_avg_real = []                      # declare an empty list to store the bootstrap realizations of the statistic 
for k in range(0,L):                   # loop over the L bootstrap realizations
    samples = random.choices(df['Porosity'].values, k=len(df)) # n Monte Carlo simulations
    por_avg_real.append(np.average(samples)) # calculate the statistic of interest from the new bootstrap dataset
plt.hist(por_avg_real,color = 'darkorange',alpha = 0.8,edgecolor = 'black') # plot the distribution, could also calculate any summary statistics
plt.xlabel('Boostrap Realizations of Average Porosity'); plt.ylabel('Frequency'); plt.title('Uncertainty Distribution for Average Porosity')
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.1, wspace=0.2, hspace=0.2); plt.show()
_images/9e6e1e4a1b90e3fb919ee77a43b3d54288a172910cff3352af626816e972cdb9.png

Now we proceed with a more complicated, robust work by first quantifying the spatial bias in the samples and assigned data weights to mitigate this bias.

Declustering#

Let’s calculate some declustering weights. There is a demonstration on declustering here https://git.io/fhgJl if you need more information.

wts, cell_sizes, dmeans = geostats.declus(df,'X','Y','Porosity',iminmax = 1, noff= 10, ncell=100,cmin=10,cmax=2000)
df['Wts'] = wts                            # add weights to the sample data DataFrame
df.head()                                  # preview to check the sample data DataFrame

def weighted_avg_and_std(values, weights): # function to calculate weighted mean and st. dev., from Eric O Lebigot, stack overflow,
    average = np.average(values, weights=weights)
    variance = np.average((values-average)**2, weights=weights)
    return (average, math.sqrt(variance))

sample_avg, sample_stdev = weighted_avg_and_std(df['Porosity'],df['Wts'])
print('Declustered mean = ' + str(round(sample_avg,3)) + ' and declustered standard deviation = ' + str(round(sample_stdev,3)))
There are 58 data with:
   mean of      0.1348423376724138 
   min and max  0.074372863 and 0.223660709
   standard dev 0.035315407927839874 
Declustered mean = 0.124 and declustered standard deviation = 0.032

A Couple of Bootstrap Realizations#

We will attempt boostrap by-hand and manually loop over \(L\) realizations and draw \(n\) samples to calculate the summary statistics of interest, mean and variance. The choice function from the random package simplifies sampling with replacement from a set of samples with weights.

This command returns a ndarray with k samples with replacment from the ‘Porosity’ column of our DataFrame (df) accounting for the data weights in column ‘Wts’.

samples1 = random.choices(df['Porosity'].values, weights=df['Wts'].values, cum_weights=None, k=len(df))

It is instructive to look at a couple of these realizations from the original declustered data set.

samples1 = random.choices(df['Porosity'].values, weights=df['Wts'].values, cum_weights=None, k=len(df))
samples2 = random.choices(df['Porosity'].values, weights=df['Wts'].values, cum_weights=None, k=len(df))

print('Bootstrap means, realization 1 = ' + str(np.average(samples1)) + ' and realization 2 = ' + str(np.average(samples2)))

plt.subplot(131)
GSLIB.hist_st(df['Porosity'],pormin,pormax,False,False,20,df['Wts'],'Porosity (fraction)','Histogram Declustered Porosity')

plt.subplot(132)
GSLIB.hist_st(samples1,pormin,pormax,False,False,20,None,'Bootstrap Sample - Realizaton 1','Histogram Bootstrap Porosity 1')

plt.subplot(133)
GSLIB.hist_st(samples2,pormin,pormax,False,False,20,None,'Bootstrap Sample - Realizaton 2','Histogram Bootstrap Porosity 2')

plt.subplots_adjust(left=0.0, bottom=0.0, right=3.0, top=1.1, wspace=0.2, hspace=0.2)
plt.show()
Bootstrap means, realization 1 = 0.12384003220689653 and realization 2 = 0.12007139553448275
_images/1f60163fe65b17047b3c5ca807ef7bdcc07bf7f5257e3642df54895a34a297b1.png

Note that the bootstrap distributions vary quite a bit from the original.

Summarizations Over Bootstrap Realizations#

Let’s make a loop to conduct \(L\) resamples and calculate the average and standard deviation for each (\(m^\ell\), \(\sigma^2_{\ell}\), for \(\ell = 0,\dots,L-1\)). We then summarization over these \(L\) realizations.

I did not find any built-in, concise functions to accomplish this, i.e. with a single line of code, so we are going to do it by hand.

To understand this code there are just a couple of Python concepts that you need to add to your Python arsenal.

  1. declaring arrays - NumPy has a lot of great array (ndarray) functionality. There are build in functions to make a ndarray of any length (and dimension). This includes ‘zeros’, ‘ones’ and ‘rand’, so when we use this code (we’re making arrays of length \(L\) pre-populated with zeros):

mean = np.zeros(L); stdev = np.zeros(L)
  1. For Loops - when we are using the command below, we are instructing the computer to loop over all the indented code below the command for \(l = 0,1,2,\ldots,L-1\) times. For each loop the \(l\) variable increments, so we can use this to save each result to a different index in the arrays mean and stdev. Note, Python arrays index starting at 0 and stop at the length - 1 (we are running each bootstrap resampled realization, calculating the average and standard deviation and storing them in the arrays that we already declared).

for l in range(0, L): 

Now, let’s write the code for performing bootstrap and calculate the uncertainty in the mean and standard deviation.

L = 1000                                   # set the number of realizations
mean = np.zeros(L); stdev = np.zeros(L)    # declare arrays to hold the realizations of the statistics
for l in range(0, L):                      # loop over realizations
    samples = random.choices(df['Porosity'].values, weights=df['Wts'].values, cum_weights=None, k=len(df))
    mean[l] = np.average(samples)
    stdev[l] = np.std(samples)
    
plt.subplot(121)
GSLIB.hist_st(mean,0.11,0.15,False,False,50,None,'Average Porosity (fraction)','Bootstrap Uncertainty in Porosity Average')

plt.subplot(122)
GSLIB.hist_st(stdev,0.015,0.045,False,False,50,None,'Standard Deviation Porosity (fraction)','Bootstrap Uncertainty in Porosity Standard Deviation')

plt.subplots_adjust(left=0.0, bottom=0.0, right=3.0, top=1.2, wspace=0.2, hspace=0.2)
plt.show()   
    
print('Summary Statistics for Bootstrap Porosity Mean Realizations:')
print(stats.describe(mean))
print('P10: ' + str(round(np.percentile(mean,10),3)) + ', P50: ' + str(round(np.percentile(mean,50),3)) + ', P90: ' + str(round(np.percentile(mean,90),3))) 

print('\nSummary Statistics for Bootstrap Porosity Standard Deviation Realizations:')
print(stats.describe(stdev))
print('P10: ' + str(round(np.percentile(stdev,10),3)) + ', P50: ' + str(round(np.percentile(stdev,50),3)) + ', P90: ' + str(round(np.percentile(stdev,90),3))) 
_images/59b296b1a68b6cfec662362d698ee37e9fc3011654c30b023408133f07793727.png
Summary Statistics for Bootstrap Porosity Mean Realizations:
DescribeResult(nobs=1000, minmax=(0.11342325989655172, 0.14170476337931037), mean=0.12448168073553446, variance=1.8482397702732407e-05, skewness=0.28544209487033434, kurtosis=0.34732373215605294)
P10: 0.119, P50: 0.124, P90: 0.13

Summary Statistics for Bootstrap Porosity Standard Deviation Realizations:
DescribeResult(nobs=1000, minmax=(0.019669288168454372, 0.04418584104504446), mean=0.032035099720929015, variance=1.5414175957605215e-05, skewness=-0.03807314119458973, kurtosis=-0.0046380513425461345)
P10: 0.027, P50: 0.032, P90: 0.037

Bootstrap in GeostatsPy#

We have a simple bootstrap function in GeostatsPy

realizations_array = geostats.bootstrap(array,stat,weights=None,nreal=100)

where:

  • array - an numpy ndarray of samples

  • stat - statistic function that can be applied to an ndarray, e.g., numpy.average

  • weights - an array of weights, if omitted or set to None, then equal weighting is applied

  • nreal - the number of realizations

L = 10000
mean = geostats.bootstrap(df['Porosity'].values,stat = np.average,weights = df['Wts'].values,nreal = L)
stdev = geostats.bootstrap(df['Porosity'].values,stat = np.std,weights = df['Wts'].values,nreal = L)

plt.subplot(121)
GSLIB.hist_st(mean,0.11,0.15,False,False,50,None,'Average Porosity (fraction)','Bootstrap Uncertainty in Porosity Average')

plt.subplot(122)
GSLIB.hist_st(stdev,0.015,0.045,False,False,50,None,'Standard Deviation Porosity (fraction)','Bootstrap Uncertainty in Porosity Standard Deviation')

plt.subplots_adjust(left=0.0, bottom=0.0, right=3.0, top=1.2, wspace=0.2, hspace=0.2)
plt.show()   
    
print('Summary Statistics for Bootstrap Porosity Mean Realizations:')
print(stats.describe(mean))
print('P10: ' + str(round(np.percentile(mean,10),3)) + ', P50: ' + str(round(np.percentile(mean,50),3)) + ', P90: ' + str(round(np.percentile(mean,90),3))) 

print('\nSummary Statistics for Bootstrap Porosity Standard Deviation Realizations:')
print(stats.describe(stdev))
print('P10: ' + str(round(np.percentile(stdev,10),3)) + ', P50: ' + str(round(np.percentile(stdev,50),3)) + ', P90: ' + str(round(np.percentile(stdev,90),3))) 
_images/3e4d8ba7abfc33bb521241886d91e83390e8316a73bd325981565947ede26707.png
Summary Statistics for Bootstrap Porosity Mean Realizations:
DescribeResult(nobs=10000, minmax=(0.11051559105172412, 0.14237169118965518), mean=0.12429691235791895, variance=1.7881312164470136e-05, skewness=0.1394007278174748, kurtosis=0.04457856848319919)
P10: 0.119, P50: 0.124, P90: 0.13

Summary Statistics for Bootstrap Porosity Standard Deviation Realizations:
DescribeResult(nobs=10000, minmax=(0.017473019574900325, 0.045527537564874675), mean=0.031868221137267706, variance=1.599492477298796e-05, skewness=-0.10039604676232775, kurtosis=-0.1344892653346781)
P10: 0.027, P50: 0.032, P90: 0.037
def bootstrap(zdata,weights,nreal,stat):
    zreal = np.zeros(nreal)               # declare an empty list to store the bootstrap realizations
    if weights is None:
        weights = np.ones((len(zdata)))
    for l in range(0,nreal):              # loop over the L bootstrap realizations
        samples = random.choices(zdata, k=len(zdata),weights=weights) # n Monte Carlo simulations, sample with replacement
        zreal[l] = stat(samples)       # calculate the realization of the statistic and append to list
    return zreal                          # return the list of realizations of the statistic

Comments#

This was a basic demonstration of trend modeling to support 3D model construction. Much more can be done, I have other demonstrations for modeling workflows with GeostatsPy in the GitHub repository GeostatsPy_Demos.

I hope this is helpful,

Michael

The Author:#

Michael Pyrcz, Professor, The University of Texas at Austin#

Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions

With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers’ and geoscientists’ impact in subsurface resource development.

For more about Michael check out these links:

Twitter | GitHub | Website | GoogleScholar | Book | YouTube | LinkedIn#

Want to Work Together?#

I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.

  • Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I’d be happy to drop by and work with you!

  • Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!

  • I can be reached at mpyrcz@austin.utexas.edu.

I’m always happy to discuss,

Michael

Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin

More Resources Available at: Twitter | GitHub | Website | GoogleScholar | Book | YouTube | LinkedIn#