Bootstrap for Uncertainty Modeling#
Michael J. Pyrcz, Professor, The University of Texas at Austin
Twitter | GitHub | Website | GoogleScholar | Geostatistics Book | YouTube | Applied Geostats in Python e-book | Applied Machine Learning in Python e-book | LinkedIn
Chapter of e-book “Applied Geostatistics in Python: a Hands-on Guide with GeostatsPy”.
Cite as: Pyrcz, M.J., 2024, Applied Geostatistics in Python: a Hands-on Guide with GeostatsPy, https://geostatsguy.github.io/GeostatsPyDemos_Book.
By Michael J. Pyrcz
© Copyright 2024.
This chapter is a tutorial for / demonstration of Bootstrap.
YouTube Lecture: check out my lecture on Bootstrap. For your convenience here’s a summary of salient points.
Bootstrap#
Uncertainty in the sample statistics
one source of uncertainty is the paucity of data.
do 200 or even less wells provide a precise (and accurate estimate) of the mean? standard deviation? skew? P13?
Would it be useful to know the uncertainty in these statistics due to limited sampling?
what is the impact of uncertainty in the mean porosity e.g. 20%+/-2%?
Bootstrap is a method to assess the uncertainty in a sample statistic by repeated random sampling with replacement.
Assumptions
sufficient, representative sampling, identical, independent samples
Limitations
assumes the samples are representative
assumes stationarity
only accounts for uncertainty due to too few samples, e.g. no uncertainty due to changes away from data
does not account for boundary of area of interest
assumes the samples are independent
does not account for other local information sources
The Bootstrap Approach (Efron, 1982)
Statistical resampling procedure to calculate uncertainty in a calculated statistic from the data itself.
Does this work? Prove it to yourself, for uncertainty in the mean solution is standard error:
Extremely powerful - could calculate uncertainty in any statistic! e.g. P13, skew etc.
Would not be possible access general uncertainty in any statistic without bootstrap.
Advanced forms account for spatial information and sampling strategy (game theory and Journel’s spatial bootstrap (1993).
Steps:
assemble a sample set, must be representative, reasonable to assume independence between samples
optional: build a cumulative distribution function (CDF)
may account for declustering weights, tail extrapolation
could use analogous data to support
For \(\ell = 1, \ldots, L\) realizations, do the following:
For \(i = \alpha, \ldots, n\) data, do the following:
Draw a random sample with replacement from the sample set or Monte Carlo simulate from the CDF (if available).
Calculate a realization of the summary statistic of interest from the \(n\) samples, e.g. \(m^\ell\), \(\sigma^2_{\ell}\). Return to 3 for another realization.
Compile and summarize the \(L\) realizations of the statistic of interest.
This is a very powerful method. Let’s try it out.
Load the Required Libraries#
The following code loads the required libraries.
import geostatspy.GSLIB as GSLIB # GSLIB utilizes, visualization and wrapper
import geostatspy.geostats as geostats # GSLIB methods convert to Python
import geostatspy
print('GeostatsPy version: ' + str(geostatspy.__version__))
GeostatsPy version: 0.0.71
Check Another Package#
At some point you are going to run into an issue with package versions. This is how you can check packages.
import numpy
numpy.__version__
'1.23.3'
We will also need some standard packages. These should have been installed with Anaconda 3.
import os # set working directory, run executables
from tqdm import tqdm # suppress the status bar
from functools import partialmethod
tqdm.__init__ = partialmethod(tqdm.__init__, disable=True)
ignore_warnings = True # ignore warnings?
import numpy as np # ndarrays for gridded data
import pandas as pd # DataFrames for tabular data
import matplotlib.pyplot as plt # for plotting
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator) # control of axes ticks
from scipy import stats # summary statistics
import math # trig etc.
import scipy.signal as signal # kernel for moving window calculation
import random
plt.rc('axes', axisbelow=True) # plot all grids below the plot elements
if ignore_warnings == True:
import warnings
warnings.filterwarnings('ignore')
cmap = plt.cm.inferno # color map
If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing ‘python -m pip install [package-name]’. More assistance is available with the respective package docs.
Declare Functions#
These are some of the functions from GeostatsPy required by the new program. We will declare them here and then in the future integrate the new indicator kriging program into the package properly.
def add_grid():
plt.gca().grid(True, which='major',linewidth = 1.0); plt.gca().grid(True, which='minor',linewidth = 0.2) # add y grids
plt.gca().tick_params(which='major',length=7); plt.gca().tick_params(which='minor', length=4)
plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks
Set the Working Directory#
I always like to do this so I don’t lose files and to simplify subsequent read and writes (avoid including the full address each time).
#os.chdir("c:/PGE383") # set the working directory
Loading Tabular Data#
Here’s the command to load our comma delimited data file in to a Pandas’ DataFrame object.
df = pd.read_csv(r'https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/sample_data_biased.csv') # load our data form my GitHub account
Let’s drop some samples so that we increase the variations in bootstrap samples for our demonstration below.
Warning, this is not a standard part of the bootstrap workflow, I’m doing this so my students can change the number of data and observe the impact on uncertainty.
df = df.sample(frac = 0.2) # extract 50 random samples to reduce the size of the dataset
print('Using ' + str(len(df)) + ' number of samples')
Using 58 number of samples
Visualizing the DataFrame would be useful and we already learned about these methods in this demo (https://git.io/fNgRW).
We can preview the DataFrame by printing a slice or by utilizing the ‘head’ DataFrame member function (with a nice and clean format, see below). With the slice we could look at any subset of the data table and with the head command, add parameter ‘n=13’ to see the first 13 rows of the dataset.
df.head(n=3) # DataFrame preview to check
X | Y | Facies | Porosity | Perm | |
---|---|---|---|---|---|
98 | 470 | 759 | 1 | 0.163307 | 85.651668 |
133 | 250 | 569 | 1 | 0.120868 | 22.277121 |
216 | 900 | 849 | 1 | 0.120992 | 3.005055 |
Summary Statistics for Tabular Data#
The table includes X and Y coordinates (meters), Facies 1 and 0 (1 is sandstone and 0 interbedded sand and mudstone), Porosity (fraction), and permeability as Perm (mDarcy).
There are a lot of efficient methods to calculate summary statistics from tabular data in DataFrames. The describe command provides count, mean, minimum, maximum, and quartiles all in a nice data table. We use transpose just to flip the table so that features are on the rows and the statistics are on the columns.
df.describe().transpose() # summary statistics
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
X | 58.0 | 481.034483 | 277.707973 | 0.000000 | 292.500000 | 420.000000 | 740.000000 | 990.000000 |
Y | 58.0 | 616.603448 | 297.219804 | 49.000000 | 346.750000 | 669.000000 | 896.000000 | 999.000000 |
Facies | 58.0 | 0.793103 | 0.408619 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
Porosity | 58.0 | 0.139756 | 0.041070 | 0.058548 | 0.105038 | 0.134101 | 0.163066 | 0.223661 |
Perm | 58.0 | 233.112389 | 513.879950 | 0.075819 | 3.067836 | 19.032111 | 125.562959 | 2372.383732 |
Visualizing Tabular Data with Location Maps#
It is natural to set the x and y coordinate and feature ranges manually. e.g. do you want your color bar to go from 0.05887 to 0.24230 exactly? Also, let’s pick a color map for display. I heard that plasma is known to be friendly to the color blind as the color and intensity vary together (hope I got that right, it was an interesting Twitter conversation started by Matt Hall from Agile if I recall correctly). We will assume a study area of 0 to 1,000m in x and y and omit any data outside this area.
xmin = 0.0; xmax = 1000.0 # range of x values
ymin = 0.0; ymax = 1000.0 # range of y values
pormin = 0.05; pormax = 0.25; # range of porosity values
nx = 100; ny = 100; csize = 10.0
Let’s try out locmap. This is a reimplementation of GSLIB’s locmap program that uses matplotlib. I hope you find it simpler than matplotlib, if you want to get more advanced and build custom plots lock at the source. If you improve it, send me the new code. Any help is appreciated. To see the parameters, just type the command name:
GSLIB.locmap # GeostatsPy's 2D point plot function
<function geostatspy.GSLIB.locmap(df, xcol, ycol, vcol, xmin, xmax, ymin, ymax, vmin, vmax, title, xlabel, ylabel, vlabel, cmap, fig_name)>
Now we can populate the plotting parameters and visualize the porosity data.
plt.subplot(111)
GSLIB.locmap_st(df,'X','Y','Porosity',xmin,xmax,ymin,ymax,pormin,pormax,'Well Data - Porosity','X(m)','Y(m)','Porosity (fraction)',cmap)
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.1, wspace=0.2, hspace=0.2); plt.show()
Simple Bootstrap Method in Python#
If you are new to bootstrap and Python, here’s the most simple code possible for bootstrap.
specify the number of bootstrap realizations, \(L\)
declare a list to store the bootstrap realizations of the statistic of interest
loop over L bootstrap realizations
n MCS, random samples with replacement for a new realization of the data
calculate the realization of the statistic from the realization of the data
summarize the resulting uncertainty model, histogram, summary statistics etc.
import random # import random package
L = 1000 # set the number of bootstrap realizations
por_avg_real = [] # declare an empty list to store the bootstrap realizations of the statistic
for k in range(0,L): # loop over the L bootstrap realizations
samples = random.choices(df['Porosity'].values, k=len(df)) # n Monte Carlo simulations
por_avg_real.append(np.average(samples)) # calculate the statistic of interest from the new bootstrap dataset
plt.hist(por_avg_real,color = 'darkorange',alpha = 0.8,edgecolor = 'black') # plot the distribution, could also calculate any summary statistics
plt.xlabel('Bootstrap Realizations of Average Porosity'); plt.ylabel('Frequency'); plt.title('Uncertainty Distribution for Average Porosity')
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.1, wspace=0.2, hspace=0.2); plt.show()
Now we proceed with a more complicated, robust work by first quantifying the spatial bias in the samples and assigned data weights to mitigate this bias.
Declustering#
Let’s calculate some declustering weights. There is a demonstration on declustering here https://git.io/fhgJl if you need more information.
wts, cell_sizes, dmeans = geostats.declus(df,'X','Y','Porosity',iminmax = 1, noff= 10, ncell=100,cmin=10,cmax=2000)
df['Wts'] = wts # add weights to the sample data DataFrame
df.head() # preview to check the sample data DataFrame
def weighted_avg_and_std(values, weights): # function to calculate weighted mean and st. dev., from Eric O Lebigot, stack overflow,
average = np.average(values, weights=weights)
variance = np.average((values-average)**2, weights=weights)
return (average, math.sqrt(variance))
sample_avg, sample_stdev = weighted_avg_and_std(df['Porosity'],df['Wts'])
print('Declustered mean = ' + str(round(sample_avg,3)) + ' and declustered standard deviation = ' + str(round(sample_stdev,3)))
There are 58 data with:
mean of 0.13975615374137929
min and max 0.058547873 and 0.223660709
standard dev 0.04071489330168611
Declustered mean = 0.123 and declustered standard deviation = 0.034
A Couple of Bootstrap Realizations#
We will attempt bootstrap by-hand and manually loop over \(L\) realizations and draw \(n\) samples to calculate the summary statistics of interest, mean and variance. The choice function from the random package simplifies sampling with replacement from a set of samples with weights.
This command returns a ndarray with k samples with replacement from the ‘Porosity’ column of our DataFrame (df) accounting for the data weights in column ‘Wts’.
samples1 = random.choices(df['Porosity'].values, weights=df['Wts'].values, cum_weights=None, k=len(df))
It is instructive to look at a couple of these realizations from the original declustered data set.
samples1 = random.choices(df['Porosity'].values, weights=df['Wts'].values, cum_weights=None, k=len(df))
samples2 = random.choices(df['Porosity'].values, weights=df['Wts'].values, cum_weights=None, k=len(df))
print('Bootstrap means, realization 1 = ' + str(np.average(samples1)) + ' and realization 2 = ' + str(np.average(samples2)))
plt.subplot(221)
GSLIB.hist_st(df['Porosity'],pormin,pormax,False,False,20,df['Wts'],'Porosity (fraction)','Histogram Declustered Porosity')
plt.subplot(222)
GSLIB.hist_st(samples1,pormin,pormax,False,False,20,None,'Bootstrap Sample - Realization 1','Histogram Bootstrap Porosity 1')
plt.subplot(224)
GSLIB.hist_st(samples2,pormin,pormax,False,False,20,None,'Bootstrap Sample - Realization 2','Histogram Bootstrap Porosity 2')
plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=2.1, wspace=0.2, hspace=0.2); plt.show()
Bootstrap means, realization 1 = 0.12113506555172411 and realization 2 = 0.11595343589655174
Note that the bootstrap distributions vary quite a bit from the original.
Summarizations Over Bootstrap Realizations#
Let’s make a loop to conduct \(L\) resamples and calculate the average and standard deviation for each (\(m^\ell\), \(\sigma^2_{\ell}\), for \(\ell = 0,\dots,L-1\)). We then summarization over these \(L\) realizations.
I did not find any built-in, concise functions to accomplish this, i.e. with a single line of code, so we are going to do it by hand.
To understand this code there are just a couple of Python concepts that you need to add to your Python arsenal.
declaring arrays - NumPy has a lot of great array (ndarray) functionality. There are build in functions to make a ndarray of any length (and dimension). This includes ‘zeros’, ‘ones’ and ‘rand’, so when we use this code (we’re making arrays of length \(L\) pre-populated with zeros):
mean = np.zeros(L); stdev = np.zeros(L)
For Loops - when we are using the command below, we are instructing the computer to loop over all the indented code below the command for \(l = 0,1,2,\ldots,L-1\) times. For each loop the \(l\) variable increments, so we can use this to save each result to a different index in the arrays mean and stdev. Note, Python arrays index starting at 0 and stop at the length - 1 (we are running each bootstrap resampled realization, calculating the average and standard deviation and storing them in the arrays that we already declared).
for l in range(0, L):
Now, let’s write the code for performing bootstrap and calculate the uncertainty in the mean and standard deviation.
L = 1000 # set the number of realizations
mean = np.zeros(L); stdev = np.zeros(L) # declare arrays to hold the realizations of the statistics
for l in range(0, L): # loop over realizations
samples = random.choices(df['Porosity'].values, weights=df['Wts'].values, cum_weights=None, k=len(df))
mean[l] = np.average(samples)
stdev[l] = np.std(samples)
plt.subplot(121)
GSLIB.hist_st(mean,0.11,0.15,False,False,50,None,'Average Porosity (fraction)','Bootstrap Uncertainty in Porosity Average')
plt.subplot(122)
GSLIB.hist_st(stdev,0.015,0.045,False,False,50,None,'Standard Deviation Porosity (fraction)','Bootstrap Uncertainty in Porosity Standard Deviation')
plt.subplots_adjust(left=0.0, bottom=0.0, right=3.0, top=1.2, wspace=0.2, hspace=0.2)
plt.show()
print('Summary Statistics for Bootstrap Porosity Mean Realizations:')
print(stats.describe(mean))
print('P10: ' + str(round(np.percentile(mean,10),3)) + ', P50: ' + str(round(np.percentile(mean,50),3)) + ', P90: ' + str(round(np.percentile(mean,90),3)))
print('\nSummary Statistics for Bootstrap Porosity Standard Deviation Realizations:')
print(stats.describe(stdev))
print('P10: ' + str(round(np.percentile(stdev,10),3)) + ', P50: ' + str(round(np.percentile(stdev,50),3)) + ', P90: ' + str(round(np.percentile(stdev,90),3)))
plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.1, wspace=0.2, hspace=0.2); plt.show()
Summary Statistics for Bootstrap Porosity Mean Realizations:
DescribeResult(nobs=1000, minmax=(0.10898530070689653, 0.13660550484482759), mean=0.1225161345117069, variance=2.0877470462299377e-05, skewness=0.058561877869793465, kurtosis=-0.046602730972200135)
P10: 0.117, P50: 0.122, P90: 0.128
Summary Statistics for Bootstrap Porosity Standard Deviation Realizations:
DescribeResult(nobs=1000, minmax=(0.02063507443238311, 0.04493071499443032), mean=0.03402645587322368, variance=1.2427911841362585e-05, skewness=-0.16205399885793326, kurtosis=0.1909400193587918)
P10: 0.03, P50: 0.034, P90: 0.038
<Figure size 640x480 with 0 Axes>
Bootstrap in GeostatsPy#
We have a simple bootstrap function in GeostatsPy
realizations_array = geostats.bootstrap(array,stat,weights=None,nreal=100)
where:
array - an numpy ndarray of samples
stat - statistic function that can be applied to an ndarray, e.g., numpy.average
weights - an array of weights, if omitted or set to None, then equal weighting is applied
nreal - the number of realizations
L = 10000
mean = geostats.bootstrap(df['Porosity'].values,stat = np.average,weights = df['Wts'].values,nreal = L)
stdev = geostats.bootstrap(df['Porosity'].values,stat = np.std,weights = df['Wts'].values,nreal = L)
plt.subplot(121)
GSLIB.hist_st(mean,0.11,0.15,False,False,50,None,'Average Porosity (fraction)','Bootstrap Uncertainty in Porosity Average')
plt.subplot(122)
GSLIB.hist_st(stdev,0.015,0.045,False,False,50,None,'Standard Deviation Porosity (fraction)','Bootstrap Uncertainty in Porosity Standard Deviation')
plt.subplots_adjust(left=0.0, bottom=0.0, right=3.0, top=1.2, wspace=0.2, hspace=0.2)
plt.show()
print('Summary Statistics for Bootstrap Porosity Mean Realizations:')
print(stats.describe(mean))
print('P10: ' + str(round(np.percentile(mean,10),3)) + ', P50: ' + str(round(np.percentile(mean,50),3)) + ', P90: ' + str(round(np.percentile(mean,90),3)))
print('\nSummary Statistics for Bootstrap Porosity Standard Deviation Realizations:')
print(stats.describe(stdev))
print('P10: ' + str(round(np.percentile(stdev,10),3)) + ', P50: ' + str(round(np.percentile(stdev,50),3)) + ', P90: ' + str(round(np.percentile(stdev,90),3)))
plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=2.1, wspace=0.2, hspace=0.2); plt.show()
Summary Statistics for Bootstrap Porosity Mean Realizations:
DescribeResult(nobs=10000, minmax=(0.1079241129137931, 0.14113598706896552), mean=0.12253668303023105, variance=2.0056951482080468e-05, skewness=0.11558725910478769, kurtosis=0.014491991163210205)
P10: 0.117, P50: 0.122, P90: 0.128
Summary Statistics for Bootstrap Porosity Standard Deviation Realizations:
DescribeResult(nobs=10000, minmax=(0.02089868884187859, 0.046002150236787526), mean=0.033908531152686064, variance=1.2469009512401978e-05, skewness=-0.0998277772962186, kurtosis=-0.01875007154804953)
P10: 0.029, P50: 0.034, P90: 0.038
<Figure size 640x480 with 0 Axes>
Want to Work Together?#
I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.
Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I’d be happy to drop by and work with you!
Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!
I can be reached at mpyrcz@austin.utexas.edu.
I’m always happy to discuss,
Michael
Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin
Comments#
This was a basic demonstration of bootstrap for uncertainty modeling. Much more can be done, I have other demonstrations for modeling workflows with GeostatsPy in the GitHub repository GeostatsPy_Demos.
I hope this is helpful,
Michael