Machine Learning Glossary

Machine Learning Glossary#

Michael J. Pyrcz, Professor, The University of Texas at Austin

Chapter of e-book “Applied Geostatistics in Python: a Hands-on Guide with GeostatsPy”.

Cite this e-Book as:

Pyrcz, M.J., 2024, Applied Geostatistics in Python: a Hands-on Guide with GeostatsPy, https://geostatsguy.github.io/GeostatsPyDemos_Book.

The workflows in this book and more are available here:

Cite the GeostatsPyDemos GitHub Repository as:

Pyrcz, M.J., 2024, GeostatsPyDemos: GeostatsPy Python Package for Spatial Data Analytics and Geostatistics Demonstration Workflows Repository (0.0.1). Zenodo. https://zenodo.org/doi/10.5281/zenodo.12667035

By Michael J. Pyrcz
© Copyright 2024.

This chapter is a summary of essential Machine Learning Terminology.

Motivation for Machine Learning Concepts#

Firstly, why do this? I have received the request for a course glossary from the students in my Subsurface Machine Learning combined undergraduate and graduate course. While I usually dedicate a definition slide in the lecture slide decks for salient terms, some of my students have requested course glossary, list of terminology for their course review. The e-book provides a great vehicle and motivation to finally complete this.

Let me begin with a confession. There is a Machine Learning Glossary written by Google developers. For those seeking the in depth, comprehensive list of geostatistical terms please use this book!

By writing my own glossary I can limit the scope and descriptions to course content. I fear that many students would be overwhelmed by the size and mathematical notation of a standard machine learning glossary.
Also, by including a glossary in the e-book I can link from glossary entries to the chapters in the e-book for convenience. I will eventual populate all the chapters with hyperlinks to the glossary to enable moving back and forth between the chapters and the glossary.

Finally, like the rest of the book, I want the glossary to be a evergreen living document.

Adjacency Matrix (spectral clustering)#

Spectral Clustering: a matrix representing a graph with the pairwise connections between all pairwise combinations of graph nodes, samples.

the values are indicators, 0 if not connected, 1 if connected

Note, node self connections are set to 0 in the adjacency matrix

Addition Rule (probability)#

Probability Concepts: when we add probabilities (the union of outcomes), the probability of $A$ or $B$ is calculated with the probability addition rule,

\[ P(A \cup B) = P(A) + P(B) - P(A,B) \]

given mutually exclusive events we can generalize the addition rule as,

\[ P\left( \bigcup_{i=1}^k A_i \right) = \sum_{i=1}^k P(A_i) \]

Affine Correction#

Feature Transformations: a distribution rescaling that can be thought of as shifting, and stretching or squeezing of a univariate distribution (e.g., histogram). For the case of affine correction of $X$ to $Y$,

\[ y_i = \frac{\sigma_y}{\sigma_x}(x_i - \overline{x}) + \overline{y}, \quad \forall \quad i, \ldots, n \]

where $\overline{x}$ and $\sigma_x$ are the original mean and variance, and $\overline{y}$ and $\sigma_y$ are the new mean and variance.

We can see above that the affine correlation method first centers the distribution (by subtracting the original mean), then rescales the dispersion (distribution spread) by the ratio of the new standard deviation to the original standard deviation and then shifts the distribution to centered on the new mean.

there is no shape change for affine correction. For shape change consider Distribution Transformation like Gaussian Anamorphosis.

Affinity Matrix (spectral clustering)#

Spectral Clustering: a matrix representing a graph with the degree of pairwise connections between all pairwise combinations of graph nodes, samples.

values indicate the strength of the connection, unlike adjacency matrix with indicators, 0 if not connected, 1 if connected

Note, node self connections are set to 0 in the adjacency matrix

Bagging Models#

Bagging Tree and Random Forest: the application of bootstrap to obtain data realizations,

\[ Y^b, X_1^b, \dots, X_m^b, \quad b = 1, \dots, B \]

to train predictive model realizations,

$\hat{Y}^b = \hat{f}^b (X_1^b, \dots, X_m^b)$

where,

$(X_1^b, \dots, X_m^b)$ - the bootstrap predictor features in the $b^{th}$ bootstrapped dataset
$\hat{f}^b$ - the $b^{th}$ bootstrapped model
$\hat{Y}^b$ - predicted value for the model in the $b^{th}$ bootstrapped model

to calculate prediction realizations. The ensemble of prediction realizations are aggregated to reduce model variance. The aggregation includes,

regression - the average of the predictions $$ \hat{Y} = \frac{1}{B} \sum_{b=1}^{B} \hat{Y}^b $$
classification - the mode of the predictions

\[ \hat{Y} = \text{argmax}(\hat{Y}^b) \]

We can perform bagging with any prediction model, in fact the BaggingClassifier and BaggingRegressor functions in scikit-learn are wrappers that take the prediction model as an input.

Basis Expansion#

Polynomial Regression: to add flexibility to our model, for example, to capture non-linearity in our model for regression, classification, we expand the features with a set of basis functions

in mathematics basis expansion is the approach of representing a more complicated function with a linear combination of simpler basis functions that make the problem easier to solve
with basis expansion we expand the dimensionality of the problem with basis functions of the original features, but still use linear methods on the transformed features.

\[ ℎ(𝑥_𝑖 )=\left( ℎ_1(𝑥_𝑖 ),ℎ_2(𝑥_𝑖 ),\ldots,ℎ_𝑘(𝑥_𝑖 ) \right) \]

Here an example of basis expansion, the set of basis functions for polynomial basis expansion:

\[ h_{i,1}(x_i) = x_i, \quad h_{i,2}(x_i) = x_i^2, \quad h_{i,3}(x_i) = x_i^3, \quad h_{i,4}(x_i) = x_i^4, \dots, \quad h_{i,k}(x_i) = x_i^k \]

Basis Function#

Polynomial Regression: to add flexibility to our model, for example, to capture non-linearity in our model for regression, classification, we expand the features with a set of basis functions

in mathematics basis expansion is the approach of representing a more complicated function with a linear combination of simpler basis functions that make the problem easier to solve
with basis expansion we expand the dimensionality of the problem with basis functions of the original features, but still use linear methods on the transformed features.

\[ ℎ(𝑥_𝑖 )=\left( ℎ_1(𝑥_𝑖 ),ℎ_2(𝑥_𝑖 ),\ldots,ℎ_𝑘(𝑥_𝑖 ) \right) \]

were each of $h_1$, \ldots, $h_k$ are basis functions. For example, here are the basis functions for polynomial basis expansion:

\[ h_{i,1}(x_i) = x_i, \quad h_{i,2}(x_i) = x_i^2, \quad h_{i,3}(x_i) = x_i^3, \quad h_{i,4}(x_i) = x_i^4, \dots, \quad h_{i,k}(x_i) = x_i^k \]

Bayes’ Theorem (probability)#

Probability Concepts: the mathematical model central to Bayesian probability for the Bayesian updating from prior probability, with likelihood probability from new information to posterior probability.

\[ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} \]

where $P(A)$ is the prior, $P(B|A)$ is the likelihood, $P(B)$ is the evidence term and $P(A|B)$ is the posterior. If is convenient to substitute more descriptive labels for $A$ and $B$ to better conceptualize this approach,

\[ P(\text{Model} | \text{New Data}) = \frac{P(\text{New Data} | \text{Model}) \cdot P(\text{Model})}{P(\text{New Data})} \]

demonstrating that we are updating our model with new data

Bayesian Probability#

Probability Concepts: probabilities based on a degree of belief (expert judgement and experience) in the likelihood of an event. The general approach,

start with prior probability, prior to the collection of new information
formulate a likelihood probability, based on new information alone
update prior with likelihood to calculate the updated posterior probability
continue to update as new information is available
solve probability problems that we cannot use simple frequencies, i.e., frequentist probability approach
Bayesian updating is modeled with Bayes’ Theorem

Bayesian Updating for Classification#

Naive Bayes: this is how we pose the classification prediction problem from the perspective of Bayesian updating, based on the conditional probability of a category, $k$, given $n$ features, $x_1, \dots , x_n$.

\[ P(C_k | x_1, \dots , x_n) \]

we can solve for this posterior with Bayesian updating,

\[ P(C_k | x_1, \dots , x_n) = \frac{P(x_1, \dots , x_n | C_k) P(C_k)}{P(x_1, \dots , x_n)} \]

let’s combine the likelihood and prior for the moment,

\[ P(x_1, \dots , x_n | C_k) P(C_k) = P(x_1, \dots , x_n, C_k) \]

we can expand the full joint distribution recursively as follows,

\[ P(x_1, \dots , x_n, C_k) \]

expansion of the joint with the conditional and prior,

\[ P(x_1 | x_2, \dots , x_n, C_k) P(x_2, \dots , x_n, C_k) \]

continue recursively expanding,

\[ P(x_1 | x_2, \dots , x_n, C_k) P(x_2 | x_3, \dots , x_n, C_k) P(x_3, \dots , x_n, C_k) \]

we can generalize as,

\[ P(C_k | x_1, \dots , x_n) = P(x_1 | x_2, \dots , x_n, C_k) P(x_2 | x_3, \dots , x_n, C_k) P(x_3 | x_4, \dots , x_n, C_k) \ldots P(x_{n-1} | x_n, C_k) (x_{n} | C_k) P(C_k) \]

Bayesian Linear Regression#

Bayesian Linear Regression: the frequentist formulation of the linear regression model is,

\[ y = b_1 \times x + b_0 + \sigma \]

where $x$ is the predictor feature, $b_1$ is the slope parameter, $b_0$ is the intercept parameter and $\sigma$ is the error or noise. There is an analytical form for the ordinary least squares solution to fit the available data while minimizing the $L^2$ norm of the data error vector.

For the Bayesian formulation of linear regression is we pose the model as a prediction of the distribution of the response, $Y$, now a random variable:

\[ Y \sim N(\beta^{T}X, \sigma^{2} I) \]

We estimate the model parameter distributions through Bayesian updating for inferring the model parameters from a prior and likelihood from training data.

\[ P(\beta | y, X) = \frac{P(y,X| \beta) P(\beta)}{P(y,X)} \]

In general for continuous features we are not able to directly calculate the posterior and we must use a sampling method, such as Markov chain Monte Carlo (McMC) to sample the posterior.

Big Data#

Machine Learning Concepts: you have big data if your data has a combination of these criteria:

Data Volume - many data samples and features, difficult to store, transmit and visualize
Data Velocity - high-rate collection, continuous data collection relative to decision making cycles, challenges keeping up with the new data while updating the models
Data Variety - data form various sources, with various types of data, types of information, and scales
Data Variability - data acquisition changes during the project, even for a single feature there may be multiple vintages of data with different scales, distributions, and veracity
Data Veracity - data has various levels of accuracy, the data is not certain

For common subsurface applications most, if not all, of these criteria are met. Subsurface engineering and geoscience are often working with big data!

Big Data Analytics#

Machine Learning Concepts: the process of examining large and varied data (big data) sets to discover patterns and make decisions, the application of statistics to big data.

Binary Transform (also Indicator Transform)#

Feature Transformations: indicator coding a random variable to a probability relative to a category or a threshold.

If $i(\bf{u}:z_k)$ is an indicator for a categorical variable,

what is the probability of a realization equal to a category?

\[\begin{split} i(\bf{u}; z_k) = \begin{cases} 1, & \text{if } Z(\bf{u}) = z_k \\ 0, & \text{if } Z(\bf{u}) \ne z_k \end{cases} \end{split}\]

for example,

given threshold, $z_2 = 2$, and data at $\bf{u}_1$, $z(\bf{u}_1) = 2$, then $i(bf{u}_1; z_2) = 1$
given threshold, $z_1 = 1$, and a RV away from data, $Z(\bf{u}_2)$ then is calculated as $F^{-1}_{\bf{u}_2}(z_1)$ of the RV as $i(\bf{u}_2; z_1) = 0.23$

If $I\{\bf{u}:z_k\}$ is an indicator for a continuous variable,

what is the probability of a realization less than or equal to a threshold?

\[\begin{split} i(\bf{u}; z_k) = \begin{cases} 1, & \text{if } Z(\bf{u}) \le z_k \\ 0, & \text{if } Z(\bf{u}) > z_k \end{cases} \end{split}\]

for example,

given threshold, $z_1 = 6\%$, and data at $\bf{u}_1$, $z(\bf{u}_1) = 8\%$, then $i(\bf{u}_1; z_1) = 0$
given threshold, $z_4 = 18\%$, and a RV away from data, $Z(\bf{u}_2) = N\left[\mu = 16\%,\sigma = 3\%\right]$ then $i(\bf{u}_2; z_4) = 0.75$

The indicator coding may be applied over an entire random function by indicator transform of all the random variables at each location.

Boosting Models#

Gradient Boosting: addition of multiple week learners to build a stronger learner.

a weak learner is one that offers predictions just marginally better than random selection

This is the method in words, and then with equations,

build a simple model with a high error rate, the model can be quite inaccurate, but moves in the correct direction
calculate the error from the model
fit another model to the error
calculate the error from this addition of the first and second model
repeat until the desired accuracy is obtained or some other stopping criteria

Now with equations, the general workflow for predicting $Y$ from $X_1,\ldots,X_m$ is,

build a week learner to predict $Y$ from $X_1,\ldots,X_m$, $\hat{F}_k(X)$ from the training data $x_{i,j}$.
loop over number of desired estimators, $k = 1,\ldots,K$

calculate the residuals at the training data, $h_k(x_{i}) = y_i - \hat{F}_k(x_{i})$
fit another week learner to predict $h_k$ from $X_1,\ldots,X_m$, $\hat{F}_k(X)$ from the training data $x_{i,j}$.

each model builds on the previous to improve the accuracy

The regression estimator is the summation over the $K$ simple models,

\[ \hat{Y} =\sum_{k=1}^{K} F_k(X_1,\ldots,X_m) \]

Bootstrap#

Bagging Tree and Random Forest: a statistical resampling procedure to calculate uncertainty in a calculated statistic from the sample data itself. Some general comments,

sampling with replacement - $n$ (number of data samples) Monte Carlo simulations from the dataset cumulative distribution function, this results in a new realization of the data
simulates the data collection process - the fundamental idea is to simulate the original data collection process. Instead of actually collecting new sample sets, we randomly select from the data to get data realizations
bootstrap any statistic - this approach is very flexible as we can calculate realizations of any statistics from the data realizations
computationally cheap - repeat this approach to get realizations of the statistic to build a complete distribution of uncertainty. Use a large number of realizations, $L$, for a reliable uncertainty model.
calculates the entire distribution of uncertainty - for any statistic, you calculate any summary statistic for the uncertainty model, e.g., mean, P10 and P90 of the uncertainty in the mean
bagging for machine learning - is the application of bootstrap to obtain data realizations to train predictive model realizations to aggregate predictions over ensembles of prediction models to reduce model variance

What are the limitations of bootstrap?

biased sample data will likely result in a biased bootstrapped uncertainty model, you must first debias the samples, e.g., declustering
you must have a sufficient sample size
integrates uncertainty due to sparse samples in space only
does not account for the spatial context of the data, i.e., sample data locations, volume of interest nor the spatial continuity. There is a variant of bootstrap called spatial bootstrap.

Categorical Feature#

Machine Learning Concepts: a feature that can only take one of a limited, and usually fixed, number of possible values

Categorical Nominal Feature#

Machine Learning Concepts: a categorical feature without any natural ordering, for example,

facies = {boundstone, wackystone, packstone, brecia}
minerals = {quartz, feldspar, calcite}

Categorical Ordinal Feature#

Machine Learning Concepts: a categorical feature with a natural ordering, for example,

geologic age = {Miocene, Pliocene, Pleistocene} - ordered from older to younger rock
Mohs hardness = $\{1, 2, \ldots, 10\}$ - ordered from softer to harder rock

Causation#

Multivariate Analysis: a relationship where a change in one or more feature(s) directly leads to a change in one or more other feature(s).

Some important aspects of causal relationships,

Asymmetry and temporal precedence - $A$ is caused by $B$ does not indicate that $B$ is caused by $A$
Non-spurious - not due to random effect or confounding features
Mechanism and explanation - a plausible mechanism or process is available to explain the relationship
Consistency - the relationship is observable over a range of conditions, times, locations, populations, etc.
Strength - stronger relationships increase the likelihood of causation given all the previous 1-5 hold

Establishing causation is very difficult,

in this course we typically avoid causation and causal analysis, and emphasize this with statements such as correlation does not imply causation

Cell-based Declustering#

Data Preparation: a declustering method to assign weights to spatial samples based on local sampling density, such that the weighted statistics are likely more representative of the population. Data weights are assigned such that,

samples in densely sampled areas receive less weight
samples in sparsely sampled areas receive more weight

The goal of declustering is for the sample statistics to be independent of sample locations, e.g., infill drilling or blast hole samples should not change the statistics for the area of interest due to increased local sample density.

Cell-based declustering proceeds as follows:

a cell mesh is placed over the spatial data and weights are set as proportional to the inverse of the number of samples in the cell
the cell mesh size is varied, and the cell size that minimizes the declustered mean (in the sample mean is biased high) or maximizes the declustered mean (if the sample mean is biased low) is selected
to remove the impact of cell mesh position, the cell mesh is randomly moved several times and the resulting declustering weights are averaged for each datum

The weights are calculated as:

\[ w(\bf{u}_j) = \frac{1}{n_l} \cdot \frac{n}{L_o} \]

where $n_l$ is the number of data in the current cell, $L_o$ is the number of cells with data, and $n$ is the total number of data.

Here are some highlights for cell-based declustering,

expert judgement to assign cell size based on the nominal sample spacing (e.g., data spacing before infill drilling) will improve the performance over the automated method for cell size selection based on minimum or maximum declustered mean (mentioned above)
cell-based declustering is not aware of the boundaries of the area of interest; therefore, data near the boundary of the area of interest may appear to be more sparsely sampled and receive more weight
cell-based was developed by Professor Andre Journel in 1983, []

Cognitive Biases#

Machine Learning Concepts: an automated (subconscious) thought process used by human brain to simplify information processing from large amount of personal experience and learned preferences. While these have been critical for our evolution and survival on this planet, they can lead to the following issues in data science:

Anchoring Bias, too much emphasis on the first piece of information. Studies have shown that the first piece of information could be irrelevant as we are beginning to learn about a topic, and often the earliest data in a project has the largest uncertainty. Address anchoring bias by curating all data, integrating uncertainty, fostering open discussion and debate on your project team.
Availability Heuristic, overestimate importance of easily available information, for example, grandfather smoked 3 packs a day and lived to 100 years old, i.e., relying on anecdotes. Address availability heuristic by ensuring the project team documents all available information and applies quantitative analysis to move beyond anecdotes.
Bandwagon Effect, assessed probability increases with the number of people holding the same belief. Watch out for everyone jumping on board or the loudest voice influencing all others on your project teams. Encouraging all members of the project team to contribute and even separate meetings may be helpful to address bandwagon effect.
Blind-spot Effect, fail to see your own cognitive biases. This is the hardest cognitive bias of all. One possible solution is to invite arms length review of your project team’s methods, results and decisions.
Choice-supportive Bias, probability increases after a commitment, i.e., a decision is made. For example, it was good that I bought that car supported by focusing on positive information about the car. This is a specific case of confirmation bias.
Clustering Illusion, seeing patterns in random events. Yes, this heuristic helped us stay alive when large predictors hunted us, i.e., false positives are much better than false negatives! The solution is to model uncertainty confidence intervals and test all data and results against random effect.
Confirmation Bias, only consider new information that supports current model. Choice-supportive bias is a specific case of confirmation bias. The solution to confirmation bias is to seek out people that you will likely disagree with and build skilled project teams that hold diverse technical opinions and have different expert experience. My approach is to get nervous if everyone in the room agrees with me!
Conservatism Bias, favor old data to newly collected data. Data curation and quantitative analysis are helpful.
Recency Bias, favor the most recently collected data. Ensure your team documents previous data and choices to enhance team memory. Just like conservative bias, data curation and quantitative analysis are our first line of defense.
Survivorship Bias, focus on success cases only. Check for any possible pre-selection or filters on the data available to your team.

Robust use of statistics / data analytics protects use from bias.

Complimentary Events (probability)#

Probability Concepts: the NOT operator for probability, if we define A then A compliment, $A^c$, is not A and we have this resulting closure relationship,

\[ P(A) + P(A^c) = 1.0 \]

complimentary events may be considered for beyond univariate problems, for example consider this bivariate closure,

\[ P(A|B) + P(A^c|B) = 1.0 \]

Note, the given term must be the same.

Computational Complexity#

Linear Regression: represents the computer resources for a method, we use it in machine learning to understand how our machine learning methods scale as we change the dimensionality, number of features, and the number of training data, represented by,

\[ 𝑂(𝑓(𝑛)) \]

where $𝑛$ represents size of the problem. There are 2 components of computational complexity,

time complexity - refers to computational time and the scaling of this time to the size of the problem for a given algorithm
space complexity - refers to computer memory required and the scaling of storage to the size of the problem for a given algorithm

For example, if time complexity is $O(n^3)$, where is $n$ is number of training data, then if we double the number of data the run time increases eight times.

Additional salient points about computational complexity,

default to worst-case complexity - the worst case for complexity given a specific problem size, provides an upper bound
asymptotic complexity - where $𝑛$ is large. Some algorithms have speed-up for small datasets, this is not used
assumes all steps are required, e.g., data is not presorted, etc.

Time complexity examples,

quadratic time, $𝑶(𝒏^𝟐)$ - for example, integer multiplication, bubble sort
linear time, $𝑶(𝒏)$ - for example, finding the min or max in an unsorted array
fractional power, $𝑶(𝒏^𝒄 )$ - where $[0 < c < 1]$, for example, searching in a kd-tree, $𝑂(𝑛^(\frac{1}{2}))$
exponential Time, $𝑶(𝟐^𝒏)$ - for example, traveling salesman problem with dynamic programing

Conditional Probability#

Probability Concepts: the probability of an event, given another event has occurred,

\[ P(A|B) = \frac{P(A,B)}{P(A)} \]

we read this as the probability of A given B has occurred as the joint divided by the marginal. We can extend conditional probabilities to any multivariate case by adding joints to either component. For example,

\[ P(C|B,A) = \frac{P(A,B,C)}{P(B,C)} \]

Confidence Interval#

Linear Regression: the uncertainty in a summary statistic or model parameter represented as a range, lower and upper bound, based on a specified probability interval known as the confidence level.

We communicate confidence intervals like this,

there is a 95% probability (or 19 times out of 20) that model slope is between 0.5 and 0.7.

Other salient points about confidence intervals,

calculated by analytical methods, when available, or with more general and flexible bootstrap
for Bayesian methods we refer credibility intervals

Confusion Matrix#

Naive Bayes: a matrix with frequencies of predicted (x axis) vs. actual (y axis) categories to visualize the performance of a classification model.

visualize and diagnose all the combinations of correct and misclassification with the classification model, for example, category 1 is often misclassified as category 3.
perfect accuracy is number of each class on the diagonal, category 1 is always predicted as category 1, etc.
the classification matrix is applied to calculate a single summary of categorical accuracy, for example, precision, recall, etc.

Continuous Feature#

Machine Learning Concepts: a feature that can take any value between a lower and upper bound. For example,

porosity = $\{13.01\%, 5.23\%, 24.62\%\}$
gold grade = $\{4.56 \text{ g/t}, 8.72 \text{ g/t}, 12.45 \text{ g/t} \}$

Continuous, Interval Feature#

Machine Learning Concepts: a continuous feature where the intervals between numbers are equal, for example, the difference between 1.50 and 2.50 is the same as the difference between 2.50 and 3.50, but the actual values do not have an objective, physical reality (exist on an arbitrary scale), i.e., do not have a true zero point, for example,

Celsius scale of temperature (an arbitrary scale based on water freezing at 0 and boiling at 100)
calendar year (there is no true zero year)

We can use addition and subtraction operations to compare continuous, interval features.

Continuous, Ratio Feature#

Machine Learning Concepts: a continuous feature where the intervals between numbers are equal, for example, the difference between 1.50 and 2.50 is the same as the difference between 2.50 and 3.50, but the values do have an objective reality (measure an actual physical phenomenon), i.e., do have true zero point, for example,

Kelvin scale of temperature
porosity
permeability
saturation

Since there is a true zero, continuous, ratio features can be compared with multiplication and division mathematical operations (in addition to addition and subtraction), e.g., twice as much porosity.

Continuously Differentiable#

Machine Learning Training and Tuning: a function is continuously differentiable if it satisfies two key conditions:

The function is differentiable, the derivative of the function exists at every point in its domain, i.e., the function has a well-defined slope at every possible point.
The derivative is continuous, the derivative of the function does not have any jumps, discontinuities, or abrupt changes, i.e, the derivative function itself is continuous at every point in its domain.

For a machine learning example,

the $L^2$ norm is continuously differentiable and as a result for linear and ridge regression we can apply partial derivatives to the loss function to calculate a closed form of training the model parameters
the $L^1$ norm is not continuously differentiable and as a result for LASSO regression we cannot apply partial derivatives to the loss function to calculate a closed form of training the model parameters. We must use iterative optimization to train the model parameters.

Convolution#

k-Nearest Neighbours: Integral product of two functions, after one is reversed and shifted by $\Delta$.

one interpretation is smoothing a function with weighting function, $𝑓(\Delta)$, is applied to calculate the weighted average of function, $𝑔(x)$,

\[ (f * g)(x) = \int_{-\infty}^{\infty} f(\Delta) g(x - \Delta) \, d\Delta \]

this easily extends into multidimensional

\[ (f * g)(x, y, z) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f(\Delta_x, \Delta_y, \Delta_z) g(x - \Delta_x, y - \Delta_y, z - \Delta_z) \, d\Delta_x \, d\Delta_y \, d\Delta_z \]

The choice of which function is shifted before integration does not change the result, the convolution operator has commutativity.

\[ (f * g)(x) = \int_{-\infty}^{\infty} f(\Delta) g(x - \Delta) \, d\Delta \]

\[ (f * g)(x) = \int_{-\infty}^{\infty} f(x - \Delta) g(\Delta) \, d\Delta \]

if either function is reflected then convolution is equivalent to cross-correlation, measure of similarity between 2 signals as a function of displacement.

Core Data#

Machine Learning Concepts: the primary sampling method for direct measure for subsurface resources (recovered drill cuttings are also direct measures with greater uncertainty and smaller, irregular scale). Comments on core data,

expensive / time consuming to collect for oil and gas, interrupt drilling operations, sparse and selective (very biased) coverage
very common in mining (diamond drill holes) for grade control with regular patterns and tight spacing
gravity, piston, etc. coring are used to sample sediments in lakes and oceans

What do we learn from core data?

petrological features (sedimentary structures, mineral grades), petrophysical features (porosity, permeability), and mechanical features (elastic modulas, Poisson’s ratio)
stratigraphy and ore body geometry through interpolation between wells and drill holes

Core data are critical to support subsurface resource interpretations. They anchor the entire reservoir concept and framework for prediction,

for example, core data collocated with well log data are used to calibrate (ground truth) facies, porosity from well logs

Correlation#

Multivariate Analysis: the Pearson’s product-moment correlation coefficient is a measure of the degree of linear relationship,

\[ \rho_{x,y} = \frac{\sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y})}{(n-1)\sigma_x \sigma_y} \]

where $\overline{x}$ and $\overline{y}$ are the means of features $x$ and $y$. The measure is bound $\[-1,1\]$.

correlation coefficient is a standardized covariance

The Person’s correlation coefficient is quite sensitive to outliers and departure from linear behavior (in the bivariate sense). We have an alternative known as the Spearman’s rank correlations coefficient,

\[ \rho_{R_x R_y} = \frac{\sum_{i=1}^{n} (R_{x_i} - \overline{R_x})(R_{y_i} - \overline{R_y})}{(n-1)\sigma_{R_x} \sigma_{R_y}}, \, -1.0 \le \rho_{xy} \le 1.0 \]

The rank correlation applies the rank transform to the data prior to calculating the correlation coefficient. To calculate the rank transform simply replace the data values with the rank $R_x = 1,\dots,n$, where $n$ is the maximum value and $1$ is the minimum value.

\[ x_\alpha, \, \forall \alpha = 1,\dots, n, \, | \, x_i \ge x_j \, \forall \, i \gt j \]

\[ R_{x_i} = i \]

Covariance#

Multivariate Analysis: a measure of how two features vary together,

\[ C_{x,y} = \frac{\sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y})}{(n-1)} \]

where $\overline{x}$ and $\overline{y}$ are the means of features $x$ and $y$. The measure is bound $\[-\sigma_x \cdot \sigm_y,\sigma_x \cdot \sigm_y\]$.

correlation coefficient is a standardized covariance

The Person’s correlation coefficient is quite sensitive to outliers and departure from linear behavior (in the bivariate sense). We have an alternative known as the Spearman’s rank correlations coefficient,

\[ \rho_{R_x R_y} = \frac{\sum_{i=1}^{n} (R_{x_i} - \overline{R_x})(R_{y_i} - \overline{R_y})}{(n-1)\sigma_{R_x} \sigma_{R_y}}, \, -1.0 \le \rho_{xy} \le 1.0 \]

The rank correlation applies the rank transform to the data prior to calculating the correlation coefficient. To calculate the rank transform simply replace the data values with the rank $R_x = 1,\dots,n$, where $n$ is the maximum value and $1$ is the minimum value.

\[ x_\alpha, \, \forall \alpha = 1,\dots, n, \, | \, x_i \ge x_j \, \forall \, i \gt j \]

\[ R_{x_i} = i \]

Cross Validation#

Machine Learning Concepts: withholding a portion of the data from the model parameter training to test the ability of the model to predict for cases not used to train the model

this is typically conducted by a train and test data split, with 15% - 30% of data assigned to testing
a dress rehearsal for real-world model use, the train-test split must be fair, resulting in similar prediction difficulty to the planned use of the model
there are more complicated designs such as k-fold cross validation that allows testing over all data via k-folds each with trained model
cross validation may be applied to check model performance for estimation accuracy (most common) and uncertainty model goodness (Maldonado-Cruz and Pyrcz, 2021)

Cumulative Distribution Function (CDF)#

Univariate Analysis: the sum of a discrete PDF or the integral of a continuous PDF. Here are the important concepts,

the CDF is stated as $F_x(x)$, note the PDF is stated as $f_x(x)$
is the probability that a random sample, $X$, is less than or equal to a specific value $x$; therefore, the y axis is cumulative probability

\[ F_x(x) = P(X \le x) = \int_{-infty}^x f(u) du \]

for CDFs there is no bin assumption; therefore, bins are at the resolution of the data.
monotonically non-decreasing function, because a negative slope would indicate negative probability over an interval.

The requirements for a valid CDF include,

non-negativity constraint:

\[ F_x(x) = P(X \le x) \ge 0.0, \quad \forall x \]

valid probability:

\[ 0.0 \le F_x(x) \le 1.0, \quad \forall x \]

cannot have negative slope:

\[ \frac{dF_x(x)}{dx} \ge 0.0, \quad \forall x \]

minimum and maximum (ensuring probability closure) values:

\[ \text{min}(F_x(x)) = 0.0 \quad \text{max}(F_x(x)) = 1.0 \]

Curse of Dimensionality#

Feature Ranking: the suite of challenges associated with working with many features, i.e., high dimensional space, including,

impossible to visualize data and model in high dimensionality space
usually insufficient sampling for statistical inference in vast high dimensional space
low coverage of high dimensional predictor feature space
distorted feature space, including warped space dominated by corners and distances lose sensitivity
multicollinearity between features is more likely as the dimensionality increases

Data (data aspects)#

Feature Ranking: when describing spatial dataset these are the fundamental aspects,

Data coverage - what proportion of the population has been sampled for this?

In general, hard data has high resolution (small scale, volume support), but with poor data coverage (measure only an extremely small proportion of the population, for example,

Core coverage deepwater oil and gas - well core only sample one five hundred millionth to one five billionth of a deepwater reservoir, assuming 3 inch diameter cores with 10% core coverage in vertical wells with 500 m to 1,500 m spacing
Core coverage mining grade control - diamond drill hole cores sample one eight thousandth to one thirty thousandth of ore body, assuming HQ 63.5 mm diameter cores with 100% core coverage in vertical drill holes with 5 m to 10 m spacing

Soft data tend to have excellent (often complete) coverage, but with low resolution,

Seismic reflection surveys and gradiometric surveys - data is generally available over the entire volume of interest, but resolution is low and generally decreasing with depth

Data Scale (support size) - What is the scale or volume sampled by the individual samples? For example,

core tomography images of core samples at the pore scale, 1 - 50 $\mu m$
gamma ray well log sampled at 0.3 m intervals with 1 m penetration away from the bore hole
ground-based gravity gradiometry map with 20 m x 20 m x 100 m resolution

Data Information Type - What does the data tell us about the subsurface? For example,

grain size distribution that may be applied to calibrate permeability and saturations
fluid type to assess the location of the oil water contact
dip and continuity of important reservoir layers to access connectivity
mineral grade to map high, mid and low grade ore shells for mine planning

Data Convexity#

Density-based Clustering: a subset, $A$, of Euclidean feature space is convex if, for any two points $𝑥_1$ and $𝑥_2$ within $𝐴$, the entire line segment connecting these points is within $A$, $\left[𝑥_1,𝑥_2\right] \in A$.

DataFrame#

Machine Learning Workflow Construction and Coding: a convenient Pandas class for working with data tables with rows for each sample and columns for each feature, due to,

convenient data structure to store, access, manipulate tabular data
built-in methods to load data from a variety of file types, Python classes and even directly from Excel
built-in methods to calculate summary statistics and visualize data
built-in methods for data queries, sort, data filters
built-in methods for data manipulation, cleaning, reformatting
built-in attributes to store information about the data, e.g. size, number nulls and null value

Data Analytics#

Machine Learning Concepts: the use of statistics with visualization to support decision making.

Dr. Pyrcz says that data analytics is the same as statistics.

Data Preparation#

Machine Learning Concepts: any workflow steps to enhance, improve raw data to be model ready.

data-driven science needs data, data preparation remains essential
$\gt >80\%$ of any subsurface study is data preparation and interpretation

We continue to face a challenge with data:

data curation - format standards, version control, storage, transmission, security and documentation
large volume to manage - visualization, availability and data mining and exploration
large volumes of metadata - lack of platforms, standards and formats
engineering integration, variety of data, scale, interpretation and uncertainty

Clean databases are prerequisite to all data analytics and machine learning

must start with this foundation
garbage in, garbage out

Degree Matrix (spectral clustering)#

Spectral Clustering: a matrix representing a graph with the number of connections for each graph nodes, samples.

diagonal matrix with integer for number of connections

DBSCAN for Density-based Clustering#

Density-based Clustering: an density-based clustering algorithm, groups are seeded or grown in feature space at locations with sufficient point density determined by hyperparameters,

$\epsilon$ – the radius of the local neighbourhood in the metric of normalized features. The is the scale / resolution of the clusters. If this values is set too small, too many samples are left as outliers and if set too large, all the clusters merge to one single cluster.
$min_{Pts}$ – the minimum number of points to assign a core point, where core points are applied to initialize or grow a cluster group.

Density is quantified by number of samples over a volume, where the volume is based on a radius over all dimensions of feature space.

Automated or guided $\epsilon$ parameter estimation is available by k-distance graph (in this case is k nearest neighbor).

Calculate the nearest neighbor distance in normalized feature space for all the sample data (1,700 in this case).
Sort in ascending order and plot.
Select the distance that maximizes the positive curvature (the elbow).

Here is a summary of salient aspects for DBSCAN clustering,

DBSCAN - stands for Density-Based Spatial Clustering of Applications with Noise (Ester et al.,1996).
Advantages - include minimum domain knowledge to estimate hyperparameters, the ability to represent any arbitrary shape of cluster groups and efficient to apply for large data sets
Hierarchical Bottom-up / Agglomerative Clustering – all data samples start as their own group, called ‘unvisited’ but practically as outliers until assigned to a group, and then the cluster group grow iteratively.
Mutually Exclusive – like k-means clustering, all samples may only belong to a single cluster group.

\[ P(C_i \cap C_j | i \ne j) = 0.0 \]

Non-exhaustive – some samples may be left as unassigned and assumed as outliers for the cluster group assignment

\[ P(C_1 \cup C_2 \cup \dots C_k) \le 1.0 \]

Decision Criteria#

Machine Learning Concepts: a feature that is calculated by applying the transfer function to the subsurface model(s) to support decision making. The decision criteria represents value, health, environment and safety. For example:

contaminant recovery rate to support design of a pump and treat soil remediation project
oil-in-place resources to determine if a reservoir should be developed
Lorenz coefficient heterogeneity measure to classify a reservoir and determine mature analogs
recovery factor or production rate to schedule production and determine optimum facilities
recovered mineral grade and tonnage to determine economic ultimate pit shell

Decision Tree#

Decision Tree: a intuitive, regression and classification predictive machine learning model that devides the predictor space, $𝑋_1,…,𝑋_𝑚$, into $𝐽$ mutually exclusive, exhaustive regions, $𝑅_𝑗$.

mutually exclusive – any combination of predictors only belongs to a single region, $𝑅_𝑗$
exhaustive – all combinations of predictors belong a region, $𝑅_𝑗$, regions cover entire feature space, range of the variables being considered

The same prediction in each region, mean of training data in region, $\hat{Y}(𝑅_𝑗) = \overline{Y}(𝑅_𝑗)$

for classification the most common, mode-based or argmax operator

Other salient points about decision tree,

supervised Learning - the response feature label, $Y$, is available over the training and testing data
hierarchical, binary segmentation - of the predictor feature space, start with 1 region and sequentially divide, creating new regions
compact, interpretable model - since the classification is based on a hierarchy of binary segmentations of the feature space (one feature at a time) the model can be specified in a intuitive manner as a tree with binary branches**, hence the name decision tree. The code for the model is nested if statements, for example,

if porosity > 0.15:
    if brittleness < 20:
        initial_production = 1000
    else:
        initial_production = 7000
else:
    if brittleness < 40:
        initial_production = 500
    else:
        initial_production = 3000

The decision tree is constructed from the top down. We begin with a single region that covers the entire feature space and then proceed with a sequence of splits,

scan all possible splits - over all regions and over all features.
greedy optimization - proceeds by finding the best split in any feature that minimizes the residual sum of squares of errors over all the training data $y_i$ over all of the regions $j = 1,\ldots,J$. There is no other information shared between subsequent splits.

\[ RSS = \sum^{J}_{j=1} \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2 \]

Hyperparameters include,

number of regions – very easy to understand, you know what the model will be
minimum reduction in RSS – could stop early, e.g., a low reduction in RSS split could lead to a subsequent split with a larger reduction in RSS
minimum number of training data in each region – related to the concept of accuracy of the region mean prediction, i.e., we need at least 𝑛 data for a reliable mean
maximum number of levels – forces symmetric trees, similar number of splits to get to each region

Declustering#

Data Preparation: various methods that assign weights to spatial samples based on local sampling density, such that the weighted statistics are likely more representative of the population. Data weights are assigned so that,

samples in densely sampled areas receive less weight
samples in sparsely sampled areas receive more weight

There are various declustering methods:

cell-based declustering
polygonal declustering
kriging-based declustering

It is important to note that no declustering method can prove that for every data set the resulting weighted statistics will improve the prediction of the population parameters, but in expectation these methods tend to reduce the bias.

Declustering (statistics)#

Data Preparation: once declustering weights are calculated for a spatial dataset, then declustered statistics are applied as input for only subsequent analysis or modeling. For example,

the declustered mean is assigned as the stationary, global mean for simple kriging
the weighted CDF from all the data with weights are applied to sequential Gaussian simulation to ensure the back-transformed realizations approach the declustered distribution

Any statistic can be weighted, including the entire CDF! Here are some examples of weighted statistics, given declustering weights, $w(\bf{u}_j)$, for all data $j=1,\ldots,n$.

weighted sample mean,

\[ \overline{x}_{wt} = \frac{\sum_{i=1}^n w(\bf{u}_j) \cdot z(\bf{u}_j)}{\sum_{i=1}^n w(\bf{u}_j)} \]

where $n$ is the number of data.

weighted sample variance,

\[ s^2_{x_{wt}} = \frac{1}{\sum_{i=1}^n w(\bf{u}_j) - 1} \cdot \sum_{i=1}^n w(\bf{u}_j) \cdot \left( x(\bf{u}_j) - \overline{x}_{wt} \right)^2 \]

where $\overline{x}_{wt}$ is the declustered mean.

weighted covariance,

\[ C_{x,y_{wt}} = \frac{1}{\sum_{i=1}^n w(\bf{u}_j) } \cdot \sum_{i=1}^n w(\bf{u}_j) \cdot \left( x(\bf{u}_j) - \overline{x}_{wt} \right) \cdot \left( y(\bf{u}_j) - \overline{y}_{wt} \right) \]

where $\overline{x}_{wt}$ and $\overline{y}_{wt}$ are the declustered means for features $X$ and $Y$.

the entire CDF,

\[ F_z(z) \approx \sum_{j=1}^{n(Z<z)} w(\bf{u}_j) \]

where $n(Z<z)$ is the number of sorted ascending data less than threshold $z$. We show this as approximative as this is simplified and at data resolution and without an interpolation model.

It is important to note that no declustering method can prove that for every data set the resulting weighted statistics will improve the prediction of the population parameters, but in expectation these methods tend to reduce the bias.

Density-Connected (DBSCAN)#

Density-based Clustering: points $A$ and $B$ are density-connected if there is a point $Z$ that is density-reachable from both points $A$ and $B$.

Density-based Cluster (DBSCAN)#

Density-based Clustering: a nonempty set where all points are density-connected to each other.

Density-Reachable (DBSCAN)#

Density-based Clustering: point $Y$ is density reachable from $A$ if $Y$ belongs to a neighborhood of a core point that can reached from $A$. This would require a chain of core points each belonging the previous core points and the last core point including point Y.

Deterministic Model#

Machine Learning Concepts: a model that assumes the system or process that is completely predictable

often-based on engineering and geoscience physics and expert judgement
for example, numerical flow simulation or stratigraphic bounding surfaces interpreted from seismic
for this course we also state that data-driven estimation models like

Advantages:

integration of physics and expert knowledge
integration of various information sources

Disadvantages:

often quite time consuming
often no assessment of uncertainty, focus on building one model

Dimensionality Reduction#

Principal Component Analysis: methods to reduce the number of features within a data science workflow. There are 2 primary methods,

features Selection – find the subset of original features that are most important for the problem
feature projection – transform the data from a higher to lower dimensional space

Known as dimension reduction or dimensionality reduction

motivated by the curse of dimensionality and multicollinearity
applied in statistics, machine learning and information theory

Directly Density Reachable (DBSCAN)#

Density-based Clustering: point $X$ is directly density reachable from $A$, if $A$ is a core point and $X$ belongs to the neighborhood, distance $le \epsilon$ from $A$.

Discrete Feature#

Machine Learning Concepts: a categorical feature or a continuous feature that is binned or grouped, for example,

porosity between 0 and 20% assigned to 10 bins = {0 - 2%, 2% - 4%, \ldots ,20%}
Mohs hardness = $\{1, 2, \ldots, 10\}$ (same at categorical feature)

Distribution Transformations#

Feature Transformations: a mapping from one distribution to another distribution through percentile values, resulting in a new histogram, PDF, and CDF. We perform distribution transformations in geostatistical methods and workflows because,

inference - to correct a feature distribution to an expected shape, for example, correcting for too few or biased data
theory - a specific distribution assumption is required for a workflow step, for example, Gaussian distribution with mean of 0.0 and variance of 1.0 is required for sequential Gaussian simulation
data preparation or cleaning - to correct for outliers, the transformation will map the outlier into the target distribution no longer as an outlier

How do we perform distribution transformations?

We transform the values from the cumulative distribution function (CDF), $F_{X}$, to a new CDF , $G_{Y}$. This can be generalized with the quantile - quantile transformation applied to all the sample data:

The forward transform:

\[ Y = G_{Y}^{-1}(F_{X}(X)) \]

The reverse transform:

\[ X = F_{X}^{-1}(G_{Y}(Y)) \]

This may be applied to any data, including parametric or nonparametric distributions. We just need to be able to map from one distribution to another through percentiles, so it is a:

rank preserving transform, for example, P25 remains P25 after distribution transformation

Eager Learning#

k-Nearest Neighbours: Model is a generalization of the training data constructed prior to queries

the model is input-independent after parameter training and hyperparameter tuning, i.e., the training data does not need to be available to make new predictions

The opposite is lazy learning.

Estimation#

Machine Learning Concepts: is process of obtaining the single best value to represent a feature at an unsampled location, or time. Some additional concepts,

local accuracy takes precedence over global spatial variability
too smooth, not appropriate for any transform function that is sensitive to heterogeneity
for example, inverse distance and kriging
many predictive machine learning models focus on estimation (e.g., k-nearest neighbours, decision tree, random forest, etc.)

f1-score (classification accuracy metric)#

Naive Bayes: a categorical classification prediction model measure of accuracy, a single summary metric for each $k$ category from the confusion matrix.

the harmonic mean of recall and precision

\[ f1-score_k = \frac{2} { \frac{1}{Precision_k} + \frac{1}{Recall_k} } \]

As a reminder,

recall - the ratio of true positives divided by all cases of the category in the testing dataset
precision - the ratio of true positives divided by all positives, true positives + false positives

\[ Recall_k = \frac{ n_{k, \text{true positives}} }{n_k} \]

Feature (also variable)#

Machine Learning Concepts: any property measured or observed in a study

for example, porosity, permeability, mineral concentrations, saturations, contaminant concentration, etc.
in data mining / machine learning this is known as a feature, statisticians call these variables
measure often requires significant analysis, interpretation, etc.
when features are modified and combined to improve our models we call this feature engineering

Feature Engineering#

Feature Transformations: using domain expertise to extract improved predictor or response features from raw data,

improve the performance, accuracy and convergency, of inferential or predictive machine learning
improve model interpretability (or may worsen interpretability if our engineered features are in unfamiliar units)
mitigate outliers & bias, consistency with assumptions such as Gaussianity, linearization, dimensional expansion

Feature transformation and feature selection are two forms of feature engineering.

Feature Importance#

Feature Ranking: a variety of machine learning methods to provide measures for feature ranking, for example decision trees summarize the reduction in mean square error through inclusion of each feature and is summarized as,

\[ FI(x) = \sum_{t \in T_f} \frac{N_t}{N} \Delta_{MSE_t} \]

where $T_f$ are all nodes with feature $x$ as the split, $N_t$ is the number of training samples reaching node $t$, $N$ is the total number of samples in the dataset and $\Delta_{MSE_t}$ is the reduction in MSE with the $t$ split.

Note, feature importance can be calculated in a similar manner to MSE above for the case of classification trees with Gini Impurity.

Feature importance is part of model-based feature ranking,

the accuracy of the feature importance depends on the accuracy of the model, i.e., an inaccurate model will likely provide incorrect feature importance

Feature Imputation#

Feature Imputation: replacing null values in the data table, samples that do not have values for all features with plausible values for 2 reasons,

enable statistical calculations and models that require complete data tables, i.e., cannot work with missing feature values
maximize model accuracy, increasing the number of reliable samples available for training and testing the model
mitigate model bias that may occur with likewise deletion in feature values are not missing at random

Feature imputation methods include,

constant value imputation - replace null values with feature mean or mode
model-based imputation - replace null values with a prediction of the missing feature with available feature values for the same sample

There are also an iterative methods that depend on convergence,

Multiple Imputation by Chained Equations (MICE) - assign random values and then iterate over the missing values predicting new values

The goal of this method is to obtain reasonable imputed values that account for the relationships between all the features and all the available and missing values

Feature Projection#

Principal Component Analysis: a transforms original $m$ features to $p$ features, where $p << m$ for dimensionality reduction

given features, $𝑋_1,\ldots,𝑋_𝑚$ we would require $\binom{m}{2} = \frac{𝑚(𝑚−1)}{2}$ scatter plots to visualize just the two-dimensional scatter plots
these representations would not capture $> 2$ dimensional structures
once we have 4 or more variables understanding our data gets very difficult. Recall the curse of dimensionality.
principal component analysis, multidimensional scaling and random projection are examples
feature selection is an alternative method for dimensionality reduction

Feature Space#

Feature Ranking: commonly feature space only refers to the predictor features and does not include the response feature(s), i.e.,

all possible combinations of predictor features for which we need to make predictions
may be referred to as predictor feature space.

Typically, we train and test our machines’ predictions over the predictor feature space.

the space is typically a hypercuboid with each axis representing a predictor feature and extending from the minimum to maximum, over the range of each predictor feature
more complicated shapes of predictor feature space are possible, e.g., we could mask or remove subsets with poor data coverage.

Feature Ranking#

Feature Ranking: part of feature engineering, feature ranking is a set of methods that assign relative importance or value to each feature with respect to information contained for inference and importance in predicting a response feature.

There are a wide variety of possible methods to accomplish this. My recommendation is a wide-array approach with multiple metric, while understanding the assumptions and limitations of each method.

Here’s the general types of metrics that we will consider for feature ranking:

Visual Inspection - including data distributions, scatter plots and violin plots
Statistical Summaries - correlation analysis, mutual information
Model-based - including model parameters, feature importance scores and global Shapley values
Recursive feature elimination - and other methods that perform trail and error to find optimum parameters sets through withheld testing data cross validation

Feature ranking is primarily motivated by the curse of dimensionality, i.e., work with the fewest, most informative predictor features.

Feature Transformations#

Feature Transformations: a type of feature engineering involving mathematical operation applied to a feature to improve the value of the feature in a workflow. For example,

feature truncation
feature normalization or standardization
feature distribution transformation

There are many reasons that we may want to perform feature transformations.

the make the features consistent for visualization and comparison
to avoid bias or impose feature weighting for methods (e.g. k nearest neighbours regression) that rely on distances calculated in predictor feature space
the method requires the variables to have a specific range or distribution:
- artificial neural networks may require all features to range from [-1,1]
- partial correlation coefficients require a Gaussian distribution.
- statistical tests may require a specific distribution
- geostatistical sequential simulation requires an indicator or Gaussian transform

Feature transformations is a common basic building blocks in many machine learning workflows.

Fourth Paradigm#

Machine Learning Concepts: the data-driven paradigm for scientific discovery building from the,

First Paradigm - empirical science - experiments and observations
Second Paradigm - theoretical science - analytical expressions
Third Paradigm - computation science - numeric simulation

We augment with new scientific paradigms, we don’t replace older paradigms. Each of the previous paradigm are supported by the previous paradigms, for example,

theoretical science is build on empirical science
numerical simulations integrate analytical expressions and calibrated equations from experiment

Frequentist Probability#

Probability Concepts: measure of the likelihood that an event will occur based on frequencies observed from an experiment. For random experiments and well-defined settings (such as coin tosses),

\[ \text{Prob}(A) = P(A) = \lim_{n \to \infty} \frac{n(A)}{n} \]

where:

$n(A)$ = number of times event $A$ occurred $n$ = number of trails

For example, possibility of drilling a dry hole for the next well, encountering sandstone at a location ($\bf{u}_{\alpha}$), exceeding a rock porosity of $15 \%$ at a location ($\bf{u}_{\alpha}$).

Gaussian Anamorphosis#

Feature Transformations: a quantile transformation to a Gaussian distribution.

Mapping feature values through their cumulative probabilities.

\[ y = G_y^{-1}\left( F_x(x)\right) \]

where $𝐹_𝑥$ is the original feature cumulative distribution function (CDF) and $𝐺_𝑦$ is the Gaussian CDF probability density function

\[ f(x) = \frac{1}{\sigma \sqrt{2 \pi}} exp \left[-1 \frac{1}{2} \left(\frac{x-\mu}{\sigma} \right)^2 \right] \]

shorthand for a normal distribution is

\[ N[\mu,\sigma^2] \]

for example $N[0,1]$ is standard normal

much of natural variation or measurement error is Gaussian
parameterized fully by mean, variance and correlation coefficient (if multivariate)
distribution is unbounded, no min nor max, extremes are very unlikely, some type of truncation is often applied

Warning, many workflows apply univariate Gaussian anamorphosis and then assume bivariate or multivariate Gaussian, this is not correct, but it is generally too difficult to transform our data to multivariate Gaussian.

Methods that require a Gaussian distribution,

Pearson product-moment correlation coefficients completely characterize multivariate relationships when data are multivariate Gaussian
partial correlations require bivariate Gaussian
sequential simulation (geostatistics) assumes Gaussian to reproduce the global distribution
Student’s t test for difference in means
Chi-square distributions is derived from sum of squares of Gaussian distributed random variables
Gaussian naive Bayes classification assumes Gaussian conditionals

Gibbs Sampler (MCMC)#

Bayesian Linear Regression: a set of algorithms to sample from a probability distribution such that the samples match the distribution statistics, based on,

sequentially sampling from conditional distributions

Since only the conditional probability density functions are required, the system is simplified as the full joint probability density function is not needed

Here’s the basic steps of the Gibbs MCMC Sampler for a bivariate case,

Assign random values for $𝑋(0)$, $𝑌(0)$
Sample from $𝑓(𝑋|𝑌(0))$ to get $𝑋(1)$
Sample from $𝑓(𝑌|𝑋(1))$ to get $𝑌(1)$
Repeat for the next steps for samples, $\ell = 1,\ldots,𝐿$

The resulting samples will have the correct joint distribution,

\[ 𝑓(𝑋,𝑌) \]

Gradient Boosting Models#

Gradient Boosting: a prediction model that results from posing a boosting model as gradient descent problem

At each step, $k$, a model is being fit, then the error is calculated, $h_k(X_1,\ldots,X_m)$.

We can assign a loss function,

\[ L\left(y,F(X)\right) = \frac{\left(y - F(X)\right)^2}{2} \]

So we want to minimize the $\ell2$ loss function:

\[ J = \sum_{i=1}^{n} L\left(y_i, F_k(X) \right) \]

by adjusting our model result over our training data $F(x_1), F(x_2),\ldots,F(x_n)$.

We can take the partial derivative of the error vs. our model,

\[ \frac{\partial J}{\partial F(x_i)} = F(x_i) - y_i \]

We can interpret the residuals as negative gradients.

\[ y_i - F(x_i) = -1 \frac{\partial J}{\partial F(x_i)} \]

So now we have a gradient descent problem:

\[ F_{k+1}(X_i) = F_k(X_i) + h(X_i) \]

\[ F_{k+1}(X_i) = F_k(X_i) + y_i - F_k(X_i) \]

\[ F_{k+1}(X_i) = F_k(X_i) - 1 \frac{\partial J}{\partial F_k(X_i)} \]

Of the general form:

\[ \phi_{k+1} = \phi_k - \rho \frac{\partial J}{\partial \phi_k} \]

where $phi_k$ is the current state, $\rho$ is the learning rate, $J$ is the loss function, and $\phi_{k+1}$ is the next state of our estimator.

The error residual at training data is the gradient, then we are performing gradient descent,

fitting a series of models to negative gradients

By approaching the problem as a gradient decent problem we are able to apply a variety of loss functions,

$\ell2$ is our $\frac{\left(y - F(X)\right)^2}{2}$ is practical, but is not robust with outliers

\[ - 1 \frac{\partial J}{\partial F_k(X_i)} = y_i - F_k(X_i) \]

$\ell1$ is our $|y - F(X)|$ is more robust with outliers

\[ - 1 \frac{\partial J}{\partial F_k(X_i)} = sign(y_i - F_k(X_i)) \]

there are others like Huber Loss

Graph Laplacian (spectral clustering)#

Spectral Clustering: a matrix representing a graph by integrating connections between graph nodes, samples, number of connections for each graph nodes, samples. Calculated as degree matrix minus adjacency matrix. Where,

degree matrix, $𝐷$ - degree of connection for each node
adjacency matrix, $𝐴$ - specific connections between nodes

Geostatistics#

Machine Learning Concepts: a branch of applied statistics that integrates:

the spatial (geological) context
the spatial relationship
volumetric support / scale
uncertainty

I include all spatial statistics with geostatistics, some disagree with me on this. From my experience, any useful statistical method for modeling spatial phenomenon is adopted and added to the geostatistics toolkit! Geostatistics is an expanding and evolving field of study.

Gradient-based Optimization#

LASSO Regression: a method to solve for model parameters by iteratively minimizing the loss function. The steps include,

start with random model parameters
calculate the loss function for the model parameters
calculate the loss function gradient, generally don’t have an equation for the loss function, sampling with numerical calculation of the local loss function derivative,

\[ \nabla L(y_{\alpha}, F(X_{\alpha}, b_1)) = \frac{L(y_{\alpha}, F(X_{\alpha}, b_1 - \epsilon)) - L(y_{\alpha}, F(X_{\alpha}, b_1 + \epsilon))}{2\epsilon} \]

update the parameter estimate by stepping down slope / gradient,

\[ \hat{b}_{1,t+1} = \hat{b}_{1,t} - r \nabla L(y_{\alpha}, F(X_{\alpha}, b_1)) \]

where $r$ is the learning rate/step size, $\hat{b}(1,𝑡)$, is the current model parameter estimate and $\hat{b}(1,𝑡+1)$ is the updated parameter estimate.

Some important comments about gradient-based optimization,

gradient search convergence - the method will find a local or global minimum
gradient search step size - impact of step size, $r$ too small, takes too long to converge to a solution and $r$ too large, the solution may skip over/miss a global minimum or diverge
multiple model parameters - calculate and decompose the gradient over multiple model parameters, with a vector representation.

\[ \nabla L(y_{\alpha}, F(X_{\alpha}, b_1, b_2)) = \left[ \begin{matrix} \nabla L(y_{\alpha}, F(X_{\alpha}, b_1)) & \nabla L(y_{\alpha}, F(X_{\alpha}, b_2)) \end{matrix} \right] \]

exploration of parameter space - optimization for training machine learning model parameters is exploration of a high dimensional model parameter space

Graph (spectral clustering)#

Spectral Clustering: a diagram that represents data in an organized manner, each sample as a node with vertices indicating pairwise relationships between samples.

for an undirected graph, vertices are bidirectional, i.e., the connection is symmetric, both ways with the same strength

Gridded Data#

Machine Learning Workflow Construction and Coding: generally exhaustive, regularly spaced data over 2D or 3D, representing maps and models

stored as a .csv comma delimited file, with $𝑛_𝑦$ rows and $𝑛_𝑥$ columns
may also be saved/loaded as also binary for a more compact, but not human readable file.
commonly visualized directly, for example, matplotlib’s imshow function, or as contour maps

Hard Data#

Machine Learning Concepts: data that has a high degree of certainty, usually from a direct measurement from the rock

for example, well core-based and well log-based porosity and lithofacies

In general, hard data has high resolution (small scale, volume support), but with poor coverage (measure only an extremely small proportion of the population, for example,

Core coverage deepwater oil and gas - well core only sample one five hundred millionth to one five billionth of a deepwater reservoir, assuming 3 inch diameter cores with 10% core coverage in vertical wells with 500 m to 1,500 m spacing
Core coverage mining grade control - diamond drill hole cores sample one eight thousandth to one thirty thousandth of ore body, assuming HQ 63.5 mm diameter cores with 100% core coverage in vertical drill holes with 5 m to 10 m spacing

Hermite Polynomials#

Polynomial Regression: a family of orthogonal polynomials on the real number line.

Order	Hermite Polynomial $H_e(x)$
0th Order	$H_{e_0}(x) = 1$
1st Order	$H_{e_1}(x) = x$
2nd Order	$H_{e_2}(x) = x^2 - 1$
3rd Order	$H_{e_3}(x) = x^3 - 3x$
4th Order	$H_{e_4}(x) = x^4 - 6x^2 + 3$

These polynomials are orthogonal with respect to a weighting function,

\[ 𝑤(𝑥)=𝑒^{−\frac{𝑥^2}{2}} \]

this is the standard Gaussian probability density function without the scaler, $\frac{1}{\sqrt{2\pi}}$. The definition of orthogonality is stated as,

\[ \int_{-\infty}^{\infty} H_m(x) H_n(x) w(x) \, dx = 0 \]

The Hermite polynomials are orthogonal over the interval $[−\infty,\infty]$ for the standard normal probability distribution.

By applying hermite polynomials instead of regular polynomials for polynomial basis expansion in polynomial regression were remove the multicollinearity between the predictor features,

recall, independence of the predictor features is an assumption of the linear system applied in polynomial regression with the polynomial basis expansion

Heuristic Algorithm#

Cluster Analysis: a shortcut solution to solve a difficult problem a compromise of optimality and accuracy for speed and practicality.

this general approach is common in machine learning, computer science and mathematical optimization, for example, the solution for k-mean clustering a $k^n$ solution space is practically solved with an heuristic algorithm.

Hierarchical Clustering#

Cluster Analysis: all cluster group assignments are determined iteratively, as opposed to an partitional clustering method that determine cluster groups all at once. Including,

agglomerative hierarchical clustering - start with $n$ clusters, each data sample in its own cluster, and then iteratively merges clusters into larger clusters
divisive hierarchical clustering - start with all data in one cluster, and then iteratively divide off new clusters
k-means clustering is partitional clustering, while the solution heuristic to find the solution is iterative, the solution is actually all at once
difficult to update, once a series of splits or mergers are made it is difficult to go back and modify the model

Histogram#

Univariate Analysis: a representation of the univariate statistical distribution with a plot of frequency over an exhaustive set of bins over the range of possible values. These are the steps to build a histogram,

Divide the continuous feature range of possible values into $K$ equal size bins, $\delta x$:

\[ \Delta x = \left( \frac{x_{max} - x_{min}}{K} \right) \]

or use available category labels for categorical features.

Count the number of samples (frequency) in each bin, $n_k$, \quad $\forall \quad k=1,\ldots,K$.
Plot the frequency vs. the bin label (use bin centroid if continuous)

Note, histograms are typically plotted as a bar chart.

Hybrid Model#

Machine Learning Concepts: system or process that includes a combination of both deterministic model and stochastic model

most geostatistical models are hybrid models
for example, additive deterministic trend models and stochastic residual models

Independence (probability)#

Probability Concepts: events $A$ and $B$ are independent if and only if the following relations are true,

$P(A \cap B) = P(A) \cdot P(B)$
$P(A|B) = P(A)$
$P(B|A) = P(B)$

If any of these are violated we suspect that there exists some form of relationship.

Indicator Transform (also Binary Transform)#

Feature Transformations: indicator coding a random variable to a probability relative to a category or a threshold.

If $i(\bf{u}:z_k)$ is an indicator for a categorical variable,

what is the probability of a realization equal to a category?

\[\begin{split} i(\bf{u}; z_k) = \begin{cases} 1, & \text{if } Z(\bf{u}) = z_k \\ 0, & \text{if } Z(\bf{u}) \ne z_k \end{cases} \end{split}\]

for example,

given threshold, $z_2 = 2$, and data at $\bf{u}_1$, $z(\bf{u}_1) = 2$, then $i(bf{u}_1; z_2) = 1$
given threshold, $z_1 = 1$, and a RV away from data, $Z(\bf{u}_2)$ then is calculated as $F^{-1}_{\bf{u}_2}(z_1)$ of the RV as $i(\bf{u}_2; z_1) = 0.23$

If $i(\bf{u}:z_k)$ is an indicator for a continuous variable,

what is the probability of a realization less than or equal to a threshold?

\[\begin{split} i(\bf{u}; z_k) = \begin{cases} 1, & \text{if } Z(\bf{u}) \le z_k \\ 0, & \text{if } Z(\bf{u}) > z_k \end{cases} \end{split}\]

for example,

given threshold, $z_1 = 6\%$, and data at $\bf{u}_1$, $z(\bf{u}_1) = 8\%$, then $i(\bf{u}_1; z_1) = 0$
given threshold, $z_4 = 18\%$, and a RV away from data, $Z(\bf{u}_2) = N\left[\mu = 16\%,\sigma = 3\%\right]$ then $i(\bf{u}_2; z_4) = 0.75$

The indicator coding may be applied over an entire random function by indicator transform of all the random variables at each location.

Indicator Variogram#

Feature Transformations: varogram’s calculated and modelled from the indicator transform of spatial data and used for indicator kriging. The indicator variogram is,

\[ \gamma_i(\mathbf{h}; z_k) = \frac{1}{2N(\mathbf{h})} \sum_{\alpha=1}^{N(\mathbf{h})} \left[ i(\mathbf{u}_\alpha; z_k) - i(\mathbf{u}_\alpha + \mathbf{h}; z_k) \right]^2 \]

where $i(\mathbf{u}_\alpha; z_k)$ and $i(\mathbf{u}_\alpha + \mathbf{h}; z_k)$ are the indicator transforms for the $z_k$ threshold at the tail location $\mathbf{u}_\alpha$ and head location $\mathbf{u}_\alpha + \mathbf{h}$ respectively.

for hard data the indicator transform $i(\bf{u},z_k)$ is either 0 or 1, in which case the $\left[ i(\mathbf{u}_\alpha; z_k) - i(\mathbf{u}_\alpha + \mathbf{h}; z_k) \right]^2$ is equal to 0 when the values at head and tail are both $\le z_k$ (for continuous features) or $= z_k$ (for categorical features), the same relative to the threshold, or 1 when they are different.
therefore, the indicator variogram is $\frac{1}{2}$ the proportion of pairs that change! The indicator variogram can be related to probability of change over a lag distance, $h$.
the sill of an indicator variogram is the indicator variance calculated as,

\[ \sigma_i^2 = p \cdot (1 - p) \]

where $p$ is the proportion of 1’s (or zeros as the function is symmetric over proportion)

Inference, Inferential Statistics#

Machine Learning Concepts: this is a big topic, but for the course I provide this simplified, functional definition, given a random sample from a population, describe the population, for example,

given the well samples, describe the reservoir
given the drill hole samples, describe the ore body

Inlier#

a regression model accuracy metric, the proportion of testing data within a margin, $\epsilon$, of the model predictions, $\hat{y}_i$,

\[ I_R = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} I(y_i, \hat{y}_i) \]

given the indicator transform,

\[\begin{split} I(y_i, \hat{y}_i) = \begin{cases} 1, & \text{if } |y_i - \hat{y}_i| \leq \epsilon \\ 0, & \text{otherwise} \end{cases} \end{split}\]

This is a useful, intuitive measure of accuracy, the proportion of training or testing data with predictions that are good enough.

but, there is a choice of the size of the margin, $\epsilon$, that could be related to the accuracy required for the specific application

Instance-based Learning#

k-Nearest Neighbours: also known as memory-based learning, compares new prediction problems (as set of predictors, $𝑥_1,\ldots,𝑥_𝑚$) with the cases observed in the training data.

model requires access to the training data, acting as a library of observations
prediction directly from the training data
prediction complexity grows with the number of training data, $𝑛$, number of neighbors, $𝑘$, and number of features, $𝑚$.
a specific case of lazy learning

Intersection of Events (probability)#

Probability Concepts: the intersection of outcomes, the probability of $A$ and $B$ is represented as,

\[ P(A \cap B) = P(A,B) \]

under the assumption of independence of $A$ and $B$ the probability of $A$ and $B$ is,

\[ P(A,B) = P(A) \cdot P(B) \]

Irreducible Error#

Machine Learning Concepts: is error due to data limitations, including missing features and missing samples, for example, the full predictor feature space is not adequately sampled

irreducible error is not impacted by model complexity, it is a limitation of the data
one of the three components of expected test square error, including model variance, model bias and irreducible error

\[ E \left[ \left(y_0 - \hat{f}(x_1^0, \ldots, x_m,^0 \right)^2 \right] = \left(E [\hat{f}(x_1^0, \ldots, x_m,^0)] - f(x_1^0, \ldots, x_m,^0) \right)^2 + \]

\[ E \left[ \left( \hat{f} \left(x_1^0, \ldots, x_m,^0 \right) - E \left[ \hat{f}(x_1^0, \ldots, x_m,^0) \right] \right)^2 \right] + \sigma_e^2 \]

where $\sigma_e^2$ is irreducible error.

Inertia (clustering)#

Cluster Analysis: the k-means clustering loss function summarizing the difference between samples within the same group over all the groups,

\[ I = \sum_{i=1}^{K} \sum_{x_j \in C_i} \| x_j - \mu_i \|^2 \]

where $K$ is the total number of clusters, $C_i$ represents the set of samples in the $i^{th}$ cluster, $x_j$ represents a data sample in cluster, $C_i$, $mu_i$ is the prototype of cluster $C_i$,$\| x_j - \mu_i \|^2$ is the squared Euclidean distance between sample $x_j$ and the cluster prototype $\mu_i $. The samples and prototypes and distance calculations in mD space, with $1,\ldots,m$ features.

by minimizing inertia k-means clusters minimizes difference within groups while maximizing difference between groups

Joint Probability#

Probability Concepts: probability that considers more than one event occurring together, the probability of $A$ and $B$ is represented as,

\[ P(A \cap B) = P(A,B) \]

or the probability of $A$, $B$ and $C$ is represented as,

\[ P(A \cap B \cap C) = P(A,B,C) \]

under the assumption of independence of $A$, $B$ and $C$ the joint probability may be calculated as,

\[ P(A,B,C) = P(A) \cdot P(B) \cdot P(C) \]

K Bins Discretization#

Feature Transformations: bin the range of the feature into K bins, then for each sample assignment of a value of 1 if the sample is within a bin and 0 if outsize the bin

binning strategies include uniform width bins (uniform) and uniform number of data in each bin (quantile)
also known as one hot encoding

Methods that require K bins discretization,

basis expansion to work in a higher dimensional space
discretization of continuous features to categorical features for categorical methods such as naive Bayes classifier
histogram construction and Chi-square test for difference in distributions
mutual information binning

K-fold Cross Validation#

Machine Learning Concepts: partitioning the data into K folds, and looping over the folds training the model with reminder of the data and testing the model with the data in the fold. Then aggregating the testing accuracy over all the folds.

the train and test data split is based on K, for example, K = 4, is 25% testing for each fold and K = 5, is 20% testing for each fold
this is an improvement over cross validation that only applies one train and test split to build a single model. The K-fold approach allows testing of all data and the aggregation of accuracy over all the folds tends to smooth the accuracy vs. hyperparameter plot for more reliable hyperparameter tuning
k-fold cross validation may be applied to check model performance for estimation accuracy (most common) and uncertainty model goodness (Maldonado-Cruz and Pyrcz, 2021)

k-Means Clustering#

Cluster Analysis: an unsupervised machine learning method for partitional clustering, group assignment to unlabeled data, where dissimilarity within clustered groups is mini minimized. The loss function that is minimized is,

\[ I = \sum^k_{i=1} \sum_{\alpha \in C_i} || X_{\alpha} - \mu_i || \]

where $i$ is the cluster index, $\alpha$ is the data sample index, $X$ is the data sample and $\mu_i$ is the $i$ cluster prototype, $k$ is the total number of clusters, and $|| X_m - \mu_m ||$ is the Euclidean distance from a sample to the cluster prototype in $M$ dimensional space calculated as,

\[ || X_{m,\alpha} - \mu_i || = \sqrt{ \sum_m^M \left( X_{m,\alpha} - \mu_{m,i} \right)^2 } \]

Here is a summary of import aspects for k-means clustering,

k - is given as a model hyperparameter
exhaustive and mutually exclusive groups - all data assigned to a single group
prototype method - represents the training data with number of synthetic cases in the features space. For K-means clustering we assign and iteratively update $K$ prototypes.
iterative solution - the initial prototypes are assigned randomly in the feature space, the labels for each training sample are updated to the nearest prototype, then the prototypes are adjusted to the centroid of their assigned training data, repeat until there is no further update to the training data assignments.
unsupervised learning - the training data are not labeled and are assigned $K$ labels based on their proximity to the prototypes in the feature space. The idea is that similar things, proximity in feature space, should belong to the same cluster group.
feature weighting - the procedure depends on the Euclidian distance between training samples and prototypes in feature space. Distance is treated as the ‘inverse’ of similarity. If the features have significantly different magnitudes, the feature(s) with the largest magnitudes and ranges will dominate the loss function and cluster groups will become anisotropic aligned orthogonal to the high range feature(s). While the common approach is to standardize / normalize the variables, by-feature weighting may be applied through unequal variances. Note, in this demonstration we normalize the features to range from 0.0 to 1.0.

k-Nearest Neighbours#

k-Nearest Neighbours: a simple, interpretable and flexible, nonparametric predictive machine learning model based on a local weighting window applied to $k$ nearest training data

The k-nearest neighbours approach is similar to a convolution approach for spatial interpolation. Convolution is the integral product of two functions, after one is reversed and shifted by $\Delta$.

one interpretation is smoothing a function with weighting function, $𝑓(\Delta)$, is applied to calculate the weighted average of function, $𝑔(x)$,

\[ (f * g)(x) = \int_{-\infty}^{\infty} f(\Delta) g(x - \Delta) \, d\Delta \]

this easily extends into multidimensional

\[ (f * g)(x, y, z) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f(\Delta_x, \Delta_y, \Delta_z) g(x - \Delta_x, y - \Delta_y, z - \Delta_z) \, d\Delta_x \, d\Delta_y \, d\Delta_z \]

The choice of which function is shifted before integration does not change the result, the convolution operator has commutativity.

\[ (f * g)(x) = \int_{-\infty}^{\infty} f(\Delta) g(x - \Delta) \, d\Delta \]

\[ (f * g)(x) = \int_{-\infty}^{\infty} f(x - \Delta) g(\Delta) \, d\Delta \]

if either function is reflected then convolution is equivalent to cross-correlation, measure of similarity between 2 signals as a function of displacement.
for k-nearest neighbours the use of $k$ results in a locally adaptive window size, different from standard convolution

K-nearest neighbours is an instance-based, lazy learning method, the model training is postponed until prediction is required, no precalculation of the model. i.e., prediction requires access to the data.

to make new predictions that training data must be available

The hyperparameters include,

k number of nearest data to utilize for prediction
data weighting, for example uniform weighting with the local training data average, or inverse distance weighting

Note, for the case of inverse distance weighting, the method is analogous to inverse distance weighted interpolation with a maximum number of local data constraint commonly applied for spatial interpolation.

inverse distance is available in GeostatsPy for spatial mapping.

Too find the k-nearest data a distance metric is needed,

training data within the predictor feature space are ranked by distance (closest to farthest)
a variety of distance metrics may be applied, including:

Euclidian distance

\begin{equation}
d_i = \sqrt{\sum_{\alpha = 1}^{m} \left(x_{\alpha,i} - x_{\alpha,0}\right)^2} \end{equation}

Minkowski Distance - a general expression for distance with well-known Manhattan and Euclidean distances are special cases,

\[ d_{(i,i')} = \left( \sum_{j=1}^{m} \left( x_{(j,i)} - x_{(j,i')} \right)^p \right)^{\frac{1}{p}} \]

when $p=2$, this becomes the Euclidean distance
when $p=1$ it becomes the Manhattan distance

Kernel Trick (support vector machines)#

Support Vector Machines: we can incorporate our basis expansion in our method without ever needing to transform the training data to this higher dimensional space,

\[ h(x) \]

We only need the inner product over the predictor features,

\[ h(x) \left( h(x') \right)^T = \langle h(x), h(x') \rangle \]

Instead of the actual values in the transformed space, we just need the ‘similarity’ between all available training data in that transformed space!

we training our support vector machines with only a similarity matrix between training data that will be projected to the higher dimensional space
we never actually need to calculate the training data values in the higher dimensional space

Kriging#

Data Preparation: spatial estimation approach that relies on linear weights that account for spatial continuity, data closeness and redundancy. The kriging estimate is,

\[ z^*(\bf{u}) = \sum_{\alpha = 1}^{n} \lambda_{\alpha} \cdot z(\bf{u}_{\alpha}) + \left( 1.0 - \sum_{\alpha=1}^n \lambda_{\alpha} \right) \cdot m_z \]

the right term is the unbiasedness constraint, one minus the sum of the weights is applied to the global mean.

In the case where the trend, $t(\bf{u})$, is removed, we now have a residual, $y(\bf{u})$,

\[ y(\bf{u}) = z(\bf{u}) - t(\bf{u}) \]

the residual mean is zero so we can simplify our kriging estimate as,

\[ y^*(\bf{u}) = \sum_{\alpha = 1}^{n} \lambda_{\alpha} \cdot y(\bf{u}_{\alpha}) \]

The simple kriging weights are calculated by solving a linear system of equations,

\[ \sum_{j=1}^n \lambda_j C(\bf{u}_i,\bf{u}_j) = C(\bf{u},\bf{u}_i), \quad i=1,\ldots,n \]

that may be represented with matrix notation as,

\[\begin{split} \begin{bmatrix} C(\bf{u}_1,\bf{u}_1) & C(\bf{u}_1,\bf{u}_2) & \dots & C(\bf{u}_1,\bf{u}_n) \\ C(\bf{u}_2,\bf{u}_1) & C(\bf{u}_2,\bf{u}_2) & \dots & C(\bf{u}_2,\bf{u}_n) \\ \vdots & \vdots & \ddots & \vdots \\ C(\bf{u}_n,\bf{u}_1) & C(\bf{u}_n,\bf{u}_2) & \dots & C(\bf{u}_n,\bf{u}_n) \\ \end{bmatrix} \cdot \begin{bmatrix} \lambda_1 \\ \lambda_2 \\ \vdots \\ \lambda_n \\ \end{bmatrix} = \begin{bmatrix} C(\bf{u}_1,\bf{u}) \\ C(\bf{u}_2,\bf{u}) \\ \vdots \\ C(\bf{u}_n,\bf{u}) \\ \end{bmatrix} \end{split}\]

This system may be derived by substituting the equation for kriging estimates into the equation for estimation variance, and then setting the partial derivative with respect to the weights to zero.

we are optimizing the weights to minimize the estimation variance

this system integrates the,

spatial continuity as quantified by the variogram (and covariance function to calculate the covariance, $C$, values)
redundancy the degree of spatial continuity between all of the available data with themselves, $C(\bf{u}_i,\bf{u}_j)$
closeness the degree of spatial continuity between the available data and the estimation location, $C(\bf{u}_i,\bf{u})$

Kriging provides a measure of estimation accuracy known as kriging variance (a specific case of estimation variance).

\[ \sigma^{2}_{E}(\bf{u}) = C(0) - \sum^{n}_{\alpha = 1} \lambda_{\alpha} C(\bf{u}_0 - \bf{u}_{\alpha}) \]

Kriging estimates are best in that they minimize the above estimation variance.

Properties of kriging estimates include,

Exact interpolator - kriging estimates with the data values at the data locations
Kriging variance - a measure of uncertainty in a kriging estimate. Can be calculated before getting the sample information, as the kriging estimation variance is not dependent on the values of the data nor the kriging estimate, i.e. the kriging estimator is homoscedastic.
Spatial context - kriging takes integrates spatial continuity, closeness and redundancy; therefore, kriging accounts for the configuration of the data and structural continuity of the feature being estimated.
Scale - kriging by default assumes the estimate and data are at the same point support, i.e., mathematically represented as points in space with zero volume. Kriging may be generalized to account for the support volume of the data and estimate,
Multivariate - kriging may be generalized to account for multiple secondary data in the spatial estimate with the cokriging system. We will cover this later.
Smoothing effect - of kriging can be forecasted as the missing variance. The missing variance over local estimates is the kriging variance.

Kriging-based Declustering#

Data Preparation: a declustering method to assign weights to spatial samples based on local sampling density, such that the weighted statistics are likely more representative of the population. Data weights are assigned so that,

samples in densely sampled areas receive less weight
samples in sparsely sampled areas receive more weight

Kriging-based declustering proceeds as follows:

calculate and model the experimental variogram
apply kriging to calculate estimates over a high-resolution grid covering the area of interest
calculate the sum of the weights assigned to each data
assign data weights proportional to this sum of weights

The weights are calculated as:

\[ w(\bf{u}_j) = n \cdot \frac{\sum_{iy}^{ny} \sum_{ix}^{nx} \lambda_j}{\sum_{i=1}^n \left[ \sum_{iy}^{ny} \sum_{ix}^{nx} \lambda_{j,ix,iy} \right]} \]

where $nx$ and $ny$ are the number of cells in the grid, $n$ is the number of data, and $\lambda_{j,ix,iy}$ is the weight assigned to the $j$ data at the $ix,iy$ grid cell.

Here is an important point for kriging-based declustering,

like polygonal declustering, kriging-based declustering is sensitive to the boundaries of the area of interest; therefore, the weights assigned to the data near the boundary of the area of interest may change radically as the area of interest is expanded or contracted

Also, kriging-based declustering integrates the spatial continuity model from variogram model. Consider the following possible impacts of the variogram model on the declustering weights,

if there is 100% relative nugget effect, there is no spatial continuity and therefore, all data receives equal weight. Note for the equation above this results in a divide by 0.0 error that must be checked for in the code.
geometric anisotropy may significantly impact the weights as data aligned over specific azimuths are assessed as closer or further in terms of covariance

Kolmogorov’s 3 Probability Axioms#

Probability Concepts: these are Kolmogorov’s 3 axioms for valid probabilities,

Probability of an event is a non-negative number.

\[ P(𝐴) \ge 0 \]

Probability of the entire sample space, all possible outcomes, $\Omega$, is one (unity), also known as probability closure.

\[ P(\Omega) = 1 \]

Additivity of mutually exclusive events for unions.

\[ P\left(⋃_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i) \]

e.g., probability of $A_1$ and $A_2$ mutual exclusive events is, $P(A_1 + A_2) = P(A_1) + P(A_2)$

$L^1$ Norm#

Linear Regression: known as Manhattan norm or sum of absolute residual (SAR),

\[ \sum_{i=1}^n |\Delta y_i | \]

also expressed as the mean absolute error (MAE),

\[ \frac{1}{n} \sum_{i=1}^n |\Delta y_i | \]

Minimization with $L^1$ norm is known as minimum absolute difference.

$L^2$ Norm#

Linear Regression: known as sum of square residual (SSR),

\[ \sum_{i=1}^n \sqrt{\Delta y_i} \]

also expressed as the mean square error (MSE),

\[ \frac{1}{n} \sum_{i=1}^n \left( \Delta y_i \right)^2 \]

and the Euclidian norm,

\[ \sqrt{ \sum_{i=1}^n \sqrt{\Delta y_i} } \]

Minimization with $L^2$ norm is known as the method of least squares.

$L^1$ vs. $L^2$ Norm#

LASSO Regression: the choice of $L^1$ and $L^2$ norm is important in machine learning. To explain this let’s compare the performance of $L^1$ and $L^2$ norms in loss functions while training model parameters.

Property	Least Absolute Deviations (L1)	Least Squares (L2)
Robustness*	Robust	Not very robust
Solution Stability	Unstable solution	Stable solution
Number of Solutions	Possibly multiple solutions	Always one solution
Feature Selection	Built-in feature selection	No feature selection
Output Sparsity	Sparse outputs	Non-sparse outputs
Analytical Solutions	No analytical solutions	Analytical solutions

Here’s some important points,

robust - resistant to outliers
unstable - for small changes in training the trained model predictions may jump
multiple solutions - different solution have similar or the same loss, resulting in solutions jumping with small changes to the training data
output sparsity and feature selection - model parameters tend to 0.0
analytical solutions - an analytical solution is available to solve for the optimum model parameters

$L^1$ or $L^2$ Normalizer#

Feature Transformations: is performed across features over individual samples to constrain the sum

The L1 Norm has the following constraint across samples,

\[ \sum_{\alpha = 1}^m x^{\prime}_{i,\alpha} = 1.0, \quad i = 1, \ldots, n \]

The L1 normalizer transform,

\[ x^{\prime}_{i,\alpha} = \frac{x_{i,\alpha}}{\sum_{\alpha=1}^m x_{i,\alpha}} \]

The L2 Norm has the following constraint across samples,

\[ \sum_{\alpha = 1}^m \left( x^{\prime}_{i,\alpha} \right)^2 = 1.0, \quad i = 1, \ldots, n \]

The L2 normalizer transform,

\[ x^{\prime}_{i,\alpha} = \sqrt{\frac{(x_{i,\alpha})^2}{\sum_{\alpha=1}^m (x_{i,\alpha})^2}} \]

Example, applied in text classification and clustering, and L1 for compositional data (sum 1.0 constraint)

LASSO Regression#

LASSO Regression: linear regression with $L^1$ regularization term and regularization hyperparameter $\lambda$,

\[ \sum_{i=1}^n \left(y_i - \left(\sum_{\alpha = 1}^m b_{\alpha} x_{\alpha,i} + b_0 \right) \right)^2 + \lambda \sum_{j=1}^m |b_{\alpha}| \]

As a result, LASSO regression training integrates two and often competing goals to find the model parameters,

find the model parameters that minimize the error with training data
minimize the slope parameters towards zero

The only difference between LASSO and ridge regression is:

for LASSO the shrinkage term is posed as an $\ell_1$ penalty,

\[ \lambda \sum_{\alpha=1}^m |b_{\alpha}| \]

for ridge regression the shrinkage term is posed as an $\ell_2$ penalty,

\[ \lambda \sum_{\alpha=1}^m \left(b_{\alpha}\right)^2 \]

While both ridge regression and the LASSO shrink the model parameters ($b_{\alpha}, \alpha = 1,\ldots,m$) towards zero:

LASSO parameters reach zero at different rates for each predictor feature as the lambda, $\lambda$, hyperparameter increases.
as a result LASSO provides a method for feature ranking and selection!

The lambda, $\lambda$, hyperparameter controls the degree of fit of the model and may be related to the model bias-variance trade-off.

for $\lambda \rightarrow 0$ the prediction model approaches linear regression, there is lower model bias, but the model variance is higher
as $\lambda$ increases the model variance decreases and the model bias increases
for $\lambda \rightarrow \infty$ the coefficients all become 0.0 and the model is the training data response feature mean

Lazy Learning#

k-Nearest Neighbours: model is a generalization of the training data and calculation is delayed until query is made of the model

the model is the training data and selected hyperparameters, to make new predictions the training data must be available

The opposite is eager learning.

Learning Rate (gradient boosting)#

Gradient Boosting: controls the rate of updating with each new model.

\[ f_m = f_{m-1} - \rho_m \frac{\partial L(y_\alpha, F(X_\alpha))}{\partial F(X_\alpha)} \]

where $\rho_m$ is the learning rate, \frac{\partial L(y_\alpha, F(X_\alpha))}{\partial F(X_\alpha)} is the gradient, error, $f_{m-1}$ is the previous estimate, and $f_m$ is the new estimate.

Some salient points about learning rate,

without learning rate, the boosting models learn too quickly and will have too high model variance
slow down learning for a more robust model, balanced to ensure good performance, too small rate will require very large number of models to reach convergence

Likewise Deletion (MRMR)#

Feature Ranking: removal of any sample with any missing feature values

if missing feature values are not missing at random (MAR) this may impart a bias in the data
will result in a decrease in the effective data size and increase in model uncertainty

Linear Regression#

Linear Regression: a linear, parametric prediction model,

\[ y = \sum_{\alpha = 1}^m b_{\alpha} x_{\alpha} + b_0 \]

The analytical solution for the model parameters, $b_1,\ldots,b_m,b_0$, is available for the L2 norm loss function, the errors are summed and squared known a least squares.

we minimize the error, residual sum of squares (RSS) over the training data:

\[ RSS = \sum_{i=1}^n \left(y_i - (\sum_{\alpha = 1}^m b_{\alpha} x_{\alpha,i} + b_0) \right)^2 \]

where $y_i$ is the actual response feature values and $\sum_{\alpha = 1}^m b_{\alpha} x_{\alpha} + b_0$ are the model predictions, over the $\alpha = 1,\ldots,n$ training data.

this may be simplified as the sum of square error over the training data,

\[ \sum_{i=1}^n (\Delta y_i)^2 \]

where $\Delta y_i$ is actual response feature observation $y_i$ minus the model prediction $\sum_{\alpha = 1}^m b_{\alpha} x_{\alpha} + b_0$, over the $i = 1,\ldots,n$ training data.

There are important assumption with our linear regression model,

Error-free - predictor variables are error free, not random variables
Linearity - response is linear combination of feature(s)
Constant Variance - error in response is constant over predictor(s) value
Independence of Error - error in response are uncorrelated with each other
No multicollinearity - none of the features are redundant with other features

Location Map#

Loading and Plotting Data and Models: a data plot where the 2 axes are locations, e.g., $X$ and $Y$, Easting and Northing, Latitude and Longitude, etc., to show the locations and magnitudes of the spatial data.

often the data points are colored to represent the scale of feature to visualize the sampled feature over the area or volume of interest
advantage, visualize the data without any model that may bias our impression of the data
disadvantage, may be difficult to visualize large datasets and data in 3D

Loss Function#

LASSO Regression: the equation that is minimized to train the model parameters. For example, the loss function for linear regression includes residual sum of square, the $L^2$ error norm,

\[ \sum_{i=1}^n \left(y_i - (\sum_{\alpha = 1}^m b_{\alpha} x_{\alpha,i} + b_0) \right)^2 \]

for LASSO regression the loss function includes residual sum of square, the $L^2$ error norm, plus a $L^1$ regularization term,

\[ \sum_{i=1}^n \left(y_i - \left(\sum_{\alpha = 1}^m b_{\alpha} x_{\alpha,i} + b_0 \right) \right)^2 + \lambda \sum_{j=1}^m |b_{\alpha}| \]

for k-means clustering the loss function is,

\[ I = \sum^k_{i=1} \sum_{\alpha \in C_i} \sqrt{ \sum_{j = 1}^m X_{\alpha,m} - \mu_{i,m} } \]

The method to minimize loss functions depends on the type of norm,

with $L^2$ norms we apply differentiation to the loss function with respect to the model parameter and set it equal to zero
with $L^1$ norms in our loss functions we lose access to an analytical solution and use iterative optimization, e.g., steepest descent

Machine Learning Workflow Design#

Machine Learning Workflow Construction and Coding: is based on the following steps,

Specify the Goals - for example,

build a numerical model
evaluate different recovery processes

Specify the Data - what is available and what is missing?
Design a Set of Steps to Accomplish the Goal - common steps include,

load data
format, check and clean data
run operation, including, statistical calculation, model or visualization
transfer function

Develop Documentation - including implementation details, defense of decisions, metadata, limitations and future work
Flow - data and information flow, learning while modeling with branches and loop backs
Uncertainty - summarize all uncertainty sources, include methods to integrate uncertainty, defend the uncertainty models and aspects deemed certain

Margin (support vector machines)#

Support Vector Machines: when the training data include overlapping categories it would not be possible, nor desirable, to develop a decision boundary that perfectly separates the categories for which this condition would hold,

\[ y_i \left( x_i^T \beta + \beta_0 \right) \geq 0 \]

We need a model that allows for some misclassification.

\[ y_i \left( x_i^T \beta + \beta_0 \right) \geq M - \xi_i \]

We introduce the concept of a margin, $𝑀$, and a distance from the margin (error, 𝜉_𝑖).

\[ \underset{\beta, \beta_0}{\text{min}} \left( \frac{1}{2M^2} + C \sum_{i=1}^N \xi_i \right) \]

The loss function includes the margin term, $M$, and hence attempts to minimize margin while minimizing classification error weighted by hyperparameter, $C$.

Marginal Probability#

Probability Concepts: probability that considers only one event occurring, the probability of $A$,

\[ P(A) \]

marginal probabilities may be calculated from joint probabilities through the process of marginalization,

\[ P(A) = \int_{-\infty}^{\infty} P(A,B) dB \]

where we integrate over all cases of the other event, $B$, to remove its influence. Given discrete possible cases of event $B$ we can simply sum the probabilities over all possible cases of $B$,

\[ P(A) = \sum_{i=1}^{k_B} P(A,B) dB \]

Matrix Scatter Plots#

Multivariate Analysis: composite plot including the combinatorial of all pair-wise scatter plots for all features.

given $m$ features, there are $m \times m$ scatter plots
the scatter plots are ordered, y-axis feature from $X_1,\ldots,X_m$ over the rows and x-axis feature from $X_1,\ldots,X_m$ over the columns
the diagonal is the features plotted with themselves and are often replaced with feature histograms or probability density functions

We use matrix scatter plots to,

look for bivariate linear or nonlinear structures
look for bivariate homoscedasticity (constant conditional variance) and heteroscedasticity (conditional variance changes with value)
look for bivariate constraints, such as sum constraints with compositional data

Remember, the other features are marginalized, this is not a full m-D visualization.

Maximum Relevance Minimum Redundancy (MRMR)#

Feature Ranking: a mutual information-based approach for feature ranking that accounts for feature relevance and redundancy.

one example is a relevance minus redundancy summary,

\[ MRMR = max \left[ frac{1}{|S|} \sum_{X_i \in S} I(X_i,Y) - \frac{1}{|S|^2} \sum_{X_i \in S} \sum_{X_j, i \ne j} I(X_i,X_j) \right] \]

where $𝑆$ is the predictor feature subset and $|𝑆|$ is the number of features in the subset $𝑆$.

Metropolis-Hastings MCMC Sampler#

Bayesian Linear Regression: The basic steps of the Metropolis-Hastings MCMC Sampler:

For $\ell = 1, \ldots, L$:

Assign random values for the initial sample of model parameters, $\beta(\ell = 1) = b_1(\ell = 1)$, $b_0(\ell = 1)$ and $\sigma^2(\ell = 1)$.
Propose new model parameters based on a proposal function, $\beta^{\prime} = b_1$, $b_0$ and $\sigma^2$.
Calculate probability of acceptance of the new proposal, as the ratio of the posterior probability of the new model parameters given the data to the previous model parameters given the data multiplied by the probability of the old step given the new step divided by the probability of the new step given the old.

\[ P(\beta \rightarrow \beta^{\prime}) = min\left(\frac{P(\beta^{\prime}|y,X) }{ P(\beta | y,X)} \cdot \frac{P(\beta^{\prime}|\beta) }{ P(\beta | \beta^{\prime})},1\right) \]

Apply Monte Carlo simulation to conditionally accept the proposal, if accepted, $\ell = \ell + 1$, and sample $\beta(\ell) = \beta^{\prime}$
Go to step 2.

Minkowski Distance#

k-Nearest Neighbours: a general expression for distance with well-known Manhattan and Euclidean distances are special cases,

\[ d_{(i,i')} = \left( \sum_{j=1}^{m} \left( x_{(j,i)} - x_{(j,i')} \right)^p \right)^{\frac{1}{p}} \]

when $p=2$, this becomes the Euclidean distance
when $p=1$ it becomes the Manhattan distance

Missing Feature Values#

Feature Imputation: null values in the data table, samples that do not have values for all features

There are many causes of missing feature values, for example,

sampling cost, e.g., low permeability test takes too long
rock rheology sample filter, e.g., can’t recover the mudstone samples
sampling to reduce uncertainty and maximize profitability instead of statistical representativity, dual purpose samples for information and production

Missing data consequences, more than reducing the amount of training and testing data, missing data, if not completely at random may result in,

biased sample statistics resulting in biased model training and testing
biased models with biased predictions with potentially no indication of the bias

Missing at Random (MAR)#

Feature Imputation: missing feature values are distributed randomly, uniform coverage over the predictor feature space, i.e., all values have likelihood to be missing, and no correlation between missing feature values.

This is typically not the case as missing data often has a confounding feature, for example,

sampling cost, e.g., low permeability test takes too long
rock rheology sample filter, e.g., can’t recover the mudstone samples
sampling to reduce uncertainty and maximize profitability instead of statistical representativity, dual purpose samples for information and production

Missing data consequences, more than reducing the amount of training and testing data, missing data, if not completely at random may result in,

biased sample statistics resulting in biased model training and testing
biased models with biased predictions with potentially no indication of the bias

Model Bias#

Machine Learning Concepts: is error due to insufficient complexity and flexibility to fit the natural setting

increasing model complexity usually results in decreasing model bias
model bias variance trade-off - as complexity increases, model variance increases and model bias decreases
one of the three components of expected test square error, including model variance, model bias and irreducible error

\[ E \left[ \left(y_0 - \hat{f}(x_1^0, \ldots, x_m,^0 \right)^2 \right] = \left(E [\hat{f}(x_1^0, \ldots, x_m,^0)] - f(x_1^0, \ldots, x_m,^0) \right)^2 + \]

\[ E \left[ \left( \hat{f} \left(x_1^0, \ldots, x_m,^0 \right) - E \left[ \hat{f}(x_1^0, \ldots, x_m,^0) \right] \right)^2 \right] + \sigma_e^2 \]

where $\left(E [\hat{f}(x_1^0, \ldots, x_m,^0)] - f(x_1^0, \ldots, x_m,^0) \right)^2$ is model bias.

Model-Bias Variance Trade-off#

Machine Learning Concepts: as complexity increases, model variance increases and model bias decreases.

as model variance and model bias are both components of expected test square error, the balancing of model bias and model variance results in an optimum level of complexity to minimize the testing error

Model Checking#

Machine Learning Concepts: is a critical last step for any spatial modeling workflow. Here are the critical aspects of model checking,

Model Inputs - data and statistics integration

check the model to ensure the model inputs are honored in the models, generally checked over all the realizations, for example, the output histograms and matches the input histogram over the realizations

Accurate Spatial Estimates - ability of the model to accurately predict away from the available sample data, over a variety of configurations, with accuracy

by cross validation, withholding some of the data, check the model’s ability to predict
generally, summarized with a truth vs. predicted cross plot and measures such as mean square error

\[ MSE = \frac{1}{n} \sum_{\alpha = 1}^{n} \left(z^{*}(\bf{u}_{\alpha}) - z(\bf{u}_{\alpha}) \right)^2 \]

Accurate and Precise Uncertainty Models - uncertainty model is fair given the amount of information available and various sources of uncertainty

also checked through cross validation, withholding some of the data, but by checking the proportion of the data in specific probability intervals
summarized with a proportion of withheld data in interval vs. the probability interval
points on the 45 degree line indicate accurate and precise uncertainty model
points above the 45 degree line indicate accurate and imprecise uncertainty model, uncertainty is too wide
points below the 45 degree line indicate inaccurate uncertainty model, uncertainty is too narrow or model is biased

Model Complexity or Flexibility#

Machine Learning Concepts: the ability of a model to fit to data and to be interpreted.

A variety of concepts may be used to describe model complexity,

the number of features, predictor variables are in the model, dimensionality of the model, usually resulting in more model parameters
the number of parameters, the order applied for each term, e.g. linear, quadratic, thresholds
the format of the model, i.e., a compact equation with polynomial regression vs. nested conditional statements with decision tree vs. thousands of weights and bias model parameters for a neural network
For example, more complexity with a high order polynomial, larger decision trees etc.

In general, more complicated or flexible models are more difficult to interpret,

linear regression and the associated model parameters can be analyzed and even applied for feature ranking, while support vector machines with radial basis functions are a linear model in the nD high dimensional space

Model Generalization#

Machine Learning Concepts: the ability of a model to predict away from training data.

the model learns the structure in the data and does not just memorize the training data

Models that do not generalize well,

overfit models have high accuracy at training data and low accuracy away from training data, demonstrated with low testing accuracy
underfit models are too simple or inflexible for the natural phenomenon and have low training and testing accuracy

Model Hyperparameters#

Machine Learning Concepts: constrain the model complexity. Hyperparameters are tuned to maximize accuracy with the withheld testing data to prevent model overfit.

For a set of polynomial models from $4^{th}$ to $1^{st}$ order,

\[ y = b_4 \cdot x^4 + b_3 \cdot x^3 + b_2 \cdot x^2 + b_1 \cdot x + b_0 \]

\[ y = b_3 \cdot x^3 + b_2 \cdot x^2 + b_1 \cdot x + b_0 \]

\[ y = b_2 \cdot x^2 + b_1 \cdot x + b_0 \]

\[ y = b_1 \cdot x + b_0 \]

the choice of polynomial order is the hyperparameter, i.e., the first order model is most simple and the fourth order model is most complicated.

Model Parameters#

Machine Learning Concepts: trainable coefficients for a machine learning model that control the fit to the training data.

For a polynomial model,

\[ y = b_3 \cdot x^3 + b_2 \cdot x^2 + b_1 \cdot x + b_0 \]

$b_3$, $b_2$, $b_1$, and $b_0$ are model parameters.

training model parameters - model parameters are calculated by optimization to minimize error and regularization terms over the training data through analytical solution or iterative solution, e.g., gradient descent optimization

Model Regularization#

Ridge Regression: adding information to prevent overfit (or underfit), improve model generalization.

this information is known as a regularization term
this represents a penalty for complexity that is tuned with a regularization hyperparameter

Consider the ridge regression loss function,

\[ \sum_{i=1}^n \left(y_i - \left(\sum_{\alpha = 1}^m b_{\alpha} x_{\alpha,i} + b_0 \right) \right)^2 + \lambda \sum_{j=1}^m b_{\alpha}^2 \]

where $\lambda \sum_{j=1}^m b_{\alpha}^2$ is the regularization term and $\lambda$ is the regularization hyperparameter.

The concept of regularization is quite general and choices in machine learning architecture, such as,

use of receptive fields for convolutional neural networks (CNNs)
the choice to limit decision trees to a maximum number of levels.

There are a couple of useful perspectives on model regularization,

Occam’s razor - regularization tunes model complexity to the simplest effective solution
Bayesian perspective - regularization is imposing a prior on the solution.

Model Variance#

Machine Learning Concepts: is error due to sensitivity to the dataset

increasing model complexity usually results in increasing model variance
ensemble machine learning, for example, model bagging reduce model variance by averaging over multiple estimators trained on bootstrap realizations of the dataset
model bias variance trade-off - as complexity increases, model variance increases and model bias decreases
one of the three components of expected test square error, including model variance, model bias and irreducible error

\[ E \left[ \left(y_0 - \hat{f}(x_1^0, \ldots, x_m,^0 \right)^2 \right] = \left(E [\hat{f}(x_1^0, \ldots, x_m,^0)] - f(x_1^0, \ldots, x_m,^0) \right)^2 + \]

\[ E \left[ \left( \hat{f} \left(x_1^0, \ldots, x_m,^0 \right) - E \left[ \hat{f}(x_1^0, \ldots, x_m,^0) \right] \right)^2 \right] + \sigma_e^2 \]

where $E \left[ \left( \hat{f} \left(x_1^0, \ldots, x_m,^0 \right) - E \left[ \hat{f}(x_1^0, \ldots, x_m,^0) \right] \right)^2 \right]$ is model variance.

Momentum (optimization)#

LASSO Regression: update the previous step with the new step, momentum, $\lambda$, is the weight applied to the previous step while $1 - \lambda$ is the weight applied to the current step,

\[ \left( \left( r \cdot \nabla L \right)_{t-1} \right)^m = \lambda \cdot r \cdot \nabla L_{t-2} + (1 - \lambda) \cdot r \cdot \nabla L_{t-1} \]

the gradients calculated from the partial derivatives of the loss function for each model parameter have noise. Momentum smooths out, reduces the impact of this noise.
momentum helps the solution proceed down the general slope of the loss function, rather than oscillating in local ravines or dimples

Markov Chain Monte Carlo (MCMC)#

Bayesian Linear Regression: a set of algorithms to sample from a probability distribution such that the samples match the distribution statistics.

Markov - screening assumption, the next sample is only dependent on the previous sample
Chain - the samples form a sequence often demonstrating a transition from burn-in chain with inaccurate statistics and equilibrium chain with accurate statistics
Monte Carlo - use of Monte Carlo simulation, random sampling from a statistical distribution

Why is this useful?

we often don’t have the target distribution, it is unknown
but we can sample with the correct frequencies with other form of information such as conditional probability density functions, Gibbs sampler, or the likelihood ratios of the candidate next sample and the current sample, Metropolis-Hastings

Metropolis-Hastings Sampling (MCMC)#

Bayesian Linear Regression: a set of algorithms to sample from a probability distribution such that the samples match the distribution statistics, based on,

the likelihood ratios of the candidate next sample and the current sample
a rejection sampler based on this likelihood ratio

Since only the ratio of likelihood is required, the system is simplified as the evidence term cancels out from the Bayesian probability

Here’s the basic steps of the Metropolis-Hastings MCMC Sampler:

For $\ell = 1, \ldots, L$:

Assign random values for the initial sample of model parameters, $\beta(\ell = 1) = b_1(\ell = 1)$, $b_0(\ell = 1)$ and $\sigma^2(\ell = 1)$.
Propose new model parameters based on a proposal function, $\beta^{\prime} = b_1$, $b_0$ and $\sigma^2$.
Calculate probability of acceptance of the new proposal, as the ratio of the posterior probability of the new model parameters given the data to the previous model parameters given the data multiplied by the probability of the old step given the new step divided by the probability of the new step given the old.

\[ P(\beta \rightarrow \beta^{\prime}) = min\left(\frac{P(\beta^{\prime}|y,X) }{ P(\beta | y,X)} \cdot \frac{P(\beta^{\prime}|\beta) }{ P(\beta | \beta^{\prime})},1\right) \]

Apply Monte Carlo simulation to conditionally accept the proposal, if accepted, $\ell = \ell + 1$, and sample $\beta(\ell) = \beta^{\prime}$
Go to step 2.

Monte Carlo Simulation (MCS)#

Bayesian Linear Regression: a random sample from a statistical distribution, random variable. The steps for MCS are:

model the feature cumulative distribution function, $F_x(x)$
draw random value from a uniform [0,1] distribution, this is a random cumulative probability value, known as a p-value, $p^{\ell}$
apply the inverse of the cumulative distribution function to calculate the associated realization

\[ x^{\ell} = F_x^{-1} (p^{\ell}) \]

repeat to calculate enough realizations for the subsequent analysis

Monte Carlo simulation is the basic building block of stochastic simulation workflows, for example,

Monte Carlo simulation workflows - apply Monte Carlo simulation many over all features to the transfer function to calculate a realization of the decision criteria, repeated for many realizations, to propagate uncertainty through a transfer function
Bootstrap - applies Monte Carlo simulation to acquire realizations of the data to calculate uncertainty in sample statistics or ensembles of prediction models for ensemble-based machine learning
Monte Carlo methods - applies Monte Carlo simulation to speed up an expensive calculation with a limited random sample that converges on the solution as the number of random samples increases

Monte Carlo Simulation Workflow#

Bayesian Linear Regression: a convenient stochastic workflow for propagating uncertainty through a transfer function through sampling with Monte Carlo Simulation (MCS). The workflow includes the following steps,

Model all the input features’ distributions, cumulative distribution functions,

\[ F_{x_1}(x_1), \quad F_{x_2}(x_2), \quad \dots \quad , F_{x_m}(x_m) \]

Monte Carlo simulate a realizations for all the inputs,

\[ x_1^{\ell}, \quad x_2^{\ell}, \quad \ldots \quad , x_m^{\ell} \]

Apply to the transfer function to get a realization of the transfer function output, often the decision criteria

\[ y^{\ell} = f \left(x_1^{\ell},x_2^{\ell}, \quad \ldots \quad, x_m^{\ell} \right) \]

Repeat steps 1-3 to calculate enough realizations to model the transfer function output distribution.

\[ F_y(y) \]

Multiplication Rule (probability)#

Probability Concepts: we can calculate the joint probability of $A$ and $B$ as the product of the conditional probability of $B$ given $A$ with the marginal probability of $A$,

\[ P(A \cup B) = P(A,B) = P(B|A) \cdot P(A) \]

The multiplication rule is derived as a simple manipulation of the definition of conditional probability, in this case,

\[ P(B|A) = \frac{P(A,B)}{P(A)} \]

Mutual Information#

Feature Ranking: a generalized approach that quantifies the mutual dependence between two features.

quantifies the amount of information gained from observing one feature about the other
avoids any assumption about the form of the relationship (e.g. no assumption of linear relationship)

units are Shannons or bits
compares the joint probabilities to the product of the marginal probabilities
- summarizes the difference between the joint $P(x,y)$ and the product of the marginals $P(x)\cdot P(y)$, integrated over all $x \in 𝑋$ and $y \in Y$,

For discrete or binned continuous features $X$ and $Y$, mutual information is calculated as:

\[ I(X;Y) = \sum_{y \in Y} \sum_{x \in X}P_{X,Y}(x,y) log \left( \frac{P_{X,Y}(x,y)}{P_X(x) \cdot P_Y(y)} \right) \]

recall, given independence between $X$ and $Y$:

\[ P_{X,Y}(x,y) = P_X(x) \cdot P_Y(y) \]

therefore if the two features are independent then the $log \left( \frac{P_{X,Y}(x,y)}{P_X(x) \cdot P_Y(y)} \right) = 0$

The joint probability $P_{X,Y}(x,y)$ is a weighting term on the sum and enforces closure.

parts of the joint distribution with greater density have greater impact on the mutual information metric

For continuous (and nonbinned) features we can applied the integral form.

\[ I(X;Y) = \int_{Y} \int_{X}P_{X,Y}(x,y) log \left( \frac{P_{X,Y}(x,y)}{P_X(x) \cdot P_Y(y)} \right) dx dy \]

Mutually Exclusive Events (probability)#

Probability Concepts: the events do not intersect, i.e., do not have any common outcomes. We represent this as,

using set notation, we state events $A$ and $B$ are mutually exclusive as,

\[ A \cap B = \{x: x \in A \text{ and } x \in B \} = \emptyset \]

and the probability for mutually exclusive as,

\[ P(A,B) = 0.0 \]

Multidimensional Scaling#

Multidimensional Scaling: a method in inferential statistics / information visualization for exploring / visualizing the similarity (conversely the difference) between individual samples from a high dimensional dataset in a low dimensional space.

Multidimensional scaling (MDS) projects the $m$ dimensional data to $p$ dimensions such that $p << m$.

while attempting to preserve the pairwise dissimilarity between the data samples
ideally we are able to project to $p=2$ to easily explore the relationships between the samples

While principal component analysis (PCA) operates with the covariance matrix, multidimensional scaling operates with the distance or dissimilarity matrix. For multidimensional scaling,

you don’t need to know the actual feature values, just the distance or dissimilarity between the samples
as with any distance in feature space, we consider feature standardization to ensure that features with larger variance do not dominate the calculation
we may work with a variety of dissimilarity measures

Comparison between multidimensional scaling and principal component analysis,

principal component analysis takes the covariance matrix ($m \times m$) between all the features and finds the linear, orthogonal rotation such that the variance is maximized over the ordered principle components
multidimensional scaling takes the matrix of the pairwise distances ($n \times n$) between all the samples in feature space and finds the nonlinear projection such that the error in the pairwise distances is minimized

Some have suggest that visualizing data or models in a multidimensional scaling space is visualizing the space of uncertainty.

Naive Bayes#

Naive Bayes: the application of the assumption of conditional independence to simplify the classification prediction problem from the perspective of Bayesian updating, based on the conditional probability of a category, $k$, given $n$ features, $x_1, \dots , x_n$,

\[ P(x_1 | x_2, \dots , x_n, C_k) P(x_2 | x_3, \dots , x_n, C_k) P(x_3 | x_4, \dots , x_n, C_k) \ldots P(x_{n-1} | x_n, C_k) (x_{n} | C_k) P(C_k) \]

The likelihood, conditional probability with the joint conditional is difficult, likely impossible to calculate. It requires information about the joint relationship between $x_1, \dots , x_n$ features. As $n$ increases this requires a lot of data to inform the joint distribution.

With the naive Bayes approach we make the ‘naive’ assumption that the features are all conditionally independent*. This entails,

\[ P(x_i | x_{i+1}, \ldots , x_n, C_k) = P(x_i | C_k) \]

for all $i = 1, \ldots, n$ features.

We can now solve for the needed conditional probability as:

\[ P(C_k | x_1, \dots , x_n) = \frac{P(C_k) \prod_{i=1}^{n} P(x_i | C_k)}{P(x_1, \dots , x_n)} \]

We only need the prior, $P(C_k)$, and a set of conditionals, $P(x_i | C_k)$, for all predictor features, $i = 1,\ldots,n$ and all categories, $k = 1,\ldots,K$.

The evidence term, $P(x_1, \dots , x_n)$, is only based on the features $x_1, \dots , x_n$; therefore, is a constant over the categories $k = 1,\ldots,n$.

it ensures closure - probabilities over all categories sum to one
we simply standardize the numerators to sum to one over the categories

The naive Bayes approach is:

simple to understand, builds on fundamental Bayesian statistics
practical even with small datasets since with the conditional independence we only need to estimate simple conditional distributions

ndarray#

Machine Learning Workflow Construction and Coding: Numpy’s convenient class for working with grids, exhaustive, regularly spaced data over 2D or 3D, representing maps and models, due to,

convenient data structure to store, access, manipulate gridded data
built in methods to load from a variety of file types, Python classes
built in methods to calculate multidimensional summary statistics
built in methods for data queries, filters
built in methods for data manipulation, cleaning, reformatting
built in attributes to store information about the nD array, for example, size and shape

Nonparametric Model#

Machine Learning Concepts: a model that makes no assumption about the functional form, shape of the natural setting.

learns the shape from the training data, more flexibility to fit a variety of shapes for natural systems
less risk that the model is a poor fit for the natural settings than with parametric models

Typically need a lot more data for an accurate estimate of nonparametric models,

nonparametric often have many trainable parameters, i.e., nonparametric models are actually parametric rich!

Norm#

Linear Regression: norm of a vector maps vector values to a summary measure $[𝟎,\infty)$, that indicates size or length.

To train our models to training data, we require a single summary measure of mismatch with the training data, training error. The error is observed at each training data location,

\[ \Delta y_i = y_i - \hat{y}_i, \quad \forall \quad i = 1,\ldots,n \]

as an error vector. We need a single value to summarize over all training data, that we can minimize!

Normalization#

Feature Transformations: a distribution rescaling that can be thought of as shifting, and stretching or squeezing of a univariate distribution (e.g., histogram) to a minimum of 0.0 and a maximum of 1.0.

this is a shift and stretch / squeeze of the original property distribution assumes no shape change, rank preserving

\[ y_i = \frac{x_i - min(x)}{max(x) - min(x)}, \quad \forall \quad i, \ldots, n \]

Methods that require standardization and min/max normalization:

k-means clustering, k-nearest neighbour regression
$\beta$ coefficient’s for feature ranking
artificial neural networks forward transform of predictor features and back transform of response features to improve activation function sensitivity

Normalized Histogram#

Univariate Analysis: is a representation of the univariate statistical distribution with a plot of probability over an exhaustive set of bins over the range of possible values. These are the steps to build a normalized histogram,

Divide the continuous feature range of possible values into $K$ equal size bins, $\delta x$:

\[ \Delta x = \left( \frac{x_{max} - x_{min}}{K} \right) \]

or use available categories for categorical features.

Count the number of samples (frequency) in each bin, $n_k$, $\forall k=1,\ldots,K$ and divide each by the total number of data, $n$, to calculate the probability of each bin,

\[ p_k = \frac{n_k}{n}, \forall \quad k = 1,\ldots,L \]

Plot the probability vs. the bin label (use bin centroid if continuous)

Note, normalized histograms are typically plotted as a bar chart.

One Hot Encoding#

Feature Transformations: bin the range of the feature into K bins, then for each sample assignment of a value of 1 if the sample is within a bin and 0 if outsize the bin

binning strategies include uniform width bins (uniform) and uniform number of data in each bin (quantile)
also known as K bins discretization

Methods that require K bins discretization,

basis expansion to work in a higher dimensional space
discretization of continuous features to categorical features for categorical methods such as naive Bayes classifier
histogram construction and Chi-square test for difference in distributions
mutual information binning

Out-of-Bag Sample#

Bagging Tree and Random Forest: with bootstrap resampling of the data, it can be shown that about $\frac{2}{3}$ of the data will be included (in expectation). For bagging-based ensemble prediction models,

therefore are $\frac{1}{3}$ of the data (in expectation) unused in training each model realization, these are know as out-of-bag observations
for every response feature observation, $y_{\alpha}$, there are $\frac{B}{3}$ out-of-bag predictions, $y^{*,b}_{\alpha}$
we can aggregate this ensemble of prediction realizations, average for regression or mode for classification, to calculate a single out-of-bag prediction, $y^{*}_{\alpha} = \sum_{\alpha = 1}^{\frac{B}{3}} y^{*,b}_{\alpha}$
from these single out-of-bag predictions over all data, the out-of-bag mean square error (MSE) is calculated as,

\[ MSE_{OOB} = \sum_{\alpha = 1}^{\frac{B}{3}} \left[ y^{*}_{\alpha} - y_{\alpha} \right]^2 \]

For bagging-based ensemble predictive machine learning, there is no need to perform training and testing splits, hyperparameter tuning can be applied with out-of-bag MSE.

this is equivalent to random train and test split that may not be fair, same difficulty as the planned use of the model
this freezes the test proportion at about $\frac{1}{3}$

Overfit Model#

Machine Learning Concepts: a machine learning model that is fit to data noise or data idiosyncrasies

increased complexity will generally decrease error with respect to the training dataset but, may result in increase error with testing data
over the region of model complexity with rising testing error and falling training error

Issues of an overfit machine learning model,

more model complexity and flexibility than can be justified with the available data, data accuracy, frequency and coverage
high accuracy in training, but low accuracy in testing representing real-world use away from training data cases, indicating poor ability of the model to generalize

Parameters (statistics)#

Machine Learning Concepts: a summary measure of a population

for example, population mean, population standard deviation

We very rarely have access to actual population parameters, in general we infer population parameters with available sample statistics

Parameters (machine learning)#

Machine Learning Concepts: trainable coefficients for a machine learning model that control the fit to the training data

model parameters are calculated by optimization to minimize error over the training data through, analytical solution, or iterative solution, e.g., gradient descent optimization

Parametric Model#

Machine Learning Concepts: a model that makes an assumption about the functional form, shape of the natural system.

we gain simplicity and advantage of only a few parameters
for is a linear model we only have $m+1$ model parameters

There is a risk that our model is quite different than the natural setting, resulting in a poor model, for example, a linear model applied to a nonlinear phenomenon.

Partial Correlation Coefficient#

Multivariate Analysis: a method to calculate the correlation between $𝑿$ and $𝒀$ after controlling for the influence of $𝒁_𝟏,\ldots,𝒁_(𝒎−𝟐)$ other features on both $𝑿$ and $𝑌$. Note, I use $m-2$ to account for $X$ and $Y$ removed.

For $\rho_(𝑋,𝑌.𝑍_1,…,𝑍_(𝑚−2) )$,

perform linear, least-squares regression to predict $𝑿$ from $𝒁_𝟏,\ldots,𝒁_{𝒎−𝟐}$. $𝑿$ is regressed on the predictors to calculate the estimate, $𝑿^∗$.
perform linear, least-squares regression to predict $𝒀$ from $𝒁_𝟏,\ldots,𝒁_{𝒎−𝟐}$. $𝒀$ is regressed on the predictors to calculate the estimate, 𝒀^∗
calculate the residuals in Step #1, $𝑿 − 𝑿^∗$, where $𝑿^∗=𝒇(𝒁_𝟏,\ldots,𝒁_{𝒎−𝟐})$, linear regression model
calculate the residuals in Step #2, $𝒀 − 𝒀^∗$, where $𝒀^∗=𝒇(𝒁_𝟏,\ldots,𝒁_{𝒎−𝟐})$, linear regression model
calculate the correlation coefficient between the residuals from Steps #3 and #4, $\rho_{𝑿 −𝑿^∗,𝒀 − 𝒀^∗}$

Assumptions of Partial Correlation, for $𝝆_(𝑿,𝒀.𝒁_𝟏,\ldots,𝒁_{𝒎−𝟐})$,

$𝑿,𝒀,𝒁_𝟏,\ldots,𝒁_{𝒎−𝟐}$ have linear relationships, i.e., all pairwise relationships are linear
no outliers for any of the univariate distributions (univariate outliers) and pairwise relationships (bivariate outliers). Partial correlation is very sensitive to outliers like regular correlation.
Gaussian distributed, univariate and pairwise bivariate distributions Gaussian distributed. Bivariate should be linearly related and homoscedastic.

Partitional Clustering#

Cluster Analysis: all cluster group assignments are determined at once, as opposed to an agglomerative hierarchical clustering method that starts with $n$ clusters and then iteratively merges clusters into larger clusters

k-means clustering is partitional clustering, while the solution heuristic to find the solution is iterative, the solution is actually all at once
easy to update, for example, by modifying the prototype locations and recalculating the group assignments

Polygonal Declustering#

Data Preparation: a declustering method to assign weights to spatial samples based on local sampling density, such that the weighted statistics are likely more representative of the population. Data weights are assigned so that,

samples in densely sampled areas receive less weight
samples in sparsely sampled areas receive more weight

Polygonal declustering proceeds as follows:

Split up the area of interest with Voronoi polygons. These are constructed by intersected perpendicular bisectors between adjacent data points. The polygons group the area of interest by nearest data point
Assign weight to each datum proportional to the area of the associated Voronoi polygon

\[ w(\bf{u}_j) = n \cdot \frac{A_j}{\sum_{j=1}^n} \]

where $w(\bf{u}_j)$ is the weight for the $j$ data. Note, the sum of the weights is $n$; therefore, $w(\bf{u}_j)$ is nominal weight of 1.0, sample density if the data were equally spaced over the area of interest.

Here are some highlights for polygonal declustering,

polygonal declustering is sensitive to the boundaries of the area of interest; therefore, the weights assigned to the data near the boundary of the area of interest may change radically as the area of interest is expanded or contracted
polygonal declustering is the same as the Theissen polygon method for calculation of precipitation averages developed by Afred H. Thiessen in 1911, []

Polynomial Regression#

Polynomial Regression: application of polynomial basis expansion to the predictor features before linear regression,

\[ y = \sum_{l=1}^{k} \sum_{j=1}^{m} \beta_{j,l} h_l (X_j) + \beta_0 \]

where the h transforms over training data, $𝑖=1,\ldots,n$,

\[ h_1(x_i) = x_i, \quad h_2(x_i) = x_i^2, \quad h_3(x_i) = x_i^3, \quad h_4(x_i) = x_i^4, \dots, h_k(x_i) = x_i^k \]

up to the specified order $𝑘$.

For example, with a single predictor feature, $𝑚 = 1$, up to the $4^{th}$ order,

\[ y = \beta_{1,1} X + \beta_{1,2} X^2 + \beta_{1,3} X^3 + \beta_{1,4} X^4 + \beta_0 \]

After the $𝒉_𝒍$, $𝑙=1,\ldots,𝑘$ transforms, over the $𝑗=1,\dots,𝑚$ predictor features we have the same linear equation and the ability to utilize the previously discussed analytical solution.

we are assuming linearity after application of our basis expansion.

Now the model parameters, $\beta_(𝒍,𝒊)$, relate to a transformed version of the initial predictor feature, $𝒉_𝒍 (𝑿_𝒋)$.

we lose the ability to interpret the coefficients, for example, what is 𝑝𝑒𝑟𝑚𝑒𝑎𝑏𝑖𝑙𝑖𝑡𝑦$^4$?
generally, significantly higher model variance, i.e., may have unstable interpolation and especially extrapolation

Polynomial regression model assumptions,

error-free - predictor features basis expansions are error free, not random variables
constant variance - error in response is constant over predictor(s) value
linearity - response is linear combination of basis features
polynomial - relationships between 𝑋 and Y is polynomial
independence of error - error in response are uncorrelated with each other
no multicollinearity* - none of the basis feature expansions are linearly redundant with other features

Population#

Probability Concepts: exhaustive, finite list of property of interest over area of interest.

for example, exhaustive set of porosity measures at every location within a reservoir

Generally, the entire population is not generally accessible and we use a limited sample to make inference concerning the population

Power Law Average#

Feature Transformations: a general form for averaging based scale up, aggregation of smaller scale measures in a larger volume into a single value representative of the larger volume

\[ \overline{x}_p = \left(\frac{1}{n}\sum_{i=1}^n x_i^p \right)^{\frac{1}{p}} \]

useful to calculate effective permeability where flow is not parallel nor perpendicular to distinct permeability layers
flow simulation may be applied to calibrate (calculate the appropriate power for power law averaging)

Precision (classification accuracy metric)#

Naive Bayes: a categorical classification prediction model measure of accuracy, a single summary metric for each $k$ category from the confusion matrix.

the ratio of true positives divided by all positives, true positives + false positives

\[ Precision_k = \frac{ n_{k,\text{true positives}} }{ n_{k,\text{true positives}} + n_{k,\text{false positives}}} = \frac{ n_{k,\text{true positives}} }{ n_{k, \text{all positives}} } \]

Prediction Interval#

Linear Regression: the uncertainty in the next prediction represented as a range, lower and upper bound, based on a specified probability interval known as the confidence level.

We communicate confidence intervals like this,

there is a 95% probability (or 19 times out of 20) that the true reservoir NTG is between 13% and 17%, given, predictor feature values, $𝑋_1=𝑥_1,\ldots,𝑋_𝑚=𝑥_𝑚$.

Is the uncertainty in our prediction, for prediction intervals we integrate,

uncertainty in the model $𝐸{\hat{𝑌}|𝑋=𝑥}$
error in the model, conditional distribution $\hat{Y}|X=x$

Prediction, Predictive Statistics#

Machine Learning Concepts: estimate the next sample(s) given assumptions about or a model of the population

for example, given our model of the reservoir, predict the next well (pre-drill assessment) sample, e.g., porosity, permeability, production rate, etc.

Predictor Feature#

Machine Learning Concepts: the input feature for a predictive machine learning model. We can generalize a predictive machine learning model as,

\[ y = \hat{f}(x_1,\ldots,x_m) + \epsilon \]

where the response feature is $y$, the predictor features are $x_1,\ldots,x_m$, and $\epsilon$ is model error

traditional statistics uses the term independent variable

Predictor Feature Space#

Feature Ranking: refers to the predictor features and does not include the response feature(s), i.e.,

all possible combinations of predictor features for which we need to make predictions
may be referred to as predictor feature space.

Typically, we train and test our machines’ predictions over the predictor feature space.

the space is typically a hypercuboid with each axis representing a predictor feature and extending from the minimum to maximum, over the range of each predictor feature
more complicated shapes of predictor feature space are possible, e.g., we could mask or remove subsets with poor data coverage.

Primary Data#

Machine Learning Concepts: data samples for the feature of interest, the target feature for building a model, for example,

porosity measures from cores and logs used to build a full 3D porosity model. Any samples of porosity are the primary data
as opposed to secondary feature, e.g., if we have facies data to help predict porosity, the facies data are secondary data

Principal Component Analysis#

Principal Component Analysis: one of a variety of methods for dimensional reduction, transform the data to a lower dimension

given features, $𝑋_1,\dots,𝑋_𝑚$ we would require ${m \choose 2}=\frac{𝑚 \cdot (𝑚−1)}{2}$ scatter plots to visualize just the two-dimensional scatter plots.
once we have 4 or more variables understanding our data gets very hard.
recall the curse of dimensionality, impact inference, modeling and visualization.

One solution, is to find a good lower dimensional, $𝑝$, representation of the original dimensions $𝑚$

Benefits of Working in a Reduced Dimensional Representation:

data storage / Computational Time
easier visualization
also takes care of multicollinearity

Salient points of principal component analysis,

orthogonal transformation - convert a set of observations into a set of linearly uncorrelated variables known as principal components, the transformation retains pairwise distance, i.e., is a rotation
number of principal components ($k$) available - is min⁡($𝑛−1,𝑚$), limited by the variables/features, $𝑚$, and the number of data

Components are ordered,

first component describes the larges possible variance / accounts for as much variability as possible
next component describes the largest possible remaining variance
up to the maximum number of principal components

Eigenvalues and eigenvectors-based,

Calculate the data covariance matrix, the pairwise covariance for the combinatorial of features and then calculate the eigenvectors and eigenvalues from the covariance matrix,

the eigenvalues are the variance explained for each component.
the eigenvectors of the data covariance matrix are the principal components.

Probability Density Function (PDF)#

Univariate Analysis: a representation of a statistical distribution with a function, $f(x)$, of probability density over the range of all possible feature values, $x$. These are the concepts for PDFs,

non-negativity constraint, the density cannot be negative,

\[ 0.0 \le f(x) \]

for continuous features the density may be > 1.0, because density is a measure of likelihood and not of probability
integrate density over a range of $x$ to calculate probability,

\[ 0 \le \int_a^b f(x) dx = P(a \le x \le b) \le 1.0 \]

probability closure, the sum of the area under the PDF curve is equal to 1.0,

\[ \int_{-infty}^{\infty} f(x) dx = 1.0 \]

Nonparametric PDFs are calculated with kernels (usual a small Gaussian distribution) that is summed over all data; therefore, there is an implicitly scale (smoothness) parameter when calculating a PDF.

To large of kernels will smooth out important information about the univariate distribution
Too narrow will result in an overly noisy PDF that is difficult to interpret.

This is analogous to the choice of bin size for a histogram or normalized histogram.

Parametric PDFs are possible but require model fitting to the data, the steps are,

Select a parametric distribution, e.g., Gaussian, log normal, etc.
Calculate the parameters for the parametric distribution based on the available data, by methods such as least squares or maximum likelihood.

Probability Non-negativity, Normalization#

Probability Concepts: fundamental constraints on probability including,

Bounded, $0.0 \le P(A) \le 1.0$
Closure, $P(\Omega) = 1.0$
Null Sets, $P(\emptyset) = 0.0$

Probability of Acceptance (MCMC)#

Bayesian Linear Regression: applied in a rejection sampler as the likelihood of a candidate sample being added to the sample.

conditional acceptance is performed by Monte Carlo simulation,
sequentially sampling from conditional distributions

The acceptance rule is,

if $𝑃(𝑎𝑐𝑐𝑒𝑝𝑡) \ge 1$, accept – accept
if $𝑃(𝑎𝑐𝑐𝑒𝑝𝑡) \lt 1$, conditionally accept, draw $𝑝 ∼ U[0,1]$, and accept if $𝑝 \le 𝛼$

Probability Operators#

Probability Concepts: common probability operators that are essential to working with probability and uncertainty problems,

Union of Events - the union of outcomes, the probability of $A$ or $B$ is calculated with the probability addition rule,

\[ P(A \cup B) = P(A) + P(B) - P(A,B) \]

Intersection of Events - the intersection of outcomes, the probability of $A$ and $B$ is represented as,

\[ P(A \cap B) = P(A,B) \]

only under the assumption of independence of $A$ and $B$ can it be calculate from the probabilities of $A$ and $B$ as,

\[ P(A,B) = P(A) \cdot P(B) \]

if there is dependence between $A$ and $B$ then we need the conditional probability, $P(A|B)$ instead of the marginal, $P(A)$,

\[ P(A,B) = P(A|B) \cdot P(B) \]

Complimentary Events - is the NOT operator for probability, if we define $A$ then $A$ compliment, $A^c$ is not $A$ and we have this resulting closure relationship,

\[ P(A) + P(A^c) = 1.0 \]

complimentary events may be considered for beyond univariate problems, for example consider this bivariate closure,

\[ P(A|B) + P(A^c|B) = 1.0 \]

Note, the given term must be the same.

Mutually Exclusive Events - the events that do not intersect or do not have any common outcomes. We represent this with set notation as,

\[ \{x: x \in A \text{ and } x \in B \} = \emptyset \]

and the joint probability of $A$ and $B$ as,

\[ P(A \cap B) = P(A,B) = 0 \]

Probability Perspectives#

Probability Concepts: the 3 primary perspectives for calculating probability:

Long-term frequencies - probability as ratio of outcomes, requires repeated observations of an experiment. The basis for frequentist probability.
Physical tendencies or propensities - probability from knowledge about or modeling the system, e.g., we could know the probability of a heads outcome from a coin toss without the experiment.
Degrees of belief - reflect our certainty about a result, very flexible, assign probability to anything, and updating with new information. The basis for Bayesian probability.

Prototype (clustering)#

Cluster Analysis: represent the sample data with set of points in the feature space.

prototypes are typically not actual samples
sample data are often assigned to the nearest (Euclidean) distance prototype

Qualitative Features#

Machine Learning Concepts: information about quantities that you cannot directly measure, require interpretation of measurement, and are described with words (not numbers), for example,

rock type = sandstone
zonation = bornite-chalcopyrite-gold higher grade copper zone

Quantitative Features#

Machine Learning Concepts: features that can be measured and represented by numbers, for example,

age = 10 Ma (millions of years)
porosity = 0.134 (fraction of volume is void space)
saturation = 80.5% (volume percentage)

Like qualitative features, there is often the requirement for interpretation, for example, total porosity may be measured but should be converted to effective porosity through interpretation or a model

$r^2$ (also coefficient of determination)#

Linear Regression: the proportion of variance explained by the model in linear regression

This works only for linear models, where:

\[ \sigma^2_{tot} = \sigma^2_{reg} + \sigma^2_{res} \]

where $\sigma^2_{tot}$ is variance of response feature training, $y_i$, $\sigma^2_{reg}$ is variance of the model predictions, $\hat{y}_i$, and $\sigma^2_{res}$ is the variance of the errors, $\Delta y_i$.

for linear regression, $r^2 = \left( \rho_{x,y} \right)^2$

For nonlinear models this will not likely hold, then $\frac{\sigma^2_{𝑟𝑒𝑔}}{\sigma^2_{𝑡𝑜𝑡}}$ may exceed $[0,1]$, for our nonlinear models regression models we will use more robust measures, e.g. mean square error (MSE)

Random Forest#

Bagging Tree and Random Forest: a ensemble prediction model that is based on the standard bagging approach, specifically,

with decision tree
with diversification of the individual trees by restricting each split to consider a $p$ random subset of the $𝑚$ available predictors

There are various methods to calculate $p$ from $m$ available features,

\[ p = \sqrt{m} \]

is common. Note, if $p = m$ then random forest is tree bagging.

More comments on the benefit of ensemble model diversification,

the reduction in model variance by ensemble estimation, as represented by standard error in the mean,

\[ \sigma_{\overline{x}}^2 = fracc{\sigma_{s}^2}{n} \]

is under the assumption that the samples are uncorrelated. One issue with tree bagging is the trees in the ensemble may be highly correlated.

this occurs when there is a dominant predictor feature as it will always be applied to the top split(s), the result is all the trees in the ensemble are very similar (i.e. correlated)
with highly correlated trees, there is significantly less reduction in model variance with the ensemble
this forces each tree in the ensemble to evolve in dissimilar, decorrelated, manner

Realization#

Machine Learning Concepts: an outcome from a random variable or a joint outcome from a random function.

an outcome from a random variable, $X$, (or joint set of outcomes from a random function)
represented with lower case, e.g., $x$
for spatial settings it is common to include a location vector, $\bf{u}$, to describe the location, e.g., $x(\bf{u})$, as $X(\bf{u})$
resulting from simulation, e.g., Monte Carlo simulation, sequential Gaussian simulation, a method to sample (jointly) from the RV (RF)
in general, we assume all realizations are equiprobable, i.e., have the same probability of occurrence

Realizations (uncertainty)#

Machine Learning Concepts: multiple spatial, subsurface models calculated by stochastic simulation by holding input parameters and model choices constant and only changing the random number seed

these models represent spatial uncertainty
for example, hold the porosity mean constant and observe changes in porosity away from the wells over multiple realizations

Reasons to Learn Some Coding#

Machine Learning Workflow Construction and Coding: Professor Pyrcz’s reasons for all scientists and engineers to learn some coding,

Transparency – no compiler accepts hand waiving! Coding forces your logic to be uncovered for any other scientist or engineer to review.
Reproducibility – run it and get an answer, hand it over to a peer, they run it and they get the same answer. This is a principle of the scientific method.
Quantification – programs need numbers and drive us from qualitative to quantitative. Feed the program and discover new ways to look at the world.
Open-source – leverage a world of brilliance. Check out packages, snippets and be amazed with what great minds have freely shared.
Break Down Barriers – don’t throw it over the fence. Sit at the table with the developers and share more of your subject matter expertise for a better deployed product.
Deployment – share your code with others and multiply your impact. Performance metrics or altruism, your good work benefits many others.
Efficiency – minimize the boring parts of the job. Build a suite of scripts for automation of common tasks and spend more time doing science and engineering!
Always Time to Do it Again! – how many times did you only do it once? It probably takes 2-4 times as long to script and automate a workflow. Usually, worth it.

Be Like Us – it will change you. Users feel limited, programmers truly harness the power of their applications and hardware.

Recall (classification accuracy metric)#

Naive Bayes: a categorical classification prediction model measure of accuracy, a single summary metric for each $k$ category from the confusion matrix.

the ratio of true positives divided by all cases of the category in the testing dataset

\[ Recall_k = \frac{ n_{k, \text{true positives}} }{n_k} \]

Recursive Feature Elimination#

Feature Ranking: a method works by recursively removing features and building a model with the remaining features.

build a model with all features, calculate a feature ranking metric, e.g., coefficient or feature importance, depending on which is available with the modeling method
remove the feature with the lowest feature importance and rebuild the model

repeat the process until only one feature remains

Any model predictive model could be used,

the method assigns rank $1,\ldots,𝑚$ for all features as reverse order of removal, i.e., last remaining feature is most important and first removed is least important

Reservoir Modeling Workflow#

Machine Learning Workflow Construction and Coding: the following is the common geostatistical reservoir modeling workflow:

Integrate all available information to build multiple subsurface scenarios and realizations to sample the uncertainty space
Apply all the models to the transfer function to sample the decision criteria
Assemble the distribution of the decision criteria
Make the optimum reservoir development decisions accounting for this uncertainty model

Response Feature#

Machine Learning Concepts: output feature for a predictive machine learning model. We can generalize a predictive machine learning model as,

\[ y = \hat{f}(x_1,\ldots,x_m) + \epsilon \]

where the response feature is $y$, the predictor features are $x_1,\ldots,x_m$, and $\epsilon$ is model error

traditional statistics uses the term dependent variable

Ridge Regression (Tikhonov Regularization)#

Ridge Regression: a linear, parametric prediction model,

\[ y = \sum_{\alpha = 1}^m b_{\alpha} x_{\alpha} + b_0 \]

The analytical solution for the model parameters, $b_1,\ldots,b_m,b_0$, is available for the L2 norm loss function, the errors are summed and squared known a least squares.

we minimize a loss function including the error, residual sum of squares (RSS) over the training data and a regularization term:

\[ \sum_{i=1}^n \left(y_i - (\sum_{\alpha = 1}^m b_{\alpha} x_{\alpha,i} + b_0) \right)^2 + \lambda \sum_{\alpha = 1}^m b_{\alpha}^2 \]

where $y_i$ is the actual response feature values and $\sum_{\alpha = 1}^m b_{\alpha} x_{\alpha} + b_0$ are the model predictions, over the $\alpha = 1,\ldots,n$ training data, and $\lambda \sum_{\alpha = 1}^m b_{\alpha}^2$ is the shrinkage penalty.

With ridge regression we add a hyperparameter, $\lambda$, to our minimization, with a shrinkage penalty term, $\sum_{j=1}^m b_{\alpha}^2$.

As a result, ridge regression training integrates two and often competing goals to find the model parameters,

find the model parameters that minimize the error with training data
minimize the slope parameters towards zero

Note: lambda does not include the intercept, $b_0$.

The $\lambda$ is a hyperparameter that controls the degree of fit of the model and may be related to the model bias-variance trade-off.

for $\lambda \rightarrow 0$ the solution approaches linear regression, there is no bias (relative to a linear model fit), but the model variance is likely higher
as $\lambda$ increases the model variance decreases and the model bias increases, the model becomes simpler
for $\lambda \rightarrow \infty$ the model parameters $b_1,\ldots,b_m$ shrink to 0.0 and the model predictions approaches the training data response feature mean

Sample#

Machine Learning Concepts: the set of values, locations that have been measured

for example, 1,000 porosity measures from well-logs over the wells in the reservoir
or 1,000,000 acoustic impedance measurements over a 1000 x 1000 2D grid for a reservoir unit of interest

Scenarios (uncertainty)#

Machine Learning Concepts: multiple spatial, subsurface models calculated by stochastic simulation by changing the input parameters or other modeling choices to represent the uncertainty due to inference of model parameters and model choices

for example, model three porosity input distribution, porosity mean low, mid and high, and vary the input distribution to calculate new subsurface models

Secondary Data#

Machine Learning Concepts: data samples for another feature, not the feature of interest, the target feature for building a model, but are used to improve the prediction of the target feature.

requires a model of the relationship between the primary and secondary data

For example, samples in space of,

acoustic impedance (secondary data) to support calculation of a model of porosity, the feature of interest
porosity (secondary data) to support calculation of a model of permeability, the feature of interest

Seismic Data#

Machine Learning Concepts: indirect measurement with remote sensing, reflection seismic applies acoustic source(s) and receivers (geophones) to map acoustic reflections with high coverage and generally low resolution. Some more details,

seismic reflections (amplitude) data are inverted to rock properties, e.g., acoustic impedance, consistent with and positionally anchored with well sonic logs
provides framework, bounding surfaces for extents and shapes of reservoirs along with soft information on reservoir properties, e.g., porosity and facies.

Shapley Value#

Feature Ranking: model-based, local (for a single prediction) and global (over a suit of predictions) feature importance by learning contribution of each feature to the prediction.

A explainable machine learning method to support complicated models are often required but have low interpretability.

Two choices to improve model interpretability,

reduce the complexity of the models, but may also reduce model accuracy
develop improved, agnostic (for any model) model diagnostics, i.e., Shapley value

Shapley value is a cooperative game theory approach that,

for allocating resources between players based on a summarization of marginal contributions, i.e., dividing up payment between players
calculates the contribution of each predictor feature to push the response prediction away from the mean value of the response
marginal contributions and Shapley values are in units of the response feature
in the units of the response feature

Simpson’s Paradox#

Multivariate Analysis: data trend reverses or disappears when groups are combined (or separated). Often observed in correlation analysis when grouping data, for example,

groups each have a negative correlation, but the whole has a positive correlation

Soft Data#

Machine Learning Concepts: data that has a high degree of uncertainty, such that data uncertainty must be integrated into the model

for example, probability density function for local porosity calibrated from acoustic impedance

Soft data integration requires workflows like indicator kriging, indicator simulation and p-field simulation or workflows that randomize the data with standard simulation methods that assume hard data like sequential Gaussian simulation.

soft data integration is an advanced topic and a focus of ongoing research, but is often to done with standard, subsurface modeling software packages

Spatial Sampling (biased)#

Data Preparation: sample such that the sample statistics are not representative of the population parameters. For example,

the sample mean is not the same as the population mean
the sample variance is not the same as the population variance

Of course, the population parameters are not accessible, so we cannot directly calculate sampling bias, i.e., the difference between the sample statistics and the population parameters. Methods we can use to check for biased sampling,

evaluate the samples for preferential sampling, clustering, filtering, etc.
apply declustering and check the results for a major change in the summary statistics, this is using declustering diagnostically

Spatial Sampling (clustered)#

Data Preparation: spatial samples with locations preferentially selected, i.e., clustered, resulting in biased statistics,

typically spatial samples are clustered in locations with higher value samples, e.g., high porosity and permeability, good quality shale for unconventional reservoirs, low acoustic impedance indicating higher porosity, etc.

Of course, the population parameters are not accessible, so we cannot directly calculate sampling bias, i.e., the difference between the sample statistics and the population parameters. Methods we can use to check for biased sampling,

evaluate the samples for preferential sampling, clustering, filtering, etc.
apply declustering and check the results for a major change in the summary statistics, this is using declustering diagnostically

Spatial Sampling (common practice)#

Data Preparation: sample locations are selected to,

Reduce uncertainty - by answering questions, for example,

how far does the contaminant plume extend? – sample peripheries
where is the fault? – drill based on seismic interpretation
what is the highest mineral grade? – sample the best part
who far does the reservoir extend? – offset drilling

Directly maximize net present value - while collecting information, for example,

maximize production rates
maximize tonnage of mineral extracted

In other words, often our samples are dual purpose, e.g., wells that are drilled for exploration and appraisal information are subsequently utilized for production.

Spatial Sampling (representative)#

Data Preparation: if we are sampling for representativity, i.e., the sample set and resulting sample statistics are representative of the population, by sampling theory we have 2 options:

Random sampling - each potential sample from the population is equally likely to be sampled as samples are collected. This includes,

selecting a specific location has no impact on the selection of subsequent locations.
assumption that the population size that is much larger than the sample size; therefore, significant correlation between samples is not imposed due to without replacement sampling (the constraint that you can only sample a location once). Note, generally this is not an issue for the subsurface due to the sparsely sampled massive populations

Regular sampling - sampling at equal space or time intervals. While random sampling is preferred, regular sampling is robust as long as,

the regular sampling intervals do not align with natural periodicity in the data, for example, the crests are systematically samples resulting in biased high sample statistics

Spectral Clustering#

Spectral Clustering: a partitional clustering method that utilizes the spectrum, eigenvalues and eigenvectors, of a matrix that represents the pairwise relationships between the data.

dimensionality reduction from data samples pairwise relationships characterized by the graph Laplacian matrix
eigenvalues, eigenvectors are equivalent to principal component analysis dimensionality reduction by linear, orthogonal feature projection and rotation to best describe the variance

Advantages of spectral clustering,

the ability to encode pairwise relationships, integrate expert knowledge.
eigenvalues provide useful information on the number of clusters, based on the degree of ‘cutting’ required to make k clusters
lower dimensional representation for the sample data pairwise relationships
the resulting eigenvalues and eigenvectors can be interpreted, eigenvalues describe the amount of connection for each number of groups and eigenvectors are grouped to form the clusters

Standardization#

Feature Transformations: a distribution rescaling that can be thought of as shifting, and stretching or squeezing of a univariate distribution (e.g., histogram) to a mean of 0.0 and a variance of 1.0.

\[ y_i = \frac{1}{\sigma_x}(x_i - \overline{x}), \quad \forall \quad i, \ldots, n \]

where $\overline{x}$ and $\sigma_x$ are the original mean and variance.

this is a shift and stretch / squeeze of the original property distribution
assumes no shape change, rank preserving

Stochastic Gradient-based Optimization#

LASSO Regression: a method to solve for model parameters by iteratively minimizing the loss function. Stochasticity and improve computational efficiency are added to gradient descent through the use of batches,

a batch is a random subset of the training data with specified size, $n_{batch}$
resulting in stochastic approximations of the loss function gradient, that are faster to calculate
batches reduce accuracy in the gradient descent, but speed up the calculation and can perform more steps, often faster than gradient descent
increase $𝑛_{𝑏𝑎𝑡𝑐ℎ}$ for more accuracy of gradient estimation, and decrease $𝑛_{𝑏𝑎𝑡𝑐ℎ}$ to speed up the steps

Robbins-Siegmund (1971) Theorem - converge to global minimum for convex loss functions and either a global or local minimum for nonconvex loss functions.

The steps include,

start with random model parameters
select a random subset of training data, $n_{batch}$
calculate the loss function and loss function gradient for the model parameters over the random batch,

\[ \nabla L(y_{\alpha}, F(X_{\alpha}, b_1)) = \frac{L(y_{\alpha}, F(X_{\alpha}, b_1 - \epsilon)) - L(y_{\alpha}, F(X_{\alpha}, b_1 + \epsilon))}{2\epsilon} \]

update the parameter estimate by stepping down slope / gradient,

\[ \hat{b}_{1,t+1} = \hat{b}_{1,t} - r \nabla L(y_{\alpha}, F(X_{\alpha}, b_1)) \]

where $r$ is the learning rate/step size, $\hat{b}(1,𝑡)$, is the current model parameter estimate and $\hat{b}(1,𝑡+1)$ is the updated parameter estimate.

Stochastic Model#

Machine Learning Concepts: system or process that is uncertain and is represented by multiple models, realizations and scenarios constrained by statistics,

for example, data-driven models that integrate uncertainty like geostatistical simulation models

Advantages:

speed
uncertainty assessment
report significance, confidence / prediction intervals
honor many types of data
data-driven approaches

Disadvantages:

limited physics used
statistical model assumptions / simplification

For the alternative to stochastic models see deterministic models.

Statistics (practice)#

Machine Learning Concepts: the theory and practice for collecting, organizing, and interpreting data, as well as drawing conclusions and making decisions.

Statistics (measurement)#

Machine Learning Concepts: summary measure of a sample, for example,

sample mean - $\overline{x}$
sample standard deviation - $s$,

we use statistics as estimates of the model parameters that summarize the population (inference)

Statistical Distribution#

Univariate Analysis: for a feature a description of the probability of occurrence over the range of possible values. We represent the univariate statistical distribution with,

histogram
normalized histogram
probability density function (PDF)
cumulative distribution function (CDF)

What do we learn from a statistical distribution? For example,

what is the minimum and maximum?
do we have a lot of low values?
do we have a lot of high values?
do we have outliers, and any other values that don’t make sense and need explaining?

Support Vector (support vector machines)#

Support Vector Machines: training data within the margin or misclassified and update the support vector machines classification model.

with a support vector machines model, training data well within the correct region, are not support vectors, and have no impact on the model

Support Vector Machines#

Support Vector Machines: a predictive, binary classification machine learning method that is a good classification method when there is poor separation of groups.

projects the original predictor features to higher dimensional space and then applies a linear, plane or hyperplane,

\[ 𝑓(𝑥) = 𝑥^𝑇 \beta +\beta_0 \]

where $\beta$ is a vector and together with $\beta$ are the hyperplane model parameters, while $x$ is the matrix of predictor features, all are in the high dimensional space.

\[ 𝐺(𝑥)=\text{𝑠𝑖𝑔𝑛}\left( 𝑓(𝑥) \right) \]

$𝑓(𝑥)$ is proportional to the signed distance from the decision boundary, and $𝐺(𝑥)$ is the side of the decision boundary, $−$ one side and $+$ the other, $f(x) = 0$ is on the decision boundary.

We represent the constraint, all data of each category must be on the correct side of the boundary, by,

\[ y_i \left( x_i^T \beta + \beta_0 \right) \geq 0 \]

where this holds if the categories, $y_i$, are -1 or 1. We need a model that allows for some misclassification,

\[ y_i \left( x_i^T \beta + \beta_0 \right) \geq M - \xi_i \]

We introduce the concept of a margin, $𝑀$, and a distance from the margin, the error as $\xi_i$. Now we can pose our loss function as,

\[ \underset{\beta, \beta_0}{\text{min}} \left( \frac{1}{2M^2} + C \sum_{i=1}^N \xi_i \right) \]

subject to, $\xi_i \geq 0, \quad y_i \left( x_i^T \beta + \beta_0 \right) \geq M - \xi_i$.

This is the support vector machine loss function in the higher dimensional space, where 𝛽,𝛽_0 are the multilinear model parameters.

Training the support vector machine, by finding the model parameters of the plane to maximize the margin, $M$, while minimizing the error, $\sum_{i=1}^N \xi_i$

$𝑪$ hyperparameter weights the sum of errors, $xi_𝑖$, higher $𝐶$, will result in reduced margin, $M$, and lead to overfit
smaller margin, fewer data used to constrain the boundary, known as support vectors
training data well within the correct side of the boundary have no influence

Here are some key aspects of support vector machines,

known as support vector machines, and not machine, because with a new kernel you get a new machine
there are many kernels available including polynomial and radial basis functions

The primary hyperparameter is $C$, the cost of

Hyperparameters are related to the choice of kernel, for example,

polynomial - polynomial order
radial basis function - $\gamma$ inversely proportional to the distance influence of the training data

Tabular Data#

Machine Learning Workflow Construction and Coding: data table with rows for each sample and columns for each feature

Pandas’ DataFrames are a convenient class for working with tabular data, due to,

convenient data structure to store, access, manipulate tabular data
built-in methods to load data from a variety of file types, Python classes and even directly from Excel
built-in methods to calculate summary statistics and visualize data
built-in methods for data queries, sort, data filters
built-in methods for data manipulation, cleaning, reformatting
built-in attributes to store information about the data, e.g. size, number nulls and null value

Training and Testing Splits#

Machine Learning Concepts: for model cross validation, prior to predictive model training withhold a proportion of the data as testing data.

training data are applied to train the model parameters, while withheld testing data are applied to tune the model hyperparameter
hyperparameter tuning is selecting the hyperparameter combination that minimizes the error norm over the withheld testing data

The most common approach is random selection, this may not be fair testing,

the range of testing difficulty is similar to the real-world use of the model
too easy – testing cases are the same or almost the same as training cases, random sampling is often too easy
too hard – testing cases are very different from the training cases, the model is expected to severely extrapolate

Alternative methods such as k-fold cross validation provide the opportunity for testing over all available data but require,

the training k predictive machine learning models over the hyperparameter combinations
aggregation of the testing error over the k models for selection of the optimum hyperparameters, hyperparameter tuning

Also, there are alternative workflow that include, training, validation and testing subsets of the data

Transfer Function (reservoir modeling workflow)#

Machine Learning Concepts: calculation applied to the spatial, subsurface model realizations and scenarios to calculate a decision criterion, a metric that is used to support decision making representing value, and health, environment and safety. Example transfer functions include,

transport and bioattenuation - numerical simulation to model soil contaminant concentrations over time during a pump and treat operation
volumetric calculation - for total oil-in-place to calculate resource in place
heterogeneity metric - as an indicator of recovery factor to estimate reserves from resources
flow simulation - for pre-drill production forecast for a planned well
Whittle’s pit optimization - to calculate mineral resources and ultimate pit shell

Uncertainty Modeling#

Machine Learning Concepts: calculation of the range of possible values for a feature at a location or jointly over many locations at the sample time. Some considerations,

quantification of the limitation in the precision of our samples and model predictions
uncertainty is a model, there is no objective uncertainty
uncertainty is caused by our ignorance
uncertainty is caused by sparse sampling, measurement error and bias, and heterogeneity

we represent uncertainty by multiple models, scenarios and realizations:

Scenarios - multiple spatial, subsurface models calculated by stochastic simulation by changing the input parameters or other modeling choices to represent the uncertainty due to inference of model parameters and model choices
Realizations - multiple spatial, subsurface models calculated by stochastic simulation by holding input parameters and model choices constant and only changing the random number seed

Underfit Model#

Machine Learning Concepts: a machine learning model that too simple, too low complexity and flexibility, to fit the natural phenomenon resulting in very high model bias.

underfit models often approach the response feature global mean
underfit models have high error over training and testing data
increased complexity will generally decrease error with respect to the training and testing dataset
over the region of model complexity with falling training and testing error

Issues of an underfit machine learning model,

more model complexity and flexibility is insufficient given the available data, data accuracy, frequency and coverage
low accuracy in training and testing representing real-world use away from training data cases, indicating poor ability of the model to generalize

Union of Events (probability)#

Probability Concepts: the union of outcomes, the probability of $A$ or $B$ is calculated with the probability addition rule,

\[ P(A \cup B) = P(A) + P(B) - P(A,B) \]

Univariate Parameters#

Univariate Analysis: summary measures based on one feature measured over the population

Univariate Statistics#

Univariate Analysis: summary measures based on one feature measured over the samples

Unsupervised Learning#

Cluster Analysis: learning patterns in data from unlabeled data.

no response features, $𝑌$, instead only predictor features, $𝑋_1,ldots,𝑋_𝑚$
machine learns by mimicry a compact representation of the data
captures patterns as feature projections, group assignments, neural network latent features, etc.
focus on inference of the population, the natural system, instead of prediction of response features

In this course we use the terms inferential and predictive machine learning, all the covered inferential machine learning methods are unsupervised.

Variable (also feature)#

Machine Learning Concepts: any property measured or observed in a study, for example,

porosity, permeability, mineral concentrations, saturations, contaminant concentration
in data mining / machine learning this is known as a feature
measure often requires significant analysis, interpretation, etc.

Variance Inflation Factor (VIF)#

Feature Ranking: a measure of linear multicollinearity between a predictor feature ($X_i$) a nd all other predictor features ($X_j, \forall j \ne i$).

First we calculate a linear regression for a predictor feature given all the other predictor features.

\[ X_i = \sum_{j, j \ne i}^m X_j + \epsilon \]

From this model we determine the coefficient of determination, $R^2$, known as variance explained.

Then we calculate the Variance Inflation Factor as:

\[ VIF = \frac{1}{1 - R^2} \]

Volume-Variance Relations#

Feature Transformations: as the volume support (scale) increases the variance reduces

Predicting volume-variance relations is central to handling multiple scales of data and models. Some general observations and assumptions,

the mean does not change as the volume support, scale changes. Only the variance changes
there may be shape change (we will not tackle that here). Best practice is to check shape change empirically. It is common to assume no shape change (affine correction) or to use a shape change model (indirect lognormal correction).
the variance reduction in the distribution is inversely proportional to the range of spatial continuity. Variance reduces faster (over smaller volume increase) for shorter spatial continuity ranges.

Over common changes in scale this impact may be significant; therefore, it is not appropriate to ignore volume-variance relations,

we don’t do this scale up, change in volume support perfectly, and this is why it is still called the missing scale. We rarely have enough data to model this rigorously
we need a model to predict this change in variance with change in volume support

There are some change in volume support, scale models,

Empirical - build a small scale, high resolution model and scale it up numerically. For example, calculate a high resolution model of permeability, apply flow simulation to calculate effective permeability over $v$ scale blocks

Power Law Averaging - there is a flexible approach known as power law averaging.

\[ z_V = \left[ \frac{1}{n} \sum z_v^{\omega} \right] ^{\frac{1}{\omega}} \]

where $\omega$ is the power of averaging. For example:

$\omega = 1$ is a regular linear averaging
$\omega = -1$ is a harmonic averaging
$\omega = 0$ is a geometric averaging (this is proved in the limit as $\omega \rightarrow 0$)

How to calculate $\omega$?

for some cases we know from theory the correct $\omega$ value, for example, for flow orthogonal to beds we select $\omega = -1.0$ to scale up permeability
flow simulation may be applied to numerically scale up permeability and then to back-calculate a calibrated $\omega$

Model - directly adjust the statistics for change in scale. For example, under the assumption of linear averaging and a stationary variogram and variance:

\[ f = 1 - \frac{\overline{\gamma}(v,v)}{\sigma^2} \]

where $f$ is variance reduction factor,

\[ f = \frac{D^2(v,V)}{D^2(\cdot,V)} = \frac{D^2(v,V)}{\sigma^2} \]

in other words, $f$ is the ratio of the variance at scale $v$ to the variance at the original data point support scale based on,

the variogram model
the scale of the data, $\cdot$ and the scale of $v$

Venn Diagrams#

Probability Concepts: a plot, visual tool for communicating probability. What do we learn from a Venn diagram?

size of regions $\propto$ probability of occurrence
proportion of $\Omega$, all possible outcomes represented by a box, i.e., probability of $1.0$
overlap $\propto$ probability of joint occurrence

Venn diagrams are an excellent tool to visualize marginal, joint and conditional probability.

Well Log Data#

Machine Learning Concepts: as a much cheaper method to sample wells that does not interrupt drilling operations, well logs are very common over the wells. Often all wells have various well logs available. For example,

gamma ray on pilot vertical wells to assess the locations and quality of shales for targeting (landing) horizontal wells
neutron porosity to assess location high porosity reservoir sands
gamma ray in drill holes to map thorium mineralization

Well log data are critical to support subsurface resource interpretations. Once anchored by core data they provide the essential coverage and resolution to model the entire reservoir concept / framework for prediction, for example,

well log data calibrated by core data collocated with well log data are used to map the critical stratigraphic layers, including reservoir and seal units
well logs are applied to depth correct features inverted from seismic that have location imprecision due to uncertainty in the rock velocity over the volume of interest

Weak Learner#

Gradient Boosting: the prediction model performs only marginally better than random

\[ 𝑌 = \hat{f}_𝑘(𝑋_1,\ldots,𝑋_𝑚) \]

where $\hat{f}_𝑘$ is the $𝑘^{th}$ weak learner, $𝑋_1,\ldots,𝑋_𝑚$ are the predictor features, $\hat{Y}$ is the prediction of the response feature.

The term weak predictor is often used, and specifically the term weak classifier for the case of classification models.

Well Log Data, Image Logs#

Machine Learning Concepts: a special case of well logs where the well logs are repeated at various azimuthal intervals within the well bore resulting in a 2D (unwrapped) image instead of a 1D line along the well bore. For example, Fullbore formation MicroImager (FMI) with:

with 80% bore hole coverage
0.2 inch (0.5 cm) resolution vertical and horizontal
30 inch (79 cm) depth of investigation

can be applied to observe lithology change, bed dips and sedimentary structures.

Comments#

This was a basic introduction to geostatistics. If you would like more on these fundamental concepts I recommend the Introduction, Modeling Principles and Modeling Prerequisites chapters from my text book, Geostatistical Reservoir Modeling{cite}`pyrcz2014’.

I hope this is helpful,

Michael

The Author:#

Michael Pyrcz, Professor, The University of Texas at Austin Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions

With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers’ and geoscientists’ impact in subsurface resource development.

For more about Michael check out these links:

Want to Work Together?#

I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.

Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I’d be happy to drop by and work with you!
Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!
I can be reached at mpyrcz@austin.utexas.edu.

I’m always happy to discuss,

Michael

Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin

Order	Hermite Polynomial \(H_e(x)\)
0th Order	\(H_{e_0}(x) = 1\)
1st Order	\(H_{e_1}(x) = x\)
2nd Order	\(H_{e_2}(x) = x^2 - 1\)
3rd Order	\(H_{e_3}(x) = x^3 - 3x\)
4th Order	\(H_{e_4}(x) = x^4 - 6x^2 + 3\)

Machine Learning Glossary

Contents

Machine Learning Glossary#

Motivation for Machine Learning Concepts#

Adjacency Matrix (spectral clustering)#

Addition Rule (probability)#

Affine Correction#

Affinity Matrix (spectral clustering)#

Bagging Models#

Basis Expansion#

Basis Function#

Bayes’ Theorem (probability)#

Bayesian Probability#

Bayesian Updating for Classification#

Bayesian Linear Regression#

Big Data#

Big Data Analytics#

Binary Transform (also Indicator Transform)#

Boosting Models#

Bootstrap#

Categorical Feature#

Categorical Nominal Feature#

Categorical Ordinal Feature#

Causation#

Cell-based Declustering#

Cognitive Biases#

Complimentary Events (probability)#

Computational Complexity#

Conditional Probability#

Confidence Interval#

Confusion Matrix#

Continuous Feature#

Continuous, Interval Feature#

Continuous, Ratio Feature#

Continuously Differentiable#

Convolution#

Core Data#

Correlation#

Covariance#

Cross Validation#

Cumulative Distribution Function (CDF)#

Curse of Dimensionality#

Data (data aspects)#

Data Convexity#

DataFrame#

Data Analytics#

Data Preparation#

Degree Matrix (spectral clustering)#

DBSCAN for Density-based Clustering#

Decision Criteria#

Decision Tree#

Declustering#

Declustering (statistics)#

Density-Connected (DBSCAN)#

Density-based Cluster (DBSCAN)#

Density-Reachable (DBSCAN)#

Deterministic Model#

Dimensionality Reduction#

Directly Density Reachable (DBSCAN)#

Discrete Feature#

Distribution Transformations#

Eager Learning#

Estimation#

f1-score (classification accuracy metric)#

Feature (also variable)#

Feature Engineering#

Feature Importance#

Feature Imputation#

Feature Projection#

Feature Space#

Feature Ranking#

Feature Transformations#

Fourth Paradigm#

Frequentist Probability#

Gaussian Anamorphosis#

Gibbs Sampler (MCMC)#

Gradient Boosting Models#

Graph Laplacian (spectral clustering)#

Geostatistics#

Gradient-based Optimization#