Machine Learning Concepts#

Michael J. Pyrcz, Professor, The University of Texas at Austin

Twitter | GitHub | Website | GoogleScholar | Book | YouTube | Applied Geostats in Python e-book | LinkedIn

Chapter of e-book “Applied Machine Learning in Python: a Hands-on Guide with Code”.

Cite this e-Book as:

Pyrcz, M.J., 2024, Applied Machine Learning in Python: a Hands-on Guide with Code, https://geostatsguy.github.io/MachineLearningDemos_Book.

The workflows in this book and more are available here:

Cite the MachineLearningDemos GitHub Repository as:

Pyrcz, M.J., 2024, MachineLearningDemos: Python Machine Learning Demonstration Workflows Repository (0.0.1). Zenodo. DOI

By Michael J. Pyrcz
© Copyright 2024.

This chapter is a summary of Machine Learning Concepts including essential concepts:

  • Statistics and Data Analytics

  • Inferential Machine Learning

  • Predictive Machine Learning

  • Machine Learning Model Training and Tuning

  • Machine Learning Model Overfit

YouTube Lecture: check out my lecture on Introduction to Machine Learning. For your convenience here’s a summary of salient points.

Motivation for Machine Learning Concepts#

You could just open up a Jupyter notebook in Python and start building machine learning models. The scikit-learn docs are quite good and for every machine learning function there is a short code example that you could copy and paste to repeat their work. Also, you could Google a question about using a specific machine learning algorithm in Python and the top results will include StackOverflow questions and responses, it is truly amazing how much experienced coders are willing to give back and share their knowledge. We truly have an amazing scientific community with the spirit of knowledge sharing and open-source development. Respect. Of course, you could learn a lot about machine learning from a machine learning large language model (LLM) like ChatGPT. Not only will ChatGPT answer your questions, but it will also provide codes and help you debug them when you tell it what went wrong. On way of the other and you received and added this code to you data science workflow.

n_neighbours = 10; weights = "distance"; p = 2
neigh = KNeighborsRegressor(n_neighbors=n_neighbours,weights = weights,p = p).fit(X_train,y_train) 

et voilà, you have a trained predictive machine learning model that could be applied to make predictions for new cases.

  • But, is this a good model?

  • How good is it?

  • Could it be better?

Without knowledge about basic machine learning concepts we can’t answer these questions and build the best possible models. In general, I’m not an advocate for black box modeling, because it is:

  1. likely to lead to mistakes that may be difficult to detect and correct

  2. incompatible with the expectations for compentent practice for professional engineers. I gave a talk on Applying Machine Learning as a Compentent Engineer or Geoscientist

To help out this chapter provides you with the basic knowledge to answer these questions and to make better, more reliable machine learning models. Let’s start building this essential foundation with some definitions.

Big Data#

Everyone hears that machine learning needs a lot of data. In fact, so much data that it is called “big data”, but how do you know if you are working with big data? The criteria for big data are these ‘V’s, if you answer yes for at least some of these, then you are working with big data:

  • Volume: many data samples, difficult to handle and visualize

  • Velocity: high rate collection, continuous relative to decision making cycles

  • Variety: data form various sources, with various types and scales

  • Variability: data acquisition changes during the project

  • Veracity: data has various levels of accuracy

In my experience, most subsurface engineers and geoscientists answer yes to all of these questions. So I proudly say that we in the subsurface have been big data long before the tech sector learned about big data. In fact, I state that we in the subsurface resource industries are the original data scientists. I’m getting ahead of myself, more on this in a bit. Don’t worry if I get carried away in hubris, rest assured this e-book is written for anyone interested to learn about machine learning. You can skip the short sections on subsurface data science or read along if interested. Now that we know big data, let’s talk about the big topics.

Statistics, Geostatistics and Data Analytics#

Statistics is collecting, organizing, and interpreting data, as well as drawing conclusions and making decisions. If you look up the definition of data analytics you will find criteria that include statistical analysis, and data visualization to support decision making. I’m going to call it, data analytics and statistics are the same thing. Now we can append, geostatistics as a branch of applied statistics that accounts for:

  1. the spatial (geological) context

  2. the spatial relationships

  3. volumetric support

  4. uncertainty

Remember all those statistics classes with the assumption of i.i.d., independent, identically distributed. Spatial phenomenon don’t do that, so we developed a unique branch of statistics to address this. By our assumption above, we can state that geostatistics is the same as spatial data analytics. Now, let’s use a Venn diagram to visualize statistics / data analytics and geostatistics / spatial data analytics:

Venn diagram for statistics and geostatistics.

Now we can add our previously discussed big data to our Venn diagram resulting in big data analytics and spatial big data analytics.

Venn diagram for statistics and geostatistics with big data added.

Data Science#

If no one else has said this to you, let me have the honor of welcoming you to the fourth paradigm for scientific discovery, data-driven scientific discovery or just call it data science. The paradigms for scientific discovery are distinct stages or approaches for humanity to develop science. To understand the fourth paradigm let’s first consider the previous three paradigms. Here’s all of the four paradigms with dates for important developments with each.:

1st Paradigm Empirical Science

2nd Paradigm Theoretical Science

3rd Paradigm Computational Science

4th Paradigm Data Science

Experiments

Models and Laws

Numerical Simulation

Learning from Data

430 BC Empedocles proved air has substance

1011 AD al-Haytham Book of Optics

1942 Manhattan Project

2009 Hey et al. Data-Intensive Book

230 BC Eratosthenes measure Earth’s diameter

1687 AD Newton Principia

1980 – Global Forecast System (GFS)

2015 AlphaGo beats a professional Go player

Of course, we can argue about the boundaries between the paradigms for scientific discovery, i.e., when did a specific paradigm begin? Certainly the Mesopotamians (4000 BC - 3500 BC) conducted many first paradigm experiments that supported their development of the wheel, plow, chariot, weaving loom, and irrigation and on the other side we can trace the development of artificial neural networks to McColloch and Pitts in 1943. Also, the adoption of a new scientific paradigm is a major societal shift that does not occur uniformly around the globe.

So what caused the fourth paradigm to start some time between 1943 and 2009? The fundamental mathematics of data driven models has been available for a long time considering the following mathematical and statistical developments in history:

  • Calculus - Isaac Newton and Gottfried Wilhelm Leibniz independently developed the math to find minimums and maximums during the 1600s with “Methods of Fluxions” 1671 (published posthumously in 1736), and with “Nova Methodus pro Maximis et Minimis” published in 1684.

  • Bayesian Probability - introduced by Reverend Thomas Bayes with Bayes’ Theorem enshrined in “An Essay Towards Solving a Problem in the Doctrine of Chances” published posthumously in 1763

  • Linear Regression - formalized by Marie Legendre in 1805

  • Discriminant Analysis - developed by Ronald Fisher in his 1939 paper “The Use of Multiple Measurements in Taxonomic Problems”

  • Monte Carlo Simulation - pioneered by Stanislaw Ulam and John von Neumann in the early 1940’s as part of the Manhattan Project

Upon reflection, one could ask, why didn’t the fourth paradigm start before the 1940’s and even in the 1800s? What changed? The answer is that other critical developments provided the fertile ground for data science, including cheap and available compute and big data.

Cheap and Available Compute#

Consider the following developments in computers.

  • Charles Babbage’s Analytical Engine (1837) is often credited as the first computer, yet is was a mechanical device the attempted to implement many modern concepts such as arithmetic logic, control flow and memory, but it was never completed given the challenges with the available technology.

  • Konrad Zuse’s Z3 (1941) and ENIAC (1945) the first digital, programmable computers, but they were programmed by labor-intensive and time-consuming rewiring of plug boards with machine language as there was no high level programming language, the memory could fit 1,000 words limiting the program complexity and with 1000s of vacuum tubes cooling and maintenance was a big challenge. These unreliable machines were very slow with only about 5,000 additions per second.

  • Transistors were invented in 1947 the transistor replaced the vacuum tubes greatly improving energy-efficiency, miniaturization, speed and reliability. Resulting in the second generation computers such as IBM 7090 and UNIVAC 1108.

  • Integrated Circuits were developed in the 1960s allowed for multiple transistors to be placed on a single chip leading to third generation computers.

  • Microprocessors - developed in 1971 integrated the functions of a computer’s central processing unit (CPU) enabling smaller, cheaper computers, resulting in the personal, home computer revolution including the Apple II (1977) and IBM PC (1981). Yes, when I was in grade 1 in elementary school we had a Apple II in my classroom, and it amazed all of us with the monochrome (orange and black pixels) monitor, floppy disk loaded programs (there was no hard drive) and beeps and clicks from the speaker!

We live in a society with more compute power in our pockets (cell phones) than used to send the first astronauts to the moon and a SETI screen saver that uses home computers’ idle time to search for extraterrestrial life! Now we are surrounded by cheap and reliable compute that our grandparents (and perhaps parents) could never have imagined.

As you will learn in this e-book, machine learning methods requires a lot of compute. In fact, training most of these machine learning methods rely on a large number of iterations, matrix or parallel operations, bootstrapping models, stochastic descent, and tuning the models requires a lot of trained models requiring even another loop of iteration. Cheap and available compute is an essential prerequisite for the fourth paradigm. Now consider the development of data science has been largely a crowd-source, open-source effort. There cannot be data science without many people having easy access to compute.

Availability of Big Data#

With small data we tend to rely on other sources of information such as physics, engineering and geoscience principals and then calibrate these models to the few available observations, as is common in applications of the second and third paradigm for scientific discovery. With big data we can learn the full range of behaviors of natural systems from the data itself. In fact, with big data we often see the limitations of the second and third paradigm models due to missing complexity in our solutions.

Therefore, big data:

  • provides sufficient sampling to support the data-driven fourth paradigm

  • and may actually preclude exclusive use of the second and third paradigm.

These days big data is everywhere, with so much data now open-source and available online. Here’s some great examples:

  • satellite data with time lapse is widely available on Google Earth and other platforms. Yes, this isn’t the most up to date and highest resolution, but it is certainly sufficient for many land use, geomorphology and surface evolution studies. My team of graduate students use it.

  • river gages data over your state are generally available to the public. We use it for hydrologic analysis and also to determine great locations and times to paddle! Check out the USGS’s National Water Information System

  • government databases, including the United States Census to oil and gas production data are available to all

  • Moon Trek: online portal to visualize and download digital images from Lunar Reconnaissance Orbiter (LRO), SELENE and Clementine.

This is open, public data, but consider the fact that many industries are embracing smart, intelligent systems with enhanced connectivity, monitoring and control. The result is an explosion of data supported by faster and cheaper computation, processing and storage. We are all swimming in data that the generation before us could not have imagined.

We could also talk about improved algorithms and hardware architectures optimized for data science, but I’ll leave that out of scope for this e-book. All of these developments have provided the fertile ground for the seeds of data science to impact all sectors or our economy and society.

Data Science and Subsurface Resources#

Spoiler alert, I’m going to boast a bit in the section. I often hear students say, “I can’t believe this data science course is in the Hildebrand Department of Petroleum and Geosystems Engineering!” or “Why are you teaching machine learning in Department of Earth and Planetary Sciences?” My response is,

We in the subsurface are the original data scientists!

We are the original data-driven scientists, we have been big data long before tech learned about big data!

This may sound a bit arrogant, but let me back this up with this timeline:

Timeline of data science development from the perspective of subsurface engineering and geoscience.

Shortly after Kolmogorov developed the fundamental probability axioms, Danie Krige developed a set of statistical, spatial, i.e., data-driven tools for making estimates in space while accounting for spatial continuity and scale. These tools were formalized with theory developed by Professor Matheron during the 1960s in a new science called geostatistics. Over the 1970s - 1990s the geostatistical methods and applications expanded from mining to address oil and gas, environmental, agriculture, fisheries, etc. with many important open source developments.

Why was subsurface engineering and geoscience earlier in the development of data science? Because, necessity is the mother of invention! Complicated, heterogeneous, sparsely sampled, vast systems with complicated physics and high value decisions drove us to data-driven methods. There are many other engineering fields that:

  • work with homogeneous phenomenon that does not have significant spatial heterogeneity, continuity, nor uncertainty

  • have exhaustive sampling of the population relative to the modeling purpose and do not need an estimation model with uncertainty

  • have well understood physics and can model the entire system with second and third paradigm phenomenon

As a result, many of us subsurface engineers and geoscientists are learning, applying and teaching data science.

Machine Learning#

Now we are ready to define machine learning, a Machine Learning from Wikipedia article definition may be summarized and dissected as follows. Machine learning is the study of a:

  1. toolbox - algorithms and mathematical models that computer systems use

  2. learning - to progressively improve their performance on a specific task with

  3. training data - machine learning algorithms train a mathematical model to sample data,

  4. general - in order to make predictions or decisions without being explicitly programmed to perform the task.

Let’s highlight some important concepts. Machine learning is a large set of algorithms, (a numerical toolbox), that adapt to the problem (learning), these algorithms learn from training data (training data) and a single algorithm may be trained to and applied to many problems (general).

Near the end of the article there is an important statement that many stop short and miss.

Use machine learning,

Where it is infeasible to develop an algorithm of specific instructions for performing the task

In other words, if you know the engineering theory, the geoscience principals, the physics, the chemical reaction, the free body diagram, etc. use that and don’t jump to use data science as a crutch instead of learning the fundamental concepts (paraphrased from personal communication with Professor Carlos Torres-Verdin).

Machine Learning Prerequisite Definitions#

To understand data science we need to first differentiate between the population and the sample.

Population#

Exhaustive, finite list of property of interest over area of interest. Generally the entire population is not accessible, e.g. exhaustive set of porosity at each location within a gas reservoir

Sample#

The set of values, locations that have been measured, e.g. porosity data from well-logs within a reservoir.

What are we measuring in our samples? Each distinct type of measure is called a variable or feature.

Variable / Feature#

Any property measured / observed in a study, and in data science only the term feature is used. Note, all of us statisticians we are more accustomed with the term “variable”, but it is the same. Here’s some examples of features:

  • porosity measured from 1.5 inch diameter, 2 inch long core plugs extracted from the Miocene-aged Tahiti field in the Gulf of Mexico

  • permeability modeled from porosity (neutron density well log) and rock facies (interpreted fraction of shale logs) at 0.5 foot resolution along the well bore in the Late Devonian Leduc formation in the Western Canadian Sedimentary Basin.

  • blast hole cuttings nickel grade aggregated over 8 inch diameter 10 meter blast holes at Voisey’s Bay Mine, Proterozoic gneissic complex.

Did you see what I did? I specified what was measured, how it was measured, and over what scale was it measured. This is important because how the measure is made changes the veracity (level of certainty in the measure) and different methods actually may results in different results so we may need to reconcile multiple measurement methods of the same thing, so we store each vintage as separate features. Also, the scale is very important due to volume variance effect, with increasing support volume (sample size) the variance reduces due to volumetric averaging resulting in regression to the mean. I have a chapter on volume-variance in my Applied Geostatistics in Python e-book for those interested in more details.

Additionally, our subsurface measures often requires significant analysis, interpretation, etc. We don’t just hold a tool up to the rock and get the number, we have a thick layer of engineering and geoscience interpretation to map from measurement to a useable feature. Consider this carbonate thin section from Bureau of Economic Geology, The University of Texas at Austin from geoscience course by F. Jerry Lucia.

Carbonate thin section image (from this link by F. Jerry Lucia).

Note, the blue dye indicates the void space in the rock, but is porosity the blue area divided by the total area of the sample? That would be the “total porosity” and not all of it may contribute to fluid flow in the rock as it is not sufficiently connected; therefore, we need to go from total porosity to “effective porosity”, requiring interpretation. Even porosity one of the most simple (observable in well logs and linearly averaging) features doesn’t escape this necessary interpretation.

Predictor and Response Features#

To understand the difference between predictor and response features let’s look at the most concise, simple expression of a machine learning model.

The fundamental predictive machine learning model that maps from predictor to response features.

The predictors (or independent) features (or variables) the model inputs, i.e., the \(X_1,\ldots,X_m\), response (or dependent) feature(s) (or variable(s)) are the model output, \(y\), and there is an error term, \(\epsilon\).

Machine Learning is all about estimating such models, \(\hat{𝑓}\), for two purposes, inference or prediction.

Inference#

Inference must precede prediction, because inference is going from the limited sample to a model of the population. Inference is learning about the system, \(\hat{f}\), for example,

  • what is the relationship between the features?

  • what feature are most important?

  • what are the complicated interactions between all the features?

A good example of a inferential problem is imagine if I walk in the room and pull a coin out of my pocket and flip it 10 times and get 3 heads and 7 tails and then ask you, “Is this is a fair coin?”.

  • you take the sample, the results from 10 coin tosses, to make an inference about the population, i.e., whether the coin is fair or not? Note, in this example the coin is the population!

Prediction#

Once we have completed our inference to model the population from the sample, we are ready to use the model of the population to predict future samples. This is prediction, going from a model of the population to estimating new samples.

  • the focus of prediction is to get the most accurate estimates of future samples.

Carrying on with our coin example, if you declare that my coin was biased towards tails, now you are able to predict the outcome for the next 10 coin tosses, before I toss the coin again. This is prediction.

How do you know if you are doing inferential or predictive machine learning. In general, inferential machine learning is conducted only with the predictor features, also known as unsupervised machine learning (because there are no response labels). This includes:

  • cluster analysis - that learns ways to split the data into multiple distinct populations to improve the subsequent models

  • dimensionality reduction - that projects the predictor features to a small number of new features that better describe the system with less noise

When you’re doing predictive machine learning you have both the predictor and the response feature(s), also known as supervised learning, and the model is trained and tuned to predict new samples.

Training and Tuning Machine Learning Models#

The predictive machine learning we apply a general model training and testing workflow that is illustrated here.

Standard predictive machine learning modeling workflow.

Let’s walk through the steps,

  1. Train and Test Split the available data into train and test mutually exclusive, exhaustive subsets. Typically 15% - 30% of the data are withheld and assigned to test and the remainder to the train.

  2. Remove the Test Split from the analysis. The test data cannot be used to train the predictive model. Any information from test data to inform the model training is considered information leakage and compromises the accuracy of the model.

  3. Train a Very Simple Model with the training data. Set the hyperparameters such that the model is as simple as reasonable and then train the model parameters with the available training data.

  4. Train Models from Simple to Complicated by repeating the previous step with the model hyperparameters incrementally increasing the model complexity. The result is a suite of models of variable complexity all fit, trained model parameters to minimize misfit with the training data.

  5. Access Performance with the Withheld Testing Data by retrieving the testing data and summarizing the error over all these withheld data.

  6. Select the Hyperparameters that minimize the testing error by picking the model from the previous step with the minimum testing error. The associated hyperparameters are known as the tuned hyperparameters.

  7. Retrain the Model with all of the data (train and test combined) with the tuned hyperparameters.

You may have some questions, let’s anticipate and answer them to better defend and explain this workflow.

What is the outcome from the empirical approach expressed in steps 1 through 6? Just the tuned hyperparameters. We would never use the selected model in step 6 as it would omit valuable data used for testing.

Why not just train the model with all of the data? We can always minimize the training error by selecting hyperparameters that result in a very complicated model. More complicated models are more flexible, and with enough flexibility will be able to perfectly fit all of the training data, but that model would not do a very good job with making predictions for predictor feature values not used to train the model. The most complicated, overfit model would always win (more on overfit soon).

In other words, this approach is an attempt to simulate the model use, a form of dress rehearsal, to find the level of complexity that does the best job making new predictions. I’ve said model parameters and model hyperparameters a bunch of times, so I owe you their definitions.

Model Parameters and Model Hyperparameters#

Model parameters are fit during training phase to minimize error at the training data, i.e., model parameters are trained with training data and control model fit to the data. For the polynomial predictive machine learning model from the machine learning workflow example above, the model parameters are the polynomial coefficients, e.g., \(b_3\), \(b_2\), \(b_1\) and \(c\) (often called \(b_0\)) for the third order polynomial model.

Model parameters are adjusted to fit of the model to the data, i.e., model parameters are trained to minimize error over the training data (x markers).

Model hyperparameters are very different. They do not constrain the model fit to the data directly, instead they constrain the model complexity. The model hyperparameters are selected (call tuning) to minimize error at the withheld testing data. Going back to our polynomial predictive machine learning example, the choice of polynomial order is the model hyperparameter.

Model hyperparameters are adjusted to change the model complexity / flexibilty, i.e., model hyperparameters are tuned to minimize error over the withheld testing data (solid circles).

Regression and Classification#

Before we proceed we need to define regression and classification.

  • Regression - a predictive machine learning model where the response feature(s) is continuous.

  • Classification - a predictive machine learning model where the response feature(s) is categorical.

It turns out that for each of these we need to build different models and use different methods to score these models. For the remainder of this discussion we will focus on regression, but in later chapters we introduce classification models as well. Now, to better understand predictive machine learning model tuning we need to understand the sources of testing error.

Sources of Predictive Machine Learning Testing Error#

Mean square error (MSE) is a standard way to express error. Mean squared error is known as a norm because we at taking a vector of error (over all of the data) and summarizing with a single, non-negative value, and specifically as the L2 norm because the errors are squared before they are summed,

\[ MSE = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2 \]

where \(y_i\) is the actual observation of the response feature and \(\hat{y}_i\) is the model estimate over data indexed, \(i = 1,\ldots,n\). The \(\hat{y}_i\) is estimated with our predictive machine learning model with a general form,

\[ \hat{y}_i = \hat{f}(x^1_i, \ldots , x^m_i) \]

where \(\hat{f}\) is our predictive machine learning model and \(x^1_i, \ldots , x^m_i\) are the predictor feature values for the \(i^{th}\) data.

Of course, MSE can be calculated over training data and this is often the loss function that we minimize for training the model parameters for regression models.

\[ MSE_{train} = \frac{1}{n_{train}} \sum_{i=1}^{n_{train}} \left( y_i - \hat{y}_i \right)^2 \]

and MSE can be calculated over the withheld testing data for hyperparameter tuning of regression models. Note, the L2 norm has a lot of nice properties including a continuous error function that can be differentiated over all values of error, but it is sensitive to data outliers that may have a disproportionate impact on the sum of the squares.

If we take the \(MSE_{test}\),

\[ MSE_{test} = \frac{1}{n_{test}} \sum_{i=1}^{n_{test}} \left( y_i - \hat{f}(x^1_i, \ldots , x^m_i) \right)^2 \]

but we pose it as the expected test square error we get this form,

\[ E \left[ \left( y_0 - \hat{f}(x^1_0, \ldots , x^m_0) \right)^2 \right] \]

where we use the \(_0\) notation to indicate data samples not in the training dataset split, but in the withheld testing split. We can expand the quadratic and group the terms to get this convenient decomposition of expect test square error into three additive sources (derivation is available in Hastie et al., 2009,

Model error in testing with three additive components, model variance, model bias and irreducible error.

Let’s explain these three additive sources of error over test data, i.e., the error we expect in the real-world use of our model to predict for cases not used to train the model,

  1. Model Variance - is error due to sensitivity to the dataset, for example a simple model like linear regression does not change much if we change the training data, but a more complicated model like a fifth order polynomial model will jump around a lot as we change the training data. More model complexity tends to increase model variance.

  2. Model Bias - is error due to using an approximate model, i.e., the model is too simple to fit the natural phenomenon. A very simple model is inflexible and will generally have higher model bias while a more complicated model is flexible enough to fit the data and will have lower model bias.

  3. Irreducible Error - is due to missing and incomplete data. There may be important features that were not sampled, or ranges of feature values that were not sampled. This is the error due to data limitations that cannot be address by the machine learning model hyperparameter tuning for optimum complexity; therefore, irreducible error is constant over the range of model complexity.

Now we can take these three additive sources of error and produce an instructive plot,

Model error in testing vs. model complexity with three additive error components, model variance, model bias and irreducible error.

showing hyperparameter tuning is an optimization of the model variance and bias trade off. We select the model complexity via hyperparameters that results in a model that is not too simple (too inflexible) is underfit or not too complicated (too flexible) is overfit. Now, let’s talk a bit more about under and overfit.

Underfit and Overfit Models#

First visualize under- and overfit first with a simple prediction problem with one predictor feature and one response feature, here’s a model that is too simple beside a model that is too complicated.

Example underfit model (left), a model that is too simple / inflexible and overfit model (right), a model that is too complicated / flexible.

It is clear that this simple model is not sufficiently flexible to fit the data, this is likely an underfit model while the overfit model perfectly fits all of the training data and is likely overfit. To better understand the difference, let’s plot the train error with the test error we showed previously.

Model error in training and testing vs. model complexity with illustration of under- and overfit model regions.
  • Train Error - as we increase model complexity it will continue to decrease and at some point reach zero when the model is flexible enough to perfectly fit the training data.

  • Test Error - as shown above is the result of model variance, model bias and irreducible error, and we expect the model variance-bias trade-off to result in an optimum model complexity that minimizes the error.

From these observations we can define underfit and overfit as follows,

  • underfit models are too simple and continue to reduce testing error as complexity increases. These models are too conservative and lack the flexibility to fit the natural phenomenon.

  • overfit models are too complicated and although they continue to reduce training error as complexity increases the testing error is actually increasing. These models may deceive us into believing that we know more than we actually do.

More About Training and Testing Splits#

Here are some more information on training and testing splits,

  • Proportion withheld for testing - meta studies have identified that between 15% and 30% is typically optimum. To understand the underlying trade-off imagine if we only withhold 2% of the data for test, then we will have a lot of training data to train the model as well as possible, but the model will not be well-tested over a wide range of prediction cases. Now image we withhold 70% of the data for test, then with only 30% of the data available for training we will do a very good job testing our somewhat poor model. It is a trade off between build a good model and testing this model well.

  • k-fold Cross Validation - the train and test workflow above known as a cross validation approach, but there are many other methods. With k-fold cross validation we divide the data into k mutually exclusive, exhaustive equal size sets (called folds) for repeat the model train and test k times with each fold having a turn being withheld. Then we average the error norm over the folds to provide a single error for hyperparameter tuning. This method removes the sensitivity to the exact train and test split so it is often seen as more robust. The proportion of testing is implicit to the choice of k, for k = 3, 33% of data are withheld for each fold and for k=5, 20% of data are withheld for each fold.

  • Train, validate and test - there is an alternative approach with three exhaustive, mutually exclusive segments. The train data subset is the same as train above and is used to train the model parameters, validate is like test above and is used to tune the model hyperparameters, and the test data subset is applied to check the model trained on all the data with the tuned hyperparameters. The philosophy is that this final check is performed with data that was not at all involved with model construction, including training model parameters nor tuning model hyperparameters. There’s two reasons that I push back on this method. Firstly, what do we do if we don’t quite like the performance in this testing phase, do we have a fourth subset for another test? Also, we can never deploy a model that is not trained with all available data; therefore, we will still have to train with the test subset.

  • Spatial Fair Train and Test Splits - Dr. Julian Salazar suggested that for spatial prediction problems that random train and test split may not fair. He proposed a fair train and test split method for spatial prediction models that splits the data based on the difficulty of the planned use of the model. Prediction difficulty is related to kriging variance that accounts for spatial continuity and distance offset. If the model will be used to impute data with small offsets from available data then construct a train and test split with train data close to test data, and if the model will be used to predict a large distance offsets then perform splits the result is large offsets between train and test data. With this method the tuned model may vary based on the planned use for the model.

Ethical and Professional Practice Concerns with Data Science#

Rideiro et al. (2016) trained a logistic regression classifier with 20 wolf and dog images to detect the difference between wolves and dogs. The input is a photo of a dog or wolf and the output is the probability of dog and the compliment, the probability of wolf. The model worked well until this example here (see the left side below),

Example dog misclassified as a wolf (left) and pixels that resulted in this misclassification (right), image taken from Rideiro et al. (2016).

where this results in a high probability of wolf. Fortunately the authors were able to interrogate the model and determine the pixels that had the greatest influence of the model’s determination of wolf (see the right side above). What happened? They trained a model to check for snow in the background. As a Canadian I can assure you that many photos of wolves are in our snow filled northern regions of our country, while many dogs are photographed in grassy yards. The problem with machine learning is:

  • interpretability may be low with complicated models

  • application of machine learning may become routine and trusted

This is a dangerous combination, as the machine may become a trusted unquestioned authority. Rideiro and others state that they developed the problem to demonstrate this, but does it actually happen? Yes. I advised a team of students that attempted to automatically segment urban vs. rural environments from time lapse satellite photographs to build models of urban development. The model looked great, but I asked for an additional check with a plot of the by-pixel classification vs. pixel color. The results was a 100% correspondence, i.e., the model with all of it’s complicated convolution, activation, pooling was only looking for grey and tan pixels often associated with roads and buildings.

I’m not going to say, ‘Skynet’, oops I just did, but just consider these thoughts:

  • New power and distribution of wealth by concentrating rapid inference with big data as more data is being shared

  • Trade-offs that matter to society while maximizing a machine learning objective function may be ignored, resulting in low interpretability

  • Societal changes, disruptive technologies, post-labor society

  • Non-scientific results such as Clever Hans effect with models that learn from tells in the data rather than really learning to perform the task resulting catastrophic failures

I don’t want to be too negative and to take this too far. Full disclosure, I’m the old-fashioned professor that thinks we should put of phones in our pockets, walk around with our heads up so we can greet each other and observe our amazing environments and societies. One thing that is certain, data science is changing society in so many ways and as Neil Postman in Technopoly)

Neil Postman’s Quote for Technopoly

“Once a technology is admitted, it plays out its hand.”

Comments#

This was a basic introduction to machine learning. Much more could be done, I have other demonstrations on basics of working with DataFrames, ndarrays and many other workflows available at GeostatsGuy/PythonNumericalDemos and GeostatsGuy/GeostatsPy.

I hope this was helpful,

Michael

The Author:#

Michael Pyrcz, Professor, The University of Texas at Austin Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions

With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers’ and geoscientists’ impact in subsurface resource development.

For more about Michael check out these links:

Twitter | GitHub | Website | GoogleScholar | Geostatistics Book | YouTube | Applied Geostats in Python e-book | Applied Machine Learning in Python e-book | LinkedIn

Want to Work Together?#

I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.

  • Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I’d be happy to drop by and work with you!

  • Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!

  • I can be reached at mpyrcz@austin.utexas.edu.

I’m always happy to discuss,

Michael

Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin

More Resources Available at: Twitter | GitHub | Website | GoogleScholar | Geostatistics Book | YouTube | Applied Geostats in Python e-book | Applied Machine Learning in Python e-book | LinkedIn#