![](_static/intro/title_page.png)
Feature Imputation#
Michael J. Pyrcz, Professor, The University of Texas at Austin
Twitter | GitHub | Website | GoogleScholar | Geostatistics Book | YouTube | Applied Geostats in Python e-book | Applied Machine Learning in Python e-book | LinkedIn
Chapter of e-book “Applied Machine Learning in Python: a Hands-on Guide with Code”.
Cite this e-Book as:
Pyrcz, M.J., 2024, Applied Machine Learning in Python: a Hands-on Guide with Code, https://geostatsguy.github.io/MachineLearningDemos_Book.
The workflows in this book and more are available here:
Cite the MachineLearningDemos GitHub Repository as:
Pyrcz, M.J., 2024, MachineLearningDemos: Python Machine Learning Demonstration Workflows Repository (0.0.1). Zenodo.
By Michael J. Pyrcz
© Copyright 2024.
This chapter is a tutorial for / demonstration of Feature Imputation.
YouTube Lecture: check out my lectures on:
Curse of Dimensionality, Dimensionality Reduction, Principal Component Analysis
Feature Imputation - To Be Recorded Soon
These lectures are all part of my Machine Learning Course on YouTube with linked well-documented Python workflows and interactive dashboards. My goal is to share accessible, actionable, and repeatable educational content. If you want to know about my motivation, check out Michael’s Story.
Motivation for Feature Imputation#
Most spatial, subsurface datasets are not complete, missing values from the database.
many data analytics and machine learning workflows require complete data, \(𝑥_(1,𝑖),\dots,𝑥_(𝑚,𝑖)\) for each of the data samples \(𝑖 = 1,\ldots,𝑛\).
Inferential Machine Learning - methods the require complete data, for example,
principal components analysis - require covariance matrix and covariance needs all feature values
multidimensional scaling - we cannot calculate the dissimilarity matrix without all features available
cluster analysis - we cannot calculate distances in feature space without all features values
Predictive Machine Learning - always require all features to train and test the model,
Dealing with missing data is an essential part of feature / data engineering, prerequisite for data analytics and machine learning.
it is important firstly to understand the cause and impact of the missing data.
Cause of Missing Feature Values#
Missing at random (MAR) is not common and is difficult to evaluated, in this case,
global random omission may not result in data bias and bias in the resulting models
MAR is not typically the case as missing data often is related to a confounding feature, for example,
sampling cost - for example, low permeability test takes too long
rock rheology or other sample survivorship biases - for example, not possible to recover the mudstone samples
sample design - sampling to reduce uncertainty and maximize profitability instead of statistical representativity, dual purpose samples for information and production
sampling accessibility - there are locations in the subsurface that are difficult or impossible to samples, for example, near lakes or communities, or subsalt for seismic imaging
Consequences of Missing Feature Values#
This will result in clustering of missing values over locations and feature space.
omission of these feature values may bias global statistics, and degrade accuracy of local predictions
the use of global distributions for imputing missing values may not be reasonable
More than reducing the amount of training and testing data, missing data, if not completely at random will result in:
Biased sample statistics resulting in biased model training and testing
Biased models with biased predictions with potentially no indication of the bias!
If you reread the above looking for solutions, I offer my Canadian, “I’m sorry”. Those who know us know that we say sorry a lot and have a cool pronunciation of the word.
I say all of the above as a cautionary note but,
in some cases there are gaps in practice due to our data challenges, i.e., data paucity and nonstationarity.
I could spend an entire course teaching methods to address these challenges
the solutions integrate the entire subsurface, spatial project team, i.e., domain expertise is critical
I’m going to leave this at the level of awareness
We must move beyond the commonly applied likewise deletion, removal of all samples with any missing features.
Load the Required Libraries#
The following code loads the required libraries.
import geostatspy.GSLIB as GSLIB # GSLIB utilities, visualization and wrapper
import geostatspy.geostats as geostats # GSLIB methods convert to Python
import geostatspy
print('GeostatsPy version: ' + str(geostatspy.__version__))
GeostatsPy version: 0.0.72
We will also need some standard packages. These should have been installed with Anaconda 3.
ignore_warnings = True # ignore warnings?
import numpy as np # ndarrays for gridded data
import pandas as pd # DataFrames for tabular data
from sklearn.impute import SimpleImputer # basic imputation method
from sklearn.impute import KNNImputer # k-nearest neighbour imputation method
from sklearn.experimental import enable_iterative_imputer # required for MICE imputation
from sklearn.impute import IterativeImputer # MICE imputation
import os # set working directory, run executables
import math # basic math operations
import random # for random numbers
import matplotlib.pyplot as plt # for plotting
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator) # control of axes ticks
from matplotlib.colors import ListedColormap # custom color maps
import matplotlib.ticker as mtick # control tick label formatting
import seaborn as sns # for matrix scatter plots
from scipy import stats # summary statistics
import numpy.linalg as linalg # for linear algebra
import scipy.spatial as sp # for fast nearest neighbor search
import scipy.signal as signal # kernel for moving window calculation
from numba import jit # for numerical speed up
from statsmodels.stats.weightstats import DescrStatsW
plt.rc('axes', axisbelow=True) # plot all grids below the plot elements
if ignore_warnings == True:
import warnings
warnings.filterwarnings('ignore')
cmap = plt.cm.inferno # color map
seed = 73071 # random seed
np.random.seed(seed=seed)
Declare Functions#
Here’s a function to assist with the plots:
add_grid - convenience function to add major and minor gridlines to improve plot interpretability
Here is the function:
def add_grid(): # add major and minor gridlines
plt.gca().grid(True, which='major',linewidth = 1.0); plt.gca().grid(True, which='minor',linewidth = 0.2) # add y grids
plt.gca().tick_params(which='major',length=7); plt.gca().tick_params(which='minor', length=4)
plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks
Set the working directory#
I always like to do this so I don’t lose files and to simplify subsequent read and writes (avoid including the full address each time).
#os.chdir("c:/PGE383") # set the working directory
You will have to update the part in quotes with your own working directory and the format is different on a Mac (e.g. “~/PGE”).
Loading Tabular Data#
Here’s the command to load our comma delimited data file in to a Pandas’ DataFrame object.
Let’s load the provided multivariate, spatial dataset ‘unconv_MV.csv’. This dataset has variables from 1,000 unconventional wells including:
well average porosity
log transform of permeability (to linearize the relationships with other variables)
acoustic impedance (kg/m^3 x m/s x 10^6)
brittleness ratio (%)
total organic carbon (%)
vitrinite reflectance (%)
initial production 90 day average (MCFPD).
Note, the dataset is synthetic.
We load it with the pandas ‘read_csv’ function into a DataFrame we called ‘my_data’ and then preview it to make sure it loaded correctly.
#df = pd.read_csv('unconv_MV_v4.csv') # load our data table
df = pd.read_csv('https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/unconv_MV_v4.csv') # load data from Dr. Pyrcz's GitHub repository
df.drop('Prod',axis=1,inplace=True)
features = df.columns.values.tolist() # store the names of the features
xmin = [6.0,0.0,1.0,10.0,0.0,0.9]; xmax = [24.0,10.0,5.0,85.0,2.2,2.9] # set the minimum and maximum values for plotting
flabel = ['Porosity (%)','Permeability (mD)','Acoustic Impedance (kg/m2s*10^6)','Brittleness Ratio (%)', # set the names for plotting
'Total Organic Carbon (%)','Vitrinite Reflectance (%)']
ftitle = ['Porosity','Permeability','Acoustic Impedance','Brittleness Ratio', # set the units for plotting
'Total Organic Carbon','Vitrinite Reflectance']
m = len(pred)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[5], line 15
9 flabel = ['Porosity (%)','Permeability (mD)','Acoustic Impedance (kg/m2s*10^6)','Brittleness Ratio (%)', # set the names for plotting
10 'Total Organic Carbon (%)','Vitrinite Reflectance (%)']
12 ftitle = ['Porosity','Permeability','Acoustic Impedance','Brittleness Ratio', # set the units for plotting
13 'Total Organic Carbon','Vitrinite Reflectance']
---> 15 m = len(pred)
NameError: name 'pred' is not defined
We can also establish the feature ranges for plotting. We could calculate the feature range directly from the data with code like this:
Pormin = np.min(df['Por'].values) # extract ndarray of data table column
Pormax = np.max(df['Por'].values) # and calculate min and max
but, this would not result in easy to understand color bars and axis scales, let’s pick convenient round numbers. We will also declare feature labels for ease of plotting.
Visualize the DataFrame#
Visualizing the DataFrame is useful first check of the data.
many things can go wrong, e.g., we loaded the wrong data, all the features did not load, etc.
We can preview by utilizing the ‘head’ DataFrame member function (with a nice and clean format, see below).
add parameter ‘n=13’ to see the first 13 rows of the dataset.
df.head(n=13) # DataFrame preview
Well | Por | Perm | AI | Brittle | TOC | VR | |
---|---|---|---|---|---|---|---|
0 | 1 | 12.08 | 2.92 | 2.80 | 81.40 | 1.16 | 2.31 |
1 | 2 | 12.38 | 3.53 | 3.22 | 46.17 | 0.89 | 1.88 |
2 | 3 | 14.02 | 2.59 | 4.01 | 72.80 | 0.89 | 2.72 |
3 | 4 | 17.67 | 6.75 | 2.63 | 39.81 | 1.08 | 1.88 |
4 | 5 | 17.52 | 4.57 | 3.18 | 10.94 | 1.51 | 1.90 |
5 | 6 | 14.53 | 4.81 | 2.69 | 53.60 | 0.94 | 1.67 |
6 | 7 | 13.49 | 3.60 | 2.93 | 63.71 | 0.80 | 1.85 |
7 | 8 | 11.58 | 3.03 | 3.25 | 53.00 | 0.69 | 1.93 |
8 | 9 | 12.52 | 2.72 | 2.43 | 65.77 | 0.95 | 1.98 |
9 | 10 | 13.25 | 3.94 | 3.71 | 66.20 | 1.14 | 2.65 |
10 | 11 | 15.04 | 4.39 | 2.22 | 61.11 | 1.08 | 1.77 |
11 | 12 | 16.19 | 6.30 | 2.29 | 49.10 | 1.53 | 1.86 |
12 | 13 | 16.82 | 5.42 | 2.80 | 66.65 | 1.17 | 1.98 |
This dataset has features from 200 unconventional wells.
Note, the dataset is synthetic, but has realistic ranges and general multivariate relationships.
Remove Some Data#
Let’s select a proportion of NaN values, values to set as missing,
proportion_NaN = 0.1
Then we can make a boolean array
make an ndarray of same shape (number rows and columns) as the DataFrame of uniform[0,1] distributed values
np.random.random(df.shape)
check condition of less than the identified proportion to make a boolean ndarray of same size, true if less than the proportion. The result will be the correct proportion (within error) of random true values.
remove = np.random.random(df.shape) < proportion_NaN
apply the mask to remove the identified values from the DataFrame
df_mask = df.mask(remove)
Full disclosure, for this demonstration our data is missing at random, MAR, and this simplifies our task.
this allows us to focus on the mechanics of feature imputation without the additional domain expertise topics. This is a good first step!
proportion_NaN = 0.1 # proportion of values in DataFrame to remove
np.random.seed(seed=seed) # ensure repeatability
remove = np.random.random(df.shape) < proportion_NaN # make the boolean array for removal
print('Fraction of removed values in mask ndarray = ' + str(round(remove.sum()/remove.size,3)) + '.')
df_mask = df.mask(remove)
print('Fraction of nan values in the DataFrame = ' + str(round(df_mask.isnull().sum().sum()/(df_mask.shape[0]*df_mask.shape[1]),3)) + '.')
Fraction of removed values in mask ndarray = 0.093.
Fraction of nan values in the DataFrame = 0.093.
We now have a new DataFrame with some missing data.
Let’s do a .head() preview to observe the NaN values scattered throughout the dataset
df_mask.head(n=13) # DataFrame preview
Well | Por | Perm | AI | Brittle | TOC | VR | |
---|---|---|---|---|---|---|---|
0 | 1.0 | 12.08 | 2.92 | 2.80 | 81.40 | 1.16 | 2.31 |
1 | 2.0 | 12.38 | 3.53 | NaN | 46.17 | 0.89 | 1.88 |
2 | NaN | 14.02 | 2.59 | 4.01 | 72.80 | 0.89 | 2.72 |
3 | 4.0 | 17.67 | 6.75 | 2.63 | 39.81 | 1.08 | 1.88 |
4 | 5.0 | 17.52 | 4.57 | 3.18 | 10.94 | 1.51 | 1.90 |
5 | 6.0 | 14.53 | 4.81 | 2.69 | 53.60 | 0.94 | 1.67 |
6 | 7.0 | 13.49 | 3.60 | NaN | 63.71 | 0.80 | 1.85 |
7 | 8.0 | 11.58 | 3.03 | NaN | 53.00 | 0.69 | 1.93 |
8 | 9.0 | NaN | 2.72 | NaN | 65.77 | 0.95 | 1.98 |
9 | 10.0 | NaN | 3.94 | 3.71 | 66.20 | 1.14 | 2.65 |
10 | 11.0 | 15.04 | 4.39 | 2.22 | NaN | 1.08 | 1.77 |
11 | NaN | 16.19 | 6.30 | 2.29 | 49.10 | 1.53 | 1.86 |
12 | 13.0 | NaN | 5.42 | 2.80 | 66.65 | 1.17 | 1.98 |
Evaluation of the Data Coverage#
Let’s calculate the amount of missing data.
df_mask.describe().transpose() # DataFrame summary statistics
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Well | 182.0 | 102.653846 | 58.078019 | 1.00 | 53.2500 | 104.000 | 153.7500 | 200.00 |
Por | 184.0 | 14.935978 | 3.002142 | 6.55 | 12.8900 | 15.055 | 17.4225 | 23.55 |
Perm | 172.0 | 4.319419 | 1.684672 | 1.13 | 3.1300 | 4.010 | 5.1850 | 9.78 |
AI | 184.0 | 2.991630 | 0.571569 | 1.28 | 2.5675 | 2.975 | 3.3950 | 4.63 |
Brittle | 186.0 | 47.793817 | 13.781815 | 10.94 | 37.7450 | 48.830 | 58.0150 | 81.40 |
TOC | 186.0 | 0.991882 | 0.481896 | -0.19 | 0.6225 | 1.020 | 1.3500 | 2.18 |
VR | 176.0 | 1.969602 | 0.293877 | 0.93 | 1.7775 | 1.970 | 2.1100 | 2.87 |
We can see the counts of available values for each feature, less than the total number of samples due to missing values.
Let’s make a plot to indicate data completeness for each feature
this is a useful summarization
plt.subplot(111) # data completeness plot
(df_mask.isnull().sum()/len(df)).plot(kind = 'bar',color='darkorange',edgecolor='black')
plt.xlabel('Feature'); plt.ylabel('Percentage of Missing Values'); plt.title('Data Completeness'); plt.ylim([0.0,1.0])
plt.plot([-0.5,df.shape[1]+0.5],[0.1,0.1],color='red',ls='--')
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=0.8, wspace=0.2, hspace=0.2); add_grid(); plt.show()
![_images/b1bf476305b9f1199d4869c12d96c48b24d1c7a00266e4f24c7a7091c556bd66.png](_images/b1bf476305b9f1199d4869c12d96c48b24d1c7a00266e4f24c7a7091c556bd66.png)
This leads to the first data imputation method, feature selection.
Imputation Method #1 - Feature Selection#
Data completeness should be considered in feature selection.
if there is low data completeness, high percentage of missing samples, for a feature then the feature may be removed.
One method is to use the .drop() DataFrame function.
df_test = df_mask.drop('VR',axis = 1)
We use axis = 1 to drop a feature (as above) to remove features with more than 10% of feature values missing.
df_test = df_mask.drop(['Perm','VR'],axis = 1)
plt.subplot(111)
(df_test.isnull().sum()/len(df)).plot(kind = 'bar',color='darkorange',edgecolor='black') # calculate DataFrame with percentage missing by feature
plt.xlabel('Feature'); plt.ylabel('Percentage of Missing Values'); plt.title('Data Completeness'); plt.ylim([0.0,1.0])
plt.plot([-0.5,df.shape[1]+0.5],[0.1,0.1],color='red',ls='--')
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=0.8, wspace=0.2, hspace=0.2); add_grid(); plt.show()
![_images/8a7c8d88bca723184e048fd777df3743356c4ba22ca026e0ab1841255feae7f4.png](_images/8a7c8d88bca723184e048fd777df3743356c4ba22ca026e0ab1841255feae7f4.png)
Imputation Method #2 - Sample Selection#
There may be samples with more missing feature values.
a specific vintage of data, for example, older data, or sample locations that experienced data collection problems
Let’s check the coverage by sample in the DataFrame.
we use the axis=1 parameter in the sum command to sum NaN values over the rows, samples, of the DataFrame.
(df_mask.isnull().sum(axis=1)/len(df.columns)).plot(kind = 'bar',color='darkorange',edgecolor='black')
plt.subplots_adjust(left=0.0, bottom=0.0, right=3.2, top=1.2, wspace=0.2, hspace=0.2) # plot formatting
plt.xlabel('Sample Index'); plt.ylabel('Percentage of Missing Records'); plt.title('Data Completeness')
plt.xticks(np.arange(0,len(df_mask),10),np.arange(0,len(df_mask),10))
plt.ylim([0,1.0])
plt.plot([-0.5,len(df)+0.5],[0.2,0.2],color='red',ls='--')
plt.subplots_adjust(left=0.0, bottom=0.0, right=3.0, top=0.8, wspace=0.2, hspace=0.2); add_grid(); plt.show()
![_images/8398e6d58914d4cbe0d2855e1abf228f2f4d6c81b63e03a6c6b1faed15854842.png](_images/8398e6d58914d4cbe0d2855e1abf228f2f4d6c81b63e03a6c6b1faed15854842.png)
If we identified samples with low data completeness, high percentage of missing samples, for a sample then the sample may be removed.
Once again we use the .drop() DataFrame function.
df_test = df_mask.drop('Water',axis = 1)
This time we use axis = 0 to drop a list of samples and demonstrated below.
We need to make a list of the sample indices with too many missing samples
(df_mask.isnull().sum(axis=1)/len(df.columns)) > max_proportion_missing_by_sample
This is a tuple type, let’s convert it to a ndarray then we ensure strip it to just the 1D values
index_low_coverage_samples = np.asarray(np.where(low_coverage_samples == True))[0]
Now we are ready to apply our boolean array of length number of samples with True for too many missing values to remove these samples by index.
df_test2 = df_mask.drop(index = index_low_coverage_samples,axis = 0)
max_proportion_missing_by_sample = 0.2
low_coverage_samples = (df_mask.isnull().sum(axis=1)/len(df.columns)) > max_proportion_missing_by_sample
index_low_coverage_samples = np.asarray(np.where(low_coverage_samples == True))[0]
df_test2 = df_mask.drop(index = index_low_coverage_samples,axis = 0)
(df_test2.isnull().sum(axis=1)/len(df_test2.columns)).plot(kind = 'bar',color='darkorange',edgecolor='black')
plt.subplots_adjust(left=0.0, bottom=0.0, right=3.2, top=1.2, wspace=0.2, hspace=0.2) # plot formatting
plt.xlabel('Updated Sample Index'); plt.ylabel('Percentage of Missing Records'); plt.title('Data Completeness')
plt.xticks(np.arange(0,len(df_mask),10),np.arange(0,len(df_mask),10))
plt.ylim([0,1.0])
plt.plot([-0.5,len(df)+0.5],[0.2,0.2],color='red',ls='--')
plt.subplots_adjust(left=0.0, bottom=0.0, right=3.0, top=0.8, wspace=0.2, hspace=0.2); add_grid(); plt.show()
![_images/3724251782036a1d6e35f097f18c379bcf2c2e001e12546dfdb5d1176b113d00.png](_images/3724251782036a1d6e35f097f18c379bcf2c2e001e12546dfdb5d1176b113d00.png)
Imputation Method #3 - Likewise Deletion#
This is the method of removing all samples that have any missing feature values.
this approach ensures complete data while technically avoiding the need for imputation
no need for a imputation model decision
often removes important information
maximizes data bias if information is not missing at random (MAR)
We must consider data completeness, coverage for each feature, as visualized above. Consider that,
missing records in one feature may be different than the missing features in another feature
the union of missing over all features, may result in loss of much more than the largest proportion of missing over the features
Also, if missing not at random (MNAR), the sample bias is maximized
while likewise deletion is often applied, it is not recommended.
We can use the dropna() function.
with subset we can only consider a list of features
how can be set to ‘any’ for drop if any missing values and ‘all’ drop if all are missing
inplace true will overwrite the DataFrame and has no output while false will pass the new dataframe as a copy
df_likewise = df_mask.dropna(how='any',inplace=False)
sns.pairplot(df_likewise.iloc[:,:-1], plot_kws={'alpha':0.5,'s':20},corner=True)
plt.subplots_adjust(left=0.0, bottom=0.0, right=0.5, top=0.6, wspace=0.1, hspace=0.2)
# df_likewise.head(n = 13)
![_images/3ffc071740f16e8b554a48b94676650d408dd93c4df0027c3a90e66b4e275534.png](_images/3ffc071740f16e8b554a48b94676650d408dd93c4df0027c3a90e66b4e275534.png)
Modeling Methods for Imputation#
These are methods for feature imputation that treat feature imputation as a prediction problem, i.e., predict missing feature value with other available data, for example,
the collocated other available feature values
the same feature values available at other sample locations
There are many prediction methods applied for feature imputation,
we start with the most simple prediction model possible, predicting with the global mean and proceed from there to more complicated models
To help us visualize the results, let’s add a feature indicating if there are any missing feature values for a specific sample
this way we can label the samples that have had features imputed for evaluation and visualization of the feature imputation results
df_mask['Imputed'] = (df_mask.isnull().sum(axis=1)) > 0
df_mask.head()
Well | Por | Perm | AI | Brittle | TOC | VR | Imputed | |
---|---|---|---|---|---|---|---|---|
0 | 1.0 | 12.08 | 2.92 | 2.80 | 81.40 | 1.16 | 2.31 | False |
1 | 2.0 | 12.38 | 3.53 | NaN | 46.17 | 0.89 | 1.88 | True |
2 | NaN | 14.02 | 2.59 | 4.01 | 72.80 | 0.89 | 2.72 | True |
3 | 4.0 | 17.67 | 6.75 | 2.63 | 39.81 | 1.08 | 1.88 | False |
4 | 5.0 | 17.52 | 4.57 | 3.18 | 10.94 | 1.51 | 1.90 | False |
Imputation Method #4 - Replace with a Constant#
This is the method of replacing the missing values with a constant value.
here’s an example of replacing the missing feature values with a very low value
This results in bias and should not be done.
df_constant = df_mask.copy(deep=True) # make a deep copy of the DataFrame
constant_imputer = SimpleImputer(strategy='constant',fill_value = 0.01)
df_constant.iloc[:,:] = constant_imputer.fit_transform(df_constant)
sns.pairplot(df_constant.iloc[:,:], hue="Imputed", plot_kws={'alpha':0.15,'s':20}, palette = 'gnuplot', corner=True)
plt.subplots_adjust(left=0.0, bottom=0.0, right=0.5, top=0.6, wspace=0.1, hspace=0.2)
df_constant.head(n=5)
Well | Por | Perm | AI | Brittle | TOC | VR | Imputed | |
---|---|---|---|---|---|---|---|---|
0 | 1.00 | 12.08 | 2.92 | 2.80 | 81.40 | 1.16 | 2.31 | 0.0 |
1 | 2.00 | 12.38 | 3.53 | 0.01 | 46.17 | 0.89 | 1.88 | 1.0 |
2 | 0.01 | 14.02 | 2.59 | 4.01 | 72.80 | 0.89 | 2.72 | 1.0 |
3 | 4.00 | 17.67 | 6.75 | 2.63 | 39.81 | 1.08 | 1.88 | 0.0 |
4 | 5.00 | 17.52 | 4.57 | 3.18 | 10.94 | 1.51 | 1.90 | 0.0 |
![_images/d56b6b518a192ee91424bfaead41c4fc03c8e7c477f5b895c795022cd8325255.png](_images/d56b6b518a192ee91424bfaead41c4fc03c8e7c477f5b895c795022cd8325255.png)
Imputation Method #6 - Replace with the Mean#
This is the method of replacing the missing values with the mean, arithmetic average, over the feature.
the global mean is globally unbiased, but may result in local bias, i.e., low values are overestimated and high values are underestimated
df_mean = df_mask.copy(deep=True) # make a deep copy of the DataFrame
mean_imputer = SimpleImputer(strategy='mean')
df_mean.iloc[:,:] = mean_imputer.fit_transform(df_mean)
sns.pairplot(df_mean.iloc[:,:], hue="Imputed", plot_kws={'alpha':0.15,'s':20}, palette = 'gnuplot', corner=True)
plt.subplots_adjust(left=0.0, bottom=0.0, right=0.5, top=0.6, wspace=0.1, hspace=0.2)
df_constant.head(n=5)
df_mean.head(n=5)
Well | Por | Perm | AI | Brittle | TOC | VR | Imputed | |
---|---|---|---|---|---|---|---|---|
0 | 1.000000 | 12.08 | 2.92 | 2.80000 | 81.40 | 1.16 | 2.31 | 0.0 |
1 | 2.000000 | 12.38 | 3.53 | 2.99163 | 46.17 | 0.89 | 1.88 | 1.0 |
2 | 102.653846 | 14.02 | 2.59 | 4.01000 | 72.80 | 0.89 | 2.72 | 1.0 |
3 | 4.000000 | 17.67 | 6.75 | 2.63000 | 39.81 | 1.08 | 1.88 | 0.0 |
4 | 5.000000 | 17.52 | 4.57 | 3.18000 | 10.94 | 1.51 | 1.90 | 0.0 |
![_images/778b379253b6bede8a0f5b94aac49542e328179796b60c96bc3c802bee7f3f69.png](_images/778b379253b6bede8a0f5b94aac49542e328179796b60c96bc3c802bee7f3f69.png)
Imputation Method #6 - Replace with the Mode#
This is the method of replacing the missing values with the most frequent value, mode, over the feature.
in the presence of outliers the mean may not be reliable. My recommendation is to first deal with outliers prior to feature imputation
df_mode = df_mask.copy(deep=True) # make a deep copy of the DataFrame
mode_imputer = SimpleImputer(strategy='most_frequent')
df_mode.iloc[:,:] = mode_imputer.fit_transform(df_mode)
sns.pairplot(df_mode.iloc[:,:], hue="Imputed", plot_kws={'alpha':0.15,'s':20}, palette = 'gnuplot', corner=True)
plt.subplots_adjust(left=0.0, bottom=0.0, right=0.5, top=0.6, wspace=0.1, hspace=0.2)
df_constant.head(n=5)
df_mode.head(n=5)
Well | Por | Perm | AI | Brittle | TOC | VR | Imputed | |
---|---|---|---|---|---|---|---|---|
0 | 1.0 | 12.08 | 2.92 | 2.80 | 81.40 | 1.16 | 2.31 | 0.0 |
1 | 2.0 | 12.38 | 3.53 | 2.45 | 46.17 | 0.89 | 1.88 | 1.0 |
2 | 1.0 | 14.02 | 2.59 | 4.01 | 72.80 | 0.89 | 2.72 | 1.0 |
3 | 4.0 | 17.67 | 6.75 | 2.63 | 39.81 | 1.08 | 1.88 | 0.0 |
4 | 5.0 | 17.52 | 4.57 | 3.18 | 10.94 | 1.51 | 1.90 | 0.0 |
![_images/cafcda1fe54702d29f8cec4b44ee48f39144ea70a19edc9c608ed4d284153136.png](_images/cafcda1fe54702d29f8cec4b44ee48f39144ea70a19edc9c608ed4d284153136.png)
Imputation Method #7 - Replace with the n-nearest Neighbor estimation#
This is the method of replacing the missing values with the k-nearest neighbour prediction model based on the other available collocated feature values
see the k-nearest neighbour chapter in this e-book for explanation of the method, assumptions and hyperparameters
the available data is applied to predict at the missing values in features space
Since the k-nearest neighbor method is a lazy learner, imputed values are calculated in a single pass over the missing values
there is not a separate train and predict step
This method should be globally unbiased and will reduce local bias relative to global mean feature imputation
df_knn = df_mask.copy(deep=True) # make a deep copy of the DataFrame
knn_imputer = KNNImputer(n_neighbors=2, weights="uniform")
df_knn.iloc[:,:] = knn_imputer.fit_transform(df_knn)
sns.pairplot(df_knn.iloc[:,:], hue="Imputed", plot_kws={'alpha':0.15,'s':20}, palette = 'gnuplot', corner=True)
plt.subplots_adjust(left=0.0, bottom=0.0, right=0.5, top=0.6, wspace=0.1, hspace=0.2)
df_constant.head(n=5)
df_mode.head(n=5)
Well | Por | Perm | AI | Brittle | TOC | VR | Imputed | |
---|---|---|---|---|---|---|---|---|
0 | 1.0 | 12.08 | 2.92 | 2.80 | 81.40 | 1.16 | 2.31 | False |
1 | 2.0 | 12.38 | 3.53 | 2.45 | 46.17 | 0.89 | 1.88 | True |
2 | 1.0 | 14.02 | 2.59 | 4.01 | 72.80 | 0.89 | 2.72 | True |
3 | 4.0 | 17.67 | 6.75 | 2.63 | 39.81 | 1.08 | 1.88 | False |
4 | 5.0 | 17.52 | 4.57 | 3.18 | 10.94 | 1.51 | 1.90 | False |
![_images/342678de0a00eaf0afd432abdbb8412519091d1c8746408d4fcb35ed2f6c83f8.png](_images/342678de0a00eaf0afd432abdbb8412519091d1c8746408d4fcb35ed2f6c83f8.png)
Imputation Method #8 - Multivariate imputation by chained equations#
This is the method of replacing the missing values with the k-nearest neighbour prediction model
Substitute random values from \(𝐹_{𝑋_{𝑖=1,\ldots,𝑚}}(𝑋_{𝑖=1,\ldots,𝑚})\) for missing values
Sequentially predict missing values for a feature with others
Iterative until convergence criteria, usually multivariate statistics
Repeat for multiple realizations of the dataset
The default predictor is BayesianRidge().
we can specify the maximum number of iterations. The last computed imputations are returned.
df_mice = df_mask.copy(deep=True) # make a deep copy of the DataFrame
mice_imputer = IterativeImputer()
df_mice.iloc[:,:] = mice_imputer.fit_transform(df_mice)
sns.pairplot(df_mice.iloc[:,:], hue="Imputed", plot_kws={'alpha':0.15,'s':20}, palette = 'gnuplot', corner=True)
plt.subplots_adjust(left=0.0, bottom=0.0, right=0.5, top=0.6, wspace=0.1, hspace=0.2)
df_constant.head(n=5)
df_mode.head(n=5)
Well | Por | Perm | AI | Brittle | TOC | VR | Imputed | |
---|---|---|---|---|---|---|---|---|
0 | 1.0 | 12.08 | 2.92 | 2.80 | 81.40 | 1.16 | 2.31 | 0.0 |
1 | 2.0 | 12.38 | 3.53 | 2.45 | 46.17 | 0.89 | 1.88 | 1.0 |
2 | 1.0 | 14.02 | 2.59 | 4.01 | 72.80 | 0.89 | 2.72 | 1.0 |
3 | 4.0 | 17.67 | 6.75 | 2.63 | 39.81 | 1.08 | 1.88 | 0.0 |
4 | 5.0 | 17.52 | 4.57 | 3.18 | 10.94 | 1.51 | 1.90 | 0.0 |
![_images/18e0e0423b4b3eac989e163a6542e3ab6cf5e33b92164b4d827b23c520204c3b.png](_images/18e0e0423b4b3eac989e163a6542e3ab6cf5e33b92164b4d827b23c520204c3b.png)
Want to Work Together?#
I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.
Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I’d be happy to drop by and work with you!
Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PI is Professor John Foster)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!
I can be reached at mpyrcz@austin.utexas.edu.
I’m always happy to discuss,
Michael
Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin
More Resources Available at: Twitter | GitHub | Website | GoogleScholar | Geostatistics Book | YouTube | Applied Geostats in Python e-book | Applied Machine Learning in Python e-book | LinkedIn
Comments#
This was a basic treatment of feature imputation. Much more could be done and discussed, I have many more resources. Check out my shared resource inventory and the YouTube lecture links at the start of this chapter with resource links in the videos’ descriptions.
I hope this is helpful,
Michael