Artificial Neural Networks

Artificial Neural Networks#

Michael J. Pyrcz, Professor, The University of Texas at Austin

Chapter of e-book “Applied Machine Learning in Python: a Hands-on Guide with Code”.

Cite this e-Book as:

Pyrcz, M.J., 2024, Applied Machine Learning in Python: A Hands-on Guide with Code [e-book]. Zenodo. doi:10.5281/zenodo.15169138

The workflows in this book and more are available here:

Cite the MachineLearningDemos GitHub Repository as:

Pyrcz, M.J., 2024, MachineLearningDemos: Python Machine Learning Demonstration Workflows Repository (0.0.3) [Software]. Zenodo. DOI: 10.5281/zenodo.13835312. GitHub repository: GeostatsGuy/MachineLearningDemos

By Michael J. Pyrcz
© Copyright 2024.

This chapter is a tutorial for / demonstration of Artificial Neural Networks.

YouTube Lecture: check out my lectures on:

These lectures are all part of my Machine Learning Course on YouTube with linked well-documented Python workflows and interactive dashboards. My goal is to share accessible, actionable, and repeatable educational content. If you want to know about my motivation, check out Michael’s Story.

Motivation#

Artificial neural networks are very powerful, nature inspired computing based on an analogy of brain

I suggest that they are like a reptilian brain, without planning and higher order reasoning

In addition, artificial neural networks are a building block of many other deep learning methods, for example,

convolutional neural networks
recurrent neural networks
generative adversarial networks
autoencoders

Nature inspired computing is looking to nature for inspiration to develop novel problem-solving methods,

artificial neural networks are inspired by biological neural networks
nodes - in our model are artificial neurons, simple processors
connections between nodes are artificial synapses

intelligence emerges from many connected simple processors. For the remainder of this chapter, I will used the terms nodes and connections to describe our artificial neural network.

Concepts#

Here are some key aspects of artificial neural networks,

Basic Design - “…a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs.” Caudill (1989).

Still a Prediction Model - while these models may be quite complicated with even millions of trainable model parameters, weights and biases, they are still a function that maps from predictor features to response features,

\[ Y=f(X)+\epsilon \]

Supervised learning – we provide training data with predictor features, $X_1,\ldots,𝑋_𝑚$ and response feature(s), $𝑌_1,\ldots,𝑌_K$, with the expectation of some model prediction error, $\epsilon$.

Nonlinearity - nonlinearity is imparted to the system through the application of nonlinear activation functions to model nonlinear relationships

Universal Function Approximator (Universal Approximation Theorem) - ANNs have the ability to learn any possible function shape of $f$ over an interval, for an arbitrary wide (single hidden layer) by Cybenko (1989) and arbitrary depth by Lu and others (2017)

A Simple Network#

To get started, let’s build a neural net, single hidden layer, fully connected, feed-forward neural network,

Simple demonstration artificial neural network.

We use this example artificial neural network in the descriptions below and as an actual example that we will train and predict with by-hand!

Now let’s label the parts of our network,

Simple demonstration artificial neural network with the parts labeled, including 3 inputs nodes, 2 hidden nodes and 1 output node fully connected.

Our artificial neural network has,

3 predictor features, $X_1, X_2, X_3$
3 input nodes
2 hidden layer nodes
1 output node
1 response feature, $Y_1$

where all nodes fully connected. Note, deep learning is a neural network with more than 1 hidden layer, but for brevity let’s continue with our non-deep learning artificial neural network.

Let’s walk through some concepts,

Feed-forward – all information flows from left to right. Each node sends the same signal along the connections to all the nodes in the next layer, but connection has their own unique trainable weights, including in this example, $\lambda_{1,4}, \lambda_{2,4}, \ldots, \lambda_{5,6}$

Feed forward, fully connected, with each node sending the same signal to all the nodes in the next layer, each with their trainable weights, including in this example, $\lambda_{1,4}, \lambda_{2,4}, \ldots, \lambda_{5,6}$.

Input Layer - the input features are passed directly to the input nodes, in the case of continuous predictor features, there is one input node per feature and the features are,

min / max normalization to a range [−1,1] or [0,1] to improve activation function sensitivity and to remove the influence of scale differences in predictor features and to improve solution stability, i.e., smooth reduction in the training loss while training

Highlighting the input layer, the first layer that receives the normalized predictor features.

In the case of categorical predictor features, we have one input node per each category for each predictor feature, i.e., after one-hot-encoding of the feature, each encoding is passed to a separate input node.

recall one-hot-encoding, 1 if the specific category, 0 otherwise, replaces the categorical feature with a binary vector with length as the number of categories.

The input layer of our artificial neural network highlighted. The first layer that receives one-hot-encoding of a single categorical predictor feature.

we could also use a single input node per categorical predictor and assign thresholds to each categories, for example [0.0, 0.5, 1.0] for 3 categories, but this assumes an ordinal categorical feature

Hidden Layer - the input layer values $I_1, I_2, I_2$ are weighted with learnable weights $\lambda_{1,4}, \lambda_{2,4}, \lambda_{3,4}, \lambda_{1,5}, \lambda_{2,5}, \lambda_{3,5}$, and

In the hidden layer nodes, the weighted input layer values, $\lambda_{1,4} \cdot I_1, \lambda_{2,4} \cdot I_2 \cdot I_2, \ldots, \lambda_{3,5} \cdot I_3$ are summed with the addition of a trainable bias term in each node, $b_4$ and $b_5$.

the output from the input layer nodes is constant over the hidden layer nodes, but the weights vary over each input node to hidden layer node connection

The hidden layer of our artificial neural network highlighted. The input layer nodes' outputs are weighted and passed into the hidden layer nodes. The output from the input layer nodes is constant but the weights vary over the input layer node to hidden layer node connections.

Output Layer - for continuous response features there is one output node per normalized response feature.

backtransformation from normalized to original response feature(s) are then applied to recover the ultimate prediction
as with continuous predictor features, min / max normalization is applied to continuous response features to a range [−1,1] or [0,1] to improve activation function sensitivity

The output layer of our artificial neural network highlighted. The hidden layer nodes' outputs are weighted and passed into the output layer nodes. The output from the hidden layer nodes is constant, but the weights vary over the hidden layer node to output layer node connections.

In the case of a categorical response feature, once again one-hot-encoding is applied, therefore, there is one output node per category.

the prediction is the probability of each category

Highlighting the input layer, the first layer that receives one-hot-encoding of a single categorical predictor feature.

Walkthrough the Network#

Let’s walkthough the artificial neural network. We follow a single path to illustrate the precise calculations associated with making a prediction with an artificial neural network.

Inside an Input Layer Node - input layer nodes just pass the predictor features,

normalized continuous predictor feature value
a single one-hot-encoding value [0 or 1] for categorical prediction features

into the hidden layer nodes, with general vector notation,

\[ I_j = X_j \]

Walkthrough of an artificial neural network, the input layer node receives one-hot-encoding of a single categorical predictor feature and passes it to all of the hidden layer nodes.

For a single example from the figure above consider,

\[ I_1 = X_1^N \]

Inside an Hidden Layer Node

The hidden layer nodes are simple processors. The take linearly weighted combinations of inputs, add a node bias term and then nonlinearly transform the result, this transform is call the activation function, $\alpha$.

indeed, a very simple processor!
through many interconnected nodes we gain a very flexible predictor, emergent ability to characterize complicated, nonlinear patterns.

Prior to activation we have,

\[ H_{j_{in}} = \sum_{i=1}^{p} \left( \lambda_{i,j} \cdot I_i \right) + b_j \]

and after activation we have,

\[ H_j = \alpha \left(𝐻_{𝑗_{𝑖𝑛}} \right) \]

We can express the simple processor in the node with general vector notation as,

\[ H_j = \alpha\left(b_j + \lambda_j^T X\right) \]

Walkthrough of an artificial neural network, the hidden layer linearly weights the input from each input layer node, adds a node bias term and then applies an activation function and passes this to all nodes in the next layer, i.e., the output layer for our example artificial neural network.

For an example of the simple processor of a hidden layer node, $H_4$, from the figure above consider,

\[ H_4 = \alpha\left( \lambda_{1,4} \cdot I_1 + \lambda_{2,4} \cdot I_2 + \lambda_{3,4} \cdot I_3 + b_4 \right) \]

Inside an Output Layer Node

The output layer nodes take linearly weighted combinations of nodes’ inputs, adds a node bias term and then transforms the result with an activation function, $\alpha$, same as the hidden layer nodes,

\[ O_j = \alpha\left(b_j + \lambda_j^T H\right) \]

But unlike the hidden layer nodes, for continuous response features, linear activation is commonly applied,

\[ O_j = \alpha \left(O_{𝑗_{𝑖𝑛}} \right) = O_{𝑗_{𝑖𝑛}} \]

Walkthrough of an artificial neural network, the output layer linearly weights the input from each hidden layer node, adds a node bias term and then applies an activation function, typically linear for continuous response features and passes this as an output.

and for categorical response features, softmax activation is commonly applied,

\[ O_j = \alpha(O_{j_{in}}) = \frac{e^{O_{j_{in}}}}{\sum_{\iota=1}^{K} e^{O_{\iota_{in}}}} \]

softmax activation ensures that the output over all the output layer nodes are valid probabilities including,
nonnegativity - through the exponentiation
closure - probabilities sum to 1.0 through the denominator normalizing the result

Number of Model Parameters#

In general, there are many model parameters, $theta$, in an artificial neural network. Consider this notation for artificial neural network width, number of input nodes, $p$, number of hidden layer nodes, $m$, and number of output nodes, $k$.

Notation for artificial neural network width, number of input nodes, $p$, number of hidden layer nodes, $m$, and number of output nodes, $k$.

for every connection there is a weight,

\[ \lambda_{𝐼_{1,\ldots,𝑝},𝐻_{1,\ldots,𝑚} } \quad \text{and} \quad \lambda_{𝐻_{1,\ldots,𝑚},𝑂_{1,\ldots,𝑘} } \]

with full connectivity the number of weights is $𝑝 \times 𝑚$ and $𝑚 \times 𝑘$ and at each node there is a bias term $𝑏_{H_{1,\ldots,m} }$ and $𝑏_{O_{1,\ldots,k} }$.

the number of model parameters is,

\[ |\theta| = 𝑝 \times 𝑚 + 𝑚 \times 𝑘 + 𝑚 + 𝑘 \]

this assumes an unique bias term at each hidden layer node and output layer node, but in some case the same bias term may be applied over the entire layer.

For our example, with $p = 3$, $m = 2$ and $k = 1$, then the number of model parameters are,

\[ |\theta| = 𝑝 \times 𝑚 + 𝑚 \times 𝑘 + 𝑚 + 𝑘 \]

after substitution we have,

\[ |\theta| = 3 \times 2 + 2 \times 1 + 2 + 1 = 11 \]

I selected this as a manageable number of parameters, so we can train and visualize our model, but consider a more typical model size by increasing our artificial neural network’s width, with $p = 10$, $m = 20$ and $k = 3$, then we have many more model parameters,

\[ |\theta| = 10 \times 20 + 20 \times 3 + 20 + 3 = 283 \]

if we add hidden layers, increase our artificial neural network’s depth, the number of model parameters will grow very quickly.

Let’s clarify these definitions,

neural network width - the number of nodes in the layers of the neural network
neural network depth - the number of layers in the neural network, typically the input layer is not included in this calculation

Activation Functions#

A activation function is a transformation of the linear combination of the weighted node inputs plus the node bias term. Nonlinear activation,

introduces non-linear properties to the network
prevents the network from collapsing

Without the nonlinear activation function we would have linear regression, the entire system collapses.

Do demonstrate this, let’s take our example network and remove the activation functions, or assuming a linear transformation,

Our simple artificial neural network with connections and model parameters.

Now we can calculate the prediction ignoring activation, since it is identity or a linear scaling as,

\[ Y_1 = \lambda_{4,6} \cdot \left( \lambda_{1,4} \cdot X_1 + \lambda_{2,4} \cdot X_2 + \lambda_{3,4} \cdot X_3 + b_{H_4} \right) + \lambda_{5,6} \cdot \left( \lambda_{1,5} \cdot X_1 + \lambda_{2,5} \cdot X_2 + \lambda_{3,5} \cdot X_3 + b_{H_5} \right) \]

Now we can group the like terms,

\[ Y_1 = (\lambda_{4,6} \cdot \lambda_{1,4} + \lambda_{5,6} \cdot \lambda_{1,5}) \cdot X_1 + (\lambda_{4,6} \cdot \lambda_{2,4} + \lambda_{5,6} \cdot \lambda_{2,5}) \cdot X_2 + (\lambda_{4,6} \cdot \lambda_{3,4} + \lambda_{5,6} \cdot \lambda_{3,5}) \cdot X_3 + (\lambda_{4,6} \cdot b_{H_4} + \lambda_{5,6} \cdot b_{H_5}) \]

and finally, we can group the summed model parameters and replace them with recognizable coefficients,

\[ Y_1 = b_1 \cdot X_1 + b_2 \cdot X_2 + b_3 \cdot X_3 + b_0 \]

et voilà! We have linear regression, an artificial neural network without activation collapses to linear regression.

Now that we have demonstrated the necessity of nonlinear activation, here’s some common activation functions,

Note the notation in the figure above, the X-axis is the input, $H_{j_{in}}$, and the Y-axis is the output after activation, $H_j$.

Sigmoid - also known as logistic has the following expression and derivative.

\[ f(x) = \frac{1}{1 + e^{-x}} \]

\[ f'(x) = f(x) \cdot (1 - f(x)) \]

Tanh – hyperbolic tangent has the following expression and derivative.

\[ f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \]

\[ f'(x) = 1 - \frac{(e^x - e^{-x})^2}{(e^x + e^{-x})^2} \]

ReLU – rectified linear units has the following expression and derivative.

\[ f(x) = \max(0, x) \]

\[\begin{split} f'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases} \end{split}\]

How do we select our activation functions? Considerations these criteria for selecting activation functions,

Nonlinear – required to impose nonlinearity into the prediction model (e.g., sigmoid, tanh and ReLU are nonlinear).
Range – finite for more stability gradient-based learning, infinite for more efficient training, but requires a slower learning rate (e.g., sigmoid $[0,\infty]$, tanh $[-1,1]$ and ReLU $[0,\infty]$).
Continuously Differentiable – required for stable gradient-based optimization (e.g., sigmoid, tanh and ReLU $\ne 0.0$)
Fast Calculation of Derivative - the derivative calculation has low calculation complexity for efficient training (e.g., sigmoid and ReLU)
Smooth functions with Monotonic Derivative – may generalize better (e.g., sigmoid, tanh and ReLU)
Monotonic – guaranteed convexity of error surface of a single layer model, global minimum for loss function (e.g., sigmoid, tanh and ReLU)
Approximates Identity at the Origin, $(𝑓(0) = 0)$ – learns efficiently with the weights initialized as small random values (e.g., ReLU and tanh)

Training Model Parameters#

Training an artificial neural network proceeds iteratively by these steps.

Training an artificial neural network proceeds iteratively by, 1. forward pass to make a prediction, 2. calculate the error derivative based on the prediction and truth over training data, 3. backpropagate the error derivative back through the artificial neural network to calculate the derivatives of the error over all the model weights and biases parameters, 4. update the model parameters based on the derivatives and learning rates, 5. repeat until convergence.

Here’s some details on each step,

Initializing the Model Parameters - initialize all model parameters with typically small (near zero) random values. Here’s a couple common methods,

Xavier Weight Initialization - random realizations from uniform distributions specified by $U[\text{min}, \text{max}]$,

\[ \lambda_{i,j} = F_U^{-1} \left[ \frac{-1}{\sqrt{p}}, \frac{1}{\sqrt{p}} \right] (p^\ell) \]

where $F^{-1}_U$ is the inverse of the CDF, $p$ is the number of inputs, and $p^{\ell}$ is a random cumulative probability value drawn from the uniform distribution, $U[0,1]$.
Normalized Xavier Weight Initialization - random realizations from uniform distributions specified by $U[\text{min}, \text{max}]$,

\[ \lambda_{i,j} = F_U^{-1} \left[ \frac{-1}{\sqrt{p}+k}, \frac{1}{\sqrt{p}+k} \right] (p^\ell) \]

where $F^{-1}_U$ is the inverse of the CDF, $p$ is the number of inputs, $k$ is the number of outputs, and $p^{\ell}$ is a random cumulative probability value drawn from the uniform distribution, $U[0,1]$.
For example, if we return to our first hidden layer node,

First hidden layer node with 3 inputs, and 1 output.

we have $p = 3$ and $k = 1$, and we draw from the uniform distribution,

\[ U \left[ \frac{-1}{\sqrt{p}+k}, \frac{1}{\sqrt{p}+k} \right] = U \left[ \frac{-1}{\sqrt{3}+1}, \frac{1}{\sqrt{3}+1} \right] \]

Forward Pass - to make a prediction, $\hat{y}$. Initial predictions will be random for the first iteration, but will improve.

Prediction with our artificial neural network initialized with random model parameters, weights and biases.

Calculate the Error Derivative - given a loss of, $P = \frac{1}{2} \left(\hat{y} - y \right)^2$, the error derivative, i.e., rate of change of in error given a change in model estimate is $\frac{\partial P}{\partial \hat{y}} = \hat{Y} - Y$.

For now, let’s only consider a single estimate, and we will address more than 1 training data later.

Backpropagate the Error Derivative - we shift back through the artificial neural network to calculate the derivatives of the error over all the model weights and biases parameters, to accomplish this we use the chain rule,

\[ \frac{\partial}{\partial x} f(g(h(x))) = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial x} \]

Update the Model Parameters - based on the derivatives, \frac{\partial P}{\partial \lambda_{i,j}} and learning rates, $\eta$, like this,

\[ \lambda_{i,j}^{\ell} = \lambda_{i,j}^{\ell - 1} + \eta \cdot \frac{\partial P}{\partial \lambda_{i,j}} \]

Repeat Until Convergence - return to step 1. until the error, $P$, is reduced to an acceptable level, i.e., model convergence is the condition to stop the iterations

Backpropagation#

We can use the chain rule to perform backpropagation (from the term backward propagation) to step our error partial derivative backwards through out network to learn how each of our artificial neural network weights and biases impact the model prediction error.

For example, given our simple artificial neural network, let’s backpropagate along a single path in the network,

With the chain rule we can backpropagate the error derivative from the output back through the artificial neural network.

Now, I’ll walk you through this, it is actually not as hard as it looks. In fact, the main challenge with backpropagation, once you learn the pattern, is the book keeping!

Start with error derivative of the prediction, the output from the output layer node, $O_6$, $\frac{\partial P}{\partial O_6} = O_6 - Y_1^N$

recall the output of output layer node, $O_6$ is the normalized model prediction, $\hat{Y}_1^N$.
we use the $O_6$ notation below to help track our backpropagation through the neural network nodes, but we could also state it as,

\[ \frac{\partial P}{\partial O_6} = \hat{Y}_1^N - Y_1^N$ \]

Now backpropagate to before application of the activation function inside of the output layer neuron, $O_6$, $\frac{\partial P}{\partial O_{6_{\text{in}}}} = \frac{\partial O_6}{\partial O_{6_{\text{in}}}} \cdot \frac{\partial P}{\partial O_6}$.

to assist, I specify $\frac{\partial O_6}{\partial O_{6_{\text{in}}}}$ as $\frac{d}{dx} O_{6_{\alpha}}(x)$ and $\frac{\partial P}{\partial O_6} = O_6 - Y_1^N$ now we get,

\[ \frac{\partial P}{\partial O_{6_{\text{in}}}} = \frac{d}{dx} O_{6_{\alpha}}(x) \cdot \left(O_6 - Y_1^N \right) \]

note that, $\frac{d}{dx} O_{6_{\alpha}}(x)$, is just the derivative of the activation function for the node input, $x$, and above we listed the derivatives and commented on the need to calculate these very quickly. As you will see, we calculate many to accomplish backpropagation!
we could replace $x$ with the input, with prior to activation calculation in the node, for example, $H_j = \alpha\left(b_j + \lambda_j^T X\right)$ but can we agree this notation would be totally unwieldly, let’s stick with $x$ as input for the activation function
this is the pattern, use the derivative of the activation function evaluated for the input to backpropagate through the node, from output after activation to input before activation.

Now backpropagate along the connection $[5,6]$ to the output of the node in the hidden layer, $H_5$, and before the application of the weight, $\lambda_{5,6}$, by the chain rule we add this term, $\frac{\partial O_{6_{\text{in}}}}{\partial H_5}$ and get,

\[ \frac{\partial P}{\partial H_5} = \frac{\partial O_{6_{\text{in}}}}{\partial H_5} \cdot \frac{\partial O_6}{\partial O_{6_{\text{in}}}} \cdot \frac{\partial P}{\partial O_6}$. \]

to assist, I evaluate $\frac{\partial O_{6_{\text{in}}}}{\partial H_5}$ as $\lambda_{5,6}$ and now we get,

\[ \frac{\partial P}{\partial H_5} = \lambda_{5,6} \cdot \frac{d}{dx} O_{6_{\alpha}}(x) \cdot \left(O_6 - Y_1^N \right) \]

and we have completed the pattern, derivative of the activation function to backpropagate through the node and then the current weight, $\lambda_{i,j}$, to backpropagate along the connection!
with this method we can backpropagate, backward propagate the error derivative backwards through our artificial neural network along any connected path

To demonstrate this repeating pattern lets continue across the hidden layer node to the input of the node, $H_{5_{in}}$,

\[ \frac{\partial P}{\partial H_{5_{\text{in}}}} = \frac{\partial O_6}{\partial O_{6_{\text{in}}}} \cdot \frac{\partial O_{6_{\text{in}}}}{\partial H_5} \cdot \frac{\partial H_5}{\partial H_{5_{\text{in}}}} \cdot \frac{\partial P}{\partial O_6} \]

and again if we evaluate the partial derivatives we have,

\[ \frac{\partial P}{\partial H_{5_{\text{in}}}} = \frac{d}{dx} H_{5_{\alpha}}(x) \cdot \lambda_{5,6} \cdot \frac{d}{dx} O_{6_{\alpha}}(x) \cdot \left(O_6 - Y_1^N \right) \]

see the pattern? From right to left we have - error derivative, derivative of activation, weight, derivative of activation, …
to further simplify and specify the calculation, let’s assume the sigmoid activation function, and substitute the sigmoid derivatives,

\[ \frac{\partial P}{\partial H_{5_{\text{in}}}} = H_{5}(x) \cdot \left(1.0 - H_{5}(x) \right) \cdot \lambda_{5,6} \cdot O_{6}(x) \cdot \left(1.0 - O_{6}(x) \right) \cdot \left(O_6 - Y_1^N \right) \]

and if we use a linear activation for the output node, $O_6$, as is typical, the derivative $\frac{d}{dx} O_{6_{\alpha}}(x) = 1.0$,

\[ \frac{\partial P}{\partial H_{5_{\text{in}}}} = H_{5}(x) \cdot \left(1.0 - H_{5}(x) \right) \cdot \lambda_{5,6} \cdot 1.0 \cdot \left(O_6 - Y_1^N \right) \]

Now we get to the book keeping challenge. If you have a node that can be reached by multiple paths, then we need to add the derivatives across all the paths.

for example, let’s jump to the error gradient at input for input node $I_1$,

Multiple paths for backpropagation to input layer node, $I_1$.

for $\frac{\partial P}{\partial I_{i_{\text{in}}}}$ we sum the two paths and then include the activation function derivative to shift to the input for node $I_1$ as,

\[ \frac{\partial P}{\partial I_{i_{\text{in}}}} = \left[ \alpha(I_i) \cdot (1 - \alpha(I_i)) \right] \cdot \sum_{j=1}^{2} \lambda_{i,j} \cdot \frac{\partial P}{\partial H_{j_{\text{in}}}} \]

and if we once again evaluate each of the partial derivatives,

\[ \frac{\partial P}{\partial I_{1_{\text{in}}}} = \left[ \alpha(I_1) \cdot (1 - \alpha(I_1)) \right] \cdot \left[ \lambda_{1,4} \cdot \left[ \alpha(H_4) \cdot (1 - \alpha(H_4)) \right] \cdot \lambda_{4,6} \cdot \left[ \alpha(O_6) \cdot (1 - \alpha(O_6)) \right] \cdot (Y^N - O_6) \right. \]

With this method we can backpropagate the error derivatives to the inputs for each node in our artificial neural network. Here’s the result for our first iteration,

Back propagation result for the first iteration.

With this approach we can backpropagate the error derivative through our artificial neural network and calculate the derivatives before and after every node,

but, we actually need the partical derivatives with respect to the model parameters, the weights and bias terms.
the good news is that we are almost there! Here’s some examples for calculating these from our error derivatives for node inputs, prior to activation,

\[ \frac{\partial P}{\partial \lambda_{4,6}} = H_4 \cdot \frac{\partial P}{\partial O_{6_{\text{in}}}} = 0.42 \cdot 1.00 = 0.42 \]

\[ \frac{\partial P}{\partial \lambda_{1,4}} = I_1 \cdot \frac{\partial P}{\partial H_{4_{\text{in}}}} = 0.5 \cdot (-0.13) = -0.065 \]

\[ \frac{\partial P}{\partial b_4} = \frac{\partial P}{\partial H_{4_{\text{in}}}} = -0.32 \]

What is happening?

the partial derivative of the connection with respect to the weight is the signal!

\[ \frac{\partial O_{6_{\text{in}}}}{\partial \lambda_{4,6}} = H4 \]

\[ \frac{\partial I_{1_{\text{in}}}}{\partial \lambda_{1,4}} = I1 \]

and the partial derivatives of the biases are already given as the partial derivatives at the node input, prior to activation!

Now for weights along the connections from input layer to hidden layer and for the connections from hidden layer to output layer we have,

\[ \frac{\partial P}{\partial \lambda_{i,j}} = I_i \cdot \frac{\partial P}{\partial I_{j_{\text{in}}}} \quad \text{and} \quad \frac{\partial P}{\partial \lambda_{i,j}} = H_i \cdot \frac{\partial P}{\partial H_{j_{\text{in}}}} \]

and for biases in the hidden layer and output layer we have,

\[ \frac{\partial P}{\partial b_j} = \frac{\partial P}{\partial I_{j_{\text{in}}}} \quad \text{and} \quad \frac{\partial P}{\partial b_j} = \frac{\partial P}{\partial H_{j_{\text{in}}}} \]

My advice is to review this a couple of times, learn the patterns, and be happy that the computer takes care of the book keeping for us!

Updating Model Parameters#

The derivatives for each of the model parameters are the gradients, so we are ready to use gradient descent optimization with the addition of,

learning rate - to scale the rate of change of the model updates we assign a learning rate, $\eta$. For our model parameter examples from above,

\[ \lambda_{4,6}^{\ell} = \lambda_{4,6}^{\ell - 1} + \eta \cdot \frac{\partial P}{\partial \lambda_{4,6}} \]

\[ \lambda_{1,4}^{\ell} = \lambda_{1,4}^{\ell - 1} + \eta \cdot \frac{\partial P}{\partial \lambda_{1,4}} \]

\[ b_j^{\ell} = b_j^{\ell - 1} + \eta \cdot \frac{\partial P}{\partial b_j} \]

recall, this process of gradient calculation and model parameters, weights and biases, updating is iterated and is known as gradient descent optimization.
the goal is to explore the loss hypersurface, avoiding and escaping local minimums and ultimately finding the global minimum.
learning rate, also known as step size is commonly set between 0.0 and 1.0, note 0.01 is the default in Keras module of TensorFlow
Low Learning Rate – more stable, but a slower solution, may get stuck in a local minimum
High Learning Rate – may be unstable, but perhaps a faster solution, may diverge out of the global minimum

One strategy is to start with a high learning rate and then to decrease the learning rate over the iterations

Learning Rate Decay - set as > 0 to avoid mitigate oscillations,

\[ \eta^{\ell} = \eta^{\ell - 1} \cdot \left( \frac{1}{1 + \text{decay} \cdot \ell} \right) \]

where $\ell$ is the model training epoch

Notice that the model parameter updates are for a single training data case? Consider this single model parameter,

we calculate the update over all samples in the batch and apply the average of the updates.

\[ \frac{\partial P}{\partial \lambda_{4,6}} = H_4 \cdot \frac{\partial P}{\partial O_{6_{\text{in}}}} = 0.42 \cdot 1.00 = 0.42 \]

is applied to update the $\lambda_{4,6}$ parameter as,

\[ \lambda_{4,6}^{\ell} = \lambda_{4,6}^{\ell - 1} + \eta \cdot \frac{\partial P}{\partial \lambda_{4,6}} \]

is dependent on $H_4$ node output, and $𝑃$ error that are for a single sample, $𝑥_1,\ldots,𝑥_𝑚$ and $𝑦$; therefore, we cannot calculate a single parameter update over all our $1,\ldots,n$ training data samples.

instead we can calculate $1,\ldots,n$ updates and then apply the average of all the updates to our model parameters,

\[ \lambda_{4,6}^{\ell} = \lambda_{4,6}^{\ell - 1} + \frac{1}{n_{batch}} \sum_{i=1}^{n_{batch}} \eta \cdot \frac{\partial P}{\partial \lambda_{4,6}} \]

since the learning rate is a constant, we can move it out of the sum and now we are averaging the gradients,

\[ \lambda_{4,6}^{\ell} = \lambda_{4,6}^{\ell - 1} + \eta \frac{1}{n_{batch}} \sum_{i=1}^{n_{batch}} \frac{\partial P}{\partial \lambda_{4,6}} \]

Training Epochs#

This is a good time to talk about stochastic gradient descent optimization, first let’s define some common terms,

Batch Gradient Descent - updates the model parameters after passing through all of the data
Stochastic Gradient Descent - updates the model parameters over each sample data
Mini-batch Gradient Descent - updates the model parameter after passing through a single batch

With mini-batch gradient descent stochasticity is introduced through the use of subsets of the data, known as batches,

for example, if we divide our 100 samples into 4 batches, then we iterate over each batch separately
we speed up the individual updates, fewer data are faster to calculate, but we introduce more error
this often helps the training explore for the global minimum and avoid getting stuck in local minimums and along ridges in the loss hypersurface

Finally our last definition here,

epoch - is one pass over all of the data, so that would be 4 iterations of updating the model parameters if we have 4 mini-batches

There are many other considerations that I will add later including,

momentum
adaptive optimization

Now let’s build the above artificial neural network by-hand and visualize the solution!

this is by-hand so that you can see every calculation. I intentionally avoided using TensorFlow or PyTorch.

Interactive Dashboard#

I built out an interactive Python dashboard with the code below for training an artificial neural network. You can step through the training iteration and observe over the training epochs,

model parameters
forward pass predictions
backpropagation of error derivatives

If you would like to see artificial neural networks in action, check out my ANN interactive Python dashboard,

Interactive artificial neural network training Python dashboard.

Import Required Packages#

We will also need some standard packages. These should have been installed with Anaconda 3.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator, AutoLocator) # control of axes ticks
plt.rc('axes', axisbelow=True)                            # set axes and grids in the background for all plots
import math
seed = 13

If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing ‘python -m pip install [package-name]’. More assistance is available with the respective package docs.

Declare Functions#

Here’s the functions to make, train and visualize our artificial neural network.

def add_grid():
    plt.gca().grid(True, which='major',linewidth = 1.0); plt.gca().grid(True, which='minor',linewidth = 0.2) # add y grids
    plt.gca().tick_params(which='major',length=7); plt.gca().tick_params(which='minor', length=4)
    plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks   

def calculate_angle_rads(x1, y1, x2, y2):
    dx = x2 - x1     # Calculate the differences
    dy = y2 - y1
    angle_rads = math.atan2(dy, dx)    # Calculate the angle in radians
    #angle_degrees = math.degrees(angle_radians)    # Convert the angle to degrees
    return angle_rads
    
def offset(pto, distance, angle_deg): # modified from ChatGPT 4.o generated
    angle_rads = math.radians(angle_deg) # Convert angle from degrees to radians
    x_new = pto[0] + distance * math.cos(angle_rads) # Calculate the new coordinates
    y_new = pto[1] + distance * math.sin(angle_rads)
    return np.array((x_new, y_new))

def offsetx(xo, distance, angle_deg): # modified from ChatGPT 4.o generated
    angle_rads = math.radians(angle_deg) # Convert angle from degrees to radians
    x_new = xo + distance * math.cos(angle_rads) # Calculate the new coordinates
    return np.array((xo, x_new))

def offset_arrx(xo, distance, angle_deg,size): # modified from ChatGPT 4.o generated
    angle_rads = math.radians(angle_deg) # Convert angle from degrees to radians
    x_new = xo + distance * math.cos(angle_rads) # Calculate the new coordinates
    x_arr = x_new + size * math.cos(angle_rads+2.48) # Calculate the new coordinates
    return np.array((x_new, x_arr))

def offsety(yo, distance, angle_deg): # modified from ChatGPT 4.o generated
    angle_rads = math.radians(angle_deg) # Convert angle from degrees to radians
    y_new = yo + distance * math.sin(angle_rads) # Calculate the new coordinates
    return np.array((yo, y_new))

def offset_arry(yo, distance, angle_deg,size): # modified from ChatGPT 4.o generated
    angle_rads = math.radians(angle_deg) # Convert angle from degrees to radians
    y_new = yo + distance * math.sin(angle_rads) # Calculate the new coordinates
    y_arr = y_new + size * math.sin(angle_rads+2.48) # Calculate the new coordinates
    return np.array((y_new, y_arr))

def lint(x1, y1, x2, y2, t):
    # Calculate the interpolated coordinates
    x = x1 + t * (x2 - x1)
    y = y1 + t * (y2 - y1)
    return np.array((x, y))

def lintx(x1, y1, x2, y2, t):
    # Calculate the interpolated coordinates
    x = x1 + t * (x2 - x1)
    return x

def linty(x1, y1, x2, y2, t):
    # Calculate the interpolated coordinates
    y = y1 + t * (y2 - y1)
    return y

def lint_intx(x1, y1, x2, y2, ts, te):
    # Calculate the interpolated coordinates
    xs = x1 + ts * (x2 - x1)
    xe = x1 + te * (x2 - x1)
    return np.array((xs,xe))

def lint_inty(x1, y1, x2, y2, ts, te):
    # Calculate the interpolated coordinates
    ys = y1 + ts * (y2 - y1)
    ye = y1 + te * (y2 - y1)
    return np.array((ys,ye))

def lint_int_arrx(x1, y1, x2, y2, ts, te, size):
    # Calculate the interpolated coordinates
    xe = x1 + te * (x2 - x1)
    line_angle_rads = calculate_angle_rads(x1, y1, x2, y2)
    x_arr = xe + size * math.cos(line_angle_rads+2.48) # Calculate the new coordinates
    return np.array((xe,x_arr))

def lint_int_arry(x1, y1, x2, y2, ts, te, size):
    # Calculate the interpolated coordinates
    ye = y1 + te * (y2 - y1)
    line_angle_rads = calculate_angle_rads(x1, y1, x2, y2)
    y_arr = ye + size * math.sin(line_angle_rads+2.48) # Calculate the new coordinates
    return np.array((ye,y_arr))

def as_si(x, ndp): # from xnx on StackOverflow https://stackoverflow.com/questions/31453422/displaying-numbers-with-x-instead-of-e-scientific-notation-in-matplotlib 
    s = '{x:0.{ndp:d}e}'.format(x=x, ndp=ndp)
    m, e = s.split('e')
    return r'{m:s}\times 10^{{{e:d}}}'.format(m=m, e=int(e))

The Simple ANN#

I wrote this code to specify a simple ANN:

three input nodes, 2 hidden nodes and 1 output node

and to train the ANN by iteratively performing the forward calculation and backpropagation. I calculate:

the error and then propagate it to each node
solve for the partial derivatives of the error with respect to each weight and bias

all weights, biases and partial derivatives for all epoch are recorded in vectors for plotting

x1 = 0.5; x2 = 0.2; x3 = 0.7; y = 0.3 # training data
lr = 0.2 # learning rate

np.random.seed(seed=seed)

nepoch = 1000

y4 = np.zeros(nepoch); y5 = np.zeros(nepoch); y6 = np.zeros(nepoch)

w14 = np.zeros(nepoch); w24 = np.zeros(nepoch); w34 = np.zeros(nepoch)
w15 = np.zeros(nepoch); w25 = np.zeros(nepoch); w35 = np.zeros(nepoch)
w46 = np.zeros(nepoch); w56 = np.zeros(nepoch)

dw14 = np.zeros(nepoch); dw24 = np.zeros(nepoch); dw34 = np.zeros(nepoch)
dw15 = np.zeros(nepoch); dw25 = np.zeros(nepoch); dw35 = np.zeros(nepoch)
dw46 = np.zeros(nepoch); dw56 = np.zeros(nepoch)

db4 = np.zeros(nepoch); db5 = np.zeros(nepoch); db6 = np.zeros(nepoch)

b4 = np.zeros(nepoch); b5 = np.zeros(nepoch); b6 = np.zeros(nepoch)
y4 = np.zeros(nepoch); y5 = np.zeros(nepoch); y6 = np.zeros(nepoch)
d4 = np.zeros(nepoch); d5 = np.zeros(nepoch); d6 = np.zeros(nepoch)

# initialize the weights - Xavier Weight Initialization 
lower, upper = -(1.0 / np.sqrt(3.0)), (1.0 / np.sqrt(3.0)) # lower and upper bound for the weights, uses inputs to node
#lower, upper = -(sqrt(6.0) / sqrt(3.0 + 2.0)), (sqrt(6.0) / sqrt(3.0 + 2.0)) # Normalized Xavier weights, integrates ouputs also
w14[0] = lower + np.random.random() * (upper - lower); 
w24[0] = lower + np.random.random() * (upper - lower); 
w34[0] = lower + np.random.random() * (upper - lower);
w15[0] = lower + np.random.random() * (upper - lower); 
w25[0] = lower + np.random.random() * (upper - lower); 
w35[0] = lower + np.random.random() * (upper - lower);

lower, upper = -(1.0 / np.sqrt(2.0)), (1.0 / np.sqrt(2.0))
#lower, upper = -(sqrt(6.0) / sqrt(2.0 + 1.0)), (sqrt(6.0) / sqrt(2.0 + 1.0)) # Normalized Xavier weights, integrates ouputs also

w46[0] = lower + np.random.random() * (upper - lower); 
w56[0] = lower + np.random.random() * (upper - lower);     

#b4[0] = np.random.random(); b5[0] = np.random.random(); b6[0] = np.random.random()
b4[0] = (np.random.random()-0.5)*0.5; b5[0] = (np.random.random()-0.5)*0.5; b6[0] = (np.random.random()-0.5)*0.5; # small random value    


for i in range(0,nepoch):

# forward pass of model
    y4[i] = w14[i]*x1 + w24[i]*x2 + w34[i]*x3 + b4[i]; 
    y4[i] = 1.0/(1 + math.exp(-1*y4[i]))
    
    y5[i] = w15[i]*x1 + w25[i]*x2 + w35[i]*x3 + b5[i]
    y5[i] = 1.0/(1 + math.exp(-1*y5[i]))
    
    y6[i] = w46[i]*y4[i] + w56[i]*y5[i] + b6[i]
#    y6[i] = 1.0/(1 + math.exp(-1*y6[i])) # sgimoid / logistic activation at o6 

# back propagate the error through the nodes
#    d6[i] = y6[i]*(1-y6[i])*(y-y6[i]) # sgimoid / logistic activation at o6 
    d6[i] = (y-y6[i]) # identity activation o at o6
    d5[i] = y5[i]*(1-y5[i])*w56[i]*d6[i]; d4[i] = y4[i]*(1-y4[i])*w46[i]*d6[i]

# calculate the change in weights
    if i < nepoch - 1:
        dw14[i] = lr*d4[i]*x1; dw24[i] = lr*d4[i]*x2; dw34[i] = lr*d4[i]*x3
        dw15[i] = lr*d5[i]*x1; dw25[i] = lr*d5[i]*x2; dw35[i] = lr*d5[i]*x3
        dw46[i] = lr*d6[i]*y4[i]; dw56[i] = lr*d6[i]*y5[i] 
        
        db4[i] = lr*d4[i]; db5[i] = lr*d5[i]; db6[i] = lr*d6[i];

        w14[i+1] = w14[i] + dw14[i]; w24[i+1] = w24[i] + dw24[i]; w34[i+1] = w34[i] + dw34[i] 
        w15[i+1] = w15[i] + dw15[i]; w25[i+1] = w25[i] + dw25[i]; w35[i+1] = w35[i] + dw35[i] 
        w46[i+1] = w46[i] + dw46[i]; w56[i+1] = w56[i] + dw56[i]

        b4[i+1] = b4[i] + db4[i]; b5[i+1] = b5[i] + db5[i]; b6[i+1] = b6[i] + db6[i] 

Now Visualize the Network for a Specific Epoch#

I wrote a custom network visualization below, select iepoch and visualize the artificial neural network for a specific epoch.

iepoch = 1

dx = -0.19; dy = -0.09; edge = 1.0

o6x = 17; o6y =5; h5x = 10; h5y = 3.5; h4x = 10; h4y = 6.5
i1x = 3; i1y = 9.0; i2x = 3; i2y = 5; i3x = 3; i3y = 1.0; buffer = 0.5

plt.subplot(111)
plt.gca().set_axis_off()

circle_i1 = plt.Circle((i1x,i1y), 0.25, fill=False, edgecolor = 'black',lw=2,zorder=100); plt.annotate(r' $I_1$',(i1x+dx,i1y+dy),zorder=100); 
circle_i1b = plt.Circle((i1x,i1y), 0.40, fill=True, facecolor = 'white',edgecolor = None,lw=1,zorder=10);
plt.gca().add_patch(circle_i1); plt.gca().add_patch(circle_i1b)

circle_i2 = plt.Circle((i2x,i2y), 0.25, fill=False, edgecolor = 'black',lw=2,zorder=100); plt.annotate(r' $I_2$',(i2x+dx,i2y+dy),zorder=100); 
circle_i2b = plt.Circle((i2x,i2y), 0.40, fill=True, facecolor = 'white',edgecolor = None,lw=1,zorder=10);
plt.gca().add_patch(circle_i2); plt.gca().add_patch(circle_i2b)

circle_i3 = plt.Circle((i3x,i3y), 0.25, fill=False, edgecolor = 'black',lw=2,zorder=100); plt.annotate(r' $I_3$',(i3x+dx,i3y+dy),zorder=100); 
circle_i3b = plt.Circle((i3x,i3y), 0.40, fill=True, facecolor = 'white',edgecolor = None,lw=1,zorder=10);
plt.gca().add_patch(circle_i3); plt.gca().add_patch(circle_i3b)

circle_h4 = plt.Circle((h4x,h4y), 0.25, fill=False, edgecolor = 'black',lw=2,zorder=100); plt.annotate(r'$H_4$',(h4x+dx,h4y+dy),zorder=100); 
circle_h4b = plt.Circle((h4x,h4y), 0.40, fill=True, facecolor = 'white',edgecolor = None,lw=1,zorder=10);
plt.gca().add_patch(circle_h4); plt.gca().add_patch(circle_h4b)

circle_h5 = plt.Circle((h5x,h5y), 0.25, fill=False, edgecolor = 'black',lw=2,zorder=100); plt.annotate(r'$H_5$',(h5x+dx,h5y+dy),zorder=100); 
circle_h5b = plt.Circle((h5x,h5y), 0.40, fill=True, facecolor = 'white',edgecolor = None,lw=1,zorder=10);
plt.gca().add_patch(circle_h5); plt.gca().add_patch(circle_h5b)

circle_o6 = plt.Circle((o6x,o6y), 0.25, fill=False, edgecolor = 'black',lw=2,zorder=100); plt.annotate(r'$O_6$',(o6x+dx,o6y+dy),zorder=100); 
circle_o6b = plt.Circle((o6x,o6y), 0.40, fill=True, facecolor = 'white',edgecolor = None,lw=1,zorder=10);
plt.gca().add_patch(circle_o6); plt.gca().add_patch(circle_o6b)

plt.plot([i1x-edge,i1x],[i1y,i1y],color='grey',lw=1.0,zorder=1)
plt.plot([i2x-edge,i2x],[i2y,i2y],color='grey',lw=1.0,zorder=1)
plt.plot([i3x-edge,i3x],[i3y,i3y],color='grey',lw=1.0,zorder=1)

plt.annotate(r'$x_1$ = ' + str(np.round(x1,2)),(i1x-buffer-1.6,i1y-0.05),size=8,zorder=200,color='grey',
             bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=0) 
plt.annotate(r'$x_2$ = ' + str(np.round(x2,2)),(i2x-buffer-1.6,i2y-0.05),size=8,zorder=200,color='grey',
             bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=0) 
plt.annotate(r'$x_3$ = ' + str(np.round(x3,2)),(i3x-buffer-1.6,i3y-0.05),size=8,zorder=200,color='grey',
             bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=0) 

plt.plot([i1x,h4x],[i1y,h4y],color='lightcoral',lw=1.0,zorder=1)
plt.plot([i2x,h4x],[i2y,h4y],color='red',lw=1.0,zorder=1)
plt.plot([i3x,h4x],[i3y,h4y],color='darkred',lw=1.0,zorder=1)

plt.plot([i1x,h5x],[i1y,h5y],color='dodgerblue',lw=1.0,zorder=1)
plt.plot([i2x,h5x],[i2y,h5y],color='blue',lw=1.0,zorder=1)
plt.plot([i3x,h5x],[i3y,h5y],color='darkblue',lw=1.0,zorder=1)

plt.plot([h4x,o6x],[h4y,o6y],color='orange',lw=1.0,zorder=1)
plt.plot([h5x,o6x],[h5y,o6y],color='darkorange',lw=1.0,zorder=1)

plt.plot([o6x+edge,o6x],[o6y,o6y],color='grey',lw=1.0,zorder=1)
plt.annotate(r'$\hat{y}$ = ' + str(np.round(y6[iepoch],2)),(o6x+buffer+0.7,o6y-0.05),size=8,zorder=200,color='grey',
             bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=0) 

plt.plot(offsetx(h4x,2,-12),offsety(h4y,2,-12)+0.1,color='orange',lw=1.0,zorder=1)
plt.plot(offset_arrx(h4x,2,-12,0.2),offset_arry(h4y,2,-12,0.2)+0.1,color='orange',lw=1.0,zorder=1)
plt.annotate(r'$H_{4}$ = ' + str(np.round(y4[iepoch],2)),(lintx(h4x,h4y,o6x,o6y,0.08),linty(h4x,h4y,o6x,o6y,0.08)-0.0),size=8,zorder=200,color='orange',
             bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=-12)

plt.plot(offsetx(h5x,2,12),offsety(h5y,2,12)+0.1,color='darkorange',lw=1.0,zorder=1)
plt.plot(offset_arrx(h5x,2,12,0.2),offset_arry(h5y,2,12,0.2)+0.1,color='darkorange',lw=1.0,zorder=1)
plt.annotate(r'$H_{5}$ = ' + str(np.round(y5[iepoch],2)),(lintx(h5x,h5y,o6x,o6y,0.07),linty(h5x,h5y,o6x,o6y,0.07)+0.25),size=8,zorder=200,color='darkorange',
             bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=12)

plt.annotate(r'$\frac{\partial P}{\partial O_{6_{in}}}$ = ' + str(np.round(d6[iepoch],2)),(o6x-0.5,o6y-0.7),size=10)
plt.annotate(r'$\frac{\partial P}{\partial H_{4_{in}}}$ = ' + str(np.round(d4[iepoch],2)),(h4x-0.5,h4y-0.7),size=10)
plt.annotate(r'$\frac{\partial P}{\partial H_{5_{in}}}$ = ' + str(np.round(d5[iepoch],2)),(h5x-0.5,h5y-0.7),size=10)

plt.annotate(r'$\frac{\partial P}{\partial \hat{y}}$ = ' + str(np.round(d6[iepoch],2)),(o6x,o6y-1.2),size=10)
plt.annotate(r'$\frac{\partial P}{\partial H_{4_{out}}}$ = ' + str(np.round(w46[iepoch]*d6[iepoch],2)),(h4x,h4y-1.2),size=10)
plt.annotate(r'$\frac{\partial P}{\partial H_{5_{out}}}$ = ' + str(np.round(w56[iepoch]*d6[iepoch],2)),(h5x,h5y-1.2),size=10)

plt.plot(lint_intx(h4x, h4y, o6x, o6y,0.4,0.6),lint_inty(h4x,h4y,o6x,o6y,0.4,0.6)-0.1,color='orange',lw=1.0,zorder=1)
plt.plot(lint_int_arrx(o6x,o6y,h4x,h4y,0.4,0.6,0.2),lint_int_arry(o6x,o6y,h4x,h4y,0.4,0.6,0.2)-0.1,color='orange',lw=1.0,zorder=1)
plt.annotate(r'$\frac{\partial P}{\partial \lambda_{4,6}}$ = ' + str(np.round(dw46[iepoch]/lr,4)),(lintx(h4x,h4y,o6x,o6y,0.5)-0.6,linty(h4x,h4y,o6x,o6y,0.5)-0.72),size=10,zorder=200,color='orange',
              bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=-11)

plt.plot(lint_intx(h5x, h5y, o6x, o6y,0.4,0.6),lint_inty(h5x,h5y,o6x,o6y,0.4,0.6)-0.1,color='darkorange',lw=1.0,zorder=1)
plt.plot(lint_int_arrx(o6x,o6y,h5x,h5y,0.4,0.6,0.2),lint_int_arry(o6x,o6y,h5x,h5y,0.4,0.6,0.2)-0.1,color='darkorange',lw=1.0,zorder=1)
plt.annotate(r'$\frac{\partial P}{\partial \lambda_{5,6}}$ = ' + str(np.round(dw56[iepoch]/lr,4)),(lintx(h5x,h5y,o6x,o6y,0.5)-0.4,linty(h5x,h5y,o6x,o6y,0.5)-0.6),size=10,zorder=200,color='darkorange',
              bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=12)

plt.plot(offsetx(i1x,2,-20),offsety(i1y,2,-20)+0.1,color='lightcoral',lw=1.0,zorder=1)
plt.plot(offset_arrx(i1x,2,-20,0.2),offset_arry(i1y,2,-20,0.2)+0.1,color='lightcoral',lw=1.0,zorder=1)
plt.annotate(r'$I_{1}$ = ' + str(np.round(x1,2)),(lintx(i1x,i1y,h4x,h4y,0.1),linty(i1x,i1y,h4x,h4y,0.1)),size=8,zorder=200,color='lightcoral',
             bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=-20)

plt.plot(offsetx(i2x,2,12),offsety(i2y,2,12)+0.1,color='red',lw=1.0,zorder=1)
plt.plot(offset_arrx(i2x,2,12,0.2),offset_arry(i2y,2,12,0.2)+0.1,color='red',lw=1.0,zorder=1)
plt.annotate(r'$I_{2}$ = ' + str(np.round(x2,2)),(lintx(i2x,i2y,h4x,h4y,0.1),linty(i2x,i2y,h4x,h4y,0.1)+0.22),size=8,zorder=200,color='red',
             bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=12)

plt.plot(offsetx(i3x,2,38),offsety(i3y,2,38)+0.1,color='darkred',lw=1.0,zorder=1)
plt.plot(offset_arrx(i3x,2,38,0.2),offset_arry(i3y,2,38,0.2)+0.1,color='darkred',lw=1.0,zorder=1)
plt.annotate(r'$I_{3}$ = ' + str(np.round(x3,2)),(lintx(i3x,i3y,h4x,h4y,0.08)-0.2,linty(i3x,i3y,h4x,h4y,0.08)+0.2),size=8,zorder=200,color='darkred',
             bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=38)

plt.annotate(r'$\lambda_{1,4}$ = ' + str(np.round(w14[iepoch],2)),((i1x+h4x)*0.45,(i1y+h4y)*0.5-0.05),size=8,zorder=200,color='lightcoral',
             bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=-18) 
plt.annotate(r'$\lambda_{2,4}$ = ' + str(np.round(w24[iepoch],2)),((i2x+h4x)*0.45-0.3,(i2y+h4y)*0.5-0.03),size=8,zorder=200,color='red',
             bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=10) 
plt.annotate(r'$\lambda_{3,4}$ = ' + str(np.round(w34[iepoch],2)),((i3x+h4x)*0.45-1.2,(i3y+h4y)*0.5-1.1),size=8,zorder=200,color='darkred',
             bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=38) 

plt.annotate(r'$\lambda_{1,5}$ = ' + str(np.round(w15[iepoch],2)),((i1x+h5x)*0.55-2.5,(i1y+h5y)*0.5+0.9),size=8,zorder=200,color='dodgerblue',
    bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=-36) 
plt.annotate(r'$\lambda_{2,5}$ = ' + str(np.round(w25[iepoch],2)),((i2x+h5x)*0.55-1.5,(i2y+h5y)*0.5+0.05),size=8,zorder=200,color='blue',
             bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=-12) 
plt.annotate(r'$\lambda_{3,5}$ = ' + str(np.round(w35[iepoch],2)),((i3x+h5x)*0.55-1.0,(i3y+h5y)*0.5+0.1),size=8,zorder=200,color='darkblue',
             bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=20) 

plt.annotate(r'$\lambda_{4,6}$ = ' + str(np.round(w46[iepoch],2)),((h4x+o6x)*0.47,(h4y+o6y)*0.47+0.39),size=8,zorder=200,color='orange',
    bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=-12) 
plt.annotate(r'$\lambda_{5,6}$ = ' + str(np.round(w56[iepoch],2)),((h5x+o6x)*0.47,(h5y+o6y)*0.47+0.26),size=8,zorder=200,color='darkorange',
    bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=12) 

plt.plot(lint_intx(i1x, i1y, h4x, h4y,0.4,0.6),lint_inty(i1x,i1y,h4x,h4y,0.4,0.6)-0.1,color='lightcoral',lw=1.0,zorder=1)
plt.plot(lint_int_arrx(h4x,h4y,i1x,i1y,0.4,0.6,0.2),lint_int_arry(h4x,h4y,i1x,i1y,0.4,0.6,0.2)-0.1,color='lightcoral',lw=1.0,zorder=1)
plt.annotate(r'$\frac{\partial P}{\partial \lambda_{1,4}} =$' + r'${0:s}$'.format(as_si(dw14[iepoch]/lr,2)),(lintx(i1x,i1y,h4x,h4y,0.5)-0.6,linty(i1x,i1y,h4x,h4y,0.5)-1.0),size=8,zorder=200,color='lightcoral',
              bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=-20)

plt.plot(lint_intx(i2x, i2y, h4x, h4y,0.3,0.5),lint_inty(i2x,i2y,h4x,h4y,0.3,0.5)-0.1,color='red',lw=1.0,zorder=1)
plt.plot(lint_int_arrx(h4x,h4y,i2x,i2y,0.5,0.7,0.2),lint_int_arry(h4x,h4y,i2x,i2y,0.5,0.7,0.2)-0.12,color='red',lw=1.0,zorder=1)
plt.annotate(r'$\frac{\partial P}{\partial \lambda_{2,4}} =$' + r'${0:s}$'.format(as_si(dw24[iepoch]/lr,2)),(lintx(i2x,i2y,h4x,h4y,0.5)-1.05,linty(i2x,i2y,h4x,h4y,0.5)-0.7),size=8,zorder=200,color='red',
              bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=12)

plt.plot(lint_intx(i3x, i3y, h4x, h4y,0.2,0.4),lint_inty(i3x,i3y,h4x,h4y,0.2,0.4)-0.1,color='darkred',lw=1.0,zorder=1)
plt.plot(lint_int_arrx(h4x,h4y,i3x,i3y,0.5,0.8,0.2),lint_int_arry(h4x,h4y,i3x,i3y,0.5,0.8,0.2)-0.12,color='darkred',lw=1.0,zorder=1)
plt.annotate(r'$\frac{\partial P}{\partial \lambda_{3,4}} =$' + r'${0:s}$'.format(as_si(dw34[iepoch]/lr,2)),(lintx(i3x,i3y,h4x,h4y,0.5)-1.7,linty(i3x,i3y,h4x,h4y,0.5)-1.7),size=8,zorder=200,color='darkred',
              bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=38)

plt.plot(lint_intx(i3x, i3y, h5x, h5y,0.4,0.6),lint_inty(i3x,i3y,h5x,h5y,0.4,0.6)-0.1,color='darkblue',lw=1.0,zorder=1)
plt.plot(lint_int_arrx(h5x,h5y,i3x,i3y,0.4,0.6,0.2),lint_int_arry(h5x,h5y,i3x,i3y,0.4,0.6,0.2)-0.12,color='darkblue',lw=1.0,zorder=1)
plt.annotate(r'$\frac{\partial P}{\partial \lambda_{3,5}} =$' + r'${0:s}$'.format(as_si(dw35[iepoch]/lr,2)),(lintx(i3x,i3y,h5x,h5y,0.5)-0.4,linty(i3x,i3y,h5x,h5y,0.5)-0.6),size=8,zorder=200,color='darkblue',
              bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=20)

plt.plot(lint_intx(i2x, i2y, h5x, h5y,0.3,0.5),lint_inty(i2x,i2y,h5x,h5y,0.3,0.5)-0.1,color='blue',lw=1.0,zorder=1)
plt.plot(lint_int_arrx(h5x,h5y,i2x,i2y,0.3,0.7,0.2),lint_int_arry(h5x,h5y,i2x,i2y,0.3,0.7,0.2)-0.12,color='blue',lw=1.0,zorder=1)
plt.annotate(r'$\frac{\partial P}{\partial \lambda_{2,5}} =$' + r'${0:s}$'.format(as_si(dw25[iepoch]/lr,2)),(lintx(i2x,i2y,h5x,h5y,0.5)-1.2,linty(i2x,i2y,h5x,h5y,0.5)-0.65),size=8,zorder=200,color='blue',
              bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=-12)

plt.plot(lint_intx(i1x, i1y, h5x, h5y,0.2,0.4),lint_inty(i1x,i1y,h5x,h5y,0.2,0.4)-0.1,color='dodgerblue',lw=1.0,zorder=1)
plt.plot(lint_int_arrx(h5x,h5y,i1x,i1y,0.2,0.8,0.2),lint_int_arry(h5x,h5y,i1x,i1y,0.2,0.8,0.2)-0.12,color='dodgerblue',lw=1.0,zorder=1)
plt.annotate(r'$\frac{\partial P}{\partial \lambda_{1,5}} =$' + r'${0:s}$'.format(as_si(dw15[iepoch]/lr,2)),(lintx(i1x,i1y,h5x,h5y,0.5)-2.2,linty(i1x,i1y,h4x,h4y,0.5)-1.5),size=8,zorder=200,color='dodgerblue',
              bbox=dict(boxstyle="round,pad=0.0", edgecolor='white', facecolor='white', alpha=1.0),rotation=-36,xycoords = 'data')

plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=1.0, wspace=0.2, hspace=0.2); plt.show()

_images/f1c2dc9fc05ec8a3d07b5642c59c42dd91a2b2ed35c547faa409371a1e776270.png

Check the ANN Convergence#

Now we plot the weights, biases and prediction over the epochs to check the training convergence.

plt.subplot(131)
plt.plot(np.arange(1,nepoch+1,1),y6,color='red',label=r'$\hat{y}$'); plt.xlim([1,nepoch]); plt.ylim([0,1])
plt.xlabel('Epochs'); plt.ylabel(r'$\hat{y}$'); plt.title('Simple Artificial Neural Network Prediction')
plt.plot([1,nepoch],[y,y],color='black',ls='--'); plt.vlines(400,-1.5,1.5,color='black')
add_grid(); plt.legend(loc='upper right'); plt.xscale('log')

plt.subplot(132)
plt.plot(np.arange(1,nepoch+1,1),w14,color='lightcoral',label = r'$\lambda_{1,4}$') 
plt.plot(np.arange(1,nepoch+1,1),w24,color='red',label = r'$\lambda_{2,4}$') 
plt.plot(np.arange(1,nepoch+1,1),w34,color='darkred',label = r'$\lambda_{3,4}$') 
plt.plot(np.arange(1,nepoch+1,1),w15,color='dodgerblue',label = r'$\lambda_{1,5}$') 
plt.plot(np.arange(1,nepoch+1,1),w25,color='blue',label = r'$\lambda_{2,5}$') 
plt.plot(np.arange(1,nepoch+1,1),w35,color='darkblue',label = r'$\lambda_{3,5}$')
plt.plot(np.arange(1,nepoch+1,1),w46,color='orange',label = r'$\lambda_{4,6}$')
plt.plot(np.arange(1,nepoch+1,1),w56,color='darkorange',label = r'$\lambda_{5,6}$')
plt.plot([1,nepoch],[0,0],color='black',ls='--')
plt.xlim([1,nepoch]); plt.ylim([-1.5,1.5]); plt.vlines(400,-1.5,1.5,color='black')
plt.xlabel('Epochs'); plt.ylabel(r'$\hat{y}$'); plt.title('Simple Artificial Neural Network Weights')
add_grid(); plt.legend(loc='upper right'); plt.xscale('log')

plt.subplot(133)
plt.plot(np.arange(1,nepoch+1,1),w14,color='lightgreen',label = r'$\phi_{4}$') 
plt.plot(np.arange(1,nepoch+1,1),w24,color='green',label = r'$\phi_{5}$') 
plt.plot(np.arange(1,nepoch+1,1),w34,color='darkgreen',label = r'$\phi_{6}$') 
plt.plot([1,nepoch],[0,0],color='black',ls='--')
plt.xlim([1,nepoch]); plt.ylim([-1.5,1.5]); plt.vlines(400,-1.5,1.5,color='black')
plt.xlabel('Epochs'); plt.ylabel(r'$\hat{y}$'); plt.title('Simple Artificial Neural Network Biases')
add_grid(); plt.legend(loc='upper right'); plt.xscale('log')

plt.subplots_adjust(left=0.0, bottom=0.0, right=3.0, top=1.0, wspace=0.2, hspace=0.2); plt.show()

_images/96a2f9a7007123700054af340b223a0c9017cbf262afe43f97c8f1b445ec44a6.png

Comments#

This was a basic treatment of artificial neural networks. Much more could be done and discussed, I have many more resources. Check out my shared resource inventory and the YouTube lecture links at the start of this chapter with resource links in the videos’ descriptions.

I hope this is helpful,

Michael

About the Author#

Michael Pyrcz is a professor in the Cockrell School of Engineering, and the Jackson School of Geosciences, at The University of Texas at Austin, where he researches and teaches subsurface, spatial data analytics, geostatistics, and machine learning. Michael is also,

the principal investigator of the Energy Analytics freshmen research initiative and a core faculty in the Machine Learn Laboratory in the College of Natural Sciences, The University of Texas at Austin
an associate editor for Computers and Geosciences, and a board member for Mathematical Geosciences, the International Association for Mathematical Geosciences.

Michael has written over 70 peer-reviewed publications, a Python package for spatial data analytics, co-authored a textbook on spatial data analytics, Geostatistical Reservoir Modeling and author of two recently released e-books, Applied Geostatistics in Python: a Hands-on Guide with GeostatsPy and Applied Machine Learning in Python: a Hands-on Guide with Code.

All of Michael’s university lectures are available on his YouTube Channel with links to 100s of Python interactive dashboards and well-documented workflows in over 40 repositories on his GitHub account, to support any interested students and working professionals with evergreen content. To find out more about Michael’s work and shared educational resources visit his Website.

Want to Work Together?#

I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.

Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I’d be happy to drop by and work with you!
Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PI is Professor John Foster)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!
I can be reached at mpyrcz@austin.utexas.edu.

I’m always happy to discuss,