Attention#

Michael J. Pyrcz, Professor, The University of Texas at Austin

Twitter | GitHub | Website | GoogleScholar | Geostatistics Book | YouTube | Applied Geostats in Python e-book | Applied Machine Learning in Python e-book | LinkedIn

Chapter of e-book “Applied Machine Learning in Python: a Hands-on Guide with Code”.

Cite this e-Book as:

Pyrcz, M.J., 2024, Applied Machine Learning in Python: A Hands-on Guide with Code [e-book]. Zenodo. doi:10.5281/zenodo.15169138 DOI

The workflows in this book and more are available here:

Cite the MachineLearningDemos GitHub Repository as:

Pyrcz, M.J., 2024, MachineLearningDemos: Python Machine Learning Demonstration Workflows Repository (0.0.3) [Software]. Zenodo. DOI: 10.5281/zenodo.13835312. GitHub repository: GeostatsGuy/MachineLearningDemos DOI

By Michael J. Pyrcz
© Copyright 2024.

This chapter is a tutorial for / demonstration of Attention.

YouTube Lecture: check out my lectures on:

  • TBD

These lectures are all part of my Machine Learning Course on YouTube with linked well-documented Python workflows and interactive dashboards. My goal is to share accessible, actionable, and repeatable educational content. If you want to know about my motivation, check out Michael’s Story.

Motivation and Why Attention?#

In many machine learning models, we need to determine which parts of the input matter most for a prediction.

Traditional approaches often treat inputs more uniformly or rely primarily on local relationships. This can become limiting when,

  • dependencies are long-range

  • relevance depends on context

  • the system contains many interacting components

Before attention mechanisms, many machine learning workflows relied on fixed local neighborhoods or sequential processing. For example,

  • convolutional approaches focus on nearby local patterns

  • recurrent approaches process information sequentially and may struggle to retain long-range relationships

In many real-world problems, important relationships may occur between locations, observations, or events that are far apart in space, depth, or time.

Attention addresses this limitation by allowing the model to dynamically determine,

  • what information is relevant?

  • where is the relevant information located?

  • how strongly should each piece of information influence the prediction?

Instead of only using nearby information, attention allows each query to compare itself against many candidate patterns and retrieve the most relevant information through weighted similarity.

This makes attention both flexible and computationally efficient for capturing larger-scale relationships and complex patterns.

Attention Concepts#

The basic building blocks or concepts for attention include,

  • Query (Q) — what I am looking for?

  • Key (K) — what each candidate piece of information offers (a descriptor of stored information)?

  • Value (V) — the actual information content associated with each key?

Each key–value pair represents a stored piece of information, where the key describes it and the value contains the content.

Then the conceptual mechanism is,

  • similarity between Query (Q) and Keys (K) \(\rightarrow\) measures compatibility and produces relevance scores

  • scores \(\rightarrow\) normalized (via softmax) into weights representing relative importance

  • weighted sum of Values (V) \(\rightarrow\) produces a context-aware output representation

We can state this qualitatively as, attention is a soft lookup mechanism over a set of stored information.

Now let’s translate these concepts into geoscience improve our connections to subsurface resource modeling,

  • Query (Q) — “current geological context”

  • Key (K) — “historical analog descriptors”

  • Value (V) — “stored reservoir properties (e.g., permeability)”

So attention becomes, a data-driven analog retrieval system. Now we put this all together by walking through the attention workflow with words, no math yet!

Attention Workflow Conceptual Description#

Let’s walk through the fundamental attention-based information flow. Dot-product similarity measures the alignment between the Query (Q) and each Key (K) by multiplying corresponding components and summing the result. Larger dot products indicate stronger similarity and greater compatibility between patterns,

\[ \text{Query (Q)} \cdot \text{Key (K)} = \text{Similarity} \]

The similarity calculation produces relevance scores that quantify how strongly each stored pattern matches the current query,

\[ \text{Similarity} \rightarrow \text{Scores} \]

Softmax normalization enforces closure by constraining the attention weights to sum to 1.0, producing a convex weighted combination of the values and helping maintain stable, unbiased predictions,

\[ \text{Softmax / Normalization(Scores)} = \text{Weights} \]

The attention weights are applied to the values to form a weighted combination that produces the final prediction or context-aware output. Larger weights contribute more strongly to the prediction, allowing the model to dynamically retrieve and combine the most relevant stored information for the current query,

\[ \sum \text{Weights} \times \text{Values (V)} = \text{Prediction} \]

Vector and Matrix Math Reminder#

Matrix notation is written as:

\[ \mathbf{A}_{r \times c} \]

where:

  • \(r\) = number of rows

  • \(c\) = number of columns

So \(\mathbf{A}_{r \times c}\) means a matrix with \(r\) rows and \(c\) columns.

We use the same general notation for vectors, the following is a row vector of length, \(c\),

\[ \mathbf{A}_{1 \times c} \]

and the following is a column vector of length, \(r\),

\[ \mathbf{A}_{r \times 1} \]

For matrix multiplication,

\[ \mathbf{A}_{m \times n} \mathbf{B}_{n \times p} = \mathbf{C}_{m \times p} \]

The inner dimensions must match,

\[ (n = n) \]

and the resulting matrix takes the outer dimensions,

\[ (m \times p) \]

For example,

\[ \mathbf{A}_{2 \times 3} \mathbf{B}_{3 \times 4} = \mathbf{C}_{2 \times 4} \]

Expanded,

\[\begin{split} \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \end{bmatrix} \begin{bmatrix} b_{11} & b_{12} & b_{13} & b_{14} \\ b_{21} & b_{22} & b_{23} & b_{24} \\ b_{31} & b_{32} & b_{33} & b_{34} \end{bmatrix} = \begin{bmatrix} c_{11} & c_{12} & c_{13} & c_{14} \\ c_{21} & c_{22} & c_{23} & c_{24} \end{bmatrix} \end{split}\]

Each element of \(\mathbf{C}\) is calculated as a dot product between a row of \(\mathbf{A}\) and a column of \(\mathbf{B}\),

\[ c_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj} \]

where each entry is a dot product between a row of \(\mathbf{A}\) and a column of \(\mathbf{B}\):

\[ c_{11} = a_{11}b_{11} + a_{12}b_{21} + a_{13}b_{31} \]
\[ c_{12} = a_{11}b_{12} + a_{12}b_{22} + a_{13}b_{32} \]
\[ c_{13} = a_{11}b_{13} + a_{12}b_{23} + a_{13}b_{33} \]
\[ c_{14} = a_{11}b_{14} + a_{12}b_{24} + a_{13}b_{34} \]
\[ c_{21} = a_{21}b_{11} + a_{22}b_{21} + a_{23}b_{31} \]
\[ c_{22} = a_{21}b_{12} + a_{22}b_{22} + a_{23}b_{32} \]
\[ c_{23} = a_{21}b_{13} + a_{22}b_{23} + a_{23}b_{33} \]
\[ c_{24} = a_{21}b_{14} + a_{22}b_{24} + a_{23}b_{34} \]

With this vector and matrix notation you will be able to follow the mathematical formulation for attention.

Attention Workflow Mathematical Formulation#

Now let’s revisit the same attention workflow, but this time with the full mathematical formulation and notation, including the dimensions of every vector and matrix so the structure is explicit and easy to follow.

Suppose we have,

  • \(N\) stored patterns or observations

  • each pattern or observation is a feature vector of size \(d\)

  • associated scalar or vector values to estimate or retrieve

Query Vector#

The query vector represents the current pattern or context that we are trying to match against stored information.

\[ \mathbf{Q}_{1 \times d} = \begin{bmatrix} q_1 & q_2 & \cdots & q_d \end{bmatrix} \]

where,

  • \(d\) = number of features in the query pattern

  • \(q_j\) = feature value \(j\)

In this formulation, the query represents the predictor location, while the key–value pairs represent information from previously observed locations used for inference, defined below.

Key Matrix#

The key matrix contains all candidate stored patterns, may be derived from data, physics-based models, analogs, etc.

\[\begin{split} \mathbf{K}_{N \times d} = \begin{bmatrix} k_{11} & k_{12} & \cdots & k_{1d} \\ k_{21} & k_{22} & \cdots & k_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ k_{N1} & k_{N2} & \cdots & k_{Nd} \end{bmatrix} \end{split}\]

where, \(N\) is the number of stored patterns, row \(i\) represents stored pattern \(i\), and each row is paired with an associated value

Value Vector#

The value vector contains the information associated with each stored key.

\[\begin{split} \mathbf{V}_{N \times 1} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_N \end{bmatrix} \end{split}\]

where \(v_i\) is the value associated with key \(i\)

Dot-Product Similarity#

Attention begins by measuring similarity between the query and every key.

For a single key,

\[ s_i = \mathbf{Q}\mathbf{K}_i^T = \sum_{j=1}^{d} q_j k_{ij} \]

where \(s_i\) is the similarity score for stored pattern \(i\), and larger values indicate stronger alignment between the query and key.

Using all keys simultaneously,

\[ \mathbf{S}_{1 \times N} = \mathbf{Q}_{1 \times d} \mathbf{K}_{d \times N}^T \]

In scaled dot-product attention,

\[ \mathbf{S} = \alpha \mathbf{Q}\mathbf{K}^T \]

where \(\alpha\) is the scaling factor or temperature parameter, and larger \(\alpha\) sharpens attention sensitivity.

Similarity Scores#

The similarity operation produces a score for every stored pattern,

\[ \mathbf{S}_{1 \times N} = \begin{bmatrix} s_1 & s_2 & \cdots & s_N \end{bmatrix} \]

These scores quantify the compatibility between the current query and each stored pattern.

Softmax Normalization#

The similarity scores are transformed into attention weights using softmax normalization. For each stored pattern,

\[ w_i = \frac{\exp(s_i)} {\sum_{k=1}^{N} \exp(s_k)} \]

Collectively,

\[ \mathbf{W}_{1 \times N} = \text{softmax}(\mathbf{S}) \]

with properties,

\[ 0 \leq w_i \leq 1 \]

and,

\[ \sum_{i=1}^{N} w_i = 1 \]

Thus, the weights form a probability-like distribution satisfying probability closure and once again producing a convex weighted combination of the values and helping maintain stable, unbiased predictions.

Large similarity scores produce larger attention weights and greater influence on the final prediction.

Attention Prediction#

The final prediction is obtained through a weighted combination of the stored values,

\[ \hat{y} = \sum_{i=1}^{N} w_i v_i \]

or equivalently,

\[ \hat{y} = \mathbf{W}_{1 \times N} \mathbf{V}_{N \times 1} \]

where \(\hat{y}\) is the predicted output or context-aware representation, and larger attention weights contribute more strongly to the prediction.

Again, because the weights sum to 1.0, the prediction is a convex weighted combination of the stored values.

Interpretation#

Attention dynamically retrieves and combines information according to similarity between the query and stored patterns.

Qualitatively,

  • attention acts as a soft lookup mechanism over a set of stored information.

I demonstrate attention with two demonstrations,

  • prediction given a set of analogs in a data table - using similarity of a single value

  • prediction of a missing well log with an available common well log - using magnitude and shape

Import Required Packages#

We will also need some standard packages. These should have been installed with Anaconda 3.

suppress_warnings = True                                      # toggle to supress warnings
import os                                                     # to set current working directory 
import math                                                   # square root operator
import numpy as np                                            # arrays and matrix math
import pandas as pd                                           # DataFrames
import matplotlib.pyplot as plt                               # for plotting
import matplotlib.patches as patches                          # draw neural network nodes
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator, AutoLocator) # control of axes ticks
plt.rc('axes', axisbelow=True)                                # grid behind plotting elements
if suppress_warnings == True:  
    import warnings                                           # supress any warnings for this demonstration
    warnings.filterwarnings('ignore') 
seed = 13                                                     # random number seed for workflow repeatability

If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing ‘python -m pip install [package-name]’. More assistance is available with the respective package docs.

Declare Functions#

Here are the fundamental functions to apply attention for both of our prediction demonstrations and to visualize the results,

  • build_keys - retreive the keys that represent geological descriptors from a DataFrame

  • build_values - retreive the values that represent target property

  • build_query - set the query for the target location descriptor

  • similarity_basic - compute similarity between query and each key, simple negative squared distance

  • similarity - similarity with temperature control for more sensitivity, simple negative squared distance

  • dot_similarity - scaled dot-product similarity (attention-style) with temperature control

  • dot_similarity_norm - scaled dot-product similarity (attention-style) normalized to reduce magnitude bias with temperature control

  • softmax - softmax normalization

  • attention - attention mechanism, compute attention weights and weighted output

  • dot_attention - dot-product attention mechanism, compute attention weights and weighted output

  • dot_attention_norm - dot-product attention mechanism normalized, compute attention weights and weighted output

  • attention_entropy - calculate entropy of all attention weights for a query

  • plot_well_log_data - plot well log training and query

  • plot_attention_prediction_log - plot well log training, query and predictions

  • plot_attention_provenance - plot query window and keys with attention weights

  • plot_entropy_log - plot the keys with attention weights

  • plot_full_attention_matrix - plot the entire attention matrix as a heat map

def build_keys(df):                                           # keys represent geological descriptors, simple 1D feature (porosity)
    return df[["Porosity"]].values

def build_values(df):                                         # values represent target property, log-permeability 
    return df["Log_Permeability"].values

def build_query(porosity_value):                              # query is the target location descriptor
    return np.array([porosity_value])

def similarity_basic(Q, K):                                   # compute similarity (dot-product style) between query and each key, simple negative squared distance (geology-friendly)   Here: 
    return -np.sum((K - Q) ** 2, axis=1)

def similarity(Q, K, scale=50.0):
    return scale * (-np.sum((K - Q) ** 2, axis=1))            # similarity with temperature control for more sensitivity

def dot_similarity(Q, K, scale=10.0):                         # scaled dot-product similarity (attention-style) with temperature control
    scores = K @ Q
    return scale * scores

def dot_similarity_norm(Q, K, scale=10.0):                    # dot similarity normalized to reduce magnitude bias
    Qn = Q / (np.linalg.norm(Q) + 1e-12)
    Kn = K / (np.linalg.norm(K, axis=1, keepdims=True) + 1e-12)
    scores = Kn @ Qn
    return scale * scores

def softmax(x):                                               # softmax normalization
    x = x - np.max(x)                                         # numerical stability
    exp_x = np.exp(x)
    return exp_x / np.sum(exp_x)

def attention(Q, K, V,scale):                                 # attention mechanism, compute attention weights and weighted output.
    scores = similarity(Q, K,scale)
    weights = softmax(scores)
    output = np.sum(weights * V)
    return weights, scores, output

def dot_attention(Q, K, V, scale=10.0):                       # dot-product attention mechanism.
    scores = dot_similarity(Q, K, scale)
    weights = softmax(scores)
    output = np.sum(weights * V)
    return weights, scores, output

def dot_attention_norm(Q, K, V, scale=10.0):                  # dot-product attention mechanism normalized
    scores = dot_similarity_norm(Q, K, scale)
    weights = softmax(scores)
    output = np.sum(weights * V)
    return weights, scores, output

def attention_entropy(weights):                               # calculate entropy of all attention weights for a query
    eps = 1e-12
    w = np.array(weights)
    return -np.sum(w * np.log(w + eps))

def plot_well_log_data(well1,well2):                          # plot the well log data
    n = len(well1); depth = well1["depth_index"]
    fig, axes = plt.subplots(1, 2, figsize=(10, 6), sharey=True)
    ax1 = axes[0]
    
    ax1.plot(well1["phi"], depth, color="tab:blue",label='Porosity')
    ax1.scatter(well1["phi"], depth,color="tab:blue",edgecolor='black',s=20,zorder=10)
    ax1.set_xlabel("Porosity"); ax1.set_ylabel("Depth Index")
    ax1.set_title("Well 1 (Training)"); ax1.invert_yaxis(); plt.ylim([40,0])
    
    ax1.xaxis.set_minor_locator(AutoMinorLocator()); ax1.yaxis.set_minor_locator(AutoMinorLocator())
    ax1.grid(True, which="major", linestyle="-", linewidth=0.8, alpha=0.6)
    ax1.grid(True, which="minor", linestyle=":", linewidth=0.5, alpha=0.4)
    
    ax1b = ax1.twiny()
    ax1b.plot(well1["logk"], depth, color="tab:red",label='Log Permeability')
    ax1b.scatter(well1["logk"], depth,color="tab:red",edgecolor='black',s=20,zorder=10)
    ax1b.set_xlabel("Log-Permeability"); ax1.legend(loc='upper left')
    ax1b.xaxis.set_minor_locator(AutoMinorLocator()); ax1b.grid(False)
    
    ax2 = axes[1]
    ax2.plot(well2["phi"], depth,color="tab:blue",label='Porosity')
    ax2.scatter(well2["phi"], depth,color="tab:blue",edgecolor='black',s=20,zorder=10)
    ax2.set_xlabel("Porosity"); ax2.set_title("Well 2 (Target)")
    ax2.invert_yaxis(); plt.ylim([40,0]); ax2.legend(loc='upper left')
    ax2.xaxis.set_minor_locator(AutoMinorLocator())
    ax2.yaxis.set_minor_locator(AutoMinorLocator())
    ax2.grid(True, which="major", linestyle="-", linewidth=0.8, alpha=0.6)
    ax2.grid(True, which="minor", linestyle=":", linewidth=0.5, alpha=0.4)
    
    ax1.set_ylim([40,0]); plt.tight_layout()
    plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.5, wspace=0.2, hspace=0.6); plt.show()

def plot_attention_prediction_log(well1,well2):               # plot training, query and predictions 
    depth_full = np.arange(len(well2))  
    fig, axes = plt.subplots(1, 2, figsize=(10, 6), sharey=True)               
    ax1 = axes[0]                                             # plot training data 
    ax1.plot(well1["phi"], depth_full, color="tab:blue",label='Porosity')
    ax1.scatter(well1["phi"], depth_full,color="tab:blue",edgecolor='black',s=20,zorder=10)
    ax1.set_xlabel("Porosity"); ax1.set_ylabel("Depth Index")
    ax1.set_title("Well 1 (Training)"); ax1.invert_yaxis(); plt.ylim([40,0])
    
    ax1.xaxis.set_minor_locator(AutoMinorLocator()); ax1.yaxis.set_minor_locator(AutoMinorLocator())
    ax1.grid(True, which="major", linestyle="-", linewidth=0.8, alpha=0.6)
    ax1.grid(True, which="minor", linestyle=":", linewidth=0.5, alpha=0.4)
    
    ax1b = ax1.twiny()
    ax1b.plot(well1["logk"], depth_full, color="tab:red",label='Log Permeability')
    ax1b.scatter(well1["logk"], depth_full,color="tab:red",edgecolor='black',s=20,zorder=10)
    ax1b.set_xlabel("Log-Permeability")
    ax1b.xaxis.set_minor_locator(AutoMinorLocator()); ax1b.grid(False)
    
    handles1, labels1 = ax1.get_legend_handles_labels()
    handles2, labels2 = ax1b.get_legend_handles_labels()
    ax1.legend(handles1 + handles2,labels1 + labels2,loc="upper left")

    ax2 = axes[1]                                             # plot queries and predictions
    ax2.plot(well2["phi"], depth_full,color="tab:blue",label='Porosity')
    ax2.scatter(well2["phi"], depth_full,color="tab:blue",edgecolor='black',s=20,zorder=10)
    ax2.set_xlabel("Porosity"); ax2.set_title("Well 2 (Target)")
    ax2.invert_yaxis(); plt.ylim([40,0]); ax2.legend(loc='upper left')
    ax2.xaxis.set_minor_locator(AutoMinorLocator())
    ax2.yaxis.set_minor_locator(AutoMinorLocator())
    ax2.grid(True, which="major", linestyle="-", linewidth=0.8, alpha=0.6)
    ax2.grid(True, which="minor", linestyle=":", linewidth=0.5, alpha=0.4)
    ax2b = ax2.twiny()
    ax2b.plot(well2_attn["pred_logk"], well2_attn["depth_index"],alpha=0.2,color="tab:red",label='Predicted Log Permeability')
    ax2b.scatter(well2_attn["pred_logk"], well2_attn["depth_index"],color="tab:red",edgecolor='black',s=20,zorder=10)
    
    handles1, labels1 = ax2.get_legend_handles_labels(); handles2, labels2 = ax2b.get_legend_handles_labels()
    ax2.legend(handles1 + handles2,labels1 + labels2,loc="upper left")

    plt.tight_layout()
    plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.5, wspace=0.2, hspace=0.6); plt.show()

def plot_attention_provenance(well1,well2,all_weights,Q2,depth_index,window=3): # plot query window and memory bank with attention weights
    half = window // 2
    q_idx = depth_index - half
    if q_idx < 0 or q_idx >= len(all_weights):
        raise ValueError("depth_index outside valid window range")
    weights = all_weights[q_idx];query = np.asarray(Q2[q_idx]).ravel()
    depth = np.arange(len(well1))

    fig, axes = plt.subplots(1, 2, figsize=(12, 6), sharey=False) 

    ax0 = axes[0]                                             # plot well 1 with query window
    ax0.plot(well2["phi"].values,depth,color="tab:blue",linewidth=2,label="Well 2 Porosity")
    ax0.plot(query,get_query_depth_window(depth_index, window),marker="o",color="black",
        linewidth=3,label="Query Window",zorder=10)
    ax0.set_title(f"Well 2 Query (Depth {depth_index})"); ax0.set_xlabel("Porosity"); ax0.set_ylabel("Depth")
    ax0.set_ylim(40, 0); ax0.grid(True, alpha=0.5); ax0.legend(loc="lower right")

    ax1 = axes[1]                                             # plot well 2 with attention weights
    ax1.plot(well1["phi"].values,depth,color="tab:blue",linewidth=2,label="Well 1 Porosity")
    ax1.set_title("Well 1 Memory Bank"); ax1.set_xlabel("Porosity"); ax1.set_ylim(40, 0); ax1.grid(True, alpha=0.5)

    ax1b = ax1.twiny()
    ax1b.barh(np.arange(len(weights)) + window//2,weights,color="tab:red",alpha=0.35,height=0.8,label='attention weights')

    ax1b.set_xlabel("Attention Weight"); ax1b.set_xlim(0, np.max(weights) * 1.1); ax1b.set_ylim(40, 0)
    
    h1, l1 = ax1.get_legend_handles_labels(); h2, l2 = ax1b.get_legend_handles_labels()

    ax1.legend(h1 + h2, l1 + l2, loc="lower right")

    plt.tight_layout()
    plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.5, wspace=0.2, hspace=0.6); plt.show()

def plot_entropy_log(well2,all_weights):                      # plot well 2 query and attention entropy
    n = len(all_weights)
    entropy = np.array([                                      # calucalte entropy per depth (windowed attention)            
        attention_entropy(all_weights[i])
        for i in range(n)
    ])
    depth_attn = np.arange(n) + 1                             # calcualte window centers for plotting
    depth_full = np.arange(len(well2))  

    fig, axes = plt.subplots(1, 2, figsize=(12, 6), sharey=True)
    ax0 = axes[0]                                             # plot well 2 query
    ax0.plot(well2["phi"].values,depth_full,color="tab:blue",linewidth=2,label="Well 2 Porosity")
    ax0.set_title("Well 2 Porosity Log"); ax0.set_xlabel("Porosity"); ax0.set_ylabel("Depth Index")
    ax0.set_ylim(0, 40); ax0.grid(True, alpha=0.5); ax0.legend(loc="lower left")

    ax1 = axes[1]                                             # plot entropy log
    ax1.barh(depth_attn,entropy,color="red",alpha=0.4,height=0.8,label="Attention Entropy")
    ax1.set_title("Attention Entropy vs Depth"); ax1.set_xlabel("Entropy (Uncertainty)"); ax1.set_ylabel("Depth Index")
    ax1.set_ylim(0, 40); ax1.set_xlim(2.0,4.0); ax1.grid(True, alpha=0.5); ax1.legend(loc="lower right")

    plt.tight_layout()
    plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.5, wspace=0.2, hspace=0.6); plt.show()

def plot_full_attention_matrix(all_weights):                  # plot the entire attention matrix as a heat map
    attention_matrix = np.vstack(all_weights)
        
    fig, ax = plt.subplots(figsize=(8, 7))

    im = ax.imshow(attention_matrix,origin="upper",aspect="auto") # plot attention heat map
    ax.set_title("Attention Weight Matrix"); ax.set_xlabel("Well 1 Window Index (Keys)")
    ax.set_ylabel("Well 2 Window Index (Queries)")
    ax.set_xticks(np.arange(0, attention_matrix.shape[1], 5))
    ax.set_yticks(np.arange(0, attention_matrix.shape[0], 5))
    ax.set_xticks(np.arange(-0.5, attention_matrix.shape[1], 1), minor=True)
    ax.set_yticks(np.arange(-0.5, attention_matrix.shape[0], 1), minor=True)
    ax.grid(which="minor", color="white", linestyle="-", linewidth=0.2)
    ax.tick_params(which="minor", bottom=False, left=False)
    cbar = plt.colorbar(im, ax=ax)
    cbar.set_label("Attention Weight")
    
    plt.tight_layout()
    plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=0.8, wspace=0.2, hspace=0.6); plt.show()

def comma_format(x, pos):                                     # comma notation on axes
    return f'{int(x):,}'

def align_values(x, window=3):                                # align scalar values to window centers.
    half = window // 2
    return x[half:len(x) - half]

def center_crop(x, window):
    half = window // 2
    return x[half:len(x) - half]

def get_query_depth_window(depth_index, window):              # return the depth indices covered by the query window.
    half = window // 2
    return np.arange(depth_index - half, depth_index + half + 1)

def build_windows(x, window=3):                               # build sliding windows of a 1D log.
    n = len(x)
    half = window // 2
    X = []
    for i in range(half, n - half):
        X.append(x[i - half:i + half + 1])
    return np.array(X)

def add_grid():                                               # add grid lines to plot
    plt.gca().grid(True, which='major',linewidth = 1.0); plt.gca().grid(True, which='minor',linewidth = 0.2) # add y grids
    plt.gca().tick_params(which='major',length=7); plt.gca().tick_params(which='minor', length=4)
    plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks 

def add_grid2(sub_plot):
    sub_plot.grid(True, which='major',linewidth = 1.0); sub_plot.grid(True, which='minor',linewidth = 0.2) # add y grids
    sub_plot.tick_params(which='major',length=7); sub_plot.tick_params(which='minor', length=4)
    sub_plot.xaxis.set_minor_locator(AutoMinorLocator()); sub_plot.yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks

Analog Data Table Attention Demonstration#

Let’s develop the simplest possible demonstration of attention. We consider an analog dataset consisting of 5 reservoir samples. For each sample we have:

  1. Sample index

  2. Lithofacies

  3. Depositional system

  4. Porosity

  5. Permeability

  6. Natural log-transformed permeability

To keep the model as simple as possible, we assume that at a new location we only observe porosity, and we wish to estimate the corresponding permeability.

We define,

  • Query (Q) — the porosity value at the location where we want to make a prediction

  • Keys (K) — the set of porosity values in the analog dataset (feature representations of stored samples)

  • Values (V) — the corresponding permeability values associated with each analog sample

Attention then provides a way to determine which stored analogs are most relevant to the query by comparing similarity in feature space and using this to weight their contributions to the prediction.

In this sense, attention answers the question,

  • which past geological analogs are most relevant for predicting the property at this location?

Subsurface Analog Dataset (Attention Keys & Values)#

Now we define the analog dataset as a small set of geological examples representing different depositional environments. Each sample provides a key representation (porosity feature) and an associated value (permeability property) used by the attention mechanism.

  • Keys (K) are used to compute similarity with the query, keys determine “where to look”.

  • Values (V) are used to build the final weighted output (prediction), values determine “what you retrieve”.

ID

Lithofacies

Depositional System

Porosity (\(\phi\))

Permeability (\(mD\))

Log-Permeability (\(ln(mD)\))

1

Channel sand

High-energy fluvial channel

0.26

800

6.68

2

Levee sand

Overbank levee deposit

0.20

200

5.30

3

Shale

Low-energy mudstone

0.08

5

1.61

4

Fractured zone

Structural damage zone

0.18

1200

7.09

5

Mixed facies

Transitional depositional mix

0.16

300

5.70

In this formulation,

  • key for each sample is the porosity value (or porosity feature representation)

  • value is the associated permeability (or log-permeability)

  • each row represents a stored geological analog in the attention “memory bank”

This dataset forms the retrieval space over which attention computes similarity between a query location and stored geological examples.

data = {                                                      # build geological analog dictionary
    "ID": [1, 2, 3, 4, 5],
    "Geological_Analog": ["Channel sand","Levee sand","Shale","Fractured zone","Mixed facies"],
    "Facies_Type": ["High-energy fluvial channel","Overbank levee deposit","Low-energy mudstone",
        "Structural damage zone","Transitional depositional mix"],
    "Porosity": [0.26, 0.20, 0.08, 0.18, 0.16],
    "Permeability_mD": [800, 200, 5, 1200, 300]}
df = pd.DataFrame(data)                                       # convert dictionary into a DataFrame
df["Log_Permeability"] = np.log(df["Permeability_mD"])        # compute and add log-permeability (natural log)
df.head()                                                     # display the DataFrame
ID Geological_Analog Facies_Type Porosity Permeability_mD Log_Permeability
0 1 Channel sand High-energy fluvial channel 0.26 800 6.684612
1 2 Levee sand Overbank levee deposit 0.20 200 5.298317
2 3 Shale Low-energy mudstone 0.08 5 1.609438
3 4 Fractured zone Structural damage zone 0.18 1200 7.090077
4 5 Mixed facies Transitional depositional mix 0.16 300 5.703782

Query (Target Location)#

We define a target subsurface location where we want to estimate permeability,

Query ID

Porosity (ϕ)

Geological Interpretation

Q

0.19

Ambiguous sand-rich interval with moderate reservoir quality

Attention Interpretation#

In the attention framework:

  • Keys (K): geological descriptors of each analog (only porosity for simplicity)

  • Values (V): permeability (or log-permeability)

  • Query (Q): target location to be estimated (only a single porosity value for simplicity)

The goal is to compute,

  • which geological analogs are most similar to the query; therefore, should contribute most strongly to the permeability estimate?

We will compute,

  1. Similarity between query and each analog (Key–Query comparison)

  2. Attention weights (softmax normalization)

  3. Weighted permeability estimate

This will produce a data-driven geological averaging system, where,

  • permeability is inferred as a weighted mixture of geologically similar analogs

Attention-Based Permeability Prediction Workflow Steps#

The workflow below applies a very simple attention workflow to estimate permeability from geological analogs. The workflow steps are,

  1. Build keys \(K\) from geological descriptors

  2. Build values \(V\) from permeability responses

  3. Define a query \(Q\) for the target location

  4. Compute similarity scores between \(Q\) and \(K\)

  5. Normalize similarity scores into attention weights

  6. Predict permeability from the weighted combination of values

Build Keys \(K\)#

Keys represent the geological descriptor space used to compare analog similarity.

In this simple demonstration, keys are constructed from porosity values,

\[ K_i = [\phi_i] \]

where \(K_i\) is the key vector for analog \(i = 1, \ldots, 5\), and \(\phi_i\) is the porosity of analog \(i\). For our simple problem, the individual keys, \(K_i\), collapses to simply a scaler value, porosity of the \(i\) analog.

Build Values \(V\)#

Values represent the quantities combined by the attention mechanism to produce the final prediction. Here, values are log-permeability,

\[ V_i = \ln(k_i) \]

where \(V_i\) is the value for analog \(i\), and \(k_i\) is the permeability of analog \(i\).

Build Query \(Q\)#

The query represents the target geological location where permeability will be estimated in the same feature(s) of the keys \(K_i\),

\[ Q = [\phi_{target}] \]

where \(Q\) is the query vector, and \(\phi_{target}\) is the target porosity.

Compute Similarity Scores#

Similarity scores measure geological affinity between the query and each analog.

In many machine learning attention models, similarity is computed with a dot product,

\[ s_i = Q \cdot K_i \]

However, for this simple geological demonstration, we instead use negative squared distance,

\[ s_i = -\alpha ||K_i - Q||^2 \]

where \(s_i\) is the similarity score for analog \(i\), \(\alpha\) is the scaling (inverse temperature) factor, and \(||K_i - Q||^2\) is the squared mismatch between the query and analog.

This formulation is convenient because it directly measures geological dissimilarity,

  • small distance \(\rightarrow\) high similarity

  • large distance \(\rightarrow\) low similarity

The negative sign converts mismatch into similarity, so analogs closer to the query receive larger attention weights after softmax normalization.

  • higher scores (less negative values) therefore indicate stronger geological similarity.

Dot Product vs. Distance-Based Similarity Similarity in the Scalar Query and Key Case#

It may be instructive to make some more comments about switching to negative squared distance vs. the usual dot product for similarity scores for the scalar feature case. For this simple demonstration, each key and the query can be viewed as a single scalar feature (porosity only),

\[ K_i = [k_i], \quad Q = [q] \]

In this case, the dot-product similarity reduces to,

\[ s_i = Q \cdot K_i = q \, k_i \]

So similarity is driven purely by the product of magnitudes of the query and the stored analog,

  • there is no notion of pattern shape or structure

  • similarity is determined only by co-magnitude (how large or small the values are together)

  • the result does not explicitly measure distance or proximity between geological states

As a result, dot-product similarity in one dimension is less intuitive as a measure of geological affinity, since it does not directly penalize mismatch between values.

To obtain a more interpretable notion of geological similarity, we instead use a distance-based formulation,

\[ s_i = -\alpha \|K_i - Q\|^2 \]

In the scalar case,

\[ s_i = -\alpha (k_i - q)^2 \]

where \(s_i\) is the similarity score for analog \(i\), \(\alpha\) is a scaling (inverse temperature) parameter, and \((k_i - q)^2\) measures mismatch in feature space.

The scaling factor \(\alpha\) controls the sensitivity of the attention mechanism to differences between the query and stored keys. Larger values of \(\alpha\) sharpen the attention response, causing the model to focus more strongly on the most similar patterns, while smaller values produce smoother, more distributed attention across many analogs.

In this sense, \(\alpha\) acts similarly to an inverse temperature parameter, controlling the balance between concentrated analog retrieval and broader averaging across the memory bank.

This formulation has a clear geometric meaning,

  • small distance \(\rightarrow\) high similarity

  • large distance \(\rightarrow\) low similarity

  • the negative sign converts mismatch into similarity

After softmax normalization, nearby geological analogs receive higher attention weights, producing a more intuitive and stable notion of similarity for scalar reservoir properties.

Compute Attention Weights#

The similarity scores are normalized with the softmax function to produce attention weights,

\[ w_i = \frac{e^{s_i}}{\sum_{j=1}^{N} e^{s_j}} \]

where \(w_i\) is the attention weight for analog \(i\), \(N\) is the number of geological analogs.

The weights satisfy,

\[ \sum_{i=1}^{N} w_i = 1 \]

Thus, attention satisfies probability closure for a probabilistic weighting system over geological analogs.

Predict Log-Permeability#

The predicted log-permeability is computed as the weighted average of the values,

\[ \hat{V} = \sum_{i=1}^{N} w_i V_i \]

or equivalently,

\[ \widehat{\ln(k)} = \sum_{i=1}^{N} w_i \ln(k_i) \]

where \(\hat{V}\) is the predicted log-permeability.

Predict Permeability#

Finally, permeability is recovered by exponentiation,

\[ \hat{k} = e^{\hat{V}} \]

where \(\hat{k}\) is the predicted permeability.

Now here’s the code for our simple attention-based permeability prediction workflow.

  • to check the numerical details, please review the functions declared above

scale = 1000.0

K = build_keys(df)                                            # build Q, K, V
V = build_values(df)
Q = build_query(porosity_value=0.1999)                        # target location (query)

weights, pre_activation, pred_logK = attention(Q, K, V,scale) # run attention

pred_K = np.exp(pred_logK)                                    # convert back to permeability

labels = df["Facies_Type"].values                             # facies labels for interpretability
x = np.arange(len(labels))                                    # numeric positions
width = 0.35                                                  # bar width

labels = np.array(labels)
x = np.arange(len(labels))
width = 0.35

fig, axes = plt.subplots(1, 2, figsize=(10, 6))
ax1 = axes[0]

bars1 = ax1.bar(x - width/2,-1 * pre_activation,width=width,color="darkorange",edgecolor="black",  # plot pre-activation (similarity space)
    label="Pre-Activation (Similarity)") 

ax1.set_ylabel("Similarity Score (Pre-Activation)", color="darkorange")
ax1.tick_params(axis='y', labelcolor="darkorange")

ax1b = ax1.twinx()                                            # plot attention weights
bars2 = ax1b.bar(x + width/2,weights,width=width,color="steelblue",edgecolor="black",label="Attention Weights")

ax1b.set_ylabel("Attention Weight (Post-Softmax)", color="steelblue")
ax1b.tick_params(axis='y', labelcolor="steelblue")
ax1.set_xticks(x); add_grid2(ax1)
ax1.set_xticklabels(labels, rotation=45, ha="right")
ax1.set_xlabel("Geological Facies (Analog)")
ax1.set_title("Attention: Pre-Activation vs Weights")

handles1, labels1 = ax1.get_legend_handles_labels()           # combined legend
handles2, labels2 = ax1b.get_legend_handles_labels()

ax1.legend(handles1 + handles2, labels1 + labels2, loc="upper right")

ax2 = axes[1]                                                 # plot prediction on analog CDF
ax2.plot(np.sort(np.exp(V)),(np.arange(0, len(V))+1) / (len(V)+1),color='darkorange')
ax2.scatter (np.sort(np.exp(V)),(np.arange(0, len(V))+1) / (len(V)+1),color='darkorange',edgecolor='black',zorder=10)
ax2.set_xlabel(r"Values, Permeability ($mD$)")
ax2.set_ylabel("Frequency"); ax2.set_xlim([0.0,1500])
ax2.set_title("Values Cumulative Distribution Function and Attention-based Prediction")
ax2.tick_params(axis='x', rotation=45); add_grid2(ax2)
ax2.set_ylim([0,1]); ax2.axvline(x=np.exp(pred_logK),color='#CC5500',linewidth=2)

plt.tight_layout()
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.6, top=0.8, wspace=0.2, hspace=0.6); plt.show()
_images/3a705034dace945606ba4a852d27ec3f9169db2218dd78cb88d08120a3993615.png

Attention-based Permeability Prediction Interpretation#

This simple attention model performs a data-driven geological averaging process,

  • geological similarity determines attention weights

  • similar analogs receive larger weights

  • permeability prediction is formed from a weighted combination of geological analog responses

The scaling parameter \(\alpha\) controls the selectivity of attention,

  • low \(\alpha\) produces diffuse averaging

  • high \(\alpha\) produces sharper analog selection

The temperature parameter controls the selectivity of geological analogy matching, transitioning from diffuse similarity to near-deterministic analog selection. Note, a very large scale can cause,

  • numerical saturation in softmax

  • effectively one-hot weights

  • loss of uncertainty representation

So,

  • high scale = deterministic analog model

  • low scale = probabilistic blending model

Well Log 1D Attention Demonstration#

As a starting point, we first considered a very simple attention model,

  • the query and key were scalars (single values)

  • similarity was defined using negative squared difference

In this next step, we increase both realism and complexity by moving to:

  • vector-valued queries and keys (local windows of the well log)

  • dot-product similarity between these feature vectors

We consider a well log prediction problem involving two wells:

  • Well 1 — contains both porosity and permeability measured along the full interval (41 regularly spaced samples). This can be interpreted as a “memory bank,” where both input features and target properties are fully observed (e.g., from core analysis).

  • Well 2 — contains porosity only, and we aim to estimate the missing permeability along the full interval (e.g., log-only data without core measurements).

Within the attention framework,

  • Keys (K): Local porosity window vectors extracted from Well 1, centered at location i

  • Values (V): Permeability (or log-permeability) values from Well 1 associated with each key location i

  • Queries (Q): Local porosity window vectors extracted from Well 2, centered at location j, where predictions are required

Synthetic Well Log Generation (Step-by-Step)#

We construct a simple but geologically meaningful synthetic dataset consisting of two wells with different levels of available information.

  1. Set up workflow and depth coordinate, we begin by ensuring reproducibility and defining a pseudo-depth axis,

  • define well length,

n = 41
  • define normalized depth coordinate: $\( z \in [0,1], \quad z = \text{linspace}(0,1,n) \)$

  1. Generate Well 1 porosity (composite geological signal)

  • porosity is constructed as a superposition of geological processes, trend, cyclic bedding, fine layering and noise

  • then we enforce physical bounds, \(\phi \in [0.02, 0.32]\)

  1. Define implicit, rule-based facies structure by classifying depositional behavior using porosity thresholds,

  • Massive sand: \(\phi > 0.22\)

  • Thin bed / shale-like: \(\phi < 0.10\)

  • Transitional: otherwise

  1. Construct Well 1 permeability model, permeability is assigned conditionally,

  • Massive sand \(k \sim \mathcal{N}(800, \sigma^2)\)

  • Thin beds, \(k \sim \mathcal{N}(5, \sigma^2)\)

  • Transitional facies with a nonlinear relationship, \(k = 50 + 300(\phi - 0.10)^2 \cdot 10 + \epsilon\), where \(\epsilon\) is Gaussian noise.

  • Finally we enforce physical constraint, \(k \ge 0\) and then compute log-transform \(\log k\)

  1. Assemble Well 1 dataset, we store all variables in a structured table,

\[ \text{Well 1} = \{\text{depth}, \phi, k, \log k\} \]
  1. Generate Well 2 porosity (prediction-only well), we construct a second porosity log with similar but not identical structure,

\[ \phi_2(z) = 0.18 + 0.06 \sin(2\pi \cdot 2z) + 0.03 \sin(2\pi \cdot 6z + 4.0) + \epsilon \]
  • key difference is phase shift in fine-scale layering introduces geological mismatch. Again we enforce bounds,

\[ \phi_2 \in [0.02, 0.32] \]
  1. Assemble Well 2 dataset, well 2 contains only porosity,

\[ \text{Well 2} = \{\text{depth}, \phi_2\} \]
  1. Finalize workflow variables and visualization, we reset indexing consistency and extract depth,

\[ \text{depth} = 0,1,2,\dots,n-1 \]
suppress_warnings = True                                      # toggle to supress warnings
np.random.seed(42)                                            # for workflow repeatability set the random number seed

n = 41                                                        # set the well log length
z = np.linspace(0, 1, n)                                      # pseudo-depth coordinate for synthetic data models

phi = (                                                       # Well 1: make porosity as a simple composite model with cycles, fine bedding and noise
    0.18
    + 0.06 * np.sin(2 * np.pi * z * 2)                        # cyclic bedding
    + 0.03 * np.sin(2 * np.pi * z * 6)                        # finer layering
    + 0.01 * np.random.randn(n)                               # small noise
)
phi = np.clip(phi, 0.02, 0.32)                                # clip to physical bounds

massive_sand = phi > 0.22                                     # define facies logic (implicit, not explicit labels)
thin_bed = phi < 0.10
transitional = ~(massive_sand | thin_bed)

k = np.zeros(n)                                               # calculate permeability model (Well 1 only) from porosity
k[massive_sand] = (800 + 50 * np.random.randn(np.sum(massive_sand))) # massive sand, high perm, low variance
k[thin_bed] = (5 + 2 * np.random.randn(np.sum(thin_bed)))     # thin beds, very low perm even if phi varies
k[transitional] = (50 + 300 * (phi[transitional] - 0.10) ** 2 * 10 + 20 * np.random.randn(np.sum(transitional))) # mixed, mod. nonlinear relations
k = np.clip(k, 0.5, None)                                     # enforce physical bounds
logk = np.log(k)                                              # log-permeability (for later attention model)
well1 = pd.DataFrame({"depth_index": np.arange(n),"phi": phi,"k_mD": k,"logk": logk}) # make well DataFrame

phi2 = (                                                      # Well 2: make porosity as a simple composite model with cycles, fine bedding and noise
    0.18
    + 0.06 * np.sin(2 * np.pi * z * 2)                        # cyclic bedding
    + 0.03 * np.sin(2 * np.pi * z * 6 + 4.0)                  # finer layering
    + 0.01 * np.random.randn(n)                               # small noise
)
phi2 = np.clip(phi2, 0.02, 0.32)                              # clip to physical bounds
well2 = pd.DataFrame({"depth_index": np.arange(n),"phi": phi2}) # make well DataFrame

n = len(well1); depth = well1["depth_index"]

plot_well_log_data(well1,well2)                               # plot the well log data
_images/a4b6bb95119492b15f111d1451ebdaa98a48b2569496a658222af0bf7cdca35a.png

Well Log 1D Attention Demonstration Mathematical Formulation#

Now that we have introduced the general attention framework, we can summarize the specific workflow used for our well log permeability prediction problem.

  • yes this is quite verbose with the intention to describe a repeatable workflow, with every possible detail included.

The well log attention demonstration workflow steps are,

  1. Extract local porosity windows from Well 1 to form the keys (K)

For a moving window of size \(d\), centered at location \(i\) in Well 1, each key vector is:

\[ \mathbf{K}_i = \begin{bmatrix} \phi_{i-\frac{d-1}{2}} & \phi_{i-\frac{d-3}{2}} & \cdots & \phi_i & \cdots & \phi_{i+\frac{d-3}{2}} & \phi_{i+\frac{d-1}{2}} \end{bmatrix} \]

where \(\phi_i\) is the porosity at depth/location \(i\), \(d\) is the moving window size, and \(\mathbf{K}_i\) = local porosity pattern centered at location \(i\)

Collecting all valid windows produces the key matrix,

\[\begin{split} \mathbf{K}_{N \times d} = \begin{bmatrix} \mathbf{K}_1 \\ \mathbf{K}_2 \\ \vdots \\ \mathbf{K}_N \end{bmatrix} \end{split}\]

where \(N\) is the number of valid moving-window samples extracted from Well 1, and each \(\mathbf{K}_k\) is a row vector that represents a stored local geological pattern in the memory bank. For example, a single key vector centered at location \(i\) with a window size of \(d=3\) is:

\[ \mathbf{K}_i = \begin{bmatrix} \phi_{i-1} & \phi_i & \phi_{i+1} \end{bmatrix} \]

where \(\phi_i\) is the porosity at location/depth index \(i\), \(\phi_{i-1}\) and \(\phi_{i+1}\) provide the local geological context surrounding the center location, and \(\mathbf{K}_i\) represents a local geological pattern extracted from the well log.

\[\begin{split} \mathbf{K}_{N \times d} = \begin{bmatrix} \phi_{1-1} & \phi_{1} & \phi_{1+1} \\ \phi_{2-1} & \phi_{2} & \phi_{2+1} \\ \phi_{3-1} & \phi_{3} & \phi_{3+1} \\ \vdots & \vdots & \vdots \\ \phi_{N-1} & \phi_{N} & \phi_{N+1} \end{bmatrix} \end{split}\]

where each row represents a local porosity window (local geological pattern) extracted from Well 1 and centered at location \(i\) and the full key matrix stacks all local geological patterns into the attention memory bank.

  1. Pair each key window with the collocated permeability value from Well 1 to form the values (V)

For each key window centered at location \(i\), we associate the collocated permeability value from Well 1,

\[ v_i = \ln(k_i) \]

where \(k_i\) is the permeability at location \(i\), and \(v_i\) = log-permeability value associated with key \(\mathbf{K}_i\)

Collecting all values forms the value vector:

\[\begin{split} \mathbf{V}_{N \times 1} = \begin{bmatrix} v_1 \\ v_2 \\ v_3 \\ \vdots \\ v_N \end{bmatrix} = \begin{bmatrix} \ln(k_1) \\ \ln(k_2) \\ \ln(k_3) \\ \vdots \\ \ln(k_N) \end{bmatrix} \end{split}\]

Each value represents the target reservoir property retrieved from the memory bank during attention-based prediction.

  1. Extract local porosity windows from Well 2 to form the queries (Q)

For a moving window of size \(d=3\), each query vector centered at location \(j\) in Well 2 is,

\[ \mathbf{Q}_j = \begin{bmatrix} \phi_{j-1} & \phi_j & \phi_{j+1} \end{bmatrix} \]

where \(\phi_j\) is the porosity at location \(j\) in Well 2, and \(\mathbf{Q}_j\) = local porosity pattern at the prediction location

Collecting all valid query windows forms the query matrix,

\[\begin{split} \mathbf{Q}_{M \times d} = \begin{bmatrix} \phi_{1-1} & \phi_1 & \phi_{1+1} \\ \phi_{2-1} & \phi_2 & \phi_{2+1} \\ \phi_{3-1} & \phi_3 & \phi_{3+1} \\ \vdots & \vdots & \vdots \\ \phi_{M-1} & \phi_M & \phi_{M+1} \end{bmatrix} \end{split}\]

where \(M\) is the number of valid moving-window samples extracted from Well 2, and each row represents a local geological pattern where permeability is unknown and must be predicted

  1. For each query window in Well 2, compute similarity against all key windows from Well 1 using scaled dot-product similarity

Let,

  • \(\mathbf{Q}_j \in \mathbb{R}^{1 \times d}\) be the query window at location \(j\)

  • \(\mathbf{K}_i \in \mathbb{R}^{1 \times d}\) be the key window at location \(i\)

  • \(d\) = window size

  • \(N\) = number of key windows in Well 1

The similarity between query \(j\) and key \(i\) is:

\[ s_{j,i} = \alpha \, \mathbf{Q}_j \cdot \mathbf{K}_i^{T} \]

where \(s_{j,i} \in \mathbb{R}\) is a scalar similarity score, \(\mathbf{Q}_j \in \mathbb{R}^{1 \times d}\), \(\mathbf{K}_i^{T} \in \mathbb{R}^{d \times 1}\), and \(\alpha\) is a scaling (inverse temperature) parameter

Now let’s look at this in matrix form with all keys at once. For a fixed query \(\mathbf{Q}_j\), we compute similarity against all keys,

\[ \mathbf{s}_j = \alpha \, \mathbf{K} \mathbf{Q}_j^{T} \]

where \(\mathbf{K} \in \mathbb{R}^{N \times d}\), \(\mathbf{Q}_j^{T} \in \mathbb{R}^{d \times 1}\), and \(\mathbf{s}_j \in \mathbb{R}^{N \times 1}\)

Now let’s work with the full attention score matrix with all queries vs all keys. If we compute similarities for all queries in Well 2,

\[ \mathbf{S} = \alpha \, \mathbf{Q} \mathbf{K}^{T} \]

where \(\mathbf{Q} \in \mathbb{R}^{M \times d}\) (Well 2 query windows), \(\mathbf{K} \in \mathbb{R}^{N \times d}\) (Well 1 key windows), and \(\mathbf{S} \in \mathbb{R}^{M \times N}\)

Each entry,

\[ S_{j,i} = \alpha \sum_{m=1}^{d} Q_{j,m} K_{i,m} \]

represents the similarity between query location \(j\) and memory bank location \(i\).

  1. Convert similarity scores into normalized attention weights using softmax

For a fixed query window \(\mathbf{Q}_j\), we first compute a vector of similarity scores against all keys:

\[\begin{split} \mathbf{s}_j = \begin{bmatrix} s_{j,1} \\ s_{j,2} \\ \vdots \\ s_{j,N} \end{bmatrix} \in \mathbb{R}^{N \times 1} \end{split}\]

where:

  • \(s_{j,i} = \alpha \, \mathbf{Q}_j \cdot \mathbf{K}_i^T\)

The attention weights are obtained from the similarity scores, \(s_{j,k}\), by applying the softmax operator across all \(N\) candidate analogs:

\[ w_{j,i} = \frac{\exp(s_{j,i})} {\sum_{k=1}^{N} \exp(s_{j,k})} \quad i = 1, \dots, N \]

This can be written compactly as:

\[ \mathbf{w}_j = \text{softmax}(\mathbf{s}_j) \]

where \(\mathbf{w}_j \in \mathbb{R}^{N \times 1}\), \(\sum_{i=1}^{N} w_{j,i} = 1\), and \(w_{j,i} \ge 0 \;\; \forall i\)

The softmax operator converts raw similarity scores into a probability distribution over all stored geological analogs.

  1. Compute the permeability prediction as the weighted combination of permeability values retrieved from the most similar porosity patterns

For a fixed query location \(j\), the predicted permeability is obtained by combining all stored values using the attention weights:

\[ \hat{y}_j = \sum_{i=1}^{N} w_{j,i} \, v_i \]

where \(w_{j,i}\) is the attention weight between query \(j\) and key \(i\), \(v_i\) is the permeability (or log-permeability) associated with key \(i\), and \(N\) is the number of stored analog samples

This can be written compactly as,

\[ \hat{y}_j = \mathbf{w}_j^{T} \mathbf{v} \]

where \(\mathbf{w}_j \in \mathbb{R}^{N \times 1}\) is the attention weight vector, \(\mathbf{v} \in \mathbb{R}^{N \times 1}\) is the value vector, and \(\hat{y}_j \in \mathbb{R}\) is the predicted permeability at location \(j\).

The full matrix form (all prediction locations), for all query locations in Well 2 is,

\[ \hat{\mathbf{y}} = \mathbf{W} \mathbf{v} \]

where \(\mathbf{W} \in \mathbb{R}^{M \times N}\) is the attention weight matrix, \(\mathbf{v} \in \mathbb{R}^{N \times 1}\) is the value vector from Well 1, and \(\hat{\mathbf{y}} \in \mathbb{R}^{M \times 1}\) are the predicted permeability values for Well 2.

Each prediction \(\hat{y}_j\) is a convex combination of permeability values from the most similar geological analogs, where the weights are determined dynamically by similarity between porosity patterns.

In this formulation, the model dynamically searches for similar porosity patterns in the reference well and retrieves permeability information from the most relevant geological analogs.

Well Log 1D Attention Demonstration Results#

Now here’s the code for our attention-based permeability prediction from well log patterns workflow.

  • to check the numerical details, please review the functions declared above

window = 3                                                    # model parameters 
scale = 100.0

phi1 = well1["phi"].values                                    # extract well logs
logk1 = well1["logk"].values

phi2 = well2["phi"].values

K1 = build_windows(phi1, window)                              # build Well 1 (keys + values)
V1 = center_crop(logk1, window)

Q2 = build_windows(phi2, window)                              # build Well 2 (queries)

pred_logk = []                                                # run attention over Well 2
all_weights = []
all_scores = []

for i in range(len(Q2)):
    #w, s, out = dot_attention(Q2[i], K1, V1, scale=scale)    # attention without normalization
    w, s, out = dot_attention_norm(Q2[i], K1, V1, scale=scale) # attention with normalization
    pred_logk.append(out)
    all_weights.append(w)
    all_scores.append(s)

pred_logk = np.array(pred_logk)

well2_attn = well2.iloc[window//2 : len(well2) - window//2].copy() # attach predictions
well2_attn["pred_logk"] = pred_logk
well2_attn["pred_k_mD"] = np.exp(pred_logk)

plot_attention_prediction_log(well1,well2)
_images/87c33ff90d48229d68e12b06d73bb37025a4afbedf4c347d2a75981c9f521db7.png

To understand these results, let’s talk about attention regimes and dot-product similarity interpretation.

Attention Regimes in Geological Sequence Modeling#

To interpret how the scaling parameter \(\alpha\) shapes model behavior, we examine different attention regimes and how they influence similarity-based retrieval in geological sequence modeling.

Scale (α)

Attention Behavior

Weight Distribution

Geological Interpretation

ln(k) Output Behavior

Low (≈ 1–10)

Diffuse attention

Nearly uniform weights

Global averaging of all analogs (weak discrimination)

Collapsed range, smooth mean-field (~global mean)

Medium (≈ 10–50)

Structured attention

Moderate peaks, partial selectivity

Local geological patterns influence prediction

Moderate variability, realistic smoothing

High (≈ 50–150)

Selective attention

Few dominant analogs per query

Analog-based geological transfer (pattern matching)

Wider range, facies-controlled structure emerges

Very high (≈ 150+)

Near-hard attention

Almost one-hot weights

Nearest-neighbor / MPS-like retrieval behavior

Highly variable, piecewise analog reconstruction

Overall, increasing \(\alpha\) transitions the system from smooth global averaging toward highly selective analog retrieval, where predictions shift from mean-field behavior to facies-controlled, pattern-driven reconstruction of permeability.

Dot-Product Similarity Interpretation#

In this formulation, similarity is defined as:

\[ s_i = \alpha \, (\mathbf{Q} \cdot \mathbf{K}_i) \]

where \(\mathbf{Q} \in \mathbb{R}^{1 \times d}\) is the query window, \(\mathbf{K}_i \in \mathbb{R}^{1 \times d}\) is the \(i\)-th key window, \(\mathbf{Q} \cdot \mathbf{K}_i = \sum_{m=1}^{d} Q_m K_{i,m}\) measures alignment between local geological patterns, and \(\alpha\) is a scaling (inverse temperature) parameter

Positive values of \(\mathbf{Q} \cdot \mathbf{K}_i\) indicate aligned porosity structures, while negative values indicate anti-correlated or mismatched patterns, leading to lower attention weights after softmax normalization.

This formulation contrasts with Euclidean distance-based similarity, which measures geometric separation rather than directional agreement in feature space,

  • Before (Euclidean similarity), Which patterns are close in feature space?

  • Now (dot-product similarity), Which patterns behave similarly in structure and trend?”

This represents a subtle but important shift in interpretation:

  • Euclidean distance \(\rightarrow\) lithology similarity (absolute proximity)

  • Dot product \(rightarrow\) pattern alignment (structural agreement)

Because the dot product is scale-sensitive, the magnitude of \(\alpha\) strongly controls model behavior,

  • small \(\alpha\) → diffuse, nearly uniform attention

  • moderate \(\alpha\) → balanced analog selection

  • large \(\alpha\) → highly selective, near-nearest-neighbor retrieval

In practice, stability is often improved by:

  • normalizing query and key windows (cosine-like behavior), or

  • keeping \(\alpha\) in a moderate range (typically \(\sim 5\)\(50\) for well-log style signals)

This ensures attention reflects meaningful geological similarity rather than being dominated by magnitude effects.

Visualize the Attention Weights for a Single Prediction#

To make the attention mechanism tangible, we now visualize a specific prediction location in Well 2 and trace how its local porosity pattern is matched against the full Well 1 memory bank.

  • well 2 porosity log with the query window highlighted at the chosen depth index (left)

  • well 1 reference log alongside the corresponding attention weights computed at every candidate location (right)

Together, these plots illustrate how a single query activates a distributed set of geological analogs, and how the attention mechanism concentrates or disperses retrieval depending on pattern similarity across the memory bank.

depth_index = 4                                               # set the depth index and the observe the attention weights

plot_attention_provenance(well1,well2,all_weights,Q2,depth_index,window=3) # plot specific query and associated attention weights 
_images/02ec792455e62e9b8623ea70c1ceaecea441c7210da5452491452a6ae8cb2e19.png

From inspection of the attention weights, it is clear that the model is not simply matching values, but retrieving analogs based on local geological structure. Higher weights are consistently assigned to key windows that exhibit,

  • similar porosity magnitudes

  • similar multi-point patterns (shape and local trend across the window)

In other words, attention is responding to pattern alignment over a spatial context, rather than pointwise similarity alone, with both level and local variability contributing to the retrieval behavior.

Visualize the Attention Weights for a Single Prediction#

Next, we examine how the uncertainty or concentration of the attention mechanism changes along the full Well 2 prediction interval. We display,

  • well 2 porosity log (left)

  • attention entropy at each prediction location, providing a measure of how distributed or focused the attention weights are across the Well 1 memory bank.

Low entropy indicates that the model concentrates attention on a small number of highly similar geological analogs, while higher entropy suggests that multiple candidate patterns contribute to the prediction, reflecting greater ambiguity in geological similarity.

plot_entropy_log(well2,all_weights)                           # plot the attention weight entropy over all predictions
_images/e5c5cd104830c01a31e686f5eb7427853bf00659fe776e3a76f9b4542a27989c.png

Locations with lower attention entropy are primarily those where the query and the memory bank exhibit strong structural disagreement.

  • in these cases, the model becomes highly selective, assigning most of the weight to a small number of analogs that are actually poor matches to the query (e.g., low vs high magnitude regimes or concave-up vs convex-up curvature).

  • in contrast, higher entropy tends to occur where multiple geological patterns partially match the query, leading to a more distributed attention response across competing analogs.

Visualize the Full Attention Weight Matrix for All Well Log Queries#

Now we step back from individual predictions and examine the full attention weight matrix,

  • where every query in Well 2 is compared against every stored analog in Well 1

This matrix provides a complete view of the retrieval process,

  • rows correspond to prediction locations (queries)

  • columns correspond to memory bank analogs (keys)

  • each entry represents the strength of geological similarity between them.

In effect, this is the full “geological matching map”, and it’s where the structure of attention really becomes visible, almost like a fingerprint of how the model organizes and reuses subsurface patterns.

  • at this point, it stops being just a calculation and starts looking like something you could print on a T-shirt!

plot_full_attention_matrix(all_weights)                       # plot the full attention matrix over all queries and keys
_images/9903d12bfd42b02c6ac30e51a32ea698aef58bd2d3985fb95d8dbda3847e15b7.png

“The pre-activation scores form a relative energy landscape, where higher (less negative) values indicate stronger geological affinity.”

When similarity contrast is weak, attention behaves like a smoothing kernel, producing ensemble-mean predictions. As similarity sharpens, attention transitions into a selective analog retrieval mechanism, increasing output variability and geological realism.

Comments#

In this chapter, we introduced the core ideas behind attention mechanisms through a geoscience-driven analog and missing well log prediction examples.

  • starting from simple analog matching and progressively building toward full matrix-based retrieval, we showed how attention provides a flexible framework for linking a query location to a set of stored geological patterns, keys.

Attention is now a foundational component of modern machine learning systems, including large language models, where it enables the learning of complex, multiscale relationships.

  • by using similarity measures such as the dot product, these models are able to capture not just local pointwise similarity, but coherent structure, trends, and dependencies across extended contexts.

In our geological formulation, this translates directly into the ability to,

  • recognize similar subsurface patterns across wells

  • integrate information from distributed analogs

  • adaptively weight their contributions to prediction.

The result is a powerful mechanism that naturally bridges local measurements and large-scale structure, making it particularly well suited for spatial and subsurface data.

Ultimately, attention can be interpreted as a learnable pattern retrieval system, one that replaces fixed local assumptions with dynamic, data-driven matching across scales.

Check out my shared resource inventory and the YouTube lecture links at the start of this chapter with resource links in the videos’ descriptions.

I hope this is helpful,

Michael

About the Author#

Professor Michael Pyrcz in his office on the 40 acres, campus of The University of Texas at Austin.

Michael Pyrcz is a professor in the Cockrell School of Engineering, and the Jackson School of Geosciences, at The University of Texas at Austin, where he researches and teaches subsurface, spatial data analytics, geostatistics, and machine learning. Michael is also,

  • the principal investigator of the Energy Analytics freshmen research initiative and a core faculty in the Machine Learn Laboratory in the College of Natural Sciences, The University of Texas at Austin

  • an associate editor for Computers and Geosciences, and a board member for Mathematical Geosciences, the International Association for Mathematical Geosciences.

Michael has written over 90 peer-reviewed publications, a Python package for spatial data analytics, co-authored a textbook on spatial data analytics, Geostatistical Reservoir Modeling and author of two recently released e-books, Applied Geostatistics in Python: a Hands-on Guide with GeostatsPy and Applied Machine Learning in Python: a Hands-on Guide with Code.

All of Michael’s university lectures are available on his YouTube Channel with links to 100s of Python interactive dashboards and well-documented workflows in over 40 repositories on his GitHub account, to support any interested students and working professionals with evergreen content. To find out more about Michael’s work and shared educational resources visit his Website.

Want to Work Together?#

I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.

  • Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I’d be happy to drop by and work with you!

  • Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PI is Professor John Foster)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!

  • I can be reached at mpyrcz@austin.utexas.edu.

I’m always happy to discuss,

Michael

Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin

More Resources Available at: Twitter | GitHub | Website | GoogleScholar | Geostatistics Book | YouTube | Applied Geostats in Python e-book | Applied Machine Learning in Python e-book | LinkedIn