Simplified Attention Layer

1. Input

Query Vectors: $q \in R^{D_{Q}}$
Input Vectors: $X \in R^{N_{X} \times D_{X}}$
Assumption: For the dot product to be valid, we must have $D_{Q} = D_{X}$

2. Computation

Alignment Scores ( $e$ ): Measures the similarity between $q$ and each row $X_{i}$

e = \frac{q X ^{⊤}}{D _{Q}} (Shape: 1 \times N_{X})

Attention Weights ( $a$ ): Normalize scores into a probability distribution.

a = softmax (e) (Shape: 1 \times N_{X})

Output Vector ( $y$ ): A weighted sum of the inputs.

y = a X = i = 1 \sum N_{X} a_{i} X_{i} (Shape: 1 \times D_{X})

Why Scale by $1/ D_{Q}$ ?

As $D_{Q}$ grows, the variance of the dot product increases. Large scores push the softmax into regions with extremely small gradients (“saturation”), leading to vanishing gradients during backprop. Scaling keeps the variance near 1.

Standard Attention Layer

1. Introduction

In practice, we don’t just use the raw input $X$ . We project it into different spaces for “matching” (Key) and “extracting” (Value).

2. Input

Query Matrix: $Q \in R^{N_{Q} \times D_{Q}}$
Input Matrix: $X \in R^{N_{X} \times D_{X}}$
Weight Matrices:
- $W_{Q} \in R^{D_{Q} \times d_{k}}$ (Projects query to internal dim)
- $W_{K} \in R^{D_{X} \times d_{k}}$ (Projects input to key space)
- $W_{V} \in R^{D_{X} \times d_{v}}$ (Projects input to value space)

3. Computation Pipeline

Linear Projections:

Q^{'} = Q W_{Q}, K = X W_{K}, V = X W_{V}

Similarity Matrix ( $E$ ):

E = \frac{Q ^{'} K ^{⊤}}{d _{k}} (Shape: N_{Q} \times N_{X})

Each $E_{i, j}$ is the score between the $i$ -th query and $j$ -th key.

Attention Weights ( $A$ ):

A = softmax (E, dim = 1) (Shape: N_{Q} \times N_{X})

Final Output ( $Y$ ):

Y = A V (Shape: N_{Q} \times d_{v})

This is a “row-wise” combination: each row $Y_{i}$ is a weighted sum of all rows in $V$ , i.e.,

Y_{i} = j \sum A_{i, j} V_{j}

4. Query, Key, and Value Vectors

We can observe that both alignment scores and output vector use input vectors in its computation. However, they serve for different purposes

in alignment scores: it serves purpose for pattern matching
in output vector: it acts as actual info contain in input

Since they serve for different purposes, we can use learnable parameters to optimize them for separate use

Key Vector

K = X W_{K}

The key matrix $W_{K}$ learns what features do the query vector wants to match, then $W_{K}$ will extract desired features from the original input vector $X$

Value Vector

V = X W_{V}

The input vector mix all the features in the image together, value matrix learns to extract valuable features from the image to send to the next layer

Query Vector

It contains the current context we have. We can think of it as saying: “Based on the context in the first $t - 1$ periods, what part of the input $X$ should I look at to generate the next output”

5. Deep Dive: Training vs. Inference

Training

We use Teacher Forcing. Since the entire target sequence is known, we can compute all Queries ( $Q$ ) at once. The “dependence on previous step” is bypassed by using the ground truth.

Inference (Generation)

We must process Autoregressively. The Query for step $t$ depends on the output generated at $t - 1$ . Here, parallelization only happens across the “Input” side ( $K$ and $V$ are static), not the “Query” side.

Chilfox

目錄

D-DL4CV-Lec13b-AttentionLayer

Simplified Attention Layer

1. Input

2. Computation

Standard Attention Layer

1. Introduction

2. Input

3. Computation Pipeline

4. Query, Key, and Value Vectors

Key Vector

Value Vector

Query Vector

5. Deep Dive: Training vs. Inference

Training

Inference (Generation)

關係圖譜

反向連結