Multi-Head Self-Attention Layer (MHSA)

1. Introduction

While a Standard Attention Layer uses a single attention distribution to aggregate values, Multi-Head Self-Attention (MHSA) allows the model to jointly attend to information from different representation subspaces at different positions.

Think of it as having multiple “Standard Attention” layers running in parallel, each focusing on a different aspect of the sequence (e.g., one head for grammar, one for vocabulary, one for long-range dependencies).

2. Input & Hyperparameters

Input Matrix: $X \in R^{N \times D_{m o d e l}}$
Number of Heads: $h$ (usually 8, 12, or 16)
Head Dimension: $d_{k} = d_{v} = D_{m o d e l} / h$
Weight Matrices (for each head $i \in {1, \dots, h}$ ):
- $W_{Q, i} \in R^{D_{m o d e l} \times d_{k}}$
- $W_{K, i} \in R^{D_{m o d e l} \times d_{k}}$
- $W_{V, i} \in R^{D_{m o d e l} \times d_{v}}$
Output Projection Matrix: $W_{O} \in R^{D_{m o d e l} \times D_{m o d e l}}$

3. Computation Pipeline

Step 1: Parallel Linear Projections

For each head $i$ , project the input $X$ into Query, Key, and Value subspaces:

Q_{i} = X W_{Q, i}, K_{i} = X W_{K, i}, V_{i} = X W_{V, i}

(Shape of $Q_{i}, K_{i}, V_{i}$ : $N \times d_{k}$ )

Step 2: Independent Attention Heads

Compute the attention output for each head independently using the Scaled Dot-Product Attention formula (refer to Standard Attention Layer for details):

$head_{i} = Attention (Q_{i}, K_{i}, V_{i}) = softmax (\frac{Q _{i} K _{i}^{⊤}}{d _{k}}) V_{i}$

(Shape of $head_{i}$ : $N \times d_{v}$ )

Step 3: Concatenation

Combine the outputs of all $h$ heads by concatenating them along the feature dimension:

$MultiHead_Concat = Concat (head_{1}, head_{2}, \dots, head_{h})$

(Shape: $N \times (h \times d_{v}) = N \times D_{m o d e l}$ )

Step 4: Final Linear Projection

Apply a final linear layer to mix the information gathered by all heads:

$Y = (MultiHead_Concat) \cdot W_{O}$

(Shape: $N \times D_{m o d e l}$ )

4. Multi-Head vs. Standard Attention: Key Differences


Feature	Standard (Single-Head)	Multi-Head
Perspective	One global weighted average.	Multiple “points of view” (subspaces).
Resolution	High-variance scores can dominate.	Different heads can focus on different items simultaneously.
Complexity	$O (N^{2} \cdot D)$	$O (N^{2} \cdot D)$ (Computationally similar due to reduced head dim).
Feature Extraction	Struggles with overlapping relationships.	Can capture syntax, semantics, and proximity in parallel.

Implementation Efficiency

In practice, we don’t perform $h$ separate matrix multiplications. Instead, we use a single large weight matrix to project $X$ into $D_{m o d e l}$ dimensions, then reshape and transpose the tensor to separate the heads. This allows for massive parallelization on GPUs.

Chilfox

目錄

D-DL4CV-Lec13e-MultiheadSelfAttentionLayer

Multi-Head Self-Attention Layer (MHSA)

1. Introduction

2. Input & Hyperparameters

3. Computation Pipeline

Step 1: Parallel Linear Projections

Step 2: Independent Attention Heads

Step 3: Concatenation

Step 4: Final Linear Projection

4. Multi-Head vs. Standard Attention: Key Differences

關係圖譜

反向連結