Multi-Head Self-Attention Layer (MHSA)
1. Introduction
While a Standard Attention Layer uses a single attention distribution to aggregate values, Multi-Head Self-Attention (MHSA) allows the model to jointly attend to information from different representation subspaces at different positions.
Think of it as having multiple “Standard Attention” layers running in parallel, each focusing on a different aspect of the sequence (e.g., one head for grammar, one for vocabulary, one for long-range dependencies).
2. Input & Hyperparameters
- Input Matrix:
- Number of Heads: (usually 8, 12, or 16)
- Head Dimension:
- Weight Matrices (for each head ):
- Output Projection Matrix:
3. Computation Pipeline
Step 1: Parallel Linear Projections
For each head , project the input into Query, Key, and Value subspaces:
(Shape of : )
Step 2: Independent Attention Heads
Compute the attention output for each head independently using the Scaled Dot-Product Attention formula (refer to Standard Attention Layer for details):
(Shape of : )
Step 3: Concatenation
Combine the outputs of all heads by concatenating them along the feature dimension:
(Shape: )
Step 4: Final Linear Projection
Apply a final linear layer to mix the information gathered by all heads:
(Shape: )
4. Multi-Head vs. Standard Attention: Key Differences
| Feature | Standard (Single-Head) | Multi-Head |
| Perspective | One global weighted average. | Multiple “points of view” (subspaces). |
| Resolution | High-variance scores can dominate. | Different heads can focus on different items simultaneously. |
| Complexity | (Computationally similar due to reduced head dim). | |
| Feature Extraction | Struggles with overlapping relationships. | Can capture syntax, semantics, and proximity in parallel. |
Implementation Efficiency
In practice, we don’t perform separate matrix multiplications. Instead, we use a single large weight matrix to project into dimensions, then reshape and transpose the tensor to separate the heads. This allows for massive parallelization on GPUs.