Introduction

Idea

We normalize the outputs of a layer so they have zero mean and unit variance. We’ve already done this in linear classification and will work the same way in neural network

Why it works?

When we draw the contour map for loss function, we’ll find the width and length of it looks like and oval instead of circle. This makes it super easy to overshoot

Hence, after we apply normalization to the input, the contour map for loss function will become a circle, which makes it more efficient and easy to optimize

Mathematical Expression

μ σ^{2} \overset{x}{^}_{i} y_{i} = \frac{1}{N} i \sum x_{i} = \frac{1}{N} i = 1 \sum N (x_{i} - μ)^{2} = \frac{x _{i} - μ}{σ ^{2} + ϵ} = γ \overset{x}{^}_{i} + β

Gamma $γ$ and Beat $β$

After normalization, the output will go through activation function (e.g., ReLU). The normalization process will let the output extremely hard to go through, thus we introduce $γ$ and $β$ after normalization to adjust the output’s distribution

Test Process

After training, we’ll only feed one test data into the neural network as input, normalization one example is meaningless. Hence, we’ll use the average mean and variance we’ve calculated during the training process when using the model

Using these constant makes the normalization process become the equation below making it a linear classifier.

y_{i} = γ \frac{( x _{i} - μ )}{σ ^{2} + ϵ} + β

Usually we insert normalization after fully connected or convolutional layers, and before the nonlinearity, i.e., activation function

Different Kinds of Normalization

Batch Normalization

In batch normalization, we’ll normalize over one channels at a time, but with all $N$ examples and all $H \times W$ spatial positions

That is, $μ$ , $σ^{2}$ , $γ$ , and $β$ all have the same size $C$ , which is the number of channels

Layer Normalization

Normalize across all features within one sample at a time, i.e., normalize across $C \times H \times W$

$μ$ , $σ^{2}$ , $γ$ , and $β$ all have the same size $N$ , the number of examples

Seldom use in CNN, since each channel in CNN isn't related

Instance Normalization

Normalize across each one channel in one example at a time, i.e., normalize across $H \times W$

$μ$ , $σ^{2}$ , $γ$ , and $β$ all have the same size $N \times C$

Group Normalization

It divides channels into groups and normalizes within each group. Instead of normalizing each channel separately (like Instance Norm) or all channels together (like Layer Norm), it normalizes a few channels at a time by grouping them. It normalize area has the size $G \times H \times W$

Chilfox

目錄

D-DL4CV-Lec07c-Normalization

Introduction

Idea

Why it works?

Mathematical Expression

Gamma $γ$ and Beat $β$

Test Process

Different Kinds of Normalization

Batch Normalization

Layer Normalization

Instance Normalization

Group Normalization

關係圖譜

反向連結

Chilfox

目錄

D-DL4CV-Lec07c-Normalization

Introduction

Idea

Why it works?

Mathematical Expression

Gamma γ and Beat β

Test Process

Different Kinds of Normalization

Batch Normalization

Layer Normalization

Instance Normalization

Group Normalization

關係圖譜

反向連結

Gamma $γ$ and Beat $β$