Introduction

VGG network consists of 5 stages, each of them is made up of several Conv layer with $3 \times 3$ filters and a max pooling layer

Importance

Before VGG, designing CNN architectures was largely guesswork. Researchers experimented with different layer combinations without clear guidelines, making it difficult to understand why certain designs worked better.

VGG changed this by introducing systematic design principles. Instead of random experimentation, VGG demonstrated that CNN architecture could follow logical, predictable rules.

Design Principles in VGG

1. All Convolutional Layers are $3 \times 3$ Stride 1 Pad 1

Statement

Every convolutional layer in CNN should only use layers with $3 \times 3$ filter with stride 1 and pad 1

Reason

Layer	Parameters	FLOP
Conv(5x5, C→ C)	$25 C^{2}$	$25 C^{2} H W$
Conv(3x3, C → C)+Conv(3x3, C → C)	$18 C^{2}$	$18 C^{2} H W$

If we stack $n$ $3 \times 3$ conv together, then the receptive field of them are $(2 n + 1) \times (2 n + 1)$ . From the above comparison we can observe that stacking multiple $3 \times 3$ conv is better than using a $(2 n + 1) \times (2 n + 1)$ in many ways:

Use less parameters
Require less FLOPs
We can insert ReLU between each $3 \times 3$ conv, which create more nonlinearity and makes the network deeper, and we’ve proved deeper is better in neural network

2. All Max Pooling are $2 \times 2$ Stride 2 and double “channels” after poolings

Statement

The max pooling will cut the size of $H$ and $W$ by half, then the subsequent conv will make the number of output channels double

Reason

Input	Conv	Memory	Parameters	FLOPs
$C \times 2 H \times 2 W$	Conv( $3 \times 3, C \to C$ )	$4 H W C$	$9 C^{2}$	$36 H W C^{2}$
$2 C \times H \times W$	Conv( $3 \times 3, 2 C \to 2 C$ )	$2 H W C$	$36 C^{2}$	$36 H W C^{2}$

The goal of doing the pooling operation is to extract important information from the image, so that slightly changing the image won’t cause much influence to the prediction.

We want every convolutional layer in CNN to have the same FLOPs, thus as we see on the above table, doubling the channel after max pooling can maintain FLOPs

Computation Resource Usage

Chilfox

目錄

D-DL4CV-Lec08b-VGG

Introduction

Introduction

Importance

Design Principles in VGG

1. All Convolutional Layers are $3 \times 3$ Stride 1 Pad 1

Statement

Reason

2. All Max Pooling are $2 \times 2$ Stride 2 and double “channels” after poolings

Statement

Reason

Computation Resource Usage

關係圖譜

反向連結

Chilfox

目錄

D-DL4CV-Lec08b-VGG

Introduction

Introduction

Importance

Design Principles in VGG

1. All Convolutional Layers are 3×3 Stride 1 Pad 1

Statement

Reason

2. All Max Pooling are 2×2 Stride 2 and double “channels” after poolings

Statement

Reason

Computation Resource Usage

關係圖譜

反向連結

1. All Convolutional Layers are $3 \times 3$ Stride 1 Pad 1

2. All Max Pooling are $2 \times 2$ Stride 2 and double “channels” after poolings