What is Convolutional Layers

Definition

Convolutional layers use small learnable filters (kernels) that slide across input data to detect local patterns and features.

A neuron in CNN is said to contribute exactly one value to the output feature map

How It Works

The Process:

  1. Create Filter: Make a 3×5×5 kernel (depth matches input image depth)

  2. Slide Across Image: Move the filter left-to-right, top-to-bottom over the entire image. At each position, calculate the dot product between filter and image patch.

  3. Generate Output: Each dot product becomes one value in the activation map. All values together form the final 28×28 output.

Generalization

Parameters

ParameterExplanation
Number of images in a batch
The number of channels in the input image
1. The number of filters in the convolution layer
2. The number of channels in the output
, The width and height of the filters

Explanation

Input: The input has images, each with the size and channels

Convolutional Layer: In a convolutional layer, we can have multiple filter. The number of filters is , which also determines the number of channels in the output

Every filter only has one bias value: I-20250722-Bias_in_ConvLayer

Output (Activation Map): We’ll have outputs, each has channels, the width and height is determined by both the input image and the size of filter

Stacking Convolutional Layers

If we stack two convolutional layers like that we do in linear classifiers. We’ll then get a new convolutional layer that combines the operations we’ve done with two layers into one

Thus, we’ll also put in an activation function after each convolutional layer

What do Convolutional Layers Learn

The filters in the front layers will often learn patterns like oriented edges, opposing colors. The latter layers will then learn larger patterns

This kind of conv layer is called "2D convolution", there are also 1D conv, which can be used for NLP and 3D convolution, which can be used for 3D computer vision


Padding

Introduction

In previous section, we’ve learnt that the width and height of the convolutional layer’s output can be calculated through .

This suggests that feature map “shrink” with each layers, which limits the number of layers we can have in neural network

The concept of “padding” aim to deal with this problem

How Padding works

Before sending the feature map into the convolutional layers, we’ll add additional width and height to the feature map, which is often set to 0

There are also other ways filling the additional features added by paddings

Then, since we’ve increase the size of feature map, the activation map will then have larger width and size based on our choice of padding

A common choice of padding is , since this makes the output the same size as the input. This specific padding is called "same padding"


Receptive Field

For a convolution with kernel size , each element in the output depends on a receptive field in the input

Each successive convolution adds elements to the receptive field

With layers the receptive field in the input is


Strided Convolution

Motivation

For large images, i.e., images with lots of pixels, we need lots of layers for each output to see the entire images (since each layer only add to the receptive field)

Hence, we introduce the concept of “stride convolution”

How does it works?

Stride convolution is a technique where the filter moves across the input with steps larger than one pixels. This not only decrease the spatial dimension of the output feature map, but also increase the increment of receptive field per layer

With the technique applied, the current output dimension can be calculated by

We call the process of reducing spatial resolution of feature map while preserving important information "Downsampling"


Common Settings of Convolutional Layers

General Rules

  1. Square kernels:
  2. Padding:
  3. Channels:

Standard Configurations

Purpose
311Standard conv
521Large receptive field
101Channel Mixing
312Downsample

Channel Mixing ( CONV)

What it does: Takes all channel values at each pixel location and combines them using learned weights.

Think of it as: Running a mini fully-connected network independently at every pixel position.

Example:

  • Input: 3 channels at pixel → values
  • Output: 1 channel → learned combination like
  • This happens simultaneously for all pixels in the image

Unlike regular convolution that mixes spatial neighbors, conv only mixes channels at the same spatial location.

This kind of convolutional layer reduces the spatial dimension by half