Sigmoid Function

Introduction

σ (x) = \frac{1}{1 + e ^{- x}}

Squash numbers to range $[0, 1]$
Historically popular since they have nice interpretation as “firing rate” of a neuron

3 Problems

1. Saturated neurons “kill” the gradients

The Problem:

When inputs to sigmoid have large magnitude (very positive or very negative), the sigmoid function becomes nearly flat, causing its derivative to approach zero.

The Consequence:

During backpropagation, gradients flowing through saturated neurons become extremely small (near zero), effectively blocking gradient flow to earlier layers.

Why This Breaks Training:

Layers before the saturated neuron receive virtually no gradient signal, so their weights barely update. This “kills” learning in the downstream layers, leaving parts of the network unable to improve regardless of how many training iterations you run.

2. Sigmoid outputs are not zero-centered

The Problem:

In a neural network layer:

h_{i}^{(ℓ)} = j \sum w_{i, j}^{(ℓ)} σ (h_{j}^{(ℓ - 1)}) + b_{i}^{(ℓ)}

When we compute gradients: $\partial h_{i}^{(ℓ)} / \partial w_{i, j}^{(ℓ)} = σ (h_{j}^{(ℓ - 1)})$ Since sigmoid always outputs positive values: $σ (x) > 0$ for all $x$

The Consequence:

During backpropagation, all weight gradients $\partial L / \partial w_{i, j}^{(ℓ)}$ will have the same sign:

If the loss is positive → ALL weight gradients are positive
If the loss is negative → ALL weight gradients are negative

Why This Limits Optimization:

In the weight space, we can only move in directions where all weights change in the same direction (all increase or all decrease together). We cannot move diagonally where some weights increase while others decrease.

This is like being forced only northeast and southwest but not other directions

3. exp() is a bit compute expensive

In contrast with other activation function like ReLU, it is relatively expensive

Tanh

Introduction

tanh (x)

Pros and Cons

Output can be negative, zero, and positive (zero-centered)
still kills gradient when saturated

ReLU

Introduction

ReLU (x) = max (0, x)

Problems

1. Not zero-centered output

Same as sigmoid

2. When input $x < 0$ , the downstream neuron died

Problem: When ReLU input $y < 0$ :

ReLU outputs 0
Gradient $d, ReLU / d y = 0$

Consequence: Zero gradient blocks backpropagation:

Upstream weights get no gradient signal
No weight updates occur
Weights remain unchanged

Result: Dead neuron cycle:

Same weights → same negative input → same zero gradient
Neuron permanently “dies” and never recovers
Lost computational capacity

A way to prevent

We can initialize biases with slightly positive value instead of 0, which gives neuron opportunity moving out of “dead region”

Leaky ReLU

Introduction

f (x) = max (αx, x)

where $α$ is a hyperparameter which often set to $α = 0.1$

Pros and Cons

Pros:

Does not saturate
Computationally efficient
Converge much faster than sigmoid/tanh
will not “die” (because no derivative equals 0)

Cons:

$x = 0$ not differentiable

Exponential Linear Unit (ELU)

f (x) = {x α (e^{x} - 1) if x > 0 if x \leq 0

$α$ is a hyperparameter and is usually set to $1$

Pros and Cons

Pros:

All benefits of Leaky ReLU
Differentiable at $x = 0$

Cons:

Require exp() computation

Summary

Don’t think too hard, just use ReLU
Try Leaky ReLU / ELU / SELU / GELU if we need squeeze last $0.1$ %
Never use sigmoid or tanh

Chilfox

目錄

D-DL4CV-Lec10a-ActivationFunction

Sigmoid Function

Introduction

3 Problems