Recurrent Convolutional Networks (RCN)

Problem

We want to process long videos, but temporal dimension will make it computational expensive if we use models we introduce before

Two Approaches

CNN + RNN Pipeline

Video → Short Clips → CNN Features → RNN → Output

Separate spatial and temporal processing
CNN compresses spatial info into 1D vectors
RNN processes temporal sequence

Multi-Layer RCN

Key Idea: Integrate convolutions directly into RNN structure
Maintains 2D feature maps throughout processing

Multi-Layer RCN Architecture

Two Recurrent Connections

Each layer $h_{j}^{i}$ depends on:

Temporal: $h_{j - 1}^{i}$ (same layer, previous time)
Hierarchical: $h_{j}^{i - 1}$ (previous layer, same time)

Key Difference

RNN: Uses weight matrices on 1D vectors
RCN: Uses convolutions on 2D feature maps

Comparison

	CNN + RNN	Multi-Layer RCN
Processing	Sequential	Integrated
Spatial Info	1D compressed	2D preserved
Architecture	Two networks	Single network

Result: RCN treats video as unified spatial-temporal data rather than “images + time”