← Concept library

Deep Learning

Convolutional Neural Networks

Why weight sharing and local receptive fields make CNNs the right inductive bias for images, and where ViTs took over.

intermediate · 8 min read

A fully connected layer over a 224x224 RGB image needs 150k weights per output unit and has no notion that nearby pixels are related. CNNs replace it with a small filter slid across the image: the same weights see every position, and translation of the input produces a translation of the output. That single architectural choice is what made vision deep learning tractable.

Convolution as weight sharing

A 2D conv layer with a k x k kernel and C_in -> C_out channels has k * k * C_in * C_out weights regardless of input size. The same kernel is convolved at every spatial location, which gives two properties for free:

  • Translation equivariance. Shift the input by (dx, dy), the output shifts by (dx, dy).
  • Parameter efficiency. A 3x3 kernel mapping 64 -> 64 channels has 36k weights, vs the ~3M of a dense layer on a 56x56 feature map.

This is an inductive bias - a built-in assumption about what kind of function the network should prefer. The bias matches natural images well, which is why CNNs train from a few thousand examples while transformers want millions.

Pooling and the receptive field

Max-pool or strided conv halves the spatial resolution and roughly doubles the receptive field of each subsequent neuron. After five such reductions, a single output unit "sees" the whole image. The receptive field at layer L grows as:

RF_L = RF_{L-1} + (k_L - 1) * prod(strides up to L-1)

Effective receptive field is usually smaller than the theoretical one - gradients concentrate around the centre of the kernel, so the network behaves as if it sees a Gaussian-weighted window.

The residual connection

Stacking more layers stopped helping past ~20 layers. The problem was not vanishing gradients (BatchNorm fixed those) but optimisation: very deep plain nets had higher training error than shallower ones. Identity mappings were hard to learn.

ResNet (He et al, 2015) added a skip connection around every two-conv block:

y = F(x, W) + x

The block now has to learn the residual F(x) = y - x rather than the full mapping. Zero is a sensible default, so layers that should be near-identity learn small F. With this fix 152-layer nets train cleanly and 1001-layer variants are stable. Every modern deep architecture - transformers included - inherits residual connections from this paper.

A short timeline

Year Model Why it mattered
1989 LeCun's LeNet Backprop through convolutions, digit recognition on real cheques
2012 AlexNet First CNN to win ImageNet, kicked off the deep learning decade
2014 VGG, GoogLeNet Depth as the lever, inception modules
2015 ResNet Skip connections, training nets 10x deeper
2017 DenseNet, ResNeXt Architectural refinements, diminishing returns visible
2020 Vision Transformer Pure attention matches CNNs given enough data

What changed in 2020-2026

ViT showed that with ~300M images of pretraining, a pure transformer beats CNNs on ImageNet. The CNN inductive bias is a prior - useful when data is scarce, dead weight when data is abundant. Frontier vision-language models (CLIP, SigLIP, Gemini vision) are all transformer-based.

CNNs did not die. They moved to where data and compute are tight:

  • Edge inference. MobileNet, EfficientNet, and their successors still dominate phone-scale image classification because depthwise-separable convs are 5-10x cheaper than equivalent attention at low resolutions.
  • Dense prediction. Segmentation and detection backbones often stay convolutional (ConvNeXt v2) because the spatial structure is right there in the architecture.
  • Hybrid stacks. Real production vision systems mix conv stems (cheap downsampling) with transformer trunks (global reasoning).

Further reading