Deep Learning
Convolutional Neural Networks
Why weight sharing and local receptive fields make CNNs the right inductive bias for images, and where ViTs took over.
intermediate · 8 min read
A fully connected layer over a 224x224 RGB image needs 150k weights per output unit and has no notion that nearby pixels are related. CNNs replace it with a small filter slid across the image: the same weights see every position, and translation of the input produces a translation of the output. That single architectural choice is what made vision deep learning tractable.
Convolution as weight sharing
A 2D conv layer with a k x k kernel and C_in -> C_out channels has k * k * C_in * C_out weights regardless of input size. The same kernel is convolved at every spatial location, which gives two properties for free:
- Translation equivariance. Shift the input by
(dx, dy), the output shifts by(dx, dy). - Parameter efficiency. A 3x3 kernel mapping 64 -> 64 channels has 36k weights, vs the ~3M of a dense layer on a 56x56 feature map.
This is an inductive bias - a built-in assumption about what kind of function the network should prefer. The bias matches natural images well, which is why CNNs train from a few thousand examples while transformers want millions.
Pooling and the receptive field
Max-pool or strided conv halves the spatial resolution and roughly doubles the receptive field of each subsequent neuron. After five such reductions, a single output unit "sees" the whole image. The receptive field at layer L grows as:
RF_L = RF_{L-1} + (k_L - 1) * prod(strides up to L-1)
Effective receptive field is usually smaller than the theoretical one - gradients concentrate around the centre of the kernel, so the network behaves as if it sees a Gaussian-weighted window.
The residual connection
Stacking more layers stopped helping past ~20 layers. The problem was not vanishing gradients (BatchNorm fixed those) but optimisation: very deep plain nets had higher training error than shallower ones. Identity mappings were hard to learn.
ResNet (He et al, 2015) added a skip connection around every two-conv block:
y = F(x, W) + x
The block now has to learn the residual F(x) = y - x rather than the full mapping. Zero is a sensible default, so layers that should be near-identity learn small F. With this fix 152-layer nets train cleanly and 1001-layer variants are stable. Every modern deep architecture - transformers included - inherits residual connections from this paper.
A short timeline
| Year | Model | Why it mattered |
|---|---|---|
| 1989 | LeCun's LeNet | Backprop through convolutions, digit recognition on real cheques |
| 2012 | AlexNet | First CNN to win ImageNet, kicked off the deep learning decade |
| 2014 | VGG, GoogLeNet | Depth as the lever, inception modules |
| 2015 | ResNet | Skip connections, training nets 10x deeper |
| 2017 | DenseNet, ResNeXt | Architectural refinements, diminishing returns visible |
| 2020 | Vision Transformer | Pure attention matches CNNs given enough data |
What changed in 2020-2026
ViT showed that with ~300M images of pretraining, a pure transformer beats CNNs on ImageNet. The CNN inductive bias is a prior - useful when data is scarce, dead weight when data is abundant. Frontier vision-language models (CLIP, SigLIP, Gemini vision) are all transformer-based.
CNNs did not die. They moved to where data and compute are tight:
- Edge inference. MobileNet, EfficientNet, and their successors still dominate phone-scale image classification because depthwise-separable convs are 5-10x cheaper than equivalent attention at low resolutions.
- Dense prediction. Segmentation and detection backbones often stay convolutional (ConvNeXt v2) because the spatial structure is right there in the architecture.
- Hybrid stacks. Real production vision systems mix conv stems (cheap downsampling) with transformer trunks (global reasoning).
Further reading
- Deep Residual Learning for Image Recognition - the ResNet paper, He et al 2015.
- ImageNet Classification with Deep Convolutional Neural Networks - AlexNet, Krizhevsky, Sutskever, Hinton 2012.
- An Image is Worth 16x16 Words - the Vision Transformer paper, Dosovitskiy et al 2020.
- Feature Visualization - Olah, Mordvintsev, Schubert; what conv layers actually learn to detect.