Protein Language Models

A protein is a string of amino acids drawn from an alphabet of twenty letters. That framing is the whole trick. If you squint, a protein sequence looks exactly like a sentence in a very strange language: a linear chain of discrete tokens whose meaning depends on long-range context. So the obvious question, once masked language modelling had worked for English, was whether the same recipe would work for proteins. Rives et al answered it in 2021: train BERT on 250 million protein sequences, mask 15% of the residues, ask the model to fill them in, and the internal representations spontaneously encode secondary structure, tertiary contacts, and mutational effects that nobody trained the model to produce. The transformer that learns to predict masked words also learns biology.

The same objective, different data

There is almost nothing architecturally novel here, and that is the point. The ESM ("Evolutionary Scale Modelling") family is a standard transformer encoder trained with the masked-language-model objective. Take a sequence, corrupt a fraction of the tokens, predict the originals from bidirectional context:

loss = - sum over masked positions i of  log p_theta(x_i | x_masked)

The tokens are amino acids (plus a few special symbols for gaps, unknowns, and sequence boundaries) rather than word-pieces. The training corpus is UniRef, a clustered database of protein sequences observed across the tree of life. Everything else, the attention mechanism, the layer norm, the positional handling, is the machinery an LLM engineer already knows.

Why does this produce biology? Because evolution has already done a colossal amount of labelling for free. Residues that sit close together in the folded 3D structure co-evolve: a mutation on one side of a contact is compensated by a mutation on the other, or the protein breaks and the lineage dies. Across hundreds of millions of sequences those co-evolutionary statistics are overwhelming, and a model forced to predict a masked residue from its neighbours has every incentive to learn them. Masked-language modelling over evolutionary data is, in effect, implicit contact prediction.

What emerges in the representations

The striking result is how much structure falls out without any structural supervision:

Contact maps in the attention. Specific attention heads light up precisely on pairs of residues that are in physical contact in the folded protein. You can read an approximate contact map straight off the attention weights, and a small linear probe on the hidden states predicts contacts well.
Secondary structure and burial. Whether a residue sits in a helix, a strand, or a loop, and whether it is buried in the core or exposed on the surface, is linearly decodable from the embeddings.
Function and family. Sequences from the same protein family cluster in embedding space, so nearest-neighbour search over ESM embeddings is a usable remote-homology detector.
Variant effect prediction. The model's per-residue probabilities give a zero-shot score for how damaging a mutation is: if the wild-type amino acid is far more probable than the mutant under the model, the mutation is likely deleterious. This tracks deep-mutational-scanning experiments without the model ever seeing a fitness label.

The scaling behaviour is the LLM story repeated. Rives et al trained up to roughly 650 million parameters; the follow-up ESM-2 pushed to 15 billion, and the quality of the emergent structural signal improved smoothly with scale, exactly as loss-versus-parameters curves in language modelling would predict.

ESMFold: structure without the alignment search

The headline application is single-sequence structure prediction. AlphaFold2 (Jumper et al, 2021) solved protein folding to near-experimental accuracy, but it leans hard on a multiple sequence alignment (MSA): before folding a target, it searches large databases for evolutionarily related sequences and feeds the alignment in as input. The MSA is where AlphaFold2 gets its co-evolutionary signal, and building it is slow, often the dominant cost of a prediction.

ESMFold (Lin et al, Science 2023) makes a different bet. The co-evolutionary statistics an MSA carries are already baked into the language model's weights during pretraining, so you can fold from a single sequence: run the sequence through ESM-2, take the representations, and pass them to a folding head that outputs 3D coordinates. No search, no alignment.

The trade is explicit and worth stating plainly:

	AlphaFold2	ESMFold
Co-evolution source	Explicit MSA, built per target at query time	Implicit, stored in ESM-2 weights
Accuracy on well-covered targets	Higher	Slightly lower
Speed	Slower (MSA search dominates)	Roughly an order of magnitude faster
Behaviour on orphan sequences	Degrades when few homologues exist	No search to fail, but signal may simply be absent

That speed is not a marginal convenience. It is what let the ESM team fold a metagenomic catalogue, predicting structures for over 600 million sequences from environmental samples where MSA-based folding at that scale would have been prohibitive. When you need coverage of hundreds of millions of proteins rather than the best possible answer on one, dropping the alignment search changes what is feasible.

When it falls down

The failure modes follow directly from what the model is and is not given.

Hard and orphan targets. On proteins with few or no detectable homologues, or on genuinely novel folds, single-sequence models lose accuracy relative to MSA-based methods. The co-evolutionary signal ESMFold relies on was learned from families that exist in the training data; for a sequence with no evolutionary neighbours there is less signal to have learned, and the confident-but-wrong prediction is the dangerous case. Always read the model's own confidence estimate rather than trusting a single structure.
Representation quality varies by family. Well-sampled families (kinases, immunoglobulins, common enzymes) are richly represented in UniRef and their embeddings are excellent. Rare, fast-evolving, intrinsically disordered, or under-sampled families get thinner representations, so downstream probes and variant scores are less reliable exactly where you most want help.
Sequence-only means context-blind. The model sees an amino-acid string and nothing else. It does not see bound ligands, ions, cofactors, post-translational modifications, pH, temperature, or binding partners; it predicts a single static conformation for molecules that are often flexible and multi-state in reality. Anything driven by dynamics, allostery, or experimental conditions sits outside what a sequence-only model can express.
Prediction is not measurement. A high-confidence predicted structure is a hypothesis, not a crystal structure. For anything consequential (drug design, mechanistic claims) the prediction narrows the search; experiment still adjudicates.

The mental model to carry: a protein language model is an LLM that learned biology from evolution instead of learning language from the web, and the same intuitions transfer. Scale helps. Emergent capability is real. And the model is only ever as good as the statistics of its training distribution, which is why orphan proteins and dynamic behaviour remain the frontier.

The same objective, different data

What emerges in the representations

ESMFold: structure without the alignment search

When it falls down

Further reading