Architectures & Scaling
Text Extraction and Boilerplate Removal
Converting raw HTML from web crawls into clean, main-content text is a lossy signal-recovery problem, and the choices made here propagate irreversibly through every downstream filtering and training stage.
intermediate · 7 min read
Roughly 40-50% of the tokens in a naively extracted Common Crawl dump are navigation menus, cookie banners, footer links, boilerplate legal text, and advertisement copy. A model trained on that noise does not learn to generate prose; it learns to recite the structural debris of the web. Text extraction is where you decide which 50% to keep.
The pipeline before extraction: WARC and WET files
Common Crawl stores its petabyte-scale archive in three file types. WARC files hold raw HTTP responses including headers and full HTML. WAT files hold pre-computed JSON metadata (links, HTTP status codes). WET files hold plaintext that Common Crawl extracted in-house using a simple heuristic stripper.
Most serious pretraining pipelines start from WARC, not WET. The reason is quality control: Common Crawl's WET extractor is optimised for speed and coverage, not precision. It retains large amounts of navigation and sidebar text. Pipelines such as FineWeb (15 trillion tokens from 96 CC snapshots, released by HuggingFace in 2024) re-extract from WARC to apply a higher-quality parser before any filtering step begins.
The processing loop at minimum looks like this:
# Conceptual pseudocode - not a real library call
for warc_record in read_warc(path):
if warc_record.type != "response":
continue
if "text/html" not in warc_record.content_type:
continue
raw_html = warc_record.content
text = extract_main_content(raw_html, url=warc_record.target_uri)
if text:
yield {"url": warc_record.target_uri, "text": text}
The bottleneck is extract_main_content. Everything else is I/O.
How main-content extraction actually works
The core challenge is that HTML mixes content and presentation. A <div> containing a 1200-word article and a <div> containing a navigation bar are syntactically identical. Three broad strategies exist.
Heuristic density scoring. The classic approach, embodied in tools like Boilerpipe (2010), scores each block of text by the ratio of text characters to HTML tag characters. Navigation links are mostly tags with short text; body paragraphs are mostly text with sparse tags. Blocks exceeding a density threshold are kept; others are discarded. This works surprisingly well on news-style pages but struggles with modern JavaScript-rendered layouts where the DOM is sparse.
Tree-based scoring. Tools like Trafilatura (used by default in HuggingFace's DataTrove pipeline) build a content tree from the DOM, assign weights based on element type and position, and then prune low-weight subtrees. Article-type elements (<article>, <main>), paragraphs (<p>), and headings score high; <nav>, <footer>, and <aside> score low. Where semantic HTML is used correctly, this is very reliable.
Readability-style algorithms. Mozilla's Readability (also the basis for Firefox Reader Mode) converts the DOM to a scored candidate set, picks the highest-scoring container, and recursively strips it of non-content children. It was designed for human readability rather than corpus extraction, so it tends toward conservative recall - it would rather omit a sentence than include a nav link.
A comparison of their trade-offs:
| Approach | Precision | Recall | Handles JS-rendered? | Speed |
|---|---|---|---|---|
| Heuristic density (Boilerpipe) | Medium | High | No | Very fast |
| Tree-based scoring (Trafilatura) | High | Medium-high | Partially | Fast |
| Readability (mozilla/readability) | Very high | Medium | No | Medium |
For pretraining corpora the usual preference is Trafilatura or a similar tree-based approach: high enough precision to avoid feeding noise into downstream filters, high enough recall that you do not throw away millions of valid pages.
In code, the call is straightforward:
import trafilatura
html = fetch_html(url) # bytes or str
text = trafilatura.extract(
html,
favor_precision=True, # fewer false positives
include_tables=False, # tables add structure noise
include_comments=False, # comment sections are low quality
)
# returns None if the page appears to lack extractable content
Setting favor_precision=True is common for pretraining use cases where you have billions of pages and can afford to discard ambiguous ones.
Language and encoding issues
Before extraction even begins, character encoding must be resolved. Common Crawl declares encoding in HTTP headers, but the HTML <meta charset> tag sometimes disagrees, and both can be wrong for byte-order-mark reasons. Most parsers (lxml, html5lib) handle this, but silent misdetections produce garbled tokens that survive all subsequent filters.
After extraction, language identification is typically the next step. FastText's language identification model (lid.176.bin) can classify a document in under a millisecond. Filtering to a target language at this stage, rather than later in the pipeline, reduces the volume of text that needs expensive downstream processing by an order of magnitude for non-English pipelines.
Unicode normalisation (NFC vs NFKC) also matters. NFKC normalisation collapses compatibility characters: the full-width Latin letters common in East Asian web text, fraction ligatures, and circled numbers all collapse to their canonical equivalents. This reduces spurious token variety and makes downstream exact-duplicate detection more effective.
Metadata preservation
Extraction tools typically recover structured metadata alongside the main text: URL, publication date, language, author, title. This metadata is worth preserving in the intermediate corpus even if it never enters the model's training text directly. It enables post-hoc filtering (removing content published before a certain date, filtering by domain allow-lists, excluding press-release heavy domains), and it supports decontamination (checking whether a page's URL was included in any benchmark's source data).
A typical intermediate record after extraction:
{
"url": "https://example.com/article/123",
"date": "2023-11-14",
"title": "Understanding Attention Mechanisms",
"language": "en",
"text": "Attention mechanisms allow models to weight ...",
"token_count": 412
}
Storing token count at this stage (a cheap BPE pass over the raw text) lets you skip re-tokenising just to compute length-based filters downstream.
When it falls down
JavaScript-rendered pages. Most extraction tools operate on the raw HTML byte stream, not the rendered DOM. Single-page applications that populate content via JavaScript will appear as near-empty HTML to any static extractor. For pretraining pipelines this is generally acceptable - you simply lose those pages. For specialised corpora (e.g., technical documentation sites built with modern JS frameworks) it can be a significant coverage gap.
Template-heavy pages. E-commerce product listings, job boards, and forum threads are structurally content, but their text is highly templated ("Free shipping on orders over $50. SKU: B0027XQ."). Tree-based extractors keep this text because it appears in <p> tags inside <main> elements. It passes extraction cleanly and requires content-quality filters (perplexity, n-gram overlap, repetition ratio) to catch downstream.
Precision-recall tension at scale. A small false-positive rate in extraction becomes a large absolute count when applied to billions of pages. If 1% of extracted documents contain significant boilerplate and the corpus is 10 billion documents, that is 100 million noisy training examples. The interaction with deduplication is particularly subtle: boilerplate footers shared across thousands of pages will be deduplicated away, but unique-but-low-quality templated content will not.
Encoding misdetection compounding. A document whose encoding was misdetected during extraction produces garbled text that is unlikely to match any deduplication fingerprint, will confuse language ID, and will survive into the final corpus as corrupt noise. This is hard to catch because the corruption is per-character and not statistically obvious.
Domain-specific content degradation. PDF-converted pages, scanned-document OCR outputs, and academic preprints served as HTML tend to have unusual DOM structures. Extractors trained on the distribution of general news sites may perform poorly on these, either hallucinating structure (including headers and footers from the PDF rendering layer) or failing to extract anything.
Further reading
- Adbar, A. et al., "Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction." https://trafilatura.readthedocs.io/en/latest/
- Penedo, G. et al. (2024), "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." https://arxiv.org/abs/2406.17557
- Common Crawl Foundation, "Get Started with Common Crawl Data (WARC/WET/WAT formats)." https://commoncrawl.org/the-data/get-started/
- Gao, L. et al. (2021), "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." https://arxiv.org/abs/2101.00027