Overview

What is CommonLID?

CommonLID is a landmark evaluation dataset produced by a global community of native speakers and NLP researchers to benchmark language identification in real-world web text.

Language identification (LID), the task of determining which language a piece of text is written in, is a fundamental first step in building multilingual corpora. Yet today's LID models were largely evaluated on clean, professionally translated text such as legal documents, religious texts, and news articles, not on the chaotic, informal, mixed-script content that fills the actual web.

CommonLID was created to fill that gap. Starting from raw web pages sampled from two recent Common Crawl snapshots (CC-MAIN-2024-22 and CC-MAIN-2025-05) and from the MADLAD-400 corpus, the team recruited over 80 native-speaker annotators, primarily from the NLP community, to label web text line by line in their languages.

The result is a uniquely challenging benchmark that reveals a systematic gap between performance reported on existing datasets and what models actually achieve on web data. In nearly every tested language, models score lower on CommonLID than on prior benchmarks, meaning the field has been overestimating its own progress.

CommonLID was developed as the shared task of the Workshop on Multilingual Data Quality Signals (WMDQS) at COLM 2025, in partnership with Common Crawl Foundation, MLCommons, EleutherAI, and Johns Hopkins University.

109 Language Varieties

Spanning many previously under-served languages and dialects, including African, South-East Asian, and indigenous varieties. 78 languages have at least 100 annotated lines.

Human-Annotated by Native Speakers

Over 80 annotators, primarily NLP professionals, provided line-level labels, validated by an expert researcher familiar with diverse writing systems.

Real Web Data

Sourced directly from Common Crawl, capturing the informal, noisy, and mixed-script text found in practice, not idealized or professionally translated samples.

Comprehensive Model Comparison

Eight widely-used LID systems evaluated across six evaluation datasets, providing the most complete state-of-the-art comparison currently available.

Why CommonLID

Current Benchmarks Overestimate Performance

Existing LID evaluation datasets consist largely of clean, formal text. CommonLID demonstrates that this creates a false sense of progress: models perform significantly worse on real web content.

Below are real examples from Common Crawl that existing LID systems misclassify. This is genuine web content: the messy, mixed, informal language that any real pipeline must handle.

Onyeakagbu, Adaobi. "See how all the 36 Nigerian states got their names". Pulse.ng. Retrieved 25 December 2021.

CorrectEnglish

System saidDagbani

Proprete: Confort: Accueil du proprietaire: Rapport qualite/prix: Randonneurs

CorrectFrench

System saidMaltese

Blog de titine807 - ~~~~~~~~~~ cOuCoU ToUs lE MoNdE BiEnVeNuE DaNs mOn tI BlOg ~~~~~~~~~~

CorrectFrench

System saidNon-Linguistic

'^7.00 7.01 ... 楊南郡、王素娥. 《玉山國家公園八通關越嶺古道西段調查研究報告》（中文（臺灣））.

CorrectChinese

System saidHebrew

These errors compound: mislabeled data contaminates multilingual corpora used to train large language models, causing under-resourced languages to be filtered out or merged with unrelated ones. CommonLID enables researchers to diagnose and correct these systematic failures.

Dataset

Dataset Details

CommonLID is released for evaluation purposes under the Common Crawl Terms of Use. Detailed per-language line counts and mean line lengths are in Appendix A of the paper.

Property	Value
Language varieties	109
Languages with 100+ lines	78
Total annotated lines	350,000+
Format	`CSV`
Size category	100K-1M rows
License	Common Crawl Terms of Use
Intended use	Evaluation only
Hugging Face ID	`commoncrawl/CommonLID`
arXiv	`2601.18026`
Task	Text Classification (LID)
Compatible libraries	Datasets, pandas, Polars

Language Coverage

CommonLID covers 109 language varieties spanning multiple families and scripts. Many included languages have been historically under-served in NLP research, including African, South-East Asian, Caribbean Creole, and indigenous languages.

Niger-Congo Austronesian Sino-Tibetan Afro-Asiatic Indo-European Dravidian Creoles Turkic

Full per-language statistics are in Table 4 of the paper. The dataset intentionally includes imbalanced language coverage, reflecting web data realities; the paper provides guidance on fair cross-model comparisons.

Evaluation Only

Do not use CommonLID to train LID models or other AI models, and do not re-host it in locations accessible to web crawlers, to preserve its integrity as a benchmark.

Data Sources

CC-MAIN-2024-22: Common Crawl web snapshot, May 2024. Raw, unfiltered web pages capturing real-world text diversity.
CC-MAIN-2025-05: Common Crawl web snapshot, January 2025. Additional data expanding language and domain coverage.
MADLAD-400: Multilingual Automatically Derived Low-resource Annotated Data, derived from Common Crawl (Google / Allen AI).

Evaluation

Models Evaluated

The paper evaluates eight widely-used LID systems across six datasets, reporting macro-averaged F1 scores over all languages and over the model-covered subset. This guide explains how to add new models to the CommonLID leaderboard.

AfroLID

Adebara et al., 2022 · Transformer · 517 languages · ~100M curated sentences

TransformerAfrica-focused

CLD2

Sites et al., 2013 · Naive Bayes · 158 languages · Web pages

Naive BayesGoogle

CLD3

LSTM-based · Web data · Google · Proprietary training data

LSTMGoogle

fastText (NLLB)

NLLB Team, 2024 · FastText · 218 languages · Public datasets

FastTextMeta/NLLB

FUN-LangID

Caswell, 2024 · Common sub-strings · 1,634 languages

Sub-stringWide coverage

GlotLID

FastText-based · Wide language coverage · Best performer on CommonLID

FastTextTop result

OpenLID-v2

Burchell et al., 2023 · Multilingual curated sources · Open source

Open sourceCurated

pyFranc

Wormer, 2023 · Trigram-based · Lightweight, fast inference

TrigramLightweight

Key finding: CommonLID is a significantly more challenging benchmark than existing LID evaluation sets. Most models perform worse on CommonLID than on datasets like FLORES, UDHR, or Bible data, suggesting that reported F1 scores on those datasets overestimate real-world LID performance on web text. GlotLID achieved the highest macro-averaged F1. Evaluation is reported for the full dataset ("all") and for the covered subset ("cov."), with language counts in parentheses. Full results tables and raw scores are in the paper.

How it was built

Dataset Construction

CommonLID was nearly two years in the making, combining a custom annotation platform, community hackathons, and expert validation.

Data Collection

Multilingual web text was sampled from two recent Common Crawl snapshots and from MADLAD-400, targeting a broad range of languages and web domains.

Platform Development

A custom annotation interface was built in partnership with MLCommons and Factored AI, allowing annotators to view and label individual lines of web text at scale.

Community Hackathons

Multiple hackathons were hosted with EleutherAI and language community organizations including Masakhane (African languages) and SEACrowd (South-East Asian languages).

Expert Validation

All annotations were reviewed by an expert NLP researcher familiar with diverse writing systems, ensuring quality across scripts and language families.

Curation and Release

The final dataset was curated and released on Hugging Face. Annotators who contributed 100 or more documents were invited to co-author the paper, resulting in 97 co-authors.

Model Evaluation

Eight popular LID models were evaluated across six benchmark datasets, with all analysis code and raw scores released to support reproducibility.

Limitations

Known Limitations

The authors are transparent about CommonLID's constraints and provide guidance on conducting fair evaluation within them.

Partial Language Coverage

CommonLID covers 109 of the world's roughly 7,000 languages. Many languages remain absent. The authors intend to expand coverage through continued community annotation.

Imbalanced Sample Sizes

Annotated line counts vary significantly across languages, from over a thousand to fewer than ten for some varieties. The paper provides guidance on fair cross-model comparisons.

Potentially Harmful Content

Data is sourced from unfiltered web crawls and may contain offensive, harmful, or NSFW content. Researchers should be aware of this when building annotation interfaces.

Evaluation Only

CommonLID must not be used to train LID or other AI models. Using it for training would contaminate the benchmark. It must not be re-hosted in crawler-accessible locations.

Single Domain

All data comes from web crawl sources. Performance here should not be generalized to other domains such as social media, speech transcripts, or literary text.

Single-Label Annotation

Each line is assigned one language label, which does not capture code-switching within a line. Mixed-language lines may be ambiguously labeled.

How to Use

Using CommonLID

Access requires accepting the conditions on Hugging Face. Once approved, load the dataset with the standard HuggingFace Datasets library or with pandas/Polars.

# pip install datasets
from datasets import load_dataset

# Accept conditions on Hugging Face first
dataset = load_dataset(
    "commoncrawl/CommonLID",
    split="test"
)

print(dataset)
print(dataset[0])

# Or load with pandas
import pandas as pd

df = pd.read_csv(
  "hf://datasets/commoncrawl/CommonLID/data.csv"
)
print(df.head())

Before You Start

CommonLID is a gated dataset. You must log in to Hugging Face and accept the usage conditions before downloading. This protects the dataset's integrity as an evaluation resource.

Evaluation Best Practices

The authors recommend:

Use the CommonLID Python package for standarized evaluation.
Reporting results separately for the full dataset (all 109 languages) and the covered subset (languages supported by your model).
Always reporting the count of covered languages alongside F1 scores to enable fair cross-model comparison.
Consulting Table 4 of the paper for per-language line counts before drawing conclusions about low-resource languages.
Using macro-averaged F1 as the primary metric, consistent with the paper.

Questions and Community

For questions about the dataset, open a discussion on the Hugging Face dataset page. For broader questions about Common Crawl's multilingual work, visit the Common Crawl Foundation.

Acknowledgments

Partners and Collaborators

CommonLID was a community endeavor spanning research institutions, open-source organizations, and language communities across the globe.

Common Crawl Foundation

Lead organization; web data and research infrastructure

MLCommons

Annotation platform co-development; shared task coordination

EleutherAI

Hackathon co-organization; open AI research community

Johns Hopkins University

Research partnership; Human Language Technologies CoE

Masakhane

African language community; hackathon annotation

SEACrowd

South-East Asian language community; annotation

Factored AI

Annotation platform co-development and engineering

97 Co-Authors

Native-speaker annotators from the global NLP community

Citation

Cite CommonLID

If you use CommonLID in your research, please cite the paper below. Citations help sustain the community work that made this dataset possible.

@article{ortiz-suarez-burchell-arnett-etal-2026-commonlid,
  title = {CommonLID: Re-evaluating State-of-the-Art Language Identification
    Performance on Web Data},
  author = {Pedro Ortiz Suarez and Laurie Burchell and Catherine Arnett and Rafael
    Mosquera-Gómez and Sara Hincapie-Monsalve and Thom Vaughan and Damian
    Stewart and Malte Ostendorff and Idris Abdulmumin and Vukosi Marivate and
    Shamsuddeen Hassan Muhammad and Atnafu Lambebo Tonja and Hend Al-Khalifa
    and Nadia Ghezaiel Hammouda and Verrah Otiende and Tack Hwa Wong and
    Jakhongir Saydaliev and Melika Nobakhtian and Muhammad Ravi Shulthan Habibi
    and Chalamalasetti Kranti and Carol Muchemi and Khang Nguyen and Faisal
    Muhammad Adam and Luis Frentzen Salim and Reem Alqifari and Cynthia Amol
    and Joseph Marvin Imperial and Ilker Kesen and Ahmad Mustafid and Pavel
    Stepachev and Leshem Choshen and David Anugraha and Hamada Nayel and Seid
    Muhie Yimam and Vallerie Alexandra Putra and My Chiffon Nguyen and Azmine
    Toushik Wasi and Gouthami Vadithya and Rob van der Goot and Lanwenn ar
    C'horr and Karan Dua and Andrew Yates and Mithil Bangera and Yeshil Bangera
    and Hitesh Laxmichand Patel and Shu Okabe and Fenal Ashokbhai Ilasariya and
    Dmitry Gaynullin and Genta Indra Winata and Yiyuan Li and Juan Pablo
    Martínez and Amit Agarwal and Ikhlasul Akmal Hanif and Raia Abu Ahmad and
    Esther Adenuga and Filbert Aurelian Tjiaranata and Weerayut Buaphet and
    Michael Anugraha and Sowmya Vajjala and Benjamin Rice and Azril Hafizi
    Amirudin and Jesujoba O. Alabi and Srikant Panda and Yassine Toughrai and
    Bruhan Kyomuhendo and Daniel Ruffinelli and Akshata A and Manuel Goulão and
    Ej Zhou and Ingrid Gabriela Franco Ramirez and Cristina Aggazzotti and
    Konstantin Dobler and Jun Kevin and Quentin Pagès and Nicholas Andrews and
    Nuhu Ibrahim and Mattes Ruckdeschel and Amr Keleg and Mike Zhang and Casper
    Muziri and Saron Samuel and Sotaro Takeshita and Kun Kerdthaisong and Luca
    Foppiano and Rasul Dent and Tommaso Green and Ahmad Mustapha Wali and
    Kamohelo Makaaka and Vicky Feliren and Inshirah Idris and Hande Celikkanat
    and Abdulhamid Abubakar and Jean Maillard and Benoît Sagot and Thibault
    Clérice and Kenton Murray and Sarah Luger},
  year = 2026,
  url = {https://arxiv.org/abs/2601.18026},
  eprint = {2601.18026},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL}
}

arXiv Paper Hugging Face Dataset MDC Dataset Leaderboard Common Crawl Blog WMDQS Workshop

CommonLID Language Identification for the Real Web