Onyeakagbu, Adaobi. "See how all the 36 Nigerian states got their names". Pulse.ng. Retrieved 25 December 2021.
CommonLID
Language Identification
for the Real Web
A community-driven, human-annotated benchmark for language identification in noisy, heterogeneous web data. Covering 109 language varieties, CommonLID reveals where current models fall short and sets a higher standard for multilingual evaluation.
Overview
What is CommonLID?
CommonLID is a landmark evaluation dataset produced by a global community of native speakers and NLP researchers to benchmark language identification in real-world web text.
Language identification (LID), the task of determining which language a piece of text is written in, is a fundamental first step in building multilingual corpora. Yet today's LID models were largely evaluated on clean, professionally translated text such as legal documents, religious texts, and news articles, not on the chaotic, informal, mixed-script content that fills the actual web.
CommonLID was created to fill that gap. Starting from raw web pages sampled from two
recent Common
Crawl snapshots (CC-MAIN-2024-22 and CC-MAIN-2025-05)
and from the MADLAD-400 corpus, the team recruited over 80 native-speaker annotators,
primarily from the NLP community, to label web text line by line in their languages.
The result is a uniquely challenging benchmark that reveals a systematic gap between performance reported on existing datasets and what models actually achieve on web data. In nearly every tested language, models score lower on CommonLID than on prior benchmarks, meaning the field has been overestimating its own progress.
CommonLID was developed as the shared task of the Workshop on Multilingual Data Quality Signals (WMDQS) at COLM 2025, in partnership with Common Crawl Foundation, MLCommons, EleutherAI, and Johns Hopkins University.
Why CommonLID
Current Benchmarks Overestimate Performance
Existing LID evaluation datasets consist largely of clean, formal text. CommonLID demonstrates that this creates a false sense of progress: models perform significantly worse on real web content.
Below are real examples from Common Crawl that existing LID systems misclassify. This is genuine web content: the messy, mixed, informal language that any real pipeline must handle.
Proprete: Confort: Accueil du proprietaire: Rapport qualite/prix: Randonneurs
Blog de titine807 - ~~~~~~~~~~ cOuCoU ToUs lE MoNdE BiEnVeNuE DaNs mOn tI BlOg ~~~~~~~~~~
'^7.00 7.01 ... 楊南郡、王素娥. 《玉山國家公園八通關越嶺古道西段調查研究報告》 (中文(臺灣)).
These errors compound: mislabeled data contaminates multilingual corpora used to train large language models, causing under-resourced languages to be filtered out or merged with unrelated ones. CommonLID enables researchers to diagnose and correct these systematic failures.
Dataset
Dataset Details
CommonLID is released for evaluation purposes under the Common Crawl Terms of Use. Detailed per-language line counts and mean line lengths are in Appendix A of the paper.
| Property | Value |
|---|---|
| Language varieties | 109 |
| Languages with 100+ lines | 78 |
| Total annotated lines | 350,000+ |
| Format | CSV |
| Size category | 100K-1M rows |
| License | Common Crawl Terms of Use |
| Intended use | Evaluation only |
| Hugging Face ID | commoncrawl/CommonLID |
| arXiv | 2601.18026 |
| Task | Text Classification (LID) |
| Compatible libraries | Datasets, pandas, Polars |
Language Coverage
CommonLID covers 109 language varieties spanning multiple families and scripts. Many included languages have been historically under-served in NLP research, including African, South-East Asian, Caribbean Creole, and indigenous languages.
Full per-language statistics are in Table 4 of the paper. The dataset intentionally includes imbalanced language coverage, reflecting web data realities; the paper provides guidance on fair cross-model comparisons.
Evaluation Only
Do not use CommonLID to train LID models or other AI models, and do not re-host it in locations accessible to web crawlers, to preserve its integrity as a benchmark.
Data Sources
- CC-MAIN-2024-22: Common Crawl web snapshot, May 2024. Raw, unfiltered web pages capturing real-world text diversity.
- CC-MAIN-2025-05: Common Crawl web snapshot, January 2025. Additional data expanding language and domain coverage.
- MADLAD-400: Multilingual Automatically Derived Low-resource Annotated Data, derived from Common Crawl (Google / Allen AI).
Evaluation
Models Evaluated
The paper evaluates eight widely-used LID systems across six datasets, reporting macro-averaged F1 scores over all languages and over the model-covered subset.
How it was built
Dataset Construction
CommonLID was nearly two years in the making, combining a custom annotation platform, community hackathons, and expert validation.
Multilingual web text was sampled from two recent Common Crawl snapshots and from MADLAD-400, targeting a broad range of languages and web domains.
A custom annotation interface was built in partnership with MLCommons and Factored AI, allowing annotators to view and label individual lines of web text at scale.
Multiple hackathons were hosted with EleutherAI and language community organizations including Masakhane (African languages) and SEACrowd (South-East Asian languages).
All annotations were reviewed by an expert NLP researcher familiar with diverse writing systems, ensuring quality across scripts and language families.
The final dataset was curated and released on Hugging Face. Annotators who contributed 100 or more documents were invited to co-author the paper, resulting in 97 co-authors.
Eight popular LID models were evaluated across six benchmark datasets, with all analysis code and raw scores released to support reproducibility.
Limitations
Known Limitations
The authors are transparent about CommonLID's constraints and provide guidance on conducting fair evaluation within them.
CommonLID covers 109 of the world's roughly 7,000 languages. Many languages remain absent. The authors intend to expand coverage through continued community annotation.
Annotated line counts vary significantly across languages, from over a thousand to fewer than ten for some varieties. The paper provides guidance on fair cross-model comparisons.
Data is sourced from unfiltered web crawls and may contain offensive, harmful, or NSFW content. Researchers should be aware of this when building annotation interfaces.
CommonLID must not be used to train LID or other AI models. Using it for training would contaminate the benchmark. It must not be re-hosted in crawler-accessible locations.
All data comes from web crawl sources. Performance here should not be generalized to other domains such as social media, speech transcripts, or literary text.
Each line is assigned one language label, which does not capture code-switching within a line. Mixed-language lines may be ambiguously labeled.
How to Use
Using CommonLID
Access requires accepting the conditions on Hugging Face. Once approved, load the dataset with the standard HuggingFace Datasets library or with pandas/Polars.
# pip install datasets
from datasets import load_dataset
# Accept conditions on Hugging Face first
dataset = load_dataset(
"commoncrawl/CommonLID",
split="test"
)
print(dataset)
print(dataset[0])
# Or load with pandas
import pandas as pd
df = pd.read_csv(
"hf://datasets/commoncrawl/CommonLID/data.csv"
)
print(df.head())
Before You Start
CommonLID is a gated dataset. You must log in to Hugging Face and accept the usage conditions before downloading. This protects the dataset's integrity as an evaluation resource.
Evaluation Best Practices
The authors recommend:
- Reporting results separately for the full dataset (all 109 languages) and the covered subset (languages supported by your model).
- Always reporting the count of covered languages alongside F1 scores to enable fair cross-model comparison.
- Consulting Table 4 of the paper for per-language line counts before drawing conclusions about low-resource languages.
- Using macro-averaged F1 as the primary metric, consistent with the paper.
Questions and Community
For questions about the dataset, open a discussion on the Hugging Face dataset page. For broader questions about Common Crawl's multilingual work, visit the Common Crawl Foundation.
Acknowledgments
Partners and Collaborators
CommonLID was a community endeavor spanning research institutions, open-source organizations, and language communities across the globe.
Citation
Cite CommonLID
If you use CommonLID in your research, please cite the paper below. Citations help sustain the community work that made this dataset possible.
@article{ortiz-suarez-burchell-arnett-etal-2026-commonlid,
title = {CommonLID: Re-evaluating State-of-the-Art Language Identification
Performance on Web Data},
author = {Pedro Ortiz Suarez and Laurie Burchell and Catherine Arnett and Rafael
Mosquera-Gómez and Sara Hincapie-Monsalve and Thom Vaughan and Damian
Stewart and Malte Ostendorff and Idris Abdulmumin and Vukosi Marivate and
Shamsuddeen Hassan Muhammad and Atnafu Lambebo Tonja and Hend Al-Khalifa
and Nadia Ghezaiel Hammouda and Verrah Otiende and Tack Hwa Wong and
Jakhongir Saydaliev and Melika Nobakhtian and Muhammad Ravi Shulthan Habibi
and Chalamalasetti Kranti and Carol Muchemi and Khang Nguyen and Faisal
Muhammad Adam and Luis Frentzen Salim and Reem Alqifari and Cynthia Amol
and Joseph Marvin Imperial and Ilker Kesen and Ahmad Mustafid and Pavel
Stepachev and Leshem Choshen and David Anugraha and Hamada Nayel and Seid
Muhie Yimam and Vallerie Alexandra Putra and My Chiffon Nguyen and Azmine
Toushik Wasi and Gouthami Vadithya and Rob van der Goot and Lanwenn ar
C'horr and Karan Dua and Andrew Yates and Mithil Bangera and Yeshil Bangera
and Hitesh Laxmichand Patel and Shu Okabe and Fenal Ashokbhai Ilasariya and
Dmitry Gaynullin and Genta Indra Winata and Yiyuan Li and Juan Pablo
Martínez and Amit Agarwal and Ikhlasul Akmal Hanif and Raia Abu Ahmad and
Esther Adenuga and Filbert Aurelian Tjiaranata and Weerayut Buaphet and
Michael Anugraha and Sowmya Vajjala and Benjamin Rice and Azril Hafizi
Amirudin and Jesujoba O. Alabi and Srikant Panda and Yassine Toughrai and
Bruhan Kyomuhendo and Daniel Ruffinelli and Akshata A and Manuel Goulão and
Ej Zhou and Ingrid Gabriela Franco Ramirez and Cristina Aggazzotti and
Konstantin Dobler and Jun Kevin and Quentin Pagès and Nicholas Andrews and
Nuhu Ibrahim and Mattes Ruckdeschel and Amr Keleg and Mike Zhang and Casper
Muziri and Saron Samuel and Sotaro Takeshita and Kun Kerdthaisong and Luca
Foppiano and Rasul Dent and Tommaso Green and Ahmad Mustapha Wali and
Kamohelo Makaaka and Vicky Feliren and Inshirah Idris and Hande Celikkanat
and Abdulhamid Abubakar and Jean Maillard and Benoît Sagot and Thibault
Clérice and Kenton Murray and Sarah Luger},
year = 2026,
url = {https://arxiv.org/abs/2601.18026},
eprint = {2601.18026},
archiveprefix = {arXiv},
primaryclass = {cs.CL}
}