Deprecated: Dataset "mc4" is deprecated and will be deleted. Use "allenai/c4" instead.

Dataset Card for mC4

Dataset Summary

A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org".

This is the version prepared by AllenAI, hosted at this address: https://huggingface.co/datasets/allenai/c4

108 languages are available and are reported in the table below.

Note that the languages that end with "-Latn" are simply romanized variants, i.e. written using the Latin script.

language code	language name
af	Afrikaans
am	Amharic
ar	Arabic
az	Azerbaijani
be	Belarusian
bg	Bulgarian
bg-Latn	Bulgarian (Latin)
bn	Bangla
ca	Catalan
ceb	Cebuano
co	Corsican
cs	Czech
cy	Welsh
da	Danish
de	German
el	Greek
el-Latn	Greek (Latin)
en	English
eo	Esperanto
es	Spanish
et	Estonian
eu	Basque
fa	Persian
fi	Finnish
fil	Filipino
fr	French
fy	Western Frisian
ga	Irish
gd	Scottish Gaelic
gl	Galician
gu	Gujarati
ha	Hausa
haw	Hawaiian
hi	Hindi
hi-Latn	Hindi (Latin script)
hmn	Hmong, Mong
ht	Haitian
hu	Hungarian
hy	Armenian
id	Indonesian
ig	Igbo
is	Icelandic
it	Italian
iw	former Hebrew
ja	Japanese
ja-Latn	Japanese (Latin)
jv	Javanese
ka	Georgian
kk	Kazakh
km	Khmer
kn	Kannada
ko	Korean
ku	Kurdish
ky	Kyrgyz
la	Latin
lb	Luxembourgish
lo	Lao
lt	Lithuanian
lv	Latvian
mg	Malagasy
mi	Maori
mk	Macedonian
ml	Malayalam
mn	Mongolian
mr	Marathi
ms	Malay
mt	Maltese
my	Burmese
ne	Nepali
nl	Dutch
no	Norwegian
ny	Nyanja
pa	Punjabi
pl	Polish
ps	Pashto
pt	Portuguese
ro	Romanian
ru	Russian
ru-Latn	Russian (Latin)
sd	Sindhi
si	Sinhala
sk	Slovak
sl	Slovenian
sm	Samoan
sn	Shona
so	Somali
sq	Albanian
sr	Serbian
st	Southern Sotho
su	Sundanese
sv	Swedish
sw	Swahili
ta	Tamil
te	Telugu
tg	Tajik
th	Thai
tr	Turkish
uk	Ukrainian
und	Unknown language
ur	Urdu
uz	Uzbek
vi	Vietnamese
xh	Xhosa
yi	Yiddish
yo	Yoruba
zh	Chinese
zh-Latn	Chinese (Latin)
zu	Zulu

You can load the mC4 subset of any language like this:

from datasets import load_dataset

en_mc4 = load_dataset("mc4", "en")

And if you can even specify a list of languages:

from datasets import load_dataset

mc4_subset_with_five_languages = load_dataset("mc4", languages=["en", "fr", "es", "de", "zh"])

Supported Tasks and Leaderboards

mC4 is mainly intended to pretrain language models and word representations.

Languages

The dataset supports 108 languages.

Dataset Structure

Data Instances

An example form the en config is:

{'timestamp': '2018-06-24T01:32:39Z',
 'text': 'Farm Resources in Plumas County\nShow Beginning Farmer Organizations & Professionals (304)\nThere are 304 resources serving Plumas County in the following categories:\nMap of Beginning Farmer Organizations & Professionals serving Plumas County\nVictoria Fisher - Office Manager - Loyalton, CA\nAmy Lynn Rasband - UCCE Plumas-Sierra Administrative Assistant II - Quincy , CA\nShow Farm Income Opportunities Organizations & Professionals (353)\nThere are 353 resources serving Plumas County in the following categories:\nFarm Ranch And Forest Retailers (18)\nMap of Farm Income Opportunities Organizations & Professionals serving Plumas County\nWarner Valley Wildlife Area - Plumas County\nShow Farm Resources Organizations & Professionals (297)\nThere are 297 resources serving Plumas County in the following categories:\nMap of Farm Resources Organizations & Professionals serving Plumas County\nThere are 57 resources serving Plumas County in the following categories:\nMap of Organic Certification Organizations & Professionals serving Plumas County',
 'url': 'http://www.californialandcan.org/Plumas/Farm-Resources/'}

Data Fields

The data have several fields:

url: url of the source as a string
text: text content as a string
timestamp: timestamp as a string

Data Splits

To build mC4, the authors used CLD3 to identify over 100 languages. The resulting mC4 subsets for each language are reported in this table:

config	train	validation
af	?	?
am	?	?
ar	?	?
az	?	?
be	?	?
bg	?	?
bg-Latn	?	?
bn	?	?
ca	?	?
ceb	?	?
co	?	?
cs	?	?
cy	?	?
da	?	?
de	?	?
el	?	?
el-Latn	?	?
en	?	?
eo	?	?
es	?	?
et	?	?
eu	?	?
fa	?	?
fi	?	?
fil	?	?
fr	?	?
fy	?	?
ga	?	?
gd	?	?
gl	?	?
gu	?	?
ha	?	?
haw	?	?
hi	?	?
hi-Latn	?	?
hmn	?	?
ht	?	?
hu	?	?
hy	?	?
id	?	?
ig	?	?
is	?	?
it	?	?
iw	?	?
ja	?	?
ja-Latn	?	?
jv	?	?
ka	?	?
kk	?	?
km	?	?
kn	?	?
ko	?	?
ku	?	?
ky	?	?
la	?	?
lb	?	?
lo	?	?
lt	?	?
lv	?	?
mg	?	?
mi	?	?
mk	?	?
ml	?	?
mn	?	?
mr	?	?
ms	?	?
mt	?	?
my	?	?
ne	?	?
nl	?	?
no	?	?
ny	?	?
pa	?	?
pl	?	?
ps	?	?
pt	?	?
ro	?	?
ru	?	?
ru-Latn	?	?
sd	?	?
si	?	?
sk	?	?
sl	?	?
sm	?	?
sn	?	?
so	?	?
sq	?	?
sr	?	?
st	?	?
su	?	?
sv	?	?
sw	?	?
ta	?	?
te	?	?
tg	?	?
th	?	?
tr	?	?
uk	?	?
und	?	?
ur	?	?
uz	?	?
vi	?	?
xh	?	?
yi	?	?
yo	?	?
zh	?	?
zh-Latn	?	?
zu	?	?

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.

Citation Information

@article{2019t5,
    author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
    title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
    journal = {arXiv e-prints},
    year = {2019},
    archivePrefix = {arXiv},
    eprint = {1910.10683},
}

Contributions

Thanks to @dirkgr and @lhoestq for adding this dataset.

Datasets:

legacy-datasets
/

mc4

Dataset Card for mC4

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Models trained or fine-tuned on legacy-datasets/mc4

google/byt5-small

google/mt5-small

google/mt5-large

google/byt5-large

google/mt5-base

google/byt5-base

Space using legacy-datasets/mc4 1