Nothing Special   »   [go: up one dir, main page]

This page organizes all corpora which have resulted from or have been used in our research. Their availability for Webis externals is as follows: (1) corpora that have been officially released by Webis as well as (2) corpora of the PAN series can be downloaded here, (3) internal Webis corpora (which will be officially released in the future) are supplied upon request, (4) other corpora can be downloaded from their original publisher/creator. Most of our released corpora are hosted at Zenodo (Zenodo) and are indexed in the Google Dataset Search (Google Dataset Search); a few larger corpora are available in the Internet Archive (Internet Archive); some corpora are accessibly via the Hugging Face (Huggingface) and IR datasets (ir_datasets) libraries; the Browser –symbol indicates a browsing facility for the respective corpus.

Released Webis Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
Archive Query Log 2022 Webis Group 2023 44 GB 357M queries Query Log Analysis browser Google Dataset Search
Arg-Microtexts Synthesis Benchmark Webis Group 2018 4 MB 260 arguments Computational Argumentation Zenodo
args.me corpus Webis Group 2019 876 MB 388K arguments Computational Argumentation ir_datasets Huggingface Zenodo Google Dataset Search
ArguAna Counterargs Webis Group 2018 106 MB 7K arguments Computational Argumentation Zenodo
ArguAna TripAdvisor Webis Group & FG Engels 2014 283 MB 2K reviews Sentiment Analysis Zenodo
BuzzFeed-Webis Fake News Corpus 16 Webis Group 2018 5 GB 1K articles News analysis Zenodo Google Dataset Search
CauseNet-20 Webis Group & Data Science Group 2020 2 GB 12M relations Causal Relation Analysis Zenodo
CommonCrawl News Articles by Political Orientation Webis Group 2022 4 GB - - Media Bias detection, Social Bias detection Zenodo
CompArg: Comparative Sentences 2019 Universität Hamburg 2019 3 MB - - Comparative Sentences Classification Zenodo Google Dataset Search
Dagstuhl-15512-ArgQuality Dagstuhl-15512 Quality breakout group 2017 1 MB 304 arguments Computational Argumentation Zenodo
Genre-KI-04 Webis Group 2004 11 MB 1K documents Web Genre Analysis Zenodo Google Dataset Search
IR Benchmarks Webis Group 2023 - 2K runs Leaderboards
LFA-11 Webis Group & FG Engels 2011 5 MB - - Genre and Sentiment Analysis Zenodo Google Dataset Search
Paderborn Genre Analysis Corpus 2012 Baumann, Lettmann, Stein 2012 20 MB - - Web Genre Analysis Zenodo Google Dataset Search
SCAI-QReCC-21 Webis Group 2023 244 MB 14K conversations Conversational Analysis (written) Zenodo
SMAuC – The Scientific Multi-Authorship Corpus Webis Group 2023 51 GB 22K documents Authorship Analytics Zenodo
TexBiG Tschirschwitz, Klemstein, Stein, Rodehorst 2022 15 GB 52K images Document Layout Analysis Zenodo Google Dataset Search
WDVC-15 FG Engels & Webis Group 2015 5 GB 24M revisions Vandalism Detection Zenodo Google Dataset Search
WDVC-16 FG Engels & Webis Group 2016 30 GB 83M revisions Vandalism Detection Zenodo Google Dataset Search
Webis Chatnoir-Copycat 2021 Webis Group 2021 91 TB 7B documents Duplicate Detection
Webis MS MARCO Anchor Text 2022 Webis Group 2022 4 GB 7M documents Anchor Text Huggingface Zenodo
Webis-Ambient-15 Webis Group 2015 114 MB 6K documents Clustering/Cluster Labeling Zenodo Google Dataset Search
Webis-ArgImages-21 Webis Group 2021 1 MB 3K images Computational Argumentation Zenodo Google Dataset Search
Webis-ArgKB-20 Webis Group 2020 1 MB 5K argumentative relations Computational Argumentation Zenodo
Webis-ArgQuality-20 Webis Group 2020 3 MB 1K arguments Computational Argumentation Zenodo
Webis-ArgRank-17 Webis Group 2017 13 MB 18K arguments Computational Argumentation Zenodo
Webis-Argument-Attributes Webis Group & DRL Potsdam 2020 1 KB 20 attributes Computational Argumentation browser
Webis-Argument-Framing-19 Webis Group 2019 7 MB 12K arguments Computational Argumentation and Framing Zenodo Google Dataset Search
Webis-ArgValues-22 Webis Group 2022 1 MB 5K arguments Human Value Detection Zenodo Google Dataset Search
Webis-Bias-Flipper-18 Webis Group 2018 13 MB 6K documents Natural Language Generation Zenodo Google Dataset Search
Webis-CausalQA-22 Webis Group 2022 17 GB 1M question-answer pairs Causal Question Answering Zenodo Google Dataset Search
Webis-Clickbait-16 Webis Group 2016 255 MB 3K tweets Clickbait Detection Zenodo Google Dataset Search
Webis-Clickbait-17 Webis Group 2017 - 20K tweets Clickbait Detection Zenodo Google Dataset Search
Webis-Clickbait-22 Webis Group 2022 10 MB 5K posts Clickbait Spoiling Zenodo Google Dataset Search
Webis-CLS-10 Webis Group 2010 530 MB 800K documents Cross-Language Text Classification Zenodo Google Dataset Search
Webis-CMV-20 Webis Group 2020 3 GB - argument pairs Computational Argumentation Zenodo
Webis-CompQuestions-20 Webis Group 2020 1 MB 15K questions Comparative Question Classification Zenodo Google Dataset Search
Webis-CompQuestions-22 Webis Group 2022 5 MB 31K questions Comparative Question Classification Zenodo Google Dataset Search
Webis-ConcluGen-21 Webis Group 2021 225 MB 136K argument-conclusion pairs Informative Conclusion Generation, Text Summarization Huggingface Zenodo Google Dataset Search
Webis-Context-SciSumm-2023 Webis Group 2023 10 GB 4.6M document-summary pairs Contextualized Summarization Zenodo
Webis-Context-sensitive-Word-Search-Queries-2022 Webis Group 2022 489 MB 24M queries Context-sensitive Word Search Zenodo
Webis-Conversational-Query-Reformulations-21 Webis Group 2021 193 KB 3K messages Query classification Zenodo Google Dataset Search
Webis-CPC-11 Webis Group 2011 19 MB 8K paraphrases Plagiarism Detection Zenodo Google Dataset Search
Webis-Dataset-Reviews-21 Webis Group 2021 43 MB 539K dataset mentions Dataset Search Zenodo
Webis-Debate-16 Webis Group 2016 908 KB 27K text segments Computational Argumentation Zenodo Google Dataset Search
Webis-Editorial-Quality-18 Webis Group 2018 3 MB 1K documents Computational Argumentation Zenodo Google Dataset Search
Webis-Editorials-16 Webis Group 2016 5 MB 300 documents Computational Argumentation Zenodo Google Dataset Search
Webis-EditorialSum-20 Webis Group 2020 10 MB 1K editorials Text Summarization Zenodo Google Dataset Search
Webis-Exhibition-Questions-21 Webis Group 2021 34 MB 849 questions Conversational Analysis (written) browser Zenodo Google Dataset Search
Webis-Follow-Up-Questions-24 Webis Group 2024 20 MB 19K turns User Simulation Zenodo Google Dataset Search
Webis-Generated-Game-Art-23 Webis Group 2023 117 MB 110 images Image Generation Zenodo Google Dataset Search
Webis-Gmane-19 Webis Group 2019 160 GB 153M emails Dialog Analysis Zenodo Google Dataset Search Internet Archive
Webis-Health-CauseNet-22 Webis Group 2022 1 GB 8M sentences Health Causal Relation Analysis Zenodo
Webis-Health-Misbeliefs-21 Webis Group 2021 200 KB - terms Query Analysis Zenodo
Webis-KIQC-13 Webis Group 2013 1 MB 3K questions Known-Item Search Zenodo Google Dataset Search
Webis-Mnemonics-17 Webis Group 2017 2 MB 1K mnemonics Password analysis Zenodo Google Dataset Search
Webis-News-Bias-20 Webis Group 2020 14 MB 7K articles News analysis, Media Bias detection Zenodo Google Dataset Search
Webis-NIL-21 Webis Group 2021 392 KB 37K log entries Query identification Zenodo Google Dataset Search
Webis-Nudged-Questions-23 Webis Group 2023 125 MB 9K questions Conversational Analysis Zenodo Google Dataset Search
Webis-ODP-10 Webis Group 2010 113 MB 5M documents Clustering/Cluster Labeling Zenodo Google Dataset Search
Webis-PC-08 Webis Group 2008 298 MB - - Plagiarism Detection Zenodo Google Dataset Search
Webis-Persuasive-Debaters-on-Reddit-CMV-2022 Webis Group 2022 492 MB 4K debaters Persuavsiveness Analysis Zenodo
Webis-PRA-12 Webis Group 2012 884 KB 14K company names Spelling Error Detection Zenodo Google Dataset Search
Webis-PSERP-24 Webis Group 2024 887 MB 511k serps SEO Spam Detection Zenodo
Webis-QInC-22 Webis Group 2022 79 MB 13 MB queries Query Interpretation Zenodo Google Dataset Search
Webis-QSeC-10 Webis Group 2010 2 MB - - Query Segmentation Zenodo Google Dataset Search
Webis-QSpell-17 Webis Group 2017 1 MB - - Query Spelling Correction Zenodo Google Dataset Search
Webis-QTM-19 Webis Group 2019 2 MB 200K Queries Query-task mapping Zenodo Google Dataset Search
Webis-Revenue-10 FG Engels & Webis Group 2010 6 MB 1K documents Entity and Relation Extraction Zenodo Google Dataset Search
Webis-SameSentiment-21 Webis Group 2021 43 MB 704K sentiment pair ids Sentiment Analysis Zenodo
Webis-SameSide-19 Webis Group 2020 63 MB 125K argument pairs Computational Argumentation Zenodo
Webis-SameSide-21 Webis Group 2021 150 MB - argument pairs Computational Argumentation Zenodo
Webis-SameSideAdversarial-21 Webis Group 2021 50 KB 175 argument pairs Computational Argumentation Zenodo
Webis-SCSmeta-21 Webis Group 2021 25 KB 1K turns Conversational Analysis (spoken) Zenodo Google Dataset Search
Webis-SDMbridge-12 Webis Group 2012 58 MB 15K models Simulation Data Mining Zenodo Google Dataset Search
Webis-Sentences-17 Webis Group 2017 200 GB 3B sentences Text statistics Zenodo Google Dataset Search
Webis-SMC-12 Webis Group 2012 123 KB - - Search Mission Detection Zenodo Google Dataset Search
Webis-Snippet-20 Webis Group 2020 11 GB 10M snippet-webpage pairs Abstractive Snippet Generation, Text Summarization Zenodo Google Dataset Search
Webis-STEREO-21 Webis Group 2021 8 GB 91M cases Text Reuse Detection Zenodo
Webis-TLDR-17 Webis Group 2017 2 GB 4M content-summary pairs Text Summarization Zenodo Google Dataset Search
Webis-Topic-Ontologies Webis Group 2023 2 GB 9M unit Argument Mining, Argument Generation, Argument Retrieval Zenodo
Webis-TRC-12 Webis Group 2012 120 MB 150 interaction logs Text Reuse Detection, Paraphrasing, and Exploratory Search Zenodo Google Dataset Search
Webis-Trigger-Warning-Corpus-22 Webis Group 2023 54 GB 1M documents Multi Label Document Classification Zenodo
Webis-Tripad-13-Sentiment Webis Group 2013 3 MB 2K reviews Sentiment Analysis Zenodo Google Dataset Search
Webis-Tripad-14 Webis Group 2014 61 MB 266K reviews Sentiment Analysis and Author Profiling Zenodo Google Dataset Search
Webis-Violence-in-Fan-Fiction-21 Webis Group 2023 2.2 GB 30k documents Document Classification Zenodo
Webis-Voice-based-and-Conversational-Argument-Search-20 Webis Group 2020 350 KB 500 participants Conversational Analysis (spoken) Zenodo Google Dataset Search
Webis-Web-Archive-17 Webis Group 2017 94 GB 10K documents Web Analysis browser Zenodo Google Dataset Search
Webis-Web-Archive-Quality-22 Webis Group 2022 18 GB 7K documents Web Analysis Zenodo Google Dataset Search
Webis-Web-Errors-19 Webis Group 2019 1 MB 10K documents Web Analysis browser Zenodo Google Dataset Search
Webis-WebSeg-20 Webis Group 2020 12 GB 8K documents Web Page Segmentation Zenodo Google Dataset Search
Webis-WebSeg-20-Algorithm-Segmentations Webis Group 2021 7 GB 246K segmentations Web Page Segmentation Zenodo Google Dataset Search
Webis-WikiDebate-18 Webis Group 2018 78 MB 6M discussions Computational Argumentation Zenodo Google Dataset Search
Webis-WikiDiscussions-18 Webis Group 2018 4 GB 6M discussions Computational Argumentation Zenodo Google Dataset Search
Webis-Wikipedia-IPC-23 Webis Group 2023 52 MB 916K paraphrase pairs Paraphrasing Zenodo Google Dataset Search
Webis-Wikipedia-Text-Reuse-18 Webis Group 2018 - - text segments Text Reuse Analysis Zenodo Google Dataset Search
Webis-WikiSciTech-23 Webis Group 2023 26 MB 2904 articles Micro-Notability Analytics Zenodo Google Dataset Search
Webis-WVC-07 Webis Group 2007 12 KB 1K documents Vandalism Detection Zenodo Google Dataset Search
Webis-YouTube8MA-18 Webis Group 2018 169 GB 6M documents Video Retrieval Zenodo Google Dataset Search
PAN Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
Alvi15-Text-Alignment-en-fa Webis Group 2015 2 MB 200 documents Originality Zenodo Google Dataset Search
C10-Attribution Webis Group 2015 4 MB - - Author Identification Zenodo Google Dataset Search
C50-Attribution Webis Group 2015 17 MB - - Author Identification Zenodo Google Dataset Search
Cheema15-Text-Alignment-en Webis Group 2015 4 MB - - Originality Zenodo Google Dataset Search
FIRE14-SOurce-COde-Re-use PAN 2014 16 MB - - Originality Zenodo
Hanfi15-Text-Alignment-en-ur Webis Group 2015 3 MB - - Originality Zenodo Google Dataset Search
Khoshnavataher15-Text-Alignment-fa Webis Group 2015 16 MB - - Originality Zenodo Google Dataset Search
Kong15-Text-Alignment-zh Webis Group 2015 3 MB - - Originality Zenodo Google Dataset Search
Mohtaij15-Text-Alignment-en Webis Group 2015 57 MB - - Originality Zenodo Google Dataset Search
Palkovskii15-Text-Alignment-en Webis Group 2015 26 MB - - Originality Zenodo Google Dataset Search
PAN-PC-09 Webis Group 2009 2 GB 41K documents Plagiarism Detection Zenodo Google Dataset Search
PAN-PC-10 Webis Group 2010 2 GB 27K documents Plagiarism Detection Zenodo Google Dataset Search
PAN-PC-11 Webis Group 2011 2 GB 27K documents Plagiarism Detection Zenodo Google Dataset Search
PAN-SemEval-Hyperpartisan-News-Detection-19 Webis & Factmata 2018 1 GB 751K articles Hyperpartisan News Detection Zenodo Google Dataset Search
PAN-WQF-12 Webis Group 2012 4 GB 2M documents Quality Flaw Prediction Zenodo Google Dataset Search
PAN-WVC-10 Webis Group 2010 439 MB 32K documents Vandalism Detection Zenodo Google Dataset Search
PAN-WVC-11 Webis Group 2011 371 MB 24K documents Vandalism Detection Zenodo Google Dataset Search
PAN11-Attribution Webis Group 2011 3 MB - - Author Identification Zenodo Google Dataset Search
PAN12-Attribution Webis Group 2012 9 MB - - Author Identification Zenodo Google Dataset Search
PAN12-Sexual-Predator-Identification Webis Group 2012 92 MB - - Deception Detection Zenodo Google Dataset Search
PAN12-Source-Retrieval Webis Group 2012 1 MB - - Originality Zenodo Google Dataset Search
PAN12-Text-Alignment Webis Group 2012 783 MB - - Originality Zenodo Google Dataset Search
PAN13-Author-Profiling Webis Group 2013 713 MB - - Author Profiling Zenodo Google Dataset Search
PAN13-Source-Retrieval Webis Group 2013 3 MB - - Originality Zenodo Google Dataset Search
PAN13-Text-Alignment Webis Group 2013 35 MB - - Originality Zenodo Google Dataset Search
PAN13-Verification Webis Group 2013 1 MB - - Author Identification Zenodo Google Dataset Search
PAN14-Author-Profiling Webis Group 2014 205 MB - - Author Profiling Zenodo Google Dataset Search
PAN14-Source-Retrieval Webis Group 2014 7 MB - - Originality Zenodo Google Dataset Search
PAN14-Text-Alignment Webis Group 2014 22 MB - - Originality Zenodo Google Dataset Search
PAN14-Verification Webis Group 2014 9 MB - - Author Identification Zenodo Google Dataset Search
PAN15-Author-Profiling Webis Group 2015 2 MB - - Author Profiling Zenodo Google Dataset Search
PAN15-Source-Retrieval Webis Group 2015 7 MB - - Originality Zenodo Google Dataset Search
PAN15-Verification Webis Group 2015 3 MB - - Author Identification Zenodo Google Dataset Search
PAN16-Author-Masking PAN 2016 2 MB 205 cases Author Obfuscation browser Zenodo Google Dataset Search
PAN16-Author-Profiling Webis Group 2016 2 MB - - Author Profiling Zenodo Google Dataset Search
PAN16-Clustering Webis Group 2016 3 MB - - Author Identification Zenodo Google Dataset Search
PAN17-Author-Profiling Webis Group 2017 254 MB - - Author Profiling Zenodo Google Dataset Search
PAN17-Clustering Webis Group 2017 1 MB - - Author Identification Zenodo Google Dataset Search
PAN17-Style-Change-Detection Webis Group 2017 8 MB - - Multi-Author Analysis Zenodo Google Dataset Search
PAN18-Attribution Webis Group 2018 4 MB 2K cases Author Identification Zenodo Google Dataset Search
PAN18-Author-Profiling PAN 2018 7 GB 8K cases Author Profiling browser Zenodo Google Dataset Search
PAN18-Style-Change-Detection Webis Group 2018 8 MB 3K cases Multi-Author Analysis browser Zenodo Google Dataset Search
PAN19-Attribution Webis Group 2019 13 MB - - Author Identification Zenodo Google Dataset Search
PAN19-Bots-and-Gender-Profiling Webis Group 2019 38 MB - - Author Profiling Zenodo Google Dataset Search
PAN19-Celebrity-Profiling Webis Group 2019 3 GB - - Author Profiling Zenodo Google Dataset Search
PAN19-Style-Change-Detection Webis Group 2019 10 MB - - Multi-Author Analysis Zenodo Google Dataset Search
PAN20-Authorship-Verification Webis Group 2020 838 MB - - Authorship Verification Zenodo Google Dataset Search
PAN20-Authorship-Verification (Large) Webis Group 2020 4 GB - - Authorship Verification Zenodo Google Dataset Search
PAN20-Celebrity-Profiling Webis Group 2020 7 GB - - Author Profiling Zenodo Google Dataset Search
PAN20-Profiling-Fake-News-Spreaders-in-Twitter Webis Group 2020 8 MB - - Author Profiling Zenodo Google Dataset Search
PAN20-Style-Change-Detection Webis Group 2020 98 MB - - Multi-Author Analysis Zenodo Google Dataset Search
PAN21-Authorship-Verification Webis Group 2021 322 MB - - Authorship Verification Zenodo Google Dataset Search
PAN21-Profiling-Hate-Speech-Spreaders-on-Twitter Webis Group 2021 3 MB - - Author Profiling Zenodo
PAN21-Style-Change-Detection Webis Group 2021 19 MB - - Multi-Author Analysis Zenodo
PAN22-Authorship-Verification Webis Group 2022 23 MB - - Authorship Verification Zenodo Google Dataset Search
PAN22-Profiling-Irony-and-Stereotype-Spreaders-on-Twitter Webis Group 2022 6 MB - - Author Profiling Zenodo
PAN22-Style-Change-Detection Webis Group 2022 28 MB - - Multi-Author Analysis Zenodo
PAN23-Multi-Author-Writing-Style-Analysis Webis Group 2023 26 MB - Reddit comments Multi-Author Analysis Zenodo
PAN23-Profiling-Cryptocurrency-Influencers-with-Few-shot-Learning Symanto Research 2023 202 KB - Tweets Author Profiling Zenodo
PAN23-Trigger-Detection Webis Group 2023 2 GB 341K fanworks Trigger Detection Zenodo
PAN24-Generative-AI-Authorship-Verification Webis Group 2024 12.4 MB 15k News Articles Generative AI Authorship Verification Zenodo
PAN24-Multi-Author-Writing-Style-Analysis Webis Group 2024 26 MB - Reddit comments Multi-Author Analysis Zenodo
PAN24-Multilingual-Text-Detoxification Webis Group 2024 - - - Multilingual Text Detoxification Zenodo
PAN24-Oppositional-Thinking-Analysis Webis Group 2024 8 MB 8,000 posts Oppositional Thinking Analysis Zenodo
Scientific Author's Writing Style Corpus 2017 Rexha, Kröll, Ziak, Kern 2017 - 66 cases Authorship Attribution Zenodo Google Dataset Search
Touché Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
Touché20-Argument-Retrieval-for-Comparative-Questions Webis Group 2020 3 MB 50 topics Argument search ir_datasets Zenodo
Touché20-Argument-Retrieval-for-Controversial-Questions Webis Group 2020 9 MB 50 topics Argument search ir_datasets Zenodo
Touché21-Argument-Retrieval-for-Comparative-Questions Webis Group 2021 200 KB 50 topics Argument search ir_datasets Zenodo
Touché21-Argument-Retrieval-for-Controversial-Questions Webis Group 2021 1 MB 50 topics Argument search ir_datasets Zenodo Google Dataset Search
Touché22-Argument-Retrieval-for-Comparative-Questions Webis Group 2022 700 MB 50 topics Argument search ir_datasets Zenodo
Touché22-Argument-Retrieval-for-Controversial-Questions Webis Group 2022 2 GB 50 topics Argument search ir_datasets Zenodo
Touché22-Image-Retrieval-for-Arguments Webis Group 2022 169 GB 24K images Image search ir_datasets Zenodo
Touché23-Argument-Retrieval-for-Controversial-Questions Webis Group 2023 1 MB 50 topics Argument search Zenodo
Touché23-Evidence-Retrieval-for-Causal-Questions Webis Group 2023 1 MB 50 topics Causal retrieval Zenodo
Touché23-Image-Retrieval-for-Arguments Webis Group 2023 1 TB 56K images Image search Zenodo Google Dataset Search
Touché23-ValueEval Webis Group 2023 1 MB 9K arguments Human Value Detection Huggingface Zenodo Google Dataset Search
Touché24-Image-Retrieval-and-Generation-for-Arguments Webis Group 2024 53 GB 7K images Image search Zenodo Google Dataset Search
Touché24-ValueEval Webis Group 2024 4 MB 3K texts Human Value Detection Zenodo Google Dataset Search
Internal Webis Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
Arxiv Webis Group - 674 MB 550 documents -
Bauphysik Webis Group 2010 70 MB - - Vertical Search
Converter Testfiles Webis Group - 2 GB - - -
Genre Corpus (2008) Webis Group 2008 26 MB 2K documents Web Genre Analysis
German Newsgroups Webis Group - 54 MB 27K documents Cluster Analysis
Google News Crawl Webis Group - 404 MB 35K documents -
Gutenberg Wordcount Webis Group - 4 MB - - -
Netspeak Dictionary Webis Group - 3 GB - - -
ODP Cluster Labeling Webis Group 2010 - 6K documents Cluster Labeling
Slashdot Webis Group - 3 GB - - -
TLDP Crawl Webis Group - 366 MB 15K documents -
Twitter Movie Sentiments Webis Group 2010 1 GB - - Sentiment Analysis
Webdiversity Webis Group - 225 MB - - -
Webis-CSP-15 Webis Group 2015 90 GB 30K documents Clustering/Cluster Labeling
Wikipedia Editwars Webis Group 2008 919 MB - - Editwar Detection
Yandex Question Queries Webis Group 2012 200 GB 2B queries -
Youtube Comments Webis Group - 2 GB 324K documents -
Other Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
20 Newsgroups Carnegie Mellon University 1999 18 MB 20K documents Text Classification, Text Clustering
7Sectors-WebKB CMU World Wide Knowledge Base 2001 6 MB 5K documents Text Classification, Text Clustering
A Corpus of Plagiarised Short Answers University of Sheffield 2009 80 KB 100 documents Plagiarism Detection
ABCD (Agreement By Create Debaters) Sara Rosenthal 2015 42 MB 10K dialogues Conversation Analysis (written, human-human)
AgreeSum New York University 2021 12 MB 18K multiple articles-summary pairs Text Summarization, Multi-document
All The News Kaggle 2020 3 GB 3M news articles Text Summarization, Text Analysis
Annotated Customer Reviews Simon Fraser University Burnaby 2004 870 KB - - Sentiment Analysis
Any-Aspect Summarization Carnegie Mellon University 2020 2 GB 280K article-summary pairs Text Summarization
AOL Query Log AOL 2006 2 GB 112M queries Query Log Analysis
Araucaria Argumentation Corpus University of Dundee 2014 9 MB 664 examples Computational Argumentation
Arguing Subjectivity Corpus University of Pittsburgh 2012 732 KB 84 documents Computational Argumentation
Argument Annotated Essays, v1 TU Darmstadt 2014 5 MB 90 essays Computational Argumentation
Argument Annotated Essays, v2 TU Darmstadt 2016 2 MB 402 essays Computational Argumentation
Argument Aspect Corpus Leibniz-Institute for Media Research / Hans-Bredow Institute 2022 2 MB - arguments,chunks Computational Argumentation Zenodo
Arxiv-PubMed Corpus Georgetown University 2018 4 GB 350K article-abstract pairs Text Summarization, Scientific Document Summarization
AWTP (Agreement in Wikipedia Talk Pages) Sara Rosenthal 2012 235 KB 822 dialogues Conversation Analysis (written, human-human)
Bergsma-Wang-Corpus 2007 S. Bergsma and Q. I. Wang 2007 2 MB 2K queries Web Search Analysis
BigPatent Summarization Corpus Khoury College of Computer Sciences 2019 6 GB 1M article-summary pairs (US patents) Text Summarization
Bill Summarization Corpus FiscalNote Research 2019 64 MB 22K article-summary pairs (US bills) Text Summarization
BLOGS06 test collection University of Glasgow 2006 - 4M documents Link Analysis
BNC Writing Errors J. Wagner et al. 2007 274 MB - - Writing Error Detection
British National Corpus (XML) BNC Consortium 2007 5 GB 4K texts Text Analysis (English)
Brown Corpus Brown University 2011 22 MB 500 documents Text Analysis (English)
Burrows Authorship Corpora Steven Burrows, RMIT University 2010 8 MB - - Source Code Authorship Attribution
CEEAUS 2010 Beta Edition Kobe University 2010 - 2K documents Cross-Language Analysis
Change My View Modes Columbia University 2017 - 78 discussion threads Computational Argumentation
CLEANEVAL 2007 University of Trento and University of Leeds 2007 15 MB 1K documents Main Content Extraction
CLEF-IP 2009 Information Retrieval Facility Society (IRF) 2009 14 GB 2M documents Patent Retrieval
CLEF-IP 2010 Information Retrieval Facility Society (IRF) 2010 9 GB 3M documents Patent Retrieval
ClueWeb09 Carnegie Mellon University 2009 4 TB 1B web pages Web Mining
ClueWeb12 Carnegie Mellon University 2012 5 TB 733M web pages Web Mining
CNN-DailyMail IBM 2016 1 GB 200K article-summary pairs Text Summarization
Common Crawl Common Crawl organization 2009-2021 (+) 2 PB 3M WARC files Web Analysis
CoNLL-2003 University of Antwerpen 2003 12 MB - - Named Entity Recognition
ConvoSumm Corpus Yale University 2021 650 MB 500 comments-summary pairs Text Summarization, Dialogue Summarization
CoPhIR Consiglio Nazionale delle Ricerche (ISTI-CNR) 2003 54 GB 106M images Image Retrieval
CORE The Open University 2018 330 GB 123M documents Data Mining
DBLP University of Massachusetts Amherst 2006 910 MB - - Network Analysis
Dbpedia 3.5 DBpedia 2010 8 GB - - Data Mining
DialogSum Corpus Zhejiang University 2021 4 MB 13K dialogue-summary pairs with topics Text Summarization, Dialogue Summarization
DMOZ Open Directory Project 2010 11 GB - - Clustering and Clusterlabeling and Data Mining
DoQA Ixa 2020 4 MB 2437 dialogues Conversation Analysis (written, human-human)
ECML PKDD Discovery Challenge 2008 ECML 2008 304 MB 17M lines Collaborative Filtering and Spam Detection
ESL 123 Mass Noun Examples Microsoft Corporation 2006 204 KB 123 sentences Cross-Language Analysis
Essay Argument Strength UT Dallas 2015 30 KB 1K scores Essay scoring
Essay Organization UT Dallas 2010 30 KB 1K scores Essay scoring
Essay Prompt Adherence UT Dallas 2014 38 KB 830 scores Essay scoring
Essay Thesis Clarity UT Dallas 2013 6 MB 830 scores Essay scoring
Europarl (v1 & v3) University of Edinburgh 2007 3 GB - - Machine Translation
European Corpus Initiative Multilingual Corpus I European Corpus Initiative 1994 824 MB 49M words Text Analysis (Multilingual)
Falko Essaykorpus L2 V2 Institut für deutsche Sprache und Linguistik 2005 5 MB 248 documents Interlanguage Analysis
Finegrained Sentiment Uppsala University 2011 4 MB 294 reviews Sentiment Analysis
General Inquirer Dictionary Harvard University 1966 4 MB 182 categories Sentiment Analysis
Google Books N-Gram 20090715 Google 2009 898 GB - - Data Mining
Google Web 1T 5-gram Version 1 Google 2006 55 GB 5B n-grams Text Analysis (English)
IBM Debater- Claim Sentences Search IBM 2018 600 MB 2M topic conclusion pairs Argument Search
IBM Debater- Claim Stance Dataset IBM 2017 8 MB 2K topic conclusion Stance Classification
IBM Debater- Claims and Evidence, ACL-14 IBM 2014 3 MB 1K topic argument pairs Argument Mining
IBM Debater- Claims and Evidence, EMNLP-2015 IBM 2015 8 MB 5K topic argument pairs Argument Mining
IBM Debater- Evidence Sentences IBM 2018 3 MB 6K topic premise pairs Argument Search
IBM Debater- Mention Detection Benchmark IBM 2018 2 MB 3K sentences Mention Detection
IBM Debater- Recorded Debating Dataset IBM 2018 2 MB 60 discussions Computational Argumentation
IBM Debater- Sentiment Composition Lexicon IBM 2018 10 MB 66K words Sentiment Analysis
IBM Debater- Sentiment Lexicon of Idiomatic Expressions IBM 2018 3 MB 5K phrases Sentiment Analysis
IBM Debater- TR9856 IBM 2015 2 MB 10K phrase pairs Semantic Relatedness
IBM Debater- Wikipedia Category Stance IBM 2018 1 MB 5K wikipedia category Stance Classification
IBM Debater- Word IBM 2018 4 MB 19K wikipedia concept pairs Semantic Relatedness
ICWSM 2009 Data Challenge ICWSM 2009 37 GB - - Network Analysis
imat2009 dataset Yandex 2009 650 MB - - Machine-learned Ranking
Intelligence Squared Debates (IQ2) Zhang et al. 2016 4 MB 108 dialogues Conversation Analysis (spoken, human-human)
International Corpus of Learner English v2 Center for English Corpus Linguistics 2009 92 MB 6K documents Language Analysis
Internet Archive Internet Archive organization - 350 TB 800K WARC files Web Analysis
Internet Argument Corpus v2 NLDS@UC Santa Cruz 2016 3 GB 11K dialogues Conversation Analysis (written, human-human)
IP2Location LITE databases 2016-20 IP2Location 2016-2019 5 GB 5 years IP-geolocation and proxies
Key-value Retrieval Dataset Stanford University 2017 1 MB 3K dialogues Conversation Analysis (written, human-wizard)
Koppel Authorship Corpus M. Koppel and J. Schler 2004 4 MB - - Authorship Verification
Learning To Rank 3 Microsoft 2008 8 GB - - Machine-learned Ranking
Lee 50 Documents M. D. Lee et al. 2005 130 KB 50 documents Text Similarity Analysis
Maluuba Frames Maluuba (Microsoft) 2017 4 MB 1K dialogues Conversation Analysis (written, human-wizard)
MANtIS Lambda-Lab at TU Delft 2019 6 GB 80K dialogues Conversation Analysis (written, human-human)
MediaSum Corpus Microsoft Cognitive Services Research Group 2021 2 GB 463K interview transcript-summary pairs Text Summarization, Dialogue Summarization
MEDLINE-PubMed Corpus University of Zürich 2018 7 GB 5M article-abstract & abstract-title pairs Text Summarization, Scientific Document Summarization
METER Corpus Department of Journalism and Department of Computer Science at Sheffield University 2002 10 MB - - Text Reuse
MIR Flickr 2008 LIACS Medialab at Leiden University, Netherlands 2008 3 GB 25K documents Image Retrieval
MISC Microsoft 2017 23 GB 110 dialogues Conversation Analysis (spoken, human-human)
Montclair Electronic Language Database Montclair State University 2001 56 KB 33 documents Cross-Language Analysis
Movie Review Data Cornell University 2004-2005 219 MB 12K reviews Sentiment Analysis
Movielens University of Minnesota 1998-2009 74 MB 11M ratings Collaborative Filtering
MPC (Multi-Party Chat) Shaikh et al. 2010 2 MB 14 dialogues Conversation Analysis (written, human-human)
MSMARCO Conversational Search Microsoft 2019 1 GB 2M synthetic search sessions Next Query Prediction
Multi Domain Sentiment Dataset (Processed ACL) John Hopkins University 2007 29 MB - - Sentiment Analysis
Multi-Aspect Summarization Amazon Research 2019 946 MB 280K article-summary pairs Text Summarization
Multi-News Yale University 2019 676 MB 54K multiple articles-summary pairs Text Summarization, Multi-document
Multi-XScience Mila 2020 61 MB 40K article-summary pairs Text Summarization, Scientific Document Summarization
Multilingual Amazon Reviews P. Keung et al. 2020 640 MB 1M reviews Text Classification (Multilingual)
MultiWOZ 2.1 M. Eric et al. 2020 19 MB 10K dialogues Conversation Analysis (written, human-wizard)
NBC 2016 Russian Troll Tweets NBC 2018 34 MB 267K tweets Propaganda detection
Netflix Challenge (Partial) Netflix 2006 2 GB - - Collaborative Filtering
New York Times Corpus New York Times 2008 3 GB 2M articles Text Mining
Newsroom Cornell University 2018 5 GB 1M article-summary pairs Text Summarization
ODP239 C. Carpineto and G. Romano 2009 5 MB - - Subtopic Information Retrieval
OHSUMED Test Collection Oregon Health & Science University 1994 461 MB - - Text Clustering
OpenWebText Corpus Brown University 2019 40 GB 8M documents Language Modeling, Text Synthesis
OPUS (Europarl3_0b and EMEA0) Jörg Tiedemann 2009 9 GB 22 languages Machine Translation
OR-QuAC C. Qu et al. 2020 10 GB 6K dialogues Conversation Analysis (written, human-wizard), Question Answering
PRESTO Google 2022 397 M 550K dialogues Conversation Analysis (written, human-system)
QuAC E. Choi et al. 2018 75 MB 14K dialogues Conversation Analysis (written, human-wizard), Question Answering
RadioTalk Laboratory for Social Machines, MIT Media Lab 2019 9 GB 3B words Language Analysis
Reason Identification and Classification Dataset UT Dallas 2014 4 MB - - Computational Argumentation
Reddit TIFU corpus Seoul National University 2019 640 MB 123K content-summary pairs Text Summarization
Request For Comments Collections (to 4501) RFC Editor 2008 55 MB 4K documents Data Mining
Reuters 21578 (22173) Reuters, David D. Lewis 1996 8 MB 22K articles Text Clustering
Reuters RCV1 Reuters, David D. Lewis 2000 1 GB 365 documents Text Clustering
Reuters RCV1 - CCAT split Reuters, David D. Lewis 2002 2 GB - - Machine Learning
Reuters RCV1/RCV2 Multilingual, Multiview Text Categorization Test Collection National Research Council of Canada 2009 166 MB - - Cross-Language Categorization
Rovereto Twitter N-Gram Corpus University of Trento, Italy 2011 5 GB 75M tweets Social Network Analysis
ScisummNet Corpus Yale University 2019 15 MB 1000 scientific paper-summary pairs (with citation networks) Text Summarization, Scientific Document Summarization
SILS Learner Corpus of English Waseda University 2007 16 MB - - Cross-Language Analysis
SMS Spam Collection v T. A. Almeida and J. M. G. Hidalgo 2011 210 KB 6K messages Spam Identification
Spoken Conversational Search Data Set J.R. Trippas et al. 2017 260 KB 101 dialogues Conversation Analysis (written, human-human)
Spotify Podcasts Dataset Clifton et al. 2020 2 TB 50K hours Conversation Analysis (spoken, human-human)
SumPubMed Corpus University of Utah 2021 608 MB 33K scientific paper-summary pairs Text Summarization, Scientific Document Summarization
TED-LIUM Release 3 Ubiqus and LIUM 2018 50 GB 452 hours Speech Recognition
The JRC-Acquis Multilingual Parallel Corpus (3) European Commission's Office for Official Publications (OPOCE) 2009 2 GB - - Cross-Language Research
TIPSTER Complete Advanced Research Projects Agency 1993 1 MB - - Information Retrieval
Topical Chat Dataset Amazon 2019 76 MB 11K dialogues Conversation Analysis (written, human-human)
TREC vol4 National Institute of Standards and Technology (NIST) 1996 436 MB 295K documents Data Mining
TREC vol5 National Institute of Standards and Technology (NIST) 1997 389 MB 260K documents Data Mining
TREC web National Institute of Standards and Technology (NIST) 1999-2004 90 GB - - Data Mining
TripAdvisor Data Set University of Illinois at Urbana-Champaign 2010 220 MB - - Opinion Mining
Tswana Learner English Corpus Center for Text Technology 2006 2 MB - - Cross-Language Analysis
Twitter tweets Yang and Leskovec 2011 26 GB 467M tweets Social Network Analysis
Twitter tweets (RecSys Challenge) Twitter 2020 76 GB 160M tweets Social Network Analysis
UKPConvArg1 TU Darmstadt 2016 21 MB 16K argument pairs Computational Argumentation
UKPConvArg2 TU Darmstadt 2016 23 MB 9K argument pairs Computational Argumentation
Uppsala Student English Uppsala University 2001 3 MB 2K documents Cross-Language Analysis
USPTO Patents from 2001 to 2010 U.S. Patent & Trademark Office 2010 10 TB - - Patent Analysis
VQuAnDa Kacupaj et al. 2020 2 MB 5K question-answer-SPARQL query triplets Answer Verbalization
WaCKy: deWaC Web-As-Corpus Kool Yinitiative 2009 26 GB 2B words Text Analysis (German)
WaCKy: frWaC Web-As-Corpus Kool Yinitiative 2009 5 GB 2B words Text Analysis (French)
WaCKy: itWaC Web-As-Corpus Kool Yinitiative 2009 31 GB 2B words Text Analysis (Italian)
WaCKy: sdeWaC Web-As-Corpus Kool Yinitiative 2009 20 GB 1B words Text Analysis (German)
WaCKy: ukWaC Web-As-Corpus Kool Yinitiative 2009 15 GB 2B words Text Analysis (English)
WaCKy: WaCkypedia_EN Web-As-Corpus Kool Yinitiative 2009 6 GB 1B words Text Analysis (English)
WCEP MDS Dataset: Wikipedia Current Events Portal Aylien Ltd., Dublin, Ireland 2020 2 GB 2M document clusters with one human-written summary per cluster Text Summarization, Multi-document
Web People Search Corpus (WePS-1) NLP Group (UNED), Proteus Project (NYU) 2007 295 MB 2K web pages Person Disambiguation, Text Clustering
Web People Search Corpus (WePS-2) NLP Group (UNED), Proteus Project (NYU) 2009 328 MB 3K web pages Person Disambiguation, Text Clustering
Web People Search Corpus (WePS-3) NLP Group (UNED), Proteus Project (NYU) 2010 571 MB 50K web pages Person Disambiguation, Text Clustering
WikiHow Summarization Corpus University of California 2018 2 GB 230K article-summary, paragraph-summary pairs Text Summarization
Wikipedia Full Dump Wikimedia Foundation 2011 5 TB - - Data Mining
Wikipedia History Snapshots Wikimedia Foundation 2006-2012 32 GB - - Data Mining
Wikipedia Participation Challenge Wikimedia Foundation 2011 976 MB - - User Behaviour Prediction
Wikipedia Revision Dump Wikimedia Foundation 2006 46 GB - - Data Mining
Wikipedia Revision Dump Wikimedia Foundation 2008 133 GB - - Data Mining
Wikipedia Snapshots Wikimedia Foundation 2006-2012 280 GB - - Data Mining
WikiSum Corpus Amazon 2021 115 MB 40K article-summary pairs Text Summarization
Wordsim353 L. Finkelstein et al. 2002 60 KB 353 word pairs Word Similarities
Wortschatz Leipzig Universität Leipzig 2006 8 GB 15 languages Text Analysis (Multilingual)
XL-Sum Corpus Bangladesh University of Engineering and Technology 2021 1 GB 1M article-summary pairs Text Summarization, Multilingual Text Summarization
XSum Corpus University of Edinburgh 2018 240 MB 214K article-summary pairs Text Summarization
Yahoo Learning To Rank Challenge 2010 Yahoo 2010 421 MB - - Document Ranking
Yahoo N-Grams Yahoo 2006 13 GB - - Text Analysis (English)