Nothing Special   »   [go: up one dir, main page]

Published April 20, 2024 | Version v1
Dataset Open

SemEval-2024 Task 8: Multidomain, Multimodel and Multilingual Black-Box Machine-Generated Text Detection

  • 1. ROR icon Mohamed bin Zayed University of Artificial Intelligence
  • 2. Institute of Information Science and Technology, Italy
  • 3. TU Darmstadt
  • 4. ROR icon University of Cambridge
  • 5. ROR icon New York University Abu Dhabi

Description

Large language models (LLMs) are becoming mainstream and easily accessible, ushering in an explosion of machine-generated content over various channels, such as news, social media, question-answering forums, educational, and even academic contexts. Recent LLMs, such as ChatGPT and GPT-4, generate remarkably fluent responses to a wide variety of user queries. The articulate nature of such generated texts makes LLMs attractive for replacing human labor in many scenarios. However, this has also resulted in concerns regarding their potential misuse, such as spreading misinformation and causing disruptions in the education system. Since humans perform only slightly better than chance when classifying machine-generated vs. human-written text, there is a need to develop automatic systems to identify machine-generated text with the goal of mitigating its potential misuse.

We offer three subtasks over two paradigms of text generation: (1) full text when a considered text is entirely written by a human or generated by a machine; and (2) mixed text when a machine-generated text is refined by a human or a human-written text paraphrased by a machine.

Files

SemEval2024-Task8-code.zip

Files (420.7 MB)

Name Size Download all
md5:7ea2a43f0b410e1bcdcdf25299934d83
1.2 MB Preview Download
md5:a126d3369b55931ac43da585aededef6
419.5 MB Preview Download

Additional details