Computer Science > Computation and Language

arXiv:2304.07987 (cs)

[Submitted on 17 Apr 2023 (v1), last revised 25 Apr 2023 (this version, v4)]

Title:Chinese Open Instruction Generalist: A Preliminary Release

Authors:Ge Zhang, Yemin Shi, Ruibo Liu, Ruibin Yuan, Yizhi Li, Siwei Dong, Yu Shu, Zhaoqun Li, Zekun Wang, Chenghua Lin, Wenhao Huang, Jie Fu

View PDF

Abstract:Instruction tuning is widely recognized as a key technique for building generalist language models, which has attracted the attention of researchers and the public with the release of InstructGPT~\citep{ouyang2022training} and ChatGPT\footnote{\url{this https URL}}. Despite impressive progress in English-oriented large-scale language models (LLMs), it is still under-explored whether English-based foundation LLMs can perform similarly on multilingual tasks compared to English tasks with well-designed instruction tuning and how we can construct the corpora needed for the tuning. To remedy this gap, we propose the project as an attempt to create a Chinese instruction dataset by various methods adapted to the intrinsic characteristics of 4 sub-tasks. We collect around 200k Chinese instruction tuning samples, which have been manually checked to guarantee high quality. We also summarize the existing English and Chinese instruction corpora and briefly describe some potential applications of the newly constructed Chinese instruction corpora. The resulting \textbf{C}hinese \textbf{O}pen \textbf{I}nstruction \textbf{G}eneralist (\textbf{COIG}) corpora are available in Huggingface\footnote{\url{this https URL}} and Github\footnote{\url{this https URL}}, and will be continuously updated.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2304.07987 [cs.CL]
	(or arXiv:2304.07987v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2304.07987

Submission history

From: Ruibin Yuan [view email]
[v1] Mon, 17 Apr 2023 04:45:06 UTC (2,238 KB)
[v2] Tue, 18 Apr 2023 04:46:57 UTC (2,238 KB)
[v3] Fri, 21 Apr 2023 03:16:13 UTC (2,238 KB)
[v4] Tue, 25 Apr 2023 01:50:19 UTC (2,238 KB)

Computer Science > Computation and Language

Title:Chinese Open Instruction Generalist: A Preliminary Release

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Chinese Open Instruction Generalist: A Preliminary Release

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators