Computer Science > Computation and Language

arXiv:2308.03281 (cs)

[Submitted on 7 Aug 2023]

Title:Towards General Text Embeddings with Multi-stage Contrastive Learning

Authors:Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang

View PDF

Abstract:We present GTE, a general-purpose text embedding model trained with multi-stage contrastive learning. In line with recent advancements in unifying various NLP tasks into a single format, we train a unified text embedding model by employing contrastive learning over a diverse mixture of datasets from multiple sources. By significantly increasing the number of training data during both unsupervised pre-training and supervised fine-tuning stages, we achieve substantial performance gains over existing embedding models. Notably, even with a relatively modest parameter count of 110M, GTE$_\text{base}$ outperforms the black-box embedding API provided by OpenAI and even surpasses 10x larger text embedding models on the massive text embedding benchmark. Furthermore, without additional fine-tuning on each programming language individually, our model outperforms previous best code retrievers of similar size by treating code as text. In summary, our model achieves impressive results by effectively harnessing multi-stage contrastive learning, offering a powerful and efficient text embedding model with broad applicability across various NLP and code-related tasks.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2308.03281 [cs.CL]
	(or arXiv:2308.03281v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2308.03281

Submission history

From: Dinkun Long [view email]
[v1] Mon, 7 Aug 2023 03:52:59 UTC (334 KB)

Computer Science > Computation and Language

Title:Towards General Text Embeddings with Multi-stage Contrastive Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Towards General Text Embeddings with Multi-stage Contrastive Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators