Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2309.08730 (eess)

[Submitted on 15 Sep 2023 (v1), last revised 2 Apr 2024 (this version, v3)]

Title:MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Authors:Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, Emmanouil Benetos

Abstract:Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with a frozen LLM, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.

Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
Cite as:	arXiv:2309.08730 [eess.AS]
	(or arXiv:2309.08730v3 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2309.08730
Journal reference:	2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Submission history

From: Yinghao Ma [view email]
[v1] Fri, 15 Sep 2023 19:31:40 UTC (905 KB)
[v2] Thu, 12 Oct 2023 21:28:02 UTC (921 KB)
[v3] Tue, 2 Apr 2024 13:35:59 UTC (502 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators