Computer Science > Sound

arXiv:2406.13275 (cs)

[Submitted on 19 Jun 2024 (v1), last revised 25 Jun 2024 (this version, v2)]

Title:Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Authors:Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang

Abstract:Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by low-rank adaptation (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.

Comments:	Accepted by Interspeech 2024
Subjects:	Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2406.13275 [cs.SD]
	(or arXiv:2406.13275v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2406.13275

Submission history

From: Gang Li [view email]
[v1] Wed, 19 Jun 2024 07:09:46 UTC (144 KB)
[v2] Tue, 25 Jun 2024 08:07:36 UTC (144 KB)

Computer Science > Sound

Title:Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators