Abstract
Audio–text retrieval is the task of aligning natural language and audio such that instances from different modalities can be compared with a similarity metric. Contrastive representation learning is a popular way to address this problem by finetuning a dual encoder backbone. However, finetuning incurs a heavy computational burden and requires multiple graphics processing units. The main objective of this study is thus to drive enhanced retrieval performance in audio–text retrieval with lower computational burden and resources. Thus, a series of plug-and-play modules are introduced to elicit knowledge from pretrained audio and text encoders. From creating a competitive baseline with early and late fusion first, the study endeavors to improve the performance further by adapting deep metric Circle and Ranked List losses. This study utilizes Manifold Mixup as a strong regularizer and a novel post-processing step of re-ranking to refine the results further. These changes work in symphony to obtain superior performance than other models without the need to finetune encoders. Particularly, an improvement of 1–3% is obtained on the AudioCaps dataset, establishing a new state of the art. The results on the Clotho dataset remain competitive with other finetuning approaches utilizing heavy resources. Also, the model emerges as the best among the frozen encoder models across all the metrics. Moreover, the proposed modules are modality agnostic and hold great potential for other retrieval tasks beyond the domain of audio–text. Overall, this study establishes a strong and competitive baseline for future approaches in audio–text retrieval.
Similar content being viewed by others
Code availability
The code is available at https://github.com/Vedanshi-Shah/YAATRA.
Notes
Hugging face model: https://huggingface.co/openai/whisper-large.
Hugging Face: https://huggingface.co/BAAI/bge-large-en-v1.5.
AudioCaps dataset is obtained from https://github.com/XinhaoMei/ACT.
For 2000, 4000, 8000, and 12,000 number of both audio and text samples, normal inference took 0.01, 0.02, 0.04 and 0.31 s. In the same setting, with re-ranking, it required 0.94, 2.11, 4.26 and 6.99 s.
References
Manco I, Benetos E, Quinton E, Fazekas G (2022) Contrastive audio-language learning for music. arXiv preprint
Lipping S, Sudarsanam P, Drossos K, Virtanen T(2022) Clotho-aqa: A crowdsourced dataset for audio question answering. In: 2022 30th European Signal Processing Conference (EUSIPCO), pp. 1140–1144. https://doi.org/10.23919/EUSIPCO55093.2022.9909680 . IEEE
Li H, Ota K, Dong M (2018) Learning iot in edge: Deep learning for the internet of things with edge computing. IEEE Network 32(1):96–101. https://doi.org/10.1109/MNET.2018.1700202
Tang D, Chen B, Huang Y, An B, Wang Y, Wang X (2024) Cmcl: Cross-modal compressive learning for resource-constrained intelligent iot systems. IEEE Internet Things J. https://doi.org/10.1109/ICIRCA48905.2020.9183088
Nandanwar H, Chauhan A, Pahl D, Meena H (2020) A survey of application of ml and data mining techniques for smart irrigation system. In: 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA), pp. 205–212. https://doi.org/10.1109/ICIRCA48905.2020.9183088 . IEEE
Nandanwar H, Chauhan A ()2021 Iot based smart environment monitoring systems: a key to smart and clean urban living spaces. In: 2021 Asian Conference on Innovation in Technology (ASIANCON), pp. 1–9. https://doi.org/10.1109/ASIANCON51346.2021.9544596 . IEEE
Singh A, Nandanwar H, Chauhan A (2022) Simulation tools and testbeds for internet of things (iot):“comparative insight”. In: 2022 Second International Conference on Computer Science, Engineering and Applications (ICCSEA), pp. 1–7. https://doi.org/10.1109/ICCSEA54677.2022.9936302 . IEEE
Ge M, Fu X, Syed N, Baig Z, Teo G, Robles-Kelly A (2019) Deep learning-based intrusion detection for iot networks. In: 2019 IEEE 24th Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 256–25609. https://doi.org/10.1109/PRDC47002.2019.00056 . IEEE
Al-Garadi MA, Mohamed A, Al-Ali AK, Du X, Ali I, Guizani M (2020) A survey of machine and deep learning methods for internet of things (iot) security. IEEE communications surveys & tutorials 22(3):1646–1685. https://doi.org/10.1109/COMST.2020.2988293
Nandanwar H, Katarya R (2024) Deep learning enabled intrusion detection system for industrial iot environment. Expert Systems with Applications 249, 123808 https://doi.org/10.1016/j.eswa.2024.123808
Nandanwar H, Katarya R (2024) Tl-bilstm iot: transfer learning model for prediction of intrusion detection system in iot environment. Int J Inf Secur 23(2):1251–1277. https://doi.org/10.1007/s10207-023-00787-8
Kauhsik B, Nandanwar H, Katarya R (2023) Iot security: A deep learning-based approach for intrusion detection and prevention. In: 2023 International Conference on Evolutionary Algorithms and Soft Computing Techniques (EASCT), pp. 1–7. https://doi.org/10.1109/EASCT59475.2023.10392490 . IEEE
Chechik G, Ie E, Rehn M, Bengio S, Lyon D (2008) Large-scale content-based audio retrieval from text queries. In: Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, pp. 105–112. https://doi.org/10.1145/1460096.1460115
Elizalde B, Zarar S, Raj B (2019) Cross modal audio search and retrieval with joint embeddings based on text and audio. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4095–4099. https://doi.org/10.1109/ICASSP.2019.8682632 . IEEE
Manocha P, Badlani R, Kumar A, Shah A, Elizalde B, Raj B (2018) Content-based representations of audio using siamese neural networks. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3136–3140. https://doi.org/10.1109/ICASSP.2018.8461524 . IEEE
Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828. https://doi.org/10.1109/TPAMI.2013.50
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. arXiv:2103.00020. PMLR
Xu H, Ghosh G, Huang P-Y, Okhonko D, Aghajanyan A, Metze F, Zettlemoyer L, Feichtenhofer C (2021) Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084
Guzhov A, Raue F, Hees J, Dengel A (2022) Audioclip: Extending clip to image, text and audio. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 976–980. https://doi.org/10.1109/ICASSP43922.2022.9747631 . IEEE
Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190 (2023). https://doi.org/10.48550/arXiv.2305.05665
Le-Khac PH, Healy G, Smeaton AF (2020) Contrastive representation learning: A framework and review. Ieee Access 8, 193907–193934 0.1109/ACCESS.2020.3031549
Ju W, Wang Y, Qin Y, Mao Z, Xiao Z, Luo J, Yang J, Gu Y, Wang D, Long Q (2024) Towards graph contrastive learning: A survey and beyond. arXiv preprint arXiv:2405.11868
Ju W, Yi S, Wang Y, Long Q, Luo J, Xiao Z, Zhang M (2024) A survey of data-efficient graph learning. arXiv preprint arXiv:2402.00447
Luo X, Ju W, Qu M, Chen C, Deng M, Hua X-S, Zhang M (2022) Dualgraph: Improving semi-supervised graph classification via dual contrastive learning. In: 2022 IEEE 38th International Conference on Data Engineering (ICDE), pp. 699–712. https://doi.org/10.1109/ICDE53745.2022.00057 . IEEE
Mao Z, Ju W, Qin Y, Luo X, Zhang M (2023) Rahnet: Retrieval augmented hybrid network for long-tailed graph classification. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 3817–3826. https://doi.org/10.1145/3581783.3612360
Ju W, Qin Y, Yi S, Mao Z, Zheng K, Liu L, Luo X, Zhang M (2023) Zero-shot node classification with graph contrastive embedding network. Transactions on Machine Learning Research
Luo X, Ju W, Gu Y, Mao Z, Liu L, Yuan Y, Zhang M (2023) Self-supervised graph-level representation learning with adversarial contrastive learning. ACM Trans Knowl Discov Data 18(2):1–23. https://doi.org/10.1145/3624018
Dai B, Lin D (2017) Contrastive learning for image captioning. Advances in Neural Information Processing Systems 30[SPACE]arXiv:1710.02534
Kohli, V, Nandanwar H, Katarya R Cracking the figurative code: A survey of metaphor detection techniques https://doi.org/10.56155/978-81-955020-2-8-31
Yu J, Wang Z, Vasudevan V, Yeung L, Seyedhosseini M, Wu Y (2022) Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917
Cho H, Seol J, Lee S-g (2021) Masked contrastive learning for anomaly detection. arXiv preprint arXiv:2105.08793
Chen B, Zhang J, Zhang X, Dong Y, Song J, Zhang P, Xu K, Kharlamov E, Tang J (2022) Gccad: Graph contrastive learning for anomaly detection. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2022.3200459
Kim CD, Kim B, Lee H, Kim G (2019) Audiocaps: Generating captions for audios in the wild. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 119–132. https://doi.org/10.18653/v1/N19-1011
Drossos K, Lipping S, Virtanen T (2020) Clotho: An audio captioning dataset. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 736–740. https://doi.org/10.1109/ICASSP40776.2020.9052990 . IEEE
Koepke AS, Oncescu A-M, Henriques J, Akata Z, Albanie S (2022) Audio retrieval with natural language queries: A benchmark study. IEEE Trans Multimedia. https://doi.org/10.1109/TMM.2022.3149712
Kong Q, Cao Y, Iqbal T, Wang Y, Wang W, Plumbley MD (2020) Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 https://doi.org/10.1109/TASLP.2020.3030497
Lou S, Xu X, Wu M, Yu K (2022) Audio-text retrieval in context. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4793–4797. https://doi.org/10.1109/ICASSP43922.2022.9746786 . IEEE
Zhao S, Xu L, Liu Y, Du S (2023) Multi-grained representation learning for cross-modal retrieval. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2194–2198. https://doi.org/10.1145/3539618.3592025
Mei X, Liu X, Sun J, Plumbley MD, Wang W(2022) On metric learning for audio-text cross-modal retrieval. arXiv preprint arXiv:2203.15537
Wang X, Hua Y, Kodirov E, Hu G, Garnier R, Robertson NM (2019) Ranked list loss for deep metric learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5207–5216. https://doi.org/10.1109/CVPR.2019.00535
Sun Y, Cheng, C, Zhang Y, Zhang C, Zheng L, Wang Z, Wei Y (2020) Circle loss: A unified perspective of pair similarity optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6398–6407. https://doi.org/10.48550/arXiv.2002.10857
Deshmukh S, Elizalde B, Wang H (2022) Audio retrieval with wavtext5k and clap training. arXiv preprint arXiv:2209.14275
Wu Y, Chen K, Zhang T, Hui Y, Berg-Kirkpatrick T, Dubnov S (2023) Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. https://doi.org/10.1109/ICASSP49357.2023.10095969 . IEEE
Hu T, Xiang X, Qin J, Tan Y (2023) Audio-text retrieval based on contrastive learning and collaborative attention mechanism. Multimedia Syst 29(6):3625–3638. https://doi.org/10.1007/s00530-023-01144-4
Xin Y, Wang B, Shang L (2023) Cooperative game modeling with weighted token-level alignment for audio-text retrieval. IEEE Signal Process Lett. https://doi.org/10.1109/LSP.2023.3313090
Luo X, Guo Y, Ma Z, Zhong H, Li T, Ju W, Chen C, Deng M (2021) Deep supervised hashing by classification for image retrieval. In: Neural Information Processing: 28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part IV 28, pp. 3–14. https://doi.org/10.1007/978-3-030-92273-3_1 . Springer
Ma Z, Ju W, Luo X, Chen C, Hua X-S, Lu G (2022) Improved deep unsupervised hashing via prototypical learning. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 659–667. https://doi.org/10.1145/3503161.3548403
Vouitsis N, Liu Z, Gorti SK, Villecroze V, Cresswell JC, Yu G, Loaiza-Ganem G, Volkovs M (2023) Data-efficient multimodal fusion on a single gpu. arXiv preprint arXiv:2312.10144
Chen K, Du X, Zhu B, Ma Z, Berg-Kirkpatrick T, Dubnov S (2022) Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 646–650. https://doi.org/10.1109/ICASSP43922.2022.9746312 . IEEE
Mei X, Meng C, Liu H, Kong Q, Ko T, Zhao C, Plumbley MD, Zou Y, Wang W (2023) Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395
Yeh C-F, Huang P-Y, Sharma V, Li S-W, Gosh G (2023) Flap: Fast language-audio pre-training. arXiv preprint arXiv:2311.01615
Wang P, Wang S, Lin J, Bai S, Zhou X, Zhou J, Wang X, Zhou C (2023) One-peace: Exploring one general representation model toward unlimited modalities
Chen, S., Li, H., Wang, Q., Zhao, Z., Sun, M., Zhu, X., Liu, J.: Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset. arXiv preprint arXiv:2305.18500 (2023) https://doi.org/10.48550/arXiv.2305.18500
Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I (2023) Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518. https://doi.org/10.48550/arXiv.2212.04356 . PMLR
Liu Y, Guo Y, Bakker EM, Lew MS (2017) Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4107–4116. https://doi.org/10.1109/ICCV.2017.442
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216. https://doi.org/10.48550/arXiv.1803.08024
Wang T, Xu X, Yang Y, Hanjalic A, Shen HT, Song J (2019) Matching images and text with multi-modal tensor fusion and re-ranking. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 12–20. https://doi.org/10.1145/3343031.3350875
Mafla A, Rezende RS, Gomez L, Larlus D, Karatzas D (2021) Stacmr: Scene-text aware cross-modal retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2220–2230. https://doi.org/10.1109/WACV48630.2021.00227
Qu L, Liu M, Wang W, Zheng Z, Nie L, Chua T-S (2023) Learnable pillar-based re-ranking for image-text retrieval. arXiv preprint arXiv:2304.12570
Verma V, Lamb A, Beckham C, Najafi A, Mitliagkas I, Lopez-Paz D, Bengio Y (2019) Manifold mixup: Better representations by interpolating hidden states. In: International Conference on Machine Learning, pp. 6438–6447. https://doi.org/10.48550/arXiv.1806.05236. PMLR
Merullo J, Castricato L, Eickhoff C, Pavlick E (2022) Linearly mapping from image to text space. arXiv preprint arXiv:2209.15162
Bellet A, Habrard A, Sebban M (2013) A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709 (2013)
Chen S, Wu Y, Wang C, Liu S, Tompkins D, Chen Z, Wei F (2022) Beats: Audio pre-training with acoustic tokenizers. arXiv preprint arXiv:2212.09058
Gong Y, Chung Y-A, Glass J (2021) Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778
Gemmeke JF, Ellis DP, Freedman D, Jansen A, Lawrence W, Moore RC, Plakal M, Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. https://doi.org/10.1109/ICASSP.2017.7952261 . IEEE
Piczak KJ (2015) Esc: Dataset for environmental sound classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1015–1018. https://doi.org/10.1145/2733373.2806390
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022. https://doi.org/10.1109/ICCV48922.2021.00986
Song K, Tan X, Qin T, Lu J, Liu T-Y (2020) Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems 33, 16857–16867 https://doi.org/10.48550/arXiv.2004.09297
Xiao S, Liu Z, Zhang P, Muennighof N (2023) C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597
Muennighoff N, Tazi N, Magne L, Reimers N (2022) Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV (2019) Xlnet: Generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, vol. 32. https://proceedings.neurips.cc/paper_files/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167 (2015)
Poria S, Chaturvedi I, Cambria E, Hussain A (2016) Convolutional mkl based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 439–448. https://doi.org/10.1109/ICDM.2016.0055 . IEEE
Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415
Ramachandran P, Zoph B, Le QV (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941
Sohn K (2016) Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems 29
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. https://doi.org/10.48550/arXiv.2002.05709 . PMLR
Kryszkiewicz M (2013) Determining cosine similarity neighborhoods by means of the euclidean distance. Rough Sets and Intelligent Systems-Professor Zdzisław Pawlak in Memoriam: Volume 2, 323–345 https://doi.org/10.1007/978-3-642-30341-8_17
White C, Hayward E, Blain T, Peterson E Automatic playlist continuation with approximate nearest neighbours
Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2017) mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Mei X, Liu X, Huang Q, Plumbley MD, Wang W (2021) Audio captioning transformer. arXiv preprint arXiv:2107.09817
Drossos K, Lipping S, Virtanen T Clotho Dataset. https://doi.org/10.5281/zenodo.4783391
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10394–10403. https://doi.org/10.1109/CVPR.2019.01064
Dutta T, Biswas S (2019) Generalized zero-shot cross-modal retrieval. IEEE Trans Image Process 28(12):5953–5962. https://doi.org/10.1109/TIP.2019.2923287
Funding
The authors have not received any funding support during experimentation of the work and writing of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors do not have any Conflict of interest to declare.
Ethics approval
This study does not violate and does not involve any moral or ethical statements.
Consent for publication
All authors have read and approved the final manuscript. All authors are aware of the publication of this manuscript and agreed to its publication.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Suryawanshi, Y., Shah, V., Randar, S. et al. Audio meets text: a loss-enhanced journey with manifold mixup and re-ranking. Knowl Inf Syst 67, 2195–2231 (2025). https://doi.org/10.1007/s10115-024-02283-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-024-02283-4