Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Audio meets text: a loss-enhanced journey with manifold mixup and re-ranking

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Audio–text retrieval is the task of aligning natural language and audio such that instances from different modalities can be compared with a similarity metric. Contrastive representation learning is a popular way to address this problem by finetuning a dual encoder backbone. However, finetuning incurs a heavy computational burden and requires multiple graphics processing units. The main objective of this study is thus to drive enhanced retrieval performance in audio–text retrieval with lower computational burden and resources. Thus, a series of plug-and-play modules are introduced to elicit knowledge from pretrained audio and text encoders. From creating a competitive baseline with early and late fusion first, the study endeavors to improve the performance further by adapting deep metric Circle and Ranked List losses. This study utilizes Manifold Mixup as a strong regularizer and a novel post-processing step of re-ranking to refine the results further. These changes work in symphony to obtain superior performance than other models without the need to finetune encoders. Particularly, an improvement of 1–3% is obtained on the AudioCaps dataset, establishing a new state of the art. The results on the Clotho dataset remain competitive with other finetuning approaches utilizing heavy resources. Also, the model emerges as the best among the frozen encoder models across all the metrics. Moreover, the proposed modules are modality agnostic and hold great potential for other retrieval tasks beyond the domain of audio–text. Overall, this study establishes a strong and competitive baseline for future approaches in audio–text retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Algorithm 2

Similar content being viewed by others

Code availability

The code is available at https://github.com/Vedanshi-Shah/YAATRA.

Notes

  1. https://github.com/RetroCirce/HTS-Audio-Transformer/tree/main/model.

  2. Hugging face model: https://huggingface.co/openai/whisper-large.

  3. https://github.com/microsoft/unilm/tree/master/beats.

  4. Hugging Face: https://huggingface.co/sentence-transformers/all-mpnet-base-v2.

  5. Hugging Face: https://huggingface.co/BAAI/bge-large-en-v1.5.

  6. https://github.com/Wangt-CN/MTFN-RR-PyTorch-Code.

  7. AudioCaps dataset is obtained from https://github.com/XinhaoMei/ACT.

  8. For 2000, 4000, 8000, and 12,000 number of both audio and text samples, normal inference took 0.01, 0.02, 0.04 and 0.31 s. In the same setting, with re-ranking, it required 0.94, 2.11, 4.26 and 6.99 s.

References

  1. Manco I, Benetos E, Quinton E, Fazekas G (2022) Contrastive audio-language learning for music. arXiv preprint

  2. Lipping S, Sudarsanam P, Drossos K, Virtanen T(2022) Clotho-aqa: A crowdsourced dataset for audio question answering. In: 2022 30th European Signal Processing Conference (EUSIPCO), pp. 1140–1144. https://doi.org/10.23919/EUSIPCO55093.2022.9909680 . IEEE

  3. Li H, Ota K, Dong M (2018) Learning iot in edge: Deep learning for the internet of things with edge computing. IEEE Network 32(1):96–101. https://doi.org/10.1109/MNET.2018.1700202

    Article  MATH  Google Scholar 

  4. Tang D, Chen B, Huang Y, An B, Wang Y, Wang X (2024) Cmcl: Cross-modal compressive learning for resource-constrained intelligent iot systems. IEEE Internet Things J. https://doi.org/10.1109/ICIRCA48905.2020.9183088

    Article  MATH  Google Scholar 

  5. Nandanwar H, Chauhan A, Pahl D, Meena H (2020) A survey of application of ml and data mining techniques for smart irrigation system. In: 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA), pp. 205–212. https://doi.org/10.1109/ICIRCA48905.2020.9183088 . IEEE

  6. Nandanwar H, Chauhan A ()2021 Iot based smart environment monitoring systems: a key to smart and clean urban living spaces. In: 2021 Asian Conference on Innovation in Technology (ASIANCON), pp. 1–9. https://doi.org/10.1109/ASIANCON51346.2021.9544596 . IEEE

  7. Singh A, Nandanwar H, Chauhan A (2022) Simulation tools and testbeds for internet of things (iot):“comparative insight”. In: 2022 Second International Conference on Computer Science, Engineering and Applications (ICCSEA), pp. 1–7. https://doi.org/10.1109/ICCSEA54677.2022.9936302 . IEEE

  8. Ge M, Fu X, Syed N, Baig Z, Teo G, Robles-Kelly A (2019) Deep learning-based intrusion detection for iot networks. In: 2019 IEEE 24th Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 256–25609. https://doi.org/10.1109/PRDC47002.2019.00056 . IEEE

  9. Al-Garadi MA, Mohamed A, Al-Ali AK, Du X, Ali I, Guizani M (2020) A survey of machine and deep learning methods for internet of things (iot) security. IEEE communications surveys & tutorials 22(3):1646–1685. https://doi.org/10.1109/COMST.2020.2988293

    Article  MATH  Google Scholar 

  10. Nandanwar H, Katarya R (2024) Deep learning enabled intrusion detection system for industrial iot environment. Expert Systems with Applications 249, 123808 https://doi.org/10.1016/j.eswa.2024.123808

  11. Nandanwar H, Katarya R (2024) Tl-bilstm iot: transfer learning model for prediction of intrusion detection system in iot environment. Int J Inf Secur 23(2):1251–1277. https://doi.org/10.1007/s10207-023-00787-8

    Article  Google Scholar 

  12. Kauhsik B, Nandanwar H, Katarya R (2023) Iot security: A deep learning-based approach for intrusion detection and prevention. In: 2023 International Conference on Evolutionary Algorithms and Soft Computing Techniques (EASCT), pp. 1–7. https://doi.org/10.1109/EASCT59475.2023.10392490 . IEEE

  13. Chechik G, Ie E, Rehn M, Bengio S, Lyon D (2008) Large-scale content-based audio retrieval from text queries. In: Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, pp. 105–112. https://doi.org/10.1145/1460096.1460115

  14. Elizalde B, Zarar S, Raj B (2019) Cross modal audio search and retrieval with joint embeddings based on text and audio. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4095–4099. https://doi.org/10.1109/ICASSP.2019.8682632 . IEEE

  15. Manocha P, Badlani R, Kumar A, Shah A, Elizalde B, Raj B (2018) Content-based representations of audio using siamese neural networks. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3136–3140. https://doi.org/10.1109/ICASSP.2018.8461524 . IEEE

  16. Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828. https://doi.org/10.1109/TPAMI.2013.50

    Article  MATH  Google Scholar 

  17. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. arXiv:2103.00020. PMLR

  18. Xu H, Ghosh G, Huang P-Y, Okhonko D, Aghajanyan A, Metze F, Zettlemoyer L, Feichtenhofer C (2021) Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084

  19. Guzhov A, Raue F, Hees J, Dengel A (2022) Audioclip: Extending clip to image, text and audio. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 976–980. https://doi.org/10.1109/ICASSP43922.2022.9747631 . IEEE

  20. Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190 (2023). https://doi.org/10.48550/arXiv.2305.05665

  21. Le-Khac PH, Healy G, Smeaton AF (2020) Contrastive representation learning: A framework and review. Ieee Access 8, 193907–193934 0.1109/ACCESS.2020.3031549

  22. Ju W, Wang Y, Qin Y, Mao Z, Xiao Z, Luo J, Yang J, Gu Y, Wang D, Long Q (2024) Towards graph contrastive learning: A survey and beyond. arXiv preprint arXiv:2405.11868

  23. Ju W, Yi S, Wang Y, Long Q, Luo J, Xiao Z, Zhang M (2024) A survey of data-efficient graph learning. arXiv preprint arXiv:2402.00447

  24. Luo X, Ju W, Qu M, Chen C, Deng M, Hua X-S, Zhang M (2022) Dualgraph: Improving semi-supervised graph classification via dual contrastive learning. In: 2022 IEEE 38th International Conference on Data Engineering (ICDE), pp. 699–712. https://doi.org/10.1109/ICDE53745.2022.00057 . IEEE

  25. Mao Z, Ju W, Qin Y, Luo X, Zhang M (2023) Rahnet: Retrieval augmented hybrid network for long-tailed graph classification. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 3817–3826. https://doi.org/10.1145/3581783.3612360

  26. Ju W, Qin Y, Yi S, Mao Z, Zheng K, Liu L, Luo X, Zhang M (2023) Zero-shot node classification with graph contrastive embedding network. Transactions on Machine Learning Research

  27. Luo X, Ju W, Gu Y, Mao Z, Liu L, Yuan Y, Zhang M (2023) Self-supervised graph-level representation learning with adversarial contrastive learning. ACM Trans Knowl Discov Data 18(2):1–23. https://doi.org/10.1145/3624018

    Article  Google Scholar 

  28. Dai B, Lin D (2017) Contrastive learning for image captioning. Advances in Neural Information Processing Systems 30[SPACE]arXiv:1710.02534

  29. Kohli, V, Nandanwar H, Katarya R Cracking the figurative code: A survey of metaphor detection techniques https://doi.org/10.56155/978-81-955020-2-8-31

  30. Yu J, Wang Z, Vasudevan V, Yeung L, Seyedhosseini M, Wu Y (2022) Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917

  31. Cho H, Seol J, Lee S-g (2021) Masked contrastive learning for anomaly detection. arXiv preprint arXiv:2105.08793

  32. Chen B, Zhang J, Zhang X, Dong Y, Song J, Zhang P, Xu K, Kharlamov E, Tang J (2022) Gccad: Graph contrastive learning for anomaly detection. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2022.3200459

    Article  MATH  Google Scholar 

  33. Kim CD, Kim B, Lee H, Kim G (2019) Audiocaps: Generating captions for audios in the wild. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 119–132. https://doi.org/10.18653/v1/N19-1011

  34. Drossos K, Lipping S, Virtanen T (2020) Clotho: An audio captioning dataset. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 736–740. https://doi.org/10.1109/ICASSP40776.2020.9052990 . IEEE

  35. Koepke AS, Oncescu A-M, Henriques J, Akata Z, Albanie S (2022) Audio retrieval with natural language queries: A benchmark study. IEEE Trans Multimedia. https://doi.org/10.1109/TMM.2022.3149712

    Article  Google Scholar 

  36. Kong Q, Cao Y, Iqbal T, Wang Y, Wang W, Plumbley MD (2020) Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 https://doi.org/10.1109/TASLP.2020.3030497

  37. Lou S, Xu X, Wu M, Yu K (2022) Audio-text retrieval in context. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4793–4797. https://doi.org/10.1109/ICASSP43922.2022.9746786 . IEEE

  38. Zhao S, Xu L, Liu Y, Du S (2023) Multi-grained representation learning for cross-modal retrieval. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2194–2198. https://doi.org/10.1145/3539618.3592025

  39. Mei X, Liu X, Sun J, Plumbley MD, Wang W(2022) On metric learning for audio-text cross-modal retrieval. arXiv preprint arXiv:2203.15537

  40. Wang X, Hua Y, Kodirov E, Hu G, Garnier R, Robertson NM (2019) Ranked list loss for deep metric learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5207–5216. https://doi.org/10.1109/CVPR.2019.00535

  41. Sun Y, Cheng, C, Zhang Y, Zhang C, Zheng L, Wang Z, Wei Y (2020) Circle loss: A unified perspective of pair similarity optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6398–6407. https://doi.org/10.48550/arXiv.2002.10857

  42. Deshmukh S, Elizalde B, Wang H (2022) Audio retrieval with wavtext5k and clap training. arXiv preprint arXiv:2209.14275

  43. Wu Y, Chen K, Zhang T, Hui Y, Berg-Kirkpatrick T, Dubnov S (2023) Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. https://doi.org/10.1109/ICASSP49357.2023.10095969 . IEEE

  44. Hu T, Xiang X, Qin J, Tan Y (2023) Audio-text retrieval based on contrastive learning and collaborative attention mechanism. Multimedia Syst 29(6):3625–3638. https://doi.org/10.1007/s00530-023-01144-4

    Article  MATH  Google Scholar 

  45. Xin Y, Wang B, Shang L (2023) Cooperative game modeling with weighted token-level alignment for audio-text retrieval. IEEE Signal Process Lett. https://doi.org/10.1109/LSP.2023.3313090

    Article  MATH  Google Scholar 

  46. Luo X, Guo Y, Ma Z, Zhong H, Li T, Ju W, Chen C, Deng M (2021) Deep supervised hashing by classification for image retrieval. In: Neural Information Processing: 28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part IV 28, pp. 3–14. https://doi.org/10.1007/978-3-030-92273-3_1 . Springer

  47. Ma Z, Ju W, Luo X, Chen C, Hua X-S, Lu G (2022) Improved deep unsupervised hashing via prototypical learning. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 659–667. https://doi.org/10.1145/3503161.3548403

  48. Vouitsis N, Liu Z, Gorti SK, Villecroze V, Cresswell JC, Yu G, Loaiza-Ganem G, Volkovs M (2023) Data-efficient multimodal fusion on a single gpu. arXiv preprint arXiv:2312.10144

  49. Chen K, Du X, Zhu B, Ma Z, Berg-Kirkpatrick T, Dubnov S (2022) Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 646–650. https://doi.org/10.1109/ICASSP43922.2022.9746312 . IEEE

  50. Mei X, Meng C, Liu H, Kong Q, Ko T, Zhao C, Plumbley MD, Zou Y, Wang W (2023) Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395

  51. Yeh C-F, Huang P-Y, Sharma V, Li S-W, Gosh G (2023) Flap: Fast language-audio pre-training. arXiv preprint arXiv:2311.01615

  52. Wang P, Wang S, Lin J, Bai S, Zhou X, Zhou J, Wang X, Zhou C (2023) One-peace: Exploring one general representation model toward unlimited modalities

  53. Chen, S., Li, H., Wang, Q., Zhao, Z., Sun, M., Zhu, X., Liu, J.: Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset. arXiv preprint arXiv:2305.18500 (2023) https://doi.org/10.48550/arXiv.2305.18500

  54. Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I (2023) Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518. https://doi.org/10.48550/arXiv.2212.04356 . PMLR

  55. Liu Y, Guo Y, Bakker EM, Lew MS (2017) Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4107–4116. https://doi.org/10.1109/ICCV.2017.442

  56. Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216. https://doi.org/10.48550/arXiv.1803.08024

  57. Wang T, Xu X, Yang Y, Hanjalic A, Shen HT, Song J (2019) Matching images and text with multi-modal tensor fusion and re-ranking. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 12–20. https://doi.org/10.1145/3343031.3350875

  58. Mafla A, Rezende RS, Gomez L, Larlus D, Karatzas D (2021) Stacmr: Scene-text aware cross-modal retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2220–2230. https://doi.org/10.1109/WACV48630.2021.00227

  59. Qu L, Liu M, Wang W, Zheng Z, Nie L, Chua T-S (2023) Learnable pillar-based re-ranking for image-text retrieval. arXiv preprint arXiv:2304.12570

  60. Verma V, Lamb A, Beckham C, Najafi A, Mitliagkas I, Lopez-Paz D, Bengio Y (2019) Manifold mixup: Better representations by interpolating hidden states. In: International Conference on Machine Learning, pp. 6438–6447. https://doi.org/10.48550/arXiv.1806.05236. PMLR

  61. Merullo J, Castricato L, Eickhoff C, Pavlick E (2022) Linearly mapping from image to text space. arXiv preprint arXiv:2209.15162

  62. Bellet A, Habrard A, Sebban M (2013) A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709 (2013)

  63. Chen S, Wu Y, Wang C, Liu S, Tompkins D, Chen Z, Wei F (2022) Beats: Audio pre-training with acoustic tokenizers. arXiv preprint arXiv:2212.09058

  64. Gong Y, Chung Y-A, Glass J (2021) Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778

  65. Gemmeke JF, Ellis DP, Freedman D, Jansen A, Lawrence W, Moore RC, Plakal M, Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. https://doi.org/10.1109/ICASSP.2017.7952261 . IEEE

  66. Piczak KJ (2015) Esc: Dataset for environmental sound classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1015–1018. https://doi.org/10.1145/2733373.2806390

  67. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022. https://doi.org/10.1109/ICCV48922.2021.00986

  68. Song K, Tan X, Qin T, Lu J, Liu T-Y (2020) Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems 33, 16857–16867 https://doi.org/10.48550/arXiv.2004.09297

  69. Xiao S, Liu Z, Zhang P, Muennighof N (2023) C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597

  70. Muennighoff N, Tazi N, Magne L, Reimers N (2022) Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316

  71. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423

  72. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

  73. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV (2019) Xlnet: Generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, vol. 32. https://proceedings.neurips.cc/paper_files/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf

  74. Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167 (2015)

  75. Poria S, Chaturvedi I, Cambria E, Hussain A (2016) Convolutional mkl based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 439–448. https://doi.org/10.1109/ICDM.2016.0055 . IEEE

  76. Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415

  77. Ramachandran P, Zoph B, Le QV (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941

  78. Sohn K (2016) Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems 29

  79. Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. https://doi.org/10.48550/arXiv.2002.05709 . PMLR

  80. Kryszkiewicz M (2013) Determining cosine similarity neighborhoods by means of the euclidean distance. Rough Sets and Intelligent Systems-Professor Zdzisław Pawlak in Memoriam: Volume 2, 323–345 https://doi.org/10.1007/978-3-642-30341-8_17

  81. White C, Hayward E, Blain T, Peterson E Automatic playlist continuation with approximate nearest neighbours

  82. Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2017) mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)

  83. Mei X, Liu X, Huang Q, Plumbley MD, Wang W (2021) Audio captioning transformer. arXiv preprint arXiv:2107.09817

  84. Drossos K, Lipping S, Virtanen T Clotho Dataset. https://doi.org/10.5281/zenodo.4783391

  85. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  86. Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10394–10403. https://doi.org/10.1109/CVPR.2019.01064

  87. Dutta T, Biswas S (2019) Generalized zero-shot cross-modal retrieval. IEEE Trans Image Process 28(12):5953–5962. https://doi.org/10.1109/TIP.2019.2923287

    Article  MathSciNet  MATH  Google Scholar 

Download references

Funding

The authors have not received any funding support during experimentation of the work and writing of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yash Suryawanshi.

Ethics declarations

Conflict of interest

The authors do not have any Conflict of interest to declare.

Ethics approval

This study does not violate and does not involve any moral or ethical statements.

Consent for publication

All authors have read and approved the final manuscript. All authors are aware of the publication of this manuscript and agreed to its publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Suryawanshi, Y., Shah, V., Randar, S. et al. Audio meets text: a loss-enhanced journey with manifold mixup and re-ranking. Knowl Inf Syst 67, 2195–2231 (2025). https://doi.org/10.1007/s10115-024-02283-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-024-02283-4

Keywords