Multi-domain Attention Fusion Network For Language Recognition

Minghang Ju¹,
Yanyan Xu ORCID: orcid.org/0000-0001-7174-6588^1,2,
Dengfeng Ke³ &
…
Kaile Su⁴

149 Accesses
3 Citations
Explore all metrics

Abstract

Attention-based convolutional neural network models are increasingly adopted for language recognition tasks. In this paper, based on the self-attention mechanism, we solve the study of language recognition by acquiring rich context dependence. To do so, we propose a new multi-domain feature fusion network to integrate local features and their global dependencies adaptively. Specifically, we attach three attention modules to each stage of ResNet, which model semantic dependence in the time, frequency, and channel domain, respectively. The time attention module aggregates the features of all the time locations through the weighted sum of the features from the time feature map and the original features. Correspondingly, the frequency/channel attention module aggregates the features of all the frequency/channel locations through the weighted sum of the features from the frequency/channel feature map and the original features. We then aggregate the outputs of the three attention modules in three ways, which are addition, average, and maximum, respectively, to further improve feature representation. Experiments are conducted on the APSIPA 2017 Oriental Language Recognition (AP17-OLR) dataset and the AP20-OLR-noisy-task dataset, and on both of them, our proposed method achieves the state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A New Time–Frequency Attention Tensor Network for Language Identification

Article Open access 28 October 2019

A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

Article Open access 28 October 2019

Tigrinya End-to-End Speech Recognition: A Hybrid Connectionist Temporal Classification-Attention Approach

Availability of Data and Materials

All data used in this study are included in the APSIPA 2017 Oriental Language Recognition (AP17-OLR) dataset [27, 28] and the AP20-OLR-noise-task dataset[29].

Abbreviations

LR:: Language recognition
OLR:: Oriental language recognition
DET:: Detect error trade-off
IDR:: Identification rate
SDA:: Single domain attention
SE-Net:: Squeeze-excitation network
MDAF:: Multi-domain attention fusion
MDAF-Net:: Multi-domain attention fusion network
SE:: Squeeze-excitation
FC:: Fully connected
ASP:: Attentive statistics pooling
AAM-Softmax:: Additive angular margin softmax

References

Li H, Ma B, Lee KA. Spoken language recognition: from fundamentals to practice. Proc IEEE. 2013;101(5):1136–59.
Article Google Scholar
Waibel A, Geutner P, Tomokiyo LM, Schultz T, Woszczyna M. Multilinguality in speech and spoken language systems. Proc IEEE. 2000;88(8):1297–313.
Article Google Scholar
Miao X, McLoughlin I, Wang W, Zhang P. D-mona: a dilated mixed-order non-local attention network for speaker and language recognition. Neural Netw. 2021;139:201–11.
Article Google Scholar
Dehak N, Torres-Carrasquillo PA, Reynolds D, Dehak R. Language recognition via i-vectors and dimensionality reduction. In: Twelfth annual conference of the international speech communication association. Citeseer (2011)
Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P. Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process. 2010;19(4):788–98.
Article Google Scholar
Huang J-T, Li J, Yu D, Deng L, Gong Y. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: 2013 IEEE International conference on acoustics, speech and signal processing, pp. 7304–7308 (2013). IEEE
Heigold G, Moreno I, Bengio S, Shazeer N. End-to-end text-dependent speaker verification. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5115–5119 (2016). IEEE
Cai W, Cai Z, Liu W, Wang X, Li M. Insights in-to-end learning scheme for language identification. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5209–5213 (2018). IEEE
Wu H, Cai W, Li M, Gao J, Zhang S, Lyu Z, Huang S: DKU-tencent submission to oriental language recognition AP18-OLR challenge. In: 2019 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp. 1646–1651 (2019). IEEE
Miao X, McLoughlin I, Yan Y. A new time-frequency attention mechanism for TDNN and CNN-LSTM-TDNN, with application to language identification. In: Interspeech, pp. 4080–4084 (2019)
Zhou J, Jiang T, Li Z, Li L, Hong Q. Deep speaker embedding extraction with channel-wise feature responses and additive supervision softmax loss function. In: Interspeech, pp. 2883–2887 (2019)
Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141 (2018)
Yadav S, Rai A. Frequency and temporal convolutional attention for text-independent speaker recognition. In: ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6794–6798 (2020). IEEE
Miao X, McLoughlin IV, Yan Y. A new time-frequency attention mechanism for TDNN and CNN-LSTM-TDNN, with application to language identification. In: Interspeech, pp. 4080–4084 (2019)
Woo S, Park J, Lee J-Y, Kweon IS. CBAM: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp. 3–19 (2018)
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H. Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3146–3154 (2019)
Qin Z, Zhang P, Wu F, Li X. Fcanet: frequency channel attention networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 783–792 (2021)
Shi Y, Huang Q, Hain T. Robust speaker recognition using speech enhancement and attention model. arXiv preprint arXiv:2001.05031 (2020)
Gao Z, Xie J, Wang Q, Li P. Global second-order pooling convolutional networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3024–3033 (2019)
Zhang D, Shao J, Li X, Shen HT. Remote sensing image super-resolution via mixed high-order attention network. IEEE Trans Geosci Remote Sens. 2020;59(6):5183–96.
Article Google Scholar
Cai W, Cai Z, Zhang X, Wang X, Li M. A novel learnable dictionary encoding layer for end-to-end language identification. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5189–5193 (2018). IEEE
Padi B, Mohan A, Ganapathy S: Attention based hybrid i-vector blstm model for language recognition. In: INTERSPEECH, pp. 1263–1267 (2019)
Cai W, Cai D, Huang S, Li M : Utterance-level end-to-end language identification using attention-based CNN-BLSTM. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5991–5995 (2019). IEEE
Padi B, Mohan A, Ganapathy S. Towards relevance and sequence modeling in language recognition. IEEE/ACM Trans Audio Speech Lang Process. 2020;28:1223–32.
Article Google Scholar
Deng J, Guo J, Xue N, Zafeiriou S. Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699 (2019)
Cai W, Chen J, Li M: Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. arXiv preprint arXiv:1804.05160 (2018)
Tang Z, Wang D, Chen Y, Chen Q. AP17-OLR challenge: data, plan, and baseline. In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 749–753 (2017). IEEE
Wang D, Li L, Tang D, Chen Q. AP16-OL7: a multilingual database for oriental languages and a language recognition baseline. In: 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), pp. 1–5 (2016). IEEE
Li Z, Zhao M, Hong Q, Li L, Tang Z, Wang, D, Song L, Yang C. AP20-OLR challenge: three tasks and their baselines. In: 2020 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp. 550–555 (2020). IEEE
Ma Z, Yu H. Language identification with deep bottleneck features. arXiv preprint arXiv:1809.08909 (2018)
Ko T, Peddinti V, Povey D, Seltzer ML, Khudanpur S. A study on data augmentation of reverberant speech for robust speech recognition. In: 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 5220–5224 (2017). IEEE
Snyder D, Chen G, Povey D. MUSAN: a music, speech, and noise corpus. arXiv preprint arXiv:1510.08484 (2015)
Cai W, Chen J, Zhang J, Li M. On-the-fly data loader and utterance-level aggregation for speaker and language recognition. IEEE/ACM Trans Audio Speech Lang Process. 2020;28:1038–51.
Article Google Scholar
Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S: X-vectors: Robust DNN embeddings for speaker recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5329–5333 (2018) IEEE
Qi Z, Ma Y, Gu M. A study on low-resource language identification. In: 2019 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp. 1897–1902 (2019). IEEE
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019)
Padi B, Mohan A, Ganapathy S. Towards relevance and sequence modeling in language recognition. IEEE/ACM Trans Audio Speech Lang Process. 2020;28:1223–32.
Article Google Scholar
Tang Z, Wang D, Chen Q. AP18-OLR challenge: three tasks and their baselines. CoRR abs/1806.00616 (2018) 1806.00616
Fernando S, Sethu V, Ambikairajah E. Factorized hidden variability learning for adaptation of short duration language identification models. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5204–5208 (2018). IEEE
Yu J, Guo M, Xie Y, Zhang J. Articulatory features based tdnn model for spoken language recognition. In: 2019 International conference on Asian language processing (IALP), pp. 308–312 (2019). IEEE
Vuddagiri RK, Mandava T, Vydana HK, Vuppala AK. Multi-head self-attention networks for language identification. In: 2019 Twelfth international conference on contemporary computing (IC3), pp. 1–5 (2019). IEEE
Fan Z, Li M, Zhou S, Xu B. Exploring wav2vec 2.0 on speaker verification and language identification. arXiv preprint arXiv:2012.06185 (2020)
Li J, Wang B, Zhi Y, Li Z, Li L, Hong Q, Wang D. Oriental language recognition (olr) 2020: Summary and analysis. arXiv preprint arXiv:2107.05365 (2021)
Li L, Li Z, Liu Y, Hong Q. Deep joint learning for language recognition. Neural Netw. 2021;141:72–86.
Article Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by the Fundamental Research Funds for the Central Universities (grant number 2021ZY87).

Author information

Authors and Affiliations

School of Information Science and Technology, Beijing Forestry University, 35 Qing-Hua East Road, Beijing, 100083, China
Minghang Ju & Yanyan Xu
Engineering Research Center for Forestry-oriented Intelligent Information Processing of National Forestry and Grassland Administration, Beijing Forestry University, 35 Qing-Hua East Road, Beijing, 100083, China
Yanyan Xu
School of Information Science, Beijing Language and Culture University, 15 Xueyuan Road, Beijing, 100083, China
Dengfeng Ke
Institute for Integrated and Intelligent Systems, Griffith University, Nathan, 4111, QLD, Australia
Kaile Su

Authors

Minghang Ju
View author publications
You can also search for this author in PubMed Google Scholar
Yanyan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Dengfeng Ke
View author publications
You can also search for this author in PubMed Google Scholar
Kaile Su
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The first author mainly performed the experiments and wrote the paper, and the other authors reviewed and edited the manuscript. All of the authors discussed the final results. All of the authors read and approved the final manuscript.

Corresponding authors

Correspondence to Yanyan Xu or Dengfeng Ke.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ju, M., Xu, Y., Ke, D. et al. Multi-domain Attention Fusion Network For Language Recognition. SN COMPUT. SCI. 4, 39 (2023). https://doi.org/10.1007/s42979-022-01447-9

Download citation

Received: 23 April 2022
Accepted: 08 October 2022
Published: 02 November 2022
DOI: https://doi.org/10.1007/s42979-022-01447-9

Multi-domain Attention Fusion Network For Language Recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A New Time–Frequency Attention Tensor Network for Language Identification

A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

Tigrinya End-to-End Speech Recognition: A Hybrid Connectionist Temporal Classification-Attention Approach

Availability of Data and Materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Multi-domain Attention Fusion Network For Language Recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A New Time–Frequency Attention Tensor Network for Language Identification

A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

Tigrinya End-to-End Speech Recognition: A Hybrid Connectionist Temporal Classification-Attention Approach

Availability of Data and Materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation