Abstract
Attention-based convolutional neural network models are increasingly adopted for language recognition tasks. In this paper, based on the self-attention mechanism, we solve the study of language recognition by acquiring rich context dependence. To do so, we propose a new multi-domain feature fusion network to integrate local features and their global dependencies adaptively. Specifically, we attach three attention modules to each stage of ResNet, which model semantic dependence in the time, frequency, and channel domain, respectively. The time attention module aggregates the features of all the time locations through the weighted sum of the features from the time feature map and the original features. Correspondingly, the frequency/channel attention module aggregates the features of all the frequency/channel locations through the weighted sum of the features from the frequency/channel feature map and the original features. We then aggregate the outputs of the three attention modules in three ways, which are addition, average, and maximum, respectively, to further improve feature representation. Experiments are conducted on the APSIPA 2017 Oriental Language Recognition (AP17-OLR) dataset and the AP20-OLR-noisy-task dataset, and on both of them, our proposed method achieves the state-of-the-art results.
Similar content being viewed by others
Abbreviations
- LR:
-
Language recognition
- OLR:
-
Oriental language recognition
- DET:
-
Detect error trade-off
- IDR:
-
Identification rate
- SDA:
-
Single domain attention
- SE-Net:
-
Squeeze-excitation network
- MDAF:
-
Multi-domain attention fusion
- MDAF-Net:
-
Multi-domain attention fusion network
- SE:
-
Squeeze-excitation
- FC:
-
Fully connected
- ASP:
-
Attentive statistics pooling
- AAM-Softmax:
-
Additive angular margin softmax
References
Li H, Ma B, Lee KA. Spoken language recognition: from fundamentals to practice. Proc IEEE. 2013;101(5):1136–59.
Waibel A, Geutner P, Tomokiyo LM, Schultz T, Woszczyna M. Multilinguality in speech and spoken language systems. Proc IEEE. 2000;88(8):1297–313.
Miao X, McLoughlin I, Wang W, Zhang P. D-mona: a dilated mixed-order non-local attention network for speaker and language recognition. Neural Netw. 2021;139:201–11.
Dehak N, Torres-Carrasquillo PA, Reynolds D, Dehak R. Language recognition via i-vectors and dimensionality reduction. In: Twelfth annual conference of the international speech communication association. Citeseer (2011)
Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P. Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process. 2010;19(4):788–98.
Huang J-T, Li J, Yu D, Deng L, Gong Y. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: 2013 IEEE International conference on acoustics, speech and signal processing, pp. 7304–7308 (2013). IEEE
Heigold G, Moreno I, Bengio S, Shazeer N. End-to-end text-dependent speaker verification. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5115–5119 (2016). IEEE
Cai W, Cai Z, Liu W, Wang X, Li M. Insights in-to-end learning scheme for language identification. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5209–5213 (2018). IEEE
Wu H, Cai W, Li M, Gao J, Zhang S, Lyu Z, Huang S: DKU-tencent submission to oriental language recognition AP18-OLR challenge. In: 2019 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp. 1646–1651 (2019). IEEE
Miao X, McLoughlin I, Yan Y. A new time-frequency attention mechanism for TDNN and CNN-LSTM-TDNN, with application to language identification. In: Interspeech, pp. 4080–4084 (2019)
Zhou J, Jiang T, Li Z, Li L, Hong Q. Deep speaker embedding extraction with channel-wise feature responses and additive supervision softmax loss function. In: Interspeech, pp. 2883–2887 (2019)
Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141 (2018)
Yadav S, Rai A. Frequency and temporal convolutional attention for text-independent speaker recognition. In: ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6794–6798 (2020). IEEE
Miao X, McLoughlin IV, Yan Y. A new time-frequency attention mechanism for TDNN and CNN-LSTM-TDNN, with application to language identification. In: Interspeech, pp. 4080–4084 (2019)
Woo S, Park J, Lee J-Y, Kweon IS. CBAM: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp. 3–19 (2018)
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H. Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3146–3154 (2019)
Qin Z, Zhang P, Wu F, Li X. Fcanet: frequency channel attention networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 783–792 (2021)
Shi Y, Huang Q, Hain T. Robust speaker recognition using speech enhancement and attention model. arXiv preprint arXiv:2001.05031 (2020)
Gao Z, Xie J, Wang Q, Li P. Global second-order pooling convolutional networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3024–3033 (2019)
Zhang D, Shao J, Li X, Shen HT. Remote sensing image super-resolution via mixed high-order attention network. IEEE Trans Geosci Remote Sens. 2020;59(6):5183–96.
Cai W, Cai Z, Zhang X, Wang X, Li M. A novel learnable dictionary encoding layer for end-to-end language identification. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5189–5193 (2018). IEEE
Padi B, Mohan A, Ganapathy S: Attention based hybrid i-vector blstm model for language recognition. In: INTERSPEECH, pp. 1263–1267 (2019)
Cai W, Cai D, Huang S, Li M : Utterance-level end-to-end language identification using attention-based CNN-BLSTM. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5991–5995 (2019). IEEE
Padi B, Mohan A, Ganapathy S. Towards relevance and sequence modeling in language recognition. IEEE/ACM Trans Audio Speech Lang Process. 2020;28:1223–32.
Deng J, Guo J, Xue N, Zafeiriou S. Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699 (2019)
Cai W, Chen J, Li M: Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. arXiv preprint arXiv:1804.05160 (2018)
Tang Z, Wang D, Chen Y, Chen Q. AP17-OLR challenge: data, plan, and baseline. In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 749–753 (2017). IEEE
Wang D, Li L, Tang D, Chen Q. AP16-OL7: a multilingual database for oriental languages and a language recognition baseline. In: 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), pp. 1–5 (2016). IEEE
Li Z, Zhao M, Hong Q, Li L, Tang Z, Wang, D, Song L, Yang C. AP20-OLR challenge: three tasks and their baselines. In: 2020 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp. 550–555 (2020). IEEE
Ma Z, Yu H. Language identification with deep bottleneck features. arXiv preprint arXiv:1809.08909 (2018)
Ko T, Peddinti V, Povey D, Seltzer ML, Khudanpur S. A study on data augmentation of reverberant speech for robust speech recognition. In: 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 5220–5224 (2017). IEEE
Snyder D, Chen G, Povey D. MUSAN: a music, speech, and noise corpus. arXiv preprint arXiv:1510.08484 (2015)
Cai W, Chen J, Zhang J, Li M. On-the-fly data loader and utterance-level aggregation for speaker and language recognition. IEEE/ACM Trans Audio Speech Lang Process. 2020;28:1038–51.
Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S: X-vectors: Robust DNN embeddings for speaker recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5329–5333 (2018) IEEE
Qi Z, Ma Y, Gu M. A study on low-resource language identification. In: 2019 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp. 1897–1902 (2019). IEEE
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019)
Padi B, Mohan A, Ganapathy S. Towards relevance and sequence modeling in language recognition. IEEE/ACM Trans Audio Speech Lang Process. 2020;28:1223–32.
Tang Z, Wang D, Chen Q. AP18-OLR challenge: three tasks and their baselines. CoRR abs/1806.00616 (2018) 1806.00616
Fernando S, Sethu V, Ambikairajah E. Factorized hidden variability learning for adaptation of short duration language identification models. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5204–5208 (2018). IEEE
Yu J, Guo M, Xie Y, Zhang J. Articulatory features based tdnn model for spoken language recognition. In: 2019 International conference on Asian language processing (IALP), pp. 308–312 (2019). IEEE
Vuddagiri RK, Mandava T, Vydana HK, Vuppala AK. Multi-head self-attention networks for language identification. In: 2019 Twelfth international conference on contemporary computing (IC3), pp. 1–5 (2019). IEEE
Fan Z, Li M, Zhou S, Xu B. Exploring wav2vec 2.0 on speaker verification and language identification. arXiv preprint arXiv:2012.06185 (2020)
Li J, Wang B, Zhi Y, Li Z, Li L, Hong Q, Wang D. Oriental language recognition (olr) 2020: Summary and analysis. arXiv preprint arXiv:2107.05365 (2021)
Li L, Li Z, Liu Y, Hong Q. Deep joint learning for language recognition. Neural Netw. 2021;141:72–86.
Acknowledgements
Not applicable.
Funding
This work was supported by the Fundamental Research Funds for the Central Universities (grant number 2021ZY87).
Author information
Authors and Affiliations
Contributions
The first author mainly performed the experiments and wrote the paper, and the other authors reviewed and edited the manuscript. All of the authors discussed the final results. All of the authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ju, M., Xu, Y., Ke, D. et al. Multi-domain Attention Fusion Network For Language Recognition. SN COMPUT. SCI. 4, 39 (2023). https://doi.org/10.1007/s42979-022-01447-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-022-01447-9