Attention-based convolutional neural network models are increasingly adopted for language recognition tasks. In this paper, based on the self-attention mechanism, we solve the study of language recognition by acquiring rich context dependence. To do so, we propose a new multi-domain feature fusion network to integrate local features and their global dependencies adaptively. Specifically, we attach three attention modules to each stage of ResNet, which model semantic dependence in the time, frequency, and channel domain, respectively. The time attention module aggregates the features of all the time locations through the weighted sum of the features from the time feature map and the original features. Correspondingly, the frequency/channel attention module aggregates the features of all the frequency/channel locations through the weighted sum of the features from the frequency/channel feature map and the original features. We then aggregate the outputs of the three attention modules in three ways, which are addition, average, and maximum, respectively, to further improve feature representation. Experiments are conducted on the APSIPA 2017 Oriental Language Recognition (AP17-OLR) dataset and the AP20-OLR-noisy-task dataset, and on both of them, our proposed method achieves the state-of-the-art results.
- LR:
Language recognition
- OLR:
Oriental language recognition
- DET:
Detect error trade-off
- IDR:
Identification rate
- SDA:
Single domain attention
- SE-Net:
Squeeze-excitation network
Multi-domain attention fusion
- MDAF-Net:
Multi-domain attention fusion network
- SE:
- FC:
Fully connected
- ASP:
Attentive statistics pooling
- AAM-Softmax:
Additive angular margin softmax
