Convolutional Recurrent Neural Networks For Small-Footprint Keyword Spotting
Convolutional Recurrent Neural Networks For Small-Footprint Keyword Spotting
Convolutional Recurrent Neural Networks For Small-Footprint Keyword Spotting
Spotting
Sercan Ö. Arık1,*, Markus Kliegl1,*, Rewon Child1, Joel Hestness1, Andrew Gibiansky1, Chris
Fougner1, Ryan Prenger1, Adam Coates1
1
Baidu Silicon Valley Artificial Intelligence Lab, 1195 Bordeaux Dr. Sunnyvale, CA 94089, USA
*
Equal contribution
sercanarik@baidu.com, klieglmarkus@baidu.com
It is desired to limit the model size given the resource (FLOPs) when implemented on processors of modern consumer
constraints for inference latency, memory, and power devices (without special functions to implement
consumption. Following [7], we choose the size limit as 250k nonlinear operations). Even when implemented on modern
(which is more than 6 times smaller than the architecture with smartphones without any approximations and special function
CTC loss in [11]). For the rest of the paper, the default units, our KWS model can achieve an inference time much
architecture is the set of parameters highlighted in bold, which faster than the time scale for reactive time for humans with
also corresponds to a fairly optimal point given the model size auditory stimuli, which is ~280 ms [18].
vs. performance trade-off.
3.3. Impact of the amount of training data
We compare the performance with a CNN architecture
based on [7]. Given the discrepancy in input dimensionality and
training data, we reoptimize the model hyperparameters for the
best performance while upper-bounding the number of
parameters to 250k for a fair comparison. For the same
development set with 5 dB SNR, the best CNN architecture
achieves 4.31% FRR at 1 FA/hour and 5.73% FRR at 0.5
FA/hour. Both metrics are ~51% higher compared to the FRR
values of the chosen CRNN model with 229k parameters.
Interestingly, the performance gap is lower for higher SNR
values. We elaborate on this in Section 3.4.
Recall that the model is bidirectional and runs on
overlapping 1.5 second windows at 100 ms stride. However,
thanks to the small model size and the large time stride of 8 in
the initial convolution layer, we are able to do inference
comfortably faster than real time. The inference computational
complexity of the chosen CRNN-based KWS model with 229k Figure 2: FRR at 0.5 FA/hour vs. number of unique
parameters is roughly ~30M floating point operations training keywords for the test set with 5 dB SNR.
Given the representation capacity limit imposed by the
architecture size, increasing the amount of positive samples in
the training data has a limited effect on the performance. Fig. 2
shows the FRR at 0.5 FA/hour (for the test set with 5 dB SNR)
vs. the number of unique “TalkType” samples used while
training. Saturation of performance occurs faster than
applications with similar type of data but with large-scale
models, e.g. [14].
Besides increasing the amount of the positive samples, we
observe performance improvement by increasing the diversity
of relevant negative samples, obtained by hard mining. We
mine negative samples, by using the pre-converged model on a
very large public videos dataset (that are not used in training,
development, or test sets). Then, training is continued using the
mined negative samples until convergence. As shown in Fig. 2,
hard negative mining yields decrease in FRR for the test set. Figure 4: FRR at 1 FA/hour vs. additional distance for
far-field test sets with varying SNR values. Solid:
3.4. Noise robustness baseline performance, dashed: with far-field
augmented training.