Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3324320.3324339acmotherconferencesArticle/Chapter ViewAbstractPublication PagesewsnConference Proceedingsconference-collections
Article

ARASID: Artificial Reverberation-Adjusted Indoor Speaker Identification Dealing with Variable Distances

Published: 15 March 2019 Publication History

Abstract

Indoor speaker identification systems have been researched for a long time and are widely used in many human interaction acoustic monitoring systems. Many works have focused on improving accuracy in dealing with different realisms, including noise and varying distances from the microphone. However, these works either require significant extra effort such as measuring room types and dimensions, obtaining many speakers’ samples, or requiring expensive hardware such as microphone arrays and complex deployment settings. In this paper, we introduce a complete speaker identification solution using an artificial reverberation generator with different parameters to adjust the original close-distance speech samples so that each speaker has different artificial voice samples. Samples in different environments are not required because these artificial samples are close approximations to different environments. Two kinds of models, GMM-UBM and the i-vector, are evaluated. The models are trained on all samples separately, and testing is done against all in parallel. A score fusing approach with two thresholds, a minimum value and a minimum difference, is applied to the scores in producing the final result. Also, several standard acoustic pre-processing routines, including a voice activity detection algorithm and an overlapped speech remover, are included to make the system fully deployable. Finally, to assess the improvements when applying a reverberation adjustment, we evaluate our system with two literature speech databases, one has 251 people and the other one has four kinds of emotions. Further, we perform an inlab speaking experiment. The evaluation results show our system has more than 90% accuracy in identifying speakers within 6 meters if the emotion is neutral, and a 10% improvement over no reverberation adjustments when speakers have non-neutral emotions.

References

[1]
A. Akula, V. R. Apsingekar, and P. L. De Leon. Speaker identification in room reverberation using gmm-ubm. In Digital Signal Processing Workshop and 5th IEEE Signal Processing Education Workshop, 2009. DSP/SPE 2009. IEEE 13th, pages 37–41. IEEE, 2009.
[2]
S. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on acoustics, speech, and signal processing, 27(2):113–120, 1979.
[3]
L. Brandschain, C. Cieri, D. Graff, A. Neely, and K. Walker. Speaker recognition: Building the mixer 4 and 5 corpora. In LREC, 2008.
[4]
CNN. Google home now recognizes your individual voice. http://money.cnn.com/2017/04/20/technology/ google-home-voice-recognition, 2017. Accessed: 2017-05-03.
[5]
J. Dattorro. Effect design, part 1: Reverberator and other filters. Journal of the Audio Engineering Society, 45(9):660–684, 1997.
[6]
N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Frontend factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):788–798, 2011.
[7]
M. Delcroix, T. Hikichi, and M. Miyoshi. Precise dereverberation using multichannel linear prediction. IEEE Transactions on Audio, Speech, and Language Processing, 15(2):430–440, 2007.
[8]
R. F. Dickerson, E. Hoque, P. Asare, S. Nirjon, and J. A. Stankovic. Resonate: reverberation environment simulation for improved classification of speech models. In Proceedings of the 13th international symposium on Information processing in sensor networks, pages 107– 118. IEEE Press, 2014.
[9]
T. H. Falk and W.-Y. Chan. Modulation spectral features for robust far-field speaker identification. IEEE Transactions on Audio, Speech, and Language Processing, 18(1):90–100, 2010.
[10]
M. Ferras, S. Madikeri, P. Motlicek, S. Dey, and H. Bourlard. A largescale open-source acoustic simulator for speaker recognition. IEEE Signal Processing Letters, 23(4):527–531, 2016.
[11]
M. Fowler, M. McCurry, J. Bramsen, K. Dunsin, and J. Remus. Standoff speaker recognition: effects of recording distance mismatch on speaker recognition system performance. In INTERSPEECH, pages 3713–3716, 2013.
[12]
D. Garcia-Romero and C. Y. Espy-Wilson. Analysis of i-vector length normalization in speaker recognition systems. In Interspeech, volume 2011, pages 249–252, 2011.
[13]
M. V. Ghiurcau, C. Rusu, and J. Astola. A study of the effect of emotional state upon text-independent speaker identification. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4944–4947. IEEE, 2011.
[14]
Y. Jiang, Z. Tang, and L. Wang. Identification of a distant speaker and its robustness. Chinese Journal of Electronics, 20(2), 2011.
[15]
Q. Jin, R. Li, Q. Yang, K. Laskowski, and T. Schultz. Speaker identification with distant microphone speech. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4518– 4521. IEEE, 2010.
[16]
Q. Jin, T. Schultz, and A. Waibel. Far-field speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):2023–2032, 2007.
[17]
P. Kenny. A small footprint i-vector extractor. In Odyssey, volume 2012, pages 1–6, 2012.
[18]
S. G. Koolagudi, K. Sharma, and K. S. Rao. Speaker recognition in emotional environment. In Eco-friendly Computing and Communication Systems, pages 117–124. Springer, 2012.
[19]
S. Lee, S. Yildirim, A. Kazemzadeh, and S. Narayanan. An articulatory study of emotional speech production. In INTERSPEECH, pages 497–500, 2005.
[20]
C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu. Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304, 2017.
[21]
A. Mansour and Z. Lachiri. Emotional speaker recognition in simulated and spontaneous context. In 2016 2nd International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), pages 776–781. IEEE, 2016.
[22]
Matlab. Matlab audio toolbox: Design and test audio processing systems. https://www.mathworks.com/products/audio-system. html, 2017. Accessed: 2017-05-03.
[23]
D. Matrouf, N. Scheffer, B. G. Fauve, and J.-F. Bonastre. A straightforward and efficient implementation of the factor analysis model for speaker verification. In Interspeech, pages 1242–1245, 2007.
[24]
I. A. McCowan, J. Pelecanos, and S. Sridharan. Robust speaker recognition using microphone arrays. In 2001: A Speaker Odyssey-The Speaker Recognition Workshop, 2001.
[25]
M. McLaren, Y. Lei, and L. Ferrer. Advances in deep neural network approaches to speaker recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 4814–4818. IEEE, 2015.
[26]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. IEEE, 2015.
[27]
J. Pelecanos and S. Sridharan. Feature warping for robust speaker verification. 2001.
[28]
J. Ramırez, J. C. Segura, C. Benıtez, A. De La Torre, and A. Rubio. Efficient voice activity detection algorithms using long-term speech information. Speech communication, 42(3):271–287, 2004.
[29]
J. Remus, J. Estrada, and S. A. Schuckers. Mitigating effects of recording condition mismatch in speaker recognition using partial least squares. In INTERSPEECH, pages 2674–2677, 2012.
[30]
D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker verification using adapted gaussian mixture models. Digital signal processing, 10(1):19–41, 2000.
[31]
F. Richardson, D. Reynolds, and N. Dehak. A unified deep neural network for speaker and language recognition. arXiv preprint arXiv:1504.00923, 2015.
[32]
S. O. Sadjadi, M. Slaney, and L. Heck. Msr identity toolbox v1. 0: A matlab toolbox for speaker-recognition research. Speech and Language Processing Technical Committee Newsletter, 1(4), 2013.
[33]
A. Salekin, Z. Chen, M. Y. Ahmed, J. Lach, D. Metz, K. De La Haye, B. Bell, and J. A. Stankovic. Distant emotion recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(3):96, 2017.
[34]
I. Shahin. Speaker identification in emotional environments. Iranian Journal of Electrical and Computer Engineering, 8(1):41–46, 2009.
[35]
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur. X-vectors: Robust dnn embeddings for speaker recognition. pages 5329–5333, 04 2018.
[36]
L. Wang, N. Kitaoka, and S. Nakagawa. Robust distant speaker recognition based on position dependent cepstral mean normalization. In INTERSPEECH, pages 1977–1980, 2005.
[37]
L. Wang, N. Kitaoka, and S. Nakagawa. Robust distant speaker recognition based on position-dependent cmn by combining speakerspecific gmm with speaker-adapted hmm. Speech communication, 49(6):501–513, 2007.
[38]
M. Wu and D. Wang. A two-stage algorithm for one-microphone reverberant speech enhancement. IEEE Transactions on Audio, Speech, and Language Processing, 14(3):774–784, 2006.
[39]
S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, V. Valtchev, and P. Woodland. The htk book (revised for htk version 3.4.1). Cambridge University, 2009.
[40]
Z. Zhang, L. Wang, and A. Kai. Distant-talking speaker identification by generalized spectral subtraction-based dereverberation and its efficient computation. EURASIP Journal on Audio, Speech, and Music Processing, 2014(1):1–12, 2014.
[41]
Z. Zhang, L. Wang, A. Kai, T. Yamada, W. Li, and M. Iwahashi. Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. EURASIP Journal on Audio, Speech, and Music Processing, 2015(1):12, 2015.
[42]
C. Zieger, M. Matassoni, and M. Omologo. Experiments on distanttalking speaker verification in tv scenario. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4538– 4541. IEEE, 2010.

Cited By

View all
  • (2021)Emotion Recognition Robust to Indoor Environmental Distortions and Non-targeted Emotions Using Out-of-distribution DetectionACM Transactions on Computing for Healthcare10.1145/34923003:2(1-22)Online publication date: 20-Dec-2021
  1. ARASID: Artificial Reverberation-Adjusted Indoor Speaker Identification Dealing with Variable Distances

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      EWSN '19: Proceedings of the 2019 International Conference on Embedded Wireless Systems and Networks
      February 2019
      436 pages
      ISBN:9780994988638

      Sponsors

      • EWSN: International Conference on Embedded Wireless Systems and Networks

      In-Cooperation

      Publisher

      Junction Publishing

      United States

      Publication History

      Published: 15 March 2019

      Check for updates

      Author Tags

      1. distance
      2. reverberation
      3. speaker identification

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate 81 of 195 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 19 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Emotion Recognition Robust to Indoor Environmental Distortions and Non-targeted Emotions Using Out-of-distribution DetectionACM Transactions on Computing for Healthcare10.1145/34923003:2(1-22)Online publication date: 20-Dec-2021

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media