research-article

Public Access

Hello, Is It Me You're Looking For?: Differentiating Between Human and Electronic Speakers for Voice Interface Security

Authors:

Patrick TraynorAuthors Info & Claims

WiSec '18: Proceedings of the 11th ACM Conference on Security & Privacy in Wireless and Mobile Networks

Pages 123 - 133

https://doi.org/10.1145/3212480.3212505

Published: 18 June 2018 Publication History

Abstract

Voice interfaces are increasingly becoming integrated into a variety of Internet of Things (IoT) devices. Such systems can dramatically simplify interactions between users and devices with limited displays. Unfortunately voice interfaces also create new opportunities for exploitation. Specifically any sound-emitting device within range of the system implementing the voice interface (e.g., a smart television, an Internet-connected appliance, etc) can potentially cause these systems to perform operations against the desires of their owners (e.g., unlock doors, make unauthorized purchases, etc). We address this problem by developing a technique to recognize fundamental differences in audio created by humans and electronic speakers. We identify sub-bass over-excitation, or the presence of significant low frequency signals that are outside of the range of human voices but inherent to the design of modern speakers, as a strong differentiator between these two sources. After identifying this phenomenon, we demonstrate its use in preventing adversarial requests, replayed audio, and hidden commands with a 100%/1.72% TPR/FPR in quiet environments. In so doing, we demonstrate that commands injected via nearby audio devices can be effectively removed by voice interfaces.

References

[1]

2017 Voice Assistant Trends {Infographic}. https://ifttt.com/blog/2017/07/voice-assistant-trends-infographic. 2017-07-12.

[2]

Adobe demos "photoshop for audio," lets you edit speech as easily as text. https://arstechnica.com/information-technology/2016/11/adobe-voco-photoshop-for-audio-speech-editing/, 2017.

[3]

Amazon Alexa Line. https://www.amazon.com/Amazon-Echo-And-Alexa-Devices/b?ie=UTF8&node=9818047011, 2017.

[4]

Apple Siri. https://www.apple.com/ios/siri/, 2017.

[5]

August Home Supports the Google Assistant. http://august.com/2017/03/28/google-assistant/, 2017.

[6]

Cortana. https://www.microsoft.com/en-us/windows/cortana, 2017.

[7]

Google Assistant. https://assistant.google.com/, 2017.

[8]

Google Home now lets you shop by voice just like Amazon's Alexa. https://techcrunch.com/2017/02/16/google-home-now-lets-you-shop-byvoice-just-like-amazons-alexa/, 2017.

[9]

Lyrebird. https://github.com/logant/Lyrebird, 2017.

[10]

Starling Bank Integrates API into Google Home. http://bankinnovation.net/2017/02/starling-bank-integrates-api-into-google-home-video/, 2017.

[11]

P. S. Aleksic and A. K. Katsaggelos. Audio-visual biometrics. Proceedings of the IEEE, 94(11):2025--2044, Nov 2006.

[12]

P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli. Multimodal fusion for multimedia analysis: a survey. Multimedia Systems, 16(6):345--379, Nov 2010.

Digital Library

[13]

R. J. Baken and R. F. Orlikoff. Clinical measurement of speech and voice. Cengage Learning, 2000.

[14]

L. Blue, H. Abdullah, L. Vargas, and P. Traynor. 2ma: Verifying voice commands via two microphone authentication. In Proceedings of the 2018 ACM on Asia Conference on Computer and Communications Security. ACM, 2018.

Digital Library

[15]

N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields, D. Wagner, and W. Zhou. Hidden Voice Commands. In 25th USENIX Security Symposium, 2016.

[16]

G. Chetty and M. Wagner. LivenessâĂİ verification in audio-video authentication. 2004.

[17]

N. Eveno and L. Besacier. Co-inertia analysis for "liveness" test in audio-visual biometrics. In ISPA 2005. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005., pages 257--261, Sept 2005.

[18]

Google. Transactions Developer Preview. https://developers.google.com/actions/transactions/, 2017.

[19]

A. K. Jain, R. Bolle, and S. Pankanti. Biometrics: personal identification in networked society, volume 479. Springer Science & Business Media, 2006.

Digital Library

[20]

H. Kuwabara. Acoustic properties of phonemes in continuous speech for different speaking rate. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 4, pages 2435--2438. IEEE, 1996.

[21]

S. Maheshwari. Burger King 'O.K. Google' Ad Doesn't Seem O.K. With Google. https://www.nytimes.com/2017/04/12/business/burger-king-tv-ad-google-home.html, 2017.

[22]

D. Mukhopadhyay, M. Shirvanian, and N. Saxena. All Your Voices are Belong to Us: Stealing Voices to Fool Humans and Machines. 20th European Symposium on Research in Computer Security, 2015.

[23]

S. Nichols. TV anchor says live on-air 'Alexa, order me a dollhouse' - Guess what happens next. https://www.theregister.co.uk/2017/01/07/tv-anchor-says-alexabuy-me-a-dollhouse-and-she-does/, 2017.

[24]

D. A. Reynolds. Speaker identification and verification using gaussian mixture speaker models. Speech Communication, 17(1):91--108, 1995.

Digital Library

[25]

A. Ross and A. Jain. Information fusion in biometrics. Pattern Recognition Letters, 24(13):2115--2125, 2003. Audio- and Video-based Biometric Person Authentication (AVBPA 2001).

Digital Library

[26]

R. R. Sanders. The electrostatic loudspeaker design cookbook. Audio Amateur, Incorporated, 2017.

Digital Library

[27]

C. Sanderson and K. K. Paliwal. Identity verification using speech and face information. Digital Signal Processing, 14(5):449--480, 2004.

[28]

M. Shirvanian and N. Saxena. Wiretapping via Mimicry: Short Voice Imitation Man-in-the-middle attacks on Crypto Phones. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, 2014.

Digital Library

[29]

S. T. Shivappa, M. M. Trivedi, and B. D. Rao. Audiovisual information fusion in human 2013;computer interfaces and intelligent environments: A survey. Proceedings of the IEEE, 98(10):1692--1715, Oct 2010.

[30]

H. Stephenson. UX design trends 2018: from voice interfaces to a need to not trick people. Digital Arts - https://www.digitalartsonline.co.uk/features/interactive-design/ux-design-trends-2018-from-voice-interfaces-need-not-trick-people/, 2018.

[31]

K. N. Stevens. Acoustic phonetics, volume 30. MIT press, 2000.

[32]

S. Studio. Respeaker 4-mic array for raspberry pi. https://www.seeedstudio.com/ReSpeaker-4-Mic-Array-for-Raspberry-Pi-p-2941.html. Accessed: March 5, 2018.

[33]

A. Team. Audacity homepage. https://www.audacityteam.org/. Accessed: March 5, 2018.

[34]

I. R. Titze and D. W. Martin. Principles of voice production. ASA, 1998.

[35]

T. Vaidya, Y. Zhang, M. Sherr, and C. Shields. Cocaine Noodles: Exploiting the Gap Between Human and Machine Speech Recognition. 11th USENIX Workshop on Offensive Technologies, 2015.

Digital Library

[36]

Z. Wu, S. Gao, E. S. Cling, and H. Li. A study on replay attack and anti-spoofing for text-dependent speaker verification. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, pages 1--5, Dec 2014.

[37]

G. Zhang, C. Yan, X. Ji, T. Zhang, T. Zhang, and W. Xu. Dolphinattack: Inaudible Voice Commands. Computer and Communications Security (CCS), 2017.

Digital Library

Cited By

Chhaglani BZakaria CPeltier RGummeson JShenoy P(2024)AeroSense: Sensing Aerosol Emissions from Indoor Human ActivitiesProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595938:2(1-30)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659593
Wu CZeng Q(2024)Turning Noises to Fingerprint-Free “Credentials”: Secure and Usable Drone AuthenticationIEEE Transactions on Mobile Computing10.1109/TMC.2024.337350323:10(10161-10174)Online publication date: Oct-2024
https://doi.org/10.1109/TMC.2024.3373503
Ba ZGong BWang YLiu YCheng PLin FLu LRen K(2024)Indelible “Footprints” of Inaudible Command InjectionIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.345972819(8485-8499)Online publication date: 2024
https://doi.org/10.1109/TIFS.2024.3459728
Show More Cited By

Index Terms

Hello, Is It Me You're Looking For?: Differentiating Between Human and Electronic Speakers for Voice Interface Security
1. Security and privacy
  1. Security services
    1. Access control
    2. Authentication
      1. Biometrics

Recommendations

WithYou: An Interactive Shadowing Coach with Speech Recognition
UIST '16 Adjunct: Adjunct Proceedings of the 29th Annual ACM Symposium on User Interface Software and Technology

Speech shadowing, in which the subject listens to native narration sound and tries to repeat it immediately while listening, is a proven way of practicing speaking skills when learning foreign languages. However, since the narration is independent of ...
Enhanced voice activity detection using acoustic event detection and classification

We examine user-friendly voice interface that requires the hands-free speech acquisition in the continuously listening environment. The traditional voice activity detection (VAD) algorithms cannot successfully identify potential acoustic event sounds ...
Choice of Voices: A Large-Scale Evaluation of Text-to-Speech Voice Quality for Long-Form Content
CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems

The advancement of text-to-speech (TTS) voices and a rise of commercial TTS platforms allow people to easily experience TTS voices across a variety of technologies, applications, and form factors. As such, we evaluated TTS voices for long-form content: ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WiSec '18: Proceedings of the 11th ACM Conference on Security & Privacy in Wireless and Mobile Networks

June 2018

317 pages

ISBN:9781450357319

DOI:10.1145/3212480

General Chair:
Panos Papadimitratos
KTH, Sweden
,
Program Chairs:
Kevin Butler
University of Florida, USA
,
Christina Pöpper
New York University Abu Dhabi, UAE

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

In-Cooperation

SIGMOBILE: ACM Special Interest Group on Mobility of Systems, Users, Data and Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

WiSec '18

Sponsor:

SIGSAC

WiSec '18: 11th ACM Conference on Security & Privacy in Wireless and Mobile Networks

June 18 - 20, 2018

Stockholm, Sweden

Acceptance Rates

Overall Acceptance Rate 98 of 338 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
895
Total Downloads

Downloads (Last 12 months)193
Downloads (Last 6 weeks)22

Reflects downloads up to 30 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chhaglani BZakaria CPeltier RGummeson JShenoy P(2024)AeroSense: Sensing Aerosol Emissions from Indoor Human ActivitiesProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595938:2(1-30)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659593
Wu CZeng Q(2024)Turning Noises to Fingerprint-Free “Credentials”: Secure and Usable Drone AuthenticationIEEE Transactions on Mobile Computing10.1109/TMC.2024.337350323:10(10161-10174)Online publication date: Oct-2024
https://doi.org/10.1109/TMC.2024.3373503
Ba ZGong BWang YLiu YCheng PLin FLu LRen K(2024)Indelible “Footprints” of Inaudible Command InjectionIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.345972819(8485-8499)Online publication date: 2024
https://doi.org/10.1109/TIFS.2024.3459728
Yang QCui KZheng Y(2024)Room-scale Voice Liveness Detection for Smart DevicesIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.3367269(1-14)Online publication date: 2024
https://doi.org/10.1109/TDSC.2024.3367269
Li XZe JYan CCheng YJi XXu W(2024)Enrollment-Stage Backdoor Attacks on Speaker Recognition Systems via Adversarial UltrasoundIEEE Internet of Things Journal10.1109/JIOT.2023.332825311:8(13108-13124)Online publication date: 15-Apr-2024
https://doi.org/10.1109/JIOT.2023.3328253
Gurowiec INissim N(2024)Speech emotion recognition systems and their security aspectsArtificial Intelligence Review10.1007/s10462-024-10760-z57:6Online publication date: 21-May-2024
https://doi.org/10.1007/s10462-024-10760-z
Shang J(2024)Security and Privacy of Augmented Reality SystemsNetwork Security Empowered by Artificial Intelligence10.1007/978-3-031-53510-9_11(305-330)Online publication date: 24-Feb-2024
https://doi.org/10.1007/978-3-031-53510-9_11
Georgiou NRamnauth RAdeniran ELee MSelin LScassellati B(2023)Is Someone There or Is That the TV? Detecting Social Presence Using SoundACM Transactions on Human-Robot Interaction10.1145/361165812:4(1-33)Online publication date: 13-Dec-2023
https://dl.acm.org/doi/10.1145/3611658
He QFang S(2023)Phantom-CSI Attacks against Wireless Liveness DetectionProceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3607199.3607245(440-454)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1145/3607199.3607245
Walker PZhang TShi CSaxena NChen YBoureanu ISchneider SReaves BTippenhauer N(2023)BarrierBypass: Out-of-Sight Clean Voice Command Injection Attacks through Physical BarriersProceedings of the 16th ACM Conference on Security and Privacy in Wireless and Mobile Networks10.1145/3558482.3581772(203-214)Online publication date: 29-May-2023
https://dl.acm.org/doi/10.1145/3558482.3581772
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents