Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

ProxiTalk: Activate Speech Input by Bringing Smartphone to the Mouth

Published: 09 September 2019 Publication History

Abstract

Speech input, such as voice assistant and voice message, is an attractive interaction option for mobile users today. However, despite its popularity, there is a use limitation for smartphone speech input: users need to press a button or say a wake word to activate it before use, which is not very convenient. To address it, we match the motion that brings the phone to mouth with the user's intention to use voice input. In this paper, we present ProxiTalk, an interaction technique that allows users to enable smartphone speech input by simply moving it close to their mouths. We study how users use ProxiTalk and systematically investigate the recognition abilities of various data sources (e.g., using a front camera to detect facial features, using two microphones to estimate the distance between phone and mouth). Results show that it is feasible to utilize the smartphone's built-in sensors and instruments to detect ProxiTalk use and classify gestures. An evaluation study shows that users can quickly acquire ProxiTalk and are willing to use it. In conclusion, our work provides the empirical support that ProxiTalk is a practical and promising option to enable smartphone speech input, which coexists with current trigger mechanisms.

Supplementary Material

yang (yang.zip)
Supplemental movie, appendix, image and software files for, ProxiTalk: Activate Speech Input by Bringing Smartphone to the Mouth

References

[1]
Cyril Allauzen and Michael Riley. 2011. Bayesian Language Model Interpolation for Mobile Speech Input. In Interspeech 2011.1429-1432.
[2]
Apple. 2019. Siri - Apple. Retrieved March 24, 2019 from https://www.apple.com/siri/
[3]
Apple. 2019. Use Face ID on your iPhone or iPad Pro - Apple Support. Retrieved May 3, 2019 from https://support.apple.com/en-us/HT208109
[4]
Apple. 2019. Use Siri on all your Apple devices - Apple Support. Retrieved March 29, 2019 from https://support.apple.com/en-us/HT204389#apple-watch
[5]
Shaikh Shawon Arefin Shimon, Courtney Lutton, Zichun Xu, Sarah Morrison-Smith, Christina Boucher, and Jaime Ruiz. 2016. Exploring Non-touchscreen Gestures for Smartwatches. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI '16). ACM, New York, NY, USA, 3822--3833.
[6]
Akin Avci, Stephan Bosch, Mihai Marin-Perianu, Raluca Marin-Perianu, and Paul Havinga. 2010. Activity recognition using inertial sensing for healthcare, wellbeing and sports applications: A survey. In 23th International conference on architecture of computing systems 2010. VDE, 1--10.
[7]
Shiri Azenkot and Nicole B. Lee. 2013. Exploring the Use of Speech Input by Blind People on Mobile Devices. In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '13). ACM, New York, NY, USA, Article 11, 8 pages.
[8]
Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. 2016. End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4945--4949.
[9]
Brandon Ballinger, Cyril Allauzen, Alexander Gruenstein, and Johan Schalkwyk. 2010. On-demand language model interpolation for mobile speech input. In Eleventh Annual Conference of the International Speech Communication Association.
[10]
Xuan Bao and Romit Roy Choudhury. 2010. MoVi: Mobile Phone Based Video Highlights via Collaborative Sensing. In Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services (MobiSys '10). ACM, New York, NY, USA, 357--370.
[11]
Peter Blank, Julian Hobbach, Dominik Schuldhaus, and Bjoern M. Eskofier. 2015. Sensor-based Stroke Detection and Stroke Type Classification in Table Tennis. In Proceedings of the 2015 ACM International Symposium on Wearable Computers (ISWC '15). ACM, New York, NY, USA, 93--100.
[12]
Tomas Brezmes, Juan-Luis Gorricho, and Josep Cotrina. 2009. Activity recognition from accelerometer data on a mobile phone. In International Work-Conference on Artificial Neural Networks. Springer, 796--799.
[13]
Xiang 'Anthony' Chen and Yang Li. 2016. Bootstrapping User-Defined Body Tapping Recognition with Offline-Learned Probabilistic Representation. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (UIST '16). ACM, New York, NY, USA, 359--364.
[14]
Michael H Cohen, Michael Harris Cohen, James P Giangola, and Jennifer Balogh. 2004. Voice user interface design. Addison-Wesley Professional.
[15]
P. Daponte, L. De Vito, F. Picariello, and M. Riccio. 2013. State of the art and future developments of measurement applications on smartphones. Measurement 46, 9 (2013), 3291--3307.
[16]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
[17]
Aarthi Easwara Moorthy and Kim-Phuong L Vu. 2015. Privacy concerns for use of voice activated personal assistant in the public space. International Journal of Human-Computer Interaction 31, 4 (2015), 307--335.
[18]
Christos Efthymiou and Martin Halvey. 2016. Evaluating the Social Acceptability of Voice Based Smartwatch Search. In Information Retrieval Technology, Shaoping Ma, Ji-Rong Wen, Yiqun Liu, Zhicheng Dou, Min Zhang, Yi Chang, and Xin Zhao (Eds.). Springer International Publishing, Cham, 267--278.
[19]
Masaaki Fukumoto. 2018. SilentVoice: Unnoticeable Voice Input by Ingressive Speech. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). ACM, New York, NY, USA, 237--246.
[20]
Shabnam Ghaffarzadegan, Hynek Bořil, and John HL Hansen. 2016. Generative modeling of pseudo-whisper for robust whispered speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 10 (2016), 1705--1720.
[21]
Shabnam Ghaffarzadegan, Hynek Bořil, and John HL Hansen. 2017. Deep neural network training for whispered speech recognition using small databases and generative model sampling. International Journal of Speech Technology 20, 4 (2017), 1063--1075.
[22]
Google. 2019. Google Assistant, your own personal Google. Retrieved March 24, 2019 from https://assistant.google.com/
[23]
Dorde T Grozdic and Slobodan T Jovicic. 2017. Whispered speech recognition using deep denoising autoencoder and inverse filtering. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 25, 12 (2017), 2313--2322.
[24]
Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al. 2019. Streaming End-to-end Speech Recognition for Mobile Devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6381--6385.
[25]
Matthew B. Hoy. 2018. Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants. Medical Reference Services Quarterly 37, 1 (2018), 81--88.
[26]
Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, et al. 2003. A practical guide to support vector classification. (2003).
[27]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700--4708.
[28]
Huawei. 2019. HUAWEI Mate 10. Retrieved April 17, 2019 from https://consumer.huawei.com/en/support/phones/mate10/
[29]
Ozlem Durmaz Incel, Mustafa Kose, and Cem Ersoy. 2013. A Review and Taxonomy of Activity Recognition on Mobile Phones. BioNanoScience 3, 2 (01 Jun 2013), 145--171.
[30]
W. Z. Khan, Y. Xiang, M. Y. Aalsalem, and Q. Arshad. 2013. Mobile Phone Sensing Systems: A Survey. IEEE Communications Surveys Tutorials 15, 1 (First 2013), 402--427.
[31]
Bret Kinsella. 2018. New Report: Over 1 Billion Devices Provide Voice Assistant Access Today and Highest Usage is on Smartphones - Voicebot. Retrieved May 2, 2019 from https://voicebot.ai/2018/11/13/new-report-over-1-billion-devices-provide-voice-assistant-access-today-and-highest-usage-is-on-smartphones/
[32]
Bret Kinsella. 2018. Over Half of Smartphone Owners Use Voice Assistants, Siri Leads the Pack - Voicebot. Retrieved May 2, 2019 from https://voicebot.ai/2018/04/03/over-half-of-smartphone-owners-use-voice-assistants-siri-leads-the-pack/
[33]
Kenji Kira and Larry A Rendell. 1992. The feature selection problem: Traditional methods and a new algorithm. In Aaai, Vol. 2. 129--134.
[34]
Kenji Kira and Larry A Rendell. 1992. A practical approach to feature selection. In Machine Learning Proceedings 1992. Elsevier, 249--256.
[35]
Mustafa Kose, Ozlem Durmaz Incel, and Cem Ersoy. 2012. Online human activity recognition on smart phones. In Workshop on Mobile Sensing: From Smartphones and Wearables to Big Data, Vol. 16. 11--15.
[36]
Jennifer R. Kwapisz, Gary M. Weiss, and Samuel A. Moore. 2011. Activity Recognition Using Cell Phone Accelerometers. SIGKDD Explor. Newsl. 12, 2 (March 2011), 74--82.
[37]
Nicholas D. Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. 2016. DeepX: A Software Accelerator for Low-power Deep Learning Inference on Mobile Devices. In Proceedings of the 15th International Conference on Information Processing in Sensor Networks (IPSN '16). IEEE Press, Piscataway, NJ, USA, Article 23, 12 pages. http://dl.acm.org/citation.cfm?id=2959355.2959378
[38]
Nicholas D Lane, Emiliano Miluzzo, Hong Lu, Daniel Peebles, Tanzeem Choudhury, and Andrew T Campbell. 2010. A survey of mobile phone sensing. IEEE Communications magazine 48, 9 (2010), 140--150.
[39]
Xin Lei, Andrew Senior, Alexander Gruenstein, and Jeffrey Sorensen. 2013. Accurate and compact large vocabulary speech recognition on mobile devices. (2013).
[40]
Heike Leutheuser, Dominik Schuldhaus, and Bjoern M Eskofier. 2013. Hierarchical, multi-sensor based classification of daily life activities: comparison with state-of-the-art algorithms using a benchmark dataset. PloS one 8, 10 (2013), e75196.
[41]
Jing-jie Li, Ian V McLoughlin, Li-Rong Dai, and Zhen-hua Ling. 2014. Whisper-to-speech conversion using restricted Boltzmann machine arrays. Electronics Letters 50, 24 (2014), 1781--1782.
[42]
Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013).
[43]
Gustavo López, Luis Quesada, and Luis A. Guerrero. 2018. Alexa vs. Siri vs. Cortana vs. Google Assistant: A Comparison of Speech-Based Natural User Interfaces. In Advances in Human Factors and Systems Interaction, Isabel L. Nunes (Ed.). Springer International Publishing, Cham, 241--250.
[44]
I. McGraw, R. Prabhavalkar, R. Alvarez, M. G. Arenas, K. Rao, D. Rybach, O. Alsharif, H. Sak, A. Gruenstein, F. Beaufays, and C. Parada. 2016. Personalized speech recognition on mobile devices. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5955--5959.
[45]
Aarthi Easwara Moorthy and Kim-Phuong L Vu. 2014. Voice activated personal assistant: Acceptability of use in the public space. In International Conference on Human Interface and the Management of Information. Springer, 324--334.
[46]
Chelsea Myers, Anushay Furqan, Jessica Nebolsky, Karina Caro, and Jichen Zhu. 2018. Patterns for How Users Overcome Obstacles in Voice User Interfaces. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 6.
[47]
Yurii Nesterov. 1983. A method for unconstrained convex minimization problem with the rate of convergence O (1/k^ 2). In Doklady AN USSR, Vol. 269. 543--547.
[48]
Henry Friday Nweke, Ying Wah Teh, Mohammed Ali Al-garadi, and Uzoma Rita Alo. 2018. Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: State of the art and research challenges. Expert Systems with Applications 105 (2018), 233--261.
[49]
Alisha Pradhan, Kanika Mehta, and Leah Findlater. 2018. Accessibility came by accident: use of voice-controlled intelligent personal assistants by people with disabilities. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 459.
[50]
Valentin Radu, Nicholas D. Lane, Sourav Bhattacharya, Cecilia Mascolo, Mahesh K. Marina, and Fahim Kawsar. 2016. Towards Multimodal Deep Learning for Activity Recognition on Mobile Devices. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct (UbiComp '16). ACM, New York, NY, USA, 185--188.
[51]
Nishkam Ravi, Nikhil Dandekar, Preetham Mysore, and Michael L Littman. 2005. Activity recognition from accelerometer data. In Aaai, Vol. 5. 1541--1546.
[52]
Sherry Ruan, Jacob O Wobbrock, Kenny Liou, Andrew Ng, and James Landay. 2016. Speech is 3x faster than typing for english and mandarin text entry on mobile devices. arXiv preprint arXiv:1608.07323 (2016).
[53]
Jaime Ruiz, Yang Li, and Edward Lank. 2011. User-defined Motion Gestures for Mobile Interaction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '11). ACM, New York, NY, USA, 197--206.
[54]
Mike Schuster. 2010. Speech recognition for mobile devices at Google. In Pacific Rim International Conference on Artificial Intelligence. Springer, 8--10.
[55]
Muhammad Shoaib, Stephan Bosch, Ozlem Incel, Hans Scholten, and Paul Havinga. 2015. A survey of online activity recognition using mobile phones. Sensors 15, 1 (2015), 2059--2085.
[56]
Muhammad Shoaib, Stephan Bosch, Ozlem Durmaz Incel, Hans Scholten, and Paul J. M. Havinga. 2016. Complex Human Activity Recognition Using Smartphone and Wrist-Worn Motion Sensors. Sensors 16, 4 (2016).
[57]
Statista. 2019. Number of smartphone users worldwide 2014-2020 | Statista. Retrieved April 1, 2019 from https://www.statista.com/statistics/330695/number-of-smartphone-users-worldwide/
[58]
Creative Strategies. 2016. Voice Assistant Anyone? Yes please, but not in public! - Creative Strategies, Inc - Creative Strategies, Inc. Retrieved July 16, 2019 from https://creativestrategies.com/voice-assistant-anyone-yes-please-but-not-in-public/
[59]
Ke Sun, Chun Yu, Weinan Shi, Lan Liu, and Yuanchun Shi. 2018. Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). ACM, New York, NY, USA, 581--593.
[60]
Ken Turkowski. 1990. Filters for common resampling tasks. In Graphics gems. Academic Press Professional, Inc., 147--165.
[61]
J. Valin, F. Michaud, J. Rouat, and D. Letourneau. 2003. Robust sound source localization using a microphone array on a mobile robot. In Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453), Vol. 2. 1228--1233 vol.2.
[62]
Dong Wang, Xuewei Zhang, and Zhiyong Zhang. 2015. THCHS-30: A Free Chinese Speech Corpus. http://arxiv.org/abs/1512.01882
[63]
Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. 2019. Deep learning for sensor-based activity recognition: A survey. Pattern Recognition Letters 119 (2019), 3--11.
[64]
Ruolin Wang, Chun Yu, Xing-Dong Yang, Weijie He, and Yuanchun Shi. 2019. EarTouch: Facilitating Smartphone Use for Visually Impaired People in Mobile and Public Scenarios. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, New York, NY, USA, Article 24, 13 pages.
[65]
Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2016. Quantized Convolutional Neural Networks for Mobile Devices. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[66]
Disheng Yang, Jian Tang, Yang Huang, Chao Xu, Jinyang Li, Liang Hu, Guobin Shen, Chieh-Jan Mike Liang, and Hengchang Liu. 2017. TennisMaster: An IMU-based Online Serve Performance Evaluation System. In Proceedings of the 8th Augmented Human International Conference (AH '17). ACM, New York, NY, USA, Article 17, 8 pages.
[67]
Jun Yang. 2009. Toward physical activity diary: motion recognition using simple acceleration features with mobile phones. In Proceedings of the 1st international workshop on Interactive multimedia for consumer electronics. ACM, 1--10.
[68]
Shuochao Yao, Shaohan Hu, Yiran Zhao, Aston Zhang, and Tarek Abdelzaher. 2017. DeepSense: A Unified Deep Learning Framework for Time-Series Mobile Sensing Data Processing. In Proceedings of the 26th International Conference on World Wide Web (WWW '17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 351--360.
[69]
Adam Zarek, Daniel Wigdor, and Karan Singh. 2012. SNOUT: One-handed Use of Capacitive Touch Devices. In Proceedings of the International Working Conference on Advanced Visual Interfaces (AVI '12). ACM, New York, NY, USA, 140--147.
[70]
Dmitry Zaykovskiy. 2006. Survey of the speech recognition techniques for mobile devices. Proc. of DS Publications (2006).
[71]
Yu Zhong, T. V. Raman, Casey Burkhardt, Fadi Biadsy, and Jeffrey P. Bigham. 2014. JustSpeak: Enabling Universal Voice Control on Android. In Proceedings of the 11th Web for All Conference (W4A '14). ACM, New York, NY, USA, Article 36, 4 pages.

Cited By

View all
  • (2024)EmoWear: Exploring Emotional Teasers for Voice Message Interaction on SmartwatchesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642101(1-16)Online publication date: 11-May-2024
  • (2024)A Systematic Review of Human Activity Recognition Based on Mobile Devices: Overview, Progress and TrendsIEEE Communications Surveys & Tutorials10.1109/COMST.2024.335759126:2(890-929)Online publication date: Oct-2025
  • (2024)HCI Research and Innovation in China: A 10-Year PerspectiveInternational Journal of Human–Computer Interaction10.1080/10447318.2024.232385840:8(1799-1831)Online publication date: 22-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies  Volume 3, Issue 3
September 2019
1415 pages
EISSN:2474-9567
DOI:10.1145/3361560
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2019
Published in IMWUT Volume 3, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. activity recognition
  2. inertial sensors
  3. mobile interaction
  4. smartphone
  5. voice input

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • National Key Research and Development Plan
  • Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)41
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)EmoWear: Exploring Emotional Teasers for Voice Message Interaction on SmartwatchesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642101(1-16)Online publication date: 11-May-2024
  • (2024)A Systematic Review of Human Activity Recognition Based on Mobile Devices: Overview, Progress and TrendsIEEE Communications Surveys & Tutorials10.1109/COMST.2024.335759126:2(890-929)Online publication date: Oct-2025
  • (2024)HCI Research and Innovation in China: A 10-Year PerspectiveInternational Journal of Human–Computer Interaction10.1080/10447318.2024.232385840:8(1799-1831)Online publication date: 22-Mar-2024
  • (2023)Brave New GES World: A Systematic Literature Review of Gestures and Referents in Gesture Elicitation StudiesACM Computing Surveys10.1145/363645856:5(1-55)Online publication date: 7-Dec-2023
  • (2023)Phone Sleight of Hand: Finger-Based Dexterous Gestures for Physical Interaction with Mobile PhonesProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581121(1-19)Online publication date: 19-Apr-2023
  • (2023)Enabling Voice-Accompanying Hand-to-Face Gesture Recognition with Cross-Device SensingProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581008(1-17)Online publication date: 19-Apr-2023
  • (2023)Selecting Real-World Objects via User-Perspective Phone OcclusionProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580696(1-13)Online publication date: 19-Apr-2023
  • (2023)Content-Aware Adaptive Device–Cloud Collaborative Inference for Object DetectionIEEE Internet of Things Journal10.1109/JIOT.2023.327957910:21(19087-19101)Online publication date: 1-Nov-2023
  • (2022)Demonstrating Finger-Based Dexterous Phone GesturesAdjunct Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology10.1145/3526114.3558645(1-3)Online publication date: 29-Oct-2022
  • (2022)Designing Gestures for Digital Musical Instruments: Gesture Elicitation Study with Deaf and Hard of Hearing PeopleProceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3517428.3544828(1-8)Online publication date: 23-Oct-2022
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media