Multimodal interfaces

Alex Waibel^1,2,
Minh Tue Vo¹,
Paul Duchnowski² &
…
Stefan Manke²

183 Accesses
25 Citations
Explore all metrics

Abstract

In this paper, we present an overview of research in our laboratories on Multimodal Human Computer Interfaces. The goal for such interfaces is to free human computer interaction from the limitations and acceptance barriers due to rigid operating commands and keyboards as the only/main I/O-device. Instead we move to involve all available human communication modalities. These human modalities include Speech, Gesture and Pointing, Eye-Gaze, Lip Motion and Facial Expression, Handwriting, Face Recognition, Face Tracking, and Sound Localization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Baluja, S. & Pomerleau, D. (1994). Non-Intrusive Gaze Tracking Using Artificial Neural Networks. To appear in Advances in Neural Information Processing Systems 6, Morgan Kaufmann Publishers.
Bodenhausen, U., Manke, S. & Waibel, A. (1993). Connectionist Architectural Learning for High Performance Character and Speech Recognition. In Proceedings of ICASSP'93, Vol. 1, 625–628. Minneapolis, MN, U.S.A.
Braida, L. D. (1991). Crossmodal Integration in the Identification of Consonant Segments. The Quarterly Journal of Experimental Psychology 43A(3): 647–677.
Google Scholar
Bregler, C., Hild, H., Manke, S. & Waibel, A. (1993). Improving Connected Letter Recognition by Lipreading. In Proceedings of ICASSP'93, Vol. 1, 557–560, Minneapolis, MN, U.S.A.
Bregler, C. (1993). Lippenlesen als Unterstützung zur robusten automatischen Spracherkennung. M.S. Thesis. Fakultaet für Informatik, Universität Karlsruhe.
Goldschen, A. J. (1993). Continuous Automatic Speech Recognition by Lipreading. Ph.D. Dissertation, George Washington University.
Guyon, I., Albrecht, P., LeCun, Y., Denker, J. & Hubbard, W. (1991). Design of a Neural Network Character Recognizer for a Touch Terminal. Pattern Recognition 24(2): 105–119.
Google Scholar
Haffner, P., Franzini, M. & Waibel, A. (1991). Integrating Time Alignment and Neural Networks for High Performance Continuous Speech Recognition. In Proceedings of ICASSP'91, Vol. 1. 105–108. Toronto, Canada.
Haffner, P. & Waibel, A. (1991). Multi-State Time Delay Neural Networks for Continuous Speech Recognition. Advances in Neural Network Information Processing Systems 4, 135–142. Morgan Kaufmann Publishers.
Hauptmann, A. (1989). Speech and Gestures for Graphic Image Manipulation. In Proceedings of CHI'89, 241–245. Austin, TX, U.S.A.
Hild, H. & Waibel, A. (1993). Connected Letter Recognition with a Multi-State Time Delay Neural Network. Advances in Neural Information Processing Systems 5, 712–719. Morgan Kaufmann Publishers.
Huang, X., Alleva, F., Hon, H., Hwang, M., Lee, K. & Rosenfeld, R. (1993). The SPHINX-II Speech Recognition System: An Overview. Computer Speech and Language 7(2): 137–148.
Google Scholar
Jackson, P. L. (1988). The Theoretical Minimal Unit for Visual Speech Perception: Visemes and Coarticulation. The Volta Review 90(5): 99–115.
Google Scholar
Manke, S. & Bodenhausen, U. (1994). A Connectionist Recognizer for On-Line Cursive Handwriting Recognition. In Proceedings of ICASSP'94, Vol. 2, 633–636. Adelaide, Australia.
Miller, G. A. & Nicely, P. E. (1955). An Analysis of Perceptual Confusions Among Some English Consonants. Journal of the Acoustical Society of America 27(2): 338–352.
Google Scholar
Ney, H. (1984). The Use of a One-Stage Dynamic Programming Algorithm for Connected Word Recognition. In IEEE Transactions on Acoustics, Speech and Signal Processing 32(2): 263–271.
Nodine, C., Kundel, H., Toto, L. & Krupinski, E. (1992). Recording and Analyzing Eye-position Data Using a Microcomputer Workstation. Behavior Research Methods, Instruments & Computers 24(3): 475–584.
Google Scholar
Mase, K. & Pentland, A. (1991). Automatic Lipreading by Optical-Flow Analysis. Systems and Computers in Japan 22(6): 67–76.
Google Scholar
Petajan, E. D. (1984). Automatic Lipreading to Enhance Speech Recognition. Ph.D. Thesis, University of Illinois.
Petajan, E. D., Bischoff, B. & Bodoff, D. (1988). An Improved Automatic Lipreading System to Enhance Speech Recognition. In Proceedings of CHI'88, 19–25. Washington, DC, U.S.A.
Pomerleau, D., (1992). Neural Network Perception for Mobile Robot Guidance. Ph.D. Thesis, Carnegie Mellon University, CMU-CS-92-115.
Rose, R. & Paul, D. (1990). A Hidden Markov Model Based Keyword Recognition Systems. In Proceedings of ICASSP'90, Vol. 1, 129–132. Albuquerque, NM, U.S.A.
Rubine, D., (1991). The Automatic Recognition of Gestures. Ph.D. Thesis, Carnegie Mellon University.
Rubine, D.,(1991). Specifying Gestures by Examples. Computer Graphics 25(4): 329–337.
Google Scholar
Schwartz, R. & Austin, S. (1991). A Comparison of Several Approximate Algorithms for Finding N-best Hypotheses. In Proceedings of ICASSP'91, Vol. 1, 701–704. Toronto, Canada.
Schenkel, M., Guyon, I. & Henderson, D. (1994). On-Line Cursive Script Recognition Using Time Delay Neural Networks and Hidden Markov Models. In Proceedings of ICASSP'94, Vol. 2, 637–640. Adelaide, Australia.
Schmidbauer, O. & Tebelskis, J. (1992). An LVQ-based Reference Model for Speaker-Adaptive Speech Recognition. In Proceedings of ICASSP'92, Vol. I, 441–444. San Francisco, CA, U.S.A.
Stork, D. G., Wolff, G. & Levine, E. (1992). Neural Network Lipreading System for Improved Speech Recognition. In Proceedings of IJCNN'92, Vol. 2, 289–295. Baltimore, MD, U.S.A.
Summerfield, Q. (1983). Audio-visual Speech Perception, Lipreading and Artificial Stimulation. In Lutman, M. E. & Haggard, M. P. (eds.) Hearing Science and Hearing Disorders, Academic Press: New York.
Google Scholar
Tebelskis, J. & Waibel, A. (1993). Performance Through Consistency: MS-TDNNs for Large Vocabulary Continuous Speech Recognition. In Advances in Neural Information Processing Systems 5, 696–703. Morgan Kaufmann Publishers.
Turk, M. & Pentland, A. (1991). Eigenfaces for Recognition. Journal of Cognitive Neuro-Science 3(1): 71–86.
Google Scholar
Vo, M. T. & Waibel, A. (1993). A Multimodal Human-Computer Interface: Combination of Speech and Gesture Recognition. In Adjunct Proc. InterCHI'93. Amtersdam, The Netherlands.
Vo, M. T. (1994). Incremental Learning using the Time Delay Neural Network. In Proceedings of ICASSP'94, Vol. 2, 629–632. Adelaide. Australia.
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K. & Lang, K. (1989). Phoneme Recognition Using Time-Delay Neural Networks. IEEE Transactions on Acoustics, Speech, and Signal Processing 37(3): 328–339.
Google Scholar
Waibel, A., Jain, A., McNair, A., Saito, H., Hauptmann, A. & Tebelskis, J. (1991). JANUS: A Speechto-speech Translation System Using Connectionist and Symbolic Processing Strategies. In Proceedings of ICASSP'91, Vol. 2, 793–796. Toronto, Canada.
Ward, W. (1991). Understanding Spontaneous Speech: The Phoenix System. In Proceedings of ICASSP'91, Vol. 1, 365–367. Toronto, Canada.
Ware, C. & Mikaelian, H. (1987). An Evaluation of an Eye Tracker as a Device for Computer Input. In SIGCHI Bulletin, Spec. Issue, CHI+GI'87, 183–188. Toronto, Canada.
Woszczyna, M. et al. (1993). Recent Advances in Janus: A Speech Translation System. In Proceedings of EUROSPEECH'93, Vol. 2, 1295–1298. Berlin, Germany.
Yuhas, B. P., Goldstein, M. H., SejnowskiJr., T. J. (1989). Integration of Acoustic and Visual Speech Signals Using Neural Networks. IEEE Communications Magazine 27(11): 65–71.
Google Scholar
Zeppenfeld, T., & Waibel, A., (1992). A Hybrid Neural Network, Dynamic Programming Word Spotter. In Proceedings of ICASSP'92, Vol. 2, 77–80. San Francisco, CA, U.S.A.
Zeppenfeld, T., Houghton, R., & Waibel, A. (1993). Improving the MS-TSNN for Word Spotting. In Proceedings of ICASSP'93, Vol. 2, 475–478. Minneapolis, MN, U.S.A.

Download references

Author information

Authors and Affiliations

School of Computer Science, Carnegie Mellon University, 15213-3890, Pittsburgh, PA, U.S.A.
Alex Waibel & Minh Tue Vo
Computer Science Department, ILKD, University of Karlsruhe, Am Fasanengarten 5, 76128, Karlsruhe, Germany
Alex Waibel, Paul Duchnowski & Stefan Manke

Authors

Alex Waibel
View author publications
You can also search for this author in PubMed Google Scholar
Minh Tue Vo
View author publications
You can also search for this author in PubMed Google Scholar
Paul Duchnowski
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Manke
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Waibel, A., Vo, M.T., Duchnowski, P. et al. Multimodal interfaces. Artif Intell Rev 10, 299–319 (1996). https://doi.org/10.1007/BF00127684

Download citation

Issue Date: August 1996
DOI: https://doi.org/10.1007/BF00127684

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Implementation Goals for Multimodal Interfaces in Human-Computer Interaction