Abstract
In this paper, we present an overview of research in our laboratories on Multimodal Human Computer Interfaces. The goal for such interfaces is to free human computer interaction from the limitations and acceptance barriers due to rigid operating commands and keyboards as the only/main I/O-device. Instead we move to involve all available human communication modalities. These human modalities include Speech, Gesture and Pointing, Eye-Gaze, Lip Motion and Facial Expression, Handwriting, Face Recognition, Face Tracking, and Sound Localization.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Baluja, S. & Pomerleau, D. (1994). Non-Intrusive Gaze Tracking Using Artificial Neural Networks. To appear in Advances in Neural Information Processing Systems 6, Morgan Kaufmann Publishers.
Bodenhausen, U., Manke, S. & Waibel, A. (1993). Connectionist Architectural Learning for High Performance Character and Speech Recognition. In Proceedings of ICASSP'93, Vol. 1, 625–628. Minneapolis, MN, U.S.A.
Braida, L. D. (1991). Crossmodal Integration in the Identification of Consonant Segments. The Quarterly Journal of Experimental Psychology 43A(3): 647–677.
Bregler, C., Hild, H., Manke, S. & Waibel, A. (1993). Improving Connected Letter Recognition by Lipreading. In Proceedings of ICASSP'93, Vol. 1, 557–560, Minneapolis, MN, U.S.A.
Bregler, C. (1993). Lippenlesen als Unterstützung zur robusten automatischen Spracherkennung. M.S. Thesis. Fakultaet für Informatik, Universität Karlsruhe.
Goldschen, A. J. (1993). Continuous Automatic Speech Recognition by Lipreading. Ph.D. Dissertation, George Washington University.
Guyon, I., Albrecht, P., LeCun, Y., Denker, J. & Hubbard, W. (1991). Design of a Neural Network Character Recognizer for a Touch Terminal. Pattern Recognition 24(2): 105–119.
Haffner, P., Franzini, M. & Waibel, A. (1991). Integrating Time Alignment and Neural Networks for High Performance Continuous Speech Recognition. In Proceedings of ICASSP'91, Vol. 1. 105–108. Toronto, Canada.
Haffner, P. & Waibel, A. (1991). Multi-State Time Delay Neural Networks for Continuous Speech Recognition. Advances in Neural Network Information Processing Systems 4, 135–142. Morgan Kaufmann Publishers.
Hauptmann, A. (1989). Speech and Gestures for Graphic Image Manipulation. In Proceedings of CHI'89, 241–245. Austin, TX, U.S.A.
Hild, H. & Waibel, A. (1993). Connected Letter Recognition with a Multi-State Time Delay Neural Network. Advances in Neural Information Processing Systems 5, 712–719. Morgan Kaufmann Publishers.
Huang, X., Alleva, F., Hon, H., Hwang, M., Lee, K. & Rosenfeld, R. (1993). The SPHINX-II Speech Recognition System: An Overview. Computer Speech and Language 7(2): 137–148.
Jackson, P. L. (1988). The Theoretical Minimal Unit for Visual Speech Perception: Visemes and Coarticulation. The Volta Review 90(5): 99–115.
Manke, S. & Bodenhausen, U. (1994). A Connectionist Recognizer for On-Line Cursive Handwriting Recognition. In Proceedings of ICASSP'94, Vol. 2, 633–636. Adelaide, Australia.
Miller, G. A. & Nicely, P. E. (1955). An Analysis of Perceptual Confusions Among Some English Consonants. Journal of the Acoustical Society of America 27(2): 338–352.
Ney, H. (1984). The Use of a One-Stage Dynamic Programming Algorithm for Connected Word Recognition. In IEEE Transactions on Acoustics, Speech and Signal Processing 32(2): 263–271.
Nodine, C., Kundel, H., Toto, L. & Krupinski, E. (1992). Recording and Analyzing Eye-position Data Using a Microcomputer Workstation. Behavior Research Methods, Instruments & Computers 24(3): 475–584.
Mase, K. & Pentland, A. (1991). Automatic Lipreading by Optical-Flow Analysis. Systems and Computers in Japan 22(6): 67–76.
Petajan, E. D. (1984). Automatic Lipreading to Enhance Speech Recognition. Ph.D. Thesis, University of Illinois.
Petajan, E. D., Bischoff, B. & Bodoff, D. (1988). An Improved Automatic Lipreading System to Enhance Speech Recognition. In Proceedings of CHI'88, 19–25. Washington, DC, U.S.A.
Pomerleau, D., (1992). Neural Network Perception for Mobile Robot Guidance. Ph.D. Thesis, Carnegie Mellon University, CMU-CS-92-115.
Rose, R. & Paul, D. (1990). A Hidden Markov Model Based Keyword Recognition Systems. In Proceedings of ICASSP'90, Vol. 1, 129–132. Albuquerque, NM, U.S.A.
Rubine, D., (1991). The Automatic Recognition of Gestures. Ph.D. Thesis, Carnegie Mellon University.
Rubine, D.,(1991). Specifying Gestures by Examples. Computer Graphics 25(4): 329–337.
Schwartz, R. & Austin, S. (1991). A Comparison of Several Approximate Algorithms for Finding N-best Hypotheses. In Proceedings of ICASSP'91, Vol. 1, 701–704. Toronto, Canada.
Schenkel, M., Guyon, I. & Henderson, D. (1994). On-Line Cursive Script Recognition Using Time Delay Neural Networks and Hidden Markov Models. In Proceedings of ICASSP'94, Vol. 2, 637–640. Adelaide, Australia.
Schmidbauer, O. & Tebelskis, J. (1992). An LVQ-based Reference Model for Speaker-Adaptive Speech Recognition. In Proceedings of ICASSP'92, Vol. I, 441–444. San Francisco, CA, U.S.A.
Stork, D. G., Wolff, G. & Levine, E. (1992). Neural Network Lipreading System for Improved Speech Recognition. In Proceedings of IJCNN'92, Vol. 2, 289–295. Baltimore, MD, U.S.A.
Summerfield, Q. (1983). Audio-visual Speech Perception, Lipreading and Artificial Stimulation. In Lutman, M. E. & Haggard, M. P. (eds.) Hearing Science and Hearing Disorders, Academic Press: New York.
Tebelskis, J. & Waibel, A. (1993). Performance Through Consistency: MS-TDNNs for Large Vocabulary Continuous Speech Recognition. In Advances in Neural Information Processing Systems 5, 696–703. Morgan Kaufmann Publishers.
Turk, M. & Pentland, A. (1991). Eigenfaces for Recognition. Journal of Cognitive Neuro-Science 3(1): 71–86.
Vo, M. T. & Waibel, A. (1993). A Multimodal Human-Computer Interface: Combination of Speech and Gesture Recognition. In Adjunct Proc. InterCHI'93. Amtersdam, The Netherlands.
Vo, M. T. (1994). Incremental Learning using the Time Delay Neural Network. In Proceedings of ICASSP'94, Vol. 2, 629–632. Adelaide. Australia.
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K. & Lang, K. (1989). Phoneme Recognition Using Time-Delay Neural Networks. IEEE Transactions on Acoustics, Speech, and Signal Processing 37(3): 328–339.
Waibel, A., Jain, A., McNair, A., Saito, H., Hauptmann, A. & Tebelskis, J. (1991). JANUS: A Speechto-speech Translation System Using Connectionist and Symbolic Processing Strategies. In Proceedings of ICASSP'91, Vol. 2, 793–796. Toronto, Canada.
Ward, W. (1991). Understanding Spontaneous Speech: The Phoenix System. In Proceedings of ICASSP'91, Vol. 1, 365–367. Toronto, Canada.
Ware, C. & Mikaelian, H. (1987). An Evaluation of an Eye Tracker as a Device for Computer Input. In SIGCHI Bulletin, Spec. Issue, CHI+GI'87, 183–188. Toronto, Canada.
Woszczyna, M. et al. (1993). Recent Advances in Janus: A Speech Translation System. In Proceedings of EUROSPEECH'93, Vol. 2, 1295–1298. Berlin, Germany.
Yuhas, B. P., Goldstein, M. H., SejnowskiJr., T. J. (1989). Integration of Acoustic and Visual Speech Signals Using Neural Networks. IEEE Communications Magazine 27(11): 65–71.
Zeppenfeld, T., & Waibel, A., (1992). A Hybrid Neural Network, Dynamic Programming Word Spotter. In Proceedings of ICASSP'92, Vol. 2, 77–80. San Francisco, CA, U.S.A.
Zeppenfeld, T., Houghton, R., & Waibel, A. (1993). Improving the MS-TSNN for Word Spotting. In Proceedings of ICASSP'93, Vol. 2, 475–478. Minneapolis, MN, U.S.A.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Waibel, A., Vo, M.T., Duchnowski, P. et al. Multimodal interfaces. Artif Intell Rev 10, 299–319 (1996). https://doi.org/10.1007/BF00127684
Issue Date:
DOI: https://doi.org/10.1007/BF00127684