Abstract
Multimodal interactive systems enabling combination of natural modalities such as speech, touch, and gesture make it easier and more effective for users to interact with applications and services, whether on mobile devices, or in smart homes or cars. However, building these systems remains a complex and highly specialized task, in part because of the need to integrate multiple disparate and distributed system components. This task is further hindered by proprietary representations for input and output to different types of modality processing components such as speech recognizers, gesture recognizers, natural language understanding components and dialog managers. The W3C EMMA standard addresses this challenge and simplifies multimodal application authoring by providing a common representation language for capturing the interpretation of user inputs and system outputs and associated metadata. In this chapter, we describe the EMMA markup language and demonstrate its capabilities through presentation of a series of illustrative examples.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The W3C recommendation EMMA 1.0 only addresses inputs. Proposals for EMMA 2.0 extend the standard to represent output processing.
- 2.
The EMMA language does not require the confidence score to be a probability and there is no expectation or requirement that confidence values are comparable across different producers of EMMA other than that values closer to 1 are higher in confidence while values closer to 0 are lower in confidence.
References
Hauptmann, A. (1989). Speech and gesture for graphic image manipulation. In Proceedings of CHI’89, Austin, TX, pp. 241–245.
Nishimoto, T., Shida, N., Kobayashi, T., & Shirai, K. (1995). Improving human interface in drawing tool using speech, mouse, and keyboard. In Proceedings of the 4th IEEE International Workshop on Robot and Human Communication, ROMAN95, Tokyo, Japan, pp. 107–112.
Oviatt, S. L. (1999). Mutual disambiguation of recognition errors in a multimodal architecture. In Proceedings of the Conference on Human Factors in Computing Systems: CHI’99, Pittsburgh, PA, pp. 576–583.
Cohen, P. R. (1992). The role of natural language in a multimodal interface. In Proceedings of the 5th Annual ACM Symposium on User Interface Software and Technology, Monterey, CA, ACM Press, New York, NY, pp. 143–149
Rudnicky, A., & Hauptman, A. (1992). Multimodal interactions in speech systems. In M. Blattner & R. Dannenberg (Eds.), Multimedia interface design (pp. 147–172). New York: ACM Press.
Oviatt S., & VanGent, R. (1996). Error resolution during multimodal human-computer interaction. In Proceedings of International Conference on Spoken Language Processing (ICSLP),Philadelphia, PA, USA, pp. 204–207.
Allgayer, J., Jansen-Winkeln, R. M., Reddig, C., & Reithinger, N. (1989). Bidirectional use of knowledge in the multi-modal NL access system XTRA. In Proceedings of IJCAI, Detroit, MI, USA, pp. 1492–1497.
Bangalore, S., & Johnston, M. (2000). Tight-coupling of multimodal language processing with speech recognition. In Proceedings of the International Conference on Spoken Language Processing, Beijing, pp. 126–129.
Bolt, R. A. (1980). “Put-That-There”: Voice and gesture at the graphics interface. Computer Graphics, 14(3), 262–270.
Chai, J., Hong, P., & Zhou, M. (2004). A probabilistic approach to reference resolution in multimodal user interfaces. In Proceedings of 9th International Conference on Intelligent User Interfaces (IUI), Madeira, pp. 70–77.
Cohen, P. R., Johnston, M., McGee, D., Oviatt, S. L., Pittman, J., Smith, I., et al. (1997). Multimodal interaction for distributed interactive simulation. In Proceedings of Innovative Applications of Artificial Intelligence Conference. Menlo Park: AAAI/MIT Press.
House, D., & Wirn, M. (2000). Adapt—A multimodal conversational dialogue system in an apartment domain. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Beijing, pp. 134–137.
Johnston, M., Bangalore, S., Vasireddy, G., Stent, A., Ehlen, P., Walker, M., et al. (2002). MATCH: An architecture for multimodal dialog systems. In Proceedings of the Association of Computational Linguistics Annual Conference, Philadelphia, PA, pp. 376–383.
Koons, D. B., Sparrell, C. J., & Thorisson, K. R. (1993). Integrating simultaneous input from speech, gaze, and hand gestures. In M. T. Maybury (Ed.), Intelligent multimedia interfaces (pp. 257–276). Cambridge, MA: AAAI Press/MIT Press.
Neal, J. G., & Shapiro, S. C. (1991). Intelligent multi-media interface technology. In J. W. Sullivan & S. W. Tyler (Eds.), Intelligent user interfaces (pp. 45–68). New York: Addison Wesley.
Sharma, R., Yeasin, M., Krahnstoever, N., Rauschert, I., Cai, G., Brewer, I., MacEachren, A. M., & Sengupta, K. (2003). Speech-gesture driven multimodal interfaces for crisis management. Proceedings of the IEEE, 91(9), 1327–1354.
Wahlster, W. (2002). SmartKom: Fusion and fission of speech, gestures, and facial expressions. In Proceedings of the 1st International Workshop on Man-Machine Symbiotic Systems, Kyoto, pp. 213–225.
Waibel, A., Vo, M., Duchnowski, P., & Manke, S. (1996). Multimodal interfaces. AI Review Journal, 10:299–319.
Wauchope, K. (1994). Eucalyptus: Integrating natural language input with a graphical user interface. Naval Research Laboratory, Report NRL/FR/5510-94-9711.
Cohen, P. R., Kaiser, E. C., Buchanan, C. M., & Lind, S. (2015). Sketch-Thru-Plan: A multimodal interface for command and control. Communications of ACM, 58(4), 56–65.
Johnston, M., & Ehlen, P. (2010). Speak4itSM: Multimodal interaction in the wild. In Proceedings of IEEE Spoken Language Technology Workshop, Berkeley, CA, pp. 59–60.
Johnston, M., Baggia, P., Burnett, D. C., Carter, J., Dahl, D. A., McCobb, G., et al. (2009). EMMA:Extensible MultiModal Annotation markup language. W3C Recommendation. https://www.w3.org/TR/2009/REC-emma-20090210/.
Johnston, M., Dahl, D. A., Denney, T., & Kharidi, N. (2015). EMMA: Extensible MultiModal Annotation markup language Version 2.0. W3C Public working draft. https://www.w3.org/TR/emma20/. Accessed Sept 2015.
Barnett, J., Bodell, M., Dahl, D., Kliche, I., Larson, J., Porter, B., et al. (2012). Multimodal architecture and interfaces. W3C Recommendation. https://www.w3.org/TR/mmi-arch/.
Dahl, D. (2000). Natural language semantics markup language for the speech interface framework. W3C Working Draft. https://www.w3.org/TR/2000/WD-nl-spec-20001120/.
Shanmugham, S., Monaco, P., & Eberman, B. (2006). A media resource control protocol (MRCP). IETF RFC 4463. https://tools.ietf.org/html/rfc4463.
Burnett, D., & Shanmugham, S. (2012). Media resource control protocol Version 2 (MRCPv2). IETF RFC 6787. https://tools.ietf.org/html/rfc6787.
Phillips, A., & Davis, M. (2006). Tags for the Identification of Languages, IETF. http://www.rfc-editor.org/rfc/bcp/bcp47.txt.
Chee, Y.-M., Franke, K., Froumentin, M., Madhvanath, S., Magana, J. A., Pakosz, G., et al. (2011). Ink markup language (InkML). W3C Recommendation. https://www.w3.org/TR/InkML/.
Kaiser, E., Olwal, A., McGee, D., Benko, H., Corradini, A., Li, X., et al. (2003). Mutual disambiguation of 3D multimodal interaction in augmented and virtual reality. In Proceedings of the 5th International Conference on Multimodal Interfaces (ICMI), Vancover, BC, Canada, pp. 12–19.
Jonson, R. (2006). Dialog context-based re-ranking of ASR hypotheses. In Proceedings of IEEE Spoken Language Technology Workshop, Palm Beach, Aruba, pp. 174–177.
Johnston, M. (2009). Building multimodal applications with EMMA. In Proceedings of the ICMI Conference, Boston, MA, USA, pp. 47–54.
Mishra, T., & Bangalore, S. (2011). Finite-state models for speech-based search on mobile devices. Natural Language Engineering, 17(2), 243–264.
Stoyanchev, S., & Johnston, M. (2015). Localized error detection for targeted clarification in a virtual assistant. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Bribane, Australia, pp. 5241–5245.
Stoyanchev, S., Liu, A., & Hirschberg, J. (2014). Towards natural clarification questions in dialogue systems. In Proceedings of AISB, London, England.
Johnston, M., & Bangalore, S. (2009). Robust understanding in multimodal interfaces. Computational Linguistics, 35(3), 345–397.
Johnston, M. (1998). Unification-based multimodal parsing. In Proceedings of the Association for Computational Linguistics Annual Conference (ACL). Montreal, pp. 624–630.
Pustejovsky, J., Castaño, J., Ingria, R., Saurí, R., Gaizauskas, R., Setzer, A., et al. (2003). TimeML: Robust specification of event and temporal expressions in text. In Proceedings of Fifth International Workshop on Computational Semantics (IWCS-5), Tilburg, The Netherlands.
Burnett, D., Walker, M. R., & Hunt, A. (2004). Speech synthesis markup language (SSML) Version 1.0. W3C Recommendation. https://www.w3.org/TR/speech-synthesis/.
ECMA International (2013). The JSON data interchange format. Standard ECMA-404. http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf. Accessed 14 May 2016.
Van Tichelen, L., & Burke, D. (2007). Semantic interpretation for speech recognition (SISR) Version 1.0. W3C Recommendation. https://www.w3.org/TR/semantic-interpretation/.
Hunt, A., & McGlashan, S. (2004). Speech recognition grammar specification Version 1.0. W3C Recommendation. https://www.w3.org/TR/speech-grammar/.
Acknowledgements
I would like to acknowledge the many contributors to the EMMA standard from around the world including Deborah Dahl, Kazuyuki Ashimura, Paolo Baggia, Roberto Pieraccini, Dan Burnett, Dave Raggett, Stephen Potter, Nagesh Kharidi, Raj Tumuluri, Jerry Carter, Wu Chou, Gerry McCobb, Tim Denney, Max Froumentin, Katrina Halonen, Jin Liu, Massimo Romanelli, T. V. Raman, and Yuan Shao.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Johnston, M. (2017). Extensible Multimodal Annotation for Intelligent Interactive Systems. In: Dahl, D. (eds) Multimodal Interaction with W3C Standards. Springer, Cham. https://doi.org/10.1007/978-3-319-42816-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-42816-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42814-7
Online ISBN: 978-3-319-42816-1
eBook Packages: EngineeringEngineering (R0)