Extensible Multimodal Annotation for Intelligent Interactive Systems

Michael Johnston²

774 Accesses

Abstract

Multimodal interactive systems enabling combination of natural modalities such as speech, touch, and gesture make it easier and more effective for users to interact with applications and services, whether on mobile devices, or in smart homes or cars. However, building these systems remains a complex and highly specialized task, in part because of the need to integrate multiple disparate and distributed system components. This task is further hindered by proprietary representations for input and output to different types of modality processing components such as speech recognizers, gesture recognizers, natural language understanding components and dialog managers. The W3C EMMA standard addresses this challenge and simplifies multimodal application authoring by providing a common representation language for capturing the interpretation of user inputs and system outputs and associated metadata. In this chapter, we describe the EMMA markup language and demonstrate its capabilities through presentation of a series of illustrative examples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Tools for Multimodal Annotation

Multimodal Systems: An Excursus of the Main Research Questions

“How Do You Do?”: Quantitative Results

Notes

1.
The W3C recommendation EMMA 1.0 only addresses inputs. Proposals for EMMA 2.0 extend the standard to represent output processing.
2.
The EMMA language does not require the confidence score to be a probability and there is no expectation or requirement that confidence values are comparable across different producers of EMMA other than that values closer to 1 are higher in confidence while values closer to 0 are lower in confidence.

References

Hauptmann, A. (1989). Speech and gesture for graphic image manipulation. In Proceedings of CHI’89, Austin, TX, pp. 241–245.
Google Scholar
Nishimoto, T., Shida, N., Kobayashi, T., & Shirai, K. (1995). Improving human interface in drawing tool using speech, mouse, and keyboard. In Proceedings of the 4th IEEE International Workshop on Robot and Human Communication, ROMAN95, Tokyo, Japan, pp. 107–112.
Google Scholar
Oviatt, S. L. (1999). Mutual disambiguation of recognition errors in a multimodal architecture. In Proceedings of the Conference on Human Factors in Computing Systems: CHI’99, Pittsburgh, PA, pp. 576–583.
Google Scholar
Cohen, P. R. (1992). The role of natural language in a multimodal interface. In Proceedings of the 5th Annual ACM Symposium on User Interface Software and Technology, Monterey, CA, ACM Press, New York, NY, pp. 143–149
Google Scholar
Rudnicky, A., & Hauptman, A. (1992). Multimodal interactions in speech systems. In M. Blattner & R. Dannenberg (Eds.), Multimedia interface design (pp. 147–172). New York: ACM Press.
Google Scholar
Oviatt S., & VanGent, R. (1996). Error resolution during multimodal human-computer interaction. In Proceedings of International Conference on Spoken Language Processing (ICSLP),Philadelphia, PA, USA, pp. 204–207.
Google Scholar
Allgayer, J., Jansen-Winkeln, R. M., Reddig, C., & Reithinger, N. (1989). Bidirectional use of knowledge in the multi-modal NL access system XTRA. In Proceedings of IJCAI, Detroit, MI, USA, pp. 1492–1497.
Google Scholar
Bangalore, S., & Johnston, M. (2000). Tight-coupling of multimodal language processing with speech recognition. In Proceedings of the International Conference on Spoken Language Processing, Beijing, pp. 126–129.
Google Scholar
Bolt, R. A. (1980). “Put-That-There”: Voice and gesture at the graphics interface. Computer Graphics, 14(3), 262–270.
Article MathSciNet Google Scholar
Chai, J., Hong, P., & Zhou, M. (2004). A probabilistic approach to reference resolution in multimodal user interfaces. In Proceedings of 9th International Conference on Intelligent User Interfaces (IUI), Madeira, pp. 70–77.
Google Scholar
Cohen, P. R., Johnston, M., McGee, D., Oviatt, S. L., Pittman, J., Smith, I., et al. (1997). Multimodal interaction for distributed interactive simulation. In Proceedings of Innovative Applications of Artificial Intelligence Conference. Menlo Park: AAAI/MIT Press.
Google Scholar
House, D., & Wirn, M. (2000). Adapt—A multimodal conversational dialogue system in an apartment domain. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Beijing, pp. 134–137.
Google Scholar
Johnston, M., Bangalore, S., Vasireddy, G., Stent, A., Ehlen, P., Walker, M., et al. (2002). MATCH: An architecture for multimodal dialog systems. In Proceedings of the Association of Computational Linguistics Annual Conference, Philadelphia, PA, pp. 376–383.
Google Scholar
Koons, D. B., Sparrell, C. J., & Thorisson, K. R. (1993). Integrating simultaneous input from speech, gaze, and hand gestures. In M. T. Maybury (Ed.), Intelligent multimedia interfaces (pp. 257–276). Cambridge, MA: AAAI Press/MIT Press.
Google Scholar
Neal, J. G., & Shapiro, S. C. (1991). Intelligent multi-media interface technology. In J. W. Sullivan & S. W. Tyler (Eds.), Intelligent user interfaces (pp. 45–68). New York: Addison Wesley.
Google Scholar
Sharma, R., Yeasin, M., Krahnstoever, N., Rauschert, I., Cai, G., Brewer, I., MacEachren, A. M., & Sengupta, K. (2003). Speech-gesture driven multimodal interfaces for crisis management. Proceedings of the IEEE, 91(9), 1327–1354.
Article Google Scholar
Wahlster, W. (2002). SmartKom: Fusion and fission of speech, gestures, and facial expressions. In Proceedings of the 1st International Workshop on Man-Machine Symbiotic Systems, Kyoto, pp. 213–225.
Google Scholar
Waibel, A., Vo, M., Duchnowski, P., & Manke, S. (1996). Multimodal interfaces. AI Review Journal, 10:299–319.
Google Scholar
Wauchope, K. (1994). Eucalyptus: Integrating natural language input with a graphical user interface. Naval Research Laboratory, Report NRL/FR/5510-94-9711.
Google Scholar
Cohen, P. R., Kaiser, E. C., Buchanan, C. M., & Lind, S. (2015). Sketch-Thru-Plan: A multimodal interface for command and control. Communications of ACM, 58(4), 56–65.
Article Google Scholar
Johnston, M., & Ehlen, P. (2010). Speak4it^SM: Multimodal interaction in the wild. In Proceedings of IEEE Spoken Language Technology Workshop, Berkeley, CA, pp. 59–60.
Google Scholar
Johnston, M., Baggia, P., Burnett, D. C., Carter, J., Dahl, D. A., McCobb, G., et al. (2009). EMMA:Extensible MultiModal Annotation markup language. W3C Recommendation. https://www.w3.org/TR/2009/REC-emma-20090210/.
Johnston, M., Dahl, D. A., Denney, T., & Kharidi, N. (2015). EMMA: Extensible MultiModal Annotation markup language Version 2.0. W3C Public working draft. https://www.w3.org/TR/emma20/. Accessed Sept 2015.
Barnett, J., Bodell, M., Dahl, D., Kliche, I., Larson, J., Porter, B., et al. (2012). Multimodal architecture and interfaces. W3C Recommendation. https://www.w3.org/TR/mmi-arch/.
Dahl, D. (2000). Natural language semantics markup language for the speech interface framework. W3C Working Draft. https://www.w3.org/TR/2000/WD-nl-spec-20001120/.
Shanmugham, S., Monaco, P., & Eberman, B. (2006). A media resource control protocol (MRCP). IETF RFC 4463. https://tools.ietf.org/html/rfc4463.
Burnett, D., & Shanmugham, S. (2012). Media resource control protocol Version 2 (MRCPv2). IETF RFC 6787. https://tools.ietf.org/html/rfc6787.
Phillips, A., & Davis, M. (2006). Tags for the Identification of Languages, IETF. http://www.rfc-editor.org/rfc/bcp/bcp47.txt.
Chee, Y.-M., Franke, K., Froumentin, M., Madhvanath, S., Magana, J. A., Pakosz, G., et al. (2011). Ink markup language (InkML). W3C Recommendation. https://www.w3.org/TR/InkML/.
Kaiser, E., Olwal, A., McGee, D., Benko, H., Corradini, A., Li, X., et al. (2003). Mutual disambiguation of 3D multimodal interaction in augmented and virtual reality. In Proceedings of the 5th International Conference on Multimodal Interfaces (ICMI), Vancover, BC, Canada, pp. 12–19.
Google Scholar
Jonson, R. (2006). Dialog context-based re-ranking of ASR hypotheses. In Proceedings of IEEE Spoken Language Technology Workshop, Palm Beach, Aruba, pp. 174–177.
Google Scholar
Johnston, M. (2009). Building multimodal applications with EMMA. In Proceedings of the ICMI Conference, Boston, MA, USA, pp. 47–54.
Google Scholar
Mishra, T., & Bangalore, S. (2011). Finite-state models for speech-based search on mobile devices. Natural Language Engineering, 17(2), 243–264.
Article Google Scholar
Stoyanchev, S., & Johnston, M. (2015). Localized error detection for targeted clarification in a virtual assistant. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Bribane, Australia, pp. 5241–5245.
Google Scholar
Stoyanchev, S., Liu, A., & Hirschberg, J. (2014). Towards natural clarification questions in dialogue systems. In Proceedings of AISB, London, England.
Google Scholar
Johnston, M., & Bangalore, S. (2009). Robust understanding in multimodal interfaces. Computational Linguistics, 35(3), 345–397.
Article Google Scholar
Johnston, M. (1998). Unification-based multimodal parsing. In Proceedings of the Association for Computational Linguistics Annual Conference (ACL). Montreal, pp. 624–630.
Google Scholar
Pustejovsky, J., Castaño, J., Ingria, R., Saurí, R., Gaizauskas, R., Setzer, A., et al. (2003). TimeML: Robust specification of event and temporal expressions in text. In Proceedings of Fifth International Workshop on Computational Semantics (IWCS-5), Tilburg, The Netherlands.
Google Scholar
Burnett, D., Walker, M. R., & Hunt, A. (2004). Speech synthesis markup language (SSML) Version 1.0. W3C Recommendation. https://www.w3.org/TR/speech-synthesis/.
ECMA International (2013). The JSON data interchange format. Standard ECMA-404. http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf. Accessed 14 May 2016.
Van Tichelen, L., & Burke, D. (2007). Semantic interpretation for speech recognition (SISR) Version 1.0. W3C Recommendation. https://www.w3.org/TR/semantic-interpretation/.
Hunt, A., & McGlashan, S. (2004). Speech recognition grammar specification Version 1.0. W3C Recommendation. https://www.w3.org/TR/speech-grammar/.

Download references

Acknowledgements

I would like to acknowledge the many contributors to the EMMA standard from around the world including Deborah Dahl, Kazuyuki Ashimura, Paolo Baggia, Roberto Pieraccini, Dan Burnett, Dave Raggett, Stephen Potter, Nagesh Kharidi, Raj Tumuluri, Jerry Carter, Wu Chou, Gerry McCobb, Tim Denney, Max Froumentin, Katrina Halonen, Jin Liu, Massimo Romanelli, T. V. Raman, and Yuan Shao.

Author information

Authors and Affiliations

Interactions Corporation, New York, NY, USA
Michael Johnston

Authors

Michael Johnston
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Johnston .

Editor information

Editors and Affiliations

Conversational Technologies, Plymouth Meeting, Pennsylvania, USA
Deborah A. Dahl

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Johnston, M. (2017). Extensible Multimodal Annotation for Intelligent Interactive Systems. In: Dahl, D. (eds) Multimodal Interaction with W3C Standards. Springer, Cham. https://doi.org/10.1007/978-3-319-42816-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-42816-1_3
Published: 18 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42814-7
Online ISBN: 978-3-319-42816-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics