Tutorial 4: Multimodel Language Processing

Michael Johnston and Srinivas Bangalore

The ongoing convergence of the web with telephony, through technologies such as Voice over IP, high-speed mobile data networks, and handheld computers and smartphones, enables the creation of natural and highly effective multimodal interfaces for human-human communication and human-machine interaction with automated services. These interfaces allow for user input and system output to be optimally distributed over multiple different modes such as speech, pen, and graphical displays. Research on the computational processing of language has primarily focussed on linear sequences of speech or text where the primitive elements are phonemes, morphemes, or words. Multimodal language can be distributed over two or three spatial dimensions as well as the temporal dimension and involve additional primitive elements such as gestures, drawings, tables, and charts. This tutorial provides an overview of the problem of multimodal language processing and detailed examples showing how representations and techniques from natural language and dialog processing can be extended and applied to the parsing, integration, understanding of multimodal inputs and the planning, generation, and presentation of multimodal outputs.

This tutorial is intended for students, researchers, and practioners in natural language and speech processing who want to see how many of the grammar and corpus-based techniques developed within the community can be applied to the creation of real-world multimodal interactive systems. It is introductory in nature and no special knowledge or background is required. The tutorial will also provide an overview of emerging standards that support multimodal interaction and will finish with presentation of how multimodal integration, dialog management and generation all work together in a sample multimodal application.


  1. Introduction
    • Definition and motivation for multimodal user interfaces
    • Examples of multimodal user interfaces: Video demonstrations
    • Language processing architectures for multimodal
  2. Unification-based multimodal integration and parsing
    • Multimodal integration as unification
    • Unification-based multimodal grammars
    • Multidimensional parsing
  3. Finite-state methods for multimodal understanding
    • Representation of input streams
    • Multimodal grammars
    • Implementation using finite-state methods
    • Integration of multimodal grammars with recognition
  4. Robust multimodal input processing
    • Robustness in spoken and multimodal language processing
    • Edit machines
    • Multimodal understanding as classification
    • Learning edit machines using machine translation
  5. Multimodal dialog management
    • Representation of multimodal dialog context
    • Clarification in multimodal dialog
    • Mode-independent dialog management
  6. Multimodal output generation
    • Multimodal content planning
    • Media synchronization
    • Generation of non-verbal behaviors
  7. Standards for multimodal interfaces
    • Speech GUI Integration: X+V and SALT
    • EMMA: Extensible MultiModal Annotation
  8. Multimodal applications and challenges
    • Sample prototype multimodal application
    • Incrementality and adaptivity

MICHAEL JOHNSTON is a Senior Technical Specialist in the IP and Voice-enabled services research lab of AT&T Labs - Research. His research interests span natural language processing, spoken and multimodal interactive systems, and human-computer interaction. For the last ten years, his work has focussed on the extension of language and dialog processing technologies to support multimodal interaction. In 1999, Dr. Johnston was awarded an NSF CAREER award for research on multimodal language processing for natural interfaces. He is also active in the creation of standards supporting spoken and multimodal interface development and serves as editor-in-chief of the World Wide Web consortium EMMA: Extensible Multimodal Annotation specification. Dr. Johnston is a member of the IEEE Speech and Language technical committee (2006-2008), was an area chair for ACL 2004, and has served as a program committee member and reviewer for numerous international conferences, journals, and workshops.

SRINIVAS BANGALORE is a Senior Technical Specialist in the IP and Voice-enabled services research lab of AT&T Labs - Research. His research areas include speech and language processing topics related to parsing, machine translation, multimodal integration, and finite-state methods. His dissertation was on a robust parsing approach called Supertagging that combines the strengths of statistical and linguistic models of language processing. During the past ten years, some of the topics he has worked on include tightly coupling speech recognition and language translation using finite-state speech translation approaches, supertag-based surface realizer for natural language generation, and finite-state based multimodal integration and understanding. Dr. Bangalore has been on the editorial board of Computational Linguistics Journal (2001-2003), the workshop chair for ACL 2004, member of IEEE Speech Technical Committee (2006-2008) and has served as a program committee member for a number of ACL and IEEE conferences and workshops.