Aravind Joshi, Rashmi Prasad and Bonnie Webber
Advances in NLP are facilitated through richly annotated corpora. This
holds of advances in both theory and technology. In discourse, richly
annotated corpora can support theoretical understanding of discourse
phenomena, as well as applications such as question answering,
summarization, machine translation and generation.
In discourse, parts of a text are related to one another via what are
called "cohesive devices". These include the well-known devices of
pronominal coreference, zero-anaphor and clitic coreference, bridging
anaphors, and other forms of anaphoric reference. Another cohesive
device that has attracted substantial interest in the last decade goes
under the label of "discourse relations".
In this tutorial, we will focus on topics related to the description
and annotation of discourse relations. The tutorial is divided into
two parts. In the first part, we will address several descriptive
issues that are critical to annotating discourse relations, such as:
- What is it that discourse relations relate?
- What types of discourse relations are there?
- What triggers a discourse relation?
and we will show how different discourse annotation projects such as
the RST TreeBank, Discourse Graphbank, LDM, SDRT, and the PDTB answer
these questions, highlighting their similarities and differences and
how the answers shape the resulting annotation. One notable
distinction to be addressed is whether or not discourse relations and
their annotations are anchored on lexical items in the text. We will
compare the different approaches with regard to this issue and discuss
the advantages and disadvantages of lexicalized and non-lexicalized
vapproaches.
In the second part of the tutorial, we will discuss the lexicalized
approach towards annotating discourse relations in the Penn Discourse
TreeBank Project (PDTB). The PDTB annotates both (1) discourse
relations that are anchored by an explicit connective in the text and
(2) discourse relations that hold between adjacent sentences whose
abstract object interpretations are not related by an explicit
connective. (The latter we call, for short, "implicit connectives".)
We will describe the methodologies we use for annotating explicit and
implicit connectives, along with their arguments. Other features
associated with connectives and arguments will also be discussed,
such as sense annotation on connectives (allowing for disambiguation
of polysemous connectives), and attribution annotation on both
connectives and their arguments (allowing for determination of the
"ownership" of a proposition or belief to an individual). The tutorial
will be useful not only for those who want to use the PDTB corpus but
also for others who are using or intend to use related annotated
corpora, as many of the issues to be discussed apply to these related
efforts as well.
Participants of the tutorial will each be provided with the complete
annotated corpus and a detailed manual of the annotation
guidelines. The corpus consists of explicit and implicit discourse
connectives annotated over the 1 million words of the Wall Street
Journal (WSJ) text corpus. The WSJ texts are also the basis for the
syntactic annotation of the Penn TreeBank [6] and the semantic
annotation of the PropBank [4]. In this regard, we will discuss the
utility of a corpus with multiple layers of annotation. After
describing the annotation methodology and guidelines of PDTB, we will
give a demonstration of the PDTB annotation on some sample texts,
during which participants will get direct experience with the
annotation. Finally, we will discuss how the corpus can be used for
experiments as well as some natural language processing applications.
TUTORIAL OUTLINE
- Part 1
- Introduction to discourse annotation
- RST (Rhetorical Structure Theory) TreeBank [2]
- Discourse GraphBank [14]
- Linguistic Discourse Model (LDM) [9]
- SDRT-based annotation [1]
- Penn Discourse TreeBank (PDTB) [7,10,13]
- Arguments to discourse relations
- Syntactic properties
- Semantic properties
- Locality
- Types of discourse relations
- Semantic domains and features [5]
- Pragmatic domains and features
- Triggers for discourse relations
- Non-lexicalized approaches
- Lexicalized approaches
- Comparison of the two approaches
- Attribution of discourse relations and their arguments
- Part 2
- Introduction to Penn Discourse TreeBank
- Scale of the corpus
- Underlying framework [12]
- Description of corpus and material distributed to participants
- Description of annotation methodology
- Annotation of explicit connectives and their arguments
- Annotation of implicit connectives and their arguments
- Sense annotation
- Attribution annotation
- Experiments with PDTB
- Previous experiments [3,8,10,11]
- Proposed experiments
- Relevance of PDTB to NLP applications
- Conclusions
Aravind K. Joshi is the Henry Salvatori Professor of Computer and
Cognitive Science. He is a member of (1) the Department of Computer
and Information Science, (2) Department of Linguistics, and (3)
Institute for Research in Cognitive Science (its former Co-Director),
all at the University of Pennsylvania. He is known for his work on
formal and computational modeling of various aspects of syntax,
semantics, and discourse, also cognitive (processing) aspects of
syntax and discourse. With respect to the proposed workshop the
relevant research work pertains to various aspects of local coherence
of discourse, centering theory, original work on discourse lexicalized
TAG (jointly with Bonnie Webber).
Rashmi Prasad is a Research Associate at the Institute for
Research in Cognitive Science, University of Pennsylvania. Her primary
research interest lies in formal and computational modeling of
discourse-level interpretive and generative processes, including
interfaces with syntax and semantics. She is known for her work on
pronominal reference (PhD thesis, 2003) and discourse parsing (with
TAG), both at the University of Pennsylvania, and on
rhetorically-motivated sentence plan generation and dialogue-act
tagging, at AT&T Labs, Research.
Bonnie Webber is a Professor and Deputy Head of the School of
Informatics at the University of Edinburgh. She is best known for her
research on Question Answering (starting with BBN's LUNAR system in
the early 70's and including work with Aravind Joshi and students on
Cooperative Question Answering) and discourse phenomena (starting with
her 1978 PhD thesis on discourse anaphora). She has also been involved
in research on animation from instructions, medical decision support
systems and biomedical text processing. She served as President of ACL
in 1980, Tutorial Chair of the 1992 ACL Conference, and General Chair
of the 2001 ACL/EACL Conference.
BIBLIOGRAPHY
-
Baldridge and Lascarides. 2005. Jason Baldridge and Alex
Lascarides. Annotating Discourse Structures for Robust Semantic
Interpretation. Proceedings of the Sixth International Workshop on
Computational Semantics IWCS-6, Tilburg. -
Carlson et al. 2003. Lynn Carlson, Daniel Marcu, and Mary Ellen
Okurowski. Building a Discourse-Tagged Corpus in the Framework of
Rhetorical Structure Theory. In Current Directions in Discourse and
Dialogue, Jan van Kuppevelt and Ronnie Smith eds., Kluwer Academic
Publishers. - Dinesh et al. 2005. Nikhil Dinesh, Alan Lee, Eleni Miltsakaki,
Rashmi Prasad, Aravind Joshi and Bonnie Webber. Attribution and the
(Non)-Alignment of Syntactic and Discourse Arguments of
Connectives. In Proceedings of the ACL Workshop on Frontiers in Corpus
Annotation II: Pie in the Sky. Ann Arbor, Michigan. -
Kingsbury and Palmer. 2002. Paul Kingsbury and Martha Palmer. From
Treebank to PropBank. In Proceedings of the 3rd International
Conference on Language Resources and Evaluation (LREC-2002), Las
Palmas, Spain. -
Knott. 1996. Alistair Knott. A Data-Driven Methodology for
Motivating a Set of Coherence Relations. PhD thesis, Department of
Artificial Intelligence, University of Edinburgh. -
Marcus et al. 1993. Mitch Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. Building a large annotated corpus of english: the Penn
Treebank. In Computational Linguistics, 19. -
Miltsakaki et al. 2004. Eleni Miltsakaki, Rashmi Prasad, Aravind
Joshi, and Bonnie Webber. Annotating Discourse Connectives and their
Arguments. In Proceedings of the HLT/NAACL Workshop on Frontiers in
Corpus Annotation. Boston, MA. -
Miltsakaki et al. 2005. Eleni Miltsakaki, Nikhil Dinesh, Rashmi
Prasad, Aravind Joshi and Bonnie Webber. Experiments on Sense
Annotations and Sense Disambiguation of Discourse Connectives. In
Proceedings of the Fourth Workshop on Treebanks and Linguistic
Theories (TLT2005), Barcelona, Spain. -
Polanyi et al. 2004. Livia Polanyi, Chris Culy, Martin van den
Berg, Gian Lorenzo Thione, and David Ahn. A Rule Based Approach to
Discourse Parsing. In Proceedings of SIGDIAL'04. Boston, MA -
Prasad et al. 2004. Rashmi Prasad, Eleni Miltsakaki, Aravind
Joshi, Bonnie Webber. Annotation and Data Mining of the Penn Discourse
TreeBank. In Proceedings of the ACL Workshop on Discourse
Annotation. Barcelona, Spain. -
Prasad et al. 2005. Rashmi Prasad, Aravind Joshi, Nikhil Dinesh,
Alan Lee, Eleni Miltsakaki, and Bonnie Webber. The Penn Discourse
TreeBank as a Resource for Natural Language Generation. In Proceedings
of the Corpus Linguistics Workshop on Using Corpora for Natural
Language Generation, Birmingham, U.K. -
Webber et al. 2003. Bonnie Webber, Aravind Joshi, Matthew Stone,
and Alistair Knott. Anaphora and Discourse Structure. In Computational
Linguistics 29(4). -
Webber et al. 2005. Bonnie Webber, Aravind Joshi, Eleni
Miltsakaki, Rashmi Prasad, Nikhil Dinesh, Alan Lee, and Kate Forbes. A
Short Introduction to the Penn Discourse TreeBank. In Copenhagen
Working Papers in Language and Speech Processing. -
Wolf and Gibson. 2005. Florian Wolf and Edward
Gibson. Representing discourse coherence: A corpus-based analysis. In
Computational Linguistics, 31(2).