Combinatory Categorial Grammar

From ACL Wiki
Revision as of 02:25, 12 August 2008 by Ioan (talk | contribs) (Introduction)

Jump to: navigation, search


Combinatory Categorial Grammar (CCG) is an efficiently parseable, yet linguistically expressive grammar formalism. It has a completely transparent interface between surface syntax and underlying semantic representation, including predicate-argument structure, quantification and information structure. CCG relies on combinatory logic, which has the same expressive power as the lambda calculus, but builds its expressions differently. The first linguistic and psycholinguistic arguments for basing the grammar on combinators were put forth by Mark Steedman and Anna Szabolcsi. More recent prominent proponents of the approach are Jacobson and Baldridge. One of the key publications of CCG is The Syntactic Process by Mark Steedman. There are various efficient parsers available for CCG.


OpenCCG: The OpenNLP library

OpenCCG, the OpenNLP CCG Library, is an open source natural language processing library written in Java, which provides parsing and realization services based on Mark Steedman's Combinatory Categorial Grammar (CCG) formalism. The library makes use of multi-modal extensions to CCG developed by Jason Baldridge as part of the Grok system (the precursor to OpenCCG). Current development efforts, led by Michael White, are focused on making the realizer practical to use in dialogue systems. For the latest news about OpenCCG, check out the SourceForge project page.

The C&C Parser and Supertagger

The C&C CCG parser and supertagger form part of the language processing tools developed by James Curran and Stephan Clark. The tools are written in C++ and have been designed to be efficient enough for large-scale NLP tasks.


StatCCG is a statistical CCG parser (trained on CCGbank) written by Julia Hockenmaier. Executables are available here


Boxer is developed by Johan Bos and generates formal semantic representations for CCG grammars. Boxer takes as input CCG (Combinatory Categorial Grammar) derivations and produces DRSs (Discourse Representation Structures, from Hans Kamp's Discourse Representation Theory) as output. It is distributed with the C&C tools. Boxer produces standard DRS syntax, uses a neo-Davidsonian analysis for events (with thematic roles from VerbNet), incorporates Van der Sandt's algorithm for presupposition, is 100% compatible with first-order logic (FOL), and normalises cardinal and date expressions. DRSs can be generated in various output formats: resolved or underspecified, in Prolog or XML, flattened or recursive structures, with discourse referents represented by Prolog atoms or variables, and with pretty printed DRSs or not. It is also possible to output FOL formulas translated from the DRSs.


CCGbank is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar derivations, created by Julia Hockenmaier and Mark Steedman. You can get it here from the Linguistic Data Consortium. You can also have a look at this demo of the HTML version included in the LDC distribution.

CCGbank pairs syntactic derivations with sets of word-word dependencies which approximate the underlying predicate-argument structure. The translation process and linguistic analyses are explained in the manual. CCGbank contains 99.44% of the sentences in the Penn Treebank, for which it corrects a number of inconsistencies and errors in the original annotation.

The LDC distribution also contains machine-readable versions of the data, which contain the syntactic derivations and the corresponding lists of word-word dependencies, as well as a file that is searchable by Doug Rohde's TGrep2 (version 1.15).

In all versions, the file structure corresponds exactly to that of the original Treebank.