Training the C&C Parser

From ACL Wiki
Revision as of 05:27, 21 April 2015 by KEvang (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The C&C Parser is an advanced statistical parser using the framework of Combinatory Categorial Grammar (CCG). It is quite easy to use with pre-trained models, but creating one's own models is a slightly different story. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)[1] on a single 64-bit Ubuntu 12.04 machine (which should have multiple cores and at least around 40 GB of main memory). The steps to take on other recent Linux distributions should be very similar.

Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by Kilian Evang based on instructions from Tim Dawborn; thanks are due to Tim and also to Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.

# Customize these variables:
export CANDC_PREFIX=$HOME
export CCGBANK=$HOME/data/CCGbank1.2
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem
export NUMNODES=32
export LIB=/usr/lib

# Some variables for use below:
export CANDC=$CANDC_PREFIX/candc
export SCRIPTS=$CANDC/src/scripts/ccg
export EXT=$CANDC/ext

# Package dependencies:
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion

# Check out the C&C tools.
# You need credentials for that, see
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion
cd $CANDC_PREFIX
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400

# Some patches to fix various problems with the scripts provided:

# Use a temp directory different from /tmp since that often doesn't have enough
# space:
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*

# Replace /bin/env by /usr/bin/env
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \
        $SCRIPTS/count_features

# Work around non-portable sed -f shebang
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \
        $SCRIPTS/create_data

# TODO patches to make the scripts work with the LDC version of CCGbank should
# go here.

# Make ext directory
mkdir $EXT

# Install Boost library (Ubuntu doesn't seem to have a version that is compiled
# against MPICH2).
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI
       # library without this for some reason
mkdir $EXT/install
cd $EXT/install
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or
       # get it from Sourceforge
tar -xzf boost_1_53_0.tar.gz
cd boost_1_53_0
./bootstrap.sh --with-libraries=mpi --prefix=$EXT
./b2 install

# Install ancient MR-MPI C&C depends on
cd $EXT/install
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is
        # dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2
tar jxf mrmpi-22Apr09.tbz2
cd mrmpi-22Apr09/src
make -f Makefile.unix clean
make -f Makefile.unix
cp *.h $EXT/include
cp libmrmpi.a $EXT/lib

# Build C&C
cd $CANDC
make -f Makefile.unix all train bin/generate

# Create data
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank
$SCRIPTS/create_data $CCGBANK $NUMNODES working/

# Train the POS tagger and Supertagger:
$SCRIPTS/train_taggers working/

# Evaluate the supertagger model to ensure its results are sane:
$SCRIPTS/cl07_table4 working/

# Create the model_hybrid directory and empty config file:
mkdir working/model_hybrid
touch working/model_hybrid/config

# Train a hybrid model:
export LD_LIBRARY_PATH=$EXT/lib:$LIB
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/

# Evaluate the parser model:
$SCRIPTS/cl07_table7 working/

References

  1. Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In Computational Linguistics 33(4), http://aclweb.org/anthology-new/J/J07/J07-4004.pdf