2009Q3 Reports: ACL Anthology
ACL ANTHOLOGY Report July 2009, Min-Yen Kan
The ACL Anthology is a digital archive of research papers in computational linguistics, sponsored by the CL community, and freely available to all. It includes the Computational Linguistics journal, and proceedings of many conferences and workshops including: ACL, EACL, NAACL, ANLP, TINLAP, COLING, HLT, MUC, and Tipster. Conference proceedings are published in the anthology around the same time as the conference. CL articles are published in the anthology roughly one year in arrears (but individual subscribers can access recent issues electronically via the MIT Press website).
The anthology now contains over 15,900 papers (up from 13,600 papers twelve months ago), along with full-text search (provided by Google's Custom Search API). Most of the papers are also indexed by Citeseer and Google Scholar, helping the citation counts of ACL authors. The ACM Digital Library is creating rich metadata and doing full citation linking for all anthology materials.
ADDITIONS OVER LAST 12 MONTHS: As promised, we have finished the ingestion of IJCNLP 2005 and 2008 and have incorporated the listing of LREC 2008 papers. We have also worked with ACL conference organizers to list the conference and associated workshop papers on the dates of the conference. Similarly, CL journal issues are usually listed and anthologized within 48 hours of release from the MIT Press website.
MAILING LIST: The Anthology mailing list's (http://groups.google.com/group/acl-anthology) membership pool has grown, now consisting of 176 members (up from 94 from last report). This is an announcement-only list, where we notify members of newly listed released materials online.
FUTURE MATERIALS: We are looking to further incorporate additional materials from other CL related conferences. ROCling's set of conference materials has been uploaded for our processing, and will be incorporated, pending edits to provide English translations of both author names and titles. We are in the planning stage for incorporating past LRECs when they are made available and the upcoming RANLP 2009 conference.
ONGOING WORK: While we incorporate more materials for CL, we are working with START (courtesy connections from Steven Bird) to directly support the ACL Anthology XML format to make future events using the aclpub package easier to incorporate directly into the Anthology.
Another large tract of our current work is on providing a uniform level of service and metadata for past work. There are a number of issues that are being tackled:
- Correct XML representation of each article: names of authors (with diacritics for European names), first, "von" and last name portions
- BibTeX representations for all articles
- One PDF file per article. This is especially problematic for the J79 series, which largely represents one issue per PDF file.
- Text for all PDF files. Some articles (e.g., EACL 2003) only exist in image form in the Anthology, rendering indexing (and hence subsequent citation) of these articles problematic.
- PDF metadata fixing for all articles. Crucially, Google Scholar uses this information but it is not always correctly generated.
With these issues fixed, we can proceed to wikify the Anthology, to allow more flexibility in adding, correcting metadata that members or outsiders can directly suggest. Currently, Ali Hakim is working on some initial test phases to do this.
RELATED WORK: Min is also working on creating an LDC release of the ACL Anthology Reference Corpus, which is a separate project, sanctioned earlier by the ACL. This project will incorporate OCR dumped text for all pages in the reference corpus (a subset of all articles up to 2006) and PNG images for each page in the corpus. This work should be ready by the end of the year.
FOR DISCUSSION: