2010Q3 Reports: ACL Anthology
ACL ANTHOLOGY Report June 2010, Min-Yen Kan
(please see the items for discussion at the end of the report)
In Progress -- Do not take as factual yet.
The ACL Anthology is a digital archive of research papers in computational linguistics, sponsored by the CL community, and freely available to all. Conference proceedings are published in the anthology around the same time as the conference (subject to general/program chairs' discretion). CL articles are published within a few days of publication on the MIT Press website (now that CL is open access).
The anthology now contains over BUG papers (up from 15,900 articles from twelve months ago). All of the papers up to 2008 are also now indexed by the ACM Portal and should have Digital Object Identifiers (DOIs) assigned to them per the ACL Anthology - ACM agreement.
CHANGES OVER LAST 12 MONTHS: As promised, we have been busy reaching out to sister CL/NLP related societies and have been also ingesting and hosting their materials. We have incorporated ROCLING, PACLIC and ALTA forums into the Anthology, now listed under Other Events. We are finalizing the incorporation of RANLP soon.
We have also been busy ensuring the coverage of DBLP and ACM Portal cover the Anthology materials. We understand that the ACM Digital Library has finished ingestion and DOI assignation of over 100 venues (mostly workshops) that had been missing from their records. DBLP is known to digest information from the ACM Digital Library Portal, so this should be completed soon.
MAILING LIST: The Anthology mailing list's (http://groups.google.com/group/acl-anthology) membership pool has grown, now consisting of XX members (up from 176 from last report). This is an announcement-only list, where we notify members of newly listed released materials online.
ONGOING WORK: While we incorporate more materials for CL, we are working with START (courtesy connections from Steven Bird) to directly support the ACL Anthology XML format to make future events using the aclpub package easier to incorporate directly into the Anthology.
Another large tract of our current work is on providing a uniform level of service and metadata for past work. There are a number of issues that are being tackled:
- Correct XML representation of each article: names of authors (with diacritics for European names), first, "von" and last name portions
- BibTeX representations for all articles
- One PDF file per article. This is especially problematic for the J79 series, which largely represents one issue per PDF file.
- Text for all PDF files. Some articles (e.g., EACL 2003) only exist in image form in the Anthology, rendering indexing (and hence subsequent citation) of these articles problematic.
- PDF metadata fixing for all articles. Crucially, Google Scholar uses this information but it is not always correctly generated.
With these issues fixed, we can proceed to wikify the Anthology, to allow more flexibility in adding, correcting metadata that members or outsiders can directly suggest. Currently, Ali Hakim is working on some initial test phases to do this.
RELATED WORK: We have released (via LDC and via the corpus website) an ACL Anthology Reference Corpus, which came out in Dec 2009, which is a separate project, sanctioned earlier by the ACL. This project incorporated OCR dumped text for all pages in the reference corpus (a subset of all articles up to 2006) and PNG images for each page in the corpus.
- One aspect where this comes up is in listing venues on the front page. While the vast majority of users may never use the front page (but get to the Anthology's details from search engines), currently we list ACL events at the top and other venues in a separate space. While fair and consistent, this marginalizes ACL events that are mainly sponsored by SIGs (notably EMNLP, SemEval, CoNLL) as they are relegated to the pages for SIGs. Would we want to change this policy and if so, what policy should be implemented?