2013Q3 Reports: ACL Anthology

From Admin Wiki
Revision as of 01:30, 31 July 2013 by Knmnyn (talk | contribs)
Jump to navigation Jump to search

[ Link to 2013 Q1 report ] [ Link to 2012 Q3 report ] [ Link to 2011 Q3 report ] [ Link to 2010 Q3 report ] [ Link to 2009 Q3 report ]

The ACL Anthology is a digital archive of research papers in computational linguistics, sponsored by the CL community, and freely available to all.

This year, TACL articles have started being indexed, and we hope to make the videos from NAACL be also available soon. We have also started indexing JEP-TALN-RECITAL, a conference series related to CL/NLP in French.

The Anthology now contains over 23,000 papers (up from 20,200 articles from a year ago).

The new ACL Anthology has gone live in Feb 2012. Unfortunately, due to a number of maintainer problems, the system has not been live for very long. It is an important part of the Anthology work to have the new version be stable, up-and-running on a constant basis for 2013.

Mailing List. The Anthology mailing list's (http://groups.google.com/group/acl-anthology) membership pool has grown, now consisting of 426 members (up from 363 from a year ago). This is an announcement-only list, where we notify members of newly listed released materials online.

Plans

A key thrust this year is in the addition of becoming a DOI assignee as part of the CrossRef publishers' cooperative. This will allow us to register our own DOIs for publications which will route to the ACL Anthology or TACL pages (see other report from the Information Office). Currently we have an agreement with the ACM to assign DOIs through them, but this costs us pageviews and the opportunity to control where we want the information to go.

A second thrust will to best handle the other forms of scientific knowledge that we are interested in archiving. These include software, datasets and video. The procedures for integrating these with START and the submission process need to be worked out, and the space requirements for these services assessed. For the time being, we will concentrate on videos (as NAACL is supposed to be making these available).

A third thrust for this year will be to incorporate the results of the R50 workshop into the Anthology, and allow third-party applications to automatically annotate articles with new metadata and papers in the Anthology, as they come available. Such an API will raise the visibility of the Anthology as a object of study, complementing our earlier work to make the Anthology's text a corpus.

The very-much related work in the Information Officer (of which the Anthology is a part) is also available as a 2013 Q3 report here.

We have long term plans to work on these other following problems, which are less urgent:

  • collaboration with START and aclpub (also may involve the Conference Officer's, Jian's, work)
  • PDF metadata fixing for all articles. Crucially, Google Scholar uses this information but it is not always correctly generated.
  • One PDF file per article. This is especially problematic for the J79 series, which largely represents one issue per PDF file.
  • Incorporation of TACL accepted articles into the Anthology. Currently one difficulty is that TACL submissions can also appear as a ACL publication. Likely, we will just list both publications as (unlinked) records for now.

Late-Breaking Plans

(This section added 31 Jul; may not be reflected in hardcopy)

ELRA has contact us (via Nicoletta and Khalid) to ask for some joint initiatives between ACL/ELRA and other sister CL organizations. I report this in the IO report, but one item is relevant to the Anthology office. ELRA has back issues of some of their conference materials and inquired about how we scan and digitize our materials for the Anthology. The bulk of the ACL legacy materials were bulk scanned by Steven Bird prior to 2004 by a third-party, so we only do one-off digitization for now. I may voluntarily assist ELRA with some of their materials as they need. With respect to the items that the IO/CO roles wish to work on with ELRA, this is considered a low-priority for now.