2013Q3 Reports: ACL Anthology

From Admin Wiki
Revision as of 08:04, 12 July 2013 by Knmnyn (talk | contribs)
Jump to navigation Jump to search

[ Link to 2013 Q1 report ] [ Link to 2012 Q3 report ] [ Link to 2011 Q3 report ] [ Link to 2010 Q3 report ] [ Link to 2009 Q3 report ]

The ACL Anthology is a digital archive of research papers in computational linguistics, sponsored by the CL community, and freely available to all. Conference proceedings are published in the anthology around the same time as the conference (subject to general/program chairs' discretion). CL articles are published within a few days of publication on the MIT Press website, now that CL is open access. With TACL going into circulation soon, this venue will also need to be incorporated into the Anthology this coming year. NAACL is also planning video recordings so how to integrate and archive these other measures will be part of our ongoing work.

The anthology now contains over 21,900 papers (up from 20,200 articles from six months ago).

The new ACL Anthology has gone live in Feb 2012. Unfortunately, due to a number of maintainer problems, the system has not been live for very long. It is an important part of the Anthology work to have the new version be stable, up-and-running on a constant basis for 2013.

We have also gotten the Exec's approval to use a separate domain (aclanthology.info) and hosting company (Amazon EC2) for the service. This service is not yet active.

With respect to materials, we continue to integrate other CL related venues into the Anthology to increase the prestige of ACL as well as to make the Anthology even more useful. We have integrated TALN (in French) and RANLP (English) over the last period.

Mailing List. The Anthology mailing list's (http://groups.google.com/group/acl-anthology) membership pool has grown, now consisting of 394 members (up from 363 from 6 months ago). This is an announcement-only list, where we notify members of newly listed released materials online.

Plans

A key thrust for this year will be to incorporate the results of the R50 workshop into the Anthology, and allow third-party applications to automatically annotate articles with new metadata and papers in the Anthology, as they come available. Such an API will raise the visibility of the Anthology as a object of study, complementing our earlier work to make the Anthology's text a corpus.

A second thrust will to best handle the other forms of scientific knowledge that we are interested in archiving. These include software, datasets and video. The procedures for integrating these with START and the submission process need to be worked out, and the space requirements for these services assessed.

A third part will be whether we want to re-investigate whether to become our own DOI assignee. Currently we have an agreement with the ACM to assign DOIs through them, but this costs us pageviews and the opportunity to control where we want the information to go. A big problem with becoming a DOI provider is it adds to our administrative burden and costs money to assign DOIs.

We plan to work on these other following problems, but which are less urgent:

  • collaboration with START and aclpub.
  • PDF metadata fixing for all articles. Crucially, Google Scholar uses this information but it is not always correctly generated.
  • One PDF file per article. This is especially problematic for the J79 series, which largely represents one issue per PDF file.
  • Incorporation of TACL accepted articles into the Anthology. Currently one difficulty is that TACL submissions can also appear as a ACL publication. Likely, we will just list both publications as (unlinked) records for now.