2016Q3 Reports: ACL Anthology
The ACL Anthology is a digital archive of research papers in computational linguistics, sponsored by the CL community, and freely available to all. We employ a Creative Commons Attribution Non-Commercial, Share-Alike license for materials published by ACL. This makes our content usable by the general public with attribution to the ACL (although it is not mandatory for any user to inform us of their use of our materials). Dual licensing for a fee is presumably possible (although not exercised currently).
The Anthology now contains over 37,000 (up from 34,800 papers in the last report in Q3) The new ACL Anthology is now active and will be switched to the primary Anthology site around ACL this year, as we have had some time to sort out problems with the site. However, we know a portion of our membership will want to still use the older version, so we are going to maintain both sites at least until the end of 2016.
We have begun assigning DOIs to our own materials this year, starting with the 2014 materials and hopefully pushing towards the current timeline, before ingesting other earlier materials before the migration from the courtesy DOI assignation from ACM (for pre-2012 materials). For reference, the ACL decided to create our own DOIs such that we could control where the DOIs resolved to, as earlier, ACM "owned" the DOI redirect, taking traffic from ACL to route to their ACM Portal digital library, in exchange for the cost of DOI assignation (US$1 per paper). With our current practice of assigning DOIs to all materials, our costs are likely to escalate to at least US$ 2K as we digitally publish at least this amount of scholarly articles.
We now have semi-updated statistics on the most accessed papers and authors from the Anthology. We have begun to automate this information and propagate this information into the pages for the papers and authors so to provide additional data for authors to argue for their impact. We have preserved the web log data for the new Anthology so as to be able to run other analytics when interests from members of our community can utilize the logs to create better services for ourselves.
While the new Anthology is live, it lives on a university virtual machine in Singapore, and will not likely scale to provide adequate bandwidth when faced with the full access from the ACL membership and general public. We are investigating which service to take our work towards as it likely requires a VPS account as we need to install certain software and libraries that usually requires root privileges. We hope to work this migration soon.
Finally, we recognise that the ACL Anthology has become a significant asset for the ACL, manifesting its central role in the NLP/CL research communities. It is of too much import to have a single editor be responsible for the policymaking of the Anthology. We hope the Exec will endorse the call for a steering committee to provide the necessary oversight for the Anthology. This is to be discussed during the executive meeting.
Mailing List. The Anthology mailing list's (http://groups.google.com/group/acl-anthology) membership pool has grown, now consisting of 555 members (up from 479 from a year ago, and 533 from the last report 6 months ago). This is an announcement-only list, where we notify members of newly listed released materials online.
Plans. Aside from completing DOI assignation, a second thrust is to other forms of scientific knowledge that we are interested in archiving. These include software, datasets and videos. The procedures for integrating these with START and the submission process need to be worked out, and the space requirements for these services assessed. For the time being, we will concentrate on videos.
A third thrust will be to incorporate the results of the R50 workshop into the Anthology, and allow third-party applications to automatically annotate articles with new metadata and papers in the Anthology, as they come available. Such an API will raise the visibility of the Anthology as a object of study, complementing our earlier work to make the Anthology's text a corpus.
We have long term plans to work on these other following issues which are smaller in scope than the above major thrusts:
A previous discussion (with Ken Church) proposed that we create a single bibtex file for all Anthology materials. The beta Anthology can generate such information fairly easily with its database backing; we plan to have this file available during the ACL 2015 conference. To create a XML representation of all of the metadata that is used to create the Anthology site. [low priority] collaboration with START and aclpub (also may involve the Conference Officer's work) to integrate users of their system and to obtain LaTeX and abstracts for indexing and preservation. [low priority] collaboration with ELRA with respect to use of the LRE Map and ISLRNs, and voluntarily helping them with scanning backlog archives into a digital form.