2018Q3 Reports: ACL Anthology
The ACL Anthology is a digital archive of research papers in computational linguistics, sponsored by the CL community, and freely available to all. As of 2016, we employ a Creative Commons Attribution license for materials published by ACL. This makes our content usable by the general public with attribution to the ACL (although it is not mandatory for any user to inform us of their use of our materials). Dual licensing for a fee is presumably possible (although not exercised currently).
The Anthology now contains over 44,600 (up from 42,100 papers in the last report in Q3). As many may know there are two versions of the Anthology, a legacy version at http://www.aclweb.org/anthology and a current one at http://aclanthology.info . Due to time constraints by the Editor, and the fact that we have both Anthologies for over 2 years as a transition period, we have stopped maintaining the legacy site. Work still remains to port all of the accepted functionality over to the current site (i.e., handling errata, full text search, redirections from the legacy site to the current one, and archiving access statistics among others). However, the development of new features is heavily constrained by the time bandwidth of the Editor, as the popularity of NLP/CL publications continues to grow, and hence its workload for all tasks routine and exceptional. The current Anthology is physically hosted at Universität des Saarlands, in a well-resourced virtual machine, but may need to find a new home soon due to volunteers ending their term with the hosting group.
Mailing List. The Anthology mailing list's (http://groups.google.com/group/acl-anthology) membership pool has grown, now consisting of 895 members (up from 763 from a year ago). This is an announcement-only list, where we notify members of newly listed released materials online.
Auxiliary Ingestion. The Anthology now has ingestion workflows for software, datasets, general attachments, slides and posters that are hosted with the ACL Anthology. One important problem is with keeping certain media updated in the publication workflows for conferences -- supplemental materials, videos and posters are ingestable and have processes within the Anthology Editor's workflows, but these seem less well formed for individual conferences. We recommend that the Conference chair come up with good interfacing between authors, vendors and the Anthology.
Digital Object Identifiers. We have assigned DOIs to all ACL materials in 2015 and have assigned ones to all current material, minus TACL. With our current practice of assigning DOIs to all materials, our costs are likely to escalate to at least US$ 3K as we digitally publish at least this amount of scholarly articles.
ACL Anthology Reference Corpus version 3 (ACL ARC 3). We have gotten blessings from the LDC, for distribution of a new reference corpus. This is a priority in the development of the Anthology as a scientific corpus itself but needs to be delayed until the current Editor can step down and concentrate on this development.
Work Queue. The current state of ingestion and development of the ACL Anthology is publicly available on the ACL Anthology's footer. New this report is the incorporating of historical ingestion logs as well. https://docs.google.com/spreadsheets/d/166W-eIJX2rzCACbjpQYOaruJda7bTZrY7MBw_oa7B2E/pubhtml
Anthology Steering Committee. The ASC currently consists of Jing-Shin Chang, Min-Yen Kan and Paola Merlo. The ASC met virtually on 14 Jul 2016. The ASC also discussed supplementary materials in the ACL Anthology and other authority networks (for papers, authors, etc.) that may be being used, proposed by other institutions (e.g., MIT Press for CL journal), and revisions and better placement of the contributor's instructions for new material Anthology ingestion. Due to Min-Yen's involvement in ACL 2017, the ASC has yet to meet for its next meeting but may be able to do so physically soon.
Accomplishments for this year
- Negotiating with Google Scholar to index aclanthology.info. While done, this is a temporary fix until the Anthology can find a home within www.aclweb.org as required by Google Scholar.
- End of lifetime support for the legacy Anthology
- Containerization of the Anthology. We need volunteers to create a network of mirrors to the Anthology
- Publication of a short paper in the NLP-OSS workshop with volunteer group
- Moving the Anthology (-PDFs) from the NUS underresourced Virtual Machine to Universität des Saarlands
- Massive CL/NLP BibTeX file. We recreate a single BibTeX file for all Anthology materials after a period of ingestions.
- Incorporating of historical ingestion logs in the public Google Doc
- Refinement of outstanding issues into the GitHub open source codebase
- Migration the GitHub codebase to the central ACL GitHub account
Plans, Prioritized
- Min-Yen Kan, current Anthology Editor, will relinquish editorship of the ACL Anthology in 2018. It is time to begin searching for qualified individuals to fulfil this important, voluntary role to ensure that service to the community will not be interrupted, while enjoying the benefits of having new leadership rejuvenate the Anthology with new ideas. The ASC concurs that this process needs to start, and recommends that the ACL Exec come up with a selection process, adding that the existing Nominating Committee could be asked to help with the process, but that the NC should be enlarged to add at least the current Anthology Editor, and/or a member that could advise on the technical expertise of candidates. The ASC additionally recommends that the NC:
- Insist that prospective Editors have access to volunteers (i.e., students) who have the technical ability to help with the infrastructural maintenance work.
- Conduct an open call for (self-) nominations that might dovetail with a call for general volunteers.
- We are capturing abstracts (albeit with some noise) from START, but have yet to successfully integrate this into the ACL Anthology's index.
- For long-term preservation, to create a XML representation of all of the metadata used to create the Anthology. This is similar in nature to the XML dump of DBLP or Wikipedia. It allows a clean separation of the underlying data in the Anthology from the code used to present it.
- Collaboration with START (also may involve the Conference Officer's work) to integrate user accounts in their system. This would allow START to have authority records for authors such that new paper submissions might start with correct, canonical forms of author names. The ASC is aware of ORCIDs and other name authority systems that might also be useful in this process.
- Collaboration with ELRA to allow the categorization of papers against the LRE Map and ISLRNs.
- To allow third-party applications to automatically annotate articles with new metadata on existing papers via an API. Such an API is a production API, allowing third-parties to add auto-analyzed materials to the Anthology (e.g., auto-extracted keywords, summaries). This will raise the visibility of the Anthology as a object of study, complementing work on the ACL ARC.