Difference between revisions of "2011Q3 Reports: ACL Anthology"

Latest revision as of 14:45, 30 May 2011

[ Link to 2008 Q3 report ] [ Link to 2009 Q3 report ] [ Link to 2010 Q3 report ]

ACL ANTHOLOGY Report June 2011, Min-Yen Kan

EXECUTIVE SUMMARY: In the past year, we have fixed most of the problems with the ACM's ingestion of our data and have published a prototype ACL Anthology site working with a canonicalized data model. Both actions will need to be approved by the ACL Exec before being finalized into production.

INTRO The ACL Anthology is a digital archive of research papers in computational linguistics, sponsored by the CL community, and freely available to all. Conference proceedings are published in the anthology around the same time as the conference (subject to general/program chairs' discretion). CL articles are published within a few days of publication on the MIT Press website, now that CL is open access.

The anthology now contains over 19,200 papers (up from 18,000 articles from twelve months ago). All papers prior to this conference that belong to the ACL (N.B., not sister organizations) are also indexed by the ACM Portal as per the ACL Anthology - ACM agreement. This agreement allows the ACM to assign DOIs for ACL materials, in exchange for being able to dictate where the DOI resolution goes to (currently, the ACM Portal)

CHANGES OVER LAST 12 MONTHS:

With the help of Praveen Bysani at NUS, we have completed a new prototype of the ACL Anthology (http://aclanthology.heroku.com) which features faceted navigation, search and an underlying data model. Technically, it is built using Ruby on Rails with a Project Blacklight plug-in and features OAI-PMH integration to allow third parties to ingest and list article metadata from the Anthology, and offsite Lucene indices to allow faceted search. In creating the prototype, we have unified the metadata of all articles in the ACL Anthology, a non-trivial task since the original Anthology metadata was not of uniform quality. Currently, minor changes to the prototype are being done to ensure that the functionality of the current Anthology are all intact in the prototype. Once finished, we will seek the ACL Exec's approval to launch the prototype as our production Anthology, which will need to be hosted by a (commercial) third party.

We have also also finished our work to ensure DBLP and ACM Portal accurately cover the Anthology materials; however, some of these changes may have not yet finalized by the opposing party at DBLP and ACM Portal. With assistance from Praveen Bysani, ACM now has a complete list of proceedings from ACL and should finalize DOI assignments for legacy materials (particularly workshops) this year and provide this information back to the ACL Anthology for our records.

On a related note, we have also been working with the ACM to fix how our materials appear on the ACM website. Previously, the materials were freely accessible but only after registration of a free ACM account. Thanks to ACL members' feedback, we considered the registration to be an unacceptable barrier to access, and ACM has since changed their layout with respect to our materials to make it easier to 1) access the PDF of the paper 2) access the ACL Anthology's page for the paper.

Finally, we have finished our major push to incorporate our sister organizations' proceedings (ROCLING, PACLIC, ALTA, RANLP), although new forums may be undertaken in this coming year. In particular, we expect to get past proceedings from RANLP and from LREC, as and when these organizations can make their proceedings available to us in the ACL Anthology ingestion format.

MAILING LIST: The Anthology mailing list's (http://groups.google.com/group/acl-anthology) membership pool has grown, now consisting of 312 members (up from 259 from last report). This is an announcement-only list, where we notify members of newly listed released materials online.

ONGOING WORK: The new ACL Anthology prototype (once minor edits are finished and the subsequent version approved by the ACL Exec) will enable more Web 2.0 style of interaction with our materials and guarantee a uniform level of service for all works. A key thrust for this year will be to allow third-party applications automatically annotate articles with new metadata in the Anthology. Such an API will raise the visibility of the Anthology as a object of study, complementing our earlier work to make the Anthology's text a corpus. In rough order of priority, we plan to finish:

XML import and export of single, multiple papers and volumes
BibTeX and other bibliographic formats import and export of single, multiple papers and volumes
Addition of custom fields (e.g., OLAC language subject codes)
Suggestion of corrections to metadata or added fields by public (to be moderated by the Anthology editor)

Once completed, we will plan to work with START and aclpub (courtesy connections from Steven Bird) to directly support the ACL Anthology XML format to make future events using the aclpub package easier to incorporate directly into the Anthology,

Other work left from previous years are still queued:

Text for all PDF files. Some articles (e.g., EACL 2003) only exist in image form in the Anthology, rendering indexing (and hence subsequent citation) of these articles problematic.
PDF metadata fixing for all articles. Crucially, Google Scholar uses this information but it is not always correctly generated.
One PDF file per article. This is especially problematic for the J79 series, which largely represents one issue per PDF file.

@@ Line 1: / Line 1: @@
 [ [[2008Q3_Reports:_Anthology|Link to 2008 Q3 report]] ]
 [ [[2009Q3_Reports:_ACL_Anthology|Link to 2009 Q3 report]] ]
+[ [[2010Q3_Reports:_ACL_Anthology|Link to 2010 Q3 report]] ]
 '''ACL ANTHOLOGY Report'''
-June 2010, Min-Yen Kan
+June 2011, Min-Yen Kan
-'''EXECUTIVE SUMMARY''': We have added sister CL organizations' metadata in the past year, and are working on revamping the Anthology's underlying data model and conversion to a database format with the help of 3K USD from the ACL Exec.
+'''EXECUTIVE SUMMARY''': In the past year, we have fixed most of the problems with the ACM's ingestion of our data and have published a prototype ACL Anthology site working with a canonicalized data model.  Both actions will need to be approved by the ACL Exec before being finalized into production.
+'''INTRO''' The ACL Anthology is a digital archive of research papers in computational linguistics, sponsored by the CL community, and freely available to all.  Conference proceedings are published in the anthology around the same time as the conference (subject to general/program chairs' discretion). CL articles are published within a few days of publication on the MIT Press website, now that CL is open access.
+The anthology now contains over 19,200 papers (up from 18,000 articles from twelve months ago). All papers prior to this conference that belong to the ACL (N.B., not sister organizations) are also indexed by the ACM Portal as per the ACL Anthology - ACM agreement.  This agreement allows the ACM to assign DOIs for ACL materials, in exchange for being able to dictate where the DOI resolution goes to (currently, the ACM Portal)
+'''CHANGES OVER LAST 12 MONTHS''':
-'''INTRO''' The ACL Anthology is a digital archive of research papers in computational linguistics, sponsored by the CL community, and freely available to all.  Conference proceedings are published in the anthology around the same time as the conference (subject to general/program chairs' discretion). CL articles are published within a few days of publication on the MIT Press website, now that CL is open access.
+With the help of Praveen Bysani at NUS, we have completed a new prototype of the ACL Anthology (http://aclanthology.heroku.com) which features faceted navigation, search and an underlying data model.  Technically, it is built using Ruby on Rails with a Project Blacklight plug-in and features OAI-PMH integration to allow third parties to ingest and list article metadata from the Anthology, and offsite Lucene indices to allow faceted search.  In creating the prototype, we have unified the metadata of all articles in the ACL Anthology, a non-trivial task since the original Anthology metadata was not of uniform quality.  Currently, minor changes to the prototype are being done to ensure that the functionality of the current Anthology are all intact in the prototype.  Once finished, we will seek the ACL Exec's approval to launch the prototype as our production Anthology, which will need to be hosted by a (commercial) third party.
-The anthology now contains over 18,000 papers (up from 15,900 articles from twelve months ago). All of the papers up to 2008 are also now indexed by the ACM Portal and should have Digital Object Identifiers (DOIs) assigned to them per the ACL Anthology - ACM agreement.
+We have also also finished our work to ensure DBLP and ACM Portal accurately cover the Anthology materials; however, some of these changes may have not yet finalized by the opposing party at DBLP and ACM Portal. With assistance from Praveen Bysani, ACM now has a complete list of proceedings from ACL and should finalize DOI assignments for legacy materials (particularly workshops) this year and provide this information back to the ACL Anthology for our records.
-'''CHANGES OVER LAST 12 MONTHS''': As promised, we have been busy reaching out to sister CL/NLP related societies and have been also ingesting and hosting their materials.  We have incorporated ROCLING, PACLIC and ALTA forums into the Anthology, now listed under Other Events.  We are finalizing the incorporation of RANLP soon; although this may be delayed 1-2 months due to catastrophic disk failure of our preview and development copy of the Anthology at NUS.
+On a related note, we have also been working with the ACM to fix how our materials appear on the ACM website.  Previously, the materials were freely accessible but only after registration of a free ACM account.  Thanks to ACL members' feedback, we considered the registration to be an unacceptable barrier to access, and ACM has since changed their layout with respect to our materials to make it easier to 1) access the PDF of the paper 2) access the ACL Anthology's page for the paper.
-We have also been busy ensuring the coverage of DBLP and ACM Portal cover the Anthology materials.  We understand that the ACM Digital Library has finished ingestion and DOI assignation of over 100 venues (mostly workshops) that had been missing from their records.
+Finally, we have finished our major push to incorporate our sister organizations' proceedings (ROCLING, PACLIC, ALTA, RANLP), although new forums may be undertaken in this coming year.  In particular, we expect to get past proceedings from RANLP and from LREC, as and when these organizations can make their proceedings available to us in the ACL Anthology ingestion format.
-DBLP is known to digest information from the ACM Digital Library Portal, and has recently completed ingestion of most of the ACL materials at this point.  Min is working with Drago's help to ensure that the information is up-to-date.
-We have also gotten approval to use 3600 USD of the ACL's budget to upgrade the Anthology.  We are using these funds to requisition external consultant work to code a new version of the underlying ACL software, to upgrade the storage, curation of the metadata and a better faceted navigation user interface that will allow the filtering of publications by custom filters.  Currently, the new Anthology model is built using Ruby on Rails and features OAI-PMH integration to allow third parties to ingest and list article metadata from the Anthology.
+'''MAILING LIST''': The Anthology mailing list's (http://groups.google.com/group/acl-anthology) membership pool has grown, now consisting of 312 members (up from 259 from last report).  This is an announcement-only list, where we notify members of newly listed released materials online.
-'''MAILING LIST''': The Anthology mailing list's (http://groups.google.com/group/acl-anthology) membership pool has grown, now consisting of 259 members (up from 176 from last report).  This is an announcement-only list, where we notify members of newly listed released materials online.
+'''ONGOING WORK''': The new ACL Anthology prototype (once minor edits are finished and the subsequent version approved by the ACL Exec) will enable more Web 2.0 style of interaction with our materials and guarantee a uniform level of service for all works.  A key thrust for this year will be to allow third-party applications automatically annotate articles with new metadata in the Anthology.  Such an API will raise the visibility of the Anthology as a object of study, complementing our earlier work to make the Anthology's text a corpus.  In rough order of priority, we plan to finish:
-'''ONGOING WORK''': While we incorporate more materials for CL, our next big project is ensuring that ACM lists our publications with appropriate rights and linkage to our Anthology copies.  ACL members rightfully have complained that the ACM doesn't make it obvious that the publications are from ACL and that they can be obtained free without ACM registration.
+* XML import and export of single, multiple papers and volumes
+* BibTeX and other bibliographic formats import and export of single, multiple papers and volumes
+* Addition of custom fields (e.g., OLAC language subject codes)
+* Suggestion of corrections to metadata or added fields by public (to be moderated by the Anthology editor)
-Once completed, we will plan to work with START and aclpub (courtesy connections from Steven Bird) to directly support the ACL Anthology XML format to make future events using the aclpub package easier to incorporate directly into the Anthology, and to incorporate further categorization of submission by OLAC codes (language subject matter).
+Once completed, we will plan to work with START and aclpub (courtesy connections from Steven Bird) to directly support the ACL Anthology XML format to make future events using the aclpub package easier to incorporate directly into the Anthology,
-Other work left from last year is still queued. These are to provide a uniform level of service and metadata for past work.  There are a number of issues that are being tackled:
+Other work left from previous years are still queued:
-* Correct XML representation of each article: names of authors (with diacritics for European names), first, "von" and last name portions
-* BibTeX representations for all articles
-* One PDF file per article.  This is especially problematic for the J79 series, which largely represents one issue per PDF file.
 * Text for all PDF files.  Some articles (e.g., EACL 2003) only exist in image form in the Anthology, rendering indexing (and hence subsequent citation) of these articles problematic.
 * PDF metadata fixing for all articles.  Crucially, Google Scholar uses this information but it is not always correctly generated.
-* Wikification of articles so that registered ACLs users will be able to edit their contributions to add errata and other metadata, multimedia.
+* One PDF file per article.  This is especially problematic for the J79 series, which largely represents one issue per PDF file.

Difference between revisions of "2011Q3 Reports: ACL Anthology"

Latest revision as of 14:45, 30 May 2011

Navigation menu

Search