Difference between revisions of "2010Q3 Reports: ACL Anthology"

From Admin Wiki
Jump to navigation Jump to search
 
(4 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
June 2010, Min-Yen Kan
 
June 2010, Min-Yen Kan
  
(please see the items for discussion at the end of the report)
+
'''EXECUTIVE SUMMARY''': We have added sister CL organizations' metadata in the past year, and are working on revamping the Anthology's underlying data model and conversion to a database format with the help of 3K USD from the ACL Exec.
  
  
== In Progress -- Do not take as factual yet. ==
 
  
The ACL Anthology is a digital archive of research papers in computational linguistics, sponsored by the CL community, and freely available to all.  Conference proceedings are published in the anthology around the same time as the conference (subject to general/program chairs' discretion). CL articles are published within a few days of publication on the MIT Press website (now that CL is open access).
 
  
The anthology now contains over BUG papers (up from XX articles from twelve months ago). Most of the papers are also indexed by Citeseer and Google Scholar, helping the citation counts of ACL authors.  The ACM Digital Library is creating rich metadata and doing full citation linking for all Anthology materials.  
+
'''INTRO''' The ACL Anthology is a digital archive of research papers in computational linguistics, sponsored by the CL community, and freely available to all.  Conference proceedings are published in the anthology around the same time as the conference (subject to general/program chairs' discretion). CL articles are published within a few days of publication on the MIT Press website, now that CL is open access.
  
'''CHANGES OVER LAST 12 MONTHS''': As promised, we have finished the ingestion of IJCNLP 2005 and 2008 and have incorporated the listing of LREC 2008 papers.  We have also worked with ACL conference organizers to list the conference and associated workshop papers on the dates of the conference.  Similarly, CL journal issues are usually listed and anthologized within 48 hours of release from the MIT Press website.
+
The anthology now contains over 18,000 papers (up from 15,900 articles from twelve months ago). All of the papers up to 2008 are also now indexed by the ACM Portal and should have Digital Object Identifiers (DOIs) assigned to them per the ACL Anthology - ACM agreement.  
  
We have also been very busy making sure the coverage of DBLP and ACM Portal cover the Anthology materials more authoritatively.  We have sent updates to both providers, in their required ingestion format, listing 100+ events that need to be updatedThis corresponds to over several thousand papers that were not listed by one or both databases.  This should help your citation counts and coverages.  However, we cannot guarantee whether or when these updates will take effect -- this is conditional on the corresponding target database's policy and schedule.
+
'''CHANGES OVER LAST 12 MONTHS''': As promised, we have been busy reaching out to sister CL/NLP related societies and have been also ingesting and hosting their materials.  We have incorporated ROCLING, PACLIC and ALTA forums into the Anthology, now listed under Other EventsWe are finalizing the incorporation of RANLP soon; although this may be delayed 1-2 months due to catastrophic disk failure of our preview and development copy of the Anthology at NUS.
  
'''MAILING LIST''': The Anthology mailing list's (http://groups.google.com/group/acl-anthology) membership pool has grown, now consisting of 176 members (up from 94 from last report)This is an announcement-only list, where we notify members of newly listed released materials online.
+
We have also been busy ensuring the coverage of DBLP and ACM Portal cover the Anthology materials. We understand that the ACM Digital Library has finished ingestion and DOI assignation of over 100 venues (mostly workshops) that had been missing from their records.
 +
DBLP is known to digest information from the ACM Digital Library Portal, and has recently completed ingestion of most of the ACL materials at this pointMin is working with Drago's help to ensure that the information is up-to-date.
  
'''FUTURE MATERIALS''': We are looking to further incorporate additional materials from other CL related conferences.  ROCling's set of conference materials has been uploaded for our processing, and will be incorporated, pending edits to provide English translations of both author names and titlesWe are in the planning stage for incorporating past LRECs when they are made available and the upcoming RANLP 2009 conference.
+
We have also gotten approval to use 3600 USD of the ACL's budget to upgrade the Anthology.  We are using these funds to requisition external consultant work to code a new version of the underlying ACL software, to upgrade the storage, curation of the metadata and a better faceted navigation user interface that will allow the filtering of publications by custom filtersCurrently, the new Anthology model is built using Ruby on Rails and features OAI-PMH integration to allow third parties to ingest and list article metadata from the Anthology.  
  
'''ONGOING WORK''': While we incorporate more materials for CL, we are working with START (courtesy connections from Steven Bird) to directly support the ACL Anthology XML format to make future events using the aclpub package easier to incorporate directly into the Anthology.
+
'''MAILING LIST''': The Anthology mailing list's (http://groups.google.com/group/acl-anthology) membership pool has grown, now consisting of 259 members (up from 176 from last report).  This is an announcement-only list, where we notify members of newly listed released materials online.
  
Another large tract of our current work is on providing a uniform level of service and metadata for past work.  There are a number of issues that are being tackled:
+
'''ONGOING WORK''': While we incorporate more materials for CL, our next big project is ensuring that ACM lists our publications with appropriate rights and linkage to our Anthology copies.  ACL members rightfully have complained that the ACM doesn't make it obvious that the publications are from ACL and that they can be obtained free without ACM registration.
 +
 
 +
Once completed, we will plan to work with START and aclpub (courtesy connections from Steven Bird) to directly support the ACL Anthology XML format to make future events using the aclpub package easier to incorporate directly into the Anthology, and to incorporate further categorization of submission by OLAC codes (language subject matter).
 +
 
 +
Other work left from last year is still queued. These are to provide a uniform level of service and metadata for past work.  There are a number of issues that are being tackled:
  
 
* Correct XML representation of each article: names of authors (with diacritics for European names), first, "von" and last name portions
 
* Correct XML representation of each article: names of authors (with diacritics for European names), first, "von" and last name portions
Line 31: Line 34:
 
* Text for all PDF files.  Some articles (e.g., EACL 2003) only exist in image form in the Anthology, rendering indexing (and hence subsequent citation) of these articles problematic.
 
* Text for all PDF files.  Some articles (e.g., EACL 2003) only exist in image form in the Anthology, rendering indexing (and hence subsequent citation) of these articles problematic.
 
* PDF metadata fixing for all articles.  Crucially, Google Scholar uses this information but it is not always correctly generated.
 
* PDF metadata fixing for all articles.  Crucially, Google Scholar uses this information but it is not always correctly generated.
 
+
* Wikification of articles so that registered ACLs users will be able to edit their contributions to add errata and other metadata, multimedia.
With these issues fixed, we can proceed to wikify the Anthology, to allow more flexibility in adding, correcting metadata that members or outsiders can directly suggest.  Currently, Ali Hakim is working on some initial test phases to do this.
 
 
 
'''RELATED WORK''': Min is also working on creating an LDC release of the ACL Anthology Reference Corpus, which is a separate project, sanctioned earlier by the ACL.  This project will incorporate OCR dumped text for all pages in the reference corpus (a subset of all articles up to 2006) and PNG images for each page in the corpus.  This work should be ready by the end of the year.
 
 
 
----
 
 
 
'''FOR DISCUSSION''':
 
 
 
* The mission of the Anthology is eternally up for debate.  I have tried to lean less heavy on editorializing, and rather concentrate on providing more content -- whatever its quality -- to the ACL Anthology audience, as long as it is related to CL/NLP and cleared to list by the appropriate authorities.  A problem arises when we need to decide whether to ingest new materials (for example, literature not in English, or only peripherally related to CL/NLP).  Currently, we implement a permissive policy on this aspect.
 
* One aspect where this comes up is in listing venues on the front page.  While the vast majority of users may never use the front page (but get to the Anthology's details from search engines), currently we list ACL events at the top and other venues in a separate space.  While fair and consistent, this marginalizes ACL events that are mainly sponsored by SIGs (notably EMNLP, SemEval, CoNLL) as they are relegated to the pages for SIGs. Would we want to change this policy and if so, what policy should be implemented?
 

Latest revision as of 15:29, 8 June 2010

[ Link to 2008 Q3 report ] [ Link to 2009 Q3 report ]

ACL ANTHOLOGY Report June 2010, Min-Yen Kan

EXECUTIVE SUMMARY: We have added sister CL organizations' metadata in the past year, and are working on revamping the Anthology's underlying data model and conversion to a database format with the help of 3K USD from the ACL Exec.



INTRO The ACL Anthology is a digital archive of research papers in computational linguistics, sponsored by the CL community, and freely available to all. Conference proceedings are published in the anthology around the same time as the conference (subject to general/program chairs' discretion). CL articles are published within a few days of publication on the MIT Press website, now that CL is open access.

The anthology now contains over 18,000 papers (up from 15,900 articles from twelve months ago). All of the papers up to 2008 are also now indexed by the ACM Portal and should have Digital Object Identifiers (DOIs) assigned to them per the ACL Anthology - ACM agreement.

CHANGES OVER LAST 12 MONTHS: As promised, we have been busy reaching out to sister CL/NLP related societies and have been also ingesting and hosting their materials. We have incorporated ROCLING, PACLIC and ALTA forums into the Anthology, now listed under Other Events. We are finalizing the incorporation of RANLP soon; although this may be delayed 1-2 months due to catastrophic disk failure of our preview and development copy of the Anthology at NUS.

We have also been busy ensuring the coverage of DBLP and ACM Portal cover the Anthology materials. We understand that the ACM Digital Library has finished ingestion and DOI assignation of over 100 venues (mostly workshops) that had been missing from their records. DBLP is known to digest information from the ACM Digital Library Portal, and has recently completed ingestion of most of the ACL materials at this point. Min is working with Drago's help to ensure that the information is up-to-date.

We have also gotten approval to use 3600 USD of the ACL's budget to upgrade the Anthology. We are using these funds to requisition external consultant work to code a new version of the underlying ACL software, to upgrade the storage, curation of the metadata and a better faceted navigation user interface that will allow the filtering of publications by custom filters. Currently, the new Anthology model is built using Ruby on Rails and features OAI-PMH integration to allow third parties to ingest and list article metadata from the Anthology.

MAILING LIST: The Anthology mailing list's (http://groups.google.com/group/acl-anthology) membership pool has grown, now consisting of 259 members (up from 176 from last report). This is an announcement-only list, where we notify members of newly listed released materials online.

ONGOING WORK: While we incorporate more materials for CL, our next big project is ensuring that ACM lists our publications with appropriate rights and linkage to our Anthology copies. ACL members rightfully have complained that the ACM doesn't make it obvious that the publications are from ACL and that they can be obtained free without ACM registration.

Once completed, we will plan to work with START and aclpub (courtesy connections from Steven Bird) to directly support the ACL Anthology XML format to make future events using the aclpub package easier to incorporate directly into the Anthology, and to incorporate further categorization of submission by OLAC codes (language subject matter).

Other work left from last year is still queued. These are to provide a uniform level of service and metadata for past work. There are a number of issues that are being tackled:

  • Correct XML representation of each article: names of authors (with diacritics for European names), first, "von" and last name portions
  • BibTeX representations for all articles
  • One PDF file per article. This is especially problematic for the J79 series, which largely represents one issue per PDF file.
  • Text for all PDF files. Some articles (e.g., EACL 2003) only exist in image form in the Anthology, rendering indexing (and hence subsequent citation) of these articles problematic.
  • PDF metadata fixing for all articles. Crucially, Google Scholar uses this information but it is not always correctly generated.
  • Wikification of articles so that registered ACLs users will be able to edit their contributions to add errata and other metadata, multimedia.