2019Q1 Reports: Anthology Director

From Admin Wiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

The main points of this document are as follows:

  • The Anthology infrastructure. The Saarland server problem has been fixed. We have rebuilt the site with static generation tools, so that it is much faster and should now be indexed correctly by Google Scholar. This work—which I had estimated would cost $40k of paid developer time—was done for free to extremely high standards by Marcel Bollmann, an NLP postdoc at the University of Copenhagen. The new Anthology is live at http://aclweb.org/anthology. We are in the process of tying up loose ends.
  • Google Scholar integration. We believe the static site, hosted entirely under aclweb.org, will address issues with Google and Google Scholar search results. We will have more certainty on this in the coming month.
  • Data quality. ''David Chiang and Dan Gildea, together with Marcel, have put in many hours to correcting data in the Anthology and to improve the representation and the ingestion scripts. We now have an official ACLPUB repo and are coordinating on it with Softconf.
  • Video ingestion. Many videos are missing from the Anthology, which seems to be a contractual issue with the company we have paid to record videos. This is a serious issue in terms of ACL’s investment that I don’t have time to deal with and we should hire someone to straighten all of this out.
  • Volunteers. There has been a good influx of volunteers who are making critical contributions to the site.

Things are going well. It’s been a lot of work but there is a great team and I am enjoying it.

I have been limiting my time to 5–10 hours a week and have a taste of the load Min worked under for so long. We have been busy with big projects which will come to an end, but I have not had time to handle everything that comes into my inbox, and this is without having had yet to handle a major conference ingestion. There are many smaller but important issues that I didn’t have time to report on here in detail (EI indexing for Chinese academics, copyright issues, CL / TACL auto-ingestion, MT archive ingestion, backups, Softconf interface, a preprint service, DOIs) in the two hours I allotted for this. This is not to complain but to inform. At some point I am going to want paid help with the more mundane details such as ingestions, even as I write documentation and hope to smooth out processes and make these things less time consuming.

Looking forward to the next quarter, I want to let the dust settle on the static rewrite and experience my first ingest of a major conference (NAACL). At that point, I would like to start thinking about longer-term projects, in collaboration with Nitin and members of the community who’ve emailed in, on topics such as:

  • A research API to the Anthology that would allow easy access to the Anthology author and citation networks as well as the structured text of papers
  • *ACL paper formats and ingestion of raw paper data
  • The potential for the Anthology to provide a preprint service

The Anthology Infrastructure

As of the first day of 2019, the Anthology was facing many problems. The Ruby on Rails infrastructure was aging, difficult to upgrade and use, and slow. Pages on the Anthology were generated in a dynamic fashion in a long pipeline of steps involving many format conversions into and out of a SQL database (see this paper). The actual Anthology was split between aclweb.org (hosting the PDFs and original BibTeX files) and aclanthology.info (hosting paper, author, and venue data and metadata). This split caused problems with Google Scholar. The aclanthology.info URL exposed the hostname of the Saarland server it was running on, a minor but annoying detail.

I believe the choice to build the Anthology on a dynamic platform years ago was a sound one. The paper, venue, and author pages, along with all the citation formats, tally into hundreds of thousands of pages, the majority of which will never be accessed. But over the years the system had accumulated a lot of technical debt that we did not have the person-power to sustain. Viewing the Github repository on the website as an administrator reports (via a service offered by Github) that we have a large number of dependencies in our project with critical security flaws:

Dependencies.png

The website was also very slow, despite being hosted on a powerful server at Saarland. Many of the pages are large and there was a noticeable delay as they were dynamically generated. This issue was likely exacerbated by our inability to keep up with upgrades that might have ameliorated this.

My first priority was therefore to rebuild the site using static generation tools. Marcel Bollmann, a Ph.D. / postdoc student, answered our volunteer call, and after researching it, decided to build the site using Python 3 and Hugo, a static site generation (SSG) tool written in the Go programming language and known for being fast and having known dependencies. Hugo has a very active developer and user base and all indicators suggest it will be around for some time to come.

In any case, future upgrades to this software are less important because the site functionality is much simpler. The 385,532 files of the website are now built from scratch from the raw data in the Github repository in about ten minutes, which is then rsync’d to our production server.

We have lost our customized search, and are currently using a Custom Google Search instead. I think this solution is working well. We currently have ads but I have an application in to be exempted under Google’s non-profit program and hope to have that resolved soon. We plan to use this for the time-being but may explore an in-house customized search again sometime in the future. In general, though, I think it is good for us to rely on existing tools as much as possible, as Nitin proposed in his report.

There is still some work to do:

  • We need to add 301 redirects from pages on the old site (aclanthology.info) and close down that site. We will retain it for development purposes and future dynamic content
  • Names that appear in different forms in the Anthology are not yet merged together (for example, Marti Hearst and Marti A. Hearst).
  • The complete list of active issues can be found on our Github issues page, where all of this is managed.

Google Scholar Integration

Google Scholar does not index the Anthology well. Anthology papers are often not returned at all, or are buried under submissions at the arXiv, and so on. Past correspondence has suggested the split between page metadata (on aclanthology.info) and the canonical PDF URL (under aclweb.org) was part of the reason. Now that these are on the same host, we hope the problem will be resolved. We have submitted our site to be indexed and Min has just put me in touch with our Google Scholar contact, Darcy Dapra (darcyd@google.com).

David Vilar has expressed interest in porting our command-line Anthology search tool, bibsearch, to become an official tool for the Anthology. In the future, this could serve as the base for a new round of custom search engines and to a research API into the Anthology.

Video Ingestion

There are some serious problems with the videos that ACL has paid to record of conference talks:

  • We have videos recorded for many conferences, but they are not all listed in the Anthology (ingested). According to Min-Yen, Weyond is contractually obligated to provide us with the link between Anthology ID and video links, but have not done this for some time.
  • Many of our videos are hosted on techtalks.tv, which appears to have gone or be going the way of the world. They have not been loading for at least a week, and their site admins are not responding to emails. These videos should be moved.

This information is summarized here. I only know about videos that (a) I found on techtalks.tv or (b) were ingested, so there are likely other videos I don’t know about.

Conference Ingested Hosted at Working
NAACL 2013 yes techtalks.tv no
ACL 2014 no techtalks.tv no
CoNLL 2015 no techtalks.tv no
ACL-IJCNLP 2015 yes techtalks.tv no
NAACL 2015 no techtalks.tv no
EMNLP 2016 no techtalks.tv no
CoNLL 2016 no techtalks.tv no
NAACL 2016 no techtalks.tv no
ACL 2016 no techtalks.tv no
NAACL 2018 yes vimeo.com yes
EMNLP 2018 yes vimeo.com yes

I do not know the relationship between these hosting services (techtalks.tv and vimeo) and Weyond, and furthermore do not yet know which video service is responsible for providing this information. I have not hand bandwidth to process it.

I have emailed Rajnish Kumar (rajnish@weyond.com), our purported contact at Weyond, but have not received any response in over a week. At the very least, we need to move these videos away from techtalks.tv—perhaps to vimeo or even Youtube—and I suggest that we pay someone to do this.

Volunteers

Martín Villalba and Christoph Teichman have been volunteers for quite some time, and have been helpful with the transition. I have received a handful of emails from people interested in volunteering. The following folks have contributed so far in some capacity.

  • Marcel Bollman has done an amazing job rebuilding the site. I can’t believe he’s doing this for free, and I don’t know where we’d be without him.
  • Martín Villalba has continued to run the aclanthology.info server. He has been busy lately, but contributed a lot of ideas to the static rewrite, jumps in on Github issues, and provides good ideas and occasional critical perspective.
  • Min-Yen Khan is very supportive and responsive. He often chimes in on Github issues with bits of arcana and advice, in addition to walking through many steps of this process with me.
  • David Chiang came out of nowhere and started doing intensive cleanup of the Anthology and formalization of the data schema. There are so many data issues that we have encountered and everything is in much better shape. He has also taken on interactions with Rich Gerber at Softconf and led the extraction and canonization of our current ACLPUB repo.
  • Dan Gildea has also been doing a lot of back-end scripting, fixing up lots of annoying issues with LaTeX imports and exports.
  • Chenliang Li has volunteered to help get *ACL proceedings listed in the Elsevier’s Engineering Village, an important listing for Chinese academics in our area to have their contributions count towards career advancement.
  • Nitin and Marti have been hugely helpful and supportive with ideas and feedback.
  • This bullet point is for the person(s) who I will feel terrible for having forgotten after I send in this report.