Difference between revisions of "Author List Clean-up Code"
(New page: A big challenge in automatically creating an anthology from publications is correcting author names. Many different versions of author names are found in different publications. For examp...) |
|||
Line 1: | Line 1: | ||
+ | [[Media:author_name_normalization.tar.gz]] | ||
+ | |||
A big challenge in automatically creating an anthology from publications is correcting author names. Many different versions of author names are found in different publications. | A big challenge in automatically creating an anthology from publications is correcting author names. Many different versions of author names are found in different publications. | ||
Revision as of 14:07, 21 July 2011
Media:author_name_normalization.tar.gz
A big challenge in automatically creating an anthology from publications is correcting author names. Many different versions of author names are found in different publications.
For example, in the ACL Anthology, there are 5 different versions of the author name "Rosé, Carolyn Penstein" 's name, as shown below.
Rose, Carolyn P. Rosé, CarolynPenstein Rosé, Carolyn P. PensteinRosé, Carolyn P. Rosé, Carolyn
In order to resolve this, we have created a semi-automatically cleaned list of all author names in ACL anthology. The "master list" of author names contains 13,692 different authors. In addition to the master list, we provide code for the following tasks
1. Finding the canonical version of different author names in the field of computational linguistics, if it exists in a master list (available as part of the package) using many different heuristics.
2. Automatically change different versions of the name to the suggested canonical name (incorporating any manual corrections by the user, if any)