Difference between revisions of "Clustering by Committee"

From ACL Wiki
Jump to: navigation, search
m (Reverted edits by Edward518 (Talk); changed back to last version by Pdturney)
Line 36: Line 36:
[[Category:Textual Entailment Portal]]
[[Category:Textual Entailment Portal]]
[[Category:Word sense disambiguation]]
[[Category:Word sense disambiguation]]
[http://www.huochepiao168.cn 火车票] [http://www.huochepiao168.cn 订火车票] [http://www.huochepiao168.cn 北京火车票] [http://www.huochepiao168.cn 火车票预定]
[http://www.huochepiao168.cn 火车票预订] [http://www.huochepiao168.cn 火车票查询]
[http://www.huochepiao168.cn 北京火车票预定] [http://www.huochepiao168.cn 北京火车票查询]
[http://www.huochepiao168.cn 北京火车票预订]
[http://www.chepiao168.cn 火车票] [http://www.chepiao168.cn 订火车票] [http://www.chepiao168.cn 北京火车票] [http://www.chepiao168.cn 火车票预定]
[http://www.chepiao168.cn 火车票预订] [http://www.chepiao168.cn 火车票查询]
[http://www.chepiao168.cn 北京火车票预定] [http://www.chepiao168.cn 北京火车票查询]
[http://www.chepiao168.cn 北京火车票预订]
[http://www.shdzbc.net.cn 搬场]  [http://www.shdzbc.net.cn 搬家] [http://www.shdzbc.net.cn 上海搬场]
[http://www.shdzbc.net.cn 上海搬场公司][http://www.shdzbc.net.cn 上海搬场] [http://www.shdzbc.net.cn 搬家公司]
[http://www.shdzbc.net.cn 上海搬家公司] [http://www.shdzbc.net.cn 上海搬家]
[http://www.hunqing666.com 婚庆] [http://www.hunqing666.com 婚庆公司] [http://www.hunqing666.com 婚庆网]
[http://www.digseo.net 搜索引擎优化] [http://www.digseo.net 网络营销]

Latest revision as of 06:44, 8 January 2008

CBC (Clustering by Committee) is both a clustering algorithm and a resulting knowledge collection created by Patrick Pantel and Dekang Lin at the University of Alberta. The algorithm is a general-purpose partitioning clustering algorithm. The authors have used it more specifically for automatically clustering documents and for automatically inducing concepts and word senses.

The CBC knowledge collection consists of concepts, which are clustered instances like the three shown below along with a template of typical grammatical contexts (lexical co-occurrence vectors) extracted from a textual corpora:

(A) multiple sclerosis, diabetes, osteoporosis, cardiovascular disease, Parkinson's, rheumatoid arthritis, heart disease, asthma, cancer, hypertension, lupus, high blood pressure, arthritis, emphysema, epilepsy, cystic fibrosis, leukemia, hemophilia, Alzheimer, myeloma, glaucoma, schizophrenia, ...
(B) Mike Richter, Tommy Salo, John Vanbiesbrouck, Curtis Joseph, Chris Osgood, Steve Shields, Tom Barrasso, Guy Hebert, Arturs Irbe, Byron Dafoe, Patrick Roy, Bill Ranford, Ed Belfour, Grant Fuhr, Dominik Hasek, Martin Brodeur, Mike Vernon, Ron Tugnutt, Sean Burke, Zach Thornton, Jocelyn Thibault, Kevin Hartman, Felix Potvin, ...
(C) pink, red, turquoise, blue, purple, green, yellow, beige, orange, taupe, white, lavender, fuchsia, brown, gray, black, mauve, royal blue, violet, chartreuse, teal, gold, burgundy, lilac, crimson, garnet, coral, grey, silver, olive green, cobalt blue, scarlet, tan, amber, ...

Using sets of representative elements, called committees, CBC discovers concept signatures that unambiguously describe the members of a possible concept (e.g. diseases, hockey goalies, and colors). Concept signatures are templates of grammatical relations that apply to most of the instances of the concept (lexical co-occurrence vectors). The algorithm initially discovers committees that are well scattered in the similarity space. It then proceeds by assigning words to their most similar committees, each of which represents a final cluster. After assigning a word to a committee, CBC removes their overlapping features (syntactical co-occurrences) from the word before assigning it to another committee. This allows CBC to discover the less frequent senses of a word and to avoid discovering duplicate senses.

On the task of recovering the concepts and word senses in WordNet, CBC achieved 61% precision and 51% recall. CBC outputs a flat list of concepts (i.e., there is no hierarchical information).

Acquiring the Resource

Both an implementation of the CBC algorithm and the CBC knowledge collection is available for research purposes by contacting its authors.



Please refer to either of the following publications when using this resource:

  • Patrick Pantel. 2003. Clustering by Committee. Ph.D. Dissertation. Department of Computing Science, University of Alberta.
  • Patrick Pantel and Dekang Lin. 2002. Discovering Word Senses from Text. In Proceedings of ACM Conference on Knowledge Discovery and Data Mining (KDD-02). pp. 613-619. Edmonton, Canada.


Patrick Pantel

Dekang Lin