Difference between revisions of "Paraphrase Identification (State of the art)"

From ACL Wiki
Jump to: navigation, search
Line 1: Line 1:
* [http://research.microsoft.com/en-us/downloads/607D14D9-20CD-47E3-85BC-A2F65CD28042/default.aspx Microsoft Research Paraphrase Corpus] (MSRP)
+
* '''source''': [http://research.microsoft.com/en-us/downloads/607D14D9-20CD-47E3-85BC-A2F65CD28042/default.aspx Microsoft Research Paraphrase Corpus] (MSRP)
* see Dolan, Quirk, and Brockett (2004)
+
* '''task''': given a pair of sentences, classify them as paraphrases or not paraphrases
* train: 4,076 sentence pairs (2,753 positive: 67.5%)
+
* '''see''': Dolan et al. (2004)
* test: 1,725 sentence pairs (1,147 positive: 66.5%)
+
* '''train''': 4,076 sentence pairs (2,753 positive: 67.5%)
 +
* '''test''': 1,725 sentence pairs (1,147 positive: 66.5%)
  
  
 
== Sample data ==
 
== Sample data ==
  
* Sentence 1: Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.
+
* '''Sentence 1''': Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.
* Sentence 2: Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence.
+
* '''Sentence 2''': Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence.
* Class: 1 (true paraphrase)
+
* '''Class''': 1 (true paraphrase)
  
  

Revision as of 14:08, 24 March 2009

  • source: Microsoft Research Paraphrase Corpus (MSRP)
  • task: given a pair of sentences, classify them as paraphrases or not paraphrases
  • see: Dolan et al. (2004)
  • train: 4,076 sentence pairs (2,753 positive: 67.5%)
  • test: 1,725 sentence pairs (1,147 positive: 66.5%)


Sample data

  • Sentence 1: Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.
  • Sentence 2: Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence.
  • Class: 1 (true paraphrase)


Table of results

Algorithm Reference Description Accuracy F
MCS Mihalcea et al. (2006) unsupervised combination of several word similarity measures 70.3% 81.3%
WDDP Wan et al. (2006) supervised dependency-based features 75.0% 73.0%

References

Dolan, B., Quirk, C., and Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources, Proceedings of the 20th international conference on Computational Linguistics (COLING 2004), Geneva, Switzerland, pp. 350-356.

Mihalcea, R., Corley, C., and Strapparava, C. (2006). Corpus-based and knowledge-based measures of text semantic similarity, Proceedings of the National Conference on Artificial Intelligence (AAAI 2006), Boston, Massachusetts, pp. 775-780.

Wan, S., Dras, M., Dale, R., and Paris, C. (2006). Using dependency-based features to take the "para-farce" out of paraphrase, Proceedings of the Australasian Language Technology Workshop (ALTW 2006), pp. 131-138.


See also