Difference between revisions of "Multiword Expressions"

From ACL Wiki
Jump to navigation Jump to search
m (Reverted edits by Creek (talk) to last revision by Michells)
 
(39 intermediate revisions by 4 users not shown)
Line 1: Line 1:
There is a growing awareness in the NLP community of the problems that Multiword Expressions (MWEs) pose and the need for their robust handling. MWEs include a large range of linguistic phenomena, such as phrasal verbs (e.g. "add up"), nominal compounds (e.g. "telephone box"), and institutionalized phrases (e.g. "salt and pepper"). These expressions, which can be syntactically and/or semantically idiosyncratic in nature, are used frequently in everyday language, usually to express precisely ideas and concepts that cannot be compressed into a single word.
+
Multiword expressions (MWEs) are expressions which are made up of at least 2 words and which can be syntactically and/or semantically idiosyncratic in nature. Moreover, they act as a single unit at some level of linguistic analysis. According to Sag et al.<ref>Sag et al (2002), p. 1. </ref> we could define MWEs roughly as „idiosyncratic interpretations that cross word boundaries“.  
 +
MWEs can be regarded as lying at the interface of grammar and lexicon, usually being instances of well productive syntactic patterns but nevertheless showing a peculiar lexical behaviour.<ref>Calzolari et al. (2002), p. 1934.</ref>
 +
 
 +
Besides, they are commonly used in any field of language – Jackendoff<ref> cf. Jackendoff (1997).</ref> estimates the number of MWEs in a speaker's lexicon as comparable to the number of single words. Examples for MWEs would be idioms as „kick the bucket“, compound nouns as „telephone box“ and „post office“, verb-particle constructions as „look sth. up“ or proper names as „San Francisco“. Due to the high frequency of MWEs there is a growing awareness in the NLP ([http://en.wikipedia.org/wiki/Natural_language_processing Natural Language Processing]) community for the problems they pose.
  
 
{{stub}}
 
{{stub}}
 +
 +
 +
== Classification of MWEs ==
 +
 +
 +
MWEs can be split up in '''lexicalized phrases''' which have at least in part idiosyncratic syntax or pragmatics, and '''institutionalized phrases''' which are syntactically and semantically compositional. Lexicalized phrases can be further subclassified into '''fixed expressions''', '''semi-fixed expressions''' and '''syntactically flexible expressions'''.<ref>The article follows the schema of classification that is proposed in Sag et al. (2002). Most of the examples are taken from their article as well.</ref>
 +
 +
 +
'''1.1 Fixed expressions'''
 +
 +
Fixed expressions are fully lexicalized and can neither be variated morphosyntactically nor modificated internally. Examples for fixed expressions are: ''in short'', ''by and large'', ''every which way''. They are fixed, as you cannot say ''in shorter'' or ''in very short.''
 +
 +
 +
'''1.2 Semi-fixed expressions
 +
'''
 +
 +
In semi-fixed expressions word order and composition are strictly invariable, while inflection, variation in reflexive form and determiner selection is possible.
 +
 +
In '''non-decomposable idioms''' (i.e. idioms in which the meaning cannot be assigned to the parts of the MWE) such as ''kick the bucket'' the verb can be inflected according to a particular context: ''he kicks'' ''the bucket''. On the other hand non-decomposable idioms do not undergo syntactic variability. For example, a passive sentence as ''the bucket was kicked'' is not possible. (or at least it does not have the same meaning.)
 +
 +
Another type of semi-fixed expressions are '''compound nominals''' as ''car park'' or ''peanut butter''. They are syntactically-unalterable but can inflect for number: ''2 car parks''.
 +
 +
'''Proper names''' are semi-fixed expressions as well since they can occur in different forms. For example the name of the U.S. sports team ''the San Francisco 49ers'' can occur as ''the 49ers'' or as a modifier in the compound noun ''a 49ers player'' etc.
 +
 +
 +
 +
'''1.3 Syntactically-Flexible Expressions'''
 +
 +
Syntactically-flexible expressions have a wider range of syntactic variability than semi-fixed expressions. They occur in the form of '''decomposable idioms''', '''verb-particle constructions''' and '''light verbs'''.
 +
 +
Decomposable idioms are likely to be syntactically flexible to some degree. Examples are ''let the cat out of the bag'' and ''sweep under the rug''. Yet, it is hard to predict which kind of syntactic variation a given idiom can undergo.
 +
 +
Verb-particle constructions, such as ''write up'' and ''look up'' are made up of a verb and one or more partcicles. Either they are semantically idiosyncratic as ''brush up on'' or compositional as ''break up'' in ''the meteorite broke up in the earth's atmosphere''. In some transitive verb-particle constructions as ''call s.o. up'' an NP argument can occur either between or following the verb and particle(s): ''call Kim up'' or ''call up Kim'', respectively. In addition adverbs can often be inserted between the verb and particle as in ''fight bravely on''.
 +
 +
For light verb constructions, as ''make a mistake'', ''give a demo'' it is difficult to predict which light verb combines with a given noun. Though they are highly idiosyncratic they have to be distinguished from idioms: "the noun is used in a normal sense, and the verb meaning appears to be bleached, rather than idiomatic."<ref>Sag et al. (2002), p. 7.</ref>
 +
 +
 +
 +
'''1.4 Institutionalized Phrases'''
 +
 +
Institutionalized phrases are conventionalized phrases, such as ''salt and pepper'', ''traffic light'' and ''to kindle excitement''. They are semantically and syntactically compositional, but statistically idiosyncratic. Regarding the phrase ''traffic light'', ''traffic'' and ''light'' both retain simpex senses but produce a compositional reading by combining constructionally.
 +
 +
== Problems for NLP ==
 +
 +
One problem that occurs in NLP given that MWEs are treated by general, compositional methods of linguistic analysis is the '''overgeneration''' problem. A system could deduce from given expressions other putatively possible expressions that are equivalent in meaning but do not exist due to a lack of institutionalization. "A generation system that is uniformed about both the patterns of compounding and the particular collocational frequency of the relevant dialect would correctly generate ''telephone booth'' (American) or ''telephone box'' (British/Australian), but might also generate such perfectly compositional, but unacceptable examples as ''telephone cabinet'', ''telephone closet'', etc."<ref>Sag et al. (2002), p. 2.</ref>
 +
 +
Another problem is the '''idiomaticity''' problem. It is difficult to predict the meaning of an expression like ''kick the bucket'' since the meaning is not related to the meanings of ''kick'', ''the'', and ''bucket''. Even though the expression seems to conform the grammar of English verb phrases.
 +
 +
 +
== References ==
 +
<references />
 +
 +
 +
Nicoletta Calzolari et al.: [http://gandalf.aksis.uib.no/lrec2002/pdf/259.pdf ''Towards Best Practice for Multiword Expressions in Computational Lexicons''] (2002) in: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), pp. 1934–40.
 +
 +
Ray Jackendoff: ''The Architecture of the Language Faculty'' (1997), Cambridge, MA: MIT Press.
 +
 +
Ivan A. Sag et al.: [http://www.springerlink.com/content/978-3-540-43219-7/#section=653450&page=1&locus=0''Multiword Expressions: A Pain in the Neck for NLP''] (2002) in: LECTURE NOTES IN COMPUTER SCIENCE, Vol. 2276, pp. 1-15.
 +
 +
 +
== Further Literature ==
 +
 +
Timothy Baldwin et al.: [http://acl.ldc.upenn.edu/acl2003/mwexp/pdfs/Baldwin.pdf. ''An Empirical Model of Multiword Expression Decomposability''] (2003) in: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 89-96.
 +
 +
 +
Eric Wehrli: [http://www.springerlink.com/content/wkeg2pxg1kha5uq9/ ''Parsing and Collocations''] (2000) in: LECTURE NOTES IN COMPUTER SCIENCE, Vol. 1835, pp. 272-282.
 +
 +
 +
== External links ==
 +
 +
[http://en.wikipedia.org/wiki/Multiword_expression Wikipedia article on MWE ]
 +
 +
 +
[[Category:Research]]

Latest revision as of 04:15, 25 June 2012

Multiword expressions (MWEs) are expressions which are made up of at least 2 words and which can be syntactically and/or semantically idiosyncratic in nature. Moreover, they act as a single unit at some level of linguistic analysis. According to Sag et al.[1] we could define MWEs roughly as „idiosyncratic interpretations that cross word boundaries“. MWEs can be regarded as lying at the interface of grammar and lexicon, usually being instances of well productive syntactic patterns but nevertheless showing a peculiar lexical behaviour.[2]

Besides, they are commonly used in any field of language – Jackendoff[3] estimates the number of MWEs in a speaker's lexicon as comparable to the number of single words. Examples for MWEs would be idioms as „kick the bucket“, compound nouns as „telephone box“ and „post office“, verb-particle constructions as „look sth. up“ or proper names as „San Francisco“. Due to the high frequency of MWEs there is a growing awareness in the NLP (Natural Language Processing) community for the problems they pose.


Classification of MWEs

MWEs can be split up in lexicalized phrases which have at least in part idiosyncratic syntax or pragmatics, and institutionalized phrases which are syntactically and semantically compositional. Lexicalized phrases can be further subclassified into fixed expressions, semi-fixed expressions and syntactically flexible expressions.[4]


1.1 Fixed expressions

Fixed expressions are fully lexicalized and can neither be variated morphosyntactically nor modificated internally. Examples for fixed expressions are: in short, by and large, every which way. They are fixed, as you cannot say in shorter or in very short.


1.2 Semi-fixed expressions

In semi-fixed expressions word order and composition are strictly invariable, while inflection, variation in reflexive form and determiner selection is possible.

In non-decomposable idioms (i.e. idioms in which the meaning cannot be assigned to the parts of the MWE) such as kick the bucket the verb can be inflected according to a particular context: he kicks the bucket. On the other hand non-decomposable idioms do not undergo syntactic variability. For example, a passive sentence as the bucket was kicked is not possible. (or at least it does not have the same meaning.)

Another type of semi-fixed expressions are compound nominals as car park or peanut butter. They are syntactically-unalterable but can inflect for number: 2 car parks.

Proper names are semi-fixed expressions as well since they can occur in different forms. For example the name of the U.S. sports team the San Francisco 49ers can occur as the 49ers or as a modifier in the compound noun a 49ers player etc.


1.3 Syntactically-Flexible Expressions

Syntactically-flexible expressions have a wider range of syntactic variability than semi-fixed expressions. They occur in the form of decomposable idioms, verb-particle constructions and light verbs.

Decomposable idioms are likely to be syntactically flexible to some degree. Examples are let the cat out of the bag and sweep under the rug. Yet, it is hard to predict which kind of syntactic variation a given idiom can undergo.

Verb-particle constructions, such as write up and look up are made up of a verb and one or more partcicles. Either they are semantically idiosyncratic as brush up on or compositional as break up in the meteorite broke up in the earth's atmosphere. In some transitive verb-particle constructions as call s.o. up an NP argument can occur either between or following the verb and particle(s): call Kim up or call up Kim, respectively. In addition adverbs can often be inserted between the verb and particle as in fight bravely on.

For light verb constructions, as make a mistake, give a demo it is difficult to predict which light verb combines with a given noun. Though they are highly idiosyncratic they have to be distinguished from idioms: "the noun is used in a normal sense, and the verb meaning appears to be bleached, rather than idiomatic."[5]


1.4 Institutionalized Phrases

Institutionalized phrases are conventionalized phrases, such as salt and pepper, traffic light and to kindle excitement. They are semantically and syntactically compositional, but statistically idiosyncratic. Regarding the phrase traffic light, traffic and light both retain simpex senses but produce a compositional reading by combining constructionally.

Problems for NLP

One problem that occurs in NLP given that MWEs are treated by general, compositional methods of linguistic analysis is the overgeneration problem. A system could deduce from given expressions other putatively possible expressions that are equivalent in meaning but do not exist due to a lack of institutionalization. "A generation system that is uniformed about both the patterns of compounding and the particular collocational frequency of the relevant dialect would correctly generate telephone booth (American) or telephone box (British/Australian), but might also generate such perfectly compositional, but unacceptable examples as telephone cabinet, telephone closet, etc."[6]

Another problem is the idiomaticity problem. It is difficult to predict the meaning of an expression like kick the bucket since the meaning is not related to the meanings of kick, the, and bucket. Even though the expression seems to conform the grammar of English verb phrases.


References

  1. Sag et al (2002), p. 1.
  2. Calzolari et al. (2002), p. 1934.
  3. cf. Jackendoff (1997).
  4. The article follows the schema of classification that is proposed in Sag et al. (2002). Most of the examples are taken from their article as well.
  5. Sag et al. (2002), p. 7.
  6. Sag et al. (2002), p. 2.


Nicoletta Calzolari et al.: Towards Best Practice for Multiword Expressions in Computational Lexicons (2002) in: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), pp. 1934–40.

Ray Jackendoff: The Architecture of the Language Faculty (1997), Cambridge, MA: MIT Press.

Ivan A. Sag et al.: Multiword Expressions: A Pain in the Neck for NLP (2002) in: LECTURE NOTES IN COMPUTER SCIENCE, Vol. 2276, pp. 1-15.


Further Literature

Timothy Baldwin et al.: An Empirical Model of Multiword Expression Decomposability (2003) in: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 89-96.


Eric Wehrli: Parsing and Collocations (2000) in: LECTURE NOTES IN COMPUTER SCIENCE, Vol. 1835, pp. 272-282.


External links

Wikipedia article on MWE