Difference between revisions of "Computational Lexicology"

Latest revision as of 04:14, 25 June 2012

Computational Lexicology is the use of computers in the study of the lexicon. It has been more narrowly described by others (Amsler, 1980) as the use of computers in the study of machine-readable dictionaries. It is distinguished from Computational Lexicography, which more properly would be the use of computers in the construction of dictionaries, though some researchers have used Computational Lexicography as synonymous with Computational Lexicology.

History

In any case, it was the appearance of machine-readable dictionaries (MRDs) that gave Computational Lexicology its start as a separate discipline within Computational Linguistics. The first widely distributed MRDs were the Merriam-Webster Seventh Collegiate (W7) and the Merriam-Webster New Pocket Dictionary (MPD). Both were produced by a government-funded project at Systems Development Corporation under the direction of John Olney. They were manually keyboarded as no typesetting tapes of either book were available. Originally each was distributed on multiple reels of magnetic tape as card images with each separate word of each definition on a separate punch card with numerous special codes indicating the details of its usage in the printed dictionary. Olney outlined a grand plan for the analysis of the definitions in the dictionary, but his project expired before the analysis could be carried out. Robert Amsler at the University of Texas at Austin resumed the analysis and completed a taxonomic description of the Pocket Dictionary under NSF funding, however his project expired before the taxonomic data could be distributed. Roy Byrd et al. at IBM Yorktown Heights resumed analysis of the Webster's Seventh Collegiate following Amsler's work. Finally, in the 1980s starting with initial support from Bellcore and later funded by NSF, ARDA, DARPA, DTO, and REFLEX, George Miller and Christiene Fellbaum at Princeton University completed the creation and wide distribution of a dictionary and its taxonomy in the WordNet project, which today stands as the most widely distributed computational lexicology resource.

Computational lexicology has contributed to our understanding of the content and limitations of print dictionaries for computational purposes. Basically, almost every portion of a print dictionary entry has been studied ranging from:

what constitutes a headword - used to generate spelling correction lists;
what variants and inflections the headword forms - use to empirically understand morphology;
how the headword is delimited into syllables;
how the headword is pronounced - used in speech generation systems;
the parts of speech the headword takes on - used for POS taggers;
any special subject or usage codes assigned to the headword - used to identify text document subject matter;
the headword's definitions and their syntax - used as an aid to disambiguation of word in context;
the etymology of the headword and its use to characterize vocabulary by languages of origin - used to characterize text vocabulary as to its languages of origin;
the example sentences;
the run-ons (additional words and multi-word expressions that are formed from the headword); and
related words such as synonyms and antonyms.

Many computational linguists were disenchanted with print dictionaries as a resource for computational linguistics because they lacked sufficient syntactic and semantic information for computer programs. The work on computational lexicology quickly led to efforts in two additional directions.

Successors to Computational Lexicology

First, collaborative activities between computational linguists and lexicographers led to an understanding of the role that corpora played in creating dictionaries. Most computational lexicologists moved on to build large corpora to gather the basic data that lexicographers had used to create dictionaries. The ACL/DCI (Data Collection Initiative) and the LDC (Linguistic Data Consortium) went down this path. The advent of markup languages led to the creation of tagged corpora that could be more easily analyzed to create computational linguistic systems. Part-of-speech tagged corpora and semantically tagged corpora were created in order to test and develop POS taggers and word semantic disambiguation technology.

The second direction was toward the creation of Lexical Knowledge Bases (LKBs). A Lexical Knowledge Base was deemed to be what a dictionary should be for computational linguistic purposes, especially for computational lexical semantic purposes. It was to have the same information as in a print dictionary, but totally explicated as to the meanings of the words and the appropriate links between senses. Many began creating the resources they wished dictionaries were, if they had been created for use in computational analysis. WordNet can be considered to be such a development, as can the newer efforts at describing syntactic and semantic information such as the FrameNet work of Fillmore. Outside of computational linguistics, the Ontology work of artificial intelligence can be seen as an evolutionary effort to build a lexical knowledge base for AI applications.

External links

FrameNet
MindNet - a semantic network made from a dictionary, by Microsoft Research, with an overview
Wikipedia: Lexicology
Wikipedia: Lexicography

References

Amsler, R.A. (1980). The Structure of the Merriam-Webster Pocket Dictionary, Doctoral Dissertation, TR-164, University of Texas, Austin.

@@ Line 3: / Line 3: @@
 ==History==
-In any case, it was the appearance of machine-readable dictionaries (MRDs) that gave Computational Lexicology its start as a separate discipline within Computational Linguistics. The first widely distributed MRDs were the Merriam-Webster Seventh Collegiate (W7) and the Merriam-Webster New Pocket Dictionary (MPD). Both were produced by a government-funded project at Systems Development Corporation under the direction of John Olney. They were manually keyboarded as no typesetting tapes of either book were available. Originally each was distributed on multiple reels of magnetic tape as card images with each separate word of each definition on a separate punch card with numerous special codes indicating the details of its usage in the printed dictionary. Olney outlined a grand plan for the analysis of the definitions in the dictionary, but his project expired before the anslysis could be carried out. Robert Amsler at the University of Texas at Austin resumed the analysis and completed a taxonomic description of the Pocket Dictionary under NSF funding, however his project expired before the taxonomic data could be distributed. Roy Byrd et al. at IBM Yorktown Heights resumed analysis of the Webster's Seventh Collegiate following Amsler's work. Finally, in the 1980s at Bellcore, George Miller and Christiene Fellbaum completed the creation and wide distribution of a dictionary's in the WordNet project, which today stands as among the most widely distributed computational lexicology resource.
+In any case, it was the appearance of machine-readable dictionaries (MRDs) that gave Computational Lexicology its start as a separate discipline within Computational Linguistics. The first widely distributed MRDs were the ''Merriam-Webster Seventh Collegiate'' (W7) and the ''Merriam-Webster New Pocket Dictionary'' (MPD). Both were produced by a government-funded project at Systems Development Corporation under the direction of John Olney. They were manually keyboarded as no typesetting tapes of either book were available. Originally each was distributed on multiple reels of magnetic tape as card images with each separate word of each definition on a separate punch card with numerous special codes indicating the details of its usage in the printed dictionary. Olney outlined a grand plan for the analysis of the definitions in the dictionary, but his project expired before the analysis could be carried out. Robert Amsler at the University of Texas at Austin resumed the analysis and completed a taxonomic description of the Pocket Dictionary under NSF funding, however his project expired before the taxonomic data could be distributed. Roy Byrd et al. at IBM Yorktown Heights resumed analysis of the ''Webster's Seventh Collegiate'' following Amsler's work. Finally, in the 1980s starting with initial support from Bellcore and later funded by NSF, ARDA, DARPA, DTO, and REFLEX, George Miller and Christiene Fellbaum at Princeton University completed the creation and wide distribution of a dictionary and its taxonomy in the [[WordNet]] project, which today stands as the most widely distributed computational lexicology resource.
-Computational lexicology has contributed to our understanding of the content and limitations of print dictionaries for computational purposes. Basically, almost every portion of a print dictionary entry has been studied ranging from what constitutes a headword, what variants and inflections it forms, how it is delimited into syllables, how it is pronunced, the parts of speech it takes on, any special subject or usage codes assigned to the headword, the headword's definitions and their syntax, the etymology and its use to characterize vocabulary by languages of origin, the example sentences, the run-ons (additional words and multi-word expressions that are formed from the headword), and related words such as synonyms and antonyms.
+Computational lexicology has contributed to our understanding of the content and limitations of print dictionaries for computational purposes. Basically, almost every portion of a print dictionary entry has been studied ranging from:
+* what constitutes a headword - used to generate spelling correction lists;
+* what variants and inflections the headword forms - use to empirically understand morphology;
+* how the headword is delimited into syllables;
+* how the headword is pronounced - used in speech generation systems;
+* the parts of speech the headword takes on - used for POS taggers;
+* any special subject or usage codes assigned to the headword - used to identify text document subject matter;
+* the headword's definitions and their syntax - used as an aid to disambiguation of word in context;
+* the etymology of the headword and its use to characterize vocabulary by languages of origin - used to characterize text vocabulary as to its languages of origin;
+* the example sentences;
+* the run-ons (additional words and multi-word expressions that are formed from the headword); and
+* related words such as synonyms and antonyms.
-Many computational linguists were disenchanted with print dictionaries as a resource for computational linguistics because they lacked sufficient syntactic and semantic information for computer programs.
+Many computational linguists were disenchanted with print dictionaries as a resource for computational linguistics because they lacked sufficient syntactic and semantic information for computer programs. The work on computational lexicology quickly led to efforts in two additional directions.
-===Successor Fields to Computational Lexicology===
+===Successors to Computational Lexicology===
-The work on computational lexicology quickly led to efforts in two additional directions. First, collaborative activities between computational linguists and lexicographers led to an understanding of the role that corpora played in creating dictionaries. Most computational lexicologists moved on to build large corpora to gather the basic data that lexicographers used to create dictionaries. The ACL/DCI (Data Collection Initiative) and the LDC (Linguistic Data Consortium) went down this path. The second direction was toward the creation of Lexical Knowledge Bases (LKBs). A Lexical Knowledge Base was deemed to be what a dictionary should be for computational linguistic purposes, especially for computational lexical semantic purposes. It was to have the same information as in a print dictionary, but totally explicated as to the meanings of the words and the appropriate links between senses.
+First, collaborative activities between computational linguists and lexicographers led to an understanding of the role that '''corpora played in creating dictionaries'''. Most computational lexicologists moved on to build large corpora to gather the basic data that lexicographers had used to create dictionaries. The ACL/DCI (Data Collection Initiative) and the LDC (Linguistic Data Consortium) went down this path. The advent of markup languages led to the creation of tagged corpora that could be more easily analyzed to create computational linguistic systems. Part-of-speech tagged corpora and semantically tagged corpora were created in order to test and develop POS taggers and word semantic disambiguation technology.
-Speech researchers looked at the use of the pronunciations in machine-readable dictionaries for a source of spoken language. Following the work on English language machine-readable dictionaries, researchers looked at bilingual dictionaries and pairing of multiple dictionaries to assist in machine translation.
+The second direction was toward the creation of [[Lexical Knowledge Bases]] (LKBs). A Lexical Knowledge Base was deemed to be what a dictionary should be for computational linguistic purposes, especially for computational lexical semantic purposes. It was to have the same information as in a print dictionary, but totally explicated as to the meanings of the words and the appropriate links between senses. Many began creating the resources they wished dictionaries were, if they had been created for use in computational analysis.  [[WordNet]] can be considered to be such a development, as can the newer efforts at describing syntactic and semantic information such as the [http://framenet.icsi.berkeley.edu/ FrameNet] work of Fillmore. Outside of computational linguistics, the Ontology work of artificial intelligence can be seen as an evolutionary effort to build a lexical knowledge base for AI applications.
-Speech generation systems did make use of pronunciations from machine-readable dictionaries and text content anslysis systems were built that used the subject codes of the Longman Dictionary of Contemporary English (LDOCE) to analysis document subject content. [There were also pre-computational lexicology projects such as the General Inquirer and others that performed content analysis of texts using hand-crafted subject tags associated with words in text, but as these didn't derive from machine-readable copies of general print dictionaries, they are not 'computational lexicology' as defined here]. The computational linguistic community has undertaken to create its own dictionary resources through projects such as the FRAMENET work of Fillmore.
+==See also==
+* [[Dictionaries]]
+* [[Lexicons]]
+* [[WordNet]]
+==External links==
+* [http://framenet.icsi.berkeley.edu/ FrameNet]
+* [http://atom.research.microsoft.com/mnex/ MindNet] - a semantic network made from a dictionary, by Microsoft Research, with an [http://research.microsoft.com/nlp/Projects/MindNet.aspx overview]
+* [http://en.wikipedia.org/wiki/Lexicology Wikipedia: Lexicology]
+* [http://en.wikipedia.org/wiki/Lexicography Wikipedia: Lexicography]
+==References==
+* Amsler, R.A. (1980). ''[ftp://ftp.cs.utexas.edu/pub/techreports/tr80-164a.pdf The Structure of the Merriam-Webster Pocket Dictionary]'', Doctoral Dissertation, TR-164, University of Texas, Austin.
+[[Category:Research]]

Difference between revisions of "Computational Lexicology"

Latest revision as of 04:14, 25 June 2012

Contents

History

Successors to Computational Lexicology

See also

External links

References

Navigation menu

Search