Proposal:Proposed Work (II): Lexical Application



next up previous
Next: Proposed Work (III): Discourse Application

Up: Description of Proposed Work

Previous: Proposed Work (I): Extending Methodology


2.1 Introduction

From a practical point of view, many tasks in NLP require word-sense disambiguation. As early as 1960, Bar-Hillel[5] argued that lexical ambiguity was the major impedement to machine translation. In information retrieval, documents and queries are typically represented by the words (i.e., terms) that they contain. Such representations would certainly be improved if unambiguous lexical items were used as opposed to words. In general, corpus analysis and lexicography could be automated more if the words in the corpora were disambiguated. It could be said that the accuracy of text processing, as a whole, is hampered by unresolved word senses.

The classification of an ambiguous word in natural language must be decided by the context in which it is used. The question becomes what aspects of context are most indicative of the classification, and what kind of knowledge is required to evaluate those contextual features. Systems that attempt to duplicate a human's ability to understand the context in terms of accumulated world and language knowledge rely on knowledge-bases that are manually created to capture human judgment in the structure of the information contained. Other researchers have proposed to ``understand'' language in terms of its statistical properties.

The difference between the information that people use to resolve ambiguity and the information that is used in most statistical classifiers is clear: people often make use of global organizational structures, while the typical probabilistic classifier uses only the immediate local structure. Such global organizational structures embody world knowledge that is generally considered to be too complex to extract from training data using statistical techniques. This knowledge is usually expressed as relations or constraints among the objects to be classified. A major difficulty in basing a classifier solely on constraints of this type is that such structures are rarely, if ever, complete and consistent enough to form the basis of a large-scale classifier. Therefore, most word-sense classifiers based on this type of knowledge are little more than demonstration systems.

We propose to develop a word-sense classifier that integrates information gathered empirically from training data with analytically derived, domain knowledge. The classifier will disambiguate a large vocabulary of words with respect to fine-grained sense distinctions made in a standard dictionary. We will use the statistical techniques described in section 1.2 to automatically formulate a probabilistic model describing the relationship between each ambiguous word and a select set of unambiguous contextual features, without requiring a large amount of disambiguated corpora. The individual models, formulated from training data, will then be interconnected by a constraint structure specifying the semantic relationships that exist among the senses of ambiguous words. Fortunately, there are existing concept taxonomies that describe the semantic relationships among word senses. Examples of such theories are: the ``subject code'' and ``box code'' hierarchies in the Longman's Dictionary of Contemporary English [54], the hyponymy and meronymy taxonomies in WordNet [51], and the various taxonomies derived from machine readable dictionaries[41],[14],[63],[19],[50],[2]. While the information expressed in these theories is perhaps too complex to derive from training data using statistical techniques, it is not too complex to express in the form of a statistical model. In the remainder of this section, we first describe a method for combining empirically and analytically derived knowledge to form a single probabilistic model, and then present a specific approach to producing a large-scale probabilistic classifier, along with a description of how word-sense disambiguation will be performed by that classifier.

2.2 Combining Empirical and Analytical Constraints

In this section, we describe methods for (1) representing, as conditional independence relationships, constraints that are formulated in propositional logic, (2) assigning probabilities to these relationships, and (3) combining such relationships with those derived from training data, to produce a graphical model. The methods outlined are specific to the type of logical theories used in our intended applications. These applications are existing concept taxonomies, such as WordNet[51], that specify interdependencies among the senses of the ambiguous words in a sentence.

Formulating a probabilistic model involves three major sub-tasks: defining the random variables in the model, identifying the dependencies among those variables, and quantifying those dependency relationships. In keeping with previous approaches[3], propositions are, in general, mapped to discrete random variables. An exception to this mapping is made for propositions corresponding to word senses. Each ambiguous word will map to a single random variable, with values corresponding to the possible senses of that word(The possible values for each word could also include the null value corresponding to the absence of that word.) This treatment is necessary to capture the implicit relationships that exist between a word and its senses, i.e., word --> wordSense_1 || wordSense_2 || ... || wordSense_n as well as the mutual exclusion relationship that exists among the senses of a word, i.e., wordSense_1 --> !wordSense_2 && !wordSense_3 && ... && !wordSense_n.

In logic, connectives, such as conjunction and implication, are used to describe relationships among propositions; in a probabilistic model, these relationships are expressed as statements of conditional independence among the variables. We require that the axioms for each proposition be covering, and that the set of axioms, as a whole, be acyclic. Given that these properties hold, all propositions (random variables) in a single axiom can be treated as interacting, while those propositions not together in any single axiom can be treated as not interacting. For example, an axiom of the form human --> animate, where human and animate refer to type primitives, would be viewed as specifying an interdependency between the variables corresponding to human and animate. If there is also an axiom of the form animal --> animate, with animal referring to a type primitive, then there would also exist an interdependency between the variables corresponding to animal and animate. But based on this information alone, no dependency between animal and human would be established.

Once the dependencies among variables have been identified, they must be quantified. The theory of Markov fields[53],[39],[21] provides a method for assigning complete and consistent probabilities given the dependency structure of the model and numerical measures of the compatibility of the values of interdependent variables, called ``compatibility measures.'' As described in Pearl[53], the method is referred to as ``Gibbs' potential.'' The advantages of this method are that the compatibility measures need only quantify the local interactions within the sets of completely interdependent variables, and that the numerical values assigned to these local interactions need not be globally consistent. Compatibility measures are assigned in accordance with the semantics of propositional logic. Returning to the previous example of human --> animate, a high compatibility measure would be assigned to the co-occurrence of human=true and animate=true, human=false and animate=false, as well as human=false and animate=true, while a low compatibility measure would be assigned to the pair human=true and animate=false.

The assignment of compatibility measures is less straightforward when the logical theory is a partially disambiguated taxonomy such as WordNet. The relationships expressed in such a taxonomy are of the form dogSense_6 --> canine, where the word dog has been disambiguated but canine has not. In order to assign compatibility measures to the possible combinations of the values of the variables such as dog and canine, it becomes necessary to augment the information provided in the theory. Specifically, it is necessary to assign a compatibility measure to the co-occurrence of dogSense_6 with canineSense_1 or canineSense_2 when all that is known is that dogSense_6 is related to one or more of the senses of canine. When the relationships among the alternative values of variables are not fully specified in the theory, we propose to assign uninformative compatibility measures to the unconstrained relationships. An uninformative compatibility measure corresponds to a uniform distribution of possible outcomes. Returning to the example of the relationship between dogSense_6 and canine, the co-occurrence of dogSense_6 with each sense of canine would receive the same (uninformative) compatibility measure.

The Gibbs' potential can also be used to merge the model derived from logical constraints with those developed from the training data. In merging multiple models, the compatibility measures are formulated from the parameters of the models to be merged.

2.3 Proposed Implementation

The proposed word-sense classifier will be large-scale, capable of disambiguating close to 600 content words. Disambiguation will be with respect to a known standard such as WordNet ``synset sets'' [51] or standard dictionary definitions. The classifier will disambiguate all targeted content words in a single sentence simultaneously via the method of probabilistic constraint satisfaction described in section 1.2.3. In this section, we describe, in greater detail, the steps required to construct such a classifier.

The first phase of construction will be to formulate, from training data, a decomposable model for each word to be disambiguated. The candidate non-classification variables will include the following (more precisely, these are interpretations of such variables): the morphology of the ambiguous word, the part-of-speech (POS) categories of the immediately surrounding words, specific collocations, sentence position, and aspects of the phrase structure of the sentence. The preprocessing necessary to identify these features will be subsumed by that required for the discourse processor; our plans for addressing them are given in section 3.3.

Once a set of contextual features has been chosen as described in section 1.1.2, the form of the model for each word is selected from the class of decomposable models as described in section 1.2.1. Estimates of the model parameters are then obtained from untagged data as described in section 1.2.2. As described in section 1.2.1, selection of the form of a model requires a small amount of tagged data indicating the senses of the ambiguous word. It may be possible to further reduce the total amount of sense-tagged data needed for this phase of construction by identifying a parametric model or models applicable to a wide range of content words, as described in our previous work[12]. In the second phase of construction, the statistical models for the individual words are interconnected by a constraint structure specifying the semantic relationships that exist among the senses of words, using the approach described in section 2.2. Several of the concept taxonomies cited in section 2.1 appear to be applicable, and it should be possible to include multiple theories in a single model, provided that they are based on the same sense distinctions. The selection of specific domain theories to be used in constructing the proposed word-sense classifier will be made after a preliminary analysis of the relational structures. The central concerns of this analysis will be completeness and consistency.

2.4 Resolution of Ambiguity Through Simultaneous Satisfaction of all Constraints

Using the model constructed as described above, the disambiguation of all ambiguous words in a sentence will be accomplished via a stochastic simulation of that model, as described in section 1.2.3. The tag sequence identified for each sentence via the stochastic simulation will be the maximum a posteriori probability (MAP) solution to the satisfaction of all constraints.

There is one restriction on the application of the stochastic simulation technique described in section 1.2.3 to models formulated from logical constraints. In order for the proof of convergence to hold, the joint distribution of all variables must be strictly positive(Technically, the requirement is that the Gibbs sampler be irreducible[75], but the simplest way of assuring irreducibility is to assure that the joint distribution is strictly positive.) This means that logical constraints may not be assigned absolute certainty (i.e., given a probability of 1.0), so that alternative values do not have a zero probability. And because the rate of convergence is effected by small probabilities[17], the probabilities assigned to logical constraints may have to be adjusted to speed convergence.

The last issue to be addressed in this section is the size of the model and the corresponding complexity of the stochastic simulation. As described so far, the model would contain nodes corresponding to one or more concept taxonomies, in addition to nodes corresponding to approximately 600 words and their contextual features. But as stated in section 1.2.3, the time required for the stochastic simulation is proportional to the number of nodes plus the number of edges in the network. In order to assure that the time requirements for the stochastic simulation are not extreme, we propose to limit the model to only the nodes needed to disambiguate the current input sentence. We propose to do this by dynamically creating a network corresponding to each input sentence. This will be done using ``marker-passing,'' as described in Charniak and Goldman[18], to select the portions of the complete model needed to process each specific input sentence.



next up previous
Next: Proposed Work (III): Discourse Application

Up: Description of Proposed Work

Previous: Proposed Work (I): Extending Methodology