Up: Description of Proposed Work
Previous: Proposed Work (I): Extending
Methodology
From a practical point of view, many tasks in NLP require word-sense
disambiguation. As early as 1960, Bar-Hillel[5]
argued that lexical ambiguity was the major impedement to machine
translation. In information retrieval, documents and queries are
typically represented by the words (i.e., terms) that they contain.
Such representations would certainly be improved if unambiguous
lexical items were used as opposed to words. In general, corpus
analysis and lexicography could be automated more if the words in the
corpora were disambiguated. It could be said that the accuracy of
text processing, as a whole, is hampered by unresolved word senses.
The classification of an ambiguous word in natural language must be
decided by the context in which it is used. The question becomes what
aspects of context are most indicative of the classification, and what
kind of knowledge is required to evaluate those contextual features.
Systems that attempt to duplicate a human's ability to understand the
context in terms of accumulated world and language knowledge rely on
knowledge-bases that are manually created to capture human judgment in
the structure of the information contained. Other researchers have
proposed to ``understand'' language in terms of its statistical
properties.
The difference between the information that people use to resolve
ambiguity and the information that is used in most statistical
classifiers is clear: people often make use of global organizational
structures, while the typical probabilistic classifier uses only the
immediate local structure. Such global organizational structures
embody world knowledge that is generally considered to be too complex
to extract from training data using statistical techniques. This
knowledge is usually expressed as relations or constraints among the
objects to be classified. A major difficulty in basing a classifier
solely on constraints of this type is that such structures are rarely,
if ever, complete and consistent enough to form the basis of a
large-scale classifier. Therefore, most word-sense classifiers based
on this type of knowledge are little more than demonstration systems.
We propose to develop a word-sense classifier that integrates
information gathered empirically from training data with analytically
derived, domain knowledge. The classifier will disambiguate a large
vocabulary of words with respect to fine-grained sense
distinctions made in a standard dictionary. We will use the
statistical techniques described in section 1.2 to
automatically formulate a probabilistic model describing the
relationship between each ambiguous word and a select set of
unambiguous contextual features, without requiring a large amount of
disambiguated corpora. The individual models, formulated from
training data, will then be interconnected by a constraint structure
specifying the semantic relationships that exist among the senses of
ambiguous words. Fortunately, there are existing concept taxonomies
that describe the semantic relationships among word senses. Examples
of such theories are: the ``subject code'' and ``box code''
hierarchies in the Longman's Dictionary of Contemporary English
[54], the hyponymy and meronymy taxonomies in WordNet
[51], and the various taxonomies derived from machine
readable dictionaries[41],[14],[63],[19],[50],[2]. While
the information expressed in these theories is perhaps too complex to
derive from training data using statistical techniques, it is not too
complex to express in the form of a statistical model. In the
remainder of this section, we first describe a method for
combining empirically and analytically derived knowledge to form a
single probabilistic model, and then present a specific approach
to producing a large-scale probabilistic classifier, along with a
description of how word-sense disambiguation will be performed by that
classifier.
In this section, we describe methods for (1) representing, as
conditional independence relationships, constraints that are
formulated in propositional logic, (2) assigning probabilities to
these relationships, and (3) combining such relationships with those
derived from training data, to produce a graphical model. The methods
outlined are specific to the type of logical theories used in our
intended applications. These applications are existing concept
taxonomies, such as WordNet[51], that specify
interdependencies among the senses of the ambiguous words in a
sentence.
Formulating a probabilistic model involves three major sub-tasks:
defining the random variables in the model, identifying the
dependencies among those variables, and quantifying those dependency
relationships. In keeping with previous approaches[3],
propositions are, in general, mapped to discrete random variables. An
exception to this mapping is made for propositions corresponding to
word senses. Each ambiguous word will map to a single random variable,
with values corresponding to the possible senses of that
word(The possible values for each word could also include the
null value corresponding to the absence of that word.) This
treatment is necessary to capture the implicit relationships that
exist between a word and its senses, i.e., word --> wordSense_1 ||
wordSense_2 || ... || wordSense_n as well as the
mutual exclusion relationship that exists among the senses of a
word, i.e., wordSense_1 --> !wordSense_2 && !wordSense_3 && ... && !wordSense_n.
In logic, connectives, such as conjunction and implication, are used
to describe relationships among propositions; in a probabilistic
model, these relationships are expressed as statements of conditional
independence among the variables.
We require that the axioms for each proposition
be covering, and that the set of axioms, as a whole, be acyclic.
Given that these properties hold, all propositions (random variables)
in a single axiom can be treated as interacting, while those
propositions not together in any single axiom can be treated as not
interacting. For example, an axiom of the form human --> animate,
where human and animate refer to type primitives, would
be viewed as specifying an interdependency between the variables
corresponding to human and animate. If there is also an axiom of
the form animal --> animate, with animal referring to a
type primitive, then there would also exist an interdependency between
the variables corresponding to animal and animate. But based on
this information alone, no dependency between animal and human
would be established.
Once the dependencies among variables have been identified, they must
be quantified. The theory of Markov fields[53],[39],[21] provides a
method for assigning complete and
consistent probabilities given the dependency structure of the model
and numerical measures of the compatibility of the values of
interdependent variables, called ``compatibility measures.'' As
described in Pearl[53], the method is referred to as
``Gibbs' potential.'' The advantages of this method are that the
compatibility measures need only quantify the local interactions
within the sets of completely interdependent variables, and that the
numerical values assigned to these local interactions need not be
globally consistent. Compatibility measures are assigned in
accordance with the semantics of propositional logic. Returning to
the previous example of human --> animate, a high compatibility
measure would be assigned to the co-occurrence of human=true and
animate=true, human=false
and animate=false, as well as human=false and animate=true,
while a low compatibility measure would be assigned to the pair
human=true and animate=false.
The assignment of compatibility measures is less straightforward when
the logical theory is a partially disambiguated taxonomy such as
WordNet. The relationships expressed in such a taxonomy are of the
form dogSense_6 --> canine, where the word dog has been
disambiguated but canine has not. In order to assign compatibility
measures to the possible combinations of the values of the variables
such as dog and canine, it becomes necessary to augment the
information provided in the theory. Specifically, it is necessary to
assign a compatibility measure to the co-occurrence of dogSense_6
with canineSense_1 or canineSense_2 when all that is known is that
dogSense_6 is related to one or more of the senses of canine. When
the relationships among the alternative values of variables are not
fully specified in the theory, we propose to assign uninformative
compatibility measures to the unconstrained relationships. An
uninformative compatibility measure corresponds to a uniform
distribution of possible outcomes. Returning to the example of the
relationship between dogSense_6 and canine, the co-occurrence of
dogSense_6 with each sense of canine would receive the same
(uninformative) compatibility measure.
The Gibbs' potential can also be used to merge the model derived from
logical constraints with those developed from the training data. In
merging multiple models, the compatibility measures are formulated
from the parameters of the models to be merged.
The first phase of construction will be to formulate, from training
data, a decomposable model for each word to be disambiguated. The
candidate non-classification variables will include the following
(more precisely, these are
interpretations of such variables): the morphology of
the ambiguous word, the part-of-speech (POS) categories of the
immediately surrounding words, specific collocations, sentence
position, and aspects of the phrase structure of the sentence. The
preprocessing necessary to identify these features will be subsumed by
that required for the discourse processor; our plans for addressing
them are given in section 3.3.
Once a set of contextual features has been chosen as described in
section 1.1.2, the form of the model for each word is
selected from the class of decomposable models as described in section
1.2.1. Estimates of the model parameters are then obtained from
untagged data as described in section 1.2.2. As described in
section 1.2.1, selection of the form of a model requires a small
amount of tagged data indicating the senses of the ambiguous word. It
may be possible to further reduce the total amount of sense-tagged
data needed for this phase of construction by identifying a parametric
model or models applicable to a wide range of content words, as
described in our previous work[12].
In the second phase of construction, the statistical models for the
individual words are interconnected by a constraint structure
specifying the semantic relationships that exist among the senses of
words, using the approach described in section 2.2. Several
of the concept taxonomies cited in section 2.1 appear to
be applicable, and it should be possible to include multiple theories
in a single model, provided that they are based on the same sense
distinctions. The selection of specific domain theories to be used in
constructing the proposed word-sense classifier will be made after a
preliminary analysis of the relational structures. The central
concerns of this analysis will be completeness and consistency.
Using the model constructed as described above, the disambiguation of
all ambiguous words in a sentence will be accomplished via a stochastic
simulation of that model, as described in section 1.2.3. The tag
sequence identified for each sentence via the stochastic simulation
will be the maximum a posteriori probability (MAP) solution to the
satisfaction of all constraints.
There is one restriction on the application of the stochastic
simulation technique described in section 1.2.3 to models
formulated from logical constraints. In order for the proof of
convergence to hold, the joint distribution of all variables must be
strictly positive(Technically, the requirement is that the
Gibbs sampler be irreducible[75], but the simplest way of
assuring irreducibility is to assure that the joint distribution is
strictly positive.) This means that logical constraints may not be
assigned absolute certainty (i.e., given a probability of 1.0), so that
alternative values do not have a zero probability. And because the
rate of convergence is effected by small probabilities[17], the
probabilities assigned to logical constraints
may have to be adjusted to speed convergence.
The last issue to be addressed in this section is the size of the model
and the corresponding complexity of the stochastic simulation. As
described so far, the model would contain nodes corresponding to one
or more concept taxonomies, in addition to nodes corresponding to
approximately 600 words and their contextual features. But as
stated in section 1.2.3, the time required for the stochastic
simulation is proportional to the number of nodes plus the number of
edges in the network. In order to assure that the time requirements
for the stochastic simulation are not extreme, we propose to limit the
model to only the nodes needed to disambiguate the current input
sentence. We propose to do this by dynamically creating a network
corresponding to each input sentence. This will be done using
``marker-passing,'' as described in Charniak and Goldman[18], to
select the portions of the complete model
needed to process each specific input sentence.
Up: Description of Proposed Work
Previous: Proposed Work (I): Extending
Methodology
Next: Proposed Work (III): Discourse Application
2.1 Introduction
2.2 Combining Empirical and Analytical Constraints
2.3 Proposed Implementation
The proposed word-sense classifier will be large-scale, capable of
disambiguating close to 600 content words. Disambiguation will be
with respect to a known standard such as WordNet ``synset sets''
[51] or standard dictionary definitions. The classifier
will disambiguate all targeted content words in a single sentence
simultaneously via the method of probabilistic constraint satisfaction
described in section 1.2.3. In this section, we describe, in
greater detail, the steps required to construct such a classifier.
2.4 Resolution of Ambiguity Through Simultaneous
Satisfaction of all Constraints
Next: Proposed Work (III): Discourse Application