Up: Description of Proposed Work
Previous: Evaluation
Many researches have avoided characterizing the interactions among
multiple contextual features by considering only one feature in
determining the sense of an ambiguous word. Techniques for
identifying the optimum feature to use in disambiguating a word are
presented in [22],[73] and [10].
Other works consider multiple contextual features in performing
disambiguation without formally characterizing the relationships among
the features. The majority of these efforts [34],[76]
weight each feature in predicting the sense of an
ambiguous word in accordance with frequency information, without
considering the extent to which the features co-occur with one
another. Gale, Church and Yarowsky[28] and Yarowsky[74] formally
characterize the interactions that they
consider in their model, but they simply assume that their model
fits the data.
Other approaches to systematically combining information from multiple
contextual features have been proposed and used in word-sense
disambiguation. Schutze[57] derived contextual features
from a singular value decomposition of a matrix of letter four-gram
co-occurrence frequencies, thereby assuring the independence of all
features. Further, the method that he used does not require
sense-tagged data. Unfortunately, interpreting a contextual feature
that is a weighted combination of letter four-grams is difficult.
In addition, the clustering procedure used to assign word meaning based on
these features is such that the resulting sense clusters do not have
known statistical properties. This makes it impossible to generalize
the results to other data sets.
Black[8] used decision trees[9] to
define the relationships among a number of prespecified contextual
features, which he called ``contextual categories,'' and the sense
tags of an ambiguous word. The tree construction process
partitions the data according to the values of one contextual
feature before considering the values of the next, thereby treating
all features in each branch of the tree as interdependent. The tree
construction process requires a large amount of tagged data. The
method presented here is more flexible because it eliminates the
necessity to treat features as interdependent. Further, it requires only a
small amount of tagged data.
Although never applied to word-sense disambiguation, the most popular
statistical models in NLP are n-gram or Markov chain models. These
models are members of the class of graphical models, and methods have
been developed for both estimating the parameters of these models from
untagged data[6],[40] and resolving multiple
ambiguities expressed in terms of these models[27].
Unfortunately, Markov chain models are capable of expressing only a
restricted set of dependencies among variables. The most general and
therefore the most expressive class of models used in NLP is the class
of maximum entropy models [55],[56],[45],[46]. While this class is
more general than the class of graphical models, the only method
currently used in
formulating these models provides maximum likelihood parameter
estimates from tagged data. Although, in theory, the techniques
proposed here may be applicable to maximum entropy models, their
application to models in that class as a whole would be
computationally infeasible.
The approaches to word sense disambiguation reviewed thus far have
used a statistical model derived from frequency information in the
training data. We propose to make use of hand-crafted concept
taxonomies in formulating our classifier, as well. While this is the
first word-sense disambiguation application that combines these two types
of knowledge within the formalism of a probabilistic model, it is not
the the first application of concept taxonomies to word-sense
disambiguation. Such knowledge structures have played an important
role in most previous work done in word-sense disambiguation[31],
[20],[64],[37],[49].
There has even been previous work on representing such structures as
statistical models, where the models used are similar in form to the
models described here[25].
There has been recent growth in empirically-oriented work in discourse
processing. Several researchers have addressed evaluation,
investigating the degree to which human subjects agree with one
another on discourse tasks [52],[35],[33]. Others have used frequency
information to evaluate
algorithms. Passonneau and Litman [52] tagged a corpus
with classes and features and tested algorithms hypothesized from the
literature. They consider just one feature at a time, so do not
address interactions among multiple features.
Some researchers have derived algorithms and models on the basis of
frequency information, but have not used systematic, formally
characterized methods for developing their models. Aone and
McKee [1], who
address anaphor resolution, train their discourse module on a
corpus simply by trying it out with different argument settings.
Hirschberg and Litman [36], who
derive a model for cue-phrase disambiguation
on the basis of intonational and orthographic features, formally
characterize the interactions among the variables that
they consider, but their
method for deriving their model was manual and not formally
characterized. Hearst [32], who addresses a broad segmentation
problem similar to the one addressed here, uses frequency of lexical
terms as the criterion by which segmentation is performed. Her method
is specific to using this one type of contextual feature.
The most formal approach to developing models that has been applied to
discourse problems makes use of decision trees [58],[60], and [48].
Above, we cited the advantages of our proposed method over this approach.
Up: Description of Proposed Work
Previous: Evaluation
Next: References
Next: References