Proposal: Main Relations to Other Work



next up previous
Next: References

Up: Description of Proposed Work

Previous: Evaluation


Many researches have avoided characterizing the interactions among multiple contextual features by considering only one feature in determining the sense of an ambiguous word. Techniques for identifying the optimum feature to use in disambiguating a word are presented in [22],[73] and [10]. Other works consider multiple contextual features in performing disambiguation without formally characterizing the relationships among the features. The majority of these efforts [34],[76] weight each feature in predicting the sense of an ambiguous word in accordance with frequency information, without considering the extent to which the features co-occur with one another. Gale, Church and Yarowsky[28] and Yarowsky[74] formally characterize the interactions that they consider in their model, but they simply assume that their model fits the data.

Other approaches to systematically combining information from multiple contextual features have been proposed and used in word-sense disambiguation. Schutze[57] derived contextual features from a singular value decomposition of a matrix of letter four-gram co-occurrence frequencies, thereby assuring the independence of all features. Further, the method that he used does not require sense-tagged data. Unfortunately, interpreting a contextual feature that is a weighted combination of letter four-grams is difficult. In addition, the clustering procedure used to assign word meaning based on these features is such that the resulting sense clusters do not have known statistical properties. This makes it impossible to generalize the results to other data sets.

Black[8] used decision trees[9] to define the relationships among a number of prespecified contextual features, which he called ``contextual categories,'' and the sense tags of an ambiguous word. The tree construction process partitions the data according to the values of one contextual feature before considering the values of the next, thereby treating all features in each branch of the tree as interdependent. The tree construction process requires a large amount of tagged data. The method presented here is more flexible because it eliminates the necessity to treat features as interdependent. Further, it requires only a small amount of tagged data.

Although never applied to word-sense disambiguation, the most popular statistical models in NLP are n-gram or Markov chain models. These models are members of the class of graphical models, and methods have been developed for both estimating the parameters of these models from untagged data[6],[40] and resolving multiple ambiguities expressed in terms of these models[27]. Unfortunately, Markov chain models are capable of expressing only a restricted set of dependencies among variables. The most general and therefore the most expressive class of models used in NLP is the class of maximum entropy models [55],[56],[45],[46]. While this class is more general than the class of graphical models, the only method currently used in formulating these models provides maximum likelihood parameter estimates from tagged data. Although, in theory, the techniques proposed here may be applicable to maximum entropy models, their application to models in that class as a whole would be computationally infeasible.

The approaches to word sense disambiguation reviewed thus far have used a statistical model derived from frequency information in the training data. We propose to make use of hand-crafted concept taxonomies in formulating our classifier, as well. While this is the first word-sense disambiguation application that combines these two types of knowledge within the formalism of a probabilistic model, it is not the the first application of concept taxonomies to word-sense disambiguation. Such knowledge structures have played an important role in most previous work done in word-sense disambiguation[31], [20],[64],[37],[49]. There has even been previous work on representing such structures as statistical models, where the models used are similar in form to the models described here[25].

There has been recent growth in empirically-oriented work in discourse processing. Several researchers have addressed evaluation, investigating the degree to which human subjects agree with one another on discourse tasks [52],[35],[33]. Others have used frequency information to evaluate algorithms. Passonneau and Litman [52] tagged a corpus with classes and features and tested algorithms hypothesized from the literature. They consider just one feature at a time, so do not address interactions among multiple features.

Some researchers have derived algorithms and models on the basis of frequency information, but have not used systematic, formally characterized methods for developing their models. Aone and McKee [1], who address anaphor resolution, train their discourse module on a corpus simply by trying it out with different argument settings. Hirschberg and Litman [36], who derive a model for cue-phrase disambiguation on the basis of intonational and orthographic features, formally characterize the interactions among the variables that they consider, but their method for deriving their model was manual and not formally characterized. Hearst [32], who addresses a broad segmentation problem similar to the one addressed here, uses frequency of lexical terms as the criterion by which segmentation is performed. Her method is specific to using this one type of contextual feature.

The most formal approach to developing models that has been applied to discourse problems makes use of decision trees [58],[60], and [48]. Above, we cited the advantages of our proposed method over this approach.



next up previous
Next: References

Up: Description of Proposed Work

Previous: Evaluation