Proposal: Contributions to the objectives of ONR



next up previous
Next: Description of Proposed Work Up: Abstract Previous: Abstract

This work is intended to advance the technology of natural language processing, a stated objective of the Artificial Intelligence subdivision of the ONR. We will apply statistical techniques to two challenging and diverse NLP tasks, one in the area of word-sense disambiguation and the other in the area of discourse processing. In particular, we will develop probabilistic classifiers --- systems that perform disambiguation by assigning, out of a set of classes, the one that is most probable according to a probabilistic model. The model expresses the relationships among the classification variable and variables that correspond to properties of the ambiguous object and the context in which it occurs (the non-classification variables).

The systems developed in this work will be fully automatic. Further, because the approach is statistical, knowledge needed to perform disambiguation will be acquired automatically, the systems will be designed to run over unlimited text, i.e., will not be tuned to specific examples, and existing, well-defined metrics such as perplexity [44] are available for evaluating system performance. This kind of approach is also valuable as a formal empirical study that provides insights into correlations that exist in the data. Similar statistical techniques have been developed in the areas of expert systems, computer vision, and speech recognition. Thus, the applicability of these techniques spans these various areas, potentially facilitating integration of modules for various AI tasks.

Statistical approaches to NLP are typically limited to simple models that include only a small number of immediately surrounding non-classification variables. The proposed work will extend our previous work in addressing these limitations[11],[12],[13]. Exploiting recent developments in applied statistics, we will use a richer class of statistical models than is typically used in NLP, along with a set of tools for: (1) fitting such models to the data, (2) estimating the parameters of the chosen model from untagged data, and (3) resolving interdependent ambiguities. Together, these techniques will make it computationally feasible to automatically develop and apply probabilistic models that express a complex set of relationships among a diverse set of variables. This work will advance statistical NLP toward expressing and using the types of knowledge typically thought to be necessary for high-level NLP tasks, such as the two we address in this work.

One of the tasks we address is to disambiguate a large vocabulary of words with respect to full sets of sense distinctions from a published source such as Longman's on-line dictionary[54]. Statistical techniques require a training corpus in which all targeted ambiguities have been resolved. Because many ambiguities are targeted in this application, the amount of tagged data needed for each must be limited. The statistical techniques we propose to use will enable us to accomplish this while still providing the ability to identify models that are good descriptions of the data.

Another important feature of this application is that existing, analytically derived, domain knowledge will be combined with statistically derived, empirical knowledge. In particular, both kinds of knowledge will be combined into a single probabilistic model by expressing both in terms of conditional independence. The domain knowledge to be used, concept hierarchies, will provide semantic constraints among word senses, and statistical techniques are presented for exploiting these constraints to simultaneously disambiguate all words in a sentence.

The second application is a discourse processing task that involves segmentation, reference resolution, and belief. Specifically, the problem is to segment a text into blocks that express the beliefs and opinions of a single agent, and to identify noun phrases that refer to that agent. In our method for developing classifiers, rather than making assumptions about which variables to use and how they are related, statistical techniques are used to explore these questions empirically. Further, the types of models used in this work can express complex relationships among diverse sets of variables. These advantages are particularly important for high-level discourse tasks, for which a tremendous number of features and interactions among them seem potentially relevant.

We will draw upon an existing algorithm[66] as a source of potentially relevant non-classification variables. This algorithm is based on regularities, i.e., combinations of various features, observed in naturally-occurring text. A number of these features are syntactic and semantic distinctions that can feasibly be approximated automatically.





next up previous
Next: Description of Proposed Work Up: Abstract Previous: Abstract