This work is intended to advance the technology of
natural language processing, a stated objective of the
Artificial Intelligence subdivision of the ONR.
We will apply statistical techniques to two challenging and diverse
NLP tasks, one in the area of word-sense disambiguation and the other
in the area of discourse processing.
In particular, we will develop
probabilistic classifiers --- systems that perform
disambiguation by assigning, out of a set of classes, the one that is
most probable according to a probabilistic model. The model
expresses the relationships among the classification variable and
variables that correspond to properties of the ambiguous object and
the context in which it occurs (the non-classification
variables).
The systems developed in this work will be fully automatic. Further,
because the approach is statistical, knowledge needed to perform
disambiguation will be acquired automatically, the systems will be
designed to run over unlimited text, i.e., will not be tuned to specific
examples, and existing, well-defined metrics such as perplexity
[44] are available for evaluating system performance.
This kind of approach is also valuable as a formal empirical study
that provides insights into correlations that exist in the data.
Similar statistical techniques
have been developed
in the areas of expert systems, computer
vision, and speech recognition. Thus, the applicability
of these techniques spans these various areas, potentially
facilitating integration of modules for various AI tasks.
Statistical approaches to NLP are typically limited to simple models
that include only a small number of immediately surrounding
non-classification variables. The proposed work will extend our
previous work in addressing these limitations[11],[12],[13].
Exploiting recent developments in applied statistics, we will use a
richer class of statistical models than is typically used in NLP,
along with a set of tools for: (1) fitting such models to the data,
(2) estimating the parameters of the chosen model from untagged data,
and (3) resolving interdependent ambiguities. Together, these
techniques will make it computationally feasible to automatically
develop and apply probabilistic models that express a complex set of
relationships among a diverse set of variables. This work will advance
statistical NLP toward expressing and using the types of knowledge
typically thought to be necessary for high-level NLP tasks, such as
the two we address in this work.
One of the tasks we address is to disambiguate a large vocabulary of
words with respect to full sets of sense distinctions from a published
source such as Longman's on-line dictionary[54].
Statistical techniques
require a training corpus in which all targeted ambiguities have been
resolved. Because many ambiguities are targeted in this application,
the amount of tagged data needed for each must be limited. The
statistical techniques we propose to use will enable us to accomplish
this while still providing the ability to identify models that are
good descriptions of the data.
Another important feature of this application is that existing,
analytically derived, domain knowledge will be combined with
statistically derived, empirical knowledge. In particular, both kinds
of knowledge will be combined into a single probabilistic model by
expressing both in terms of conditional independence. The domain
knowledge to be used, concept hierarchies, will provide semantic
constraints among word senses, and statistical techniques are presented for
exploiting these constraints to simultaneously disambiguate all words
in a sentence.
The second application is a discourse processing task that involves
segmentation, reference resolution, and belief. Specifically, the
problem is to segment a text into blocks that express the beliefs and
opinions of a single agent, and to identify noun phrases that refer to
that agent. In our method for developing
classifiers, rather than making assumptions about which variables to
use and how they are related, statistical techniques are used to
explore these questions empirically. Further, the types of models
used in this work can express complex relationships among diverse sets
of variables. These advantages are particularly important for
high-level discourse tasks, for which a tremendous number of features
and interactions among them seem potentially relevant.
We will draw upon an existing algorithm[66] as a source
of potentially relevant non-classification variables. This algorithm
is based on regularities, i.e., combinations of various features,
observed in naturally-occurring text. A number of these features are
syntactic and semantic distinctions that can feasibly be approximated
automatically.
Next: Description of Proposed Work
Up: Abstract
Previous: Abstract
Next: Description of
Proposed Work
Up: Abstract
Previous: Abstract