Up: Description of Proposed Work
Previous: Description of Proposed Work
We propose to extend our previous work in developing probabilistic
classifiers in three important ways. The first two involve
improvements to make large-scale applications feasible.
In particular, we will improve the methods for selecting the form of
the model and estimating model parameters so that the amount of
hand-tagged data required is drastically reduced. The third extension
will enhance the capabilities of the method, enabling us to resolve
interdependent ambiguities.
We first provide background, describing the classes of models used in
this work (in section 1.1.1) and describing our previous
work in developing probabilistic classifiers [11],[12],[13]
(in section 1.1.2). Then we describe our proposed
extensions to the method (in section 1.2).
In this work, we propose to make use of a more general class of
models than previously used in most NLP applications (graphical
models, [65]). Such models are capable of characterizing
a rich set of relationships among a large number of variables, where
these variables may be of different types as well as spatially
separated.
Graphical models are a subset of the class of hierarchical log-linear
models and, as such, are well-suited to characterizing and studying
the structure of data, that is, the interactions among
variables as evidenced by the frequency with which the values of the
variables co-occur[7].
A log-linear model expresses the probability
distribution of a set of variables as a function of the
interdependencies that exist among those variables. Formulating such
a model involves specifying that certain variables do not interact
while allowing the remaining interactions among variables to be
arbitrary and unknown.
The probability distribution of a graphical model corresponds to a
Markov field. This is because a graphical model is a probability
model for multivariate random observations whose dependency structure
has a valid graphical representation, a dependency graph. The
dependency graph of a model is formed by mapping each variable in the
model to a node in the graph, and then drawing undirected edges
between the nodes corresponding to variables that are interdependent.
A valid dependency graph has the property that all variables that are
not directly connected in the graph are conditionally independent
given the values of the variables mapping to nodes connecting them.
For example, if node A separates node B from node C in a
dependency graph, then the variable mapping to node B is
conditionally independent of the variable mapping to node C, given
the value of the variable mapping to node A. Not only does a
dependency graph provide a clear interpretation of the interactions
among variables, it also has been shown[53],[47]
to form the basis of a distributed computational
architecture for probabilistic constraint propagation.
An important subset of graphical models is the class of decomposable
models. In addition to having a graphical representation,
decomposable models possess other desirable properties, two of which
are: (1) there are closed-form expressions for the parameter
estimates, and (2) the joint distribution can be expressed as a
product of lower-order marginal distributions, thereby reducing the
number of parameters that need to be estimated. In total,
decomposable models combine a richness in modeling, clarity in
interpretation, and ease of analysis previously unrealized in other
classes of statistical models.
Most previous efforts have not attempted to systematically identify
the interdependencies among the variables involved in the
classification. The assumption is often made that all of the
non-classification variables are either conditionally independent
given the classification variable or fully independent. When such
untested assumptions do not hold, the performance of the classifier
will be negatively affected. Of course, all non-classification could
be treated as interdependent, but, if there are several of them, such
a model could have too many parameters to estimate in practice.
Rather than making assumptions about how the variables are related,
we explore this empirically. In particular, in our previous work
[13],[12],
we formulate a decomposable model that describes the relationships
among all variables in terms of only the most important
interdependencies, that is, the simplest model of this class that fits
the data well. The fit of a model is the degree to which the
data is approximated by that model. Below, we describe the method.
The first step in the process of formulating a probabilistic model is
to select, out of all available non-classification variables, a subset
to include in the model. We identify those that are most strongly
correlated with the classification variable; these are the only
non-classification variables included in the model. A variable is
considered to be correlated with the classification variable if the
model for independence between them fits the data poorly according to
the goodness-of-fit test described below. The worse the fit, the more
correlated the variables are judged to be.
Having selected a set of variables that are highly correlated with
the classification variable, in order to use them in concert to
perform classification, the interdependencies among them must also be
identified. In our method, this is accomplished via a process of
hypothesis testing: patterns of interdependencies among variables are
expressed by decomposable models, and then these models are evaluated
as to how well they fit the training data. The likelihood ratio
statistic, G^2, is used as the measure of the goodness-of-fit of a
model. It is distributed asymptotically as chi^2 with degrees of
freedom corresponding to the number of interactions
omitted from (unconstrained in) the model. The models that are judged
to fit are those whose reference chi^2 values are statistically
significant. Accessing the fit of a model in terms of the
significance of its G^2 statistic gives preference to models with
the fewest number of interdependencies, thereby assuring the selection
of a model specifying only the most systematic variable interactions.
Once the form of the model has been established, maximum likelihood
estimates are used for the parameters of the model. In fact, in this
method, parameter estimation is an integral part of the model selection
process.
Approximating the joint distribution of all variables with a model
containing only the most important systematic interactions among
variables limits the number of parameters to be estimated, supports
computational efficiency, and provides an understanding of the data.
The biggest limitation of the method is that it requires large amounts
of hand-tagged data. The validity of the results obtained using this
approach are compromised when it is applied to a small data sample,
first because maximum likelihood estimates are used for the
parameters, and secondly because asymptotic approximations are used
for the distribution of G^2.
We propose to extend the method described above to better facilitate
developing large-scale applications that involve many features and
interactions among them, such as the applications we will address in
this work. Our approach to developing these applications will be as
follows. Relevant non-classification variables will be selected in
the same way as in the original method. Also, as in the original
method, we will select a decomposable model to express the
interdependencies among variables, but, to meet the demands of
large-scale applications, we will use different methods both for
selecting the form of the model and estimating the parameters of that
model in order to minimize the amount of hand-tagged data required.
These two extensions to the original method are described in sections
1.2.1 and 1.2.2, respectively.
Both applications will involve constraints among ambiguous objects.
In the lexical application, such constraints will arise from domain
knowledge that specifies semantic relations between the words in a
sentence (how such knowledge will be incorporated in the proposed work
is described in a later section, section 2). In the discourse
application, such constraints arise from relationships among the sentences
in a coherent discourse. The third extension to the original method is
a technique for resolving multiple, mutually-constraining ambiguities
in graphical models; this extension is described in section 1.2.3.
Recall that in the original method, large-sample approximations for
the distributions of both the parameter estimators and the likelihood
ratio statistic were used. While our new method for selecting the
form of a model still uses the likelihood ratio statistic (G^2) to
measure the fit of a model to the data, it eliminates (1) the need for
parameter estimation to select the form of the model, and (2) the need
to use large-sample approximations of the distribution of G^2. This
is accomplished by generating the exact conditional distribution of
G^2 in order to determine the significance of the G^2 value
describing the fit of a model. The exact conditional distribution of
G^2 for each test is the distribution of G^2 values that would be
observed for comparable data samples randomly generated from the model
being tested. A comparable data sample is one having the same
sufficient statistics as the actual training data, where the
sufficient statistics are the marginal distributions of the variables
that are specified as interdependent in the model. Using the
conditional distribution of G^2 values, that is, the distribution of
G^2 values calculated for data having the same sufficient statistics
as the training data, eliminates the need to estimate the parameters
of a model when evaluating its fit. Intuitively, when conditioning on
the sufficient statistics, one assumes the dependencies that are
specified in
the model, and tests for the presence of higher-order
interdependencies.
In order to reduce the computational complexity of using exact
conditional tests to evaluate the fit of a model, we will introduce
two simplifications to the procedure. First, the exact significance
of the G^2 value for each model will only be approximated by
sampling from the conditional distribution of G^2 values as
opposed to generating the complete distribution. The significance
assigned to the model G^2 value is then equal to the portion of the
sampled G^2 values that are at least as large as that of the model.
The second simplification restricts the form of the model that is
evaluated in each test. As described above, it is the distribution of
G^2 values conditioned on the sufficient statistics of the model
being tested that is required to define the significance of a test;
unfortunately, it is difficult to generate samples of G^2 values
that are conditioned on the sufficient statistics of
arbitrarily-complex models. For this reason, model evaluation will
be
conducted as a series of tests, where each test evaluates a single
conditional independence between two variables. The complete model
will be
acceptable if all conditional independence relationships expressed in
the model are found to be acceptable.
The method just described for selecting the form of a model requires
a small amount of sense-tagged data (approximately two hundred tagged
instances) in
order to characterize the interdependencies that exist between the
various non-classification variables and the classification variable.
Testing will be required to establish more exact specifications of
the amount of tagged training data required.
Given the form of the model, it should be possible to obtain reliable
estimates of the model parameters from untagged text. The idea is to
treat the missing tags as ``missing data'' and draw on the wealth of
statistical techniques for augmenting incomplete data given some
knowledge of the form of the complete data density.
The scheme we propose to use is a stochastic simulation technique
referred to as The Data Augmentation Algorithm[61]. The
procedure is a Bayesian approach to data augmentation that is very similar in spirit to the EM algorithm[24]. The basic idea is straightforward. The
observed data, y, is augmented by z, the missing data (i.e., the
missing tags). It is assumed that if both y and z were known,
then the posterior density P(theta|y,z), and thus the posterior
estimates of the model parameters, could be easily calculated (where
theta is the vector of model parameters). But the posterior
density that we want is P(theta|y), which may be difficult to
calculate. If, however, one could generate multiple values of z
from $P(z|y)$, then $P(\theta |y)$ could be approximated by
P(theta|y,z) averaged over all values of z. Because generating
values of z from P(z|y) is often hard, what is done is to draw
multiple values of z_i from the distribution
P(z|y,theta_i-1) and then use these values of z_i to form the new
estimates of the model parameters, theta_i. The process iterates
between drawing z_i from P(z|y, theta_i-1) (the ``imputation
step'') and updating the estimate, theta_i, based on sampling from
P(theta|y,z_i) (the ``posterior step''), until the parameter
estimates converge. Tanner and Wong[61] show that
this procedure generally converges to P(theta|y).
For decomposable models, Dawid and Lauritzen[23]
have demonstrated how conjugate prior distributions can be set up so
that both the imputation and posterior steps are straightforward.
In this section, we propose a method for classification when the
classes of different objects are interdependent. Stated another way,
we propose to find the set of classifications that, when taken as a
whole, most completely satisfy the constraints defined by all
interdependencies among variables. Methods for doing this have been
developed for various classes of models. The Viterbi algorithm
[27] can be used to find the classification sequence that
maximizes the probability of a Markov chain. In this work, we are not
restricted to models that are Markov chains and therefore draw on
techniques that are applicable to the broader class of graphical
models. The exact solution of an arbitrary conditional distribution
is difficult, so stochastic simulation techniques, such the ``Gibbs
sampler'' [29], will be used instead to approximate the
desired distribution. The Gibbs sampler is a data augmentation
technique related to the data augmentation algorithm presented in
section 1.2.2.
To see the relationship between the Gibbs sampler and the data
augmentation algorithm, view the correct classifications of the
ambiguous objects as ``missing data,'' as in section 1.2.2. The
difference is that, now, we know the model parameters and are only
interested in estimating the values of the missing data; also, the
model might not be a decomposable model. Thus, we make the following
changes to the algorithm. First, we do not assume that it will be
easy to draw samples of the missing data, z, from the distribution
P(z|y, theta ). Instead, we partition the missing data as z = (z_1
, z_2 , ... , z_n ) to make the drawing of each part of z (i.e.,
z_i) easy given all the other parts of z as well as the observed
data, y, and the model parameters, theta. Specifically, we
sample z_1 from P(z_1 | z_2 , ... , z_n , y, theta ), z_2 from
P(z_2 | z_1 ,z_3 , ... , z_n , y, theta ), and so on up to z_n.
We also do not perform the posterior step of the data augmentation
algorithm (i.e., the sampling of theta from P( theta |z,y)).
The Gibbs sampler uses some starting value for z and then generates
a new value for z_1 through z_n repeatedly, each time using the
updated values of all other partitions of z in drawing z_i. Geman
and Geman[29] have shown that, under mild regularity
conditions, the process converges to P(z|y, theta).
The process of stochastic simulation is potentially slow. Hundreds of
cycles may be required to achieve reasonable accuracy, with each cycle
requiring time on the order of the sum of the number of variables and
the number of interdependencies between variables[53].
Fortunately, the algorithm can be easily executed in parallel by
concurrent processors. A further reduction in computation time can be
realized by incorporating into the algorithm the simulated annealing
heuristic of Kirkpatrick, Gelatt and Vecchi[42]
as demonstrated by Geman and Geman[29].
Up: Description of Proposed Work
Previous: Description of Proposed Work
Next: Proposed Work (II): Lexical Application
1.1 Background
1.1.1 Graphical Models
1.1.2 Previous Method for Developing Probabilistic Classifiers
1.2 The Extensions
1.2.1 Selecting the Form of the Model
1.2.2 Estimating Parameters from Untagged Data
1.2.3 Resolution of Interdependent Ambiguities
Next: Proposed Work (II): Lexical Application