Proposal: Proposed Work (I) Extending Methodology



next up previous
Next: Proposed Work (II): Lexical Application

Up: Description of Proposed Work

Previous: Description of Proposed Work


We propose to extend our previous work in developing probabilistic classifiers in three important ways. The first two involve improvements to make large-scale applications feasible. In particular, we will improve the methods for selecting the form of the model and estimating model parameters so that the amount of hand-tagged data required is drastically reduced. The third extension will enhance the capabilities of the method, enabling us to resolve interdependent ambiguities.

We first provide background, describing the classes of models used in this work (in section 1.1.1) and describing our previous work in developing probabilistic classifiers [11],[12],[13] (in section 1.1.2). Then we describe our proposed extensions to the method (in section 1.2).

1.1 Background

1.1.1 Graphical Models

In this work, we propose to make use of a more general class of models than previously used in most NLP applications (graphical models, [65]). Such models are capable of characterizing a rich set of relationships among a large number of variables, where these variables may be of different types as well as spatially separated.

Graphical models are a subset of the class of hierarchical log-linear models and, as such, are well-suited to characterizing and studying the structure of data, that is, the interactions among variables as evidenced by the frequency with which the values of the variables co-occur[7]. A log-linear model expresses the probability distribution of a set of variables as a function of the interdependencies that exist among those variables. Formulating such a model involves specifying that certain variables do not interact while allowing the remaining interactions among variables to be arbitrary and unknown.

The probability distribution of a graphical model corresponds to a Markov field. This is because a graphical model is a probability model for multivariate random observations whose dependency structure has a valid graphical representation, a dependency graph. The dependency graph of a model is formed by mapping each variable in the model to a node in the graph, and then drawing undirected edges between the nodes corresponding to variables that are interdependent. A valid dependency graph has the property that all variables that are not directly connected in the graph are conditionally independent given the values of the variables mapping to nodes connecting them. For example, if node A separates node B from node C in a dependency graph, then the variable mapping to node B is conditionally independent of the variable mapping to node C, given the value of the variable mapping to node A. Not only does a dependency graph provide a clear interpretation of the interactions among variables, it also has been shown[53],[47] to form the basis of a distributed computational architecture for probabilistic constraint propagation.

An important subset of graphical models is the class of decomposable models. In addition to having a graphical representation, decomposable models possess other desirable properties, two of which are: (1) there are closed-form expressions for the parameter estimates, and (2) the joint distribution can be expressed as a product of lower-order marginal distributions, thereby reducing the number of parameters that need to be estimated. In total, decomposable models combine a richness in modeling, clarity in interpretation, and ease of analysis previously unrealized in other classes of statistical models.

1.1.2 Previous Method for Developing Probabilistic Classifiers

Most previous efforts have not attempted to systematically identify the interdependencies among the variables involved in the classification. The assumption is often made that all of the non-classification variables are either conditionally independent given the classification variable or fully independent. When such untested assumptions do not hold, the performance of the classifier will be negatively affected. Of course, all non-classification could be treated as interdependent, but, if there are several of them, such a model could have too many parameters to estimate in practice.

Rather than making assumptions about how the variables are related, we explore this empirically. In particular, in our previous work [13],[12], we formulate a decomposable model that describes the relationships among all variables in terms of only the most important interdependencies, that is, the simplest model of this class that fits the data well. The fit of a model is the degree to which the data is approximated by that model. Below, we describe the method.

The first step in the process of formulating a probabilistic model is to select, out of all available non-classification variables, a subset to include in the model. We identify those that are most strongly correlated with the classification variable; these are the only non-classification variables included in the model. A variable is considered to be correlated with the classification variable if the model for independence between them fits the data poorly according to the goodness-of-fit test described below. The worse the fit, the more correlated the variables are judged to be.

Having selected a set of variables that are highly correlated with the classification variable, in order to use them in concert to perform classification, the interdependencies among them must also be identified. In our method, this is accomplished via a process of hypothesis testing: patterns of interdependencies among variables are expressed by decomposable models, and then these models are evaluated as to how well they fit the training data. The likelihood ratio statistic, G^2, is used as the measure of the goodness-of-fit of a model. It is distributed asymptotically as chi^2 with degrees of freedom corresponding to the number of interactions omitted from (unconstrained in) the model. The models that are judged to fit are those whose reference chi^2 values are statistically significant. Accessing the fit of a model in terms of the significance of its G^2 statistic gives preference to models with the fewest number of interdependencies, thereby assuring the selection of a model specifying only the most systematic variable interactions.

Once the form of the model has been established, maximum likelihood estimates are used for the parameters of the model. In fact, in this method, parameter estimation is an integral part of the model selection process.

Approximating the joint distribution of all variables with a model containing only the most important systematic interactions among variables limits the number of parameters to be estimated, supports computational efficiency, and provides an understanding of the data. The biggest limitation of the method is that it requires large amounts of hand-tagged data. The validity of the results obtained using this approach are compromised when it is applied to a small data sample, first because maximum likelihood estimates are used for the parameters, and secondly because asymptotic approximations are used for the distribution of G^2.

1.2 The Extensions

We propose to extend the method described above to better facilitate developing large-scale applications that involve many features and interactions among them, such as the applications we will address in this work. Our approach to developing these applications will be as follows. Relevant non-classification variables will be selected in the same way as in the original method. Also, as in the original method, we will select a decomposable model to express the interdependencies among variables, but, to meet the demands of large-scale applications, we will use different methods both for selecting the form of the model and estimating the parameters of that model in order to minimize the amount of hand-tagged data required. These two extensions to the original method are described in sections 1.2.1 and 1.2.2, respectively.

Both applications will involve constraints among ambiguous objects. In the lexical application, such constraints will arise from domain knowledge that specifies semantic relations between the words in a sentence (how such knowledge will be incorporated in the proposed work is described in a later section, section 2). In the discourse application, such constraints arise from relationships among the sentences in a coherent discourse. The third extension to the original method is a technique for resolving multiple, mutually-constraining ambiguities in graphical models; this extension is described in section 1.2.3.

1.2.1 Selecting the Form of the Model

Recall that in the original method, large-sample approximations for the distributions of both the parameter estimators and the likelihood ratio statistic were used. While our new method for selecting the form of a model still uses the likelihood ratio statistic (G^2) to measure the fit of a model to the data, it eliminates (1) the need for parameter estimation to select the form of the model, and (2) the need to use large-sample approximations of the distribution of G^2. This is accomplished by generating the exact conditional distribution of G^2 in order to determine the significance of the G^2 value describing the fit of a model. The exact conditional distribution of G^2 for each test is the distribution of G^2 values that would be observed for comparable data samples randomly generated from the model being tested. A comparable data sample is one having the same sufficient statistics as the actual training data, where the sufficient statistics are the marginal distributions of the variables that are specified as interdependent in the model. Using the conditional distribution of G^2 values, that is, the distribution of G^2 values calculated for data having the same sufficient statistics as the training data, eliminates the need to estimate the parameters of a model when evaluating its fit. Intuitively, when conditioning on the sufficient statistics, one assumes the dependencies that are specified in the model, and tests for the presence of higher-order interdependencies.

In order to reduce the computational complexity of using exact conditional tests to evaluate the fit of a model, we will introduce two simplifications to the procedure. First, the exact significance of the G^2 value for each model will only be approximated by sampling from the conditional distribution of G^2 values as opposed to generating the complete distribution. The significance assigned to the model G^2 value is then equal to the portion of the sampled G^2 values that are at least as large as that of the model.

The second simplification restricts the form of the model that is evaluated in each test. As described above, it is the distribution of G^2 values conditioned on the sufficient statistics of the model being tested that is required to define the significance of a test; unfortunately, it is difficult to generate samples of G^2 values that are conditioned on the sufficient statistics of arbitrarily-complex models. For this reason, model evaluation will be conducted as a series of tests, where each test evaluates a single conditional independence between two variables. The complete model will be acceptable if all conditional independence relationships expressed in the model are found to be acceptable.

The method just described for selecting the form of a model requires a small amount of sense-tagged data (approximately two hundred tagged instances) in order to characterize the interdependencies that exist between the various non-classification variables and the classification variable. Testing will be required to establish more exact specifications of the amount of tagged training data required.

1.2.2 Estimating Parameters from Untagged Data

Given the form of the model, it should be possible to obtain reliable estimates of the model parameters from untagged text. The idea is to treat the missing tags as ``missing data'' and draw on the wealth of statistical techniques for augmenting incomplete data given some knowledge of the form of the complete data density.

The scheme we propose to use is a stochastic simulation technique referred to as The Data Augmentation Algorithm[61]. The procedure is a Bayesian approach to data augmentation that is very similar in spirit to the EM algorithm[24]. The basic idea is straightforward. The observed data, y, is augmented by z, the missing data (i.e., the missing tags). It is assumed that if both y and z were known, then the posterior density P(theta|y,z), and thus the posterior estimates of the model parameters, could be easily calculated (where theta is the vector of model parameters). But the posterior density that we want is P(theta|y), which may be difficult to calculate. If, however, one could generate multiple values of z from $P(z|y)$, then $P(\theta |y)$ could be approximated by P(theta|y,z) averaged over all values of z. Because generating values of z from P(z|y) is often hard, what is done is to draw multiple values of z_i from the distribution P(z|y,theta_i-1) and then use these values of z_i to form the new estimates of the model parameters, theta_i. The process iterates between drawing z_i from P(z|y, theta_i-1) (the ``imputation step'') and updating the estimate, theta_i, based on sampling from P(theta|y,z_i) (the ``posterior step''), until the parameter estimates converge. Tanner and Wong[61] show that this procedure generally converges to P(theta|y).

For decomposable models, Dawid and Lauritzen[23] have demonstrated how conjugate prior distributions can be set up so that both the imputation and posterior steps are straightforward.

1.2.3 Resolution of Interdependent Ambiguities

In this section, we propose a method for classification when the classes of different objects are interdependent. Stated another way, we propose to find the set of classifications that, when taken as a whole, most completely satisfy the constraints defined by all interdependencies among variables. Methods for doing this have been developed for various classes of models. The Viterbi algorithm [27] can be used to find the classification sequence that maximizes the probability of a Markov chain. In this work, we are not restricted to models that are Markov chains and therefore draw on techniques that are applicable to the broader class of graphical models. The exact solution of an arbitrary conditional distribution is difficult, so stochastic simulation techniques, such the ``Gibbs sampler'' [29], will be used instead to approximate the desired distribution. The Gibbs sampler is a data augmentation technique related to the data augmentation algorithm presented in section 1.2.2.

To see the relationship between the Gibbs sampler and the data augmentation algorithm, view the correct classifications of the ambiguous objects as ``missing data,'' as in section 1.2.2. The difference is that, now, we know the model parameters and are only interested in estimating the values of the missing data; also, the model might not be a decomposable model. Thus, we make the following changes to the algorithm. First, we do not assume that it will be easy to draw samples of the missing data, z, from the distribution P(z|y, theta ). Instead, we partition the missing data as z = (z_1 , z_2 , ... , z_n ) to make the drawing of each part of z (i.e., z_i) easy given all the other parts of z as well as the observed data, y, and the model parameters, theta. Specifically, we sample z_1 from P(z_1 | z_2 , ... , z_n , y, theta ), z_2 from P(z_2 | z_1 ,z_3 , ... , z_n , y, theta ), and so on up to z_n. We also do not perform the posterior step of the data augmentation algorithm (i.e., the sampling of theta from P( theta |z,y)).

The Gibbs sampler uses some starting value for z and then generates a new value for z_1 through z_n repeatedly, each time using the updated values of all other partitions of z in drawing z_i. Geman and Geman[29] have shown that, under mild regularity conditions, the process converges to P(z|y, theta).

The process of stochastic simulation is potentially slow. Hundreds of cycles may be required to achieve reasonable accuracy, with each cycle requiring time on the order of the sum of the number of variables and the number of interdependencies between variables[53]. Fortunately, the algorithm can be easily executed in parallel by concurrent processors. A further reduction in computation time can be realized by incorporating into the algorithm the simulated annealing heuristic of Kirkpatrick, Gelatt and Vecchi[42] as demonstrated by Geman and Geman[29].



next up previous
Next: Proposed Work (II): Lexical Application

Up: Description of Proposed Work

Previous: Description of Proposed Work