Universal Grammar and Lexis for Quick Ramp-Up of MT Systems

Sergei Nirenburg and Victor Raskin

Computing Research Laboratory
New Mexico State University
Las Cruces N.M. 88003, USA
{sergei, raskin}@crl.nmsu.edu
Abstract

Introduction: The Boas Project

This paper introduces Boas, a semi-automatic knowledge elicitation system that guides a team of two people through the process of developing the static knowledge sources for a moderate-quality, broad-coverage MT system from any "low-density"1 language into English in about six months. Boas contains knowledge about human language and means of realization of its phenomena in a number of specific languages and is, thus, a kind of a "linguist in the box" that helps non-professional acquirers with the task, whose complexity is legendary. The knowledge about language elicited by Boas from the acquirers aims to support MT output quality which is roughly commensurate with the outputs of the better commercial systems, such as Systran. These relatively modest expectations are dictated by the amount of language work which can be carried out, given the resources available. The rules of the game specifically exclude linguists and MT developers from the acquisition team. Under such conditions, the only sensible course of action is to attempt to collect as much knowledge about as many languages as possible in advance and include it in the elicitation system itself. The rest of the paper is structured as follows. Section 2 briefly discusses the computational architecture in which Boas operates. Section 3 is devoted to defining the format of the descriptive language knowledge to be elicited from the acquirers through Boas. The descriptive language knowledge, which we address in this paper, is, later in the process of Boas operation, converted into operational knowledge capable of supporting the processes of source language analysis and source-target transfer. In Section 4, we discuss how work on ontological semantics in MT can contribute to Boas in a situation of a single target language, English. In Section 5, we address the procedure for descriptive language knowledge acquisition in Boas, both in terms of resources created and reused and in terms of the actual elicitation techniques, differentiating between the acquisition of grammatical and lexical parameters.

The Ecology of Boas

Machine-aided linguistic knowledge acquisition is a complex task even for seasoned MT developers. Experience shows that at the beginning of any project involving sizeable knowledge acquisition, staff members must be trained in: The rules of the game in our project do not allow for this initial training period. One must, therefore, incorporate training into the Boas environment.
Boas must produce the same knowledge that human grammar and lexicon acquirers produce for an MT system. For each SL, this will include, at the coarsest grain size of description: When people are acquiring the above information, they have at their disposal, in addition to linguistic knowledge, reference grammars, text corpora and mono and multilingual dictionaries for the language(s) in question. People typically decide on the methodology of the work themselves or are assisted by their supervisors. With Boas, the methodological initiative rests with the system: it is the system that leads the acquirer, ordering the interactions (questions) and keeping in mind the coverage needs and the nature of the output. While the acquirers will still have access to the printed (or online) descriptive grammars, dictionaries and other reference materials, the responsibility for quality and coverage of the output now rests with Boas.
The overall MT (development and runtime) environment, called Expedition, includes, in addition to Boas, two other major system components: a configuration and control system (CCS) and the run-time MT system. The relationship among the three systems is illustrated in Figure 1. CCS helps Boas by rounding up (and presenting to Boas users) online resources available for a particular source language. CCS also compiles the knowledge recorded through Boas into a format suitable for processing by the resident MT system and manages the file systems and the databases to support the MT engine, whose operation does not change with new source languages. The MT engine itself has been developed under the Corelli project at NMSU CRL (e.g., Zajac, 1996).

Defining Parameters for Boas

The descriptive knowledge about the source language is a set of statements about morphological, syntactic, and lexical properties (parameters) of a language, listed together with their values and realization options. Data about each parameter includes the language, the name of the parameter, the list of entities to which this parameter applies (its domain) and the list of parameter values (its range). Moreover, parameter values have an associated set of realization options in each language. For instance, the parameter of gender in Ukrainian is described as follows: For comparison, the Hebrew gender is described differently: Instead of discovering parameters from scratch for each language, it is preferable, in order to ensure uniformity and systematicity of Boas operation, to come up with a complete list of all possible parameters in natural languages, with complete lists of their possible values attached. The attainability of such a resource becomes then a central issue.
The terms `parameter' and `value' are used in our task in the same sense as in the school of theoretical syntactic thought consecutively known as government and binding (Chomsky 1981), principles

and parameters (Chomsky 1986) and the minimalist position (Chomsky 1995). The theory postulates a small number of general principles defining the innate human language faculty and a larger number of language parameters, which implement these principles by selecting concrete values for particular languages. The complete set of such parameters and values constitutes a universal grammar (UG) see also Culikover (1997), Lightfoot (1991) and Webelhuth (1992).

Unfortunately, work within this approach has not stressed the descriptive task of creating a comprehensive inventory of universal grammar parameters or even those for particular languages or language families. For Project Boas, it means that both the nature of the parameters it would be using and their inventory has to be developed in-house.

In order to define a set of parameters for Boas, it is essential to distinguish among the language phenomena that should be accorded the status of parameter and those that should be understood as parameter values or their realizations. Still other phenomena may remain, at least for the task at hand, outside the parameter system. We believe, with Dorr (1993), that parameters may be understood as building blocks of an interlingua in MT. We reserve judgment about whether every component of an interlingua is by definition parametric 2.

Thus, the parameter "lexical category" has a range of values {V, N, Adj, Adv, ...}. Any of these values may itself be considered a parameter. If viewed within a single language, their values are, ultimately, all words in the language which belong to the respective lexical categories. The realizations of these values are the specific forms of these words, which appear in text decorated with realizations of appropriate values of such morphological parameters as number, gender, case, etc.

An example of a syntactic parameter is head-modifier dependency, whose values include such pairs as "head: noun; modifier: adjective;" "head: verb; modifier: adverb," "head: noun; modifier: relative clause" and others. Realization options for these values involve word or constituent order rules (for instance, post or pre-posing) and agreement rules. Lexical parameters are viewed as language-independent lexical meanings (ontological concepts), such as tablefurniture. The values of this parameter are the word senses corresponding to this ontological concept across the inventory of languages. The realizations for these values are the words or phrases that express this meaning in each language, with a possibility of a lexical gap (a null value) included.

While Boas needs a complete list of all parameters in all languages, it is clear that each individual language uses a subset of that list. For instance, the parameter of gender used by Ukrainian, Hebrew and many other languages is not utilized by English. Indeed, there is no inflection for gender in any word of English (see, e.g., Quirk et al., 1985, pp.314ff). The pronominal forms he, she, his, her and hers, do stand for the English nouns which incorporate the meaning of maleness or femaleness, but this agreement is due to deictic anaphoric rules pertaining also to person and number parameter values.

So, it may appear that if English is a target language of an MT system, the gender parameter should be activated only for translating pronouns. Is it true then that in any particular MT situation only the parameter subset utilized by the target language is important? Why bother with determining the gender of a source language noun if there is no trace of this parameter value in the translation?

In fact, if the parameter of gender is active in the source language, it can be used not only for the purpose of transfer in MT but also to support the analysis of the source text before the bilingual step in machine translation. Thus, the gender of the antecedent of the Russian pronoun kotoryy in (1a,b) determines not only the form of the pronoun itself (which would not be reflected in an English translation) but, importantly, guides the dependency structure of the relative clause, resulting in quite different translations (2 a,b, respectively).

In other words, it is still necessary in the framework of a Russian-English translation system to activate the parameter of gender. Thus, the set of parameters activated for an MT system is not determined by the target language alone. Rather, one should revive the thirty-year-old CETA hypothesis (see, for instance, Vauquois 1969; Veillon 1968) that the syntactic interlingua ("pivot language") is determined by a specific SL-TL pair, extending it to cover the set of activated parameters in an MT system.

We do not expect to end up being able to cover all the languages of the world. Instead, we strive tocompile as many of the parameter sets as possible, with as many as possible attested realization options for a realistically large number of languages. A straightforward methodology for such an effort requires a parametrical exploration of a large number of potential target and source languages. In the next section, we explore how having only one target language and reusing computational resources developed in other projects simplifies this methodology and, in fact, makes it feasible.

Translation Environment Supported by Boas

The single-target-language (English) environment which Boas serves allows for simplification of both system implementation and the acquisition process compared to the case of multiple SLs and TLs. First, only one text synthesis module needs to be built. Second, many fewer transfer components (bilingual lexicons, transduction tables for closed-class lexical items, feature and structure transfer tables) are needed. In fact, this situation almost licenses the transfer approach, as the combinatorial argument for interlingual MT is weaker here than in the case of multiple TLs (see, however, below and fn. 3). Third, it appears that knowledge acquisition for a new SL may be aided by the presence of a number of resources already developed for the TL.

These resources include a) the vocabulary of the generation lexicon which can serve as the list of lexical parameters for compiling the bilingual dictionary; b) a world model (ontology) providing the terms in which the senses of the English words and phrases are expressed (Boas uses the ontology from the Mikrokosmos project at NMSU CRL-see Mahesh and Nirenburg 1995); c) the structure and term definitions from the text meaning representation in Mikrokosmos (see, for instance, Onyshkevych and Nirenburg 1995), to help guide parameter elicitation; d) the set of English closed-class lexical items and morphemes; e) English grammar used in text synthesis, which provides the TL side of structural transfer rules in the runtime MT system (see Figure 1 above); and f) a set of "ecological" parameters and their realizations for English. While a complete description of the use of all of the above resources is beyond the scope of this paper, we will give a few brief illustrations.

The list of English word senses seeds the acquisition of the SL lexicon. The acquirer first simply translates all the word senses into SL and then adds SL features to the corresponding entries as needed. The result is an SL-TL transfer dictionary which also serves as the lexicon for SL analysis. The acquirer gets a lemma with all its senses:


Entry: table-n1
POS: noun
Sense: furniture

Entry: table-n2
POS: noun
Sense: diagram
and produces the following SL lexicon entries (the example is in Hebrew):

Entry: shulxan-n1
POS: noun
Gender: m (plural -ot)
Sense: table-n1

Entry: tavla-n1
POS: noun
Gender: f
Sense: table-n2.

In the examples, the senses are conveniently explained not in any specially designed lexicon/ontology notation, but rather through translation into English. Because each English translation is the entry head for a sense which is already explained in an ontology-based semantic metalanguage in the already existing Mikrokosmos lexicon, Expedition can benefit from richer semantic information than that acquired using Boas. We use the Mikrokosmos ontology as a search space to support word sense disambiguation. The method (suggested by Jim Cowie) depends on the bilingual dictionary of the kind illustrated above. Coarse grain-size lexical mappings of TL word senses to ontological concepts are established (for instance, Chihuahua and Poodle may be both linked to the ontological concept dog). The system, thus, knows that both Chihuahuas and Poodles have four legs, are carnivorous, domesticated, etc.

The disambiguation method uses such ontological constraints by computing a distance in the ontological space between ambiguous word senses on the one hand and the senses of other words in their context. SL syntactic information helps to guide the disambiguation process by providing additional constraints. Thus, closeness between senses of words belonging to the same syntactic unit is weighed more heavily than that across unit boundaries. The acquisition of the complete list of parameters in the single-TL environment is facilitated not only by the availability of the initial set of lexical parameters but also by the prominence of the syntactic and morphological parameters activated in English. Thus, for morphology and syntax, the existence of such comprehensive grammars of English as Quirk et al. (1985) allows a quick round-up of the major parameters. One cannot always limit oneself, however, to TL-induced acquisition as we have demonstrated in the previous section on the example of gender in English.

Source Language Knowledge Acquisition

Acquisition of descriptive knowledge about a language consists in Boas of a set of elicitation "episodes." No matter what particular form the actual elicitation techniques will take, it is clear that the content which needs to be elicited ultimately constitutes the necessary set of lexical meanings semantic structure dependencies characterizing the source language. The difference from the interlingua (knowledge-based) translation approaches is in the way the elicited knowledge is recorded. In Expedition, it will be recorded as bilingual, SL-English correspondences rather than in an abstract metalanguage. Thus, instead of representing data for computer programs, we concentrate on preparation of knowledge necessary to elicit information about a specific source language from a human user. The knowledge which needs to be prepared is largely parametric.

Our work on descriptive knowledge acquisition in Boas is divided into three related parts: a) compiling the complete inventory of all parameters and their values, b) developing the elicitation episodes corresponding to this list, and c) implementing the elicitation techniques to acquire all realizations of parameter values in a given SL and all mappings of SL realizations into TL realizations.

A significant portion of the inventory of parameters has already been acquired in the property branch of the Mikrokosmos ontology: most of the "grammatical" meanings (as in grammatical semantics-cf. Frawley 1993, Raskin 1994) are already recorded and systematized there. This inventory is in the process of being expanded. Sources for this expansion include information about languages not used in the development of the Mikrokosmos ontology and literature on field linguistics. This expanded information includes, for instance, the honorific mode, as in Korean or Japanese, itself a value of the mode parameter, but having a range of values of its own that require considerable modifications in the English translations.

The inventory of parameter values is at this point much less complete. This inventory must include every grammatical meaning, for instance, each nominal case meaning which would include phenomena such as ergativity or the French partitive case, with de l'eau translating as "some water" and l'eau as "water" or "the water."

As far as elicitation techniques are concerned, some methodology has been adapted from field linguistics (see, for instance, Samarin 1967, Bouquiaux and Thomas 1992, Payne 1997). As the native speaker's input must be interpreted by a computational system, not a human, the field linguistics methodology is not applicable directly. Thus, in Comrie and Smith (1977), which is essentially a checklist of parameters for a field linguist, the existence and actual listing of the lexical categories in a SL is taken for granted, and the membership criteria never explored, a luxury that Boas cannot afford. Some recent attempts at automatic language knowledge acquisition (see, for instance, Knight 1996, Knight et al. 1995) are also of some relevance to our task. However, in the cited work the source of language information is not a native speaker, the scope of inquiry is more constrained, and the response range is more limited.

Boas experiments with new elicitation techniques, including asking the acquirers to consult a descriptive grammar of a language (if it is available) to derive comprehensive lists of phenomena, for instance, all the forms of the noun declension paradigm on a typical example with its translations into English, thus avoiding the need in a lengthy and tediously repetitive elicitation episode.

The inventory of parameters and values and the elicitation techniques in Boas are used and put to a test in the process of actual acquisition of the realizations of each parameter value in the SL. Thus, to return to the example of nominal case values, one has to a) elicit the noun inflection paradigms (if any); b) elicit prepositions (if any); c) combine prepositions and cases; d) elicit prepositional meanings; e) elicit meanings of preposition-case combinations (e.g., the Russian s dereva "from the tree," s derevom "with the tree," s derevo "the size of a tree"-see Nirenburg 1980); f) juxtapose these combinations with their parameter values. In the process of knowledge elicitation, the meanings can be expressed by the native speaker in a number of ways-ontologically, as English phrases, using pictures, diagrams, examples, etc. Multimodal representation, if made possible, improves the quality of acquisition by, among other things, breaking the tedium of the long sessions.

The most difficult issues in acquisition involve the transcategorial realization of values, such as the signalling of a noun case in the verb or non-standard clitics, or the lexical realizations in SL of grammatical parameters in TL, such as the possible absence of continuous tenses in a SL and the choice of a grammatical realization of such lexical values as "right now" in the SL as the present continuous marker on the corresponding verb.

Lexical acquisition proceeds as described in Section 4, aided by a special resource created for Boas/Expedition: continuing our work on significantly reducing the number of different senses in a lexicon entry by combining related senses in MRDs (see Nirenburg et al. 1995) and, more rarely, deleting the marginal ones, we have manually reduced a combined (Mikrokosmos and other sources) English lexicon of about 28,000 words to about 40,000 word senses, each of which serves as a lexical parameter for SL acquisition. In addition, frequency analyses of SL corpora will provide the requirements for adding lexical parameters from SL, not just TL.

Conclusion: Computational Field Linguistics?

Boas exemplifies the broad-coverage descriptive approach to NLP (see, for instance, Nirenburg and Raskin 1996) and adds to it a complementary new commitment to automating field-linguistic methodology. This goes hand in hand with the evolving reorientation of theoretical linguistics from selective theorizing in terms of atomistic rule postulation and testing back to the primary goal of linguistics, which is a theory-based language description.

A full evaluation of Boas, that is, the development of the first actual SL to English MT system over a six-month time interval, will take place within the next two years.


Acknowledgments

The research reported in this paper was supported by Contract MDA904-92-C-5189 with the U.S. Department of Defense. Victor Raskin is grateful to Purdue University for permitting him to consult CRL/NMSU.

References

Bouquiaux, L. and J.M.C.Thomas. 1992. Studying and Describing Unwritten Languages. Dallas, TX: Summer Institute of Linguistics Press.
Chomsky, N. 1981. Lectures on Government and Binding. Dordrecht: Foris.
Chomsky, N. 1986. Knowledge of Language: Its Nature, Origin, and Use. New York: Praeger.
Chomsky, N. 1995. The Minimalist Program. Cambridge, MA: MIT Press.
Comrie, B., and N. Smith 1977. Lingua Descriptive Studies: Questionnaire. Lingua 42:1, pp. 1-72.
Culikover, P. W. 1997. Principles and Parameters. An Introduction to Syntactic Theory. Oxford University Press.
Dorr, B. 1993. Interlingual Machine Translation: A Parametrized Approach. Artificial Intelligence 63, 429-92.
Frawley, W. 1992. Linguistic Semantics. Hillsdale, N.J.: Erlbaum.
Knight, K. 1996. Learning Word Meanings by instruction. AAAI `96.
Knight, K., I. Chander, M. Haines, V. Hatzivassiloglou, E. Hovy, M. Iida, S. K. Luk, R. Whitney, and K. Yamada 1995. Filling Knowledge Gaps in a Broad-Coverage Machine Translation System. IJCAI `95.
Lightfoot, D. 1991. How to Set Parameters. Cambridge, MA: MIT Press
Mahesh, K., and S. Nirenburg 1995. Semantic Classification for Practical Natural Language Processing. In: Proceedings of the Sixth ASIS SIG/CR Classification Research Workshop: An Interdisciplinary Meeting. Chicago, IL.
Onyshkevych, B., and S. Nirenburg. 1995. "A Lexicon for Knowledge-Based MT." Machine Translation, 10:1-2, pp. 5-57.
Nirenburg, S. 1980. Application of Semantic Methods in Description of Russian Morphology. Ph.D. Thesis, Hebrew University, Jerusalem.
Nirenburg, S., and V. Raskin 1996. Ten Choices in Lexical Semantics. MCCS-96-304, Las Cruces, N.M.: NMSU CRL.
Nirenburg, S., V. Raskin, and B. Onyshkevych 1995. Apologiae Ontologiae. TMI `95, Leuven.
Payne, T. E. 1997. Describing Morphosyntax. A Guide for Field Linguists. Cambridge: Cambridge University Press.
Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik 1985. A Comprehensive Grammar of the English Language. London: Longman.
Raskin, V. 1994. Frawley: Linguistic Semantics. Language 70:3, pp. 552-556.
Samarin, W. J. 1967. Field Linguistics. New York: Holt, Rinehart and Winston.
Vauquois, B. 1969. Traduction automatique et traitement automatique des langues. Automatisme 14, 368-72.
Veillon, G. 1968. Description du langage pivot du système de traduction automatique C.E.T.A. T.A. Informations, 1 8-17.
Webelhuth, G. 1992. Principles and Parameters of Syntactic Saturation. New York and Oxford: Oxford University Press.
Zajac, R. 1996. A Multilingual Translator's Workstation for Information Access. Proceedings of NLP+IA 96, Moncton, New Brunswick, Canada.
Back to top of the page