Unfortunately, work within this approach has not stressed the descriptive task of creating a
comprehensive inventory of universal grammar parameters or even those for particular languages or
language families. For Project Boas, it means that both the nature of the parameters it would be
using and their inventory has to be developed in-house.
In order to define a set of parameters for Boas, it is essential to distinguish among the language
phenomena that should be accorded the status of parameter and those that should be understood as
parameter values or their realizations. Still other phenomena may remain, at least for the task at
hand, outside the parameter system. We believe, with Dorr (1993), that parameters may be understood
as building blocks of an interlingua in MT. We reserve judgment about whether every component of an
interlingua is by definition parametric 2.
Thus, the parameter "lexical category" has a range of values {V, N, Adj, Adv, ...}. Any of these
values may itself be considered a parameter. If viewed within a single language, their values are,
ultimately, all words in the language which belong to the respective lexical categories. The
realizations of these values are the specific forms of these words, which appear in text decorated
with realizations of appropriate values of such morphological parameters as number, gender, case,
etc.
An example of a syntactic parameter is head-modifier dependency, whose values include such pairs as
"head: noun; modifier: adjective;" "head: verb; modifier: adverb," "head: noun; modifier: relative
clause" and others. Realization options for these values involve word or constituent order rules
(for instance, post or pre-posing) and agreement rules.
Lexical parameters are viewed as language-independent lexical meanings (ontological concepts), such
as tablefurniture. The values of this parameter are the word senses corresponding to this
ontological concept across the inventory of languages. The realizations for these values are the
words or phrases that express this meaning in each language, with a possibility of a lexical gap (a
null value) included.
While Boas needs a complete list of all parameters in all languages, it is clear that each
individual language uses a subset of that list. For instance, the parameter of gender used by
Ukrainian, Hebrew and many other languages is not utilized by English. Indeed, there is no
inflection for gender in any word of English (see, e.g., Quirk et al., 1985, pp.314ff). The
pronominal forms he, she, his, her and hers, do stand for the English nouns which incorporate the
meaning of maleness or femaleness, but this agreement is due to deictic anaphoric rules pertaining
also to person and number parameter values.
So, it may appear that if English is a target language of an MT system, the gender parameter should
be activated only for translating pronouns. Is it true then that in any particular MT situation only
the parameter subset utilized by the target language is important? Why bother with determining the
gender of a source language noun if there is no trace of this parameter value in the translation?
In fact, if the parameter of gender is active in the source language, it can be used not only for
the purpose of transfer in MT but also to support the analysis of the source text before the
bilingual step in machine translation. Thus, the gender of the antecedent of the Russian pronoun
kotoryy in (1a,b) determines not only the form of the pronoun itself (which would not be reflected
in an English translation) but, importantly, guides the dependency structure of the relative clause,
resulting in quite different translations (2 a,b, respectively).
We do not expect to end up being able to cover all the languages of the world. Instead, we strive
tocompile as many of the parameter sets as possible, with as many as possible attested realization
options for a realistically large number of languages. A straightforward methodology for such an
effort requires a parametrical exploration of a large number of potential target and source
languages. In the next section, we explore how having only one target language and reusing
computational resources developed in other projects simplifies this methodology and, in fact, makes
it feasible.
These resources include a) the vocabulary of the generation lexicon which can serve as the list
of
lexical parameters for compiling the bilingual dictionary; b) a world model (ontology) providing the
terms in which the senses of the English words and phrases are expressed (Boas uses the ontology
from the Mikrokosmos project at NMSU CRL-see Mahesh and Nirenburg 1995); c) the structure and term
definitions from the text meaning representation in Mikrokosmos (see, for instance, Onyshkevych and
Nirenburg 1995), to help guide parameter elicitation; d) the set of English closed-class lexical
items and morphemes; e) English grammar used in text synthesis, which provides the TL side of
structural transfer rules in the runtime MT system (see Figure 1 above); and f) a set of
"ecological" parameters and their realizations for English. While a complete description of the use
of all of the above resources is beyond the scope of this paper, we will give a few brief
illustrations.
The list of English word senses seeds the acquisition of the SL lexicon. The acquirer first
simply
translates all the word senses into SL and then adds SL features to the corresponding entries as
needed. The result is an SL-TL transfer dictionary which also serves as the lexicon for SL analysis.
The acquirer gets a lemma with all its senses:
In the examples, the senses are conveniently explained not in any specially designed
lexicon/ontology notation, but rather through translation into English. Because each English
translation is the entry head for a sense which is already explained in an ontology-based semantic
metalanguage in the already existing Mikrokosmos lexicon, Expedition can benefit from richer
semantic information than that acquired using Boas. We use the Mikrokosmos ontology as a search space
to support word sense disambiguation. The method (suggested by Jim Cowie) depends on the bilingual
dictionary of the kind illustrated above. Coarse grain-size lexical mappings of TL word senses to
ontological concepts are established (for instance, Chihuahua and Poodle may be both linked to the
ontological concept dog). The system, thus, knows that both Chihuahuas and Poodles have four legs,
are carnivorous, domesticated, etc.
The disambiguation method uses such ontological constraints by computing a distance in the
ontological space between ambiguous word senses on the one hand and the senses of other words in
their context. SL syntactic information helps to guide the disambiguation process by providing
additional constraints. Thus, closeness between senses of words belonging to the same syntactic unit
is weighed more heavily than that across unit boundaries.
The acquisition of the complete list of parameters in the single-TL environment is facilitated not
only by the availability of the initial set of lexical parameters but also by the prominence of the
syntactic and morphological parameters activated in English. Thus, for morphology and syntax, the
existence of such comprehensive grammars of English as Quirk et al. (1985) allows a quick round-up
of the major parameters. One cannot always limit oneself, however, to TL-induced acquisition as we
have demonstrated in the previous section on the example of gender in English.
Acquisition of descriptive knowledge about a language consists in Boas of a set of elicitation
"episodes." No matter what particular form the actual elicitation techniques will take, it is clear
that the content which needs to be elicited ultimately constitutes the necessary set of lexical
meanings semantic structure dependencies characterizing the source language. The difference from the
interlingua (knowledge-based) translation approaches is in the way the elicited knowledge is
recorded. In Expedition, it will be recorded as bilingual, SL-English correspondences rather than in
an abstract metalanguage. Thus, instead of representing data for computer programs, we concentrate
on preparation of knowledge necessary to elicit information about a specific source language from a
human user. The knowledge which needs to be prepared is largely parametric.
Our work on descriptive knowledge acquisition in Boas is divided into three related parts: a)
compiling the complete inventory of all parameters and their values, b) developing the elicitation
episodes corresponding to this list, and c) implementing the elicitation techniques to acquire all
realizations of parameter values in a given SL and all mappings of SL realizations into TL
realizations.
A significant portion of the inventory of parameters has already been acquired in the property
branch of the Mikrokosmos ontology: most of the "grammatical" meanings (as in grammatical
semantics-cf. Frawley 1993, Raskin 1994) are already recorded and systematized there. This inventory
is in the process of being expanded. Sources for this expansion include information about languages
not used in the development of the Mikrokosmos ontology and literature on field linguistics. This
expanded information includes, for instance, the honorific mode, as in Korean or Japanese, itself
a value of the mode parameter, but having a range of values of its own that require considerable
modifications in the English translations.
The inventory of parameter values is at this point much less complete. This inventory must
include
every grammatical meaning, for instance, each nominal case meaning which would include phenomena
such as ergativity or the French partitive case, with de l'eau translating as "some water" and l'eau
as "water" or "the water."
As far as elicitation techniques are concerned, some methodology has been adapted from field
linguistics (see, for instance, Samarin 1967, Bouquiaux and Thomas 1992, Payne 1997). As the native
speaker's input must be interpreted by a computational system, not a human, the field linguistics
methodology is not applicable directly. Thus, in Comrie and Smith (1977), which is essentially a
checklist of parameters for a field linguist, the existence and actual listing of the lexical
categories in a SL is taken for granted, and the membership criteria never explored, a luxury that
Boas cannot afford. Some recent attempts at automatic language knowledge acquisition (see, for
instance, Knight 1996, Knight et al. 1995) are also of some relevance to our task. However, in the
cited work the source of language information is not a native speaker, the scope of inquiry is more
constrained, and the response range is more limited.
Boas experiments with new elicitation techniques, including asking the acquirers to consult a
descriptive grammar of a language (if it is available) to derive comprehensive lists of phenomena,
for instance, all the forms of the noun declension paradigm on a typical example with its
translations into English, thus avoiding the need in a lengthy and tediously repetitive elicitation
episode.
The inventory of parameters and values and the elicitation techniques in Boas are used and put to
a
test in the process of actual acquisition of the realizations of each parameter value in the SL.
Thus, to return to the example of nominal case values, one has to a) elicit the noun inflection
paradigms (if any); b) elicit prepositions (if any); c) combine prepositions and cases; d) elicit
prepositional meanings; e) elicit meanings of preposition-case combinations (e.g., the Russian s
dereva "from the tree," s derevom "with the tree," s derevo "the size of a tree"-see Nirenburg
1980); f) juxtapose these combinations with their parameter values. In the process of knowledge
elicitation, the meanings can be expressed by the native speaker in a number of ways-ontologically,
as English phrases, using pictures, diagrams, examples, etc. Multimodal representation, if made
possible, improves the quality of acquisition by, among other things, breaking the tedium of the
long sessions.
The most difficult issues in acquisition involve the transcategorial realization of values, such
as
the signalling of a noun case in the verb or non-standard clitics, or the lexical realizations in SL
of grammatical parameters in TL, such as the possible absence of continuous tenses in a SL and the
choice of a grammatical realization of such lexical values as "right now" in the SL as the present
continuous marker on the corresponding verb.
Lexical acquisition proceeds as described in Section 4, aided by a special resource created for
Boas/Expedition: continuing our work on significantly reducing the number of different senses in a
lexicon entry by combining related senses in MRDs (see Nirenburg et al. 1995) and, more rarely,
deleting the marginal ones, we have manually reduced a combined (Mikrokosmos and other sources)
English lexicon of about 28,000 words to about 40,000 word senses, each of which serves as a lexical
parameter for SL acquisition. In addition, frequency analyses of SL corpora will provide the
requirements for adding lexical parameters from SL, not just TL.
Boas exemplifies the broad-coverage descriptive approach to NLP (see, for instance, Nirenburg and
Raskin 1996) and adds to it a complementary new commitment to automating field-linguistic
methodology. This goes hand in hand with the evolving reorientation of theoretical linguistics from
selective theorizing in terms of atomistic rule postulation and testing back to the primary goal of
linguistics, which is a theory-based language description.
A full evaluation of Boas, that is, the development of the first actual SL to English MT system
over
a six-month time interval, will take place within the next two years.
In other words, it is still necessary in the framework of a Russian-English translation system to
activate the parameter of gender. Thus, the set of parameters activated for an MT system is not
determined by the target language alone. Rather, one should revive the thirty-year-old CETA
hypothesis (see, for instance, Vauquois 1969; Veillon 1968) that the syntactic interlingua ("pivot
language") is determined by a specific SL-TL pair, extending it to cover the set of activated
parameters in an MT system.
Translation Environment Supported by Boas
The single-target-language (English) environment which Boas serves allows for simplification of both
system implementation and the acquisition process compared to the case of multiple SLs and TLs.
First, only one text synthesis module needs to be built. Second, many fewer transfer components
(bilingual lexicons, transduction tables for closed-class lexical items, feature and structure
transfer tables) are needed. In fact, this situation almost licenses the transfer approach, as the
combinatorial argument for interlingual MT is weaker here than in the case of multiple TLs (see,
however, below and fn. 3). Third, it appears that knowledge acquisition for a new SL may be aided by
the presence of a number of resources already developed for the TL.
Entry: table-n1
POS: noun
Sense: furniture
Entry: table-n2
POS: noun
Sense: diagram
and produces the following SL lexicon entries (the example is in Hebrew):
Entry: shulxan-n1
POS: noun
Gender: m (plural -ot)
Sense: table-n1
Entry: tavla-n1
POS: noun
Gender: f
Sense: table-n2.
Source Language Knowledge Acquisition
Conclusion: Computational Field Linguistics?
Acknowledgments
The research reported in this paper was supported by Contract MDA904-92-C-5189 with the U.S.
Department of Defense. Victor Raskin is grateful to Purdue University for permitting him to consult
CRL/NMSU.
References
Bouquiaux, L. and J.M.C.Thomas. 1992. Studying and Describing Unwritten Languages. Dallas, TX:
Summer Institute of Linguistics Press.
Chomsky, N. 1981. Lectures on Government and Binding. Dordrecht: Foris.
Chomsky, N. 1986. Knowledge of Language: Its Nature, Origin, and Use. New York: Praeger.
Chomsky, N. 1995. The Minimalist Program. Cambridge, MA: MIT Press.
Comrie, B., and N. Smith 1977. Lingua Descriptive Studies: Questionnaire. Lingua 42:1, pp. 1-72.
Culikover, P. W. 1997. Principles and Parameters. An Introduction to Syntactic Theory. Oxford
University Press.
Dorr, B. 1993. Interlingual Machine Translation: A Parametrized Approach. Artificial Intelligence
63, 429-92.
Frawley, W. 1992. Linguistic Semantics. Hillsdale, N.J.: Erlbaum.
Knight, K. 1996. Learning Word Meanings by instruction. AAAI `96.
Knight, K., I. Chander, M. Haines, V. Hatzivassiloglou, E. Hovy, M. Iida, S. K. Luk, R. Whitney, and
K. Yamada 1995. Filling Knowledge Gaps in a Broad-Coverage Machine Translation System. IJCAI `95.
Lightfoot, D. 1991. How to Set Parameters. Cambridge, MA: MIT Press
Mahesh, K., and S. Nirenburg 1995. Semantic Classification for Practical Natural Language
Processing. In: Proceedings of the Sixth ASIS SIG/CR Classification Research Workshop: An
Interdisciplinary Meeting. Chicago, IL.
Onyshkevych, B., and S. Nirenburg. 1995. "A Lexicon for Knowledge-Based MT." Machine Translation,
10:1-2, pp. 5-57.
Nirenburg, S. 1980. Application of Semantic Methods in Description of Russian Morphology. Ph.D.
Thesis, Hebrew University, Jerusalem.
Nirenburg, S., and V. Raskin 1996. Ten Choices in Lexical Semantics. MCCS-96-304, Las Cruces, N.M.:
NMSU CRL.
Nirenburg, S., V. Raskin, and B. Onyshkevych 1995. Apologiae Ontologiae. TMI `95, Leuven.
Payne, T. E. 1997. Describing Morphosyntax. A Guide for Field Linguists. Cambridge: Cambridge
University Press.
Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik 1985. A Comprehensive Grammar of the English
Language. London: Longman.
Raskin, V. 1994. Frawley: Linguistic Semantics. Language 70:3, pp. 552-556.
Samarin, W. J. 1967. Field Linguistics. New York: Holt, Rinehart and Winston.
Vauquois, B. 1969. Traduction automatique et traitement automatique des langues. Automatisme 14,
368-72.
Veillon, G. 1968. Description du langage pivot du système de traduction automatique C.E.T.A. T.A.
Informations, 1 8-17.
Webelhuth, G. 1992. Principles and Parameters of Syntactic Saturation. New York and Oxford: Oxford
University Press.
Zajac, R. 1996. A Multilingual Translator's Workstation for Information Access. Proceedings of
NLP+IA 96, Moncton, New Brunswick, Canada.
Back to top of the page