Authors: Keith J. Miller (MITRE Corporation and Georgetown University)
David M. Zajic (MITRE Corporation and University of Maryland)
Subject area keywords: Interlingual Machine Translation, LCS, Machine Translation
Keith J. Miller & David M. Zajic
keith@mitre.org dzajic@mitre.org
Our interest in this workshop stems from practical concerns. We are currently involved in a Machine Translation project, CyberTrans, which provides translation services to a wide community of users. By providing a common interface to two large commercial transfer-based systems and several smaller translation systems, CyberTrans gives this community easy access to a considerable suite of Machine Translation tools. The maintenance of separate domain-specific lexicons for each of these systems, however, is a large task. In an effort to make this task more manageable, we have undertaken a project known as the Lexicon Service Bureau (LSB). As described in (Miller & Zajic, 1998), the LSB project has as its goal to provide a unified lexical resource from which the system-specific lexicons can be generated for the transfer-based systems. The design for the LSB calls for a single lexicon for each language handled by the translation systems, with links between monolingual lexicons providing the information necessary to generate the system-specific bilingual lexicons. Since source language (SL) to target language (TL) mappings are not all be one-to-one, the design calls for links to be made by means of a pivot language (or IL). Thus, we are interested in the determination of what such a pivot must entail. Furthermore, we are interested in discovering which types of MT problems are particularly suited to an IL approach. That is, what are the areas in which an IL approach to MT could solve problems currently encountered by the state-of-the-art transfer based MT systems that we currently provide to our users?
Since we are concerned with the potential benefits to be gained in an operational setting in which analysts are interested primarily in viewing English translations of foreign language documents, our approach to this workshop was likely somewhat different from that of other participants. We began with the text from the UNESCO Courier that served as a basis of the workshop (Otero, 1997). This text was available in 13 languages, of which we chose to focus on the English, French, Spanish, Russian, and German versions, principally for practical reasons. We also had access to two transfer-based MT systems (Systran and an older version of Globalink), a system that provided English glosses of the foreign language texts, and, in some cases human translations of the texts. As a method of pointing to areas in which the transfer systems needed improvement, we submitted each of the four foreign language texts to the two transfer-based systems (for translation into English). We then attempted to locate errors/disfluencies in the output of the transfer-based MT systems. From the list of errors, we developed an observational classification of the error types, and partitioned these classes into two groups: those containing errors that an IL approach would correct, and those containing errors that would not be ameliorated by use of an IL. This grouping of error classes will be presented in the next section of this paper. Furthermore, we compared the various language versions of the document with each other and with the English translation/gloss of the document. The identification of the error classes and the cross-lingual comparison of the documents pointed to many interesting cases in which an Interlingual approach (or at the very least, an approach with a deeper semantic representation) would improve the capability of the MT system to provide a fluent English translation. Finally, we chose several examples from the large pool we had collected. For the selected examples, we propose an Interlingual representation, based on Dorr's (1993) LCS, to be discussed at the workshop.
the original source language text
the output of each of the MT systems (For each segment containing an error by either MT system, the output of both systems was entered into the grid. It is interesting to note that the two systems often made errors on the same segment; sometimes the errors were the same, but often they were not.)
the human translation or gloss of the segment
the corresponding segment from the English version of the document
a slot for a classification of the error.
Classification of the errors was then done based on all of the factors listed above, as well as the authors' knowledge of the systems in question. The result is the observational classification of errors below. We do not claim that this is a complete classification of errors made by transfer-based MT systems, or even that this is the correct classification. The purpose of the classification is merely to separate those problems that are amenable to an IL treatment from those that are not.
We identified four classes of errors that would not be remedied by an IL approach:
We witnessed two distinct subclasses of this class 1 . First, there were untranslated words due to errors in the source text. The Spanish text, for example, contained hyphenated words in the middle of a line of running text (e.g. infe - rior for inferior), presumably an artifact of the conversion of the document from a previous format. Secondly, there were, not surprisingly, gaps in the MT systems' lexicons (unknown words). Effects of this ranged from the system's attempting to decompose the unknown word and translate its constituent parts (e.g. micro-entreprise micro entreprise, and then translated as "microphone company" in the French document) to the system leaving the word in the source language (e.g. unterstüzt in the German document).
In certain cases, there were problems with the analysis of the source language. Based on the data available to us in the grid mentioned in the previous section, these errors were either due to inability to recognize and handle proper nouns (in various documents, the author's name, María Otero, was variously translated as Marrying Otero, Maria Hill, and Maria Knoll), inadequate morphological processing, or incomplete coverage of the grammar. As an example of the last of these, consider the following example:
Sp: En cinco años BancoSol se ha convertido ...
Gls: In five years BancoSol itself has converted
Human : In five years BancoSol has become ...
Systran: Into five BancoSol years one has become ....
We can assume that Systran analyzed BancoSol as a modifier of años, and then, left with no subject for the following verb, inserted the generic pronoun "one". An Interlingual representation would not help in cases in which the syntactic analysis has similarly gone askew.
Failure to recognize Phrasal Units
This is similar to the first class in that it is a problem that points to gaps in the MT systems' lexicons. For instance, Systran translated the French phrase à but non-lucratif (to/with goal non-lucrative) as
"with nonlucrative goal", rather than as "non-profit". There is the possibility that an Interlingua could handle gaps in the phrasal lexicon for fixed phrases of which the meaning is compositional, however, an Interlingua is a powerful tool, suited for more complex problems.
This class is similar to the second class in that the correct semantic analysis hinges on the ability to resolve surface-level ambiguity; however, this class refers more to the level of lexical ambiguity. It is truly a borderline call as to whether this type of problem can be handled by using an Interlingual approach. The determination will depend on how strong one assumes the analysis modules to be. If the assumption is that all ambiguity would be removed during analysis, thus arriving at an unambiguous IL representation, then an appropriate translation would be produced. Consider, though the following example from the German text. On the morphological level, Darlehen is ambiguous with respect to number. There is nothing else in the source language sentence that indicates whether it is to be singular or plural; a human is able to disambiguate based only on world knowledge. If this is to be handled by an IL system, it would be by virtue of access to an extensive knowledge base and the ability to reason out relevant inferences from it.
More interestingly, we discovered that six of our classes would benefit from an Interlingual treatment. They were the following.
Naturally, since an IL representation of a source language sentence is intended to be unambiguous and meaning-preserving, cases involving some problem with word sense would benefit from an IL treatment. This is a vast class, and includes issues such as polysemy, more specifically problems with prepositional meaning, and fine-grained sense distinctions. For example, the French document contains the subheading "Un succès éclatant" (lit. "a success bursting"), which was translated as "A Vivid Success", "A Bright Succees", and "A Stunning Success" by Globalink, Systran, and a human translator, respectively. None of these is necessarily incorrect, but perhaps one would capture the meaning of the original more precisely than the others. As a more mundane example le noun portefeuille has the multiple senses "wallet" and "portfolio", both of which could potentially be valid in a financial domain. A clear (IL) meaning representation of the source sentence would enable the generation of the correct form in the target language.
Similar to the example of pescado (fish prepared to be eaten) versus pez (live fish), provided in Dorr (1993), it is common that a word in the source language will contain more or less information than its counterpart in the target language. Thus, English, we have the three lexical items "who", "that", and "which", which are marked for [+human], [+restrictive], and [-restrictive], respectively. In the French document, on the other hand, we find relative clauses subordinated with qui, which is unmarked for [human] or [restrictive]. Without access to these features in a deeper meaning representation, the syntactic-transfer-based systems continually make errors concerning the choice of the correct term in English.
Lexical Incorporation / Conflation
In her list of divergence types, Dorr (1993) lists conflational divergence, or cases in which a single word or concept in a language is represented by more than one word in another language. This type of problem is particularly suited to IL treatment. An example from the French text is found in the phrase (qui) étaient au départ, which could be rendered (as Systran did) "(which) were at the start", but which was much more fluently rendered in the version originally authored in English as "(which) started ... as".
Among the sentence-level phenomena that we encountered was verbal aspect. In order to handle this particular phenomena, sentence-level semantics must be taken into account. This is beyond the capabilities of most current systems.
Although we have not (yet) encountered any discourse-level translation problems in the text in question, this class is a natural progression from the sentence-level phenomena. Given a large enough body of text, discourse-level phenomena would certainly present themselves. For a discussion of IL treatment of discourse-level phenomena, see (LuperFoy & Miller, 1997).
There is some question and debate as to whether even an Interlingua that is maximally expressive and capable of producing unambiguous representations of all natural language sentences would be able to deal with the subtleties of language processing that humans take for granted. Among these are humor and stylistic variation, both of which are found in the workshop text. For example, titles in the English and Russian version contain puns on the business name "ACCION", which are not present in the Romance language versions. Furthermore, there are other stylistic differences between the documents, including differences in semantic content of headings and differences in segmentation of thoughts into sentences. It is debatable as to whether an MT system will be able to produce such subtle differences in the foreseeable future.
An Interlingua: Lexical Conceptual Structures
We will represent some sentences from the source text in the style of lexical conceptual structures (LCS). A sentence is represented by a composed LCS (CLCS) and a lexical entry is represented by a root LCS (RLCS). At present, verbs, prepositions and nouns which denote events or states are assigned lexical entries; nouns which do not represent events or states are treated as atomic entites.
The core insight which allows CLCSs to represent a wide range of sentences is that the semantics of motion and location serve as an analogy for many other areas of human understanding [Gruber 1965]. For instance, by regarding time as a one dimensional space, events can be positioned in time (sentence 1), just as things or events are positioned in space (sentence 2).
The meeting is in Dr. Dorr's office.
The distinction here is among semantic fields, temporal and locational respectively. A semantic field can be described as an analogy between physical motion and a motion analog and between physical position and a position analog. Semantic fields can be described by three parameters[Jackendoff 83]:
what type of entities can assume the role of theme, i.e., undergo the motion analog or be located by the position analog,
what type of entities can serve as reference objects to define the path of the motion analog or location of the position analog
what concept is represented by analogy to motion and location.
In order to understand the different semantic fields, we must first know what values can fill in the parameters described above. The types of entities, or semantic types, which will serve to restrict themes and reference objects with respect to fields are as follows. Things include all physical objects (e.g. table, telephone, James Garfield) as well as non-physical human concepts (e.g. idea, hope, intelligence, Calculus, "Romeo and Juliet") and human conceptual organizations applied to physical objects (e.g. the faculty of the Computer Science department, MITRE Corporation). States are sets of conditions which apply to some universe over time. Events are changes in those conditions: in an event, something happens. Times are points or intervals along the one-dimensional timeline. Properties are characteristics which can apply to things.
In [Dorr 93] nine semantic fields are identified: locational, possessional, identificational, perceptual, existential, circumstantial, intentional, instrumental and temporal. In addition, some causative events may exist without any state. Instrumental and intentional fields do not apply to events and states, but are used in adding additional information about an event or state to a CLCS. Table 1 shows the restrictions on semantic types as they serve as parameters of themes.
For the locational field, the relevant issues are where a thing is located, where an event takes place, and where a state holds. If a reference object is a thing, the location of the theme is given relative to the location of the reference object. If the reference object is an event, the location of the theme is given relative to where the event takes place. In the possessional field, the theme is possessed by the reference object. More recent work has downplayed the use of the temporal field in favor of representing temporal information in modifiers. Themes of the identificational field are instances of or identical to a reference object, or bear a property. In the existential field, the theme is understood to exist for things, to have happened for events, or to hold true for states. In the circumstantial field, the theme is a character in a reference situation (event or state), as in "Ludwig continues to compose string quartets" or "Xavier avoids being rude." Lastly, the theme of the perceptual field perceives a reference thing or situation.
A place is a prepositional relation to one or more reference objects. The reference object can be any of the semantic types, and the prepositional relation describes the position of a theme with respect to the reference object(s) (e.g., above, at between). A prepositional relation may make use of zero or more reference objects, as in
"between World War I and World War II"
A path is a path operation applied to one or more places.
We will not attempt to enumerate the primitives of the LCS system here, however we will make use of representative primitives in our examples.
CLCSs take the form of trees with each node of the tree an instance of a common node structure. In our examples we will make use of an informal adaptation of the LCS short-form notation. Moreover, the process of encoding sentences as CLCSs is, at present, something of a craft, and quite a non-trivial craft for complex sentences, such as those found in the article by Otero.
Applying the Craft: Mapping Sentences into Composed Lexical Conceptual Structures
In this section we will examine three sentences and their representations as CLCSs. Two of the sentences are from the English version of [Otero 97] and one is from the Spanish version. The Spanish sentence corresponds to the second of the English sentences.
The LCS model is oriented towards representation of verbs, and likewise the process of constructing an LCS representation of a sentence begins with a root LCS, or lexical entry, for a verb. For instance, the Spanish verb, "administrar," meaning "manage", is represented by this structure, somewhat simplified here:
This may be interpreted that thing2, the manger, and thing 6, the managed entity, are together in a managing relationship. This LCS corresponds to the top level of Example 3.
"ACCION International is a U.S.-based private non-profit organization that currently provides technical assistance to a network of institutions in thirteen countries in Latin America and six cities in the United States."
(be ident (ACCION International)
(at ident (ACCION International)
(go poss (assistance (technical))
To clarify the sense of this representation, we offer a rough human translation of it. ACCION International is identificationally an organization. The organization has the characteristics of being U.S.-based, private, non-profit, and that it participates as agent in a cause event. In the representation of the cause event it is referred to as *head*, meaning that the object modified participates in the IL which modifies it. In this case, the organization causes technical assistance to to go possessionally to a network. The network is composed of institutions, which are in the conjunction of countries and cities. The countries are further modified as being thirteen and in Latin America; the cities are six and in the United States. The organization causes this to happen in a manner described as "providingly" and the entire event is further modified as happening "currently."
Note that there is no confusion in this representation about what modifies what, owing to the tree structure. However, there is some confusion about what the modifiers mean. It should be understood that the atomic words appearing in the LCS stand for concepts rather then lexical items, however how this done is not clear. For instance, that the "countries" are "thirteen" could mean that they are thirteen in number or that they are thirteen years old, or that they are number thirteen in some context-specific ordering. Obviously, the handling of nominals in this system is still an issue. In the future a system of qualia based on [Pustejovsky 95] may be incorporated into the LCS IL.
Next we consider a pair of translation equivalent texts. Note that the English text is broken into two sentences, and thus represented by two CLCSs.
"Banco Solidario, S.A., or BancoSol, grew out of a non-profit joint venture created in 1986 by prominent members of the Bolivian business community and ACCION International. The latter brought with them leadership and seed capital, while the former provided technology and methodology."
(be ident (*head*) (BancoSol)))
(toward ident (Banco Solidario)
(at exist (*head*) (EXIST)))))
(cause (community (business) (Bolivian))
(go poss (conj technology methodology))
"En sus comienzos, en 1986, BancoSol era una asociación sin fines de lucro adminstrada conjuntamente por ACCIÓN Internacional - que se encargaba de la gestión y proporcionaba el capital incial - y por representantes de los círculos bolivianos de los negocios, que suministraban apoyo logístico y su conocimiento del terreno."
(co loc (*head*) (management?))))
(toward ident (nil? BancoSol?)
(community (business) (Bolivian)))
(toward ident (nil? BancoSol?)
One might expect that the LCS representations of translation equivalent texts should be quite similar. The vast divergence between these IL representations should not be construed as a failure of the representation, however. These are mappings into IL of sentences that are equivalent in the mind of a human translator, who performed many semantic level transformations on the content before generating the final form.
The process of creating CLCS representations for sentences is still a difficult and unclear process even for humans. We will continue to model sentences from source texts as CLCS and examine the properties of these representations, both to discover ways in which the representation may be strengthened and to better understand the process, so that we can ultimately design systems which can automatically create CLCSs from text. In addition, further study of divergent CLCSs representing semantically close sentences, both between two languages and within one language, will aid in the process of discovering constructs which truly represent semantic content divorced from the details of surface form.
Dorr, Bonnie, Machine Translation: a View from the Lexicon, MIT Press, 1993.
Gruber, Jeffrey S. Studies in Lexical Relations. Doctoral Dissertation, MIT, Cambridge MA, 1965
Jackendoff, Ray S. Semantics and Cognition. MIT Press, Cambridge, MA, 1983
LuperFoy, Susann and Miller, Keith, The Use of Pegs Computational Discourse Framework as an Interlingua Representation, AMTA SIG-IL First Workshop on Interlinguas (held at MT Summit VI), San Diego, California, 1997.
Miller, Keith and David Zajic, Lexicons as Gold: Mining, Embellishment, and Reuse, AMTA-98 (to appear).
Otero, María, "Latin America: ACCION Speaks Louder Than Words", UNESCO Courier, January, 1977.
Pustejovsky, James, "The Generative Lexicon", Massachusetts Institute of Technology, 1995
Reeder, Florence and Dan Loehr, Finding the Right Words, AMTA '98 (to appear).