An interlingua aiming at communication on the Web: How language independent can it be?
by
Ronaldo Texeira Martins, Lucia Helena Machado Rino, Maria das Graças Volpe Nunes, Gisele Montilha, Osvaldo Novais de Oliveira Jr.

Critique by Eduard Hovy
USC Information Sciences Institute


This paper provides a clear and honest description of one UNL effort to build an interlingua, together with several parsers and generators, for the purpose of making web pages universally translatable.

UNL can be discussed from various perspectives. Since the effort is still at an early stage, it is not possible to talk about actual working results. One is therefore forced to comment on the general approach being taken, both technically and administratively, and compare that to similar efforts.

Administrative comment. First, an administrative comment. Unfortunately, UNL is not an open enterprise --one has to pay money to join. That immediately limits the amount of participation by experts in Interlinguas and MT. This is a pity, for the project loses a vast source of experience, and is hence doomed to rediscovering some well-known problems. I suspect that if UNL were to institute regular report/oversight meetings with presentations to a panel of experts in MT, Interlinguas, etc., it (or some of its component projects) may benefit a lot.

Technical comments. Now to the technical aspects of the paper. It is interesting to read the description of UNL in Section 3 and the problems encountered in Section 5, because much of Section 3 is exactly the experience of previous attempts to define Interlinguas or semantic representation schemes (see, say, Conceptual Graphs (Sowa), MIKROKOSMOS (Nirenburg et al.), and others), while much of Section 5 has been discussed in the work of Dorr's LCS structures, Nirenburg et al.'s KBMT, and others. It is true that neither Interlingua design nor an Interlingua proper have ever been done to more than a very limited-domain scale, so there is still a lot of room for experimentation with the general case. However, it is also true that the valuable lessons reflected in the abovementioned work should not be simply ignored. This paper is missing a number of important references.

Universal Words. How many UWs are there (today)? How many should there be, in the ideal case? What is the policy of creating new UWs? Can one simply define a term as soon as one identifies a concept with slightly different shade of meaning? That is, are Forest, Jungle, and Woods all different? What makes them different? How will one explain to a non-English speaker the difference, so that he or she can make the appropriate links from his or her lexicon to these UWs? As Graeme Hirst showed in a paper in the early 1990's, semantic plesionyms (near-synonyms) are usually qualified in dictionaries by descriptive footnotes, but these are hard to formalize. Going beyond semantics, are Terrorist, Freedom Fighter, and Guerrilla different? Here the difference lies in the interpersonal/ pragmatic connotations--see the work on linguistic metafunctions by Halliday (1977) or the study of language generation with appropriate pragmatic connotations by Hovy (1988).

To overcome the difficulty of mapping words with nearly-identical meanings across various languages, the EuroWordNet project (Vossen et al., late 1990s) allowed researchers to build their own computational lexicons and then link the ones that truly were identical using the Interlingual Index (ILI). One thus has little `clouds' of similar words in each language, with their anchor points mapped to the ILI and thereby providing a link to other languages. This weak linkup facilitated close matches.

I saw no discussion in the paper of how UNL plans to address the problem of close but not identical matches of words to Interlingua terms. I recommend reading the authors cited here.

Relation Labels. The literature has seen many proposals for the core set of relations, often called case roles, and for the additional relations one needs. Fillmore's famous paper (Case for Case, approx. 1975) set the stage by defining about 10 of them (Agent, Patient, Beneficiary, etc.); later work by Schank (mid-70s), Jackendoff (approx. 1980), the Penman NL generation project (Matthiessen and Bateman 1992), and others, for example Sowa (book on ontologies, late 1990s) all provide lists. n general, one tends to find that people define wither about 5 very basic relations, or about 20, or about 100--the latter usually organized in a taxonomy that supports increasing differentiation of subtypes. I prefer the last solution, because it allows one to handle the unusual but ever-present cases that always seem to make trouble in real language; for example, the CoAgent relation (X and Y get married to each other; X and Y make a legal agreement; etc.), or what one might perhaps call the InstrumentalAgent (the bomb exploding and killing 20 people). Often speakers of different languages have different intuitions about the appropriate RL to use, which leads to some of the observations in Section 5.

I would recommend that the authors study the abovementioned work and then create their own taxonomy of RLs, in collaboration with other UNL projects, trying to make as clear as possible the definition of each RL.

Attribute Labels. I find the discussion of ALs very problematic. The examples given are almost all syntactic, not semantic, notions--tense and aspect have been studied by grammarians for a long time. The fact that one language has tenses X, Y, and Z does not mean that an Interlingua should support tenses! According to Reichenbach's (1930s) theory, the tenses found in (most) languages can be decomposed into a theory of three time points and the relations among them (when Eventtime precedes Speakingtime and is the same as Perspectivetime, one gets simple past tense, etc.). Again here the authors could fruitfully read the MIKROKOSMOS work and antecendent literature. Another example: determinateness (from determiners such as "the" and "a") seems to be a natural part of the Interlingua if one speaks only western European languages. But as soon as one goes to Slavic languages one realizes the emptiness of the idea, and the fact that determiners contain essentially no information (if you blank out all determiners in a text, a reader can put them back with over 96% accuracy). When the (syntactic) phenomena of determiners, aspect, tense, mood, etc., are studied across languages, the semantic import (to the extent there is any) will become clear, and can then be properly formulated and put into the Interlingua.

Missing. A major component I find missing from the description of UNL (it may be there in the grand design, but it is not captured in this paper) is the nonsemantic aspects of communication--the pragmatic, interpersonal, and medium-related aspects. The difference between "the man ran" and "the guy skedaddled" or "may I help you?" and "whatdaya want?" is primarily stylistic/interpersonal. To be a true reflection of the meaning of the communication, the Interlingua has to capture this aspect as well. Similarly, the difference between "There are three matters of importance. X... Y... Z..." and "There are three matters of importance: - X... - Y... - Z..." is medium-related/presentational, and may have important effects on the overall communication as well. Especially for web pages, an adequate Interlingua needs to represent these aspects, so that they can also be translated using styles and presentation formats that convey the equivalent meanings in the target language.

Conclusion. If I may, I would like to suggest a new paradigm for the project. Instead of using simple sentences such as "the cat is on the mat", the project should take from some parallel corpus--even the Bible--a few real sentences in several languages, and try to represent them (all their aspects) first as expressed in the source languages, by native speakers. Then, when the various candidate Interlingua expressions have been made, they should be brought together and compared, so that a merged Interlingua expression can be created for each sentence.

When enough such examples have been handled, the project should start taking whole paragraphs, and pay attention to the differences in cross-sentence phenomena, such as reference, sentence splitting, thematization, etc. Only after enough of these examples have been handled, should all the various Interlingua fragments be assembled and synthesized into the outline of the theory of UNL. At the same time, researchers should pay particular attention to their policy regarding the creation of new UWs and RLs, and any internal organization (perhaps taxonomization into an ontology).

Overall, I think this paper describes a serious and well-intentioned effort that has potential, if it is properly aware of previous work, to make a good contribution to the slowly-growing understanding on Interlinguas.

To SIG-IL Workshop Series Home Page

Last Updated: March 30, 2000

Copyright 2000 Computing Research Lab.