The Universal Networking Language (UNL) project by Martins et al (2000) addresses the problem of developing an interlingua to be used in a Web environment, for embedded Knowledge-Based Machine Translation (KBMT). The UNL is an interlingual representation comprised of three components:
* Primitive concepts (Universal Words)
* Roles (Relation Labels)
* Attribute Labels (Features)
The primary goal of this representation is to provide the basis for intercommunication over the Web among speakers of different languages. The UNL-based MT system converts a source-language document into a target-language document, using the UNL as an intermediate pivot language for generating any of 13 target languages.
While the endeavor of this group is, indeed, noble (incorporating 13 languages across several different linguistic groups), there seems to be relatively little awareness of the key issues associated with this approach. I will focus on 5 areas concerning the UNL representation and its usage for machine translation.
Little is said about how this approach differs from earlier approaches other than that UNL is in a much earlier stage and is "more ambitious". Given that the three UNL components described above are standard in most existing interlingual representations, it is difficult to see where the true differences lie. Could the authors comment, more specifically, on how the UNL differs from the representations of other existing large-scale MT efforts e.g., the TMRs of the approaches adopted in Pangloss and Mikrokosmos (Mahesh and Nirenburg, 1995) or the earlier work by primitive-based approaches (Schank, 1973) and thematic-role approaches (Dorr, 1993; Gruber, 1965; Jackendoff, 1983)? The authors also reuse the name KBMT, without mentioning or citing the work of (Nirenburg et al, 1992).
From the scant description (and only one English example), it is difficult to assess the value of this primitive-based representation. The authors state, without justification, that it is a "consistent and complete" meaning representation, yet they later reveal that components of meaning are "lost" during the translation. More should be said about which meaning components are lost and where the translation falls short as a result of this loss. Also, the authors criticize other MT approaches for their "language-dependence", yet it is unclear that the English-specific UNL's provide the sought-after language-independence that appears in the title of the paper. The UNL is referred to as a "unique semantic representation", which presumably means that some disambiguation is necessary; if so, how is disambiguation done?
In addition to these issues, the approach will be subject to the same questions associated with other primitive-based representations:
* How many primitives are there?
* How will the primitive-based representations be acquired on a large scale?
* How will the issue of exhaustive decomposition be handled?
* How will you evaluate coverage?
The authors should comment on the difference between this ontology and other word-based ontologies such as WordNet (Miller et al., 1990). Also, there is no discussion about how this ontology is generated; is there any automation in this process? Finally, can the authors say something about the complexity of using the ontology to encode morphology? For example, "gato" and "gata" are different concepts, which seems potentially expensive. Are all masculine and feminine concepts encoded independently? What about languages that have a third gender (dual)? What about other morphological features (number, tense, etc.)? The effect of all these feature possibilities could cause a combinatorial explosion. The authors should discuss whether it would be possible to abstract away from these sorts of distinctions and just have a general concept for "cat".
The UNL system converts sentential strings into their corresponding translations; the process of mapping to and from the interlingua is not described. Is there a parser? Or does the analyzer operate directly off the string? In addition, the issue of mismatches needs further elaboration, e.g., cases such as "mat" (English) and "capacho" (Portuguese), where there isn't an exact translation. How does the "cross-linguistic lexical matching" work for this example? Is some sort of fuzzy match scheme used? The representation for "mat" is given, but no representation included---no rep for "capacho". No real translation examples are given anywhere in the entire paper.
The authors make two claims: (1) It is possible to codify any NL sentence as a UNL representation; and (2) All languages can be encoded in this representation. What metrics are use to confirm that the mapping in (1) has been done accurately? How is (2) verified? With a test suite? (How many sentences?) Finally, since this is supposed to be an MT system for the Web, could more be said about how the system will be used and assessed in the context of a real Web-based application? (Virtually no mention of the Web is made in the entire paper.)
Final comments: While the approach is rather mysterious in its creation, theoretical underpinnings, and processing methods, it is clearly a large-scale, multinational effort that is deserving of attention from the MT/IL community.
REFERENCES.
Bonnie J. Dorr. Machine Translation: A View from the Lexicon. The MIT Press, Cambridge, MA, 1993.
Jeffrey S. Gruber. Studies in Lexical Relations. Doctoral Dissertation, MIT, Cambridge, MA, 1965.
Ray Jackendoff. Semantics and Cognition. The MIT Press, Cambridge, MA, 1983.
Kavi Mahesh and Sergei Nirenburg. A Situated Ontology for Practical NLP. In Proceedings of IJCAI-95 Workshop on Basic Ontological Issues in Knowledge Sharing, 1995.
G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Five papers on WordNet. 43, Cognitive Science Laboratory, Princeton University, 1990.
Sergei Nirenburg, Jaime Carbonell, Masaru Tomita, and Kenneth Goodman. Machine Translation: A Knowledge-Based Approach. Morgan Kaufmann Publishers, San Mateo, CA, 1992.
Sergei Nirenburg, David Farwell, and Yorick Wilks. Multi-Engine, Adaptive MT. In Proceedings of Fifteenth International Conference on Computational Linguistics, 1994.
Roger Schank. Identification of Conceptualizations Underlying Natural Language. In Computer Models of Thought and Language, Freeman, San Francisco, CA, 114-151, 1973.
To SIG-IL Workshop Series Home Page
Copyright 2000 Computing Research Lab.