Mikrokosmos (
K ) is a knowledge-based machine translation (KBMT)
system under development at the computing research laboratory (CRL) of
New Mexico State University (Onyshkevych and Nirenburg, 1994; Mahesh
and Nirenburg, 1995; Beale, Nirenburg, and Mahesh, 1995).
Unlike previous research in interlingual machine translation (MT), this
project is building a large-scale, practical MT system.
K already has several thousand Spanish words in its lexicon as well as several thousand concepts in its ontology (or world knowledge base). By the
end of the year, a lexicon of approximately 7000 Spanish words
supported by an ontology of over 5000 concepts will be in place.
High-quality meaning representations of up to 10 article-length
Spanish texts from the domain of company mergers and acquisitions will
have been produced by the
K system. In the coming years,
K will be expanded into other languages such as Arabic, Japanese, Russian,
and Thai.
A comprehensive study of the computational treatment of texts is a
multifaceted endeavor covering a wide range of linguistic and
pragmatic phenomena. Because the various facets of this knowledge are
complex in their own right, study of any individual phenomenon is
often conducted in relative isolation from the study of other related
phenomena. However, in a KBMT application, knowledge about a large
number of interrelated linguistic and language use phenomena is
required. A natural way of combining the diverse knowledge required of
such a system into a unified whole is for the various phenomena to be
treated by separate computational linguistic ``microtheories'' united
through a system's control architecture and knowledge representation
conventions.
In the Mikrokosmos project, a comprehensive study of a variety of
microtheories central to the support of KBMT systems is being carried
out with the ultimate objective of defining a methodology for
representing the meaning of natural language texts in a
language-neutral interlingual format called a text meaning
representation (TMR). The TMR represents the result of analysis of a
given input text and serves as input to the target language generator.
The meaning of the input text is represented in the TMR as
instantiated elements of an independently motivated model of the world
(or ontology). The link between the ontology and the TMR is provided
by the lexicon, where the meanings of most open class lexical items
are defined in terms of their mappings into ontological concepts and
their resulting contributions to TMR structure. The ontology and the
lexicon are the two main knowledge sources in the
K system. Information about the nonpropositional components of text
meaning such as speech acts, speaker attitudes and intentions,
relations among text units, coreferences, etc. is also derived
from the lexicon with inputs from other microtheories and becomes
part of the TMR. Figure
illustrates the
K architecture for analyzing input texts. The workings of this
architecture are illustrated below through an example.
Figure: The Mikrokosmos NLP architecture.
Kavi Mahesh