Collaborative Research: Interlingual Annotation of
Multilingual Text
Corpora
This project involves collaborative work between six research
institutions, CRL New Mexico State
University, ISI University of Southern California, UMIACS University of
Maryland, LTI Carnegie Mellon University,
Columbia University, and The MITRE Corporation. This research
aims at providing a well-defined,
motivated and practical semantic level of representation that captures
information from natural language text. We refer to
this level of representation as an "interlingual representation". The
novelty of the research comes not only from the
interlingua representation itself, but also from an improved methodology
for designing and evaluating such
representations.
The research has four aspects: First, to compile a
collection of texts for six or seven non-English
languages, coupled with at least three translations into English. Second,
an interlingual representation
framework based on the careful study of these parallel text corpora. The
framework will include a formal definition of
the representation language along with coding manuals for the main
components of meaning (e.g., even time, aspect,
modalities, etc.). Third, we will annotate these bilingual corpora using
the agreed-upon interlingual representation.
This effort will also allow for a straightforward extension of those
corpora without further research required. Fourth,
we will develop metrics for evaluating interlingual representations and
for choosing a grainsize of meaning
representation that is appropriate for a given task. The metrics are
based on inter-coder reliability, the growth rate of
the interlingual representation, and quality of the target language text
that can be generated from the interlingua.
For further information about this project contact Dr. Steve Helmreich
or Dr. David Farwell.
IL Annotation project's website