Natural Language Processing and AI
Table of Contents
Mark W. Davis
Synopsis paper from the SIGIR97 Workshop on Translingual Text
Retrieval in Philadelphia, July, 1997
Mark W. Davis
QUILT (Query User Interface with Light Translations) is a prototype
implementation of a complete cross-language text retrieval system that
takes English queries and produces English gloss translations of
Spanish documents. The system indexes the Spanish documents in
Spanish, but converts the English query into a Spanish equivalent set
through a novel combination of lexical methods and parallel-corpus
disambiguation. Similar methods are applied to the returned document
to produce a simple translation that can be examined by non-Spanish
speakers to gauge the relevance of the document to the original
English query. The system integrates tradi tional, glossary-based
machine translation technology with information retrieval approaches
and demonstrates that rela tively simple term substitution and
disambiguation approaches can be viable for cross-language text
retrieval. Components of QUILT have been used to build a CLTR
interface to WWW-based search services.
Mark W. Davis
In Cross-Language Text Retrieval, queries in one language retrieve
documents in other languages. Query translation is the least expensive
approach to the retrieval task when compared to full document
translation. The simple combinatorial properties of vector-based text
retrieval systems simplify the translation task enormously, reducing
most translation to the correct substitution of equivalents from a
bilingual lexicon or corpus. New experiments are presented on methods
for selecting among potential equivalents from a bilingual lexicon,
including one fully-automatic method that achieves 73.5% of the
performance of a monolingual system operating on the same retrieval task.
In Proceedings of the Fifth Text Retrieval Evaluation Conference
(TREC-5)
Mark W. Davis and Ted E. Dunning
Multilingual information retrieval (IR) systems
apply queries in one language to a document
collection in several different languages with
the goal of retrieving only those documents
relevant to the query. At first glance, deep lin
guistic analysis and translation of the query
appears necessary before retrievals can be per
formed. IR systems are unique in natural lan
guage processing, however, because a pattern
of term occurrences in a document generally
suffices to determine the subject matter; word
order is largely irrelevant. Translated queries
are therefore primarily derived by a mapping
from a word set in the query language to a
word set in the language of the derived query.
This paper follows upon the work reported in
Davis and Dunning (1995b). Evolutionary Pro
gramming is again applied to optimize the
performance of translated queries, but the ini
tial queries are derived from simple corpus
statistics, rather than from a prepared lexicon.
Results indicate substantial performance gains
can be made in many cases by a simple strat
egy of eliminating and repeating terms from
the queries.
In Proceedings of the Fifth Conference on
Evolutionary Programming, March 1996
Mark W. Davis and Ted E. Dunning
In a Multi-lingual Text Retrieval (MLTR) system, queries in one language
are used to retrieve documents in several languages. Although all of the col
lection documents could be translated to a single language, a more efficient
approach is to simply translate the queries into each of the document lan
guages. We have investigated five methods for query translation that rely on
lexical-transfer and corpus-based methods for creating multi-lingual queries.
The resulting queries produced by these systems were then used in a compet
itive information-retrieval environment and the results evaluated by the
TREC evaluation group.
in Proceedings of the Fourth Text Retrieval Evaluation Conference,
NIST, November 1995
Mark W. Davis, Ted E. Dunning and William C. Ogden
Alignment methods based on byte-length
comparisons of alignment blocks have been
remarkably successful for aligning good
translations from legislative transcriptions.
For noisy translations in which the parallel
text of a document has significant structural
differences, byte-alignment methods often
do not perform well. The Pan American
Health Organization (PAHO) corpus is a
series of articles that were first translated by
machine methods and then improved by professional
translators. Many of the Spanish
PAHO texts do not share formatting conventions
with the corresponding English docu
ments, refer to tables in stylistically different
ways and contain extraneous information. A
method based on a dynamic programming
framework, but using a decision criterion
derived from a combination of byte-length
ratio measures, hard matching of numbers,
string comparisons and n-gram co-occurrence
matching substantially improves the
performance of the alignment process.
in Proceedings of the Seventh European Conference
of the ACL, Dublin, Ireland, March 1995
Mark W. Davis and Ted E. Dunning
Multi-lingual information retrieval (IR) systems apply queries in one
language to a document collection in several different languages with the
goal of retrieving only those documents relevant to the query. At
first glance, deep linguistic analysis and translation of the query appears
necessary before retrievals can be performed. IR systems are unique in
natural language processing, however, because a pattern of term occurrences in
a document generally suffices to determine the subject matter; word
order is largely irrelevant. Translated queries are therefore primarily derived
by a mapping from a word set in the query language to a word set in
the language of the derived query. Large parallel text collections with
sentence-level alignments can provide a baseline for evaluating the correctness
of a query translation, but the determination of members of the query
translation remains problematic. Constructing a query from machine-readable,
bilingual dictionaries and assigning term weights by the evolutionary
optimization of a population of potential weighting schemes presents a
solution to the difficulties of generating translated queries. In this
approach, differences in the rank statistics on the comparative recall
results for a query against its native language and its translation
against its native language determine the fitness of a tentative query translation.
in Proceedings of the Fourth Annual Conference on Evolutionary Programming,
San Diego, CA, March 1995
Ted E. Dunning and Mark W. Davis
We have designed a fully multi-lingual information retrieval system
and tested crucial parts. This system can accept a query in one
language and find documents in others. Furthermore, relevance
feedback can be used in a fully multi-lingual fashion.
Our system is based on the availability of parallel and aligned texts.
We use these texts to derive a linear approximation of the translation
process, and then use this linear transformation to implement a
conventional vector based information retrieval system. We describe
three possible techniques for deriving this translation matrix, one of
which we have implemented and tested on a relatively moderately sized
training corpus. Our method appears to be very efficient in terms of
the size of the necessary training corpus.
Since our solution for the translation matrix is incremental in
nature, additional parallel texts can be used to augment the system at
any time.
We also present estimates of how much the multi-lingual aspect of the
system degrades performance relative to mono-lingual use.
Memoranda in Cognitive and Computer Science, Computing Research
Laboratory, NMSU, 1993