Natural Language Processing and AI

Table of Contents

Graphical Models and Networks for Monolingual, Multilingual and Translingual Text Retrieval and Visualization

Mark W. Davis

Synopsis paper from the SIGIR97 Workshop on Translingual Text Retrieval in Philadelphia, July, 1997

Implementing Cross-Language Text Retrieval Systems for Large-scale Text Collections and the World Wide Web

Mark W. Davis

QUILT (Query User Interface with Light Translations) is a prototype implementation of a complete cross-language text retrieval system that takes English queries and produces English gloss translations of Spanish documents. The system indexes the Spanish documents in Spanish, but converts the English query into a Spanish equivalent set through a novel combination of lexical methods and parallel-corpus disambiguation. Similar methods are applied to the returned document to produce a simple translation that can be examined by non-Spanish speakers to gauge the relevance of the document to the original English query. The system integrates tradi tional, glossary-based machine translation technology with information retrieval approaches and demonstrates that rela tively simple term substitution and disambiguation approaches can be viable for cross-language text retrieval. Components of QUILT have been used to build a CLTR interface to WWW-based search services.

NEW EXPERIMENTS IN CROSS-LANGUAGE TEXT RETRIEVAL AT NMSU'S COMPUTING RESEARCH LAB

Mark W. Davis

In Cross-Language Text Retrieval, queries in one language retrieve documents in other languages. Query translation is the least expensive approach to the retrieval task when compared to full document translation. The simple combinatorial properties of vector-based text retrieval systems simplify the translation task enormously, reducing most translation to the correct substitution of equivalents from a bilingual lexicon or corpus. New experiments are presented on methods for selecting among potential equivalents from a bilingual lexicon, including one fully-automatic method that achieves 73.5% of the performance of a monolingual system operating on the same retrieval task.

In Proceedings of the Fifth Text Retrieval Evaluation Conference (TREC-5)

Query Translation Using Evolutionary Programming for Multilingual Information Retrieval II

Mark W. Davis and Ted E. Dunning

Multilingual information retrieval (IR) systems apply queries in one language to a document collection in several different languages with the goal of retrieving only those documents relevant to the query. At first glance, deep lin guistic analysis and translation of the query appears necessary before retrievals can be per formed. IR systems are unique in natural lan guage processing, however, because a pattern of term occurrences in a document generally suffices to determine the subject matter; word order is largely irrelevant. Translated queries are therefore primarily derived by a mapping from a word set in the query language to a word set in the language of the derived query. This paper follows upon the work reported in Davis and Dunning (1995b). Evolutionary Pro gramming is again applied to optimize the performance of translated queries, but the ini tial queries are derived from simple corpus statistics, rather than from a prepared lexicon. Results indicate substantial performance gains can be made in many cases by a simple strat egy of eliminating and repeating terms from the queries.

In Proceedings of the Fifth Conference on Evolutionary Programming, March 1996

A TREC EVALUATION OF QUERY TRANSLATION METHODS FOR MULTI-LINGUAL TEXT RETRIEVAL

Mark W. Davis and Ted E. Dunning

In a Multi-lingual Text Retrieval (MLTR) system, queries in one language are used to retrieve documents in several languages. Although all of the col lection documents could be translated to a single language, a more efficient approach is to simply translate the queries into each of the document lan guages. We have investigated five methods for query translation that rely on lexical-transfer and corpus-based methods for creating multi-lingual queries. The resulting queries produced by these systems were then used in a compet itive information-retrieval environment and the results evaluated by the TREC evaluation group.

in Proceedings of the Fourth Text Retrieval Evaluation Conference, NIST, November 1995

Text Alignment in the Real World: Improving Alignments of Noisy Translations Using Common Lexical Features, String Matching Strategies and N-Gram Comparisons

Mark W. Davis, Ted E. Dunning and William C. Ogden

Alignment methods based on byte-length comparisons of alignment blocks have been remarkably successful for aligning good translations from legislative transcriptions. For noisy translations in which the parallel text of a document has significant structural differences, byte-alignment methods often do not perform well. The Pan American Health Organization (PAHO) corpus is a series of articles that were first translated by machine methods and then improved by professional translators. Many of the Spanish PAHO texts do not share formatting conventions with the corresponding English docu ments, refer to tables in stylistically different ways and contain extraneous information. A method based on a dynamic programming framework, but using a decision criterion derived from a combination of byte-length ratio measures, hard matching of numbers, string comparisons and n-gram co-occurrence matching substantially improves the performance of the alignment process.

in Proceedings of the Seventh European Conference of the ACL, Dublin, Ireland, March 1995

Query Translation Using Evolutionary Programming for Multi-lingual Information Retrieval

Mark W. Davis and Ted E. Dunning

Multi-lingual information retrieval (IR) systems apply queries in one language to a document collection in several different languages with the goal of retrieving only those documents relevant to the query. At first glance, deep linguistic analysis and translation of the query appears necessary before retrievals can be performed. IR systems are unique in natural language processing, however, because a pattern of term occurrences in a document generally suffices to determine the subject matter; word order is largely irrelevant. Translated queries are therefore primarily derived by a mapping from a word set in the query language to a word set in the language of the derived query. Large parallel text collections with sentence-level alignments can provide a baseline for evaluating the correctness of a query translation, but the determination of members of the query translation remains problematic. Constructing a query from machine-readable, bilingual dictionaries and assigning term weights by the evolutionary optimization of a population of potential weighting schemes presents a solution to the difficulties of generating translated queries. In this approach, differences in the rank statistics on the comparative recall results for a query against its native language and its translation against its native language determine the fitness of a tentative query translation.

in Proceedings of the Fourth Annual Conference on Evolutionary Programming, San Diego, CA, March 1995

Multi-lingual Information Retrieval

Ted E. Dunning and Mark W. Davis

We have designed a fully multi-lingual information retrieval system and tested crucial parts. This system can accept a query in one language and find documents in others. Furthermore, relevance feedback can be used in a fully multi-lingual fashion.

Our system is based on the availability of parallel and aligned texts. We use these texts to derive a linear approximation of the translation process, and then use this linear transformation to implement a conventional vector based information retrieval system. We describe three possible techniques for deriving this translation matrix, one of which we have implemented and tested on a relatively moderately sized training corpus. Our method appears to be very efficient in terms of the size of the necessary training corpus.

Since our solution for the translation matrix is incremental in nature, additional parallel texts can be used to augment the system at any time.

We also present estimates of how much the multi-lingual aspect of the system degrades performance relative to mono-lingual use.

Memoranda in Cognitive and Computer Science, Computing Research Laboratory, NMSU, 1993