 |
Temple project |
 |

A Framework for Fast Deployment of
Multilingual Machine Translation Systems
The Temple project has developed an open multilingual architecture and
software support for rapid development of extensible Machine Translation
functionalities. The targeted languages are those for which Natural Language Processing
and human resources are scarce or difficult to obtain. The goal is to support rapid
development of Machine Translation functionalities in a very short time with limited
resources.
New Web sites in foreign languages are appearing everyday everywhere on the globe and
language barriers threaten to atomize the World Wide Web in closed linguistic
communities. The Temple project has developed an open multilingual architecture and
software support for rapid development of extensible Machine Translation
functionalities. The targeted languages are those for which Natural Language Processing
and human resources are scarce or difficult to obtain. The goal is to support rapid development of Machine Translation functionalities
in a very short time with limited resources.
Currently, the Temple prototype provides automatic raw English
translations from documents in several languages (Spanish,
Arabic, Japanese and Russian). Translations are produced using a
Glossary-Based Machine Translation engine. Analysts and translators
can edit the raw translation using a multilingual editor. Source
documents and their translations are managed using the Tipster
Document Manager developed at CRL which is also used as the
architectural basis for integrating the system's components. One
important outcome of the Temple project is the development of an
architecture to support reuse of NLP tools and resources:
- Tools that are acquired from an external source, such as morphological analyzers,
generators, or taggers, can be integrated in the system with a minimum of programming
effort.
- Heterogeneous linguistic resources are parsed and mapped to a common multilingual
representation.
The major components of the Temple prototype include:
- A Glossary-Based Machine Translation (GBMT) engine which provides an automatic translation for each language pair;
- Morphological analyzers, bilingual dictionaries, and bilingual
glossaries for Spanish, Arabic,
Japanese and Russian, and an English
morphological generator.
- A multilingual document editor (the Tipster Editor for Documents developed at CRL
under the Norm project) used to browse documents and their translation;
- A multilingual dictionary and glossary editor and utilities to parse and load flat
dictionary (Machine-Readable Dictionaries) and glossary files into the system's lexical
database;
- Corpus-based utilities to automate the acquisition of bilingual glossaries.
A brief history of the project
Glossary-Based Machine-Translation (GBMT) was first developed at CMU as part of
the Pangloss project [Nirenburg 95; Cohen et al., 93; Nirenburg et al., 93; Frederking et
al., 93], and a sizeable Spanish-English GBMT system was implemented. The Temple project
has built upon this experience and extended the GBMT approach to other languages:
Japanese, Arabic, and Russian. This experience with other languages has provided
significant insights for the development of a versatile GBMT engine and for the use of
off-the-shelf components for building a complete Machine-Translation System. Building a
generic platform for integrating various Machine-Translation Systems in a single flexible
user environment, built upon the Tipster document architecture [Grishman 95], has also
been a valuable experience for developing generic Natural Language Processing support
systems.
Planned as a two-years project, the Temple project actually started at the beginning
of 1995 and ended in May 1996. The first part of the project was devoted to the
acquisition of corpora, dictionaries and glossaries and to the implementation of basic
capabilities using CRL's Tipster Document Manager. After some initial experiments, the
final architecture and the GBMT engine was implemented while work on lexical acquisition
was (and is) still progressing. The project succeeded in achieving its major goal, fast
development of MT functionalities integrated in an analyst workstation. A new project,
Corelli, was defined for consolidating the results and exploring new avenues that were
identified during the course of the Temple project.
The user perspective
The Temple Analyst's Workstation is incorporated into a Tipster document management
architecture and it allows both translator/analysts and monolingual analysts to use the
machine-translation function for assessing the relevance of a translated document or
otherwise using its information in the performance of other types of information
processing. Translators can also use its output as a rough draft from which to begin the
process of producing a translation, following up with specific post-editing functions.
The Temple Analyst's Workstation design is original in that it combines the best
features and eliminates the weaknesses of competing alternatives. On the one hand, like
word-based glossers, it puts the user in control by allowing all core linguistic
components used by the glossary-based engine to be accessed, modified and developed by the
Analyst. On the other hand, like advanced MT systems, it uses reliable morphological
processors and taggers, components which are relatively inexpensive, require little or no
maintenance, and greatly enhance output quality.
The user (translator or analyst) can:
- Browse a collection of documents managed by a Tipster Document Manager using a
Collection/Document browser,
- View and edit foreign language documents or their English translations using the
multilingual Tipster Editor for Documents,
- Translate foreign documents using the generic translation function,
- Browse and edit lexical resources (bilingual dictionaries and glossaries),
- Get help on the system using context-sensitive help.
Although the translation provided by the system is only a phrase-for-phrase (or
word-for-word) gloss of the original, the system is entirely under the control of the user
who can modify any essential part of it, i.e. the dictionaries and glossaries. From a
user's point of view, the system is predictable, responsive, affordable and easy to use
and maintain.
The MT developer perspective
A Multilingual Architecture. The Temple architecture uses the
multilingual text library developed at CRL to support multilingual
text processing. This library, available for Unix systems, is capable
of handling a large number of character codesets and provides
multilingual string processing functionalities and character code
conversion for a large variety of codesets. Also supported is a
multilingual Motif text widget that can be embedded in higher-level
applications (as the Lexical Editor) and a simple multilingual text
editor. This library proved to be a major asset for the project since
few comparable functionalities for the range of languages processed in
the Temple project are available in the Unix environment.
Although full Unicode support was a goal of the project, this could not be achieved
entirely. So far, only the Arabic morphological analyzer has built-in Unicode support (as
well as other codesets). However, a full Unicode library should be available for use in
the Corelli project.
Reuse of Machine-Readable Dictionaries.
Bilingual dictionaries are processed versions of various Machine-Readable Dictionaries
(MRD), for example the Collins Spanish-English dictionary, or of other MT dictionaries
that have been restructured to conform to Temple's lexical format. Since morphological
analyzers and dictionaries may come from different sources, they may have incompatible
lexical representation, as it happened for the Japanese-English dictionary and the
Japanese morphological analyzer. In such cases, integration is achieved by mapping the
dictionary, including for example part-of-speech information, to a standardized format,
and by developing a filter that maps the morphological analyzer result to that structure.
Reuse of Natural Language Processing components.
An important decision in the Temple project was to use available NLP components and
resources whenever possible. This led to the definition of an open architecture that
provides support for integrating external tools. Some Temple components have been
developed as part of the project (as the Arabic and Russian morphological analyzers) but
have been integrated using the same methodology as other tools.
Essentially, the Temple architecture is built around a uniform internal canonical
linguistic representation for all languages: all components read or map their results to
this representation, which uses Tipster annotations. All NLP tools are encapsulated using
Tcl wrappers for mapping the tool representation to the Temple representation.
Morphological information is transferred from the source to the target language using
morphological transfer tables that map categories and features from a source lexical item
to the equivalent English lexical item. The GBMT engine itself is fully generic and is
parameterized by a bilingual glossary and a morphological transfer table.
Semi-automatic development of glossaries.
Small glossaries (between 2,000 and 20,000 entries depending on the language) have been
developed for each language. The acquisition process is as follows:
- An Ngram extraction program is used to collect recurrent word patterns in a given corpus;
- This set of patterns is loaded in the Lexical Database as partial glossary entries;
- The translation of each entry is added manually.
The development cost of such glossaries remains relatively low since the structure and
the information encoded in a glossary entry is very simple.
Languages
- The target language: English
-
English is the single target language of the Temple prototype and only one component deals
primarily with English: the morphological generator, which is the one used in the Penman
system. This generator accepts as input a citation form and a set of morphological
features to be realized. This set of morphological features is produced by the GBMT engine
using morphological transfer tables.
- Arabic
-
All Arabic components, morphological analyzer, Arabic-English dictionary (45,000 stems and
almost as may irregular entries for a total of 72,000 entries) and Arabic-English glossary
(11,000 entries), have been developed within the Temple project.
- Japanese
-
The Japanese morphological analyzer is Juman, developed at Kyoto University. The
Japanese-English dictionary is a processed and augmented version of a Kenkyusha CD-ROM
dictionary (40,000 entries). The Japanese-English glossary has been developed at CRL
(5,000 entries).
- Russian
-
The Russian morphological analyzer has been developed at CRL. The size of both the
dictionary and the glossary is so far very modest, 6,700 and 1,600 entries
respectively. The dictionary and the glossary has been developed at CRL.
- Spanish
-
The Spanish morphological analyzer is the SPOST tagger developed at CRL in the Pangloss
project. The dictionary is a processed version of the Collins Spanish-English dictionary
(with approximately 60,000 entries). A subset of the glossary developed in the Pangloss
project has been converted to the Temple format (20,000 entries).
Team
Project manager
Staff
- Mark Casper was responsible for the
GBMT engine and for overseeing all programming tasks of the project.
- Svetlana Sheremetyeva developed the
Russian morphological analyzer.
- Susumu "Duke" Yasuda was responsible for the Japanese language part.
Consultants
- Tim Buckwalter designed the Arabic morphological analyzer and
provided the Arabic-English dictionary. A native of Argentina, Tim
Buckwalter taught English for seven years in the Middle East
(Bethlehem, Nablus, and Cairo). He has an M.A. in Arabic from Indiana
University (1982). He has been involved in Arabic MT since 1989.
Students
- Vanishree Mahesh implemented an
initial version of the GBMT engine and the Lexical Editor.
- Ahmed Malki developed the Arabic-English glossary.
- Abdussalam Mohammed implemented the Arabic morphological analyzer.
- Nick Ourusoff implemented the
Russian morphological analyzer.
- Marek Telgarsky implemented the initial version of the Lexical Editor.
- Daniel Wood implemented sorting
and comparison functions for the Unicode library.
And this would not be complete without mentioning Nigel Sharples
who implemented a new version of the Tipster Document Manager and GUI
tools and also helped with the interface for the Tipster Document
Manager;Mark Leisher who
developed the Multilingual Multi-attributed Text Library and helped on
many parts of the project dealing with multilingual character set
processing.
Publications
- Rémi Zajac and Mark Casper. "The Temple Web Translator". 1997
AAAI Spring Symposium on Natural Language Processing for the World
Wide Web, March 24-26, 1997, Stanford University.
(HTML · Gzipped Postscript)
- Michelle Vanni and Rémi Zajac. ``Glossary-Based MT Engines in a Multilingual
Analyst's Workstation for Information Processing''. To appear in Machine
Translation, Special Issue on New Tools for Human Translators.
- Rémi Zajac. 1996. "Towards a Multilingual Analyst's
Workstation: Temple". In Expanding MT Horizons - Proceedings of the
2nd Conference of the Association for Machine Translation in the
Americas, AMTA-96. 2-5 October 1996, Montréal,
Canada. pp280-284.
- Bill Caid, Jamie Callan, Jim Conley, Harold Robin, Jim Cowie, Kathy
DiBella, Ted Dunning, Joe Dzikiewicz, Louise Guthrie, Jerry Hobbs,
Clint Hyde, Mark Ilgen, Paul Jacobs, Matt Mettler, Bill Ogden, Peggy
Otsubo, Bev Schwartz, Ira Sider, Ralph Weischedel and Rémi
Zajac. "Tipster Text Phase II Architecture Design,
Version 2.1". Proceedings of the Tipster-II 24-month Workshop, Tysons
Corner, VA, 7-10 May, 1996. pp249-305.
- Rémi Zajac. ``A Multilingual Translator's Workstation for Information
Access'', Proceedings of the International Conference on Natural Language Processing
and Industrial Applications, NLP+IA 96, Moncton, New-Brunswick, Canada, June 4-6,
1996. (Gzipped Postscript)
- Michelle Vanni and Rémi Zajac. ``The Temple Translator's Workstation
Project'', Proceedings of the Tipster-II 24-month Workshop, Tysons Corner, VA, 7-10 May,
1996. (Gzipped Postscript)
References
-
[Church & Hovy 93] Church, K. and Eduard Hovy. "Good Applications for Crummy Machine
Translation". Machine Translation, Vol. 8, No 4, 1993, pp239258.
-
[Callan et als.] Callan J.P., Croft W.B., and Harding S.M., "The INQUERY Retrieval
System," Proceedings of the 3rd International Conference on Database and Expert Systems
Applications.
-
[Cohen et al. 93] Cohen, A., P. Cousseau, R. Frederking, D. Grannes, S. Khanna,
C. McNeilly, S. Nirenburg, P. Shell and D. Waeltermann. ``Translator's Workstation User
Document.'' Center for Machine Translation, Carnegie Mellon University, 1993.
-
[Davis et als. 95] Mark Davis, Ted Dunning, Bill Ogden. "String Matching Strategies and
N-Gram Comparisons". Proceedings of the 7th Conference of the European Chapter of the
Association for Computational Linguistics, 27-31 March 1995, University College Dublin,
Belfield, Dublin, Ireland.
-
[Frederking et al. 93] Frederking, R., D. Grannes, P. Cousseau, and S. Nirenburg. ``An MAT
Tool and Its Effectiveness.'' Proceedings of the DARPA Human Language Technology Workshop,
Princeton, NJ, 1993.
-
[Grishman 95] Grishman, Ralph, editor. Tipster Phase II Architecture Design Document
Version 1.52, July 1995. (HTTP://cs.nyu.edu/tipster)
-
[Guthrie et al. 93a] Guthrie, Louise, Guthrie, Joe, Wilks, Yorick, Cowie, Jim, Farwell,
David, Slator, Brian, and Bruce, Rebecca. ``A research program on machine-tractable
dictionaries and their application to text analysis.'' CRL Technical Report
MCCS-92-249. 1993.
-
[Guthrie et al. 93b] Guthrie, Louise, Rauls, Venus, Luo, Tao, Bruce,
Rebecca. ``LEXI-CAD/CAM, A Tool for Lexicon Builders.'' CRL Technical Report
MCCS-93-259. 1993.
-
[Johnson & Whitelock 87] Johnson, R.L. and P. Whitelock. Machine Translation as an Expert
Task. In S. Nirenburg, (ed.), Theoretical and Methodological Issues in Machine
Translation, Cambridge, Cambridge University Press, 1987, pp 136-144.
-
[Kay 80] Kay, M. "The Proper Place of Men and Machines in Machine Translation". Xerox PARC
Technical Report CSL-80-11. October 1980.
-
[Macklovitch 89] Macklovitch, Eliot. "An Off-the-Shelf Workstation for
Translators". Proc. of the 30th American Translators Conference, Washington D.C. 1989.
-
[Matsumoto et als.] Yuji Matsumoto, Sadao Kurohashi, Yutaka Nyoki, Hitoshi Shinho, and
Makoto Nagao. "User's Guide for the Juman System, a User-Extensible Morphological
Analyzer for Japanese. Version 0.5", Kyoto University. (in Japanese)
-
[Nagao & Mori 94] Makoto Nagao, Shinsuke Mori. "A New Method of N-gram Statistics for
Large Number of n and Automat8ic Extraction of Words and Phrases form Large Text Data of
Japanese". Proceedings of the 15th International Conference on Computational Linguistics,
Coling'94, Kyoto, Japan, August 5-9, 1994. pp611-615.
-
[Nirenburg et al. 93] Nirenburg, S., P. Shell, A. Cohen, P. Cousseau, D. Grammes,
C. McNeilly. ``Multi-purpose Development and Operations Environments for Natural Language
Applications.'' Proc. of the 3rd Conference on Applied Natural Language Processing
(ANLP-93), Trento, Italy.
-
[Nirenburg 95] Nirenburg, Sergei, editor. ``The PANGLOSS Mark III Machine-Translation
System''. CMU-CMT-95-145. A Joint Technical Report by NMSU CRL, USC ISI and CMU CMT. April
1995.
-
[Penman 88] The Penman Primer, User Guide, and Reference Manual. 1988. Unpublished USC/ISI
documentation.
-
[Smadja 93] Smadja, Frank. 1993. "Retrieving Collocations from Text:
Xtract". Computational Linguistics, Vol. 19, No 1, pp143-175.
-
[Stein et al. 93] Stein, Gees C., Lin, Fang, Bruce, Rebecca, Weng, Fuliang, and Guthrie,
Louise. ``The Development of an Application Independent Lexicon: LexBase.'' CRL Technical
Report MCCS-92-247. 1993.
-
[Tipster 94] Proceedings of the TIPSTER Text Program (Phase I). San Francisco, CA: Morgan
Kaufmann, 1994.
-
[Unicode 91] The Unicode Consortium. The Unicode Standard, Worldwide Character Encoding.
Addison-Wesley Publishing Company, 1991.