Temple project

A Framework for Fast Deployment of
Multilingual Machine Translation Systems



The Temple project has developed an open multilingual architecture and software support for rapid development of extensible Machine Translation functionalities. The targeted languages are those for which Natural Language Processing and human resources are scarce or difficult to obtain. The goal is to support rapid development of Machine Translation functionalities in a very short time with limited resources.

New Web sites in foreign languages are appearing everyday everywhere on the globe and language barriers threaten to atomize the World Wide Web in closed linguistic communities. The Temple project has developed an open multilingual architecture and software support for rapid development of extensible Machine Translation functionalities. The targeted languages are those for which Natural Language Processing and human resources are scarce or difficult to obtain. The goal is to support rapid development of Machine Translation functionalities in a very short time with limited resources.

Currently, the Temple prototype provides automatic raw English translations from documents in several languages (Spanish, Arabic, Japanese and Russian). Translations are produced using a Glossary-Based Machine Translation engine. Analysts and translators can edit the raw translation using a multilingual editor. Source documents and their translations are managed using the Tipster Document Manager developed at CRL which is also used as the architectural basis for integrating the system's components. One important outcome of the Temple project is the development of an architecture to support reuse of NLP tools and resources:

The major components of the Temple prototype include:

A brief history of the project

Glossary-Based Machine-Translation (GBMT) was first developed at CMU as part of the Pangloss project [Nirenburg 95; Cohen et al., 93; Nirenburg et al., 93; Frederking et al., 93], and a sizeable Spanish-English GBMT system was implemented. The Temple project has built upon this experience and extended the GBMT approach to other languages: Japanese, Arabic, and Russian. This experience with other languages has provided significant insights for the development of a versatile GBMT engine and for the use of off-the-shelf components for building a complete Machine-Translation System. Building a generic platform for integrating various Machine-Translation Systems in a single flexible user environment, built upon the Tipster document architecture [Grishman 95], has also been a valuable experience for developing generic Natural Language Processing support systems.

Planned as a two-years project, the Temple project actually started at the beginning of 1995 and ended in May 1996. The first part of the project was devoted to the acquisition of corpora, dictionaries and glossaries and to the implementation of basic capabilities using CRL's Tipster Document Manager. After some initial experiments, the final architecture and the GBMT engine was implemented while work on lexical acquisition was (and is) still progressing. The project succeeded in achieving its major goal, fast development of MT functionalities integrated in an analyst workstation. A new project, Corelli, was defined for consolidating the results and exploring new avenues that were identified during the course of the Temple project.

The user perspective

The Temple Analyst's Workstation is incorporated into a Tipster document management architecture and it allows both translator/analysts and monolingual analysts to use the machine-translation function for assessing the relevance of a translated document or otherwise using its information in the performance of other types of information processing. Translators can also use its output as a rough draft from which to begin the process of producing a translation, following up with specific post-editing functions.

The Temple Analyst's Workstation design is original in that it combines the best features and eliminates the weaknesses of competing alternatives. On the one hand, like word-based glossers, it puts the user in control by allowing all core linguistic components used by the glossary-based engine to be accessed, modified and developed by the Analyst. On the other hand, like advanced MT systems, it uses reliable morphological processors and taggers, components which are relatively inexpensive, require little or no maintenance, and greatly enhance output quality.

The user (translator or analyst) can:

Although the translation provided by the system is only a phrase-for-phrase (or word-for-word) gloss of the original, the system is entirely under the control of the user who can modify any essential part of it, i.e. the dictionaries and glossaries. From a user's point of view, the system is predictable, responsive, affordable and easy to use and maintain.

The MT developer perspective

A Multilingual Architecture. The Temple architecture uses the multilingual text library developed at CRL to support multilingual text processing. This library, available for Unix systems, is capable of handling a large number of character codesets and provides multilingual string processing functionalities and character code conversion for a large variety of codesets. Also supported is a multilingual Motif text widget that can be embedded in higher-level applications (as the Lexical Editor) and a simple multilingual text editor. This library proved to be a major asset for the project since few comparable functionalities for the range of languages processed in the Temple project are available in the Unix environment.

Although full Unicode support was a goal of the project, this could not be achieved entirely. So far, only the Arabic morphological analyzer has built-in Unicode support (as well as other codesets). However, a full Unicode library should be available for use in the Corelli project.

Reuse of Machine-Readable Dictionaries. Bilingual dictionaries are processed versions of various Machine-Readable Dictionaries (MRD), for example the Collins Spanish-English dictionary, or of other MT dictionaries that have been restructured to conform to Temple's lexical format. Since morphological analyzers and dictionaries may come from different sources, they may have incompatible lexical representation, as it happened for the Japanese-English dictionary and the Japanese morphological analyzer. In such cases, integration is achieved by mapping the dictionary, including for example part-of-speech information, to a standardized format, and by developing a filter that maps the morphological analyzer result to that structure.

Reuse of Natural Language Processing components. An important decision in the Temple project was to use available NLP components and resources whenever possible. This led to the definition of an open architecture that provides support for integrating external tools. Some Temple components have been developed as part of the project (as the Arabic and Russian morphological analyzers) but have been integrated using the same methodology as other tools.

Essentially, the Temple architecture is built around a uniform internal canonical linguistic representation for all languages: all components read or map their results to this representation, which uses Tipster annotations. All NLP tools are encapsulated using Tcl wrappers for mapping the tool representation to the Temple representation.

Morphological information is transferred from the source to the target language using morphological transfer tables that map categories and features from a source lexical item to the equivalent English lexical item. The GBMT engine itself is fully generic and is parameterized by a bilingual glossary and a morphological transfer table.

Semi-automatic development of glossaries. Small glossaries (between 2,000 and 20,000 entries depending on the language) have been developed for each language. The acquisition process is as follows:

The development cost of such glossaries remains relatively low since the structure and the information encoded in a glossary entry is very simple.

Languages

The target language: English
English is the single target language of the Temple prototype and only one component deals primarily with English: the morphological generator, which is the one used in the Penman system. This generator accepts as input a citation form and a set of morphological features to be realized. This set of morphological features is produced by the GBMT engine using morphological transfer tables.
Arabic
All Arabic components, morphological analyzer, Arabic-English dictionary (45,000 stems and almost as may irregular entries for a total of 72,000 entries) and Arabic-English glossary (11,000 entries), have been developed within the Temple project.
Japanese
The Japanese morphological analyzer is Juman, developed at Kyoto University. The Japanese-English dictionary is a processed and augmented version of a Kenkyusha CD-ROM dictionary (40,000 entries). The Japanese-English glossary has been developed at CRL (5,000 entries).
Russian
The Russian morphological analyzer has been developed at CRL. The size of both the dictionary and the glossary is so far very modest, 6,700 and 1,600 entries respectively. The dictionary and the glossary has been developed at CRL.

Spanish
The Spanish morphological analyzer is the SPOST tagger developed at CRL in the Pangloss project. The dictionary is a processed version of the Collins Spanish-English dictionary (with approximately 60,000 entries). A subset of the glossary developed in the Pangloss project has been converted to the Temple format (20,000 entries).

Team

Project manager
Staff
Consultants
Students

And this would not be complete without mentioning Nigel Sharples who implemented a new version of the Tipster Document Manager and GUI tools and also helped with the interface for the Tipster Document Manager;Mark Leisher who developed the Multilingual Multi-attributed Text Library and helped on many parts of the project dealing with multilingual character set processing.

Publications

  1. Rémi Zajac and Mark Casper. "The Temple Web Translator". 1997 AAAI Spring Symposium on Natural Language Processing for the World Wide Web, March 24-26, 1997, Stanford University. (HTML · Gzipped Postscript)
  2. Michelle Vanni and Rémi Zajac. ``Glossary-Based MT Engines in a Multilingual Analyst's Workstation for Information Processing''. To appear in Machine Translation, Special Issue on New Tools for Human Translators.
  3. Rémi Zajac. 1996. "Towards a Multilingual Analyst's Workstation: Temple". In Expanding MT Horizons - Proceedings of the 2nd Conference of the Association for Machine Translation in the Americas, AMTA-96. 2-5 October 1996, Montréal, Canada. pp280-284.
  4. Bill Caid, Jamie Callan, Jim Conley, Harold Robin, Jim Cowie, Kathy DiBella, Ted Dunning, Joe Dzikiewicz, Louise Guthrie, Jerry Hobbs, Clint Hyde, Mark Ilgen, Paul Jacobs, Matt Mettler, Bill Ogden, Peggy Otsubo, Bev Schwartz, Ira Sider, Ralph Weischedel and Rémi Zajac. "Tipster Text Phase II Architecture Design, Version 2.1". Proceedings of the Tipster-II 24-month Workshop, Tysons Corner, VA, 7-10 May, 1996. pp249-305.
  5. Rémi Zajac. ``A Multilingual Translator's Workstation for Information Access'', Proceedings of the International Conference on Natural Language Processing and Industrial Applications, NLP+IA 96, Moncton, New-Brunswick, Canada, June 4-6, 1996. (Gzipped Postscript)
  6. Michelle Vanni and Rémi Zajac. ``The Temple Translator's Workstation Project'', Proceedings of the Tipster-II 24-month Workshop, Tysons Corner, VA, 7-10 May, 1996. (Gzipped Postscript)

References

  1. [Church & Hovy 93] Church, K. and Eduard Hovy. "Good Applications for Crummy Machine Translation". Machine Translation, Vol. 8, No 4, 1993, pp239­258.
  2. [Callan et als.] Callan J.P., Croft W.B., and Harding S.M., "The INQUERY Retrieval System," Proceedings of the 3rd International Conference on Database and Expert Systems Applications.
  3. [Cohen et al. 93] Cohen, A., P. Cousseau, R. Frederking, D. Grannes, S. Khanna, C. McNeilly, S. Nirenburg, P. Shell and D. Waeltermann. ``Translator's Workstation User Document.'' Center for Machine Translation, Carnegie Mellon University, 1993.
  4. [Davis et als. 95] Mark Davis, Ted Dunning, Bill Ogden. "String Matching Strategies and N-Gram Comparisons". Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics, 27-31 March 1995, University College Dublin, Belfield, Dublin, Ireland.
  5. [Frederking et al. 93] Frederking, R., D. Grannes, P. Cousseau, and S. Nirenburg. ``An MAT Tool and Its Effectiveness.'' Proceedings of the DARPA Human Language Technology Workshop, Princeton, NJ, 1993.
  6. [Grishman 95] Grishman, Ralph, editor. Tipster Phase II Architecture Design Document Version 1.52, July 1995. (HTTP://cs.nyu.edu/tipster)
  7. [Guthrie et al. 93a] Guthrie, Louise, Guthrie, Joe, Wilks, Yorick, Cowie, Jim, Farwell, David, Slator, Brian, and Bruce, Rebecca. ``A research program on machine-tractable dictionaries and their application to text analysis.'' CRL Technical Report MCCS-92-249. 1993.
  8. [Guthrie et al. 93b] Guthrie, Louise, Rauls, Venus, Luo, Tao, Bruce, Rebecca. ``LEXI-CAD/CAM, A Tool for Lexicon Builders.'' CRL Technical Report MCCS-93-259. 1993.
  9. [Johnson & Whitelock 87] Johnson, R.L. and P. Whitelock. Machine Translation as an Expert Task. In S. Nirenburg, (ed.), Theoretical and Methodological Issues in Machine Translation, Cambridge, Cambridge University Press, 1987, pp 136-144.
  10. [Kay 80] Kay, M. "The Proper Place of Men and Machines in Machine Translation". Xerox PARC Technical Report CSL-80-11. October 1980.
  11. [Macklovitch 89] Macklovitch, Eliot. "An Off-the-Shelf Workstation for Translators". Proc. of the 30th American Translators Conference, Washington D.C. 1989.
  12. [Matsumoto et als.] Yuji Matsumoto, Sadao Kurohashi, Yutaka Nyoki, Hitoshi Shinho, and Makoto Nagao. "User's Guide for the Juman System, a User-Extensible Morphological Analyzer for Japanese. Version 0.5", Kyoto University. (in Japanese)
  13. [Nagao & Mori 94] Makoto Nagao, Shinsuke Mori. "A New Method of N-gram Statistics for Large Number of n and Automat8ic Extraction of Words and Phrases form Large Text Data of Japanese". Proceedings of the 15th International Conference on Computational Linguistics, Coling'94, Kyoto, Japan, August 5-9, 1994. pp611-615.
  14. [Nirenburg et al. 93] Nirenburg, S., P. Shell, A. Cohen, P. Cousseau, D. Grammes, C. McNeilly. ``Multi-purpose Development and Operations Environments for Natural Language Applications.'' Proc. of the 3rd Conference on Applied Natural Language Processing (ANLP-93), Trento, Italy.
  15. [Nirenburg 95] Nirenburg, Sergei, editor. ``The PANGLOSS Mark III Machine-Translation System''. CMU-CMT-95-145. A Joint Technical Report by NMSU CRL, USC ISI and CMU CMT. April 1995.
  16. [Penman 88] The Penman Primer, User Guide, and Reference Manual. 1988. Unpublished USC/ISI documentation.
  17. [Smadja 93] Smadja, Frank. 1993. "Retrieving Collocations from Text: Xtract". Computational Linguistics, Vol. 19, No 1, pp143-175.
  18. [Stein et al. 93] Stein, Gees C., Lin, Fang, Bruce, Rebecca, Weng, Fuliang, and Guthrie, Louise. ``The Development of an Application Independent Lexicon: LexBase.'' CRL Technical Report MCCS-92-247. 1993.
  19. [Tipster 94] Proceedings of the TIPSTER Text Program (Phase I). San Francisco, CA: Morgan Kaufmann, 1994.
  20. [Unicode 91] The Unicode Consortium. The Unicode Standard, Worldwide Character Encoding. Addison-Wesley Publishing Company, 1991.