Recent activities at the Computing Research Laboratory (CRL) concentrate on methods for developing robust large-scale NLP systems for machine-translation tasks, lexicon development, belief modeling, text analysis and workbench projects for linguists and translators. This work is funded through contracts and grants from organizations whose objectives require that solutions to these problems be approached from many perspectives.
What follows are brief synopses of current projects that have large-scale development support at CRL. The range is indicative of the breadth and depth of the Laboratory's interests and expertise. More detailed information on these projects can be obtained by following links to each project's home page.
Artwork addresses the machine translation of spoken dialogue. The focus is investigating approaches to providing robustness by exploiting models of the task domain and of conversational interaction to generate relevant expectations against which the input can be interpreted.
This project concerns reasoning about mental states, with a focus on mental states that are described in natural language discourse. A special sub-focus is mental states that are described metaphorically in discourse. (Metaphorical description of mental states is common in both text and speech.) A reasoning system called ATT-Meta has been implemented, and performs both ordinary and metaphor-based reasoning about mental states. The system does not currently take natural language input. Rather, it is given logical expressions that are simplified representations of the contents of small, hypothetical discourse fragments.
Aside from metaphorical considerations, the central approach in ATT-Meta to reasoning about mental states is ``simulative'' reasoning. This is fully combined with non-simulative reasoning about mental states. All the reasoning is done within a general purpose framework for uncertain reasoning, which has no particular orientation towards mental states or metaphor. The framework is rule-based. Uncertainty is expressed by means of qualitative certainty levels.
This line of research encompasses various attempts to derive syntactic grammars automatically from text. We have focussed on statistical techniques that do not require supervised learning (hand-coded training texts). We work with languages with minimal on-line resources, primarily text corpora and an on-line dictionary.
The objective of the Corelli project is to extend the capabilities of the Pangloss and Temple Translator's Workstation (TWS) from English and Spanish to include Arabic, Russian and Japanese. In particular, Corelli expands available on-line tools such as dictionary, user glossary and corpus access to include a multilingual substrate for non-Latin character sets. Native character and script editors for Japanese and Arabic will be added. Both the constituent-oriented machine-aided translation capability and advanced post-editing tools will be extended to cover the new language pairs.
GATE is a joint project of CRL and the University of Sheffield. The project emphasizes building large-scale natural language processing (NLP) systems by reusing NLP components. GATE researches work from the position that full-scale NLP systems are not merely scaled-up research systems, and represent a qualitatively different set of problems that require new tools and new ideas. The GATE solution is to reuse and integrate heterogeneous components in a new distributed architecture for Natural Language Processing. Applications include multilingual information retrieval and extraction, machine translation, and speech.
This work involves developing probabilistic classifiers for two challenging and diverse NLP tasks using a common set of techniques. One classifier will be capable of disambiguating a large vocabulary of words with respect to a full set of sense distinctions from a published source, such as Longman's on-line dictionary. The second will perform a discourse processing task that involves segmentation, reference resolution, and belief: segmenting a text into blocks that express the beliefs and opinions of a single agent, and identifying noun phrases that refer to that agent. Both systems will be fully automatic.
Mikrokosmos is devoted to a study of the computational semantics of English, Spanish and Japanese. Methodologically, Mikrokosmos is based on the recognition that computational treatment of text requires the study of a wide variety of language, language use and world phenomena, and that a single all-encompassing theory of computational linguistics is not feasible, at least in the near future. The only realistic hope of building a comprehensive NLP application grounded on a sound theoretical basis is to devise a computational architecture that allows partial treatment of a variety of language phenomena, i.e., microtheories, to be integrated. Integration is carried out through a formalized and uniform knowledge representation and a flexible application system control architecture.
MINDS - Multi-lingual Interactive Document Summarization is a project to develop analysts tools which use multiple types of summarization techniques to produce both document and cross-document summaries. Four languages; English, Spanish, Japanese, and Russian are supported. MINDS will allow document linking, summarization, translation, and retrieval for all four languages.
New computer software technology can fail for a number of reasons. The new technology may not be delivered in a conveniently usable form, or it may not provide useful function for accomplishing the professional's task. CRL has been employing a usability-oriented empirical approach for applying natural language technology to the design of interface software called Cibola that supports professional translators. This design approach is an iterative, user-centered approach based on an analysis of the translation task derived from observational data (user protocols) collected from experienced translators. The interface provides a multi-windowed side-by-side presentation of source and target language texts with editing capabilities in both. Further, the prototype allows access to a number of on-line resources such as dictionaries and other databases. Information retrieval (IR) techniques have been implemented to allow the full indexing of texts, morphological processing, and contextual vector-based searching of this material. The user-centered empirical approach continues to guide the design in ways that best support the translation task.
Oleada is an extension of Cibola technology, its resource base, and graphical user interface design to the task of language instruction. Intended both for language instructors and learners, Oleada will enhance the instructor's ability to create and modify language instruction modules, and will help improve the learner's proficiency through self-study and evaluation.
The large objectives of this research are to establish a theoretical framework for investigating the pragmatic aspects of the translation process and to implement a computational platform for carrying out systematic experiments on the pragmatics of translation.
The first step towards this has consisted of analysis of multiple translations of the same texts. We examine differences in translations that reflect differences of belief and/or differences in reasoning about the topics and participants in the text and the translation process. This analysis provides us with a model of the elements of context and the pragmatic inferencing involved in the translation process.
Tabula Rasa is a project that attempts to reduce two of the major bottlenecks of information extraction; defining text extraction tasks and developing tools to aid in producing structured data or templates. The Tabula Rasa toolkit is a `meta-tool' that analysts can use to build tools that help with template filling tasks.
This research continued the DARPA-sponsored TIPSTER Text Project on information retrieval. It is a joint effort among many sites to develop a system that integrates information retrieval and information extraction. At the core of the effort is a joint government/contractor committee that is specifying an architecture for this kind of system. Three Computing Research Laboratory (CRL) principal investigators are members of this committee. CRL's work on this project includes further development of the Diderot information extraction system developed in phase I of this project, and development of a variety of specialized software subsystems that support the architecture development. The subsystems being developed are: a document manager that provides multi-source document compatibility using SGML; a translation subsystem that supports retrieval of documents in many languages on the basis of a query in one language; libraries of procedures for user interface support with embedded functions for information retrieval and information extraction; and advanced Motif based multilingual user interface capabilities, supporting Chinese, Japanese, Korean, Arabic and other writing systems.
The Unicode Retrieval System Architecture (URSA) is an attempt to make detection, retrieval and collection visualization completely transparent to query and document language issues. Ongoing work involves prototypes of translingual or cross-language information retrieval systems, the development of Unicode IR technologies, and close integration with the Tipster document management architecture.