New Mexico State University
Home Research CRL Staff Publications Resources Search Employment CRL Internal

 

Abstracts

Meaning Oriented Question Answering
Dr. Stephen Helmreich

MOQA is CRL's ARDA-funded grant under the AQUAINT project, in collaboration with CoGenTex in Ithaca, NY, and ILIT (Institute for Language and Information Technologies) at UMBC (University of Maryland, Baltimore County). The goal of the project is to bring to bear CRL's ONTOSEM ontology and TMR (Text Meaning Representation) to enhance both the accuracy and user-friendliness of question/answering systems. The talk will cover the scope and structure of the project, with a focus on recent developments in ontology-based multilingual information retrieval.

Machine Learning of Verbal Meaning
Dr. Thomas Landauer

Latent Semantic Analysis (LSA) is a learning machine that computes relations between the meanings of all the words and passages in a representative collection of text. Measured by how well it simulates a wide variety of linguistic phenomena and human judgments and behaviors that depend on verbal meaning, it is surprisingly successful. The major surprise lies in the fact that LSA does as well as it does while ignoring word order within sentences and passages. I will attempt to explain why this should not have been such a surprise.

Cross-Language Text Retrieval at CRL
Dr. Bill Ogden
NMSU Computing Research Laboratory

In this talk I will review the state of the art in cross-language text retrieval. This is concerned with the design of automatic techniques to retrieve texts in languages other than the language of the query. I will focus on recent experiments conducted for TREC-8 and our plans for TREC-9.

Pattern Matching and Parsing with Charts
Dr. Rémi Zajac, Dr. Jan Amtrup
NMSU Computing Research Laboratory

We present a unified approach to parsing and pattern matching. Pattern matching traditionally refers to matching regular expressions in strings. Cascaded pattern matchers have been used for chunking and for partial parsing (Harris, 58). Parsing on the other hand refers to the analysis of a string using a context-free grammar or grammars at least as powerful as context-free grammars. The most efficient parsing algorithms are chart-parsing algorithms. We introduce a uniform notation for patterns and rules. We show on examples how pattern matching on strings can be extended to pattern matching on charts without losing the inherent efficiency of finite-state matching. This leads to a unified system for pattern matching and parsing where matchers and parsers can be mixed in a unified cascaded parsing architecture.

Norms and Exploitations: Towards a Theory of Linguistic Performance
Patrick Hanks
Chief Editor, Current English Dictionaries, Oxford University Press

In these two seminars, we shall look at some of the issues involved in linking meaning and use, i.e. mapping syntax onto semantics and vice versa, in the light of corpus evidence. Corpus linguistics is at the forefront of the resurgence of empiricism, prompting redefinition of some of our most basic assumptions about the nature of language, to account for observed data. The task confronting the linguistic analyst is seen as being to identify the norms of usage--or rather sets of overlapping norms--and the linguistic principles which govern the ways in which norms are exploited.

Dictionaries list many meanings, but they do not normally tell the user how to distinguish one meaning of a word from another. What's worse still, large dictionaries such as OED make little attempt to distinguish between normal usage and exploitation, often recording variations in phraseology and exploitations as if they were new norms. With the advent of large corpora, we can begin to see that meaning distinctions in real text are associated with differences in patterns of phraseology. This work is only just starting.

Some examples are discussed, drawing on Halliday's notion of"lexis as a linguistic level," Sinclair's work on lexical analysis and phraseology, Wilks's preference semantics, Fillmore's frame semantics, and Lakoff's work on metaphor and prototype theory. At the most general level, following Fillmore, we can make distinctions such as:

[PERSON] climb [PHYSOBJ] = go up

[PERSON] climb [PP] = clamber

As a much more delicate level, the default interpretation of 'climb [MOUNTAIN]" involves using all four limbs and then some, while the default interpretation of "climb steps" involves using only the legs, not the arms. Are semantic norms phraseologically determined?

We shall also examine issues such as the distinction between literal and metaphorical meaning, including the nonliteral meaning of "literal" and the distinction between conventionalized and ad-hoc metaphors.

The Georgian Language: An Outline Grammatical Description
Dr. Oleg Kapanadze
Tbilisi State University
Republic of Georgia

The presentation will provide a brief description of Georgian, the principal member of the Kartvelian (South Caucasian) Language group. It will cover main features of the language, but particular emphasis is placed on identifying general patterns in the complex verb system.

Statistical Morphological Disambiguation for Agglutinative Languages
by Kemal Oflazer
Computer Engineering Department
Bilkent University, Ankara, Turkey
(joint work with Dilek Hakkani-Tur and Gokhan Tur)

We present statistical models for morphological disambiguation in Turkish. Turkish presents an interesting problem for statistical models since the potential tag set size, with tags making the relevant morhosyntactic and semantic distinctions, is very large because of the productive derivational morphology. We propose to handle this by breaking up the morhosyntactic tags into inflectional groups, each of which contains the inflectional features for each (intermediate) derived form. Our statistical models score the probability of each morhosyntactic tag by considering statistics over the individual inflection groups in a trigram model. Among the three models that we have developed and tested, the simplest model ignoring the local morphotactics within words performs the best. Our best trigram model performs with 93.95% accuracy on our test data getting all the morhosyntactic and semantic features correct. If we are just interested in syntactically relevant features and ignore a very small set of semantic features, then the accuracy rises to 95.07%.

Automatic Identification of Classified Documents
by Judy Hochberg
Computer Research and Applications Group (CIC-3)
Los Alamos National Laboratory

How can one automatically identify classified documents? This is a vital question for the Department of Energy (DOE), which is reviewing millions of classified documents for possible declassification, and for Los Alamos National Laboratory (LANL), which is checking its unclassified computing storage systems for the presence of classified documents.

After developing an expert rule system for automatic classification, the DOE provided me with a small set of documents with which to explore a statistical classifier as an alternative. I represented documents as vectors of character trigram frequencies (using a chi-square statistic to select the optimal trigrams), and trained a linear classifier using the Pocket algorithm. Results ranged from 60% accuracy (for identification of specific classified topics) to 87% accuracy (for identification of classified versus unclassified text). Training set size was a significant factor in the results. In contrast, my work for LANL started "from scratch" and needed to be moved rapidly into large-scale production. I therefore implemented a expert system tailored to LANL's needs, and have focused heavily on the practical issues that arose in canvassing large amounts of files in a variety of formats.

The Architecture of Meat
by Jan Amtrup, CRL

Meat, the ``Multilingual Environment for Advanced Translations'' is an MT framework designed following four principles:

- Integratedness. All hypotheses about various aspects of the translation process are stored in a chart that is accessible from all components.

- Uniformity. The descriptions of all linguistic objects are typed feature structures.

- Multilinguiality. Foremost, this means that we use Unicode to represent language data.

- Configurability. Meat is not just a translation system, but is highly configurable using a simple application definition language.

We will present each of these principles and the application of Meat to a translation problem, using the Persian-English machine translation system Shiraz as an example.

Using a Target Language Model for Domain Independent Lexical Disambiguation
Presented by Yevgeny Ludovik
Authors: Jim Cowie, Yevgeny Ludovik, Sergei Nirenburg

The lexical disambiguation algorithm is based on a statistical language model. The input data to the algorithm is a set of possible translations. The algorithm selects the most probable translation. The statistical language model was trained on a corpus American English newspaper texts. Its performance was tested using output from a transfer based translation system between Turkish and English. The method is source language independent, and can be used for systems translating from any language into English.

Elicitation Knowledge About Syntax in the Boas System
by Svetlana Sheremetyeva, CRL

Boas is a large system designed for the elicitation of the linguistic knowledge necessary for the construction of a machine translation system.

This presentation is of work in progress on the development of the subsystem of Boas for the elicitation of syntactic information about the source language.

Following a brief presentation of the current status of this subsytem, there will be a discussion, soliciting ideas and suggestions for the further elaboration of the subsystem.

Lingvistica '98
by Michael Blekhman, Lingvistica '98, Inc.

This talk will describe the programs available from Lingvistica '98 for automatic translation between the following language pairs: English-Russian, English-Ukrainian, Russian-Ukrainian. Systems for English-German and English-Norwegian are under construction.

The translation methodology and the lexicons used will be described and actual translation results presented.

Lingvistica '98 has also provided specialized dictionaries in Ukrainian and Russian to SYSTRAN, and is involved with a number of other translation-related products and technologies.

Applications of Latent Semantic Analysis: Cross-Language Retrieval and Automated Essay Scoring
by Peter Foltz, NMSU Psychology Dept.

Latent Semantic Analysis is a statistical technique for deriving measures of semantic similarity between pieces of textual information based on an analysis of a large corpus. In this talk, I will discuss how LSA has been applied to cross-language retrieval and to automated essay scoring. For the essay scoring, LSA is able to grade essays as accurately as human graders as well as provide feedback to students. I'll also describe a web-based application used in an undergraduate course which provides automatic grading of essays with content-based feedback.

A Paradigm for Extracting Information from Medical Text
by Spencer Koehler, CRL

This presentation details the architecture of an Information Extraction system based on Natural Language Understanding Techniques. Various knowledge representation models were used and effectively combined for processing medical reports in SymText, a symbolic text understanding system. This system was used to extract coded information from dictated chest x-ray reports (discussed in this presentation) and free text admitting diagnoses. The novel techniques employed by this system were (1) the use of Bayesian networks as a model for context and (2) the method for integrating meaning (semantics) with structure (syntax).

Within the applied radiological domain, three contextual models were created and tuned through two developmental iterations. The system responded to the training with an overall increase in coding accuracy on an independent data set from 48.1% to 86.1% in recall and from 70.3% to 85.6% in precision. This evaluation showed that the applied techniques for understanding natural language (in a restricted domain) generalized well and encourages continued developoment and integration of the techniques with other natural language understanding paradigms.

Answering Questions
by Jim Cowie, CRL

This talk will describe the Text Retrieval Evaluation Conference's "Question and Answer" task and the system developed at CRL in response to this problem. Although several components were incomplete at the time of the evaluation, the system appears to produce correct output in many cases.

The CRL Q & A system uses the Mikroskosmos ontology, name and quantity recognition developed for Tipster Phase I, and phrase recognition. Questions are automatically structured both to retrieve documents using a Boolean retrieval engine and to match up with a specially developed question lexicon. The question lexicon used to find the type of answer required. and to structure the information in the question. This structure is then used to perform extraction on each sentence in each document retrieved and the best 5 matches are provided as answers.

The system was designed and developed by Svetlana Sheremetyeva, Jim Cowie, Eugene Ludovik, Hugo Molino-Salgado, and Sergei Nirenburg.

Connections Between Human and Machine Translation:
A case study

by Anna Aguliar, Barcelona Autonomous University

In recent years Translation Studies has been trying to achieve the status of an interdisciplinary science and theory rather than an anecdotal art dependent on linguistics. This theory has several lines of research focusing on the process and the results. Computational tools are involved in research, training and professional activities. Beyond pure Linguistics, the translator is interested expressing differences between cultures (Anthropology), in the cognitive aspects of processing knowledge (Psychology and Cognitive Science) and in the organization of information in the text. When dealing with a text the translator attempts to preserve the functional content of the communication act, even if this means sacrificing the semantic content.

Although there are a lot of different schools and approaches, we won't directly discuss them. Instead we'll consider how a translator's decisions are affected by those differing sociolinguistic points-of-view: an international langage like English, a major langage such as Spanish and a minority (or minoritized) language such as Catalan.

Those differences affect the translation market and also the linguistic policy and attitude towards the defense of the conceptual system and its subsystems. 90% of the translation activity in non-English spoken countries concerns technical and scientific texts. This is one of the reasons why terminology has become a key tool in the production as well as translation of domain-oriented texts. The acquisition of expert knowledge is also crucial to give transparency to that terminology.

BACO (BAse de COneixement - Knowledge Base) is a large terminology database (7,000 concepts and more than 30,000 forms) built by the highest level of the Translation and Interpretation Department of the Autonomous University of Barcelona. BACO is an example of the kind of knowledge representation and storage which professional translators (as well as Machine Translation systems) involved in technical translation need. Nor is Machine-Aided Translation possible without a terminology tool like this (BACO was built with MultiTerm, one of the tools involved in Translator's Work Bench, a translation memory software by TRADOS). 

This data base was built using onomasiological criteria and takes into account the display of the sub-concepts (features) and related concepts involved in a concept definition while specifying them in an atomized way. The database building process can be improved by using information extraction tools in a domain-oriented corpora. Tools to find, identify and store the information should make the work of the terminologist and the translator much easier.

BACO is based on the philosophy that any of the languages involved can be used to make decisions referring the conceptual split in a specific field.

Once represented, selecting one word between variants is a social and political decision. Is terminology homogeneity against minoritized language survivals? The answer to this question can bring us to another: should Machine Translation try to take account of discrepancies rather than the semantic common space?

Ambiguities in Automatic Translation to Braille
by Christopher Weaver

English Braille and technical braille codes appear to be easy targets for translation from source documents. That is, they should only involve regular expression replacement or simple parsed filtration.

However, the authors of the braille codes have introduced several context dependencies. Some of these are syntactic context dependencies which can be resolved by keeping the context for reference as the translation procedes. Others, though are dependent on phonetic or even semantic context and thus require more sophisticated linguistic processsing.

This talk will encompass a general discussion of the rules for braille and possible techniques for efficient translation.

Methods for Probabilistic Classification in Natural Language
Processing Applied to Identifying Subjective Sentences

by Janyce Wiebe

The first part of this talk will give an overview of a framework for developing probabilistic classifiers in Natural Language Processing (NLP) (Bruce & Wiebe 1999). A probabilistic classifier assigns the most probable class to an object, based on a probability model of the interdependencies among the class and a set of input features. This framework focuses on formulating a model that captures the most important interdependencies, to avoid over-fitting the data while also characterizing the data well. The class of probability models and the associated inference techniques were developed in mathematical statistics, and are widely used in artificial intelligence and applied statistics. However, these techniques have not been widely used in NLP. The class of models, (decomposable models), is large and expressive, yet there are computationally feasible model search procedures defined for them. The formality of the method  supports evaluation: the talk will briefly describe how the three determinates of classifier performance (the features, the form of the model, and the parameter estimates) can be separately evaluated.

The second part of the talk will describe an empirical investigation of a natural language disambiguation task. In many text processing applications, such as information extraction, summarization, text categorization, and information retrieval, it is important to distinguish "objective sentences," which are used to present factual information, from "subjective sentences," which are used to present beliefs and evaluations (Wiebe 1994; Wiebe, Bruce, &  O'Hara, 1999). Whether a sentence is subjective or objective depends not only on semantics, but also on the context in which the sentence appears. Using the model search procedure described above, we developed a probabilistic classifier for identifying subjective sentences. Using only shallow features, the classifier achieves an average accuracy 21 percentage points higher than the baseline, in 10-fold cross validation experiments. In order to develop the classifier, a gold-standard data set was needed for training and testing. In a two-phase study, a data set was annotated by multiple annotators, resulting in high intercoder agreement for sentences that could be tagged with certainty. The classifier also performs better on sentences the judges tagged with certainty, showing consistency between the classifier and the human judges.

References

Bruce, Rebecca & Wiebe, Janyce (1999). Decomposable modeling in natural language processing. To appear in Computational Linguistics 25(2).

Wiebe, Janyce (1994). Tracking point of view in narrative. Computational Linguistics 20 (2): 233-287.

Wiebe, Janyce, Bruce, Rebecca, & O'Hara, Thomas (1999). Development and use of a gold standard data set for subjectivity classifications. To appear in Proc. 37th Annual Meeting of the Assoc. for Computational Linguistics (ACL-99).

Interactive Cross Language Text Retrieval:
Can we expect people to be able to get information from texts in languages they cannot read?

by Bill Odgen

In this talk I will review two relevant lines of research bearing on this issue, including work at CRL, and will show how our results are being used in a the design of a new WEB interface for cross-language text search. One line of research, "Interactive IR", is concerned with the user interface issues for information retrieval systems such as how best to display the results of a text search. I will review current research, including our own on "document thumbnail" visualizations, and discuss current WEB conventions, practices and folklore. The other area of research, "Cross-Language Text Retrieval", is concerned with the design of automatic techniques, including Machine Translation, to retrieve texts in languages other than the language of the query. I will review work done at CRL in the URSA project concerning query translation and in the MINDS project concerning multilingual text summarization.

I will discuss how our new demonstration project, Keizai, uses and extends the results of these lines of research into an end-to-end web-based cross-language text retrieval system. Beginning with an English query, the system will search Japanese and Korean web data and display English summaries of the top ranking documents. A user should be able to accurately judge which foreign language documents are relevant to their query and either glean necessary information from the translation to schedule specific documents for human translation and subsequent analysis.

This work will be set in the context of the evaluations planned for TREC-8, the upcoming Text Retrieval Evaluation Conference.

Input Methods and Script-specific Writing Rules
by Malek Boualem

Unicode as a universal character set solves encoding problems of multilingual texts. It provides abstract character codes but does not offer methods for rendering text on screen or paper. Abstract characters can have different visual representations (called shapes or glyphs) on screen or paper, depending on context. Different scripts which are part of Unicode require writing rules for rendering glyphs and also composite characters, ligatures, and other script-specific features.

In this talk we present a general approach to encoding script-specific writing rules based on the Unicode character set and using finite-state transducers defined in the Salsa architecture as application to specify input methods. This approach is modular, Unicode-compatible and accessible to the users.

Multilingual Inheritance-Based Lexical Representation
by Carole Tiberius, University of Brighton

Most work on multilingual lexicons so far has assumed monolingual lexicons linked only at the level of semantics.

Cahill and Gazdar (1999) argue that this approach might be appropriate for unrelated languages, but that it makes it impossible to capture useful generalizations about related languages.

Closely related languages exhibit many similarities at all levels of linguistic description - morphology, phonology, orthography, syntax, etc. - not just semantics. Compare, for example, the forms of the verb "sing" in Dutch, English, and German:

    sing - sang - sung (English)
    zing - zong - gezongen (Dutch)
    sing - sang - gesungen (German)

Such similarities, if captured, can help to produce more robust natural language processing systems for such languages.

Cahill and Gazdar describe an architecture which aims to encode and exploit lexical similarities between closely related languages. They applied this architecture in the PolyLex project to define a trilingual hierarchical lexicon for Dutch, English, and German sharing morphological and phonological information between these languages.

The aim of my Ph.D. research is to look at the methodological and theoretical issues raised by the construction of such multilingual inheritance-based lexicons and to develop a framework for Multilingual Lexical Representation.

In this talk, I will first look at the language sampling problem. For a framework to be generally valid - covering all variants encountered in natural language - it is necessary to use a language sample that explores as much as possible the full range of forms and constructions that can occur. I will show how a representative subset of languages can be selected.

Secondly, I will focus on how to structure such a multilingual inheritance-based lexicon. I will describe two models, the structure-sharing model and the meta-features model. I will discuss the advantages and disadvantages of both models with reference to two sample lexical fragments of Danish, Dutch, and English. I will conclude with some suggestions for further research.

References
Cahill, L. and G. Gazdar. 1999. The POLYLEX architecture: multilingual lexicons for related languages, In _Traitement Automatique des Langues_, 38:1.

Keizai and the Expedition Configuration and Control System
by Jim Cowie

Keizai is a prototype system integrating cross-language retrieval, summarization and machine translation. The language being processed are Japanese, Korean, and English. The main goal is to find presentation methods that allow Keizai to support useful information analysis.

The Expedition Configuration and Control System supports the development of complex sets of software and data. At the moment it is being designed to support a programmer in the construction of the data and programs needed to do machine translation from a new language to English. The CCS contains features which control order of development, tutorial material, and automatic execution of support tasks. It is accessed through a web browser and its control mechanisms include an extended form of HTML which permits on the fly substitution of vairables and execution of processes on the server.

Prolegomena to the Philosophy of Linguistics
by Sergei Nirenburg and Victor Raskin

Building large and comprehensive computational linguistic applications involves making many theoretical and methodological choices. These choices are made by all language processing system developers. In many cases, the developers are, unfortunately, not aware of having made them. This is because the fields of computational linguistics and natural language processing do not tend to dwell on their foundations, or on creating resources and tools that would help researchers and developers to view the space of theoretical and methodological choices available to them and to figure out the corollaries of their theoretical and methodological decisions. We report on one step towards generating and analyzing such choice spaces. Issues of this kind typically belong to the philosophy of a branch of science, hence the title.

Research Activities in the KLE Laboratory in Pohang, Korea
by Professor Jong-Hyeok Lee

The Knowledge and Language Engineering Laboratory (KLE), at Pohang University of Science and Technology, Korea, has been involved in various aspects of natural language processing (computational  linguistics) since 1991, with special emphasis on computer processing of the Korean language.

Currently KLE's research efforts are concentrated in three major areas. The first one is the core technology of Korean language processing, which is fundamental to the development of computer systems that recognize, understand, and generate the Korean language. The second one is machine translation (MT) between Korean and other foreign languages such as Japanese, Chinese, and English. Finally, information retrieval (IR) is also a main concern, focusing on Web-based applications.

In this talk, I'll introduce our activities while displaying some demo systems through the Web. And then I'll make a technical presentation about CLTR-J/K (Cross-Language Text Retrieval for Japanese through Korean).

UNICODE the Third
by Mark Leisher

Unicode is expanding once again, this time to include scripts that were left out in earlier versions and more Han characters. The first half of the talk will be a short survey of the scripts and other characters being added for Unicode 3.0, and the second half will be a short and simple introduction to Unicode for those who feel they need it.

Word Frequency Count and Point of View/Introduction to Chu Spaces
by Stephen Helmreich

This is a two-part presentation. The first part will report on applying simple counting techniques to ascertain the point of view of the author of an article.

The second part will present a short introduction to Chu spaces, which are very general mathematical objects which can model topological, algebraic, and logical structures. A Chu space consists of 3 three sets and a function from the cross product of the first two into the third. In most cases, the third set is {0,1}.

Research in Computer Science at the Universidad Complutense de Madrid
by Antonio Vaquero

Antonio Vaquero is Full Professor of Computer Science and former Head of the Department of Computer Science at the Universidad Complutense de Madrid. As a part of his sabbatical year, Prof. Vaquero is currently a Visiting Research Scholar at the Computing Research Laboratory. To date, he has carried out research in several different areas of Computer Science and educational applications. His current interests include computer-based learning and instruction, information retrieval, natural language processing and Spanish as an object language for computing. He will present a summary of his research activities with an emphasis on those which are most directly related to NLP.

 

Bootstrapping a Morphological Analyzer
by Kemal Oflazer

This talk addresses the problem of bootstrapping a morphological analyzer for a language, given a list of roots (possibly annotated with additional tags, such as POS), and some text. In contrast with recent approaches that use minimal description length and similar ways, our approach uses approximate matching between to roots and words to identify potential roots even when the roots have been deformed by morphographemic processes. Once potential roots have been identified, a list of affixes is generated, words are segmented and most suitable segmentations are selected. Human intervention at this step is possible to fix any minor discrepancies. The segmented words along with their original forms are then used by a transformation-based learner which induces morphographemic rules to segment and analyze the can then be used to build a finite state morphological analyzer for the language in question. Very preliminary investigations on Slovenian, Bulgarian and English have provided some quite interesting results.

Dependency Parsing With an Extended Finite State Approach
by Kemal Oflazer

The talk presents an approach to dependency parsing using an extended finite state model. The finite state approach augments the input representation with "channels" so that links representing syntactic dependency relations among words (or rather "inflectional groups", a term more appropriate in the context of Turkish, to which we apply our approach) can be encoded. Intermediate configurations violating  various constraints of dependency representations such as planarity (projectivity), no unlinked items except sentential head, etc, are filtered via finite state filters. The extended nature of the approach is due to the fact that the parser has to iterate on the input (typically 3-4 times) to arrive at a fixed point, much as in the approaches of Roche (1996) and Abney (1996). The parser takes in morphologically analyzed and disambiguated text, and produces an output that encodes the syntactic relations. It is possible to refine the parser so that labeled links and limited nonplanar constructs can be handled, and morphological disambiguation can be done during parsing.

The Un-MT: Cool and Refreshing
by Stephen Beale

Syntax without grammars! Exemplified with the Turkish-English MT system recently developed at CRL. Includes phrase and clause chunking and transfer using simplified patterns and a few heuristics.


Pragmatics-based Machine Translation
by David Farwell

This presentation is a summary of work in pragmatics-based machine translation which has been carried out over the last few years. Underlying the approach is the assumption that translation is inextricably linked to the translator's beliefs about the topics under discussion, about the author and addressees of the source language interaction and about the addressees of the translation. We begin with an introduction to a pragmatics-based approach. Next, we present some background terminology and potential apparatus for implementing such an approach. We then review one case study in which differing beliefs about the source language interaction result in different translations and one case study in which differing beliefs about the target language interaction result in different translations. We conclude with a few observations about the implications of this work for translation, machine translation and MT evaluation.

Document thumbnail visualizations for rapid relevance judgments: When do they pay off?
by Bill Ogden

This talk will be presented at the upcoming TREC conference and is the joint work of William Ogden, Mark Davis, and Sean Rice of New Mexico State University.

In a preliminary experiment, Kaugars [1] showed that people were faster and better at making relevance judgments for a fixed set of retrieved documents when using a system that simultaneously presented small graphical representations (thumbnails) of each whole document in comparison to using a system representing documents as lists of short titles. Document thumbnails presented color-coded highlighting indicating positions and identities of keywords in the document, and a document viewer showing a multiply focused fish-eye view of sentences holding keywords.

We attempted to replicate this result using an interactive, WEB version of a thumbnail document set viewer and using the prescribed TREC-7 interactive track methodology. The system (named J24 after the July 24th deadline for its completion) was built using the Unicode Retrieval System Architecture (URSA) text retrieval software library developed the Computing Research Lab at New Mexico State University. The system provides a WEB interface for entering search terms and displaying results with thumbnails for 10 documents at a time. The J24 system was compared to a control retrieval system, ZPRISE.

We failed to replicate any advantage for the J24 thumbnail displays. Overall, there are no differences between J24 and ZPRISE in the time or number of document saved (i.e. documents judged to have relevant aspects). Task and user variability seems to overwhelm any advantage the J24 system may have had.

We will use a task analysis model of the TREC-7 interactive task to illustrate  the problems this evaluation methodology has in determining the effects of any particular interface feature.

We will then describe a follow-up study we designed based on this analysis, and explain why it too shows no advantage for the  thumbnail visualizations.

Along the way we will highlight the features of URA and the J24 interface.

REFERENCE
[1] Cougars, Karl's, J. 1998. A hierarchical, Approach to Detail + Context Views. Unpublished Doctoral Dissertation. New Mexico State University.


Machine Interpreting with Layered Charts
by Jan Amtrup

Human natural language understanding works incrementally, we start understanding words and utterances before the speaker has completed them. The incorporation of incremental techniques into systems for speech processing has become increasingly desirable during the last years, since certain types of applications (e.g. sophisticated dialog systems and simultaneous interpreting) are only feasible this way. The main obstacle for using incrementality is the absence of global optimization criteria, which usually yields a drastic decrease in performance.

We present an experimental machine interpreting system which provides incremental translations for spontaneously spoken language in the domain of appointment scheduling. The design of the system is based on layered charts, an extension of the well-known parsing data structure. An efficient formalism (typed feature structures implemented on the basis of an abstract machine interpretation) is used to represent linguistic objects throughout the system, ranging from idiomatic expressions to generated English expressions. Using efficient algorithms for graph processing and search we show that incremental systems are able to almost reach the performance of conventional systems.

Aspects of GETA's MT methodology applied to high-quality personal
networking communication and pragmatic speech translation

by Christian Boitet

Ariane-G5 is an integrated environment initially designed to facilitate the development of multilingual MT systems for revisors (MT-R), where output quality is obtained by using the heuristic programming facilities of its 5 rule-based languages to specialize the lingware components to the sublanguage at hand. It can support many MT architectures and linguistic methodologies and accepts whole paragraphs or pages as units of translation rather than separate sentences. For MT-R, B.Vauquois' multilevel transfer approach has given excellent results on a large number of MT-R mockups and prototypes, as well as two large scale operational systems. As both the computer tools and the linguistic methodology are not embodiments of a particular theory, they are quite easy to adapt to new problems. In the last few years, they have actually been revised and further developed in the framework of new research on high quality MT for monolingual authors (MT-A), relying on a disambiguation dialogue with the author (DBMT), following an all-paths analysis.

The evolution of Ariane-G5/LIDIA and associated linguistic methodologies is now motivated by two projects of different aims and requirements. The UNL project of personal multilingual high-quality communication over the Internet requires the construction of a large lexical database from which coherent dictionaries for MT and for
interactive disambiguation will be generated. For the C-STAR Speech Translation (ST) project, it becomes necessary to start from phonetic lattices output by a speech recognizer, and to transmit some linguistically useful memory from one dialogue turn to the next.

Annotated bi-text as productive language resource: DTD-driven bilingual document generation
by Arantza Casillas

Among different annotation schemes, TEI conformant SGML markup appeared to be a good option to explore. Therefore, the segmentation and alignment of a large Basque-Spanish bitext has been carried out in the form of standard widespread annotation conventions based on TEI P3 guidelines.

By virtue of the markup, the annotated and aligned bi-text becomes a rich and productive language resource in itself, containing three different types of translation memories as well as glossaries for proper names and
terminology. We have projected these data outside the corpus into classic data-base formats, resulting in four independent linguistic databases: (i) an aligned sentence-based translation memory, (ii) a second translation memory based on variable translation units, (iii) a bi-text document-base, and (iv) proper name and terminological glossaries.

In addition to these resources, paired DTDs (Document Type Definitions) have also been derived. DTDs have been abstracted from the SGML tagged documents, defining the structural elements of these documents. What is most interesting about these paired DTDs is that they can be put to work in the process of generating new bilingual documents. The translation memory based on variable translation units is used for that purpose. We will present an experimental editing environment that tries to demonstrate the practicality of this approach to machine translation.

Proper names, terms, and other multi-word collocations: bitext segmentation and alignment
by Raquel Martinez

Different strategies exist for approaching the exploitation of a big language resource, such as a large bi-text. In the tradition of humanities computing, the prevailing approach has been to annotate the corpus so that implicit linguistic information becomes explicit. The information that can be made explicit runs from general pragmatic and discourse features to concrete morphological and phonetic features, as well as lexical, semantic and syntactic features. Some are easier to grasp than others, but in any case, a large literature has been produced at all levels of analysis.

Our talk will describe our efforts to bring out some of the linguistic features present in the bi-text, particularly those features that have been explored in connection with aligned parallel corpora and translation memories: segmentation into translation units and their alignment.

Segmentation has been attained at different levels, from general discourse segments (such as text divisions) to sentence and intra-sentential elements. In our talk, we will concentrate on the segmentation and alignment of the latter, including proper names, as well as multi-word lexicological and terminological units.

Processing inflectional head-initial vs. agglutinative head-final languages in a large Spanish-Basque bi-text
by Joseba Abaitua

Spanish and Basque are two languages that have coexisted since Spanish became a language, differentiating itself from its close Romance relatives (Portuguese, Catalan, French and Italian). All these languages are quite similar with respect to their main linguistic features. In addition to a largely shared lexicon, their grammars are
nearly the same. All are SVO languages with rather strict head-initial behaviour, most clearly seen within noun phrases. This is an important issue for NLP as word order remains highly homogeneous within Romance languages.

In contrast, Basque, which is a pre-Romance (indeed, pre-Indoeuropean) language, displays almost all opposite properties. It is an SOV language with very strict head-final behaviour, not only within NPs but in embedded clauses also. For those who work with Japanese, the grammar of Basque will seem very familiar.

Our project concerns the exploitation of a very large corpus of bilingual texts in Spanish and Basque. We regard our bi-text as a huge source of linguistic information and we want to talk about the strategies we have adopted to get out the most of it.


A Brief Introduction to Persian
by Karine Megerdoomian

This talk gives an overview of the Persian language covering its history, writing system, morphology and syntax. I will emphasize the more unusual aspects of the langauge, and I will discuss difficulties that arise in a computational analysis of Persian.

On Parsimony and Induction
by Mark Davis

The notions of "simplicity," "parsimony," "elegance" and "economy" have been used since Aristotle to attempt to describe the desirability of one theory versus another in scientific discovery. In the last 40 years, the idea that compact theories lead to better theories in inductive inference and prediction has been bolstered by simultaneous developments in information, coding and computation theory. In this presentation, I will briefly describe the history of thinking on inductive inference ("the scandal of philosophy"), and will develop the Minimum Description Length heuristic from Bayesian principles and from the Chaitin-Kolmogorov complexity of algorithms. I will then show how this heuristic can be used to formulate algorithms for the unsupervised learning of ambiguous term relationships and phrase structure from text, and how a cognitive model can be constructed to explain human performance in learning artificial grammars.

Unicode and How it Got That Way
by Mark Davis

Most people working in environments that require multilingual text quickly encounter the problem of representing a mixture of text in two or more languages in one document. Various temporary solutions were found to support this requirement, but a price had to be paid at the application level. Eventually it was recognized that the cost of using these temporary solutions was too high and a new character set was designed, Unicode. This presentation provides an introduction to Unicode and some important concepts about Unicode.

Morphological Disambiguation of Turkish Using Voting Constraints
by Kemal Oflazer

Most people working in environments that require multilingual text quickly encounter the problem of representing a mixture of text in two or more languages in one document. Various temporary solutions were found to support this requirement, but a price had to be paid at the application level. Eventually it was recognized that the cost of using these temporary solutions was too high and a new character set was designed, Unicode. This presentation provides an introduction to Unicode and some important concepts about Unicode.

Implementing a Turkish Morphological Analyzer with Xerox Finite State Tools
by Kemal Oflazer

This talk describes the implementation of a morphological analyzer for Turkish using the two-level morphology approach augmented with additional levels of representation and finite state operations. Both morphographemic and morphotactic aspects of Turkish will be discussed with emphasis of processing real world text.

The Corelli Document Manager
by Remi Zajac

We describe a new flexible annotation scheme, the Corelli Document Manager, which extends the Tipster Document Architecture in several innovative directions. Annotation types are defined using typed feature structure definitions; the annotations themselves are instances of these types. Annotations are stored in an object-oriented database system; the corpora files can be stored on a file system (local or remote) or in the database itself; the Document Manager maintains the relations between annotations and the documents.

Some Current Research in Text Retrieval at Computing Research Lab
by Mark Davis and Bill Ogden

We will provide an overview of the state-of-the-art in modern text retrieval systems--from monolingual to cross-language and interactive approaches--as well as provide an introduction to evaluation methods and metrics in the text retrieval community. After the introduction, we will describe the development of QUILT, a cross-language retrieval system developed at CRL for English access to a Spanish text database, as  well as describe recent efforts in designing experiments to experimentally test the effectiveness of retrieval systems for end users.

MINDS: A Multilingual information Retrieval and Text Summarization System
by Jim Cwie

CRL's work on Multilingual Interactive Document Summarization for TIPSTER Phase III combines research in automatic and interactive summarization with an integration of new and current methods in a multi-engine prototype summarization tool. We are building on the CRL Core Summarization Engine to provide robust multi-lingual summarization capabilities designed to aid fast and interactive document filtering, even in the
absence of certain lexical or other resources for a language. The initial languages supported will be English, Spanish, Japanese and Russian. In addition to providing stand alone summarization and translatin capabilities the system will build personal profiles, in English, of selected personages by using a combination of techniques.

Breaking Down Barriers: The Mikrokosmos Generator
by Stephen Beale

We argue that modularization of text generation into separate tasks, as currently practiced, sets up unneeded barriers to the generation task. We propose a new modularity based on natural linguistic phenomena and overview how it is implemented in the Mikrokosmos text generator.

User-Friendly Machine Translation: Alternate Translations Based on Differing Beliefs
by David Farwell

A notion of "user-friendly" translation is presented and a method for achieving it within a pragmatics-based approach to machine translation is described. The approach relies on modeling the beliefs of the participants in the translation process: the source language speaker and addressee, the translator and the target language addressee. Translation choices may vary according to how beliefs are ascribed to the various participants and, in particular, "user-friendly" choices are based on the beliefs ascribed to the TL addressee.

An Empirical Approach to Temporal Reference Resolution
by Janyce Wiebe

This talk presents the results of an empirical investigation of temporal reference resolution in scheduling dialogs. The algorithm adopted is primarily a linear-recency based approach that does not include a model of global focus. A fully automatic system has been developed and evaluated on unseen test data with good results. This paper presents the results of an intercoder reliability study, a model of temporal reference resolution that supports linear recency and has very good coverage, the results of the system evaluated on unseen test data, and a detailed analysis of the dialogs assessing the viability of the approach.

Evaluating Natural Language Interfaces: Review and Implications for Information Retrieval
by Bill Ogden

Natural language interfaces were introduced in the 70's and 80's primarily as user interfaces for relational databases. A comprehensive review of the empirical studies of the usefulness of these interfaces yields discouraging results. I will summarize these studies within a framework for modeling the cognitive processing involved in database querying which will be used to highlight why natural language was the wrong interface for relational databases. However, the same analysis suggests a natural language interface may be better choice for text retrieval.

Probabilistic Event Categorization
by Janyce Wiebe

This talk describes the automation of a new text categorization task. The categories assigned in this task are more syntactically, semantically, and contextually complex than those typically assigned by fully automatic systems that process unseen test data. Our system for assigning these categories uses a probabilistic classifier, developed with a recent method for formulating a probabilistic model from a predefined set of potential features (Bruce 1995, Bruce and Wiebe 1994, Pedersen et al. 1996). This paper focuses on feature selection. It presents various types of properties experimented with in this work. We identify and evaluate various approaches to organizing the collocational properties into features.
With the more complex features we define, there is an organization that yields the best results; but the same organization with less complex features yields inferior results. The results suggest a way to take advantage of properties that are low frequency but strongly indicative of a class. The problems of recognizing and organizing the various kinds of contextual information required to perform a linguistically complex categorization task has rarely been systematically investigated in NLP.

Minimum Description Length for Inferring Statistical Language Models
by Mark Davis

Statistical language models have presented some compelling and useful, if often linguistically-naive, results. This talk will present ongoing work on using an approach to infer language models that combines considerations of the performance of a model on a data set with considerations of the cost of specifying the parameters of the model. Minimum Description Length approaches provide a framework for accounting
for model complexity while simultaneously optimizing the performance of a language model. Early applications to be presented include a demo of a multigram model for text segmentation that is completely language independent, while ongoing work is focused on stochastic grammar induction and semantic modeling.

Natural Language Processing for the World Wide Web:
Hypertext Summary Extraction for Fast Document Browsing
by Kavi Mahesh

Two kinds of summarization programs have been built before: those that throw away everything but the summary and those that highlight pieces of the summary throughout the document. The former kind is good if there is a correct algorithm for summarization (e.g., if we already know what the document contains). The latter provides no facility to navigate to different parts of the document and is good only for short documents. Neither kind summarizes the parts that weren't included in the summary.

I present a new kind of summarization tool, called HyperGen, that exploits Hypertext technology to automatically generate Hypertext structure from a plain or Hypertext document. Every part of the document is summarized. None is thrown away. The different summaries are linked together in a
Hypertext structure where each hyperlink is labeled meaningfully. HyperGen has implemented preliminary ideas for generating meaningful labels by identifying key topics and rhetorical types.

The presentation will include a "show and tell" and a comparison between HyperGen and Microsoft's AutoSummarize tool. Also, all of the examples are new!

Software Infrastructure for Language Engineering
by Hamish Cunningham

This talk reviews and classifies the currently available design strategies for for software infrastructure for NLP and presents an implementation of a system called GATE - a General Architecture for Text Engineering. By *infrastructure* is meant what has been variously referred to in the literature as: software architecture; software support tools; language engineering platforms; development environments. The argument is that when NLP is applied to constructing large-scale systems with predictable performance levels and robust behaviour, software infrastucture for NLP can make a significant contribution. Also that integration overheads associated with collaborative research and code reuse can be reduced. GATE is being used at a number of sites in Europe; the principal current application is Information Extraction, and a set of modules to do MUC-6 style IE is bundled with the system, but the architecture is intended to be sufficiently general to support any NLP application. GATE is based on the TIPSTER architecture, and is currently being integrated with the Corelli document processing architecture at CRL. There are HTML versions of the GATE documetation at file:/home/hamish/gate/gate_docs.html with pointers to postscript.

Artwork: Discourse Processing in Machine Translation of Dialog
by Janyce Wiebe

This talk will provide an overview of the Artwork project, which targets temporal reference resolution and speech-act resolution in scheduling dialogs.

Onto-WordNet Mapper: Lexical Acquisition with WordNet and a KBMT Ontology
by Tom O'Hara

The Onto-WordNet Mapper project investigated using WordNet as a means to automate portions of the English lexicon development for Mikrokosmos. Mikrokosmos is a knowledge-based machine translation system developed at NMSU's Computing Research Laboratory. Since the project has concentrated on the development of the Spanish and Japanese lexicons, there isn't a suitable English lexicon to support generation of English texts from those in the other languages. The basic idea was to find WordNet synonym sets (synsets) corresponding to Mikrokosmos concepts, ranking alternatives through matching heuristics, which includes both symbolic and statistical approaches. The result is a lexicon acquisition tool that produces plausible lexical mappings.

Corpus-Based Disambiguation of Transfer Equivalents for Cross-Language Text Retrieval
by Mark Davis

Large parallel text corpora are of great potential utility for cross-language text retrieval, although precisely how to make effective use of them has remained unclear. Experimental evidence is presented for an effective method of disambiguating translation terminology against parallel documents. Domain matching between training texts and retrieval documents appears to be less problematic than often thought, with substantial improvements in cross-language retrieval performance possible without unduly complex calculations or complex textual representations.

webmaster@crl.nmsu.edu
Page last edited on February 2003