![]() |
|
|
| Home | Research | CRL Staff | Publications | Resources | Search | Employment | CRL Internal |
Abstracts |
|
Meaning Oriented Question Answering MOQA is CRL's ARDA-funded grant under the AQUAINT project, in collaboration with CoGenTex in Ithaca, NY, and ILIT (Institute for Language and Information Technologies) at UMBC (University of Maryland, Baltimore County). The goal of the project is to bring to bear CRL's ONTOSEM ontology and TMR (Text Meaning Representation) to enhance both the accuracy and user-friendliness of question/answering systems. The talk will cover the scope and structure of the project, with a focus on recent developments in ontology-based multilingual information retrieval. Machine
Learning of Verbal Meaning Latent Semantic Analysis (LSA) is a learning machine that computes relations between the meanings of all the words and passages in a representative collection of text. Measured by how well it simulates a wide variety of linguistic phenomena and human judgments and behaviors that depend on verbal meaning, it is surprisingly successful. The major surprise lies in the fact that LSA does as well as it does while ignoring word order within sentences and passages. I will attempt to explain why this should not have been such a surprise. Cross-Language
Text Retrieval at CRL In this talk I will review the state of the art in cross-language text retrieval. This is concerned with the design of automatic techniques to retrieve texts in languages other than the language of the query. I will focus on recent experiments conducted for TREC-8 and our plans for TREC-9. Pattern
Matching and Parsing with Charts We present a unified approach to parsing and pattern matching. Pattern matching traditionally refers to matching regular expressions in strings. Cascaded pattern matchers have been used for chunking and for partial parsing (Harris, 58). Parsing on the other hand refers to the analysis of a string using a context-free grammar or grammars at least as powerful as context-free grammars. The most efficient parsing algorithms are chart-parsing algorithms. We introduce a uniform notation for patterns and rules. We show on examples how pattern matching on strings can be extended to pattern matching on charts without losing the inherent efficiency of finite-state matching. This leads to a unified system for pattern matching and parsing where matchers and parsers can be mixed in a unified cascaded parsing architecture. Norms and
Exploitations: Towards a Theory of Linguistic Performance In these two seminars, we shall look at some of the issues involved in linking meaning and use, i.e. mapping syntax onto semantics and vice versa, in the light of corpus evidence. Corpus linguistics is at the forefront of the resurgence of empiricism, prompting redefinition of some of our most basic assumptions about the nature of language, to account for observed data. The task confronting the linguistic analyst is seen as being to identify the norms of usage--or rather sets of overlapping norms--and the linguistic principles which govern the ways in which norms are exploited. Dictionaries list many meanings, but they do not normally tell the user how to distinguish one meaning of a word from another. What's worse still, large dictionaries such as OED make little attempt to distinguish between normal usage and exploitation, often recording variations in phraseology and exploitations as if they were new norms. With the advent of large corpora, we can begin to see that meaning distinctions in real text are associated with differences in patterns of phraseology. This work is only just starting. Some examples are discussed, drawing on Halliday's notion of"lexis as a linguistic level," Sinclair's work on lexical analysis and phraseology, Wilks's preference semantics, Fillmore's frame semantics, and Lakoff's work on metaphor and prototype theory. At the most general level, following Fillmore, we can make distinctions such as: [PERSON] climb [PHYSOBJ] = go up [PERSON] climb [PP] = clamber As a much more delicate level, the default interpretation of 'climb [MOUNTAIN]" involves using all four limbs and then some, while the default interpretation of "climb steps" involves using only the legs, not the arms. Are semantic norms phraseologically determined? We shall also examine issues such as the distinction between literal and metaphorical meaning, including the nonliteral meaning of "literal" and the distinction between conventionalized and ad-hoc metaphors. The Georgian Language:
An Outline Grammatical Description The presentation will provide a brief description of Georgian, the principal member of the Kartvelian (South Caucasian) Language group. It will cover main features of the language, but particular emphasis is placed on identifying general patterns in the complex verb system. Statistical
Morphological Disambiguation for Agglutinative Languages We present statistical models for morphological disambiguation in Turkish. Turkish presents an interesting problem for statistical models since the potential tag set size, with tags making the relevant morhosyntactic and semantic distinctions, is very large because of the productive derivational morphology. We propose to handle this by breaking up the morhosyntactic tags into inflectional groups, each of which contains the inflectional features for each (intermediate) derived form. Our statistical models score the probability of each morhosyntactic tag by considering statistics over the individual inflection groups in a trigram model. Among the three models that we have developed and tested, the simplest model ignoring the local morphotactics within words performs the best. Our best trigram model performs with 93.95% accuracy on our test data getting all the morhosyntactic and semantic features correct. If we are just interested in syntactically relevant features and ignore a very small set of semantic features, then the accuracy rises to 95.07%. Automatic
Identification of Classified Documents How can one automatically identify classified documents? This is a vital question for the Department of Energy (DOE), which is reviewing millions of classified documents for possible declassification, and for Los Alamos National Laboratory (LANL), which is checking its unclassified computing storage systems for the presence of classified documents. After developing an expert rule system for automatic classification, the DOE provided me with a small set of documents with which to explore a statistical classifier as an alternative. I represented documents as vectors of character trigram frequencies (using a chi-square statistic to select the optimal trigrams), and trained a linear classifier using the Pocket algorithm. Results ranged from 60% accuracy (for identification of specific classified topics) to 87% accuracy (for identification of classified versus unclassified text). Training set size was a significant factor in the results. In contrast, my work for LANL started "from scratch" and needed to be moved rapidly into large-scale production. I therefore implemented a expert system tailored to LANL's needs, and have focused heavily on the practical issues that arose in canvassing large amounts of files in a variety of formats. The Architecture of
Meat Meat, the ``Multilingual Environment for Advanced Translations'' is an MT framework designed following four principles: - Integratedness. All hypotheses about various aspects of the translation process are stored in a chart that is accessible from all components. - Uniformity. The descriptions of all linguistic objects are typed feature structures. - Multilinguiality. Foremost, this means that we use Unicode to represent language data. - Configurability. Meat is not just a translation system, but is highly configurable using a simple application definition language. We will present each of these principles and the application of Meat to a translation problem, using the Persian-English machine translation system Shiraz as an example. Using a Target
Language Model for Domain Independent Lexical Disambiguation The lexical disambiguation algorithm is based on a statistical language model. The input data to the algorithm is a set of possible translations. The algorithm selects the most probable translation. The statistical language model was trained on a corpus American English newspaper texts. Its performance was tested using output from a transfer based translation system between Turkish and English. The method is source language independent, and can be used for systems translating from any language into English. Elicitation Knowledge
About Syntax in the Boas System Boas is a large system designed for the elicitation of the linguistic knowledge necessary for the construction of a machine translation system. This presentation is of work in progress on the development of the subsystem of Boas for the elicitation of syntactic information about the source language. Following a brief presentation of the current status of this subsytem, there will be a discussion, soliciting ideas and suggestions for the further elaboration of the subsystem. Lingvistica '98 This talk will describe the programs available from Lingvistica '98 for automatic translation between the following language pairs: English-Russian, English-Ukrainian, Russian-Ukrainian. Systems for English-German and English-Norwegian are under construction. The translation methodology and the lexicons used will be described and actual translation results presented. Lingvistica '98 has also provided specialized dictionaries in Ukrainian and Russian to SYSTRAN, and is involved with a number of other translation-related products and technologies. Applications of Latent
Semantic Analysis: Cross-Language Retrieval and Automated Essay Scoring Latent Semantic Analysis is a statistical technique for deriving measures of semantic similarity between pieces of textual information based on an analysis of a large corpus. In this talk, I will discuss how LSA has been applied to cross-language retrieval and to automated essay scoring. For the essay scoring, LSA is able to grade essays as accurately as human graders as well as provide feedback to students. I'll also describe a web-based application used in an undergraduate course which provides automatic grading of essays with content-based feedback. A Paradigm for
Extracting Information from Medical Text This presentation details the architecture of an Information Extraction system based on Natural Language Understanding Techniques. Various knowledge representation models were used and effectively combined for processing medical reports in SymText, a symbolic text understanding system. This system was used to extract coded information from dictated chest x-ray reports (discussed in this presentation) and free text admitting diagnoses. The novel techniques employed by this system were (1) the use of Bayesian networks as a model for context and (2) the method for integrating meaning (semantics) with structure (syntax). Within the applied radiological domain, three contextual models were created and tuned through two developmental iterations. The system responded to the training with an overall increase in coding accuracy on an independent data set from 48.1% to 86.1% in recall and from 70.3% to 85.6% in precision. This evaluation showed that the applied techniques for understanding natural language (in a restricted domain) generalized well and encourages continued developoment and integration of the techniques with other natural language understanding paradigms. Answering Questions This talk will describe the Text Retrieval Evaluation Conference's "Question and Answer" task and the system developed at CRL in response to this problem. Although several components were incomplete at the time of the evaluation, the system appears to produce correct output in many cases. The CRL Q & A system uses the Mikroskosmos ontology, name and quantity recognition developed for Tipster Phase I, and phrase recognition. Questions are automatically structured both to retrieve documents using a Boolean retrieval engine and to match up with a specially developed question lexicon. The question lexicon used to find the type of answer required. and to structure the information in the question. This structure is then used to perform extraction on each sentence in each document retrieved and the best 5 matches are provided as answers. The system was designed and developed by Svetlana Sheremetyeva, Jim Cowie, Eugene Ludovik, Hugo Molino-Salgado, and Sergei Nirenburg. Connections Between
Human and Machine Translation: In recent years Translation Studies has been trying to achieve the status of an interdisciplinary science and theory rather than an anecdotal art dependent on linguistics. This theory has several lines of research focusing on the process and the results. Computational tools are involved in research, training and professional activities. Beyond pure Linguistics, the translator is interested expressing differences between cultures (Anthropology), in the cognitive aspects of processing knowledge (Psychology and Cognitive Science) and in the organization of information in the text. When dealing with a text the translator attempts to preserve the functional content of the communication act, even if this means sacrificing the semantic content. Although there are a lot of different schools and approaches, we won't directly discuss them. Instead we'll consider how a translator's decisions are affected by those differing sociolinguistic points-of-view: an international langage like English, a major langage such as Spanish and a minority (or minoritized) language such as Catalan. Those differences affect the translation market and also the linguistic policy and attitude towards the defense of the conceptual system and its subsystems. 90% of the translation activity in non-English spoken countries concerns technical and scientific texts. This is one of the reasons why terminology has become a key tool in the production as well as translation of domain-oriented texts. The acquisition of expert knowledge is also crucial to give transparency to that terminology. BACO (BAse de COneixement - Knowledge Base) is a large terminology database (7,000 concepts and more than 30,000 forms) built by the highest level of the Translation and Interpretation Department of the Autonomous University of Barcelona. BACO is an example of the kind of knowledge representation and storage which professional translators (as well as Machine Translation systems) involved in technical translation need. Nor is Machine-Aided Translation possible without a terminology tool like this (BACO was built with MultiTerm, one of the tools involved in Translator's Work Bench, a translation memory software by TRADOS). This data base was built using onomasiological criteria and takes into account the display of the sub-concepts (features) and related concepts involved in a concept definition while specifying them in an atomized way. The database building process can be improved by using information extraction tools in a domain-oriented corpora. Tools to find, identify and store the information should make the work of the terminologist and the translator much easier. BACO is based on the philosophy that any of the languages involved can be used to make decisions referring the conceptual split in a specific field. Once represented, selecting one word between variants is a social and political decision. Is terminology homogeneity against minoritized language survivals? The answer to this question can bring us to another: should Machine Translation try to take account of discrepancies rather than the semantic common space? Ambiguities in
Automatic Translation to Braille English Braille and technical braille codes appear to be easy targets for translation from source documents. That is, they should only involve regular expression replacement or simple parsed filtration. However, the authors of the braille codes have introduced several context dependencies. Some of these are syntactic context dependencies which can be resolved by keeping the context for reference as the translation procedes. Others, though are dependent on phonetic or even semantic context and thus require more sophisticated linguistic processsing. This talk will encompass a general discussion of the rules for braille and possible techniques for efficient translation. Methods for
Probabilistic Classification in Natural Language The first part of this talk will give an overview of a framework for developing probabilistic classifiers in Natural Language Processing (NLP) (Bruce & Wiebe 1999). A probabilistic classifier assigns the most probable class to an object, based on a probability model of the interdependencies among the class and a set of input features. This framework focuses on formulating a model that captures the most important interdependencies, to avoid over-fitting the data while also characterizing the data well. The class of probability models and the associated inference techniques were developed in mathematical statistics, and are widely used in artificial intelligence and applied statistics. However, these techniques have not been widely used in NLP. The class of models, (decomposable models), is large and expressive, yet there are computationally feasible model search procedures defined for them. The formality of the method supports evaluation: the talk will briefly describe how the three determinates of classifier performance (the features, the form of the model, and the parameter estimates) can be separately evaluated. The second part of the talk will describe an empirical investigation of a natural language disambiguation task. In many text processing applications, such as information extraction, summarization, text categorization, and information retrieval, it is important to distinguish "objective sentences," which are used to present factual information, from "subjective sentences," which are used to present beliefs and evaluations (Wiebe 1994; Wiebe, Bruce, & O'Hara, 1999). Whether a sentence is subjective or objective depends not only on semantics, but also on the context in which the sentence appears. Using the model search procedure described above, we developed a probabilistic classifier for identifying subjective sentences. Using only shallow features, the classifier achieves an average accuracy 21 percentage points higher than the baseline, in 10-fold cross validation experiments. In order to develop the classifier, a gold-standard data set was needed for training and testing. In a two-phase study, a data set was annotated by multiple annotators, resulting in high intercoder agreement for sentences that could be tagged with certainty. The classifier also performs better on sentences the judges tagged with certainty, showing consistency between the classifier and the human judges. References Bruce, Rebecca & Wiebe, Janyce (1999). Decomposable modeling in natural language processing. To appear in Computational Linguistics 25(2). Wiebe, Janyce (1994). Tracking point of view in narrative. Computational Linguistics 20 (2): 233-287. Wiebe, Janyce, Bruce, Rebecca, & O'Hara, Thomas (1999). Development and use of a gold standard data set for subjectivity classifications. To appear in Proc. 37th Annual Meeting of the Assoc. for Computational Linguistics (ACL-99). Interactive Cross
Language Text Retrieval: In this talk I will review two relevant lines of research bearing on this issue, including work at CRL, and will show how our results are being used in a the design of a new WEB interface for cross-language text search. One line of research, "Interactive IR", is concerned with the user interface issues for information retrieval systems such as how best to display the results of a text search. I will review current research, including our own on "document thumbnail" visualizations, and discuss current WEB conventions, practices and folklore. The other area of research, "Cross-Language Text Retrieval", is concerned with the design of automatic techniques, including Machine Translation, to retrieve texts in languages other than the language of the query. I will review work done at CRL in the URSA project concerning query translation and in the MINDS project concerning multilingual text summarization. I will discuss how our new demonstration project, Keizai, uses and extends the results of these lines of research into an end-to-end web-based cross-language text retrieval system. Beginning with an English query, the system will search Japanese and Korean web data and display English summaries of the top ranking documents. A user should be able to accurately judge which foreign language documents are relevant to their query and either glean necessary information from the translation to schedule specific documents for human translation and subsequent analysis. This work will be set in the context of the evaluations planned for TREC-8, the upcoming Text Retrieval Evaluation Conference. Input Methods and
Script-specific Writing Rules Unicode as a universal character set solves encoding problems of multilingual texts. It provides abstract character codes but does not offer methods for rendering text on screen or paper. Abstract characters can have different visual representations (called shapes or glyphs) on screen or paper, depending on context. Different scripts which are part of Unicode require writing rules for rendering glyphs and also composite characters, ligatures, and other script-specific features. In this talk we present a general approach to encoding script-specific writing rules based on the Unicode character set and using finite-state transducers defined in the Salsa architecture as application to specify input methods. This approach is modular, Unicode-compatible and accessible to the users. Multilingual
Inheritance-Based Lexical Representation Most work on multilingual lexicons so far has assumed monolingual lexicons linked only at the level of semantics. Cahill and Gazdar (1999) argue that this approach might be appropriate for unrelated languages, but that it makes it impossible to capture useful generalizations about related languages. Closely related languages exhibit many similarities at all levels of linguistic description - morphology, phonology, orthography, syntax, etc. - not just semantics. Compare, for example, the forms of the verb "sing" in Dutch, English, and German: sing - sang - sung
(English) Such similarities, if captured, can help to produce more robust natural language processing systems for such languages. Cahill and Gazdar describe an architecture which aims to encode and exploit lexical similarities between closely related languages. They applied this architecture in the PolyLex project to define a trilingual hierarchical lexicon for Dutch, English, and German sharing morphological and phonological information between these languages. The aim of my Ph.D. research is to look at the methodological and theoretical issues raised by the construction of such multilingual inheritance-based lexicons and to develop a framework for Multilingual Lexical Representation. In this talk, I will first look at the language sampling problem. For a framework to be generally valid - covering all variants encountered in natural language - it is necessary to use a language sample that explores as much as possible the full range of forms and constructions that can occur. I will show how a representative subset of languages can be selected. Secondly, I will focus on how to structure such a multilingual inheritance-based lexicon. I will describe two models, the structure-sharing model and the meta-features model. I will discuss the advantages and disadvantages of both models with reference to two sample lexical fragments of Danish, Dutch, and English. I will conclude with some suggestions for further research. References Keizai and the
Expedition Configuration and Control System Keizai is a prototype system integrating cross-language retrieval, summarization and machine translation. The language being processed are Japanese, Korean, and English. The main goal is to find presentation methods that allow Keizai to support useful information analysis. The Expedition Configuration and Control System supports the development of complex sets of software and data. At the moment it is being designed to support a programmer in the construction of the data and programs needed to do machine translation from a new language to English. The CCS contains features which control order of development, tutorial material, and automatic execution of support tasks. It is accessed through a web browser and its control mechanisms include an extended form of HTML which permits on the fly substitution of vairables and execution of processes on the server. Prolegomena to the
Philosophy of Linguistics Building large and comprehensive computational linguistic applications involves making many theoretical and methodological choices. These choices are made by all language processing system developers. In many cases, the developers are, unfortunately, not aware of having made them. This is because the fields of computational linguistics and natural language processing do not tend to dwell on their foundations, or on creating resources and tools that would help researchers and developers to view the space of theoretical and methodological choices available to them and to figure out the corollaries of their theoretical and methodological decisions. We report on one step towards generating and analyzing such choice spaces. Issues of this kind typically belong to the philosophy of a branch of science, hence the title. Research Activities
in the KLE Laboratory in Pohang, Korea The Knowledge and Language Engineering Laboratory (KLE), at Pohang University of Science and Technology, Korea, has been involved in various aspects of natural language processing (computational linguistics) since 1991, with special emphasis on computer processing of the Korean language. Currently KLE's research efforts are concentrated in three major areas. The first one is the core technology of Korean language processing, which is fundamental to the development of computer systems that recognize, understand, and generate the Korean language. The second one is machine translation (MT) between Korean and other foreign languages such as Japanese, Chinese, and English. Finally, information retrieval (IR) is also a main concern, focusing on Web-based applications. In this talk, I'll introduce our activities while displaying some demo systems through the Web. And then I'll make a technical presentation about CLTR-J/K (Cross-Language Text Retrieval for Japanese through Korean). UNICODE the Third Unicode is expanding once again, this time to include scripts that were left out in earlier versions and more Han characters. The first half of the talk will be a short survey of the scripts and other characters being added for Unicode 3.0, and the second half will be a short and simple introduction to Unicode for those who feel they need it. Word Frequency
Count and Point of View/Introduction to Chu Spaces This is a two-part presentation. The first part will report on applying simple counting techniques to ascertain the point of view of the author of an article. The second part will present a short introduction to Chu spaces, which are very general mathematical objects which can model topological, algebraic, and logical structures. A Chu space consists of 3 three sets and a function from the cross product of the first two into the third. In most cases, the third set is {0,1}. Research in Computer
Science at the Universidad Complutense de Madrid Antonio Vaquero is Full Professor of Computer Science and former Head of the Department of Computer Science at the Universidad Complutense de Madrid. As a part of his sabbatical year, Prof. Vaquero is currently a Visiting Research Scholar at the Computing Research Laboratory. To date, he has carried out research in several different areas of Computer Science and educational applications. His current interests include computer-based learning and instruction, information retrieval, natural language processing and Spanish as an object language for computing. He will present a summary of his research activities with an emphasis on those which are most directly related to NLP.
Bootstrapping a
Morphological Analyzer This talk addresses the problem of
bootstrapping a morphological analyzer for a language, given a list of roots (possibly
annotated with additional tags, such as POS), and some text. In contrast with recent
approaches that use minimal description length and similar ways, our approach uses
approximate matching between to roots and words to identify potential roots even when the
roots have been deformed by morphographemic processes. Once potential roots have been
identified, a list of affixes is generated, words are segmented and most suitable
segmentations are selected. Human intervention at this step is possible to fix any minor
discrepancies. The segmented words along with their original forms are then used by a
transformation-based learner which induces morphographemic rules to segment and analyze
the can then be used to build a finite state morphological analyzer for the language in
question. Very preliminary investigations on Slovenian, Bulgarian and English have
provided some quite interesting results. Dependency Parsing
With an Extended Finite State Approach The talk presents an approach to dependency parsing using an extended finite state model. The finite state approach augments the input representation with "channels" so that links representing syntactic dependency relations among words (or rather "inflectional groups", a term more appropriate in the context of Turkish, to which we apply our approach) can be encoded. Intermediate configurations violating various constraints of dependency representations such as planarity (projectivity), no unlinked items except sentential head, etc, are filtered via finite state filters. The extended nature of the approach is due to the fact that the parser has to iterate on the input (typically 3-4 times) to arrive at a fixed point, much as in the approaches of Roche (1996) and Abney (1996). The parser takes in morphologically analyzed and disambiguated text, and produces an output that encodes the syntactic relations. It is possible to refine the parser so that labeled links and limited nonplanar constructs can be handled, and morphological disambiguation can be done during parsing. The Un-MT: Cool and
Refreshing Syntax without grammars! Exemplified with the Turkish-English MT system recently developed at CRL. Includes phrase and clause chunking and transfer using simplified patterns and a few heuristics.
This presentation is a summary of work in pragmatics-based machine translation which has been carried out over the last few years. Underlying the approach is the assumption that translation is inextricably linked to the translator's beliefs about the topics under discussion, about the author and addressees of the source language interaction and about the addressees of the translation. We begin with an introduction to a pragmatics-based approach. Next, we present some background terminology and potential apparatus for implementing such an approach. We then review one case study in which differing beliefs about the source language interaction result in different translations and one case study in which differing beliefs about the target language interaction result in different translations. We conclude with a few observations about the implications of this work for translation, machine translation and MT evaluation. Document thumbnail
visualizations for rapid relevance judgments: When do they pay off? This talk will be presented at the upcoming
TREC conference and is the joint work of William Ogden, Mark Davis, and Sean Rice of New
Mexico State University.
Human natural language understanding works
incrementally, we start understanding words and utterances before the speaker has
completed them. The incorporation of incremental techniques into systems for speech
processing has become increasingly desirable during the last years, since certain types of
applications (e.g. sophisticated dialog systems and simultaneous interpreting) are only
feasible this way. The main obstacle for using incrementality is the absence of global
optimization criteria, which usually yields a drastic decrease in performance. Aspects of GETA's MT
methodology applied to high-quality personal Ariane-G5 is an integrated environment
initially designed to facilitate the development of multilingual MT systems for revisors
(MT-R), where output quality is obtained by using the heuristic programming facilities of
its 5 rule-based languages to specialize the lingware components to the sublanguage at
hand. It can support many MT architectures and linguistic methodologies and accepts whole
paragraphs or pages as units of translation rather than separate sentences. For MT-R,
B.Vauquois' multilevel transfer approach has given excellent results on a large number of
MT-R mockups and prototypes, as well as two large scale operational systems. As both the
computer tools and the linguistic methodology are not embodiments of a particular theory,
they are quite easy to adapt to new problems. In the last few years, they have actually
been revised and further developed in the framework of new research on high quality MT for
monolingual authors (MT-A), relying on a disambiguation dialogue with the author (DBMT),
following an all-paths analysis. Annotated bi-text as
productive language resource: DTD-driven bilingual document generation Among different annotation schemes, TEI
conformant SGML markup appeared to be a good option to explore. Therefore, the
segmentation and alignment of a large Basque-Spanish bitext has been carried out in the
form of standard widespread annotation conventions based on TEI P3 guidelines. Proper names, terms,
and other multi-word collocations: bitext segmentation and alignment Different strategies exist for approaching
the exploitation of a big language resource, such as a large bi-text. In the tradition of
humanities computing, the prevailing approach has been to annotate the corpus so that
implicit linguistic information becomes explicit. The information that can be made
explicit runs from general pragmatic and discourse features to concrete morphological and
phonetic features, as well as lexical, semantic and syntactic features. Some are easier to
grasp than others, but in any case, a large literature has been produced at all levels of
analysis. Processing
inflectional head-initial vs. agglutinative head-final languages in a large Spanish-Basque
bi-text Spanish and Basque are two languages that
have coexisted since Spanish became a language, differentiating itself from its close
Romance relatives (Portuguese, Catalan, French and Italian). All these languages are quite
similar with respect to their main linguistic features. In addition to a largely shared
lexicon, their grammars are
This talk gives an overview of the Persian language covering its history, writing system, morphology and syntax. I will emphasize the more unusual aspects of the langauge, and I will discuss difficulties that arise in a computational analysis of Persian. On Parsimony and
Induction The notions of "simplicity," "parsimony," "elegance" and "economy" have been used since Aristotle to attempt to describe the desirability of one theory versus another in scientific discovery. In the last 40 years, the idea that compact theories lead to better theories in inductive inference and prediction has been bolstered by simultaneous developments in information, coding and computation theory. In this presentation, I will briefly describe the history of thinking on inductive inference ("the scandal of philosophy"), and will develop the Minimum Description Length heuristic from Bayesian principles and from the Chaitin-Kolmogorov complexity of algorithms. I will then show how this heuristic can be used to formulate algorithms for the unsupervised learning of ambiguous term relationships and phrase structure from text, and how a cognitive model can be constructed to explain human performance in learning artificial grammars. Unicode and How it
Got That Way Most people working in environments that require multilingual text quickly encounter the problem of representing a mixture of text in two or more languages in one document. Various temporary solutions were found to support this requirement, but a price had to be paid at the application level. Eventually it was recognized that the cost of using these temporary solutions was too high and a new character set was designed, Unicode. This presentation provides an introduction to Unicode and some important concepts about Unicode. Morphological
Disambiguation of Turkish Using Voting Constraints Most people working in environments that require multilingual text quickly encounter the problem of representing a mixture of text in two or more languages in one document. Various temporary solutions were found to support this requirement, but a price had to be paid at the application level. Eventually it was recognized that the cost of using these temporary solutions was too high and a new character set was designed, Unicode. This presentation provides an introduction to Unicode and some important concepts about Unicode. Implementing a
Turkish Morphological Analyzer with Xerox Finite State Tools This talk describes the implementation of a morphological analyzer for Turkish using the two-level morphology approach augmented with additional levels of representation and finite state operations. Both morphographemic and morphotactic aspects of Turkish will be discussed with emphasis of processing real world text. The Corelli Document
Manager We describe a new flexible annotation scheme, the Corelli Document Manager, which extends the Tipster Document Architecture in several innovative directions. Annotation types are defined using typed feature structure definitions; the annotations themselves are instances of these types. Annotations are stored in an object-oriented database system; the corpora files can be stored on a file system (local or remote) or in the database itself; the Document Manager maintains the relations between annotations and the documents. Some Current
Research in Text Retrieval at Computing Research Lab We will provide an overview of the state-of-the-art in modern text retrieval systems--from monolingual to cross-language and interactive approaches--as well as provide an introduction to evaluation methods and metrics in the text retrieval community. After the introduction, we will describe the development of QUILT, a cross-language retrieval system developed at CRL for English access to a Spanish text database, as well as describe recent efforts in designing experiments to experimentally test the effectiveness of retrieval systems for end users. MINDS: A
Multilingual information Retrieval and Text Summarization System CRL's work on Multilingual Interactive
Document Summarization for TIPSTER Phase III combines research in automatic and
interactive summarization with an integration of new and current methods in a multi-engine
prototype summarization tool. We are building on the CRL Core Summarization Engine to
provide robust multi-lingual summarization capabilities designed to aid fast and
interactive document filtering, even in the Breaking Down
Barriers: The Mikrokosmos Generator We argue that modularization of text generation into separate tasks, as currently practiced, sets up unneeded barriers to the generation task. We propose a new modularity based on natural linguistic phenomena and overview how it is implemented in the Mikrokosmos text generator. User-Friendly
Machine Translation: Alternate Translations Based on Differing Beliefs A notion of "user-friendly" translation is presented and a method for achieving it within a pragmatics-based approach to machine translation is described. The approach relies on modeling the beliefs of the participants in the translation process: the source language speaker and addressee, the translator and the target language addressee. Translation choices may vary according to how beliefs are ascribed to the various participants and, in particular, "user-friendly" choices are based on the beliefs ascribed to the TL addressee. An Empirical Approach
to Temporal Reference Resolution This talk presents the results of an empirical investigation of temporal reference resolution in scheduling dialogs. The algorithm adopted is primarily a linear-recency based approach that does not include a model of global focus. A fully automatic system has been developed and evaluated on unseen test data with good results. This paper presents the results of an intercoder reliability study, a model of temporal reference resolution that supports linear recency and has very good coverage, the results of the system evaluated on unseen test data, and a detailed analysis of the dialogs assessing the viability of the approach. Evaluating Natural
Language Interfaces: Review and Implications for Information Retrieval Natural language interfaces were introduced in the 70's and 80's primarily as user interfaces for relational databases. A comprehensive review of the empirical studies of the usefulness of these interfaces yields discouraging results. I will summarize these studies within a framework for modeling the cognitive processing involved in database querying which will be used to highlight why natural language was the wrong interface for relational databases. However, the same analysis suggests a natural language interface may be better choice for text retrieval. Probabilistic Event
Categorization This talk describes the automation of a new
text categorization task. The categories assigned in this task are more syntactically,
semantically, and contextually complex than those typically assigned by fully automatic
systems that process unseen test data. Our system for assigning these categories uses a
probabilistic classifier, developed with a recent method for formulating a probabilistic
model from a predefined set of potential features (Bruce 1995, Bruce and Wiebe 1994,
Pedersen et al. 1996). This paper focuses on feature selection. It presents various types
of properties experimented with in this work. We identify and evaluate various approaches
to organizing the collocational properties into features. Minimum Description
Length for Inferring Statistical Language Models Statistical language models have presented
some compelling and useful, if often linguistically-naive, results. This talk will present
ongoing work on using an approach to infer language models that combines considerations of
the performance of a model on a data set with considerations of the cost of specifying the
parameters of the model. Minimum Description Length approaches provide a framework for
accounting Natural Language
Processing for the World Wide Web: Two kinds of summarization programs have
been built before: those that throw away everything but the summary and those that
highlight pieces of the summary throughout the document. The former kind is good if there
is a correct algorithm for summarization (e.g., if we already know what the document
contains). The latter provides no facility to navigate to different parts of the document
and is good only for short documents. Neither kind summarizes the parts that weren't
included in the summary. Software
Infrastructure for Language Engineering This talk reviews and classifies the currently available design strategies for for software infrastructure for NLP and presents an implementation of a system called GATE - a General Architecture for Text Engineering. By *infrastructure* is meant what has been variously referred to in the literature as: software architecture; software support tools; language engineering platforms; development environments. The argument is that when NLP is applied to constructing large-scale systems with predictable performance levels and robust behaviour, software infrastucture for NLP can make a significant contribution. Also that integration overheads associated with collaborative research and code reuse can be reduced. GATE is being used at a number of sites in Europe; the principal current application is Information Extraction, and a set of modules to do MUC-6 style IE is bundled with the system, but the architecture is intended to be sufficiently general to support any NLP application. GATE is based on the TIPSTER architecture, and is currently being integrated with the Corelli document processing architecture at CRL. There are HTML versions of the GATE documetation at file:/home/hamish/gate/gate_docs.html with pointers to postscript. Artwork: Discourse
Processing in Machine Translation of Dialog This talk will provide an overview of the Artwork project, which targets temporal reference resolution and speech-act resolution in scheduling dialogs. Onto-WordNet Mapper:
Lexical Acquisition with WordNet and a KBMT Ontology The Onto-WordNet Mapper project investigated using WordNet as a means to automate portions of the English lexicon development for Mikrokosmos. Mikrokosmos is a knowledge-based machine translation system developed at NMSU's Computing Research Laboratory. Since the project has concentrated on the development of the Spanish and Japanese lexicons, there isn't a suitable English lexicon to support generation of English texts from those in the other languages. The basic idea was to find WordNet synonym sets (synsets) corresponding to Mikrokosmos concepts, ranking alternatives through matching heuristics, which includes both symbolic and statistical approaches. The result is a lexicon acquisition tool that produces plausible lexical mappings. Corpus-Based
Disambiguation of Transfer Equivalents for Cross-Language Text Retrieval Large parallel text corpora are of great
potential utility for cross-language text retrieval, although precisely how to make
effective use of them has remained unclear. Experimental evidence is presented for an
effective method of disambiguating translation terminology against parallel documents.
Domain matching between training texts and retrieval documents appears to be less
problematic than often thought, with substantial improvements in cross-language retrieval
performance possible without unduly complex calculations or complex textual
representations. |
|
webmaster@crl.nmsu.edu |