COLING-2000 Workshop:
Using Toolsets and Architectures To Build NLP Systems
Centre Universitaire, Luxembourg, Saturday 5 August 2000
Call for Participation
Background
The purpose of the workshop is to present the state-of-the-art on
NLP toolsets and workbenches that can be used to develop multilingual
and/or multi-applications NLP components and systems. Many toolsets
have been developed to support the implementation of single NLP
components (taggers, parsers, generators, dictionaries) or complete
Natural Language Processing applications (Information Extraction
systems, Machine Translation systems). These tools aim at
facilitating and lowering the cost of building NLP systems. Since the
tools themselves are often complex pieces of software, they require a
significant amount of effort to be developed and maintained in the
first place. Is this effort worth the trouble? It is to be noted that
NLP toolsets have often been originally developed for implementing a
single component or application. In this case, why not build
the NLP system using a general programming language such as Lisp or
Prolog? There can be at least two answers. First, for pure efficiency
issues (speed and space), it is often preferable to build a
parameterized algorithm operating on a uniform data structure (e.g., a
phrase-structure parser). Second, it is harder, and often impossible,
to develop, debug and maintain a large NLP system directly written in
a general programming language.
It has been the experience of many users that a given toolset is
quite often unusable outside its environment: the toolset can be too
restricted in its purpose (e.g. an MT toolset that cannot be used for
building a grammar checker), too complex to use, or even too difficult
to install. There have been, in particular in the US under the Tipster
program, efforts to promote instead common architectures for a given
set of applications (primarily IR and IE in Tipster; see also the
Galaxy architecture of the DARPA Communicator
project). Several software environments have been built around
this flexible concept, which is closer to current trends in main
stream software engineering.
The workshop aims at providing a picture of the current problems
faced by developers and users of toolsets, and future directions for
the development and use of NLP toolsets. It includes reports of actual
experiences in the use of toolsets as well as presentation of
toolsets.
Audience
Researchers and practitioners in Language Engineering, users and
developers of tools and toolsets. Please note that workshop
participants are required to register at http://www.coling.org/reg.html.
Program
This one-day workshop includes ten presentation periods which are
divided into 20 minutes presentations followed by 10 minutes
reserved for exchanges. We encourage the authors to focus on the
salient points of their presentation and identify possible
controversial positions. There will be ample time set aside for
informal and panel discussions and audience participation.
| 9:30 - 9:45 |
Opening
|
| 9:45 - 10:15 |
Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin
Tablan, Yorick Wilks
| Experience Using GATE for NLP R&D
|
| 10:15 - 10:45 |
Fredrik Olsson, Björn Gambäck
| Composing a General-Purpose Toolbox for Swedish
|
| 10:45- 11:15 |
Kalina Bontcheva, Hennie Brugman, Hamish Cunningham, Albert
Russel, Peter Wittenburg
| An Experiment in Unifying Audio-Visual and Textual
Infrastructures for Language Processing Research and Development
|
| 11:15 - 11:30 |
Coffee break
|
| 11:30 - 12:00 |
Jan Amtrup, Rémi Zajac
| A Modular Toolkit for Machine Translation Based on Layered Charts
|
| 12:00 - 12:30 |
Jan Daciuk
| Finite State Tools for Natural Language Processing
|
| 12:30 - 14:00 |
Lunch
|
| 14:00 - 14:30 |
Nancy Ide
| The XML Framework and Its Implications for the Development of
Natural Language Processing Tools
|
| 14:30 - 15:00 |
Jill Burstein, Daniel Marcu
| Benefits of Modularity in an Automated Essay Scoring System
|
| 15:00 - 15:30 |
Vincent Pautret
| A Rational Agent for the Modeling of a Semantic Model
|
| 15:30 - 15:45 |
Coffee break
|
| 15:45 - 16:15 |
Matthias Denecke
| An Integrated Development Environment for Spoken Dialogue Systems
|
| 16:15 - 16:45 |
Anke Kölzer
| Diamod - a Tool for Modeling Dialogue Applications
|
Abstracts
- Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin
Tablan, Yorick Wilks.
"Experience Using GATE for NLP R&D".
GATE, a General Architecture for Text Engineering, aims to provide
a software infrastructure for researchers and developers working
in NLP. GATE has now been widely available for four years. In this
paper we review the objectives which motivated the creation of
GATE and the functionality and design of the current system. We
discuss the strengths and weaknesses of the current system,
identify areas for improvements.
- Fredrik Olsson, Björn Gambäck.
"Composing a General-Purpose Toolbox for Swedish".
The paper discusses the lessons we have learned
from the work on building reusable toolset for Swedish within the
framework of GATE, the General Architecture for Text Engineering
from the University of Sheffield. We describe our toolbox SVENSK
and the reasons behind the choices made in the design, as well as
the overall conclusions for language processing toolbox design
which can be drawn.
- Kalina Bontcheva, Hennie Brugman, Hamish Cunningham, Albert
Russel, Peter Wittenburg.
"An Experiment in Unifying Audio-Visual and Textual
Infrastructures for Language Processing Research and Development".
This paper describes an experimental integration of two
infrastructures (Eudico and GATE) which were developed
independently from each other; for different media (video/speech
vs. text) and applications. The integration resulted into gaining
an in-depth understanding of the functionality and operation of
each of the two systems in isolation, and the benefits of their
combined used. It also highlighted some issues (e.g., distributed
access) which need to be addressed in future work. The experiment
also showed clearly the advantages of modularity and generality
adopted in both systems.
- Jan W. Amtrup, Rémi Zajac.
"A Modular Toolkit for Machine Translation Based on Layered Charts".
We present a freely available toolkit for building machine
translation systems for a large variety of languages. The toolkit
uses standard linguistic data representation based on charts and
typed feature structures; A modular open architecture based on
standardized interfaces and processing architecture, enabling the
addition of external language processing components and the
configuration of new applications (plug-and-play); An open library
of basic parameterizable language processing components including a
morphological finite-state processor, dictionary components, an
island chart parser, chart generator, and chart-based transfer
engine (for MT systems). It is open-source: the C++ source code is
available, and portable: targeted systems are Unix and Windows
systems.
- Jan Daciuk.
"Finite State Tools for Natural Language Processing".
We describe a set of tools using deterministic, acyclic,
finite-state automata for natural language processing
applications. The core of the tool set consists of two programs
constructing finite-state automata (using two different, but
related algorithms). Other programs from the set interpret the
content of those automata. Preprocessing scripts and user
interfaces complete the set. The tools are available for research
purposes in source form in the Internet.
- Nancy Ide.
"The XML Framework and Its Implications for the Development of
Natural Language Processing Tools".
The eXtensible Markup Language (XML) (Bray et als. 98) is the
emerging standard for data representation and exchange on the
World Wide Web. The XML Framework includes very powerful
mechanisms for accessing and manipulating XML documents that are
likely to significantly impact the development of tools for
processing natural language and annotated corpora.
- Jill Burstein, Daniel Marcu.
"Benefits of Modularity in an Automated Essay Scoring System".
E-rater is an operational automated essay scoring
application. The system combines several NLP tools that identify
linguistic features in essays for the purpose of evaluating the
quality of essay text. The application currently identifies a
variety of syntactic, discourse, and topical analysis features. We
have maintained two clear visions of e-rater's
development. First, new linguistically-based features would be
added to strengthen connections between human scoring guide
criteria and e-rater scores. Secondly, e-rater would be
adapted to automatically provide explanatory feedback about writing
quality. This paper provides two examples of the flexibility of
e-rater's modular architecture for continued application
development towards these goals. Specifically, we discuss a) how
additional features from rethorical parse trees were integrated
into e-rater, and b) how the salience of automatically
generated discourse-based essay summaries was evaluated for use as
instructional feedback through the re-use of e-rater's
topical analysis module.
- Vincent Pautret.
"A Rational Agent for the Modeling of a Semantic Model".
This paper presents a methodology that aims at building knowledge
models from a natural language description of a domain. Our
methodology is based on the establishment of a dialogue with the
knowledge engineer of an application. This dialog is motivated by
the Semantic Differentiation Process, which solves problems
related to acquisition and modeling. Moreover, the dialog can be
naturally formalized within a theory of communicating rational
agents. We can thus consider a more complete automation of the
process of modeling and show ho to integrate our methodology in
this type of theory.
- Matthias Denecke.
"An Integrated Development Environment for Spoken Dialogue Systems".
Development environments for spoken dialogue processing systems
are of particular interest because the turn-around time for
a dialogue system is high while at the same time a considerable
amount of components can be reused with little or no
modifications. We describe an Integrated Development Environment
(IDE) for spoken dialogue systems. The IDE allows application
designers to interactively specify reusable building blocks called
dialog packages for dialog systems. Each dialog package
consists of
and assembly of data sources, including an object-oriented domain
model, a task model and grammars. We show how the dialog packages
can be specified through a graphical user interface with the help
of a wizard.
- Anke Kölzer.
"Diamod - a Tool for modeling dialogue Applications".
Speech dialog systems are currently becoming state-of-the-art for
different kinds of applications, but they are still weak in the
support of spontaneous speech and correct interpretation of what
was said. One reason for the lack of good interactive dialogue
systems is their complexity. To develop a system which is able to
handle more than simple commands and phrases requires a lot of
time and experience. To be able to accelerate and improve this
process we are currently working on methods and tools which
support this development. A new method called Dialogue
Statecharts was defined for the graphical specification of
complex dialogues. It is capable of representing parallel
dialogue steps which is e.g. necessary for mixed-initiative
dialogues. Our tool system named Diamod provides editors
for different dialog concepts, such as dialogue structures,
grammars and parameters. The modeling is supported by graphical
editors for Dialogue Statecharts and Task
Hierarchies. Diamod is able to check for the
completeness and consistency of dialogue models. One goal when
developing Diamod was to provide specification models
which are universal enough to be interpreted within different
dialogue systems, i.e. different implementations of generic
conversational systems. With the help of a uniform representation
of data a transformation between different models and different
dialogue description languages (DDL) such as VoiceXML (ATT) and
some in-house DDLs, such as Temic-DDL and Dialogue-Prolog, will be
possible.
Organizing Committee
- Rémi Zajac (Chair), CRL, New-Mexico State University, USA:
zajac@crl.nmsu.edu.
- Jan Amtrup, CRL, New-Mexico State University, USA:
jamtrup@crl.nmsu.edu.
- Stephan Busemann, DFKI, Saarbrucken:
busemann@dfki.de.
- Hamish Cunningham, University of Sheffield:
hamish@dcs.shef.ac.uk.
- Guenther Goerz, IMMD VIII, University of Erlangen:
goerz@immd8.informatik.uni-erlangen.de.
- Gertjan van Noord, University of Groningen:
vannoord@let.rug.nl.
- Fabio Pianesi, IRST, Trento:
pianesi@irst.itc.it.
Of Related Interest