Calypso Document Architecture: Overview


The Architecture

The Calypso Document Manager developed at CRL is implemented in Java: the interface to the document manager is defined by a set of Java interfaces organized in services as Java packages.

The document model generalizes the Tipster model: documents can be grouped into collections and sub-collections but can also be accessed individually. Documents can be located anywhere on a wide-area network. Currently, the basic document class uses local file access, FTP and HTTP protocols to access a document content. However, document properties and annotations are managed locally by the document server using an object-oriented database back-end.

A previous CORBA implementation was developed and tested: see the description of this experiment.

Services

The Calypso Architecture follows the CORBA model: the Document Manager interface is divided into 4 services, 3 of which are fully compliant with the CORBA definition (Naming, Property and Collection services). The other one, the Document Management Service, provides functions to manipulate document and collections.

Database functionalities are provided directly by the underlying OODBMS (ObjectStore).

Key Design Decisions

Naming is clearly distinguished from object identity. Objects are named relatively to an explicitly defined context and names are independent from the states of objects. The architecture does not require the concept of object identity, and this notion has been avoided because of the distributed nature of the architecture. The Naming Service provides the persistent equivalent of a directory service for documents and collections.

Issues such as performance, scalability and portability have influenced major decisions such as the separation between interfaces and implementation, the structuration of the architecture into a modular and extensible set of services, the choice of the primary programming language for the initial implementation (Java).

The architecture tries to strike a balance between complete flexibility for the client and total type security. Most of the objects of the architecture, and especially documents and collections, support the PropertySet interface in which properties can be dynamically defined. However, the type of a property value can be defined by sub-classing the application object, allowing for both typing and efficient implementation of complex application objects.

Implementation

The current distribution of the Document Manager is written in Java and depends on two external Java packages: the JGL package from ObjectSpace, and the PSE database from ObjectDesign. Both packages can be freely obtained from these sites.

Services

CORBA services and ODMG interfaces are not documented in the distribution of the Document Manager. For a complete documentation, check out the following references:

Naming Service

Collection Service

Property Service

Document Management Service

The Document Management Service (DMS) provides operations for

Differences with the Tipster architecture

Collections

The current Tipster model is directly derived from the model in which a collection of document is a single file. Adapting this model to handle other types of collections and documents has proved difficult. Document and collection classes that can also support HTML documents distributed across the net; a collection can simply be a set of references (URLs) to documents.

A collection provides a view across a set of documents: as such, neither documents nor collections have parents. Collections can be created and destroyed without modifying any contained documents. Document and Collections can be sub-classed. A given document may be a member of two different collections.

Collections, documents contain properties, dynamic sets of name and a value pairs. A value can be any Java object. This provides flexibility for the types of objects that can be stored in the Document Manager.

No pre-defined properties (such as `date') are specified in the architecture: they could be added dynamically by the client.

Documents

Documents can live without being part of a collection. To be persistently stored in the document manager however, they need to be named (using the Document Manager Naming Service).

Documents have a content (byte array) and have a graph of annotations.

Annotations

In a document, the set of annotations is structured as a graph in which an annotation is an edge. Each annotation has a content (a Tango feature structure), a tag (typically, the name of the annotator) and a set of spans (as well as a start and end node in the graph). The Document Manager does not maintain any relation between spans and nodes: the management of spans is left to the client.

The graph itself is directed and can be traversed in only one direction. The graph can have a set of unconnected components. If each component represent a unit of processing, it is the responsibility of the client to manage these components.

For example, in the Corelli MT architecture, each rooted sub-graph represent a processing unit, e.g. a sentence in the document. Such a graph can for example represent a word lattice as produced by a speech recognizer or a morphological analyzer. Furthermore, an annotation graph can be directly used by a parser, a chart-parser for example, which input and working data structure (the chart) can be directly represented using the annotation graph. The figure below represent an annotation graph where each sub-graph is a projection of the whole graph relatively to a given tag (the graph class supports several traversal algorithms based on tags).

In some NLP applications, annotations are complex objects such as trees or feature structures. The current Tipster model of annotations forces the mapping of such complex objects to the Tipster attribute structure, which is sometime very difficult do do and always computationally inefficient. In Calypso, an annotation content is a (typed) feature structure. This interface enables a more direct interface to existing NLP components such as unification-based parsing system and provide a super-set of the functionalities provided by the attributes of a Tipster annotation.

Future Plans

We are currently defining an Application Framework that will help to integrate a variety of tools for building an application, including annotators, static resources such as dictionaries, codeset converters, etc. This is essentially an extension of Sheffield's GATE annotator model.

We are also experimenting with Java-RMI and CORBA for implementing distribution of components.