[Top] [Prev] [Next] [Bottom]

Tabula Rasa


Turning Text into Data

Tabula Rasa is an attempt to reduce two of the major bottlenecks of information extraction; defining text extraction tasks and developing tools to aid in producing structured data or templates. Tabula Rasa is a `meta-tool' that analysts can use to build tools that help with template filling tasks. One Tabula Rasa component, tredit, enables designers of information extraction tasks, like those used in ARPA's Message Understanding Conferences (MUC), to create and edit template definitions. Another Tabula Rasa component, the runtime tool-builder, uses these definitions to automatically generate Graphical User Interface (GUI) tools that analysts can use to create filled templates.

Current automatic information extraction systems are usually inaccurate and have long development times. Even when the accuracy of the technology is adequate there is still a need for completed keys (filled templates) for training automatic systems and to allow system performance to be objectively tested. To produce these keys a human analyst must first carry out the template filling task. Tabula Rasa facilitates the production of keys for new domains by helping analysts create new template definitions. Tabula Rasa can then use these new definitions to produce machine assisted information extraction tools. This capability is intended to help analysts define extraction tasks more rapidly, and to integrate automatic extraction techniques in tools used by human analysts.

Tabula Rasa has the following functionality:

The User's Manual

This manual contains two sections on building and running an interface tool using Tabula Rasa. The first section contains basic information about the windows and options available for building tools. This is intended for users who are familiar with editing windows and using software of this type, and who have had experience with extraction tools. The second section includes general information about extraction tools.

Manual Conventions

Different fonts are used to indicate various interface features.
Buttons, fields, and boxes

Helvetica

Menus and menu options

Courier

Entries Typed by the user

AvantGarde

Mouse Conventions

When the text instructs the user to "click on" a button, menu, or box, the user should position the cursor on that item and press the left mouse button. Unless otherwise specified, the left mouse button is always used. Middle mouse button drag refers to selecting an item by mousing to it, holding the middle mouse button down, dragging the item to the desired location, and releasing the mouse button.

Glossary of Terms

Template

A filled template contains structured elements called template objects.

Template Object

A related set of slots which are filled with information from a text, e.g., person, organization, etc.

Template Definition

Formal specifications written in a BNF syntax that defines the form of template objects, and by analogy the form of template slots.

Slot

A labeled field to be filled with information from a text.

Fill

Information that is in (or goes in) a slot.

Valence

Describes how many legal fills for a particular slot.

Link/Annotation

A reference to the source text that specifies where the information for the slot is located.

Alternative Fills

Slots may have more than one acceptable fill value. Alternative fills allow answer keys to be built with alternative acceptable fills.

Set Fill

A slot filled by selection from a precipiced list of categories. In TR this is defined as a `Select Field' and is most useful when the number of categories is less than 10. For more than 10 categories a text field with fill options may be most appropriate.

String Fill

An exact copy of the text from an article that maps into a text field (literal).

Normalized Fill

Text from an article that has been reduced to canonical form in TR's Text Field.

Index Fill

Filled with an index to an object. In TR this translates into a pointer field.



[Top] [Prev] [Next] [Bottom]

webmaster@crl.nmsu.edu
Copyright © 1997, The Computing Research Laboratory
New Mexico State University. All rights reserved.