Introduction to the Lexicon

MIKROKOSMOS



Introduction to the Lexicon

In the model of NLP adopted in a Knowledge-Based Machine Translation paradigm, the lexicon becomes the key locus and source of knowledge. Compared to many other computational lexicons, a substantial amount of information is either directly located in the lexicon, or is indexed or referenced through the lexicon.

Organization

The lexicon for a given language is organized by superentries which are identified by the "dictionary" form of the word. Within a superentry, individual lexemes are represented in a frame-based language (FRAMEKIT is the LISP version, or FRAMEPAC in the C++ version). A variety of inheritance mechanisms is used to minimize redundancy within the lexicon. Inheritance can be used "vertically", for example, to indicate that a particular verb is of syntactic CLASS basic-bitransitive, thus inheriting local syntactic specification or synactic features. Inheritance may also assist "horizontally", for example, to indicate that the third and fourth verb sense of eat share the same SYN-STRUC, or that all verb senses share the same PHONology and MORPHology.

The format for lexeme names is the character "+", followed by the dictionary form of the lexeme, followed by "-" and an indication of the sense of the dictionary form, followed by a representation of the syntactic category (e.g., v, n, adj) and a number indicating which sense of that syntactic category. For example, +eat-v2 is the lexeme label for the second sense of the verb eat.

A proper name which is to be entered into the lexicon would reference an Onomasticon entry in the lexical semantics (defined below), but would otherwise appear just as any other lexicon entry. For example, +Paris-n1 might be the label for the lexical item Paris which names the city Paris, France. This arrangement allows language-independent world knowledge to be maintained independently of language-specific nomenclature (which, in turn, affects its phonology, morphology, syntactic behavior, etc).

Structure

An entry in the lexicon is comprised of a number of zones (each possibly having multiple fields), integrating various levels of lexical information. The zones are CAT (syntactic category), ORTH (orthography -- abbreviations and variants), PHON (phonology), MORPH (morphological irregular forms or class information), SYN (syntactic features such as attributive), SYN-STRUC (indication of sentence- or phrase-level syntactic inter-dependencies, including subcategorization), SEM (lexical semantics, meaning representation), LEXICAL-RELATIONS (collocations, etc.), PRAGM (pragmatics hooks, for example for deictics, and stylistic factors), and ANNOTATIONS (user, lexicographer, and administrative information, such as modification audit trail, example sentences, definition in English, etc.)

The CAT, MORPH, SYN, and SYN-STRUC zones are all referenced during the syntactic parsing (to include segmentation, tokenization, and morphological analysis) which, in practice, precedes the invocation of the semantic analysis (locally) in the processing model described below. The SEM zone is a focus of interest because it is the locus of interaction with the ontology (and onomasticon) knowledge base, and thus the source of many of the building blocks of the eventual meaning representation in TMR. Other sections in this document, as well as some of the papers, discuss the formalism used in the lexical semantic specification in the SEM zone, and the utilization of that specification is discussed in subsequent sections.

Syntactic Information

The lexicon serves both as a source of syntactic information (the local syntactic specification found in SYN-STRUC zones) as well as the location of the syntax-semantics interface. In the following subsections, first the general syntactic paradigm is briefly sketched out, followed by a sketch of the notation used in the specification of local syntactic information in a lexical entry.

The syntactic parse structure used in the system described here is a modification of a Lexical-Functional Grammar (LFG) f-structure level parse representation (although it shall be referred to as an f-structure anyway). In the syntactic parses, the traditional LFG f-structure is augmented by a ROOT identifier (akin to the labelling of a node in a tree structure); at each level of the structure, the ROOT identifier is followed by the wordsense identifier for the relevant word in the parse. The representation can be thought of as a list representation of a (possibly recursive) feature structure, where each attribute name is followed by either a symbol value or another (imbedded) f-structure. For example, the f-structure below is the preferred parse of the sentence The old man ate a doughnut in the shop.

 ((ROOT +EAT-V1)
(MOOD DECL) (VOICE ACTIVE) (NUMBER S3)
(CAT V) (TENSE PAST) (FORM FINITE)
(SUBJ ((ROOT +MAN-N1)
(NUMBER S3) (CAT N)
(PROPER -) (COUNT +) (CASE NOM)
(DET ((ROOT +THE-DET1) (CAT DET)))
(MODS ((ROOT +OLD-ADJ1) (CAT ADJ)
(ATTRIBUTIVE + -))))
(OBJ ((ROOT +DONUT-N1)
(NUMBER S3) (CAT N) (PROPER -) (COUNT +)
(DET ((ROOT +A-DET1) (CAT DET)))))
(PP-ADJUNCT ((ROOT +IN-PREP1)
(CAT PREP)
(OBJ ((ROOT +SHOP-N1)
(NUMBER S3) (CAT N)
(PROPER -) (COUNT +)
(DET ((ROOT +THE-DET1)
(CAT DET))))))))
The same structure may also be viewed in the (perhaps more familiar) typed feature structure matrix shown in Figure 4A. Figure 4A not available.]

Lexical Syntactic Specification

The contents of the SYN-STRUC zone of a lexicon entry is an indication of where the lexeme may fit into f-structure parses of sentences. In addition, this zone provides the basis of the syntax-semantics interface. The information contained in this zone essentially amounts to an underspecified piece of an f-structure parse of a typical sentence using the lexeme; this piece, called an fs-pattern, contains the lexeme in question, and may include information from any number of imbedded levels (but typically not more than two) above or below the current lexeme. The information included in an fs-pattern reflects those levels and elements of the f-structure which the current lexeme syntactically selects for; in the current model, verbs select for all their arguments, modifiers select for their heads, prepositions select for their objects as well as for their node of attachment, etc. Fs-patterns thus determine such things as subcategorization (optionality is indicated, otherwise obligatoriness assumed), complements allowed, etc.

Since f-structures do not indicate linear order, the fs-pattern merely indicates a piece large enough to establish all necessary dependencies. Thus, in the simple case, the fs-pattern for a verb will indicate the arguments which the verb subcategorizes for. In LFG f-structures, all arguments (including subjects) are immediate children of the verb node, so the selection in the fs-pattern is for elements which are descendants of the current lexeme in the f-structure tree. However, we also use the same mechanism for syntactic relationships other than arguments. So adjectives and prepositions, for example, select (in their respective fs-patterns) for the syntactic head which they modify (in addition, prepositions select for their arguments.)

In the fs-patterns, we place variables at the ROOT positions selected for by the lexeme in question, which is identified by the variable $var0; this allows the fs-patterns to be inherited (using the CLASS mechanism described above). Subsequently numbered variables ($var1, $var2, ...) identify other nodes in the f-structure with which the current lexeme has syntactic or semantic dependencies. For example, the fs-pattern below is appropriate for any regular monotransitive verb:

((root $var0)
(subj ((root $var1) (cat n)))
(obj ((root $var2) (cat n))))
Or, viewed as a feature structure:

[figure missing]

The exact syntactic relationship of words in a sentence may vary by syntactic transformations, valency changes, or movement rules; for this reason, we introduce this level of indirection (the variables) in the fs-patterns. Additional advantages of this mechanism include the ability to inherit fs-patterns from a hierarchy, as well as reducing the work in assigning lexical-function <==> case role correspondences.

In cases of lexicon entries for idioms, verbs with particles, non-compositional collocations, etc., the ROOT attribute in an fs-pattern may be followed by a specific lexeme, not by a variable. For example, the special sense of kick which defines the idiom kick the bucket will select for an OBJect with ROOT +bucket-n1, where +bucket-n1 is a lexeme identifier for a standard sense of the word bucket. Additionally, in the fs-pattern, the attribute-value pair will be followed by the symbol null-sem as follows: (ROOT +bucket-n1 null-sem) to indicate that this sense of bucket does not contribute to the semantics of the idiom. In cases of semantic structure of idioms such as spill the beans, spill will select for an OBJect which will specify (ROOT +beans-n3), meaning that this special sense of beans (meaning information) does contribute its meaning as an idiom chunk to the entire idiom. In both of these cases, the root specified is obligatory, so the special sense in question will fail the syntactic parse (in analysis) if the selected-for root does not appear in the utterance. In generation, this special sense will get selected in the lexical selection process only if the meaning is appropriate.

The SYN-STRUC zone has two facets. If the word is syntactically regular, non-idiomatic, having no particles, etc., then the CLASS facet is used to indicate which fs-pattern to inherit from the class hierarchy. If none of the class fs-patterns are appropriate for the lexeme in question, an fs-pattern may be locally specified in the LOCAL facet; in fact, both a class and local information may be specified, and the two fs-patterns are unified.

In addition to specifying syntactic dependency structure, the fs-pattern also indicates an interaction with the meaning pattern from the SEM zone, in that certain portions of the meaning pattern for a phrase or clause are regularly and compositionally determined by the semantics of the components (Principle of Compositionality); the structure of the resulting meaning pattern is determined not only by the semantic meaning patterns of each of the components, but also by their syntactic relationship in the f-structure.