User's Guide for the Juman System - Version 0.5

Yuji Matsumoto, Sadao Kurohashi, Yutaka Nyoki, Hitoshi Shinho, and Makoto Nagao
Translated by Jim Barnett (Translators comments in italics)

1. The Morphological Grammar of Japanese

1.1 Part of Speech and Subpart of Speech

Morphemes that perform a grammatically similar role are classified together. We call the resulting classes parts of speech (keitaihinsi). The current system allows the part of speech to be further sub-classified. These sub-classes are called subparts of speech (keitaihinsisaibunrui).

1.2 Paradigm and Desinence

Along the lines of the "verb" "adjective" and "auxiliary verb" described in school grammars, there are morphemes that change their form according the morphemes that occur before and after them. This change of form is called "conjugation." The part of the surface form that doesn't changed is called the root, and the part that does change is called the suffix. Most conjugation is regular. Categorizing morphemes according to this regularity, we call the resulting classes paradigms. The combined surface forms that actually occur are called desinences.

1.3 Morpheme Structure

If morpheme m has part of speech H1, subpart of speech H2, paradigm K1, desinence K2 and surface form M, we call the list (H1, H2, K1, K2, M) the "morpheme structure of morpheme m". Elements H1, H2, K1, and K2 are filled with the names of the part of speech, the subpart of ech, the paradigm and desinence. Furthermore, M is filled with a surface form expressed in kana or kanji. This notation is used to describe the connection rules. A morpheme structure may be underspecified and stands for the set of morphemes that will unify it.

1.4 Connection Relations and Connection Rules

The fact that all the members of the sets of morphemes A1 and A2 can be connected (can occur next to each other in a sentence) is expressed as: (A1, A2). This is called a Connection Rule. A set of morphemes is expressed as a list of morpheme structures. Given connection rule (A1, A2), then for every morpheme in every morpheme structure alpha1 in A1 can be followed by any arbitrary morpheme in any morpheme structure alpha2 in A2. The set of Connection Rules is called the Connection Relations.

If A1 is a list of morpheme structures (a11,... a1n) and similarly A2 = (a21, ... a2k), then (A1, A2) means that any morpheme that unifies with any of the descriptions (a11,... a1n) can be followed by any one that unifies with one of (a21,... a2k).

2. Dictionary Definitions and Data Structures

Here we discuss the internal data structures of the dictionary (Figure 1). The speciality of this dictionary system is that it allows the user to freely define parts of speech and Connection Relations.

We call the dictionary that has been defined by the user the User Dictionary. The User Dictionary is divided into the Grammar Dictionary and the Morpheme dictionary. The Grammar Dictionary, briefly described (I'm not sure what "issetsu de nobeta" means), is used to describe the morphological grammar of Japanese. It is composed of the Subpart of Speech Dictionary, the Conjugation Dictionary, the Connection Rule Dictionary, and the Conjugation Relation Dictionary (Section 2.1). The Morpheme Dictionary contains information about individual morphemes (Section 2.2).

The dictionary that is used for morphological analysis is called the System Dictionary. The System Dictionary consists of: the Connection Table and Connection Matrix, which are generated from the Connection Rule Dictionary and the Conjugation Connection Dictionary, the Tree Structure Dictionary, which contains the information in the Morpheme Dictionary in symbolic form and is constructed by referring to the Grammar Dictionary, the (Lexical) Key Dictionary, the Readings (yomi) Dictionary, and the Meaning Dictionary.

The dictionaries described here must be placed in the directory held in the Unix environment variable JUMANPATH. (I don't understand the next sentence: "hon sistemu wa, JUMANPATH wa `/juman/dic' o sansho siteiru")

Furthermore, this system comes equiped with a standard User Dictionary. It will be introduced in section 2.5.

2.1 Definition of the Grammar Dictionary

The Grammar Dictionary is set up in the following way. Each dictionary is defined in S-expression format.

The Morpheme Subpart of Speech Dictionary (cf. JUMAN.grammar)
Defines the names of the parts and subparts of speech that the system uses. A single part of speech or subpart of speech is represented by a single S-expression (list structure). The lists first element contains the part of speech. If there are subparts of speech, elements 2 and beyond of the list each contain a subpart of speech in list format.
    Individual parts of speech or subparts of speech are represented as single element lists, but if the morphemes belonging to the part or subpart of speech are inflected, the symbol `%' is added as the second element of the list. If a part of speech has the symbol `%', all morphemes belonging to any of its subparts of speech are inflected.
Conjugation Dictionary (cf. JUMAN.katuyou)
A table of the paradigms of inflectable morphemes and the definitions of the desinences contained in the paradigms. The list's first element contains the paradigm, the second element is a list containing lists of desinences and Suffixes. The base form of the morpheme cannot be left out as a desinence, since for inflectable words it is the base form which is entered in the Morpheme Dictionary.
    The word root is what remains after the removal of the base form suffix from the form that is entered in the Morpheme Dictionary (i.e., from the base form). If the suffix does not appear in the surface form, the Suffix entry for a desinence is marked with `*'.
Connection Rule Dictionary (cf. JUMAN.connect)
The Connection Rule Dictionary is the set of connection rules. A connection rule is a pair of sets of morpheme structures (cf. Section 1.3) and is expressed as a two-element list. A set of morpheme structures can also be expressed as a list.
    (A rule is a 2-element list, where each element is either a morpheme structure or a list of morpheme structures. In the latter case, the rule can be viewed as shorthand for a number of simple rules. Alternatively, we can view each element of the rule as denoting a set of morphemes, where a morpheme structure denotes the set of elements that unify it, and a list of morpheme structures denotes the union of the sets its elements denote.)
    All morphemes contained in the first element of a connection rule can be connected to all the elements contained in the second element. A single morpheme may be contained in multiple morpheme structures within a rule.
    A special symbol `*' can be used in any morpheme structure. It denotes a "don't care" value (unifies with anything). For example, the morpheme structure
            alpha1 = (*)
can be viewed as an arbitrary set of morphemes. In the same manner, the morpheme structure
           alpha2 = (noun)
expresses the set of all morphemes classified as nouns. Further, the morpheme structure
           alpha3 = (* * * mizenkei)
denotes the set of all morphemes that can take the mizenkei desinence. (N.B. In this example, the following relations hold: alpha1 >= alpha2, alpha1 >= alpha3.)
    (A morpheme structure stands for all the morphemes that unify with it, where `*' unifes with anything. AND the following convention holds: each morpheme structure is 5 elements long, and shorter structures are filled with `*' to the right. Thus, `(noun)' really stands for `(noun * * * *)' and `(* * * mizenkei)' stands for `(* * * mizenkei *)'.)

2.2 Definition of the Morpheme Dictionary

The Morpheme Dictionary is defined in list form. It is stored in files with extension '.dic'. It may be divided into multiple files. Here is the BNF for the dictionary.

<morpheme definition> ::=
(<part of speech> <morpheme info list> | (<part of speech> (<subpart of speech> <morpheme info list>))
<morpheme info list> ::=
<morpheme info> <morpheme info list> | NIL
<morpheme info> ::=
(<keyword info><reading info> <conjugation info><meaning info>)
<keyword info> ::=
(keyword <keyword notation>)
the word "keyword" followed by the appropriate form
<reading info> ::=
(reading <reading notation>)
the word "reading" (yomi) followed by the appropriate form
<conjugation info> ::=
(paradigm <desinence name>) | Nil
<meaning info> ::=
(meaning information <meaning>) | Nil
<morpheme info>
I don't understand the first sentence
The part of speech and subpart of speech must be defined in the Morpheme Part of Speech Dictionary.
<conjugation info>
This cannot be left out if the part of speech or subpart of speech have been defined to be inflectable.
<keyword>
This should be the surface form of the word. In the case of declinable morphemes, this should be the base form.
<reading>
This contains the morphemes reading. In addition to an arbitrary sequence of kana, it is possible to store other information here.
<meaning>
This contains semantic information. It consists of arbitrary text. There is no restriction on length.

2.3 Construction of the System Dictionary

To build the system dictionary from the user dictionary, along nwith the conversion of the Connection Rule Dictionary, a two step dictionary data conversion is necessary (Figure 1). Before the conversions , the JUMANPATH must be set to contain the name of the directory in which the dictionary is stored.

2.3.1 Conversion of the Connection Rule Dictionary

This is done by executing `makemat'. `makemat' doesn't take any arguments. At this step, the connection table (JUMANTREE.table) and the connection matrix (JUMANTREE.matrix) are generated from the Connection Rule Dictionary (JUMAN.connect) and the Conjugation Connection Dictionary (JUMAN.kankei).

The connection table contains an entry for each part or subpart of speech. The entry is a pointer into a row and a column of the connection matrix (the row entry contains information about what can follow the morpheme and the column entry contains information about what can preceed it.) In the case of inflectable parts of speech, the offset for the desinence in question is added to to the entry for the part of speech and the resulting entry is consulted. This allows us to define entries in a uniform manner for arbitrary morpheme structures. The connection matrix records whether a pair of morpheme structures can be connected or not. whether they can occur adjacent to each other. Morpheme structures that can combine with the same things to their right share a row in the table. Those which have the same possibilities for leftward combination share a column.

2.3.2 Conversion to the Intermediate Dictionary

The information in the Morpheme Dictionary is converted into the Intermediate Dictionary. I'm not sure what "ittan" means in this sentence - maybe "partially". This processing is invoked via `makeint'. Makeint should be passed a file with extension `.dic' as an argument. To convert all the dictionary files in JUMANPATH or the appropriate directory, use

makeint *.dic

At this stage, the following processing takes place:

  1. By consultation with the Morpheme Part of Speech Dictionary and the Conjugation Dictionary, the part of speech, subpart of speech, paradigm, and desinence are converted into single byte symbolic values.
  2. By consultation with the connection table, each morpheme is assigned an entry number, represented as a 4 bytes integer.
  3. For inflectable words, consulting the Conjugation Dictionary, only the word root is entered into the Intermediate Dictionary. In the case of words whose entire surface form changes due to inflection, all desinences are entered into the Intermediate Dictionary.

The reason for the symbolization in steps 1 and 2 is for convenience is processing fixed-length structures and to restrict the size of the system dictionary files. The information from before the symbolization can be recovered from the Grammar Dictionary. One Intermediate Dictionary file is created for each Morpheme Dictionary file. The files have the same name as the original files, except that they have the extension `.int'.

2.3.3 Conversion to the System Dictionary

The system dictionary is constructed from the intermediate dictionary. This processing is performed by `maketree', and the Tree Structure Dictionary (JUMANTREE.main), the Keyword Dictionary (JUMANTREE.mida), the Readings Dictionary (JUMANTREE.yomi) and the Semantic Dictionary (JUMANTREE.imis) are generated. `maketree' must be passed files with extension `.int' as arguments. To convert all intermediate dictionar files to, type:

maketree *.int

If there is a file JUMANTREE.main in the directory specified by JUMANPATH, the morphemes in the intermediate dictionary are added to it. Otherwise the file JUMANTREE.main is created.

Multiple entries are not created for morphemes with the same part of speech, subpart of speech, keyword, and reading. During the conversion various sorts of information are recorded in the file maketree.log

2.4 The Dictionary's Internal Data Structures

(I am very unsure of the translation of this section.)

The dictionary is represented as a set of B-Trees [2]. The B-Trees are ordered by using the initial character of the Keyword as a key. That is, all the morphemes in a given B-Tree have keywords that start with the same character.

The B-Trees' internal data structure is given in Figure 2. The search key is the Keyword. Among the information that is in the morpheme dictionary, the variable length items Keyword, Reading and Meaning are stored in the Keyword Dictionary, the Reading Dictionary and the Semantic Dictionary. In each case, the B-Tree contains an absolute offset (pointer) from the head of the dictionary file. The following 8 fields contain data:

H1:
symbolized part of speech
H2:
symbolized subpart of speech
K1:
symbolized paradigm
K2:
symbolized desinence
contbl:
connection table entry number
ptr_midasi:
pointer into the Keyword Dictionary
ptr_yomi:
pointer into the Readings Dictionary
ptr_imit:
pointer into the Semantic Dictionary

Morphemes with the same Keyword index are stored in a linear list at the nodes of the B-Tree. To make this possible, the field ptr_next contains either a pointer to the next entry which shares the same Keyword, or Nil.

2.5 The Standard System Grammar

A standard grammar has been prepared as the system's user dictionary. This is called the Standard System Grammar. The Part of Speech Dictionary, Conjugation Dictionary, and Connection Dictionary were built as extensions of the analysis contained in the Masuoka and Takubo grammar[1].

Part of Speech Dictionary:
We defined 14 parts of speech , adding the class "special" (punctuation, symbols, parentheses, etc.) to Masuoka and Takubo's system, and dividing affixes into Prefixes and Suffixes. (cf. JUMAN.grammar)
Conjugation Dictionary:
We defined 21 standard paradigms plus 6 special paradigms, extending the Masuoka and Takubo grammar in order to deal with literay language (bungo), colloquial language, and polite language. (cf. JUMAN.katsuyou).
Connection Dictionary:
This was designed from scratch, consulting Masuoka and Takubo.
Conjugation Connection Dictionary:
A table of inflectable morpheme structures plus a table of the desinences they can take. (cf. JUMAN.kankei)

3. Morphological Analysis

The environment variable JUMANPATH defines the absolute path for the dictionary directory. There are C and Prolog versions of the morphology analysis program.

3.1 C Version

Morphological analyis is invoked by

juman -[b|m|p] -[f|e|c]

Input is read a line at a time from standard input.

Contents of the analysis: The system searches for the analysis with the fewest unknown words, morphemes and independent words. The results are displayed according to the following options:

If the analysis is ambiguous:

-b
display a single analysis with the longest matching suffix. (I don't know what gohoosaichooichi means, but compositionally it would appear to mean "lonest matching suffix".)
-m
display all possible morphemes in ambiguous parts of the input (but not duplicating unambiguous parts).
-p
display all analyses (duplicating unambiguous parts) .

For each morpheme:

-f
Display arranging the karamu (I don't know what "karamu" means).
-e
Display all morpheme info in character format (i.e., spelled out in kana and kanji).
-c
Display all morpheme info in code format

3.2 Prolog Version

Processing Environment

Juman is designed for SICStus Prolog 0.7 #4.

Invocation

Moving to the directory juman/juman_pl, start Prolog, and `consult' `juman.pl'. Processing can proceed a sentence at at time or a file at a time.

1. | ?- juman()
2. | ?- juman.
 Input file name? 
 Output file name? user.

Contents of the Analysis

The system seeks the analysis with the fewest unknown words, morphemes, and independent words. In case of ambiguity, output is expressed in a lattice structure. Ambuguous parts of the analysis are indented.

References

[1] Takashi Masuoka and Yukinori Kabuto. "Basic Japanese Grammar" Kuroshi Publishers. 1989

[2] Knuth, D.E. "The Art of Computer Programming" vol. 3 Sorting and Searching. Addison Wesley. 1973.

Figures

Figure 1. "Generation of the Dictionaries"

The boxes on the left are titled (from top to bottom)

The line connecting "makeint" to "maketree" is labeled Intermediate Structure

The boxes on the right are labeled (from top to bottom)

Figure 2. "Dictionary Data Structures"

At the right of the diagram there are 5 boxes with Japanese labels "prt_midasi" points to the top one, and "contbl" points to the bottom one.

From top to bottom, the labels are: