Morphemes that perform a grammatically similar role are classified together. We call the resulting classes parts of speech (keitaihinsi). The current system allows the part of speech to be further sub-classified. These sub-classes are called subparts of speech (keitaihinsisaibunrui).
Along the lines of the "verb" "adjective" and "auxiliary verb" described in school grammars, there are morphemes that change their form according the morphemes that occur before and after them. This change of form is called "conjugation." The part of the surface form that doesn't changed is called the root, and the part that does change is called the suffix. Most conjugation is regular. Categorizing morphemes according to this regularity, we call the resulting classes paradigms. The combined surface forms that actually occur are called desinences.
If morpheme m has part of speech H1, subpart of speech H2, paradigm K1, desinence K2 and surface form M, we call the list (H1, H2, K1, K2, M) the "morpheme structure of morpheme m". Elements H1, H2, K1, and K2 are filled with the names of the part of speech, the subpart of ech, the paradigm and desinence. Furthermore, M is filled with a surface form expressed in kana or kanji. This notation is used to describe the connection rules. A morpheme structure may be underspecified and stands for the set of morphemes that will unify it.
The fact that all the members of the sets of morphemes A1 and A2 can be connected (can occur next to each other in a sentence) is expressed as: (A1, A2). This is called a Connection Rule. A set of morphemes is expressed as a list of morpheme structures. Given connection rule (A1, A2), then for every morpheme in every morpheme structure alpha1 in A1 can be followed by any arbitrary morpheme in any morpheme structure alpha2 in A2. The set of Connection Rules is called the Connection Relations.
If A1 is a list of morpheme structures (a11,... a1n) and similarly A2 = (a21, ... a2k), then (A1, A2) means that any morpheme that unifies with any of the descriptions (a11,... a1n) can be followed by any one that unifies with one of (a21,... a2k).
Here we discuss the internal data structures of the dictionary (Figure 1). The speciality of this dictionary system is that it allows the user to freely define parts of speech and Connection Relations.
We call the dictionary that has been defined by the user the User Dictionary. The User Dictionary is divided into the Grammar Dictionary and the Morpheme dictionary. The Grammar Dictionary, briefly described (I'm not sure what "issetsu de nobeta" means), is used to describe the morphological grammar of Japanese. It is composed of the Subpart of Speech Dictionary, the Conjugation Dictionary, the Connection Rule Dictionary, and the Conjugation Relation Dictionary (Section 2.1). The Morpheme Dictionary contains information about individual morphemes (Section 2.2).
The dictionary that is used for morphological analysis is called the System Dictionary. The System Dictionary consists of: the Connection Table and Connection Matrix, which are generated from the Connection Rule Dictionary and the Conjugation Connection Dictionary, the Tree Structure Dictionary, which contains the information in the Morpheme Dictionary in symbolic form and is constructed by referring to the Grammar Dictionary, the (Lexical) Key Dictionary, the Readings (yomi) Dictionary, and the Meaning Dictionary.
The dictionaries described here must be placed in the directory held in the Unix environment variable JUMANPATH. (I don't understand the next sentence: "hon sistemu wa, JUMANPATH wa `/juman/dic' o sansho siteiru")
Furthermore, this system comes equiped with a standard User Dictionary. It will be introduced in section 2.5.
The Grammar Dictionary is set up in the following way. Each dictionary is defined in S-expression format.
The Morpheme Dictionary is defined in list form. It is stored in files with extension '.dic'. It may be divided into multiple files. Here is the BNF for the dictionary.
To build the system dictionary from the user dictionary, along nwith the conversion of the Connection Rule Dictionary, a two step dictionary data conversion is necessary (Figure 1). Before the conversions , the JUMANPATH must be set to contain the name of the directory in which the dictionary is stored.
This is done by executing `makemat'. `makemat' doesn't take any arguments. At this step, the connection table (JUMANTREE.table) and the connection matrix (JUMANTREE.matrix) are generated from the Connection Rule Dictionary (JUMAN.connect) and the Conjugation Connection Dictionary (JUMAN.kankei).
The connection table contains an entry for each part or subpart of speech. The information in the Morpheme Dictionary is converted into the Intermediate Dictionary.
I'm not sure what "ittan" means in this sentence - maybe "partially". This processing is
invoked via `makeint'. Makeint should be passed a file with extension `.dic' as an argument. To
convert all the dictionary files in JUMANPATH or the appropriate directory, use
makeint *.dic
At this stage, the following processing takes place:
The reason for the symbolization in steps 1 and 2 is for convenience is processing fixed-length
structures and to restrict the size of the system dictionary files. The information from before the
symbolization can be recovered from the Grammar Dictionary. One Intermediate Dictionary file is
created for each Morpheme Dictionary file. The files have the same name as the original files,
except that they have the extension `.int'.
The system dictionary is constructed from the intermediate dictionary. This processing is
performed by `maketree', and the Tree Structure Dictionary (JUMANTREE.main), the Keyword Dictionary
(JUMANTREE.mida), the Readings Dictionary (JUMANTREE.yomi) and the Semantic Dictionary
(JUMANTREE.imis) are generated. `maketree' must be passed files with extension `.int' as
arguments. To convert all intermediate dictionar files to, type:
maketree *.int
If there is a file JUMANTREE.main in the directory specified by JUMANPATH, the morphemes in the
intermediate dictionary are added to it. Otherwise the file JUMANTREE.main is created.
Multiple entries are not created for morphemes with the same part of speech, subpart of speech,
keyword, and reading. During the conversion various sorts of information are recorded in the file
maketree.log
(I am very unsure of the translation of this section.)
The dictionary is represented as a set of B-Trees [2]. The B-Trees are ordered by using the
initial character of the Keyword as a key. That is, all the morphemes in a given B-Tree have
keywords that start with the same character.
The B-Trees' internal data structure is given in Figure 2. The search key is the Keyword. Among
the information that is in the morpheme dictionary, the variable length items Keyword, Reading and
Meaning are stored in the Keyword Dictionary, the Reading Dictionary and the Semantic Dictionary. In
each case, the B-Tree contains an absolute offset (pointer) from the head of the dictionary
file. The following 8 fields contain data:
Morphemes with the same Keyword index are stored in a linear list at the nodes of the B-Tree. To
make this possible, the field ptr_next contains either a pointer to the next entry which shares the
same Keyword, or Nil.
A standard grammar has been prepared as the system's user dictionary. This is called the Standard
System Grammar. The Part of Speech Dictionary, Conjugation Dictionary, and Connection Dictionary
were built as extensions of the analysis contained in the Masuoka and Takubo grammar[1].
The environment variable JUMANPATH defines the absolute path for the dictionary directory. There
are C and Prolog versions of the morphology analysis program.
Morphological analyis is invoked by
juman -[b|m|p] -[f|e|c]
Input is read a line at a time from standard input.
Contents of the analysis: The system searches for the analysis with the fewest unknown words,
morphemes and independent words. The results are displayed according to the following options:
If the analysis is ambiguous:
For each morpheme:
Juman is designed for SICStus Prolog 0.7 #4.
Moving to the directory juman/juman_pl, start Prolog, and `consult' `juman.pl'. Processing can
proceed a sentence at at time or a file at a time.
The system seeks the analysis with the fewest unknown words, morphemes, and independent
words. In case of ambiguity, output is expressed in a lattice structure. Ambuguous parts of the
analysis are indented.
[1] Takashi Masuoka and Yukinori Kabuto. "Basic Japanese Grammar" Kuroshi Publishers. 1989
[2] Knuth, D.E. "The Art of Computer Programming" vol. 3 Sorting and Searching. Addison
Wesley. 1973.
The boxes on the left are titled (from top to bottom)
The line connecting "makeint" to "maketree" is labeled Intermediate Structure
The boxes on the right are labeled (from top to bottom)
At the right of the diagram there are 5 boxes with Japanese labels "prt_midasi" points to the
top one, and "contbl" points to the bottom one.
From top to bottom, the labels are:
2.3.2 Conversion to the Intermediate Dictionary
2.3.3 Conversion to the System Dictionary
2.4 The Dictionary's Internal Data Structures
2.5 The Standard System Grammar
3. Morphological Analysis
3.1 C Version
3.2 Prolog Version
Processing Environment
Invocation
1. | ?- juman()
2. | ?- juman.
Input file name?
Contents of the Analysis
References
Figures
Figure 1. "Generation of the Dictionaries"
Figure 2. "Dictionary Data Structures"