|
| Arabic-English in Temple |
Arabic morphology is based on the Semitic root-and-pattern scheme
of forming word stems, as well as the concatenation of stem and affixes. To
illustrate, the Arabic word wasayaktubuunahaa () can be
analyzed as follows (
): the root morpheme ktb (``to write'')
combines with the verb pattern morpheme CCuC (present/future
tense) to form the stem ktub, to which are attached the
prefixes wa (``and''), sa (future tense),
ya (3rd person), and the suffixes uuna (masc. pl.)
and haa (``it''). The Arabic word wasayaktubuunahaa maps to a complete English sentence: ``and they
(masc.pl.) will write it.'' In addition to having a relatively
intricate morphology, Arabic is normally written without short vowels
and other diacritics that mark gemination, zero vowel, and various
inflectional case endings. The word cited in the above example is
normally written wsyktbuunhaa.
For the purpose of computer analysis, Arabic words were treated as
having three elements: prefix, stem, and suffix. Using this approach,
the word wsyktbuunhaa was segmented as wsy-ktb-uunhaa
()
All valid prefix and suffix concatenations were stored in respective lexicons. Likewise, all valid stems (i.e. combinations of root and pattern morphemes) were stored in a lexicon of stems. Three truth tables were compiled in order to act as filters for each postulated prefix-stem-suffix analysis, with a true/false value given to the various prefix-stem, stem-suffix, and prefix-suffix combinations.
The morphological analyzer is parameterized in order to correctly process regional variations in spelling (e.g. the Egyptian orthography), as well as for handling various codesets, including Unicode.
Each entry in the lexicons of prefixes, stems, and suffixes includes:
The Arabic lexicon has approximately 45,000 stem entries, with almost as many additional entries covering irregularities in morphology (e.g. weak verbs) and orthography (e.g. the spelling of the Arabic glottal stop).
A bilingual glossary of approximately 11,000 phrases was built using an Arabic corpus containing essentially news articles and technical documentation.