Arabic-English in Temple

Arabic Morphology

Arabic morphology is based on the Semitic root-and-pattern scheme of forming word stems, as well as the concatenation of stem and affixes. To illustrate, the Arabic word wasayaktubuunahaa () can be analyzed as follows (): the root morpheme ktb (``to write'') combines with the verb pattern morpheme CCuC (present/future tense) to form the stem ktub, to which are attached the prefixes wa (``and''), sa (future tense), ya (3rd person), and the suffixes uuna (masc. pl.) and haa (``it''). The Arabic word wasayaktubuunahaa maps to a complete English sentence: ``and they (masc.pl.) will write it.'' In addition to having a relatively intricate morphology, Arabic is normally written without short vowels and other diacritics that mark gemination, zero vowel, and various inflectional case endings. The word cited in the above example is normally written wsyktbuunhaa.

Computer Analysis of Arabic Morphology

For the purpose of computer analysis, Arabic words were treated as having three elements: prefix, stem, and suffix. Using this approach, the word wsyktbuunhaa was segmented as wsy-ktb-uunhaa ()

All valid prefix and suffix concatenations were stored in respective lexicons. Likewise, all valid stems (i.e. combinations of root and pattern morphemes) were stored in a lexicon of stems. Three truth tables were compiled in order to act as filters for each postulated prefix-stem-suffix analysis, with a true/false value given to the various prefix-stem, stem-suffix, and prefix-suffix combinations.

The morphological analyzer is parameterized in order to correctly process regional variations in spelling (e.g. the Egyptian orthography), as well as for handling various codesets, including Unicode.

Lexicon Features

Each entry in the lexicons of prefixes, stems, and suffixes includes:

  1. the item itself, as normally spelled,
  2. the item's morphological category, which is used by the truth tables in determining the validity of each postulated analysis, and
  3. the corresponding English gloss.

The Arabic lexicon has approximately 45,000 stem entries, with almost as many additional entries covering irregularities in morphology (e.g. weak verbs) and orthography (e.g. the spelling of the Arabic glottal stop).

Bilingual glossary

A bilingual glossary of approximately 11,000 phrases was built using an Arabic corpus containing essentially news articles and technical documentation.