Processing Persian Text: Tokenization in the Shiraz Project
Karine Megerdoomian and Rémi Zajac
Abstract
Prior to morphological analysis or syntactic parsing, a text needs to
undergo tokenization, in order to determine sentence and word
boundaries. This report describes the tokenizer used in the Shiraz
Persian-English machine translation project at the Computing Research
Laboratory. The Persian writing system and the methods that can be
used in recognizing token boundaries in written text are
presented. The system uses a low-level language-independent tokenizer,
which outputs an unambiguous sequence of basic tokens. Difficulties
arise in analysis of Persian text since certain detachable morphemes
need to be reattached to the word before morphological analysis takes
place. In addition, words are often concatenated in written
form. These pre-processing tasks are accomplished by a post-tokenizer
that contains language-specific information.
Back to Shiraz publications