Processing Persian Text: Tokenization in the Shiraz Project

Karine Megerdoomian and Rémi Zajac

Abstract

Prior to morphological analysis or syntactic parsing, a text needs to undergo tokenization, in order to determine sentence and word boundaries. This report describes the tokenizer used in the Shiraz Persian-English machine translation project at the Computing Research Laboratory. The Persian writing system and the methods that can be used in recognizing token boundaries in written text are presented. The system uses a low-level language-independent tokenizer, which outputs an unambiguous sequence of basic tokens. Difficulties arise in analysis of Persian text since certain detachable morphemes need to be reattached to the word before morphological analysis takes place. In addition, words are often concatenated in written form. These pre-processing tasks are accomplished by a post-tokenizer that contains language-specific information.

Back to Shiraz publications