New Mexico
State
University
Home Research CRL Staff Publications Resources Employment CRL Internal

Language Resources Home | Arabic | Chinese | Korean | Urdu | Somali

Persian Resources
Persian, also known as Farsi, is the official language of Iran. It is also one of the two main languages spoken in Afghanistan, and the main language in Tajikestan in central Asia. The Persian spoken in these three countries has been influenced by the local environments. This is especially true in Tajikestan since it was isolated from the other Persian speaking countries during the Soviet era. The Persian in this country has many Russian borrowings and also uses the Russian alphabet. The language resources provided here is mainly based on the Persian spoken in Iran.

Persian is derived from Indo-Iranian, one of the branches of the Indo-European languages. Indo-Iranian split into the Iranian languages and the Indo-Aryan (Indic) languages, from which most languages of India are derived. This split is estimated to have taken place around 1500 BC. The major Iranian languages are Persian, Kurdish, Pashto and Baluchi.

Persian Letters
Handwriting sample
Sample Persian text
For further information about Persian (Farsi) please visit the Ethnologue's Website.
Persian-English Dictionary (Including Proper Names):
The Shiraz Persian to English dictionary consists of approximately 50,000 entries including single words, phrases and proper names. The dictionary entries consist of several fields: the stem used in dictionary lookup, the part of speech of the entries and their English translations. It also contains morphological features for irregular entries. This information is stored as feature structures describing each dictionary entry.

     

Persian Morphological Analyzer
CRL's Morphological Analyzer generates analyses for texts in Arabic, Persian and Urdu. The analyzer is written in C (ANSI C) and been tested for Unix/Windows/Linux.

The analyzer uses a validation table to generate the valid morphemes. Three rules of validation are implemented prefix-suffix, prefix-root, and root-suffix. These rules are used to validate the possible concatenations of prefix-root, root-suffix.

The analyzer outputs all possible valid combinations with the appropriate part-of-speech (POS) and other features about the prefix and the suffix. For efficiency the rules are hard-coded in the analyzer because they are fixed and could be enumerated.

The package contains the source code for the analyzer, test examples in Arabic, Persian and Urdu and their outputs. A readme file contains details about compiling and running the analyzer on the different platforms.

A description of the analyzer and tables of features recognized for each language can be found here.



Persian Machine Translation System
Please visit the Shiraz project's website for a description of the Persian-English machine translation system.