Tools for acquiring Uyghur text


Uyghur Encodings

There is an official Chinese encoding of Uyghur called GB 12050-89, but it is unknown whether this encoding is even being used. There does not seem to be much information about GB 12050 on the web. The majority of Uyghur text on the Internet seems to be encoded in Unicode and appears in the Latin, Cyrillic, and Arabic scripts.

Converting Uyghur text for processing

For Uyghur text that uses the Arabic script, special fonts and a certain set of characters have to be used to get certain letters to display properly in web browsers. Basically, the text has to use the Arabic Presentation Forms blocks of Unicode. A Perl script that converts between the Arabic Presentation Form blocks and the Arabic block of Unicode is available as:

http://crl.nmsu.edu/say/tools/uniuygh.pl

The command line is: uniuygh.pl [-a|-p] [-o outfile] [file1 file2 ...]

-aConvert from Arabic block text to Presentation Form block text.
-pConvert from Presentation Form block text to Arabic block text.
-o outfileUse the named output file instead of STDOUT.
file1....If no filenames are provided for conversion, input is expected from STDIN. All files are expected to be UTF-8 encoded.

Typing Uyghur text on Windows 2000/XP

There are not any Uyghur system input methods really available yet that we know of. A couple Keyman packages with the Arabic and Cyrillic scripts are in development now.

Uyghur Fonts

To display Uyghur properly, special fonts with the necessary glyphs are needed. Any of the fonts at http://www.ukij.org/fonts/ should work properly.