| Russian-English in Temple
|
by Nicholas Ourusoff nourusof@crl.nmsu.edu
Overview
The Russian component of the Temple project has five goals:
- 1. To provide a Russian corpus for building the Russian glossary and for demonstrating Temple's Russian glossary-based translation capabilities.;
- 2. To provide a Russian morphological analyzer;
- 3. To provide a general Russian lexicon (Russian-English word dictionary);
- 4. To provide a Russian glossary (Russian-English phrase dictionary);
- 5. To provide a dictionary of proper names and places.
Bilingual dictionary
The Russian lexicon is a bilingual Russian-English word dictionary that can be used by a program (that annotates a Russian text with the English translation of each word, for example), and can also be consulted on-line by the user. The dictionary was based on a list of words with meanings generated by Yuri Mordovskoi.
A lexical entry has the following structure
A program written in PERL ensures that the dictionary is free of typographical errors that may have been introduced during its creation. and that the dictionary for a language conforms to standard parts of speech used in that language's dictionary. (To date, there is not standardization of parts of speech across languages.) The Russian, Spanish and Japanese dictionaries have been edited using this program.
The dictionary is converted to the lexicon format using the program convert_to_lbformat.c. The lexicon index file, which is generated by the program ken.indx, speeds up the dictionary search.
The dictionary contains 6650 entries. Words with several different word senses but the same part of speech are presented as a single entry with multiple meanings. However, if a word has a distinct parts of speech, the word will have a separate entry for each part of speech.
Morphological analyzer
The Russian morphology analyzer (RMA), is based on an algorithm developed by Svetlana Sheremetyeva and Sergei Nirenburg (CRL/NMSU) that produces Russian morphology with a minimum of acquisition effort. No comprehensive stem dictionary is needed. Instead, the algorithm relies on a unit of description, the quasi-root, which consists of a string of 3 characters preceding the word ending and a series of short tables or lexicons (consisting of some 5000 entries in all) in order to determine morphological information. Another feature of the algorithm is its efficiency in resolving the part of speech of an input string: it first tests whether the input is a noun, and if it so determines, it need not consider participial or adjectival candidates; the algorithm next tests whether the input is ta participle or superlative adjective, and if it so determines, it need not consider other adjectival or verb candidates.
The Russian Morphology Analyzer (RMA) program, implemented in Allegro Common Lisp, reads an input file in KOI-8 Cyrillic format and produces an output file that is input to the Temple morphological analyzer. In addition, RMA optionally produces a short and long display of each input word's morphological information.
To date, RMA has been tested on lists of nouns, participles, adjectives and verbs, and some selected texts. The results are promising; further testing on a large body of corpus is underway.
Bilingual glossary
The Russian glossary was built using a Russian corpus including
aerospace and computer science corpora provided by Yuri Mordovskoi,
and a fictional source (Zhukov) downloaded by Ted Dunning. Frequency
analysis of phrases of 2-, 3-, 4- and 5-grams using scripts written by
Krishna Rajput formed the basis for selection of phrases by native
Russian speakers, Natasha Broido and Nicholas Ourusoff
The Russian glossary contains approximately 1650 phrases (94,705 characters)
Appendix
If the Russian text below does not appear in Cyrillic, then you need
to do the following (it looks long, but is quite simple):
1. In the Netscape menu bar, select General Preferences from the Options
menu item. Click on Fonts. Click on the button associated with "For
the Encoding:" and select User-Defined. Click on the button
associated with "Use proportional Font:" and select Cyrillic (Glonti
Samarin, koi8-1). Click on the button associated with "Use the Fiexed
Font:" and select Fixed (Misc, koi8-1. Then click on OK.
2. In the Netscape menus bar, select Document Encoding from the
Options menu, and select User-Defined. The cyrillic text should
appear as cyrillic.
dictionary entry
A sample dictionary (lexicon) entry:
абсолютная
$C adj
$I absolute;
glossary
Sample glossary entries:
прямо сказать<:1>
say<:1> frankly
say<:1> bluntly
высшее учебное заведение<:1>
higher educational institution<:1>
institution<:1> of higher learning