Arabic Tools Page



The Arabic IR :
The Arabic IR system offers the possibilities of querying an Arabic resource and retrieving Arabic documents using the URSA Unicode-based retrieval engine developed at CRL http://crl.nmsu.edu/Research/Projects/tipster/ursa/ for papers, technical documentation and download of the URSA engine.} The system has been demonstrated using an archive of 5 months of the Al-Rayah daily newspaper from Qatar.  The system offers the option to use a full word mode or using a morphological analyzer.  In the latter option, documents are indexed using stems, and the query is also processed using the morphological analyzer. See a web demo at http://crl.nmsu.edu/~ahmed/

The Arabic-English CLIR :
The Arabic-English CLIR system offers the possibility of querying the Arabic text using an English query.  The English query is processed interactively and the user can improve the translated query.  The translated (Arabic) query will be sent to the Arabic IR system to retrieve relevant Arabic documents. The user has a full set of tools to display and browse the results, and translate back into English the retrieved documents.
See a web demo at http://crl.nmsu.edu/~ahmed/test/news/index1.html

 

The Arabic corpora :
The Arabic corpora collected at CRL include a collection of more than 60MB of news from the AFP news agency; all the documents in the collection are listed by date and tagged in SGML format using a designed template to include the number of the story, Date, headlines and a footer.
We also have a several collections of on-line newspapers articles from eleven countries (Algeria, Iraq, Jordan, Kuwait, Lebanon, Mauritania, Morocco, Oman, Qatar, Saudi Arabia and Syria). These news articles were collected between February and November 2000. This collection has been tagged and indexed and deployed in some information retrieval experiments, the collection could be queried using the tools developed at the CRL. The indexes are built in two different ways (i. Based on the form of the word, ii. Using the morphology), the system also allow the user to choose stem the query when searching the collection either by single country -newspaper- or all the collection together. The collection is accessible from our web site http://crl.nmsu.edu/~ahmed/test/news/ .

The Arabic-English dictionary:
The Arabic-English dictionary is essentially a morphological dictionary with English translations. It does not contain usual part-of-speech information nor proper citation forms. Instead, an entry key (field {$H} below) is a morphological stem, typically a sub-string of an inflected word. All stem variants for the same word are listed. Each entry contains a morphological category (number of the inflectional paradigm for that stem, field {$C}). English translations are listed in field {$I}. The dictionary contains approximately 72,000 pairs of (stems, category) and about 43,000 unique stems.  An entry looks like:

 $H dxAry
 $C R001
 $I savings;storage;
 $$

The Arabic proper names dictionary:
The Arabic proper names dictionary was developed to serve as a main resource for the proper nouns, capture some specific characteristics and categorization (First name, Last Name. Place,..) of the proper nouns, the dictionary contains 1694 entries which are mainly Arabic proper names and places and organizations, along with the Arabic citation the dictionary include also the English translation, or transliterationin some cases.

English-Arabic Onomasticon:
is a large collection (204,606 entries) of names of persons and companies, and also includes geographical places, countries and cities around the world. Extracted mainly from English newspapers and journals. The Arabic side of the Onomasticon is an Arabic transliteration of the western names
The dictionary are arranged based on a same template format used in the Temple project

English-Arabic Dictionary:
The English Arabic dictionary contains 113.208 entries. Each entry contains the part of speech (POS) of the English word and one or more translation based on the meaning and the context of usage of the English word (See Figure 3 on page 3).

Arabic-English phrasal dictionary (glossary):
The machine translation system uses an Arabic-English phrasal dictionary (``glossary'') containing approximately 12,000 phrases. This glossary was built by automatically extracting phrasal patterns from an Arabic corpus of news articles and technical documentation.  Translations were added manually.

Number and Date Tagger:
The tagger works on cp-1256 Arabic texts and produces a tagged text with tags date, date expressions, number (ordinal or digit). Each tag has a value attribute which contains the decimal value of the Arabic number or date.
<date value=130193>13 ynAyr 1993</date>
<number value=2>2</number>
<ordinal value=1>Al>wl</ordinal>
<number value=1993>1993</number>
<date value=110193>Al>vnyn</date>
The tagger can be run on the command line using the script numberDateTagger as:
          numberDateTagger  <inputfile> <outputfile> 
 
  
  
 

Back   Home