The Arabic IR :
The
Arabic IR system offers the possibilities of querying an Arabic resource and
retrieving Arabic documents using the URSA Unicode-based retrieval engine
developed at CRL http://crl.nmsu.edu/Research/Projects/tipster/ursa/
for papers, technical documentation and download of the URSA
engine.} The system has been demonstrated using an archive of 5 months of the
Al-Rayah daily newspaper from Qatar. The system offers the option to use a
full word mode or using a morphological analyzer. In the latter option,
documents are indexed using stems, and the query is also processed using the
morphological analyzer. See a web demo at http://crl.nmsu.edu/~ahmed/
The Arabic-English CLIR :
The Arabic-English CLIR system offers the possibility of
querying the Arabic text using an English query. The English query is
processed interactively and the user can improve the translated query. The
translated (Arabic) query will be sent to the Arabic IR system to retrieve
relevant Arabic documents. The user has a full set of tools to display and
browse the results, and translate back into English the retrieved documents.
See a web demo at http://crl.nmsu.edu/~ahmed/test/news/index1.html
The Arabic corpora :
The Arabic corpora collected at CRL include a collection of more than
60MB of news from the AFP news agency; all the documents in the collection
are listed by date and tagged in SGML format using a designed template
to include the number of the story, Date, headlines and a footer.
We also have a
several collections of on-line newspapers articles from eleven countries
(Algeria, Iraq, Jordan, Kuwait, Lebanon, Mauritania, Morocco, Oman, Qatar, Saudi
Arabia and Syria). These news articles were collected between February and
November 2000. This collection has been tagged and indexed and deployed in some
information retrieval experiments, the collection could
be queried using the tools developed at the CRL. The indexes are built in two
different ways (i. Based on the form of the word, ii. Using the morphology), the
system also allow the user to choose stem the query when searching the
collection either by single country -newspaper- or all the collection together.
The collection is accessible from our web site http://crl.nmsu.edu/~ahmed/test/news/
.
The Arabic-English dictionary:
The Arabic-English dictionary is essentially a morphological
dictionary with English translations. It does not contain usual part-of-speech information nor proper citation forms. Instead, an entry
key (field {$H} below) is a morphological stem, typically a sub-string
of an inflected word. All stem variants for the same word are listed. Each
entry contains a morphological category (number of the inflectional paradigm
for that stem, field {$C}). English translations are listed in field {$I}.
The dictionary contains approximately 72,000 pairs of (stems, category)
and about 43,000 unique stems. An entry looks like:
$H dxAry
$C R001
$I savings;storage;
$$
The Arabic proper names dictionary:
The Arabic proper names dictionary was developed to serve as
a main resource for the proper nouns, capture some specific characteristics and categorization (First name, Last Name. Place,..)
of the proper nouns, the dictionary contains 1694 entries which are mainly
Arabic proper names and places and organizations, along with the Arabic
citation the dictionary include also the English translation, or transliterationin some cases.
English-Arabic Onomasticon:
is a large collection (204,606 entries) of names of persons and companies, and also includes
geographical places, countries and cities around the world. Extracted mainly
from English newspapers and journals. The Arabic side of the Onomasticon
is an Arabic transliteration of the western names
The dictionary are arranged based on a same template format used in
the Temple project
English-Arabic Dictionary:
The English Arabic dictionary contains 113.208 entries. Each entry
contains the part of speech (POS) of the English word and one or more translation
based on the meaning and the context of usage of the English word (See
Figure 3 on page 3).
Arabic-English phrasal dictionary (glossary):
The machine
translation system uses an Arabic-English phrasal dictionary (``glossary'')
containing approximately 12,000 phrases. This glossary was built by automatically extracting phrasal
patterns from an Arabic corpus of news articles and technical documentation.
Translations were added manually.
Number and Date Tagger:
The tagger works on
cp-1256 Arabic texts and produces a tagged text with tags date, date
expressions, number (ordinal or digit). Each tag has a value attribute which
contains the decimal value of the Arabic number or date.
<date
value=130193>13 ynAyr 1993</date>
<number
value=2>2</number>
<ordinal
value=1>Al>wl</ordinal>
<number
value=1993>1993</number>
<date
value=110193>Al>vnyn</date>
The tagger can be run on the command
line using the script numberDateTagger as:
numberDateTagger <inputfile> <outputfile>