Contributions and inquiries to: lexical@nmsu.edu OR lexical@nmsu.bitnet FTP address for accessing materials: clr.nmsu.edu [128.123.1.12].
Introduction
The Consortium for Lexical Research is designed to serve as a repository for software and resources of importance to the computational linguistics and natural language processing research community. CLR's objective is to help alleviate the repeated re-creation of basic software tools, and to assist in making essential data sources more generally available. For more information on the Consortium, please ftp to the site (address above) to obtain a copy of the CLR catalog. In the directory CLR/ it is available in a plain ascii file as catalog or in a postscript version, catalog.ps. Any questions about the archives or on becoming a member of CLR can be directed to lexical@nmsu.edu. This newsletter is also distributed from the ftp site. The directory is CLR/newsletter and the files to get are news12.txt, or news12.ps for a postscript version.
*************************************************
//http:/crl.nmsu.edu/home.html
The Home Page has links for downloading a copy of the CLR catalog, or you can view it on-line. When you are viewing the catalog, it is possible to select an item held at the ftp site and actually download that item to your machine. For security reasons, this option is not available for those resources held in members-only.
The Home Page also has the most recent newsletters on-line, and the
beginnings of a collection of "sample" files for lexicons and
dictionaries. Right now, sample files for the Harper Collins bilingual
German and bilingual Spanish show what these electronic dictionaries
look like in their raw typesetters form. Very soon examples of the
Collins tagged format will be available. In a section called "Other
On-line Lexical Resources" CLR has begun to build a selection of ftp,
gopher, and WWW sites which are of interest to lexical researchers.
SYNTACTICA is a software application tool designed to be used in introductory linguistics classes, or classes with a syntax component. The program provides a simple graphical interface for creating grammars, for viewing the structures they assign to natural language sentences, and for transforming those structures by movement, deletion, copying, etc.
In SYNTACTICA, grammars consist of a set of context-free phrase structure rules and a lexicon. Sets of phrase structure rules are created in a Rule Window, using a rule template. The screen shot on the next page shows a Rule Window (the window says "Rules" above and you can see the rules which have been entered: S --> NP VP, NP --> Det N, NP --> N, etc., etc.) The rule template allows the user to specify familiar category information for PS-rules and which nodes in a rule are heads. This choice determines the path by which features are passed in a phrase-marker.
Lexicons are created in a Lexicon Window, according to a template for lexical items. The screen shot shows the Lexicon Window and the lexical items: John, Mary, cake, a, baked, for. The lexicon window is displaying information about the lexical item bake. The template permits the user to enter basic information about words, including category, feature and subcategorization information, and whether the lexical item is audible or phonetically null.
Once a rule-set and a lexicon are created, they can be used to generate phrase-markers for sentences. Rules and lexicon are first loaded into the Tree Viewer Window (see screenshot). The user enters a sentence and presses the Build Tree button, and SYNTACTICA generates a phrase-marker using the grammar that has been loaded. When more than one structure is possible, SYNTACTICA computes all phrase markers and displays the range in the Parse field. By clicking on Parse 0 or Parse 1, the user views the alternative structures. Various operations can be performed on the phrase markers: right and left adjunction, substitution, deletion, copying and indexing of constituents.
SYNTACTICA has extensive on-line help which can also be printed as a reference manual. An accompanying program, SEMANTICA, will be available in the early part of next year. SYNTACTICA and SEMANTICA were developed by Dr. Richard Larson and the SUNY-Stony Brook Semantics Lab under a grant from the National Science Foundation. It runs under the NextSTep operating system and utilizes an underlying Prolog engine, XSB, developed at the Dept. of Computer Science, SUNY-SB. Directory: members-only/tools/ling-analysis/syntax/
INTEX is a corpora processor. It includes large-coverage dictionaries, several grammars which can be represented by graphs, and allows the user to build her/his own dictionary or grammar. At this time English, French and German dictionaries are included with INTEX.
INTEX automatically identifies words and morpho-syntactic patterns in large texts. The user can:
- Build a lexicon of words from the text. Terms may be simple words (e.g.: table), compounds (e.g.: word processor), or expressions (e.g.: to kick the bucket).
- Locate in the text all occurrences of a given word, even if it is inflected, or a given category, such as all feminine plural adjectives, or a morpho-syntactic pattern.
- Apply grammars, represented by graphs, to the text. It is possible to build indexes or concordances for all occurrences of the previous patterns.
- Use local grammars to remove word disambiguities in the text, or to detect errors or deviant sequences. The user can create her/his own grammar or edit one of the built-in grammars.
A user begins work by uploading a corpus and selecting the language. INTEX counts the number of tokens, number of different tokens, and sorts them by frequency. The user then selects linguistic tools to parse the text. The tools are either dictionaries or finite state transducers (FSTs). INTEX is based on 2 large-coverage dictionaries. The DELAF dictionary contains 700,000 simple words; each entry is accompanied by canonical form, part of speech, and inflectional information. The DELACF dictionary contains over 100,000 compound terms, mostly nouns. The FSTs are entered into INTEX either by editing regular expressions, or by drawing recursive graphs. Basically, the "input" part of a FST is used to identify occurrences of words in texts; the "output" part is used to associate each identified occurrence with information. By applying dictionaries and FSTs to a text, the user builds a lexicon of that text.
From the above lexicon, the user can locate morpho-syntactic patterns
in the corpus, and index or build a concordance of these patterns. A
pattern could be a syntactic pattern represented by a regular
expression, such as: INTEX was developed by Professor Max Silberztein, at the Laboratoire
D'Automatique Documentaire et Linguistique, Universite Paris 7.
Licensing through CLR.
The Pan American Health Organization, Conferences and General Services
Division, has given permission to CLR to distribute about 200
documents of parallel Spanish and English text. The text was
translated by the PAHO translation staff using their SPANAM machine
translation system with post editing by human translators. Most text
was originally in Spanish, but this is not consistently the case. The
documents are memos, letters, reports, conference proceedings, etc.,
on a wide variety of topics in the domains of Public Health and Latin
America. There are about 180 pairs of text, 360 individual files,
which amount to approximately 8 Mb of data. The Spanish documents do
contain the Spanish character encoding. Other formatting commands,
such as tabs, centering, bold, etc., have been removed.
Special thanks to Dr. Marjorie Leon for her assistance in making these
texts available to the nlp research community.
Directory: members-only/lexica/PAHO/
A list of Italian wordforms, with about 30,000 entries, has been
included in the CLR archives. This lexicon was created by Professor
Rodolfo Delmonte, at the Instituto di Linguistica e Didattica delle
Lingue, Universita degli Studi, in Venice, Italy. The Italian Wordform
List (IWL) was derived from a 500,000 word corpus. An attempt was made
to use broad based texts and cull a vocabulary which represents the
most frequently used Italian words in written text today. The original
corpus was composed of: the novel "La Coscienza de Zeno", by Italo
Svevo; magazines with a popular focus; newspapers; monthly magazines
on science and computing; and political documents. The corpus was
automatically tagged with a part of speech tagger called IMMORTALE,
which will also be available through CLR. Only a few corrections were
made to the automatic output, and these were limited to a small subset
of "hard to tag" terms.
The wordform list is "encumbered" and there is a fee of $300.00 for
academic research use and $500.00 for corporate research use of this
item. The license agreement form and explanation of payment of the
fees can be found in:
Directory: members-only/lexica/IWL/
An example randomly excerpted from the wordform list is shown below.
The tagset list is provided in the README file which is also in the
above listed directory. The format of the list is: word, tags,
frequency, and a letter which designates which type of corpus it was
found in. The files are ASCII, tab delineated, and available as DOS
zipped, Mac sea.hqx, and Unix compressed format.
-------------------------
Harper Collins Publishers allows the Collins College Edition Bilingual
dictionaries to be made available for research use through CLR. The
dictionaries in this series are smaller than the large bilingual
editions which are also available electronically (see Newsletter 11).
The College Bilinguals contain approximately 80,000 references, 40,000
per side. Collins charges L1000 pounds sterling for academic research
use and requires a signed license agreement. Additional information
and application forms can be obtained from CLR. The languages
available are: The above dictionaries are available in a tagged format. The tagging
is similar to SGML, but is proprietary to Collins. The tagging system
is not designed for linguistic analysis, but rather for the offset
printing of the paper dictionaries. It does however lend itself to
work in Computational Linguistics. Sample files excerpted from 3 of
the dictionaries, and their accompanying documents explaining the tags
that were used, are housed at the CLR ftp site. Read the file in
Info:/ called COLLINS.college and look in the directory:
members-only/lexica/COLLINS.DICT.samples/college_bilingual/
Below is a brief excerpt from the French - Italian bilingual sample
file, followed by a portion of the file which explains the tags. These
are brief examples, please get the complete sample files from the CLR
site.
Excerpt from the documentation from Collins describing the tags shown
above (excerpt - not the entire file):
B FORMAT FRENCH-ITALIAN DICTIONARY 08-07-93
(HWME) Main entry headword positioned full out in HEADWORD BOLD.
(HWAD) Headword add on ending to be positioned after the headword or
alternative form, separated by a comma and a character space. To be
set in HEADWORD BOLD.
(PRON) Phonetics surrounded by square brackets, following the headword
string, (HW..), to which they belong preceded by a character space.
(MAIN) Main entry for grouping purposes NOT FOR OUTPUT. The contents
of this tag should not be output.
(MNHN) Main entry homonym number NOT FOR OUTPUT. The contents of this
tag should not be output.
(BFORMAT) This tag is used for grouping purposes and is NOT FOR
OUTPUT.
(COMMON) This tag is used for grouping purposes and is NOT FOR OUTPUT.
(POSP) Part of speech marker should be output in ITALIC, generally
preceded and followed by a character space. Where there is more than
one occurrence of this tag in succession, the intervening punctuation
should be a comma.
(TRAN) Translation to be output in ROMAN. Will normally be preceded by
a character space and followed by either a comma, if the following
tag, excluding any (TRAD)/(TRSB)/(TRCF), (TGGR) or (TL..) tags, is
(TRAN)/(TREQ)/(TRGL), a semi-colon, if an indicator (LB..) tag follows
or full stop if it is the end of the entry.
(LBIN) Indicator - general to be output in SLOPED ROMAN within round
brackets, preceded and followed by a character space, unless following
a (T...) or (X...) tag where it would be preceded by a semi-colon and
character space.
(RFVB) Reflexive verb to be output in SECONDARY BOLD and separated
from the preceding item by a semi-colon and a character space, except
when the preceding item is (POSP), in which case it should be
separated by a comma and a character space.
(PHRS) Phrase to be output in SECONDARY BOLD and separated from the
preceding item by a semi-colon and a character space.
The Collins COBUILD English Language Dictionary which we described in
the last newsletter, is also available in a tagged format. COBUILD was
developed from the Collins Bank of English, and has over 70,000
references. Currently, the Collins English Dictionary (CED), Third
Edition, is only available in typesetters format as described in our
last newsletter. However, it will be out in a tagged format sometime
later this summer. The price for both dictionaries will be the same,
L2000 for academic research use.
Dr. Robert Krovetz at the University of Massachusetts, Dept. of
Computer Science, has prepared a very useful guide to the errors found
in the machine-readable version of the Longman Dictionary of
Contemporary English (LDOCE). The guide refers to the first edition of
the dictionary (1978), and in particular the "lisp" version. The guide
has a section on translation errors, those errors which resulted from
the conversion of the original tape into Lisp s-expressions. Another
section has listings of errors found in particular fields:
part-of-speech, subject codes, selectional restrictions, definitions,
and run-ons. The pronunciation field and the grammar codes are not
examined, nor are errors identified with the box codes on morphology.
Dr. Krovetz will be releasing a revised edition of the guide shortly
which will also be available through CLR. More Info: info/LDOCE.guide
Ftp Directory: CLR/resources/
An electronic version of Alexander McBain`s "An Etymological
Dictionary of the Gaelic Language" is available thanks to Kevin
Donnelly who typed it in. The file is in ASCII, so all the typographic
representations are in ASCII. The dictionary includes some grammatical
information, and complete definitions.
Ftp Directory: CLR/multiling/gaelic/
The ARI German->English On-Line Military Dictionary for MS Windows was
developed by Jonathan Kaplan, the Army Research Institute, Alexandria,
Virginia. This is a German-English glossary in either of two
electronic formats: a MS Windows Write word processor document or a
plain ascii file. When used under Windows as a Write application,
paging and searching functions are available. The glossary has
approximately 6,200 German words. The terminology is largely military
terms although a basic German vocabulary is also included. Each entry
is one line of text, beginning with the head word followed by a simple
part of speech, then a brief definition or equivalent in English.
Standard German orthography, i.e., the umlaut and "es-tset" (a), is
retained throughout. A special thank you to Dr. Melissa Holland for
making ARI resources available.
Ftp Directory: members-only/lexica/German.dict.mil/
This is just a reminder that CLR houses a variety of wordlists, and
lists of proper names, countries, currencies, etc.
Countries and Currencies: there is a collection of files
listing country names, currencies, and country and currency codes.
They are in the directory: members-only/lexica/country.currency.codes.
Included are the SWIFT currency codes, and a list of singular and
plural currency names extracted from the CIA World FactBook. There are
files of ISO and FIPS country listings, with codes. These files were
provided to CLR by the Department of Defense.
The Gazetteer: The Tipster Project Gazetteer is a
compilation of gazetteer information from a variety of sources,
primarily the CIA's RWDB2, the US Geological Survey, the CIA World
Factbook, and the Board of Geographic Names namelists. Version 4.0 of
the gazetteer has over 240,000 place names. It is in:
CLR/members-only/lexica/wordlists/gazetteer/.
Proper Names - People and Corporations: the personal names
lists were produced from various sources, including the combined
student directories of Cornell, UNC Chapel Hill and NMSU, for the
Tipster Project. There are lists of first names, last names, and a
short list of titles. These are in members-only/lexica/wordlists. The
subdirectory corporations/ contains a file with over 50,000
corporations listed with their countries of origin, along with a
glossary of corporate abbreviations and designators. The corporation
names were compiled by Eric Iverson at New Mexico State University.
German Wordlist: This is a german wordlist for ispell made
available by Geoff Kuenning. It is more helpful than most wordlists
because it is split into a number of separate files, including
abbreviations and acronyms, geographic names and names, technical
terms, adjectives, conjugated verbs, compound words, and then the file
"all other words". It is in the directory:
CLR/lexica/wordlists/german.dict/
Wordlists from the Fifth Message Understanding Conference: Files
gathered for MUC-5 were later released to all members of CLR. Files
include a nationalities list for 216 countries, with both noun and
adjective form of the nationality. Also there are two files on
organization names, one listing UN organizations and the other 187
international organizations. BBN Corporation deposited files of
corporation names in Japanese and English, and Japanese human names
and place names. These files are in the directory:
members-only/lexica/MUC5.wordlists/.
Ftp Directory: multiling/fonts.iso.latin/
Set of fonts for ISO - LATIN 8859; 1 through 9, plus cyrillic, greek,
and hebrew. It looks like these fonts will cover all of the Baltic and
Eastern European languages in addition to the Western European ones.
Ftp Directory: members-only/tools/text-analysis/Geta_Run/
Geta_Run is an experimental multilingual system for text understanding
which was developed by Professor Rodolfo Delmonte at the Universita
Degli Studi di Venezia, Instituto do Linguistica e Didattica delle
Lingue. Geta_Run represents a linguistically based approach to text
understanding that addresses the need to restrict access to
extralinguistic knowledge of the world by contextual reasoning; i.e.
reasoning from linguistically available cues. It is intended to show
how linguistic knowledge can be put to use and external knowledge of
the world accessed "only when needed", parsimoniously and
independently, by the system. At present Italian, English and German
are implemented and all three languages have limited but updatable
lexicons. The parsing system is based on the LFG theoretical
framework. Basic grammatical representation modules are the lexicons,
and C-structure and F-structure, which are internally represented as
graphs. The parser is a DCG which exploits the properties of Prolog in
its general parsing strategy. Geta_run is written in Prolog for the
Macintosh platform, and has accompanying documentation. Please be sure
to sign the license agreement if you wish to experiment with this
software. More Info: info/GETA.RUN.
Ftp Directory: members-only/tools/ling-analysis/syntax/ETL/
The ETL parser is a parsing engine for augmented context-free grammar
(CFG). It is a completely parallel parsing system which uses the Early
algorithm and is designed for languages such as Japanese which are not
word delimited. The ETL parser treats each word entry as a grammar
rule, and each character used in words is written in a dictionary. A
sample grammar and dictionary for the analysis of Japanese verbal
phrases are distributed with the parsing engine. The ETL parser has 2
versions: one for 1-byte coded character strings, and the other for
2-byte coded which can handle sentences written in Kanji or Kana. The
software was developed by Dr. Hitoshi Isahara, at the Electrotechnical
Laboratory (ETL) of the Agency of Industrial Science and Technology,
Ministry of International Trade and Industry in Tsukuba, Japan. A
license agreement for the use of the software is housed in the
directory listed above, and must be signed and returned to ETL. A
paper describing the techniques used by the ETL paper is also
included. More Info: info/ETL.parser
The members-only area of the CLR archives is rapidly increasing its
volume with valuable materials and software which are available only
to members of the Consortium. If your interests lie in lexical
research, computational linguistics, or natural language processing,
CLR encourages your organization to become a member. Membership not only
provides your organization with resources, but allows this ftp site
and its services to be maintained and to grow.
Welcome to new CLR member organizations and their contact staff:
Dr. Christian Boitet, of the Groupe d`Etude pour la Traduction
Automatique, Institut IMAG, Domaine Universitaire, Grenoble, France.
Dr. Enrique Daltabuit, Director, and Randall Sharp, Computational
Linguist, of the Direccion General de Servicios de Computo, Academico,
Universidad National Autonoma de Mexico, Mexico.
Dr. Rodolfo Delmonte, of the Istituto di Linguistica e Didattica delle
Lingue, Universita degli Studi di Venezie, Venice, Italy.
Dr. Abolfazl Fatholahzadeh and Dr. Claude Lhermitte, SUPELEC, Ecole
Superieure d'Electricite, Metz, France.
Dr. Michael Hess of the Fachbereich Informatik, Institut fur
Computerlinguistik, Universitat Koblenz-Landau, Koblenz, Germany.
Dr. Hwee Tou Ng and Dr. Khee Yin How of the Defense Science
Organization, Computer Research Division, Republic of Singapore.
Dr. James Pustejovsky, of the Computer Science Department, Brandeis
University, Waltham, Massachusetts.
Dr. Pennelope Sibun of Fuji Xerox Company, Ltd., Palo Alto,
California.
Dr. Stan Szpakowicz, of the Knowledge Acquisition Laboratory,
Department of Computer Science, University of Ottowa, Ottowa, Ontario,
Canada.
Dr. Yorick Wilks, of the Institute of Language, Speech and Hearing,
the University of Sheffield, Sheffield, England.
Dr. Dekai Wu, of the Computer Science Department, Hong Kong University
of Science and Technology, Hong Kong.
Spanish-English Parallel Text from PAHO
Italian Wordform List (IWL)
Electronic Dictionaries and WordLists: Tagged Format Collins
Harper Collins College Bilingual Dictionaries in Tagged Format
***********************************************************
(COMMON)
*
(HWME) eau-de-vie
(PRON) odvi
(MAIN) eau
(MNHN) 1
(HWIF) ~x-~-~
(IFGR) pl
(POSP) nf
(TRAN) acquavite $
(TGGR) f
************************************************************
(COMMON)
*
(HWME) echauffer
(PRON) e$ofe
(POSP) vt
(LBIN) aussi fig
(TRAN) scaldare
*
(RFVB) s'echauffer
(POSP) vr
(LBSF) SPORT
(TRAN) riscaldarsi
*
(LBIN) dans la discussion
(TRAN) scaldarsi
(TRAN) accalorarsi
***********************************************************
(COMMON)
*
(HWME) ecrire
(PRON) ekRiR
(POSP) vt, vi
(TRAN) scrivere
*
(RFVB) s'ecrire
(POSP) vr
(LBIN) rciproque
(TRAN) scriversi
*
(PHRS) a s'ecrit comment?
(TRAN) come si scrive?
*
(BFORMAT)
*
(PHRS) ~ qn (que)
(TRAN) scrivere a qn (di)
***********************************************************
(COMMON)
*
(HWME) ecrit
(HWAD) e
(PRON) ekRi, it
(MAIN) ecrire
*
(BFORMAT)
*
(POSP) pp
(XROF) ecrire
*
(COMMON)
*
(POSP) adj
(HWXT) bien ~
(TRAN) ben scritto*
(TRSB) a
*
(PHRS) mal ~
(MAIN) ecrire
(TRAN) scritto* male
(TRSB) a
*
(POSP) nm
(TRAN) scritto
*
(PHRS) par ~
(MAIN) ecrire
(TRAN) per iscritto
*******************
Other Collins Dictionaries in Tagged Format
Longman's Dictionary Error Guide
Gaelic Dictionary
German Military Terms Dictionary
Wordlists in CLR
Recent Acquisitions: A Quick List
Fonts ISO - LATIN 8859
Geta_Run
ETL Parser
CLR Membership and New Members