Consortium For Lexical Research - Catalog

Catalog Instructions

This is the CLR catalog: a Table of Contents is followed by the actual catalog with short paragraph length descriptions of the materials available in the CLR archives.

When you are in the catalog you may select an item and:

Below is the Table of Contents listing the CLR materials alphabetically. Each is linked to its section of the catalog. You may use the Table of Contents or go directly to the Catalog.

Go To The Catalog


Table Of Contents Catalog Start

Acronym Dictionary


Ftp Directory

An ascii text file of a very comprehensive list of acronyms; over 3300
entries.  A wide variety of domains are covered, including business,
science, medicine, government, and more. More info: here 

A brief sample: 
NAS	National Academy of Sciences
NAS	National Advanced Systems
NASA	National (US) Aeronautics and Space Administration [Space]
NASDA	NAtional (Japan) Space Development Agency [Space]
NASM	National (US) Air and Space Museum [Space]
NASP	National (US) AeroSpace Plane [Space]
NATO	North Atlantic Treaty Organization


AFGREP


Ftp Directory

Afgrep is a variant of the mgrep algorithm from the agrep package.  It
provides a high speed multi-string search of a file in the same manner
as the fgrep program.  It also has fewer arbitrary limitations. More
Info: here.

AGREP


Ftp Directory

Agrep is a tool for high speed text searching, allowing for errors.
Agrep is similar to fgrep, egrep, grep, but is more general and
usually much faster.  The three most significant features of agrep

1. The ability to search for approximate patterns
2. Agrep is record oriented, not just line oriented
3. Multiple patterns can be specified with logical operators (AND and
   OR) for queries. More Info: here.

ArabTex


Ftp Directory

ArabTeX is a package extending the capabilities of TeX/LaTeX to generate 
the arabic writing from an ASCII transliteration for texts in several 
languages using the arabic script. 

It consists of a TeX macro package and an arabic font in several sizes,         
presently only available in the Naskhi style. ArabTeX will run with Plain 
TeX and also with LaTeX; other additions to TeX have not been tried. 

ArabTeX is primarily intended for generating the arabic writing, but the 
scientific transcription can be also easily generated. For other languages 
using the arabic script limited support is available. 

This package also has the option of typesetting fully vocalized text
(with vowel diacritics as in the Q'uran) and/or transcriptions as
well. More Info: here.

ARCSGML 1.0


Ftp Directory

ARCSGML is a set of tools for setting up and working with text that 
is tagged with your own specialized tags in SGML format.  The tags
permit you to label text structures (such as Part-of-speech, syntactic,
morphological, semantic, or discourse structures).  The parser is
for selectively pulling out corresponding tagged pieces of text.
The ARCSGML toolkit is for use in developing conforming SGML parsers,
systems, and applications.  A validator (for checking your tagged text)
is supplied.  It supports the standard SGML reference concrete syntax 
(beginning from 1983) in all features except LINK, CONCUR, and SUBDOC 
(although some hooks are in place to get you started on these).  [The
package was originally written to validate the 1983 working draft of 
the SGML standard, and was subsequently maintained to track the standard 
through its final phases of development, culminating in the amendment.]

Executable sourcecode programs for versions for PC and Unix C (MSDOS REXX, 
MSDOS C, and Unix C) are provided. More Info: here.

ATeX


Ftp Directory

ATeX is a simple extension to LaTeX allowing the user to typeset, edit
and print documents in Arabic, or in a combination of English and
Arabic.  Simple updates to one of the files will allow one to get ATeX
to do Farsi, Urdu, etc., since the extra-Arabic characters for these
are present in the font used.  Also supplied with this package is the
MSDOS executable for using it on PCs. More Info: here.

AV Parser


Ftp Directory

The Attribute Value Parser provides a general tool for investigating
unification-based theories of grammar, runs on Apple Macintosh
computers, and was developed by Mark Johnson.  It works with a
user-defined grammar, specified in a file or constructed using the
editor included, and constructs parse trees and feature structures
from input sentences. Clicking on the nodes in the parse tree causes
their associated feature structures to be displayed.  There are two
versions of the parser, corresponding to the two versions of Apple's
CommonLisp environment that were used to create them.  The 1.32
version was created with MACL version 1.32, and the 2.0p2 version was
created with MCL 2.0p2. More Info: here.

Bamboo Helper


Ftp Directory

Bamboo Helper is a shareware program by Carlos McEvilly that
transliterates Chinese text files into Pinyin, Wade-Giles, Yale or
Zhuyinfuhao formats.  The program segments Chinese text to add breaks
between words, identifies the correct pronunciation for characters
with more than one pronunciation, and contains a dictionary.  Intended
for students of Chinese, the program outputs vocabulary lists and
flash cards.  Bamboo Helper extracts Chinese text strings and ASCII
text strings from binary files.  With no editing functions, it is best
used as a supplement to a Chinese display system word processing
environment.  More Info:/BAMBOO.


CG Parser


Ftp Directory

CGParser is an implementation of a linear parser of Conceptual Graphs
as described in John Sowa's book _Conceptual Graphs_.  It was
written using the YACC general-purpose language utility. Some simple
and more interesting examples are provided for testing purposes. 
More Info: here.

Character Encoding Converter Generator Package


Ftp Directory

This packag, developed by Kosta Kostis, generates a converter based on
character encoding description files, one for the source encoding, and
one for the destination encoding.  The description files are pure
ASCII files using ISO 10646 names.  The package includes 74 character
encoding description files covering character encoding from: Adobe,
Apple, Atari, DEC, EBCDIC, HP, ISO 646*, ISO 8859*, IBM Codepages,
Microsoft Windoes Codepages, NeXT.  This is an extremely useful
package for researchers in multilingual text processing.  There is
also a subdirectory called "extras" with additional files of interest
to people working with ISO character sets. More Info: here.

CMU-Dictionary


Ftp Directory

This pronunciation dictionary was generated at Carnegie Mellon
University for the purpose of tuning and developing speech
understanding systems.  The dictionary contains approximately 100k
words and their transcriptions.  Several independent sources were used
in its development, such as , the UCLA Shoup dictionary, a subset of
the Dragon dictionary, and various other dictionaries, that were
hand-built, syntehsizer-generated, or generated with Orator and
Mitalk. Robert Weide and Peter Jansen from CMU developed this
dictionary for any purpose, commercial or otherwise. Version 0.1 and
0.2 are both in this directory.  More Info: here.

COGNATE


Ftp Directory

COGNATE is the implementation of a prototype algorithm for identifying
related words across languages.  Given the same list of words in two
different languages, COGNATE will determine which words are likely to
be regularly derivable from each other, and which are not.  COGNATE is
only available as an MSDOS executable and comes with Dutch, English,
and German word lists. More Info: here.

Collins English Dictionary Prolog Fact Base


Ftp Directory

This "dictionary" is a set of Prolog facts derived from the first
published edition of the Collins English Dictionary. It was originally
created by Dr. Ed Fox and Dr. Robert Vance at Virginia Tech for the
CODER lexicon Project. The factbase consists of 20 files, one for each
relation identified in the structure of the Collins Dictionary. Each
relation file consists of ground facts in Edinburgh standard Prolog
syntax, one fact per line. The headword relation file has no
accompanying data. In the rest of the files, the facts are of the
form: name (where name identifies the relation); descriptor (where
descriptor specifies both the associated entry and the depth within it
at which the fact is bound); and data (data represents the data stored
for that information).  More info: here.

Collins English Dictionary: Third Edition (Sample)


Ftp Directory

The Collins English Dictionary, Third Edition, published in 1991,
contains 180,000 references, 190,000 numbered definitions, 14,000 new
or updated entries from the last edition, and 16,000 biographical and
geographic entries.  Harper Collins makes the CED3 available to
researchers for a fee; please contact CLR for more information.  This
directory contains an electronic sample which shows what the ascii
text typesetters tape format is like.


Collins Large Bilingual Dictionaries (Samples)


Ftp Directory
Ftp Directory

Harper Collins Publishers have a line of monolingual and bilingual
machine readable dictionaries available through CLR.  For complete
information on product and pricing, please contact CLR.  The files in
this directory are electronic samples from two of the large bilingual
dictionaries; the large German - English and the large Spanish -
English.  The large German - English dictionary contains over 280,000
references and 460,000 translations.  The large Spanish - English
contains over 230,000 references and 440,000 translations. The
dictionaries are avialable on tape, in typesetters coded format. The
purpose of the samples is to demonstrate the types of linguistic data
available, and the format of the tapes.  More Info:
here.

Collins College Bilingual Dictionaries (Samples)


Ftp Directory

Harper Collins also makes available it's medium-sized bilingual
dictionaries, the Collins College Edition Bilingual Series.  These
have approximately 80,000 references, 40,000 per side.  They come in a
'tagged' format, with tagging somewhat similar to SGML.  The tagging
is done for printing purposes, not for linguistic analysis purposes.
But it does lend itself to computational linguistic work.  Some
samples excerpted from dictionaries which show the tagging output,
along with documents explaining the tags used, are available in the
directory listed above. More Info: here.

Conc


Ftp Directory

Conc produces concordances of texts. A concordance consists 
of a list of the words in the text with a short section of the 
context that precedes and follows each word. Conc also produces 
an index, consisting of a list of the distinct words in the text, 
each with the number of times it occurs and a list of the places 
where it occurs. Conc displays the original text, the 
concordance, and the index each in its own window. Clicking on a 
word in any one of the three windows causes the other two windows 
to display the entries for the same word. More Info: here.

Countries and Currencies Lists


Ftp Directory

This is a collection of text file lists of countries, country codes,
currencies, and currency codes.  Included are the SWIFT currency codes
with files arranged alphabetically by country, and alphabetically by
currency code.  There is also a file of single and plural currency
names which was extracted from the CIA World Factbook. In addition
there are files of ISO and FIPS country listings with codes.  This
collection was provided by the Department of Defense in connection
with the MUC-5 conference. More Info: here.

CRL-Proper Name Tagging (CRL/NMSU)


Ftp Directory

Collection of hand-tagged English and Japanese texts and an
accompanying evaluation program for comparison of machine tagging.
Texts were hand tagged for human names and organizations names.
Scoring software allows evaluation of machine proper name tagging of
same texts.  Scoring report gives a value for recall and precision.
More Info: here.

CRL-Text Tools (CRL/NMSU)


Ftp Directory

Collection of tools to segment text and count the pieces.  Programs
are included that segment English text into sentences, words and
word-level n-grams.  In addition programs to count strings are
provided, as well as programs which can statistically analyse the
results of these counts.  These utilities can be used with other
natural languages if a word extraction program is available for use as
a pre-processor.  These programs were used to produce the data
described in the paper ``Accurate methods for the statistics of
surprise and coincidence'' which appeared in the March 1993 issue of
Computational Linguistics.  All programs are written in C and have
been run under SunOS 4.1.1, but should be portable to other
environments which support long integers as default and large arrays.
More Info: here.

CRL-Word Lists (CRL/NMSU)


Ftp Directory
Ftp Directory

These word lists were produced from various sources including the
combined student drectories of Cornell, UNC Chapel Hill, and NMSU, for
the TIPSTER project. Lists of first names, last names,
and personal titles are provided. In the corporations directory is a
list of over 50,000 corporations and their countries of origin, along
with a glossary of corporate abbreviations and designators. 
More Info: here.

DIMAP-2 (demo version)


Ftp Directory
Ftp Directory

DIMAP-2 is a set of PC dictionary creation and maintenance utilities
that are modeled loosely and more flexibly on the utilities developed
by Tom Ahlswede and described at ACL 85.  DIMAP-2 also comes with a
linked machine-readable dictionary (Merriam-Webster Concise Electronic
Dictionary, with 80,000 entries).  The design of the software is
intended to facilitate the development of NLP lexicons in any
formalism (following Allen, HPSG, lexical conceptual structures, ECD;
using Lisp, Prolog, ASCII), which then belong to the developer.
The software itself is obtainable for single users at $125 and for
academic institutions at $500.  Commercial copies are available for
$2,400 for one copy and $6,000 for a site license.  A demonstration
version is also available.  The software is currently being ported to
the UNIX environment.

An MSDOS executable demo is provided and some of the DIMAP-2 features
are restricted for the demo. More info: here.
Dimap-2 is also available for the Sun-4. More Info: here.

EDICTJ


Ftp Directory

EDICTJ is a small public-domain Japanese/English Dictionary
(dual-language glossary) in machine-readable form authored by Jim
Breen.  Having the neat form of a dual-language, dual-script glossary,
it can readily be used in any number of applications.  [It was
initially intended for use with MOKE (Mark's Own Kanji Editor) and
related software such as JDIC.] More Info: here.

ENGLEX


Ftp Directory

Englex is a basic lexicon for morphological analysis of English text.
It uses the standard orthography for English. It is intended for use
with PC-KIMMO (or programs that use the PC-KIMMO parser, such as
KTEXT).  With such software and Englex, you can produce sets of
records of the morphological constituents in English texts.  Practical
applications include morphological preprocessing of text for a
syntactic parser and producing morphologically tagged text.  Englex
can also be used to explore English morphological structure. 
More Info: here.

English - Chinese Dictionary


Ftp Directory

The Eng-Chi Dictionary, Version 1.0, by John Rittinghouse allows users
to quickly find Chinese definitions for English entries.  It allows
on-line lookup of English words, with display in chinese characters,
pinyin, Wade_giles, Yale romantization, Full dictionary consists of
107,750 entries. This is an MSDOS or Windows program.  More info:
here.

ETL Parser


Ftp Directory

The ETL parser is a parsing engine for augmented context-free grammar
(CFG). It is a completely parallel parsing system which uses the Early
alogorithm and is designed for languages such as Japanese which are
not word delimited.  The ETL parser treats each word entry as a
grammar rule, and each character used in words is written in a
dictionary.  A sample grammar and dictionary for the analysis of
Japanese verbal phrases are distributed with the parsing engine. The
ETL parser can parse sentences written in EUC or Mule internal
code. The software was developed by the Electrotechnical Laboratory
(ETL) of the Agency of Industrial Science and Technology, Ministry of
International Trade and Industry in Tsukuba, Japan.
A license agreement for the use of the software is
housed in the directory listed above, and must be signed and returned
to ETL. A paper describing the techniques used by the ETL paper is
also included.  More Info: here.

FLEX


Ftp Directory

FLEX, Fast Lexical Analyzer Generator, was developed by Vern Paxson at
the Lawrence Berkeley Laboratory in Berkeley, CA.  Flex is a tool for
generating programs which recognize lexical patterns in text.  Flex
reads the given input files for a description of a scanner to
generate. The description is in the form of pairs of regular
expressions and C code - these are called rules. Flex generates as
output a C source file, lex.yy.c, which defines a routine yylex().
This file is compiled and linked with the library to produce an
executable.  When the executable is run, it analyzes its input for
occurences of the regular expressions.  Whenever it finds one, it
executes the corresponding C code.  More info: here.

FONOL


Ftp Directory

Fonol is a programming language for writing out and applying TG-style
phonological rules (modelled on Chomsky and Halle 'Sound Pattern of
English' and Schane 'Generative Phonology') to see their effect.  It
also incorporates the input and output filters (conditions) which came
into common use about the same time.  It is intended to aid students
of phonology to grasp the ideas behind phonological rules and to help
phonologists manage large complex bodies of rules in the theory of
their choice.  (Notation style modified for writing in IBM PCs.) 
More Info: here.

Fonts: ISO - LATIN 8859


Ftp Directory

Set of fonts for ISO - LATIN 8859 1 through 9.  Also, cyrillic, greek,
and hebrew.  It looks like these fonts will cover all of the Baltic
and Eastern European languages.


French Plus


Ftp Directory

French Plus! (authored by Gene Hayworth) is a tutorial and testing
program, divided into three sections: Vocabulary Review, Vocabulary
Exercises, and Verb Conjugation Exercises.  The demo version,
available for evaluation purposes only, contains approximately 35
words; the full version includes a combination of more than 800 nouns,
adjectives, and commonly used verbs, with their conjugations in four
tenses.  Accent marks have not been included in order to make this
program compatible with as many systems as possible. The program will
run from either a floppy disk or a hard drive when installed in a
single directory. More Info: here.

FUF and SURGE


Ftp Directory

FUF 5.2 and SURGE 1.2 were developed by Michael Elhadad, currently at
Ben Gurion University of the Negev. FUF is an extended implementation
of the formalism of functional unification grammars (FUG's) introduced
by Martin Kay, specialized to the task of natural language generation.
SURGE is a large syntactic realization grammar of English, written in
FUF.  SURGE is developed to serve as a "black box" syntactic
generation component in a larger generation system that encapsulates a
rich knowledge of English syntax.  SURGE can also be used as a
platform for exploration of grammar writing with a generation
perspective.  More info: here.

Gaelic Dictionary


Ftp Directory

This dictionary is an electroic version of Alexander McBain's "An
Etymological Dictionary of the Gaelic Language", from 1911.  It was
typed in by Kevin Donnelly.  The file is in ASCII, so all the
typographic representations are in ASCII.  For an excerpt, see the
More Info file.  More Info: here.

Gazeteer


Ftp Directory

The TIPSTER Gazetteer is a compilation and reformulation of gazetteer
information from a number of sources, primarily CIA's RWDB2, the US
Geological Survey, the CIA World Fact- book (version 11), and the
Board of Geographic Names namelists. Coverage varies drastically,
based on the degree of completion for the region or attribute. Version
4.0 has over 240,000 place names. Depending on the source, multiple
entries may exist for the same geographic entity, under various
spellings. More Info: here.

German Military Terms Dictionary


Ftp Directory

The ARI German-English On-Line Military Dictionary for MS Windows was
developed by Jonathan Kaplan, the Army Research Institute,
Alexandria, Virginia. This is a German-English glossary in either of
two electronic formats: a MS Windows Write word processor document or
a plain ascii file.  When used under Windows as a Write application,
paging and searching functions are available. The glossary has
approximately 6,200 German words.  The terminology is largely military
terms although a basic German vocabulary is also included.  Each entry
is one line of text, beginning with the head word followed by a simple
part of speech, then a brief definition or equivalent in English.
Standard German orthography, i.e., the umlaut and "es-tset" (a), is
retained throughout. A special thank you to Dr. Melissa Holland for
making ARI resources available. More Info: here.

German Wordlist


Ftp Directory

This is a German dictionary for ispell, originally created by Martin
Schulz and made available by Geoff Kuenning.  It contains a number of
separate wordlists, including abbreviations and acronyms, geographic
names and names, technical terms, and adjectives, conjugated verbs,
compound words, and "all other words".  More info: here.

German Stemmer


Ftp Directory

This German Stemmer, designed by Daniel Stieger from Institut fuer
Informationssysteme, Zuerich, is available in Modula-2.  Its design is
based on the "Porter Algorithm". It includes only one semester's work
and is, therefore, unfinished. The program uses an automatically
generated dictionary of 215,000 German words for the decomposition
task. The dictionary is available along with the software.  A report
written in German is available in hard copy. 
More Info: here.

GETA_RUN


Ftp Directory

Geta_Run is an experimental multilingual system for text understanding
which was developed by Professor Rodolfo Delmonte at the Universita
Degli Studi di Venezia, Instituto do Linguistica e Didattica delle
Lingue. Geta_Run represents a linguistically based approach to text
understanding that addresses the need to restrict access to
extralinguistic knowledge of the world by contextual reasoning; ie
reasoning from linguistically available cues. It is intended to show
how linguistic knowledge can be put to use and external knowledge of
the world accessed "only when needed", parsimoniously and
independently, by the system. At present Italian, English and German
are implemented and all three languages have limited but updatable
lexicons. The parsing system is based on the LFG theoretical
framework.  Basic grammatical representation modules are the lexicons,
and C-structure and F-structure, which are internally represented as
graphs. The parser is a DCG which exploits the properties of Prolog in
its general parsing strategy. Geta_run is written in Prolog for the
Macintosh platform, and has accompanying documentation.



GRAMPT411


Ftp Directory
	
MSDOS executables and data for Grammar Tranformation System GRAMTSY
interpret the  more or less liguistically familiar notation of
transformational grammars and applies the grammars so linguists may
analyze and interpret the implications of intricate TG-rule or
TG-grammar networks.  More Info: here.

GREEK


Ftp Directory

"Greek" is a (di)troff filter that takes Greek text written with Latin
characters using the typewriter letter correspondence, and converts it
into the corresponding character sequences for the Greek letters.
"Greek" follows the ``monotoniko'' (single-accent) system.  Text
may have intermixed Greek and Roman characters. More Info: here.

Half-Uncial fonts


Ftp Directory

These are the half-Uncial (Irish typeface) fonts pre-generated at
300dpi for use with (La)TeX. More Info: here.

Homophones


Ftp Directory

This is a list of homophones in "General American English", based on 
the book HANDBOOK OF HOMOPHONES by William Cameron Townsend, 1975. 
The list contains words that sound the same (or very nearly the same)
but are spelled differently. It occasionally includes spelling
variants of the same word when there  is another word in the same
entry; the only difference between "homophones" and "spelling variant"
is whether or not the words are lexically "the same". The list also
contains a few common proper names.

This list of homophones was provided by Evan Antworth from the Summer
Institute of Linguistics. More Info: here.

Hum


Ftp Directory

This is the "hum" concordance and textual analysis package done by
Bill Tuthill when he was at Berkeley (1981).  A package of programs
for literary and linguistic computing, emphasizing the preparation of
concordances and supporting documents.  Both keyword in context and
keyword and line generators are provided, as well as exclusion
routines, a reverse concordance module, formatting programs, a
dictionary maker, and lemmatization facilities.  There are also word,
character, and digraph frequency counting programs, word length
tabulation routines, a cross reference generator, and other related
utilities.  The programs are written in the C programming language. 
More Info: here.

InterBASE


Ftp Directory

Interbas is a natural language front end for relational databases in
DBase compatible format. The program was a finalist in the "Software
in Europe" contest at CeBIT 93 in Hanover, Germany.  This demo
contains both an English and a Russian version of the InterBase
system; additionally there are several demo linguistic processors for
various applied software systems. More Info: here.

The Interlinear Text Processor


Ftp Directory

The IT ('eye-tee' for Interlinear Text) software is a set of tools
from SIL [in executable MSDOS binary code] that are for developing a
corpus of annotated interlinear text -- for what linguists, literary
scholars, translators, and anthropologists call the "glossing" of text.
Primary among these tools is `itp'.  The interlinear text file produced
using itp is a clean ASCII file which is accessible by other text
processing software for purposes such as concordancing, indexing, or
display formatting. In addition to itp, the IT package includes a
collection of other software tools which support the conversion of
conventional texts to interlinear text format and which support the
maintenance of the auxiliary lexical database files.  IT views text as
a sequence of text units, each of which contains a text line plus a
multidimensional set of annotations entered according to a model
provided by the analyst.  In addition to word and morpheme level
annotations, the IT system supports freeform annotations of the whole
text unit, such as translations. More Info: here.

Italian Plus


Ftp Directory

Italian Plus! (authored by Gene Hayworth) is a tutorial and testing
program, divided into three sections: Vocabulary Review, Vocabulary
Exercises, and Verb Conjugation Exercises.  The demo version,
available for evaluation purposes only, contains approximately 35
words; the full version includes a combination of more than 800 nouns,
adjectives, and commonly used verbs, with their conjugations in four
tenses.  Accent marks have not been included in order to make this
program compatible with as many systems as possible. The program will
run from either a floppy disk or a hard drive when installed in a
single directory. More Info: here.

Italian WordFrom List (IWL)


Ftp Directory

The IWL is a lexicon of Italian part of speech tagged words, created
by Professor Rodolfo Delmonte at the Universita degli Studi di
Venezia, Instituto di Linguistica e Didattica delle Lingue, at the
Laboratorio di Linguistica Computazionale.  The IWL was derived from a
500,000 word corpus.  Broad based texts were used, and a vocabulary
was culled which represents the most frequently used Italian words in
written text today.  The lexicon is encumbered, and there is a fee of
$300.00 for academic use and $500.00 for corporate research use of
this item.  The license agreement, a sample excerpted from the list,
and a readme file that lists the tags being used can all be found in
the directory listed above. More Info:  here.

ITRANS3.20


Ftp Directory

This is a package for printing Indian language script text. It only
does the transliteration mapping--each letter in an Indian language is
assigned as English equivalent, but the actual printout is in an
Indian language script. Font support is available for two versions of
the Devanagari script (one of which was developed by Frans Velthuis),
as well as for Tamil, Telugu and Bengali. The preferred input
interface is TeX, though a dumb textual interface is available for the
PostScript Devanagari font. More Info: here.

Japanese/German Dictionary


Ftp Directory

This is a Japanese/German dictionary entered by Helmut Goldenstein
from the book "Langenscheidts Lehrbuch und Lexikon der Jap. Schrft",
author Wolfgang Hadamitzky.  It contains 11627 Japanese words and
22000 German translations.  This dictionary may not be available after
December of 1992, so please make SURE you read the documentation file
that comes with it. More Info: here.

Japanese Morphological Dictionary


Ftp Directory

This is a Japanese morphological dictionary with a search program
included for accessing it.  The documentation is all in Japanese. 
More Info: here.

Japanese Vocabulary (wordlist)


Ftp Directory

These vocabulary lists were copied from the Vocabulary Summary at the
back of Mangajin magazine by Lars Huttar.  The words listed have three
fields; the Kanji written form, the Hiragana pronunciation, and the
English definition.  There are 22 lists of about 100 words each. The
lists were originally created to be used with vocabulary drilling
software.  More Info: here.

JKWIC


Ftp Directory

Jkwic is a simple Key Word In Context program that has the capability
of working with EUC Japanese.  It allows both simple keyword searches
and limited regular expression specification of the keyword. 
More Info: here.

JUMAN-MCC


Ftp Directory

Juman is a program which segments Japanese into words and tags
these words with parts of speech.  It was produced at Kyoto
University and then heavily modified by researchers at MCC.
The tables used for tagging are generated by a prolog program,
but the program which actually does the tokenizing and tagging
is written in C, so that users do not need to have a working 
prolog implementation if they just want to use Juman. 
More Info: here.

JUMAN-1.0


Ftp Directory

Juman-1.0 is the newest version of the segmenter-tagger for Japanese.
It includes the JUMAN-MCC version.  In addition, it provides a 120K
dictionary.  Other minor changes have been made so that the software
is more convenient to use. More Info: here 


JWP


Ftp Directory

JWP is a Japanese word processor for MS Windows, release V1.1, by
Stephen Chung. It supports all windows printers with fully scalable
fonts (ie TrueType).  The program comes complete with source code, and
the ability to use the Japanese - English freeware dictionary EDICT.
Among its features are dynamic and user-defined kana->kanji
conversion, kanji information such a stroke count and bushu, entry of
JIS characters thru JIS code or a table, and most of the standard
Windows word processor features.  More Info: here.

Kanji Frequency


Ftp Directory

These are frequency-sorted lists of kanji apprearing in a sample of
about 30,000 articles from the USENET group "fj".  One file is for all
kanji, and lists frequency rank, JIS code hexadecimal, the kanji,
number of times it appeared in sample, and percentage of the kanji
covered by this and all previously listed kanji.  The second file
sorts frequency of kanji compounds and lists the top 1000 most
frequent kanji strings and the top 500 2-character Kanji compounds.
The software to process text and extract Kanji frequencies is
included.  These files were vreated by Tim Burness, and the work was
supported by TWICS Co., Ltd.  More Info:/KANJI.freq.


The KGEN Program


Ftp Directory

KGEN is an auxiliary program for PC-KIMMO. PC-KIMMO is a program for
doing computational phonology and morphology. KGEN is typically used
to build morphological parsers for natural language processing
systems.  The KGEN program which this document describes will be of
very little use to you without the PC-KIMMO program and book.  The
PC-KIMMO software is available for MS-DOS (IBM PCs and compatibles),
Macintosh, and UNIX.  ["PC-KIMMO: a two-level processor for
morphological analysis" by Evan L. Antworth, published by the Summer
Institute of Linguistics (1990). More Info: here The book
(including software) is: International Academic Bookstore, 7500 W. Camp
Wisdom Road, Dallas TX, 75236 U.S.A. (phone 214/709-2404, fax
214/709-2433)


KIT-FAST


Ftp Directory

KIT-FAST is an experimental German -> English Machine translation
system.  It was developed in Quintus Prolog 3.1.4, under the Sun OS.
The MT system is made available together with linguistic data for
German and English (grammars and lexicons)).  The Prolog source code
is included.  There is an on-line documentation system; there is also
an installation and users manual in German. Within Kit-Fast there is a
knowledge representation system called BACK, and tools for linguisitc
data development called CPSG tools and CPSG parser (for CPS grammars).
KIT-FAST was developed by Technical University of Berlin, Department
of Software and Theoretical Computer Science, Berlin, Germany, under
Professor Wilhelm Weisweber.  More Info: here.

KN-Parser


Ftp Directory

This is a Japanese parser which detects either a dependency or a case
structure.  In the case of the dependency structure, a set of
heuristic rules are used to detect the unique structure of a sentence.
In the case of a case structure, the analysis is performed on a sample
set of sentences found in the case-frame dictionary. The current
case-frame dictionary contains less than 1,000 verbs.  This parser
makes use of gmake, gcc, and juman.

The authors of KN-parser are Profs. Sadao Kurohashi and Makato Nagao,
Kyoto University. More Info: here.

Korean 21word


Ftp Directory

21word is a Korean word processor running on IBM PClones. According to
the original poster, it supports VGA, SVGA(Trident and ET3000? only)
and Hercules mono graphics. Warning never specify swap directory as
one with some files in it since the program wipes out things in the
directory specified as swap directory. It should be comparable to
those commercially available.  More Info: here.

The KTEXT program

 
Ftp Directory

KTEXT is a text processing program that uses the PC-KIMMO parser (see
info about PC-KIMMO). KTEXT reads a text from a disk file, parses
each word, and writes the results to a new disk file. This new file is
in the form of a structured text file where each word of the original
text is represented as a database record composed of several fields. 
More Info: here.

The LDB demo program


Ftp Directory

The Linguistic DataBase (LDB) is a database program created by the
TOSCA corpus linguistics group at Nijmegen University for the storage
and exploration of syntactically analysed texts. It features a tree
viewer and an extensive query language.  It was designed on the basis
of the Nijmegen 130,000 word English corpus, also available.  This
MSDOS LDB demo program demonstrates some of the features of the
Linguistic DataBase program. More Info: here.

LDOCE Error Guide


Ftp Directory

This quide was prepared by Dr. Bob Krovetz at the University of
Massachusetts, Dept. of Computer Science. This is a guide to the
errors which Dr. Krovetz found in the machine-readable version of the
Longman Dictionary of Contemporary English (LDOCE).  The guide refers
to the first edition of the dictionary, and in particular the "lisp"
version.  The guide has a section on translation errors, errors which
resulted from the conversion of the original tape into Lisp
s-expressions.  Another section has listings of errors in particular
fields: part-of-speech, subject codes, selectional restrictions,
definitions, and run-ons.  The pronunciation field and the grammar
codes are not examined, nor are errors identified with the box codes
on morphology.  More Info: here.

LHIP Parser


Ftp Directory

The LHIP parser (Left-Head corner Island Parser) was developed by
Afzal Ballim, at ISSCO, the University of Geneva. LHIP is a system for
incremental grammar development using an extended DCG formalism.  The
system uses a robust island-based parsing method controlled by
user-defined performance thresholds which allows it to analyse what it
can from the input, thus presenting the grammar developer with results
at an early stage.  The rules themselves are an extended version of the
DCG rules, allowing optional constituents, negation, disjunction, the
specification of adjacency, and the ability to mark multiple heads in
a rule body. The latest version is 1.1.  The lhip system requires an
Edingurgh style Prolog. More Info: here.

LINK


Ftp Directory

This directory contains a system for parsing English. The parser is
based on Link Grammar, a context-free formalism for the description of
natural language also designed by LINK's authors (Daniel Sleator,
Carnegie Mellon University, and Davy Temperley, Columbia University).
It is a lexical system, where each word has a combinatorial formula
representing all the ways in which that word can be correctly used in
a sentence.  A sentence is grammatical if links can be drawn above the
words in such a way that (1) each word's combinatorial requirements
are satisfied, (2) the links do not cross, and (3) the graph of links
and words is connected.  The system is comprised of a parser which
reads in a link grammar (words and their corresponding formulas), and
parses sentences according to the given grammar.  This system also
includes a Link Grammar for English.  This grammar has roughly 700
definitions and 25000 words, and captures many phenomena of English
grammar, such as noun-verb agreement, questions, imperatives, complex
and irregular verbs, different types of nouns, past or present
participles in noun phrases, commas, a variety of adjective types,
prepositions, adverbs relative clauses, possessives, etc. 
More Info: here.

The LQ-TEXT package


Ftp Directory

LQ-TEXT searches text for phrases in it that you previously indexed.
A browser and a program to generate keyword-in-context style lists are
also included.  The software is primarily designed for Unix systems.
The necessary indexing program (lqaddfile) is enclosed.  Indexes are
usually less than the size of the data, and sometimes half that.
There is a browser (lqtext) for System V, and a shell script (lq) for
any Unix system.  There is also a program (lqkwik) that turns the
output of lqphrase or "lqword -l" into a keyword in context-style
list. More Info: here.


Male and Female Names List

Ftp Directory

A file of almost 3000 male names and 4967 female names compiled by
Mark Kantrowitz.  Copyright is contained in the names.README.Z file. 
More Info: here.

MacLex


Ftp Directory

MacLex is a program for field linguists that manages lexicon/dictionary
files of a specified format. It supports editing, find/change, user-
defined sorting order, and reversals. It is written and supported by
Bruce Waters of SIL (Summer Institute of Linguistics). 
More Info: here.

MAP3.1


Ftp Directory

This is a dictionary system suitable for use in a natural language
parsing system.  In addition to the basic dictionary, a morphological
analysis system can be used to cope with inflections and derivations
of words in the lexicon.  This system allows the user to write
dictionaries and analyzers in the language of their choice, but the
system includes an example of British English.

This system was designed by Graeme Ritchie, Steve Pulman, Graham
Russell and Alan Black More Info: here.

The MORFOGEN demo program


Ftp Directory

MORFOGEN is a morphological rule compiler and dictionary interface
tool which consists of a finite state compiler (that converts inflectional
and derivational paradigms into a finite state machine) and a
recognizer, which accepts inflected forms as input and returns base
forms (constrained by inflection class information in the lexicon) as
well as any morphemes that matched during analysis.  MORFOGEN can
handle concatenative as well as non-concatenative morphology, and can
be customized, for use on languages of inflecting as well as agglutinating
types.

The demo program provides executables for Sun4 OS 4.1.1. 
More Info: here.

MRC Psycholinguistic Database


Ftp Directory

This second version of the MRC Psycholinguistic Database is a computer
usable dictionary (MRD) created from a large online database
originally used in psycholinguistic research. The text comes complete
with a suite of UNIX/C retrieval tools, but can be processed on any
machine under any operating system.  The file contains 150837 words
and provides information about 26 different linguistic properties,
although it is not the case that information about every property is
available for every word. No semantic information is included.
Linguistic properties include: number of letters, phonemes, and
syllables; measures of frequency; pronunciation; measures of
familiarity; part of speech; etc.  The original compilation research
was conducted by Professor Max Coltheart under a Medical research
Council grant, and the resulting dictionary was documented and
corrected by Mike Wilson. More Info: here.

MTRAN

Ftp Directory

MTran is an MS project by Douglas Witmer (University of Texas at
Arlington), that analyzes 5 different linguistic theories with regard
to machine translation.  The MTran program was written to implement a
proposed a fixed text machine translation technique, arguing two
primary points: much of machine translation work can be done by a
person fluent in only one language, and analysis of text only needs to
be done once during the translation of text into multiple languages.

MTran is written in the ICON language and ICON interpreters for 8088
and 80386 architectures are provided. More Info: here.

MUC-5 Word Lists (Fifth Message Understanding Conference)


Ftp Directory

These files were gathered for the Fifth Message Understanding
Conference participants, and later released to all CLR members. Files
include a nationalities file of 216 countries with noun and adjective
forms of the nationalities, and two files on organization names, one
listing UN organizations and the other listing 187 international
organizations.  The BBN corporation provided corporation names in English
and Japanese, Japanese human names and place names, and a lexica of
Japanese words in the business domain. More Info: here.

MY Russian Translation program


Ftp Directory

MY Russian translation program translates in a fully automatic mode
or a semi-automatic mode which allows interactive correction for new
words. It allows updating of the dictionaries; "teaching" the system
new words and context sensitive translation; and editing of text while
in the process of translating.  The program was originally designed
for college students.  The My Russian translation program was written by
Yuri Yulaev. More Info: here.

NJStar


Ftp Directory

NJStar is a Japanese word processor for PC's with a Wordperfect-like
feel.  It supports the input, display and printing of Japanese
characters; JIS, EUC, and NEC-JIS.  NJStar has pull-down menus, cut
and paste, macros, built-in printing and drivers, multiple file
editing, and configurable keys.  Version 3.j is still offered as
shareware, but please honor the copyright and pay for a registered
version.  NJStar was created by Hongbo Ni, and comes complete with
extras like fonts and dictionaries.  More Info: here.

NIST SGML parser/validator program


Ftp Directory

The NIST NCSL OSE SGML package is an SGML parser and validation suite
for text into which tags in SGML format have been inserted.
Currently no other documentation other than comments in the code is
available.  The C source code is supplied in this package, which is
designed primarily for Unix systems. More Info: here.

OED2 package


Ftp Directory

Oed2/Ox2 is a package of utilities used to manage network access to
the Oxford English Dictionary at the Waterloo Center for OED research
for online lookup of words, definitions, examples, or other patterns.
[Oed2/Ox2 is a front-end to the Pat pattern searching applied to the
Oxford English Dictionary Version 2.  Version 2 of the OED comprises
the merged Version 1 dictionary and the Supplement.]  Pat combines
very fast search capabilities over a very large text file with an
awkward user interface.  The oed2/ox2 program knows about the
structure of the dictionary file.  At the request of Oxford University
Press, this program should be installed as ox2 (not oed2) at non-UIUC
sites.  Because oed2/ox2 is a network resource, the oed2/ox2 program
can be compiled to use a remote Pat server.  In addition, a modified
telnet server is supplied for remote network access. More Info:
here 


Free On-Line Dictionary of Computing


Ftp Directory

The On-line Dictionary of Computing is a dictionary of programming
languages, architectures, operating systems, networking, theory,
mathematics, telecoms, acronyms, jargon, projects, history, in fact
anything to do with computing.  It was compiled by Dennis Howe, from
the Theory and Formal Methods section of the Department of Computing
at Imperial College of Science, Technology and Medicine, London. More
Info: here.

OALDCE: Mitton's Version

Oxford Advanced Learners Dictionary of Current English

Ftp Directory

The OALDCE is a well known dictionary of over 35,000 headwords.  This
version was prepared and documented by Roger Mitton of Birkbeck
College, University of London, and is often referred to as the
"computer-usable" version.  The dictionary contains no definitions;
the spelling, pronunciation, and syntactic information of the original
are used.  In addition to the headwords and subentries of the
original, this version is extended to include about 2,500 proper
names, and a section of over 68,000 derived inflected forms.  More
Info: here.

Parallel Text in English and Spanish

Pan American Health Organization 

Ftp Directory

The Pan American Health Organization (PAHO), Conferences and General
Services Division, has kindly allowed this group of sample parallel
texts to be released for nlp research purposes.  There are 180 pairs
of text, 360 individual files, which amount to about 8 Mb of data.
The documents cover the general domains of Public Health and Latin
America, but vary greatly in content and in length.  Some are short
memos or letters, most are longer reports and conference proceedings.
The Spanish documents do contain the Spanish character encoding.
Other formatting commands, such as tabs, centering, italicizing, etc.
have been removed.  Special thanks to Dr. Marjorie Leon for her
assistance in making these texts available.


The PC-KIMMO program


Ftp Directory

PC-KIMMO is a new implementation for microcomputers of a program
dubbed KIMMO after its inventor Kimmo Koskenniemi (see Koskenniemi
1983). It is of interest to computational linguists, descriptive
linguists, and those developing natural language processing systems.
The program is designed to generate (produce) and/or recognize (parse)
words using a two-level model of word structure, in which a word is
represented as a correspondence between its lexical-level form
(components) and its surface-level form (the way it is written). More
Info: here 


PLEUK grammar development system


Ftp Directory

The Pleuk grammar, published by the University of Edinburgh's Centre
for Cognitive Science, is a shell for grammar
development.  Many differnt grammatical formalisms can be embedded
within it, including Cfg, HPSG-PL, Mike, SLE, and Term.  Sample
grammars are provided for these formalisms.  Pleuk is being made
available by its authors in the hope that it will provide a set of
facilities for the production of new grammar formalisms. Pleuk
provides both an operating system in which to develop grammars as well
as a uniform environment in which grammar writers and testers can
situate implementations of a variety of grammar formalisms.  The
system offers a way of describing the  modules of a grammar
formalism and defining operations over files and the objects described
in them, such as compilation, editing and display. Pleuk uses a
standard Prolog environment and takes advantage of functionalities
like high quality graphs or menu-based input when these are available.
More Info: here.

The Poor Man's TeX system


Ftp Directory

PM-TeX provides a simple macro-based approach to typeset Chinese,
Japanese, and Korean text using either LaTeX or TeX.  It comes with
programs that generate MetaFont fonts for these scripts from existing
bitmap fonts.  A number of conversion programs to convert the Chinese,
Japanese, or Korean text into a form that PM-TeX can use are included.
PM-TeX works with almost any (La)TeX system on almost any platform.
More Info: here.

The POPX Prolog code archive


Ftp Directory

These are some of the libraries and programs from the DEC-10 Prolog
Library as well as some other interesting Prolog programs.  The list
is long, so please consult the info page. More Info: here.

PrepCOB.awk


Ftp Directory

This is a program (designed by Archibald Michiels and Jacques Noel,
Universite de Liege) that reads the Collins CoBuild English Language
Dictionary (Harper-Collins and renders it so that its information is
more accessible.  The information that is kept includes: the lexeme
with the reading number, the headword with morphological variants,
grammar information, the definition, and examples. The result is a
list of awk records, each separated by a blank line. More Info:
here.

Proper Names Wordlist


Ftp Directory

This is a list of around 114,000 words taken from the 170 million word
Cobuild "Bank of English" Corpus.  The list was created from words
which appear predominantly in uppercase, and are therefore good
candidates for proper nouns. The file has three columns: the word, its
frequency with initial capital letter, and its frequency with
lowercase initial letter.  The files were prepared by Jem Clear, and
are freely available except for commercial uses. More Info:
here.

The Roget 1911 thesaurus


Ftp Directory

This thesaurus is an electronic version of the edition of Roget's
classic Thesaurus published in 1911 by the Crowell company.  The large
number of English words is subdivided by sense into a series of large
antonym groups and into subgroups, indexed by subjects whose tags
appear in the outline.  (Alphabetical indexing is not provided as it
remains under copyright; most of the original printer's effects such
as italics for borrowings appear.) More Info: here.

ROOK


Ftp Directory


Rook is a system for authoring descriptive grammars in HyperCard for
the Macintosh.  It is a tool for interactively and incrementally
developing a grammar description based on an interlinear text corpus
(as produced by SIL's IT program). The resulting on-line descriptive
grammar exploits the capacity of the computer to provide instant
access to cross-referenced topics, text examples, explanations of
morpheme glosses, and so on.  It was designed by J. Randolph
Valentine. More Info: here.

RUSSIAN ENGLISH On-Line Dictionary


Ftp Directory

Russian English On-Line Dictionary is a memory-resident program for
the MS DOS operating system developed by Leon Ungier. Version 1.25 is
freely available. You can not add items to the main dictionary, but a
Personal Dictionary Manager allows the creation of your own
dictionary files. More Info: here.

SAX Syntactic Analyzer (synana)


Ftp Directory

SAX (Sequential Analyzer for syntaX and semantics) is a syntactic
analyzer based on logic programming. SAX employs a bottom-up and
breadth-first parsing algorithm. The SAX grammar rules are basically
written in Definite Clause Grammar (DCG). The SAX grammar rules are
translated into a parsing program written in Prolog. SAX is
implemented in SICStus Prolog Ver 0.7.

Included with this system is a Japanese grammar and some sample
Japanese data. More Info: here.

Sense Tagged Text of the word "Interest"


Ftp Directory

These files were made available by Rebecca Bruce, at New Mexicso State
University, from work done by herself and Dr. Jan Weibe.  The data
file is composed of sentences containing the noun "interest" or
"interests" that were automatically extracted from the Penn Treebank
Wall Street Journal corpus.  The file includes the part-of-speech tags
and phrase bracketing provided in the original corpus. Each sentence
in the data file contains one sense-tagged occurrence of the word
"interest" (or "interests").  The sense tags correspond to the six
non-idiomatic noun senses of "interest" defined in the first edition
of Longman's Dictionary of Contemporary English.  In total, there are
2,369 sentences. For More Info: here.

SGML2LATEX


Ftp Directory

The sgml2latex system is a set of SGML document type definitions for
the LaTeX document styles (articles, books, reports, letters, slides),
for BibTeX bibliographies and for Unix manual pages, a set of programs
for doing the translation from SGML to LaTeX or troff/nroff, and a
program for extracting source code from documentation, providing a
simple "literate programming" facility.

To use the 'qwertz' documentation system, the 'sgmls' SGML parser is
required (also available from the CLR). More Info: here 


SGMLS


Ftp Directory

Sgmls is an SGML parser derived from the ARCSGML parser materials
which were written by Charles Goldfarb.  It works on Unix, MS-DOS and
VAX/VMS.  It should be straightforward to port to most systems that
provide ANSI C and use an ASCII-based character set.  It outputs a
simple, easily parsed, line oriented, ASCII representation of an SGML
document's Element Structure Information Set (see pp 588-593 of ``The
SGML Handbook'').  It is intended to be used as the front end for
structure-controlled SGML applications.  For compatibility with the
Amsterdam SGML Parser (ASP), there is also a filter that translates
the output of sgmls using an ASP replacement file. More Info:
here.

SHAKE AND BAKE Demo


Ftp Directory

This is a Spanish-English translation system demo written by John L.
Beaven, at the University of Edinburgh.  Other contributors are Guy
Barry, Robin Cooper, Mark Johnson, and Chris Mellish.  The system
exploits recent advances in lexicalist unification-based grammar
theories.  The system provides greater modularity of the monolingual
components.  The approach is demonstrated by presenting very different
Unification Categorial Grammars for small fragments of English and
Spanish; the grammars contain linguistically interesting phenomena
such as word order variation and clitic placement.  The monolingual
grammars are put into correspondence by means of a bilingual lexicon.
More Info: here.

The SHOEBOX program


Ftp Directory

SHOEBOX is a database management program, designed expressly to meet
the needs of the field linguist. Using SHOEBOX, the linguist can
easily enter, edit, and analyze lexical, textual, anthropological and
other types of data in multiple datafiles. For example, with SHOEBOX,

+ Maintain a simple dictionary, or a more complex lexicon,
+ Interlinearize text, where new words are automatically entered 
  into the dictionary,
+ Do grammatical filing and analysis of text data,
+ Enter and file cultural notes,
+ Maintain nonlinguistic types of databases, such as address lists
  or library catalogs. More Info: here.

The Standard Industrial Classification Manual (SICM)


Ftp Directory

The SICM is a manual which defines the classification of economic
activities for the production of Federal economic statistics. Economic
activities are classed under 99 major headings, each with a potential 99
subheadings. Thus manufacture of bulletproof vests is in class 
3842 -- Orthopedic, Prosthetic, and Surgical Appliances and Supplies
which is in group 384 -- Surgical, Medical, and Dental Instruments and
Supplies which, in turn, is in major group 38 -- Measuring, Analyzing, and
Controlling Instruments; Photography, Medical and Optical Goods; Watches
and Clocks. The manual is used to provide classifications for a
variety of computer applications. For example companies maintaining
mailing lists may classify the organizations on the list by SIC code and
use this  to target or specialize the type of mail sent to the
organization.

The manual is available in machine readable form and is an interesting
lexical resource in its own right. More Info: here.

SUSANNE


Ftp Directory

The SUSANNE Corpus comprises an approximately 128000-word subset of
the Brown Corpus of American English, annotated in accordance with the
SUSANNE scheme.

The SUSANNE scheme attempts to provide a method of representing all
aspects of English grammar which are sufficiently definite to be
susceptible of formal annotation, with the categories and boundaries
between categories specified in sufficient detail that, ideally, two
analysts independently annotating the same text and referring to the
same scheme must produce the same structural analysis. More Info:
here.

SYNONAME


Ftp Directory

Names are a key entry point for researching and indexing historical
information.  However, the format and spelling of personal names vary
greatly from one institution to the next, reflecting traditional
differences in practice.  If historical information is to be compiled
or shared, matching different versions of personal names is a
necessity.  The computer program SYNONAME automatically matches many
possible forms of a single personal name by using an ordered sequence
of twelve algorithms for pattern matching that include both character-
and word-matching techniques.  The matched pairs of names are
considered to be "candidate matches" until confirmed by a human
name-authority editor.  Run against a merged file of artists' names
from museum collections data, the program performed with an accuracy
rate of 97.4% and an optimum efficiency rate of 90.8%.  Accuracy can
increase to nearly 99% at the expense of some efficiency.  The
concepts behind the algorithms and their imlpementation may be useful
to others merging data in different contexts. More Info: here.

SYNTACTICA


Ftp Directory

SYNTACTICA is a software application tool designed for use in
introductory syntax classes, or introductory linguistics classes with
a syntax component.  SYNTACTICA presents a simple graphical interface
for creating grammars and for viewing and transforming the structures
that they assign to natural language sentences.  Using SYNTACTICA, it
is possible to construct a grammar consisting of a set of context-free
phrase structure rules and (typically) a lexicon.  This grammar is
loaded into a TreeViewer window, which generates phrase-markers for
input sentences on the basis of the grammar that has been loaded.
SYNTACTICA is a production of the SUNY-Stony Brook Semantics Lab, and
was developed by Richard Larson under a grant from The National
Science Foundation.  It runs on the NextStep operating system.  More
Info: here.

TACT


Ftp Directory

TACT is an interactive full-text retrieval system for MS-DOS with
a number of analytical tools.  Like others of its kind, TACT
retrieves segments of text according to specified word forms. In
addition, it can find words or character-strings that match
criteria the user specifies. TACT generates simple graphs to show
the distribution of forms throughout an entire text, or within
various structural divisions determined by the user. TACT also
allows retrieval by metatextual `categories'.

TACT was designed by John Bradley and Lidio Presutti at the University
of Toronto. More Info: here.

TAGGER.v1.01

TAGGER.v1.12

Ftp Directory

This is a part-of-speech tagger designed by Eric Brill at MIT.  The
tagger can be trained to tag, or an already trained tagger for English
can be used. The trainer uses a two-stage process: in the first stage
the tagger learns rules for tagging unknown words; in the second, the
rules learned involve the use of contextual cues to improve tagging
accuracy. The trained tagger also uses a two-stage process.  It
assigns most likely tags to every word in isolation. In the second
stage, contextual transformations are used to improve accuracy. More
Info: here.
This tagger has been updated to version 1.12 and is found in the 
same ftp directory as is its original version.

TAMIL

Ftp Directory

PCTAMIL is a transliteration tool cum previewer which transliterates
tamil documents written in Roman text into Tamil script; it is
compatible with LaTex. Included is a font driver that produces the
bilingual ASCII setup for both English and Tamil fonts.  PULAVAN-
Tamil Pundit - is a set of programs for learning Tamil and includes a
program for verb conjugations in Tamil. TAGTAMIL is a part of speech
tagger; it is a morphological processor which can output the root form
of an inputted word and provide suitable tags for affixes. The
dictionary has a lexicon of 1000 words at this time.  Programs
developed by Vasu Renganathan at the University of Washington. More
info: here.

TIMIT


Ftp Directory

TIMIT is a database of 6,100 English words with their most likely
pronunciation. Each entry is made up of two lines: the first line has
the word number followed by the spelling of the word, and the second
contains the transcription of the word using the set of 61 TIMIT
phones. This word list was provided by Chuck Wooters from ICSI at
Berkeley.  More Info: here.

TULIP


Ftp Directory

This is a two-level phonological analyzer based on the system
described in S.G. Pulman and M.R. Hepple, "A Feature Based Formalism
for Two Level Phonology: a description and implementation."  More
Info: here.

Turkish Text


Ftp Directory

These texts were kindly given to CLR by Dr. Kemal Oflazer of Bilkent
University; some of them will be available in the European Corpus
Initiative project's CD. One file contains a news feed from the
Anatolian News Agency from September of 1992.  The other file has
miscellanious pieces from popular publications.  More Info:
here.

UBS (Unifikations Basierte Sprache)


Ftp Directory

UBS is a formal language, which allows users to specify HPSG grammars.
It is an extension of SEPIA Prolog that was developed to accomodate
those aspects of the grammar formalism HPSG which can not be
implemented in regular Prolog in a straightforward manner. UBS is able
to process typed feature structures over type hierarchy trees,
unification, disjunction, negation, functional dependent values
(relations) and sets (unification of sets). The parser/generator
includes as data the HPSG grammar for English.  UBS is by Frieder
Stolzenburg from the Universitaet Koblenz-Landau.  More info:
here.

UNIFIED MEDICAL LANGUAGE SYSTEM


Ftp Directory
		  lexica/UMLS/samples

The Unified Medical Language System project is sponsored by the
National Library of Medicine at the Department of Health and Human
Services.  UMLS has developed two machine readable "Knowledge Sources"
of medical lexicons; a Metathesaurus, and a Semantic Network. The CLR
archive does not house the data and software (currently over 500 Mb of
information); rather this directory contains descriptive fact sheets
on the project, sample files, complete copies of the program
documentation, and instructions for the license agreement letter. The
entire database is still available for no charge at this time.  The
info file contains a few citations by linguists who have worked with
the data.  More Info: here.

University of pennsylvania Morphological Analyzer


Ftp Directory

The morphology package contains a large morphological database, an
X-window based maintenance program, and C and Lisp hooks for
interfacing the database to other software programs.  The database
itself consists of approximately 316,000 inflected items, along with
their root forms and inflectional information (such as case, number,
mode).  There are 13 parts of speech - Noun, Proper Noun, Pronoun,
Verb, Verb Particle, Adverb, Adjective, Preposition, Complementizer,
Determiner, Conjunction and Interjection, and Noun/Verb Contraction.
Nouns and Verbs are the largest categories, with approximately 213,000
and 46,500 inflected forms, respectively.  The access time for a given
inflected entry is .6 msec.  The maintenance program runs under the
X-window interface, and allows the user to customize the database to
their needs.  A inexperienced user can easily add, delete, or modify
entries to the existing database, and a person passingly familiar with
X windows and C array structures can customize the package for a
different language, building their own database with different parts
of speech and/or inflectional information.  There are C and Lisp
functions that provide hooks to allow developers to incorporate the
database into existing research projects.  The entire package requires
about 25-30 M of space. More Info: here.

VERBALIST


Ftp Directory

Verbalist is a program to demonstrate English verb forms, written by
John and Muriel Higgins. It is an MS DOS, or MS Windows, application
which conjugates English verbs. It contains a dictionary with all the
irregular verbs in English, and a sampling (300) of the other verbs.
The dictionary is extensible and the verb forms covered are extensive.
More Info:/VERBALIST


Vietnamese Text Processing Software


Ftp Directory

These software packages are produced by the Vietnamese Professionals
Society. VPSedit is a Vietnamese text editor for the PC, and VPSwin is
the Windows version.  VPSwin has a spell checker and a hoi/nga lookup
table.  There are also three font files included; VPSfont1 is a True
Type font set for Windows with 8 True Type fonts. More Info:
here.

Vietnamese Text Tools


Ftp Directory

This directory includes a variety of tools and information for
processing Vietnamese text. The tools are fonts, a text converter, and a
text printer.  A proposal for the Vietnamese Standard for Information
Interchange is also included. More Info: here.

WLIST


Ftp Directory

This program is authored and copyright ptotected by Ari Hovila and
Jari Perkimki, University of Vassa. WLIST is a language independent
word length and word frequency counter. WLIST is a statistical tool
for any language user. The program can recognize all words in an ASCII
file as well as count their occurrences. WLIST counts the lengths of
all unique words as well as the average lengths of all words. WLIST is
language independent. Sorting is determined by the alphabet of the
language you are working in. Sample sorting files are included for
Finnish, Swedish, Norwegian, Danish, English, French and German.  The
user can create their own as needed. WLIST was compiled to run under
DOS.  More Info: here.

The Word Lists


Ftp Directory

These word lists are copies of a number of word lists that are freely
available from a number of sites in Europe and North America.  The
origins of some of them are currently unknown, but are being checked.
Current languages are English, Dutch, English (shorter list), German,
Norwegian, Italian, and Swedish.  Several word lists containing names
are also available. More Info: here.

The OTA word lists


Ftp Directory

The Oxford Text Archive word lists contains word lists from the
following. More Info: here.Australian
Chinese (only a list of the HanYu PinYin)
Computer (various stuff including common passwords, domains, etc.)
Danish
Dutch
Finnish
French
German
Italian
Japanese (List of words in Romaji - see edictj reference in here ).
Literature (including various authors and genre)
Movies&TV (including Monty Python and Star Trek word lists)
Names (includes names in a number of languages and others)
Norwegian
Place Names (including colleges, wordl factbook, zip codes, etc.)
Random (includes various random sorts of word lists)
Religion (includes Q'ran and King James Bible word lists)
Science (includes asteroids and biology lists)
Spanish
Swedish
Yiddish

WORDSURV

Ftp Directory

A typical language survey may involve activities like determining
linguistic relationships through the comparison of word lists, testing
dialect intelligibility by playing back tape-recorded texts, and
studying sociolinguistic aspects of language use and language
attitudes in multilingual situations. WORDSURV is designed to aid the
first of these areas--the collection and analysis of word lists. It
functions in three main areas: (1) data entry and maintenance, (2)
data analysis, and (3) data output. WORDSURV also supports specialized
kinds of analysis, including lexicostatistics, phonostatistics, and
comparative reconstruction. More Info: here.

XEROX Part-of-speech Tagger


Ftp Directory

This part-of-speech tagger, designed by Doug Cutting and Jan Pederson
at Xerox, was written in ANSI Common Lisp. Its development was done
in Franz Allegro Common Lisp version 4.1 on SunOS4.x and MacIntosh
Common Lisp 2.0p2. The following code is provided:  source code, a
tokenizer for plain ASCII English, an English lexicon enduced from the
Brown corpus, a table of mappings for word suffixes to likely
ambiguity classes, and an HMM trained on the odd numbered sentences in
the Brown corpus. More Info: here.

YUGOSLAV CORPUS


Ftp Directory

This is a Serbo-Croatian corpus consisting of approximately 700, 000
words.  The texts are taken from modern Yugoslav fiction and all
Serbo-Croatian-speaking areas-- Serbia, Croatia, Montenegro, and
Bosnia-Hercegovina-- are represented.  The texts are in ASCII format
and the Latin alphabet is used. A list of ASCII values of the special
Serbo-Croatian characters is provided in info file 0092. More Info:
here.