Starting from a part-of-speech tagged corpus, we apply an implementation of the Minimum Description Length (MDL) algorithm. This generates a set of possible rules, which are screened by native speakers and classified as to type (noun phrase, verb phrase, and so forth). If this technique produces promising results in one language, we expect the method to be applicable to other languages as well.
Funding for this research has been provided by the IDEAS project under the title, "Computation Grammars From Text" (MDA904-96-C-0353). Summary reports of the work done under this project are included below.
Personnel involved in
this research include Dr. Stephen Helmreich, project administrator,
and Dr. Yorick Wilks (Sheffield University), consultant. Mark Davis
at CRL, who has implemented a version of the MDL algorithm using
genetic programming and who has contributed the "MDL game" which lets
the player pick which "rules" to choose and then evaluates their
effect on the Description Length. Chunyu Kit, formerly of Sheffield
and now in Hong Kong provided the implementation currently being used
in the project, assisted by Alex Krotov in Sheffield. Ted Dunning,
formerly of CRL, has also been influential in this work.
Report 1: March - April, 1996
Summary: The work in this period resulted in significant revision of plans for the project, outlined below. A major effort was mounted to obtain resources of various sorts for the project.
I. Revision of project goals and tasks
The first task for this project was the re-thinking of the goals of the project in light of the facts listed below:
1) The departure from CRL of Ted Dunning, one of the key personnel for the project.
2) The shift in focus of the project from major experimentation in English first and then a second language, to fairly immediate work on a second and then a third language (Russian and Serbo-Croatian).
3) The revision of the budget to cover slightly more of the time of the key personnel and thus much less work for graduate assistants.
The consequences of these facts are several. First, in terms of purely mathematical and statistical approaches to natural language processing, there is no natural replacement for Ted Dunning. Dr. David Farwell was chosen to replace Ted. David brings his excellent linguistic, managerial, and intellectual skills to the task. Both Helmreich and Farwell are well suited to carry out research in this area, but the the research will be more geared to using linguistic universals in the course of the project rather than purely mathematical approaches.
Second, the shift in focus to immediate experimentation on languages other than English means that much of the proposed schedule of work outlined in the original IDEAS proposal must now be rethought. That schedule anticipated working on English during the first 3/4 of the project. The techniques perfected on English would then be moved directly to the second language, allowing nearly nine months to develop the resources (dictionary, tagger, corpus, parsed text) needed in the second language. Now the development of these resources needed to be moved to a fast track.
The result of this rethinking was to concentrate first on Russian, using a morphological analyzer based on an algorithm created by Lana Scheremetyeva and programmed by Nick Ourusoff. This algorithm does not require a large dictionary and it is possible that it can be adapted to Serbo-Croatian with minimal changes.
One advantage of this change was that it brought the resource goals of the project much closer to those of the Corelli project (a follow-on to the TEMPLE project). Thus the possibility of sharing resource development became a real possibility, a particularly useful one, given the change in the budget of the project.
This shift in budgeting priorities means that the budget no longer contained substantial funding for resource development by purchase or by creation. The likelihood of being able to share programs and resources with the Corelli project was a real plus, though the slightly later start date of the Corelli project would mean some delay. The project, however, is without skilled programmers or personnel with great expertise in statistical approaches to language analysis.
It thus became clear that the project would need to rely more heavily on on resources and programs that were available off-the-shelf or from other lab projects. The resources brought to the project through the cooperation with Dr. Wilks at Sheffield also will be more important for the success of the project.
The use of a native speaker informant became both more vital in terms of the more rapid development of Serbo-Croatian resources, and yet less possible, due to the lower budget figures for research assistants.
II. Resource acquisition
Much of the project time for the first two months was spent in searching for resources in Serbo-Croatian: corpora, dictionaries, taggers, This involved checking for Serbo-Croatian grammars and dictionaries available in print (via InterLibrary Loan), looking for native speakers locally, and for linguistically-trained native speakers elsewhere, subscribing to s-c mailing lists, surfing the web, advertising in appropriate lists, looking for on-line dictionaries.
The results of these searches was a considerable catalog of hard-copy grammars and dictionaries, a small corpus of Serbian literature on-line, and leads towards an on-line Serbian-Bosnian-Croatian-English set of bilingual dictionaries.
Additional work was also done by Mark Leisher in preparing an editor that could handle Serbo-Croatian codesets and display Cyrillic fonts. Wanying Jin also used hard-copy grammars and dictionaries to begin preparations of lists of endings for various grammatical categories.
III. Plans for the next reporting period (May-June, 1996)
The plans for the next reporting period include continued work on acquisition of resources, in particular the Serbo-Croatian on-line dictionary for which we are currently negotiating. Once this is obtained, then a tagger for Serbo-Croatian can be developed. Acquisition of appropriate statistical tools should also be undertaken, relying on packages being developed at Sheffield. Other approaches (such as neural nets and genetic algorithms should also be investigated).