next up previous
Next: Intra-Language Matching Up: A Full-Text Experiment in Previous: Example-Based MT

Initial Cleaning and Alignment of Corpus

The UN Spanish-English corpus was delivered to us as two sets of identically named files with suffixes .eng (for English) and .spa (for Spanish). The texts in the identically named files were supposed to be translations of one another. In reality, the widely diverging sizes of ostensibly aligned files showed that mistakes occurred in preparation of the original corpus. In addition, there were discrepencies in the formatting of titles, tables, headers, footnotes and footnote markers; inconsistent marking of page boundaries, inconsistent use of ``decorative'' delimiters of sections and documents (lines of asterisks, dashes, underscores, dots, etc.). A more complex problem had to do with differences in alphabetization --- many documents had sections named by names of countries, which are often named differently in the two languages, which led to differences in the order of paragraphs and sections when these sections were alphabetically arranged. Ancillary information, such as the date, the document number, the language of the document, statements about the original language of the report, etc. had to be deleted. Footnotes which included UN resolution numbers also caused alignment problems and had to be deleted.

The simplest approach would be to delete all ``non-textual'' material, but this would lead to the loss of many clues used in the alignment process. Our purpose, therefore, was to delete as much non-textual material as possible while supporting the alignment process.

Many existing alignment algorithms (Brown et al., 1991; Gale and Church, 1991) are built with the view of treating asymmetrical alignments of the 1-0, 1-2, 2-3 or 1-3 kind. For simplicity, we decided to go only for symmetrical alignment, in which sentences in the source language corpus correspond to sentences in the target language corpus in a one-to-one fashion. As a result of this decision, about 5% of the corpus was lost. However, for our purposes this proved immaterial. Thus, our alignment algorithm was extremely simple -- take the texts paragraph by paragraph (we were often able to make use of the fact that many paragraphs in the UN corpus are numbered, which simplified paragraph-level alignment further), discard paragraphs with an unequal number of sentences and align the remaining paragraphs sentence by sentence, in order.



next up previous
Next: Intra-Language Matching Up: A Full-Text Experiment in Previous: Example-Based MT



Steve Beale
Tue Oct 1 12:14:38 MDT 1996