next up previous
Next: Inter-Language Matching Up: A Full-Text Experiment in Previous: Initial Cleaning and

Intra-Language Matching

The input to this step is a Spanish sentence which is submitted for translation and a corpus of Spanish text. The output is a list of the form

((input-substring-1 
     ((corpus-string-1-1 score-1-1)
      (corpus-string-1-2 score-1-2)
       ...
      (corpus-string-1-10 score-1-10)))

 (input-substring-2 
     ((corpus-string-2-1 score-2-1)
      (corpus-string-2-2 score-2-2)
       ...
      (corpus-string-2-10 score-2-10)))
...

 (input-substring-10 
     ((corpus-string-10-1 score-10-1)
      (corpus-string-10-2 score-10-2)
       ...
      (corpus-string-10-10 score-10-10)))
)
in which no input substring fully includes any other input substring and the list of ten corpus substrings constitutes the procedure's choice of the ten ``best'' matches of the input substring by strings from the corpus.

Each word in corpus-string-i (with the exception of the members of a special list of ``common'' or ``frequent'' words) is referenced through three indices:

 
(file-index-i sent-index-j word-index-k)

where word-index-i is the position of the word (actually, any member of the inflectional paradigm for that word) in the corpus sentence; sent-index-j refers to the position of the sentence in the corpus file and file-index-k references the corpus file itself.

If a word does not appear in the corpus, an empty list is returned. If a word belongs to a special list of ``frequent'' words, a special symbol is returned since the corpus was, for efficiency reasons, not indexed for words that are too frequent.

The search of match candidates proceeds as follows. A sentence is broken into segments at punctuation marks or unknown words, and a list of all contiguous substrings (``chunks'') of a segment is produced. For every input chunk we look for sentences in the Spanish corpus that contain a matching substring. We assume a relaxed definition of matching in that we allow not ony complete matches but also matches in which:

For each of the inexact matches, we calculate a penalty for the inexact match (see Nirenburg et al, 1993 for an earlier version of this approach).

The findings of the candidate finding procedure are filtered to retain only matches whose match scores are above a threshold. Match scores are first calculated separately for each of the kinds of incomplete matches listed above. Then a cumulative score is produced. In our current system we set the threshold at 10 best matches. Note that this can be changed to include, for instance, all candidate matches with ratings above a certain threshold. For efficiency reasons, we added a preliminary stage to the selection process by filtering out matches that we considered clearly unacceptable. Thus, we did not allow any gaps of length 5 and higher. We expect in future to improve and modify this early filtering.

The following heuristics guide the scoring process for the matches:

Unmatched Words
Higher priority is given to sentences containing all the words in an input chunk. The penalty for unmatched words is presently set to 10. This penalty is applied for each instance of an unmatched word in the input.

Noise
Higher priority is given to corpus sentences which have fewer extra words. The penalty for extra words in the corpus sentence is presently set to 5. This penatly is applied for each instance of an extra word in the corpus sentence.

Order
Higher priority is given to sentences containing input words in the order which is closer to their order in the input chunk. The penalty for misordering is presently set to 15. This penalty is applied for each instance of word inversion.

Morphology
Higher priority is given to sentences in which words match exactly rather than against morphological variants. If words match identically then no penalty is presently applied. If words match on morphological variants, then we consider whether the word is a content word or a frequently occuring function word. The penalty for morphological matches of content words is presently set to 2 and the penalty for morphological matches of function words is presently set to 1. The appropriate penalty is applied for each word match in the chunk. Note the possibility of false positives, as in the case of the English nominal plural and verbal third person singular - but these would be exact matches and would not be detectable unless the corpus and the input are tagged by parts of speech.

Table 1 summarizes our current approach to intra-language matching.

Our initial combination of the results of intra-language matching tests was, therefore done according to the following formula:

overall-chunk-score = 10I + 5G + 15O + 2M + 10P + 20R + V + 5S + 10T

which reflects our intuition that it is more important for the quality of a match for the words to retain their original order and have all input chunk words involved. We expect to improve the above function both through statistical analysis with feedback and through further honing of heuristics.

                Table 1



next up previous
Next: Inter-Language Matching Up: A Full-Text Experiment in Previous: Initial Cleaning and



Steve Beale
Tue Oct 1 12:14:38 MDT 1996