The input to this step is a Spanish sentence which is submitted for translation and a corpus of Spanish text. The output is a list of the form
((input-substring-1
((corpus-string-1-1 score-1-1)
(corpus-string-1-2 score-1-2)
...
(corpus-string-1-10 score-1-10)))
(input-substring-2
((corpus-string-2-1 score-2-1)
(corpus-string-2-2 score-2-2)
...
(corpus-string-2-10 score-2-10)))
...
(input-substring-10
((corpus-string-10-1 score-10-1)
(corpus-string-10-2 score-10-2)
...
(corpus-string-10-10 score-10-10)))
)
in which no input substring fully includes any other input substring and the
list of ten corpus substrings constitutes the procedure's choice of the ten
``best'' matches of the input substring by strings from the corpus.
Each word in corpus-string-i (with the exception of the members of a special list of ``common'' or ``frequent'' words) is referenced through three indices:
(file-index-i sent-index-j word-index-k)
where word-index-i is the position of the word (actually, any member of the inflectional paradigm for that word) in the corpus sentence; sent-index-j refers to the position of the sentence in the corpus file and file-index-k references the corpus file itself.
If a word does not appear in the corpus, an empty list is returned. If a word belongs to a special list of ``frequent'' words, a special symbol is returned since the corpus was, for efficiency reasons, not indexed for words that are too frequent.
The search of match candidates proceeds as follows. A sentence is broken into segments at punctuation marks or unknown words, and a list of all contiguous substrings (``chunks'') of a segment is produced. For every input chunk we look for sentences in the Spanish corpus that contain a matching substring. We assume a relaxed definition of matching in that we allow not ony complete matches but also matches in which:
For each of the inexact matches, we calculate a penalty for the inexact match (see Nirenburg et al, 1993 for an earlier version of this approach).
The findings of the candidate finding procedure are filtered to retain only matches whose match scores are above a threshold. Match scores are first calculated separately for each of the kinds of incomplete matches listed above. Then a cumulative score is produced. In our current system we set the threshold at 10 best matches. Note that this can be changed to include, for instance, all candidate matches with ratings above a certain threshold. For efficiency reasons, we added a preliminary stage to the selection process by filtering out matches that we considered clearly unacceptable. Thus, we did not allow any gaps of length 5 and higher. We expect in future to improve and modify this early filtering.
The following heuristics guide the scoring process for the matches:
Table 1 summarizes our current approach to intra-language matching.
Our initial combination of the results of intra-language matching tests was, therefore done according to the following formula:
overall-chunk-score = 10I + 5G + 15O + 2M + 10P + 20R + V + 5S + 10T
which reflects our intuition that it is more important for the quality of a match for the words to retain their original order and have all input chunk words involved. We expect to improve the above function both through statistical analysis with feedback and through further honing of heuristics.

Table 1