The input to this stage of the KBMT engine is a set of quintuples {IC, CC, SLCS, TLCS, CFS}, where
The following table illustrates one such quintuple.

The goal of inter-language matching is, for each input quintuple, to extract from TLCS that part which is the translation of CC. For the example above, the answer should be:
represent every country
We use the word correspondence method described below for two reasons: it is simple and it gives good results, given the restrictions of the EBMT methodology. EBMT is meant to be a "quick-and-dirty" MT process, one step above a word-for-word method. We are resigned to the fact that EBMT simply will not work in some situations, i.e. non-contiguous translations such as:
source = A B C D target = A' B' x y z C' D'
In fact, EBMT only works well when there is a definite contiguous alignment
from the source phrase to the target phrase. Our method can be used to find
such alignments in a fairly straighforward way. Our results are dependent only
on the following sources of information:
The word correspondence method tends to break down in the following circumstances:
A statistical-based approach could theoretically be used to address the above problems. It is unclear, however, if such an approach could do as well as the word correspondence approach in the general case. More importantly, statistical approaches are much more complicated and processing-time intensive, which defeats the whole purpose of the EBMT approach. Our decision is to use the simpler method described below which yields acceptable results over a wide range of inputs. This approach will be combined with a glossary-based approach which can identify idioms, thus minimizing the effects of problem 1, and a full knowledge-based approach, which can minimize the effects of 2 and 3.
Inter-language matching is performed in several stages. The major components of this process are:
First, all possible English translations of each word in SLCS were obtained
from the Collins Spanish-English MRD and stored in a list
(``eng-trans-of-spanish-words'').
The result of this operation on the sample input quintuple above is as follows:
eng-trans-of-spanish-words = NIL nation country unite unity NIL represent every all nation country
All possible root forms for each of the words in TLCS are obtained using the PC-KIMMO program, version 1.0 and stored in a list (``eng-root-forms''). We only need the root forms because the Collins dictionary contains root forms of translations of the Spanish words. ``Common'' words are not analyzed.
eng-root-words = NIL
unite united
nation
represent
every
country
The function used for finding cross-linguistic word correspondences is:
find-correspondence (SLCS eng-trans-of-spanish-words eng-root-words TLCS)
In our example, the arguments are as follows:
The results are stored in a list of pairs (spa-eng-word-correspondences).
spa-word-correspondences = naciones nations naciones country unidas united representado represent todas every naciones nations naciones country
For easy access, results of finding word correpondences are stored in a two (temporary) hash table, indexed by source and target language words, respectively.
(gethash spanish-word *spanish-corr-hash*)
-> list of english words
(gethash english-word *eng-corr-hash*)
-> list of spanish words
Thus,
(gethash "unidas" *spanish-corr-hash*)
-> ("united")
(gethash "every" *english-corr-hash*)
-> ("todas")
Also,
(gethash "naciones" *spanish-corr-hash*) ->
("nations" "country" "nations" "country")
During sub-sentential alignment we need to know if there is more than one possible alignment for a given word; this is easily calculated by looking at the length of the list returned from the hash table. It is inconsequential that a target language word was repeated twice in the last example above.
At this point the procedure looks for the longest possible substring of the TLCS that could translate the CC.
The rationale behind this is that we can first find an upper bound to the translation using known word correspondences, and then use scoring matrices to rate substrings of this outer bound until we find the best. An alternative approach would be to find the lower bound (the smallest possible translation - see below), and then use heuristics or scoring matrices to expand this boundary outward one word at a time. The problem with the latter approach is that a "local maximum" may be hit, stopping the outward expansion, when, in fact, if additional words were added a "global maximum" would be achieved. Such a situation could occur when an unknown word occurs between the smallest possible translation and words that have ambiguous word correspondences (and thus were not added to smallest possible translation) but would increase the total score if added.
longest-possible-trans =
get-longest-possible-trans
(CC
TLCS
spa-corr-hash
eng-corr-hash)
But first, we find the ``smallest-possible-trans.'' This is the smallest SL substring which contains all the words that unambiguously correspond to words in CC. We need to perform this step to give us a ``hook'' which we can then expand into the largest possible translation. If we cannot find any words that unambiguously correspond to words in the CC, then we can't even be sure where to start looking. Our approach is to give up at this point and try another input chunk, rather than ``blindly'' trying to score combinations of words that may or may not correspond to words in the CC. In practice, this usually only occurs because no correspondences were found for any of the input CC words (not because they were all ambiguous). In such a case, the only way to find the correct alignment would be by the process of elimination (i.e. figure out what translates the rest of the input sentence, then whatever is left must translate the CC). This was deemed to be too costly for an EBMT application.
rightmost-pos = 0
leftmost-pos = 10000
for each word in CC
begin
eng-trans =
gethash(word spa-corr-hash)
unless length(eng-trans) > 1
or
length(eng-trans) = 0
begin
position-eng-word =
find-pos(eng-trans TLCS)
if position-eng-word <
leftmost-pos
then leftmost-pos =
position-of-english-word
if position-eng-word >
rightmost-pos
then rightmost-pos =
position-of-english-word
end
end
if leftmost-pos = 10000 then
return NIL ;; failure
smallest-possible-trans =
get-substring (TLCS
leftmost-pos
rightmost-pos)
At this stage, we add to the smallest possible substring all the TLCS words to
the left and right for which we have not determined which SLCS word they
translate, including the cases when a word occurs more than once in a sentence
--- once in a given CC and once outside it. We stop adding on words when we
come to a TLCS word that we know translates a SLCS word which is not in CC.
for i := leftmost-pos downto 1
begin
english-word = get-element(TLCS i)
spa-word-corrs =
gethash(english-word
eng-corr-hash)
if spa-word-corrs = nil
or
string-intersection(
spa-word-corrs CC)
then
leftmost-pos = i
end
for i := rightmost-pos to length(TLCS)
begin
english-word = get-element(TLCS i)
spa-word-corrs =
gethash(english-word
eng-corr-hash)
if spa-word-corrs = nil
or
string-intersection(
spa-word-corrs CC)
then
rightmost-pos = i
end
return get-substring(TLCS
leftmost-pos
rightmost-pos)
We assume that the translation of CC is a substring of longest-possible-trans.
At this point, we examine all substrings of longest-possible-trans, evaluate
each according to a special preference metric and pick the best one(s).
One may think that only substrings that are greater-than-or-equal-to the smallest possible translation need be considered. In practice, though, the best alignment is sometimes smaller than the smallest possible translation. This occurs when the target translation is non-contiguous:
CC = A B C smallest-possible-trans = A' x y z B' C'
In a case like this, we believe it is preferable to return:
TRANS = B' C'
than the whole smallest-possible-trans. Further research is needed here, however, to try and identify such cases separately, instead of forcing the needless evaluation of substrings smaller than smallest-possible-trans for every chunk.
best-score = 10000 ;a bad match score
best-translations = NIL
for each substring in
longest-possible-trans
begin
score = score-match(substring
CC
spa-corr-hash
eng-corr-hash)
+ CFS
if score < best-score then
begin
best-score = score
best-translations = substring
end
else if score = best-score then
begin
best-translations =
append(best-translations
substring)
end
end
return (best-score best-translations)
The rating of how well a TLCS substring translates a CC is based on minimizing the estimated number of keystrokes needed to ``reduce'' each candidate translation to an ``optimum'' translation (presumably, determined by humans). Thus, the lower the score the better the translation.
The score-match function uses the results of seven sperate heuristic tests. The seven values are then combined in a function whose coefficients (weights) are determined with the help of statistical training. The length of CCs also plays a role in determining the weights.
score =
get-weight(0,len) ;; a constant
+ test1-score * get-weight(1,len)
+ test2-score * get-weight(2,len)
+ test3-score * get-weight(3,len)
+ test4-score * get-weight(4,len)
+ test5-score * get-weight(5,len)
+ test6-score * get-weight(6,len)
+ test7-score * get-weight(7,len)
+ test8-score * get-weight(8,len)
The tests used in scoring cross-linguistic correspondences are as follows:
SLCS = (a b c d e f g h i) CC = (a b c d) TLCS = (z a' b' c' d' m n o p) TLCS substring = (z a' b' c' d')
Even though z does not correspond to anything in SLCS, it quite probably may be a part of the translation of CC.
In the future, if syntactic processing is practical, we will use better defined syntactic features. For example, knowing the boundaries of constituents in all the strings could help cross-linguistic alignment.
The results of the EBMT engine are then submitted to the same chart mechanism used for integrating results from a variety of MT engines. The chart, thus, can contain more than one candidate from each of the engines in the system.