next up previous
Next: A Detailed Example Up: A Full-Text Experiment in Previous: Intra-Language Matching

Inter-Language Matching

The input to this stage of the KBMT engine is a set of quintuples {IC, CC, SLCS, TLCS, CFS}, where

IC
stands for Input Chunk,

SLCS
stands for Source Language Corpus Sentence,

CC
stands for Corpus Chunk (the phrase in SLCS which matches IC),

TLCS
stands for Target Language Corpus Sentence and

CFS
is the score of the match, produced by intra-language candidate. finder

The following table illustrates one such quintuple.

The goal of inter-language matching is, for each input quintuple, to extract from TLCS that part which is the translation of CC. For the example above, the answer should be:

 represent every country

We use the word correspondence method described below for two reasons: it is simple and it gives good results, given the restrictions of the EBMT methodology. EBMT is meant to be a "quick-and-dirty" MT process, one step above a word-for-word method. We are resigned to the fact that EBMT simply will not work in some situations, i.e. non-contiguous translations such as:

 
  source = A B C D
  target = A' B' x y z C' D'

In fact, EBMT only works well when there is a definite contiguous alignment from the source phrase to the target phrase. Our method can be used to find such alignments in a fairly straighforward way. Our results are dependent only on the following sources of information:gif

  1. machine-readable dictionary (MRD) from the source to the target language

  2. morphological analyzer for source and target language

The word correspondence method tends to break down in the following circumstances:

  1. Idioms are used in the source or target phrase. In such a case, there will be no literal word-for-word correspondences.

  2. The same content word is used multiple times in a sentence. This creates ambiguous word correspondences, which, if frequent enough and exist almost to the exclusion of non-ambiguous correspondences, can confuse the aligner. In our experience to-date this case has been rare.

  3. Incorrect word senses are connected. For instance, if the Spanish word ``correr,'' ``to run (i.e., on foot)'' is used in the source phrase and the target sentence has the word ``run'' used in a different sense (i.e., ``a run in a stocking''), then an incorrect word correspondence will be marked, which will adversely affect the output.

A statistical-based approach could theoretically be used to address the above problems. It is unclear, however, if such an approach could do as well as the word correspondence approach in the general case. More importantly, statistical approaches are much more complicated and processing-time intensive, which defeats the whole purpose of the EBMT approach. Our decision is to use the simpler method described below which yields acceptable results over a wide range of inputs. This approach will be combined with a glossary-based approach which can identify idioms, thus minimizing the effects of problem 1, and a full knowledge-based approach, which can minimize the effects of 2 and 3.

Inter-language matching is performed in several stages. The major components of this process are:

First, all possible English translations of each word in SLCS were obtained from the Collins Spanish-English MRD and stored in a list (``eng-trans-of-spanish-words'').gif

The result of this operation on the sample input quintuple above is as follows:

 
eng-trans-of-spanish-words = 
  NIL 
  nation country 
  unite unity
  NIL 
  represent
  every all
  nation country

All possible root forms for each of the words in TLCS are obtained using the PC-KIMMO program, version 1.0 and stored in a list (``eng-root-forms''). We only need the root forms because the Collins dictionary contains root forms of translations of the Spanish words. ``Common'' words are not analyzed.

 
eng-root-words = NIL
                 unite united
                 nation
                 represent 
                 every
                 country

The function used for finding cross-linguistic word correspondences is:

 
find-correspondence
  (SLCS                       
   eng-trans-of-spanish-words 
   eng-root-words             
   TLCS)

In our example, the arguments are as follows:

SLCS
las naciones unidas han representado todas naciones

eng-trans-of-spanish-words
(nation country) (unite unity) NIL represent (every all) (nation country)

eng-root-words
NIL (unite united) nation represent every country

TLCS
the united nations represent every country

The results are stored in a list of pairs (spa-eng-word-correspondences).

 
spa-word-correspondences = 

naciones	 nations
naciones	 country 
unidas		 united
representado	 represent
todas		 every 
naciones	 nations
naciones	 country

For easy access, results of finding word correpondences are stored in a two (temporary) hash table, indexed by source and target language words, respectively.

 
(gethash spanish-word *spanish-corr-hash*) 
    -> list of english words
(gethash english-word *eng-corr-hash*) 
    -> list of spanish words

Thus,

 
(gethash "unidas" *spanish-corr-hash*) 
    -> ("united")
(gethash "every" *english-corr-hash*) 
    -> ("todas")

Also,

 
(gethash "naciones" *spanish-corr-hash*) ->
  ("nations" "country" "nations" "country")

During sub-sentential alignment we need to know if there is more than one possible alignment for a given word; this is easily calculated by looking at the length of the list returned from the hash table. It is inconsequential that a target language word was repeated twice in the last example above.

At this point the procedure looks for the longest possible substring of the TLCS that could translate the CC.

The rationale behind this is that we can first find an upper bound to the translation using known word correspondences, and then use scoring matrices to rate substrings of this outer bound until we find the best. An alternative approach would be to find the lower bound (the smallest possible translation - see below), and then use heuristics or scoring matrices to expand this boundary outward one word at a time. The problem with the latter approach is that a "local maximum" may be hit, stopping the outward expansion, when, in fact, if additional words were added a "global maximum" would be achieved. Such a situation could occur when an unknown word occurs between the smallest possible translation and words that have ambiguous word correspondences (and thus were not added to smallest possible translation) but would increase the total score if added.

 
longest-possible-trans = 
     get-longest-possible-trans 
        (CC 
         TLCS
         spa-corr-hash
         eng-corr-hash)

But first, we find the ``smallest-possible-trans.'' This is the smallest SL substring which contains all the words that unambiguously correspond to words in CC. We need to perform this step to give us a ``hook'' which we can then expand into the largest possible translation. If we cannot find any words that unambiguously correspond to words in the CC, then we can't even be sure where to start looking. Our approach is to give up at this point and try another input chunk, rather than ``blindly'' trying to score combinations of words that may or may not correspond to words in the CC. In practice, this usually only occurs because no correspondences were found for any of the input CC words (not because they were all ambiguous). In such a case, the only way to find the correct alignment would be by the process of elimination (i.e. figure out what translates the rest of the input sentence, then whatever is left must translate the CC). This was deemed to be too costly for an EBMT application.

    rightmost-pos = 0
    leftmost-pos = 10000

    for each word in CC
      begin
      eng-trans = 
        gethash(word spa-corr-hash)
      unless length(eng-trans) > 1 
             or 
             length(eng-trans) = 0
        begin
        position-eng-word = 
            find-pos(eng-trans TLCS)
        if position-eng-word < 
            leftmost-pos
          then leftmost-pos = 
               position-of-english-word
        if position-eng-word > 
            rightmost-pos
          then rightmost-pos = 
               position-of-english-word
        end
      end
    if leftmost-pos = 10000 then 
      return NIL  ;; failure

    smallest-possible-trans = 
      get-substring (TLCS 
                     leftmost-pos
                     rightmost-pos)
At this stage, we add to the smallest possible substring all the TLCS words to the left and right for which we have not determined which SLCS word they translate, including the cases when a word occurs more than once in a sentence --- once in a given CC and once outside it. We stop adding on words when we come to a TLCS word that we know translates a SLCS word which is not in CC.

 
    for i := leftmost-pos downto 1
      begin
       english-word = get-element(TLCS i)
       spa-word-corrs = 
         gethash(english-word 
                 eng-corr-hash) 
       if spa-word-corrs = nil 
          or
          string-intersection(
            spa-word-corrs CC)
       then
         leftmost-pos = i
      end
    for i := rightmost-pos to length(TLCS) 
      begin
       english-word = get-element(TLCS i)
       spa-word-corrs = 
         gethash(english-word 
                 eng-corr-hash) 
       if spa-word-corrs = nil 
          or
          string-intersection(
            spa-word-corrs CC)
       then
         rightmost-pos = i
      end

    return get-substring(TLCS 
                         leftmost-pos
                         rightmost-pos)
We assume that the translation of CC is a substring of longest-possible-trans. At this point, we examine all substrings of longest-possible-trans, evaluate each according to a special preference metric and pick the best one(s).

One may think that only substrings that are greater-than-or-equal-to the smallest possible translation need be considered. In practice, though, the best alignment is sometimes smaller than the smallest possible translation. This occurs when the target translation is non-contiguous:

 
  CC = A B C
  smallest-possible-trans = A' x y z B' C'

In a case like this, we believe it is preferable to return:

 
  TRANS = B' C'

than the whole smallest-possible-trans. Further research is needed here, however, to try and identify such cases separately, instead of forcing the needless evaluation of substrings smaller than smallest-possible-trans for every chunk.

 
   best-score = 10000     ;a bad match score
   best-translations = NIL
   for each substring in 
            longest-possible-trans
     begin
     score = score-match(substring 
                         CC
                         spa-corr-hash
                         eng-corr-hash) 
             + CFS
     if score < best-score then
       begin
       best-score = score
       best-translations = substring
       end
     else if score = best-score then
       begin
       best-translations = 
             append(best-translations
                    substring)
       end
     end
   return (best-score best-translations)

The rating of how well a TLCS substring translates a CC is based on minimizing the estimated number of keystrokes needed to ``reduce'' each candidate translation to an ``optimum'' translation (presumably, determined by humans). Thus, the lower the score the better the translation.

The score-match function uses the results of seven sperate heuristic tests. The seven values are then combined in a function whose coefficients (weights) are determined with the help of statistical training. The length of CCs also plays a role in determining the weights.

 
 score =
     get-weight(0,len)     ;; a constant
     + test1-score * get-weight(1,len)
     + test2-score * get-weight(2,len)
     + test3-score * get-weight(3,len)
     + test4-score * get-weight(4,len)
     + test5-score * get-weight(5,len)
     + test6-score * get-weight(6,len)
     + test7-score * get-weight(7,len)
     + test8-score * get-weight(8,len)

The tests used in scoring cross-linguistic correspondences are as follows:

Test 1
returns the number of words in CC that correspond to a word in the TLCS substring. The more correspondences, the more confident we can be of the translation.

Test 2
returns the number of ``content'' words in CC that do not correspond to any words in the TLCS substring. Each such case lowers our confidence in the translation. We need to ignore common words here because they are not recorded in the correlation hash tables.

Test 3
returns the number of ``content'' words in the TLCS substring that do not correspond to any words in CC.

Test 4
returns the smaller of test2-score or test3-score. The motivation is that if there are unmatched CC words (detected in test2) and unmatched TLCS substring words (detected in test3) then perhaps some of these match each other.

Test 5
attempts to take into account the ``commmon'' words. (The other tests were not able to do this since the word correlation data excluded common words.) It would be nice if we could return the number of common words that correlated exactly (i.e., el with the). However, as a simplification, we just assume each common word in CC will match one common word in the TLCS substring

Test 6
returns the number of ``content'' words in CC that do not correspond to any words in the TLCS substring, but do correspond to words in TLCS. This indicates that some sort of mis-alignment has occurred. Each such occurrence lowers our confidence in the translation.

Test 7
is a harbinger of using syntactic information to help the alignment process. So far, all we do is check whether the TLCS substring is either at the beginning or the end of the sentence. This type of test helps us to gather up ``stray'' words at the end that do not correspod to any SLCS element but should probably be included in the translation nonetheless. For example, consider the following:

 
SLCS             = (a b c d e f g h i)
CC               = (a b c d)
TLCS             = (z a' b' c' d' m n o p)
TLCS substring   = (z a' b' c' d')

Even though z does not correspond to anything in SLCS, it quite probably may be a part of the translation of CC.

In the future, if syntactic processing is practical, we will use better defined syntactic features. For example, knowing the boundaries of constituents in all the strings could help cross-linguistic alignment.

Test 8
returns the number of common words that do not match. Again, we simplified this by assuming each common words in CC will match one common word in the TLCS substring. The number of common words left over is then reported by this test. This test was added because, without it, common words are added on to the end indiscriminately. This test will tend to cut off this process when the number of common words in the translation grows larger than the number of common words in the CC.

The results of the EBMT engine are then submitted to the same chart mechanism used for integrating results from a variety of MT engines. The chart, thus, can contain more than one candidate from each of the engines in the system.



next up previous
Next: A Detailed Example Up: A Full-Text Experiment in Previous: Intra-Language Matching



Steve Beale
Tue Oct 1 12:14:38 MDT 1996