next up previous
Next: Tasks 2 and Up: Task 1: Collection Previous: Task 1: Collection

Results of Initial Analyses

A line-by-line study of the first dialog was carried out, including comments on translation difficulties and on the forms of discourse markers and their functions. At about this time information on the Janus project was received. It was noted that the protocol for the Janus dialogs, while fundamentally similar to that of Artwork's, did vary in a number of ways. A study of Janus dialogs showed that the structure of the Janus data and that of Artwork differed in that there was more variation in the sub-dialogs in the Artwork data. It was this set of observations that lead to the study of the impact of protocol on various types of dialog, discussed below.

The first analyses attempted to establish higher level discourse structures; this was done in order to reduce the number of discourse variants that might occur in future discourses to be analyzed. Additionally, ``fine-grained'' analyses of the text were carried out. These include the identification of the discourse markers employed by the two consultants, and the functions of these markers. The aim of this work was to establish working categories which would then be used to mark up data. The following is a summary of the categories established.

  1. DISFLUENCIES. We suggest two broad categories of disfluencies. One is ``grammatical disfluencies''; by this we mean those constructions which are well formed but which serve some discourse internal function such as filling a gap in a dialog. The other are ``non- grammatical disfluencies''; these are the ``uhs'', ``ahhs'' and ``umms'' which fill gaps. We also include in this group sentence fragments, ill-formed utterances, slips of the tongue, and other such ill-formed constructions.
    1. Grammatical
      1. Common Knowledge Disfluency ``Ella está trabajando en la escuela y yo en la otra.'' ``She's working at school and I'm at the other.'' The definite article in Spanish ``la'' escuela, ``la'' otra, presupposes certain knowledge about ``a certain'' school which is not provided in text; based on shared world knowledge of the speakers.
      2. Logical Disfluency. ``A menos de que quieres ir a comer con mi esposa, porque también de doce a una voy a comer con ella.'' ``Unless you want to go eat with my wife, because I'm going to have lunch with her too from twelve to one.'' A strange construction in Spanish which, no matter what the translation, will sound funny in English. The context of the discourse ameliorates the ``strangeness'' of such an utterance, but this is something that would not occur in a written text.
      3. Filler Disfluency. ``Necesitamos, este, discutir, eh, uh, algo del proyecto.'' ``We need, um, to talk about, eh, uh, something about the project.'' In this context, ``este'' is a marker which signals reservation on the speaker's part about a Territorial Breach illocution, and cannot be construed as a determinant.
      4. Agreement Disfluency. Lack of pronoun agreement, e.g. ``'los' doce'' (twelve of something) instead of ``'las' doce'' (time reference). Both are grammatically possible, which could lead to ambiguity.
      5. Varietal Disfluency. ``Ah no te dijiste que después de las cinco, ¿verdad?'' ``Ah you said after five, right?'' Some regional variant not common in other Spanish dialects. In this case, ``no te dijiste'' is literally ``you didn't tell yourself''. The pronoun ``te'' may be a disfluency, while the ``no'' is probably some type of inchoative particular to this dialect.
      6. Politeness Disfluency. ``¿A qué hora le hablará a comer?'' ``When will you call her to go eat?'' A change in the level of formality, in this case from ``tú'' to ``usted''. 2S is marked for person, while 3S is not, resulting in ambiguity.
    2. Non-grammatical.
      1. Syntactic Disfluency. ``...y a mi esposa tiene que ir a...'' ``...and my wife has to go to...'' Some problem with syntax probably created by changing a thought in mid-utterance, a slip of the tongue, etc. In this case, we find the dative marker ``a'' introducing a nominative phrase.
      2. Fragments. Incomplete utterances, caused by interruptions, changing thoughts in mid-utterance, etc.
  2. DIALECTAL VARIATION. Arises from the fundamental differences between written text and speech. Try as speakers might to reproduce some written standard in their speech, there will inevitably occur some variation due to impossibility of ``correcting'' the spoken word. Examples are calques such as ``tener buen tiempo'' from the English ``to have a good time'' or ``escuela de graduados'' from ``graduate school''. Code-switching may occur, even in the speech of monolinguals, e.g. if they are trying to sound ``sophisticated''. Also there is the problem of variation of semantic content between dialects; ``correr'' can mean ``to run'' or ``to fire someone (from a job)''.
  3. ANAPHORA. In Spanish, there are a variety of grammatical elements such as demonstrative, relative and indefinite pronouns, among others, which rely on some previous referent in order to produce an adequate translation. For example, the genitive ``su'' (problema) can mean ``his'', ``her'', ``its'', ``their'', ``your (singular)'' or ``your (plural)'' (problem). This is found both in written texts and in speech.

  4. ELLIPSIS. The deletion of some element or elements, in either Spanish or English. In some cases, some forms or syntactic structures may be commonly ellipted in Spanish and not in English; in this case, the ellipted material needs to be generated. An example of this are subject pronouns, esp. 3S and 3P. Spanish marks person in the verb morphology, and subject pronouns are commonly deleted, both in written text and in speech. 3S and 3P are not overtly marked, and thus can lead to ambiguity. English does not permit this (at least in written texts), and so the subject pronouns must be generated. Conversely, some elements may be commonly ellipted in English and not in Spanish, creating a need to ``get rid of'' some of the input language.

  5. VERBAL INFLECTION. Spanish has a relatively greater reliance on synthetic verbal inflection than English does. For example, Tense and Aspect tend to be inflected in Spanish and not in English. The subjunctive mood corresponds to a complex use of modal auxiliaries in English. And there are some subtle usages, such as the simple present indicative in Spanish which sometimes demands the use of a modal in English. These differences occur in both written texts and in speech, with the possibility that speech will be more difficult to translate due to the appearance of new uses for old forms, due to the process of grammaticization.

  6. SYNTAX. The differences here are simply too many to be listed. Suffice it to say that there is a much greater dependence on word order in English than in Spanish. These differences occur in both written texts and in speech.

  7. DISCOURSE STRATEGIES. Discourse markers and their internal and external illocutionary functions were noted. These are what speakers use to initiate, maintain and terminate a dialog, as well as introducing and discussing shared and private knowledge. The classification of the different types of markers was based on Edmondson's (1981) and House's (1984) categories. Extra categories were added as needed. This work was motivated by the possibility that the task of disambiguating certain forms and structures will be facilitated by identifying what type of discourse functions they fulfill. For example, opening gambits need not be translated on the basis of the semantic content of the lexemes, but rather by their formulaic nature. Date and time parameters may be identified by the discourse markers that introduce them.


next up previous
Next: Tasks 2 and Up: Task 1: Collection Previous: Task 1: Collection



Computing Research Laboratory
Wed Jun 7 19:21:15 MDT 1995