next up previous
Next: Exclamations and Interjections. Up: Ellipsis. Previous: Referring Expressions

Disfluencies

All sixteen Spanish dialogs were marked up to indicate the boundaries of several types of disfluencies (pauses, filler words, repetitions, repairs and grammatical ill-formedness), taking between a half an hour (for, say, repairs) and 10 minutes (for, say, pauses) per text once operational definitions for each of the subcategories had been established. A subcategory for non sequiturs was also recognized but because of the subtlety of analysis required to identify them and the limits on the research groups time, they were neither counted nor marked. The sixteen English translations of these dialogs were counted and marked up for these categories as well except for grammatical ill-formedness which was postponed due to lack of time.

In regard to the 20 Spanish texts and their translations, only a couple of cases of grammatical ill-formedness were found. There were no examples of any of the other subcategories of disfluencies. Since the few cases of grammatical ill-formedness, which were almost certainly due to printers errors, were not marked, by default none of the texts or their translation can be taken as marked up.

The operational definition of disfluency in general that we have settled on is ``any expression uttered or decided pause which breaks the on-going flow of the conversational exchange at that point and yet does not appear to be intended to do so''. This is fairly vague, and in practice, the analysis was based on the following more specific criteria associated with each of the several subcategories.

The operational definition of pause we have adopted is ``a sufficiently long silence to be noted by the transcriber''. In the transcriptions, such pauses are marked by ``...''.

Aquí, ya estoy corrigiendo exámenes. Ya... casi por terminar.
Here, I'm correcting tests. Almost... ready to finish.

A subclassification was developed on the basis of whether the pause was utterance internal, such as the one above, or utterance final (at the end of a turn or between utterances in an utterance sequence) as below.

Claro que sí. ¿Qué tipo de organización, estatal...
Of course. What type of organization, state...

Utterance final pauses, marked in the transcription by ``.'', and brief but conventional pauses (for example for taking breath or setting off emphasizes constituents), marked in the transcription by ``,'', were not counted as disfluencies of this type.

The operational definition of filler word we is ``an expression uttered without semantic content in the context of the discourse but rather in order to keep the conversation flowing while the speaker thinks of what to say next''. Examples of such expressions include ``pues'', ``pos'', ``bueno'', ``pero'', and so on.

Necesitamos, este, discutir eh, uh, algo del proyecto.
We need to, ahh, talk about, ahh, uh, something about the project.

No special subcategories were identified.

The operational definition of repetition is ``any sequences of two expressions or subexpressions in which the second is a verbatim repetition of the first''. They may be morphemes, words, phrases or clauses. The repetition of sounds (as in the case of stuttering) would have been counted as well, however, since we were working from orthographic transcriptions, none were found in the data.

Para que veas, para que veas ya que tú no
So that you'll see, so that you'll see since you don't

te acuerdas de mí, yo vengo a verte.
remember me, I came to see you.

No, no, no venta de... cosas que ya no usan.
No, no, not a sale... of things that they don't use anymore.

There were no subclassifications of repetitions beyond this.

The operational definition of repair we are following is ``any sequences of two utterances in which the second either restates the first in a different way or simply starts a new sequence of utterances without completing the first.'' There are two basic subcategories, those that uttered in order to correct spoken errors such as:

porque yo mira a la un de la una a las seis yo trabajo...
because I, look, at one from one to six I work...

and those that are uttered in order to establish an alternative line of thought (to recover from false starts).

No, éste es, ah, sí, pos mañana.
No, that is, uh, yes, tomorrow.

No further subclassification has been considered.

The operational definition of ungrammaticality we have adopted is ``any expression which deviates from standard norms of linguistic form''. For instance, in:

ah no te dijiste después de las cinco ¿verdad?
oh no, you said (to you) after five, right?

The indirect object pronoun ``te'' (to you) has been gratuitously inserted. No further subclassification has been considered.

The operational definition of non sequitur we are using is ``any sequence of utterances whose propositional content or order of presentation support a contradiction and therefore form a difficult context for assigning an interpretation''. For example, in the following contribution:

Porque mira, yo puedo (juntarme contigo) antes de las ocho
Because look, I can (meet with you) before eight ...

... entre diez y once ... y después de las cinco.
...between ten and eleven ... and after five.

a menos de que quieres ir a comer con mi esposa porque
unless you want to go and eat with my wife because

también de doce a una voy a comer con ella.
also from twelve to one I'm going to eat with her.

Si quieres nos acompañas y allá platicamos.
If you want you can accompany us and we can talk there.

There are four utterances which at first blush appear to reflect a faulty reasoning process. The speaker begins by say he can meet at certain times of the day unless the addressee wants to have lunch with the speaker's wife. It is not obvious how that would solve the problem. The speaker then quickly adds the he is going to have lunch with her too which causes one to wonder who else is having lunch with her. Finally the speaker invites the addressee to join his wife and him so that they can carry out their business. The interpretation, once established, is clear. The speaker can meet with the addressee at certain times of the day or, if the addressee wanted to join the speaker and his wife for lunch, they could also meet then. But because of the choice of connectors, the position of ``también'' (also), and the ordering of the presentation of the information, the interpretation is not easily assigned.

There was no attempt to identify subcategories of non sequiturs and indeed, a complete count and marking up of the data was never carried out.

As mentioned above, the 6 Spanish dialogs contained 666 utterances and 360 turns. The 11 Spanish news articles contained 133 sentences with 304 clauses.

As for the quantitative results of the analysis, in the 6 dialogs we found 254 disfluencies of which 102 were pauses, 64 were filler expressions, 24 were repetitions, 19 were repairs, and 45 were case of grammatical ill-formedness. Thus, on average, about .37 of the utterances contained a disfluency. In the 11 news articles we found no cases of pauses, filler words, repetitions, or repairs and only few cases of grammatical ill- formedness which appeared to be due to printers errors. It is clear from these results that disfluencies are extremely common in dialog and will have to be handled by effectively and efficiently by dialog translation systems.



next up previous
Next: Exclamations and Interjections. Up: Ellipsis. Previous: Referring Expressions



Computing Research Laboratory
Wed Jun 7 19:21:15 MDT 1995