Next: Tasks 2 and
Up: Task 1: Collection
Previous: Task 1: Collection
A line-by-line study of the first dialog was carried out,
including comments on translation difficulties and on the forms
of discourse markers and their functions. At about this time
information on the Janus project was received. It was noted that
the protocol for the Janus dialogs, while fundamentally similar
to that of Artwork's, did vary in a number of ways. A study of
Janus dialogs showed that the structure of the Janus data and
that of Artwork differed in that there was more variation in the
sub-dialogs in the Artwork data. It was this set of observations
that lead to the study of the impact of protocol on various types
of dialog, discussed below.
The first analyses attempted to establish higher level
discourse structures; this was done in order to reduce the number
of discourse variants that might occur in future discourses to be
analyzed. Additionally, ``fine-grained'' analyses of the text were
carried out. These include the identification of the discourse
markers employed by the two consultants, and the functions of
these markers. The aim of this work was to establish working
categories which would then be used to mark up data. The
following is a summary of the categories established.
- DISFLUENCIES.
We suggest two broad categories of disfluencies. One is
``grammatical disfluencies''; by this we mean those constructions
which are well formed but which serve some discourse internal
function such as filling a gap in a dialog. The other are ``non-
grammatical disfluencies''; these are the ``uhs'', ``ahhs'' and ``umms''
which fill gaps. We also include in this group sentence
fragments, ill-formed utterances, slips of the tongue, and other
such ill-formed constructions.
- Grammatical
-
Common Knowledge Disfluency
``Ella está trabajando en la escuela y yo en la otra.''
``She's working at school and I'm at the other.''
The definite article in Spanish ``la'' escuela, ``la'' otra,
presupposes certain knowledge about ``a certain'' school
which is not provided in text; based on shared world
knowledge of the speakers.
-
Logical Disfluency.
``A menos de que quieres ir a comer con mi esposa, porque
también de doce a una voy a comer con ella.''
``Unless you want to go eat with my wife, because I'm going
to have lunch with her too from twelve to one.''
A strange construction in Spanish which, no matter what the
translation, will sound funny in English. The context of the
discourse ameliorates the ``strangeness'' of such an
utterance, but this is something that would not occur in a
written text.
- Filler Disfluency.
``Necesitamos, este, discutir, eh, uh, algo del proyecto.''
``We need, um, to talk about, eh, uh, something about the
project.''
In this context, ``este'' is a marker which signals
reservation on the speaker's part about a Territorial Breach
illocution, and cannot be construed as a determinant.
- Agreement Disfluency.
Lack of pronoun agreement, e.g. ``'los' doce'' (twelve of
something) instead of ``'las' doce'' (time reference). Both
are grammatically possible, which could lead to ambiguity.
-
Varietal Disfluency.
``Ah no te dijiste que después de las cinco, ¿verdad?''
``Ah you said after five, right?''
Some regional variant not common in other Spanish dialects.
In this case, ``no te dijiste'' is literally ``you didn't tell
yourself''. The pronoun ``te'' may be a disfluency, while the
``no'' is probably some type of inchoative particular to this
dialect.
-
Politeness Disfluency.
``¿A qué hora le hablará a comer?''
``When will you call her to go eat?''
A change in the level of formality, in this case from ``tú''
to ``usted''. 2S is marked for person, while 3S is not,
resulting in ambiguity.
- Non-grammatical.
-
Syntactic Disfluency.
``...y a mi esposa tiene que ir a...''
``...and my wife has to go to...''
Some problem with syntax probably created by changing a
thought in mid-utterance, a slip of the tongue, etc.
In this case, we find the dative marker ``a'' introducing a
nominative phrase.
-
Fragments.
Incomplete utterances, caused by interruptions, changing
thoughts in mid-utterance, etc.
- DIALECTAL VARIATION.
Arises from the fundamental differences between written text
and speech. Try as speakers might to reproduce some written
standard in their speech, there will inevitably occur some
variation due to impossibility of ``correcting'' the spoken
word. Examples are calques such as ``tener buen tiempo'' from
the English ``to have a good time'' or ``escuela de graduados''
from ``graduate school''. Code-switching may occur, even in
the speech of monolinguals, e.g. if they are trying to sound
``sophisticated''. Also there is the problem of variation of
semantic content between dialects; ``correr'' can mean ``to
run'' or ``to fire someone (from a job)''.
- ANAPHORA.
In Spanish, there are a variety of grammatical elements such
as demonstrative, relative and indefinite pronouns, among
others, which rely on some previous referent in order to
produce an adequate translation. For example, the genitive
``su'' (problema) can mean ``his'', ``her'', ``its'', ``their'', ``your
(singular)'' or ``your (plural)'' (problem). This is found both
in written texts and in speech.
- ELLIPSIS.
The deletion of some element or elements, in either Spanish
or English. In some cases, some forms or syntactic
structures may be commonly ellipted in Spanish and not in
English; in this case, the ellipted material needs to be
generated. An example of this are subject pronouns, esp. 3S
and 3P. Spanish marks person in the verb morphology, and
subject pronouns are commonly deleted, both in written text
and in speech. 3S and 3P are not overtly marked, and thus
can lead to ambiguity. English does not permit this (at
least in written texts), and so the subject pronouns must be
generated. Conversely, some elements may be commonly
ellipted in English and not in Spanish, creating a need to
``get rid of'' some of the input language.
- VERBAL INFLECTION.
Spanish has a relatively greater reliance on synthetic
verbal inflection than English does. For example, Tense and
Aspect tend to be inflected in Spanish and not in English.
The subjunctive mood corresponds to a complex use of modal
auxiliaries in English. And there are some subtle usages,
such as the simple present indicative in Spanish which
sometimes demands the use of a modal in English. These
differences occur in both written texts and in speech, with
the possibility that speech will be more difficult to
translate due to the appearance of new uses for old forms,
due to the process of grammaticization.
-
SYNTAX.
The differences here are simply too many to be listed.
Suffice it to say that there is a much greater dependence on
word order in English than in Spanish. These differences
occur in both written texts and in speech.
-
DISCOURSE STRATEGIES.
Discourse markers and their internal and external illocutionary
functions were noted. These are what speakers use to initiate,
maintain and terminate a dialog, as well as introducing and
discussing shared and private knowledge. The classification of
the different types of markers was based on Edmondson's (1981)
and House's (1984) categories. Extra categories were added as
needed. This work was motivated by the possibility that the task
of disambiguating certain forms and structures will be
facilitated by identifying what type of discourse functions they
fulfill. For example, opening gambits need not be translated on
the basis of the semantic content of the lexemes, but rather by
their formulaic nature. Date and time parameters may be
identified by the discourse markers that introduce them.
Next: Tasks 2 and
Up: Task 1: Collection
Previous: Task 1: Collection
Computing Research Laboratory
Wed Jun 7 19:21:15 MDT 1995