next up previous
Next: Fragments. Up: Introduction Previous: Results of Initial

Tasks 2 and 3: Comparison of Dialogs and Texts and their Translations.

The analysis of dialogs and texts involved comparing 22 Spanish newspaper articles and 16 Spanish dialogs with respect to a number of linguistic characteristics, both qualitatively and quantitatively. This consisted of three general tasks: a small trial comparative analysis of texts and dialogs with respect to a number of possible characteristics in order to identify which would serve as informative criteria, the comparison of texts and dialogs with respect to the informative criteria, and an analysis of the translation problems related to those criteria.

As a preliminary to the comparison of Spanish texts and Spanish dialogs, some basic information about the corpora was gathered. This included word counts and word frequency counts along with identification of the numbers of clauses and sentences in the text corpus and the identification of the numbers of utterances and turns in the corpus of dialog transcriptions. This provided a basis, albeit imperfect, for selecting comparable corpus sizes.

The comparison was initially carried out on 2 dialogs and 2 texts with respect to 11 categories:

  1. sentence fragments,
  2. ellipted utterances,
  3. referring expressions,
  4. disfluencies (pauses, repetitions, repairs, grammatical errors, non sequiturs),
  5. interjections and exclamations,
  6. discourse markers,
  7. patterns of topic and focus,
  8. conversational moves,
  9. foreign words,
  10. casual (vs. formal) locutions,
  11. dialectal (vs. standard) locutions.
The goal of this initial trial comparison was, on the one hand, to identify which of the above categories would act as practical and interesting criteria for comparison and, on the other, to sharpen our operational definitions of the categories themselves. The two corpora used consisted of two four minute dialogs (approximately 1500 words) and two newspaper articles (approximately 800 words). As a result, we selected categories (1) through (6) as initial criteria for comparison since they appeared to lead to both interesting results over the small trial corpus and to be relatively easy to apply in terms of effort.

The use of categories (7) and (8) as a basis for comparison was postponed until time permits. They proved to yield interesting results over the small trial corpus but they required more effort to carry out on a large scale than we felt the budget would allow. We hope to turn our attention to a comparison of texts and dialogs on the basis of (7) during the final phase of the first year if time permits and are planning on performing a comparison on the basis of (8) during the second year of the project.

In regard to categories (9) through (11), we decided to discontinue their use as a basis for comparison. With respect to (9), we found the data set to be too small to obtain useful results. Discounting proper nouns, there were but a half dozen ``non-standard'' borrowings in the 16 dialogs and even fewer in the 22 texts. Thus, as an indicator of differences between spoken and written language, the number and types of foreign words would not have proven useful. With respect to (10) and (11), we found that the differences between the two forms of communication in terms of the goals and settings of the communicative activity were too extreme to lead to useful comparisons. The dialogs were between speakers of the same dialect, in an informal setting, with a rather mundane purpose, planning a meeting. The newspaper articles, on the other hand, were written in a more formal setting with the purpose of reporting situations and events, possibly by speakers of differing dialects but who were making a conscious effort to approach a standard literary style. In the end there was simply too little ``commonality'' for a useful comparison to be carried out.

The analysis itself was carried out on two sets of four corpora. The first consisted of 6 Spanish dialogs and their English translations (approximately 4000 words each) and 11 Spanish news articles and their translations (approximately 4000 words each). The 6 Spanish dialogs, for which the results are given below, contained 666 utterances and 360 turns. The 11 Spanish news articles contained 133 sentences with 304 clauses. The second set of four corpora consisted of 12 Spanish dialogs and their English translations (approximately 7000 words each) and 20 Spanish news articles and their translations (approximately 7000 words each). Briefly, the results of the comparison with respect to each of the six criteria were as follows.





next up previous
Next: Fragments. Up: Introduction Previous: Results of Initial



Computing Research Laboratory
Wed Jun 7 19:21:15 MDT 1995