Corpus statistics

A 20 minute whirlwind tour

Objectives:

 

Basic counters and statistical tools

 

Kosovo

count: 8

Total words in document: 350

Frequency = count/total = 2.29%

 

 

A few examples of such tools:

 

web-based:

 

Cathy Ball’s word counter

(http://www.georgetown.edu/cball/webtools/web_freqs.html)

 

CRL Java word counter

(http://crl.nmsu.edu/~raz/webtools/wordcount/WebWordCount.html)

 

standard tools:

 

wcount tool in Oleada

 

cwcount in /home/language_tools/algorithms

What is an occurrence?

Collocations --- What words occur together?

Ice cream

Mineral water

Natural gas

Health insurance

Mutual Information

I(x,y) = log2 [p(x,y)/(p(x)p(y))]

and that the simplest estimates of probabilities, the maximum likelihood estimates, are given by

p(x) = count(x)/N

p(y) = count(y)/N

p(x,y) = fcount(x,y)/N

 

(N is the number of words in the corpus)

(remember log2 (x) = log10 (x) / .30103)

log[p(x,y)/p(x)p(y)] = log p(x,y) - log p(x) + log p(y)

 

Resources

Mitch Resnik. Notes from a Short Course on Statistical Methods in NLP.

(http://www.umiacs.umd.edu/users/resnik/nlstat_tutorial_summer1998/index.html)

Dan Jurafsky and Jim Martin, Speech and Language Processing: An Introduction to Speech Recognition, Natural Language Processing, and Computational Linguistics, draft in progress (version of 10 June 1998).

(http://www.cs.colorado.edu/~martin/slp.html)

Brigitte Krenn and Christer Samuelsson, The Linguist's Guide to Statistics.

(http://www.coli.uni-sb.de/~christer/stat_cl.ps.gz.uu)

Ted Dunning. 1993. Accurate Methods for the statistics of surprise and coincidence. Computational Linguistics. 19(1):61-74.

Eugene Charniak. 1993. Statistical language Learning. Cambridge: MIT Press.

 

Probability

Probability: a mathematical statement about the likelihood of some event occurring.

 

 

Examples:

    1. the cat in the hat
    2. the cat on the mat
    3. the moon on the mat
    4. the cat saw the moon

 

Viewing this as one document:

count of cat: 3

total words: 20

P(cat) = 3/20

Viewing this as four separate documents:

What’s the probability that cat will be in a document?

P(cat) = 3/4

 

 

Independence:

Two events are said to be independent iff

P(A Ç B) = P(A) p(B)

(The probability of A and B occurring simultaneously can be determined directly from the individual probabilities of A and B)

Example:

The word Monica occurs in 8 of 100 texts and the word Lewinsky occurs in 5 of 100 documents. The two words occur together in 4 of 100 documents. Are the words independent?

P(Monica Ç Lewinsky) = 4/100

P(Monica) P(Lewinsky) = 8/100 5/100 = 4/1000

4/100 ¹ 4/1000

Conditional Probabilities

P(A Ç B)

P(A|B) = ----------------------

P(B)

 

    1. the cat in the hat
    2. the cat on the mat
    3. the moon on the mat
    4. the cat saw the moon

P(hat Ç cat) 1/4

P(hat|cat) = ------------------ = ----- = 1/3

P(cat) 3/4

 

 

Bayes’ Theorem

P(B|A)P(A)

P(A|B) = --------------------

P(B)

 

P(hat|cat)P(cat) (1/3)(3/4)

P(cat|hat) = ------------------- = ------------- = 1

P(hat) (1/4)

 

One application: speech recognition

Maxword P(word|acoustic evidence)

Find Part of speech tag given word and the previous part of speech tag

 

Chi-square

c 2 =

Are words uniformly distributed in a set of documents?

Does the topic of a text influence the occurrence of specific words?

Are two words collocates?

 

Example

Suppose we want to determine whether a certain word w occurs with the same frequency in 5 documents that vary in topic.

The counts of the word in the 5 documents are as follows:

observed

1

2

3

4

5

total

43

29

52

34

48

206

If the distribution were uniform we would expect 206/5, or about 41 occurrences of w per document:

expected

1

2

3

4

5

total

41.2

41.2

41.2

41.2

41.2

206

Plugging these numbers into the equation we get

c 2 = 8.903

Checking a c 2 table with df =4, it’s sufficient to reject the null hypothesis at the 10% level but not at the 5%