Corpus statistics
A 20 minute whirlwind tour
Objectives:
Basic counters and statistical tools
Kosovo
count: 8
Total words in document: 350
Frequency = count/total = 2.29%
A few examples of such tools:
web-based:
Cathy Ball’s word counter
(http://www.georgetown.edu/cball/webtools/web_freqs.html)
CRL Java word counter
standard tools:
wcount tool in Oleada
cwcount in /home/language_tools/algorithms
What is an occurrence?
Collocations --- What words occur together?
Ice cream
Mineral water
Natural gas
Health insurance
Mutual Information
I(x,y) = log2 [p(x,y)/(p(x)p(y))]
and that the simplest estimates of probabilities, the maximum likelihood estimates, are given by
p(x) = count(x)/N
p(y) = count(y)/N
p(x,y) = fcount(x,y)/N
(N is the number of words in the corpus)
(remember log2 (x) = log10 (x) / .30103)
log[p(x,y)/p(x)p(y)] = log p(x,y) - log p(x) + log p(y)
Resources
Mitch Resnik. Notes from a Short Course on Statistical Methods in NLP.
(http://www.umiacs.umd.edu/users/resnik/nlstat_tutorial_summer1998/index.html)
Dan Jurafsky and Jim Martin, Speech and Language Processing: An Introduction to Speech Recognition, Natural Language Processing, and Computational Linguistics, draft in progress (version of 10 June 1998).
(http://www.cs.colorado.edu/~martin/slp.html)Brigitte Krenn and Christer Samuelsson, The Linguist's Guide to Statistics.
(http://www.coli.uni-sb.de/~christer/stat_cl.ps.gz.uu)Ted Dunning. 1993. Accurate Methods for the statistics of surprise and coincidence. Computational Linguistics. 19(1):61-74.
Eugene Charniak. 1993. Statistical language Learning. Cambridge: MIT Press.
Probability
Probability: a mathematical statement about the likelihood of some event occurring.
Examples:
Viewing this as one document:
count of cat: 3
total words: 20
P(cat) = 3/20
Viewing this as four separate documents:
What’s the probability that cat will be in a document?
P(cat) = 3/4
Independence:
Two events are said to be independent iff
P(A Ç B) = P(A) p(B)
(The probability of A and B occurring simultaneously can be determined directly from the individual probabilities of A and B)
Example:
The word Monica occurs in 8 of 100 texts and the word Lewinsky occurs in 5 of 100 documents. The two words occur together in 4 of 100 documents. Are the words independent?
P(Monica Ç Lewinsky) = 4/100
P(Monica) P(Lewinsky) = 8/100 5/100 = 4/1000
4/100 ¹ 4/1000
Conditional Probabilities
P(A Ç B)
P(A|B) = ----------------------
P(B)
P(hat Ç cat) 1/4
P(hat|cat) = ------------------ = ----- = 1/3
P(cat) 3/4
Bayes’ Theorem
P(B|A)P(A)
P(A|B) = --------------------
P(B)
P(hat|cat)P(cat) (1/3)(3/4)
P(cat|hat) = ------------------- = ------------- = 1
P(hat) (1/4)
One application: speech recognition
Maxword P(word|acoustic evidence)
Find Part of speech tag given word and the previous part of speech tag
Chi-square
c
2 = 
Are words uniformly distributed in a set of documents?
Does the topic of a text influence the occurrence of specific words?
Are two words collocates?
Example
Suppose we want to determine whether a certain word w occurs with the same frequency in 5 documents that vary in topic.
The counts of the word in the 5 documents are as follows:
observed
|
1 |
2 |
3 |
4 |
5 |
total |
|
43 |
29 |
52 |
34 |
48 |
206 |
If the distribution were uniform we would expect 206/5, or about 41 occurrences of w per document:
expected
|
1 |
2 |
3 |
4 |
5 |
total |
|
41.2 |
41.2 |
41.2 |
41.2 |
41.2 |
206 |
Plugging these numbers into the equation we get
c 2
= 8.903Checking a c 2 table with df =4, it’s sufficient to reject the null hypothesis at the 10% level but not at the 5%