Mar 7, 2016

Giving an IDF Effect to Probability Count of Words

The probability of an n-gram  or word tells you how important a word is to a document. Unfortunately, since a document almost always has important words inter-mingled with conjunctions and determiners and other types of noise, using the raw probabilities may not be a good idea. One thing that you can do, is to normalize these probabilities using a larger collection.  So let us say, you have 3 documents and you want to find the most important  words in a document. Assume that your document looks like this:

Doc1 { the cat and dog and mouse ate cheese }
Doc2 { the mouse in the house was lovely }
Doc3 { the mouse in the jungle was dead }

Now, let us look at the word counts.

Entire Collection (i.e. Doc1 + Doc2 + Doc3):

{
  "results": [
    " the:5",
    " mouse:3",
    " in:2",
    " and:2",
    " was:2",
    " house:1",
    " jungle:1",
    " cat:1",
    " lovely:1",
    " ate:1",
    " cheese:1",
    " dead:1",
    " dog:1"
  ]
}

Counts for Doc1:

Doc 1 {
  "results": [
    " and:2",
    " mouse:1",
    " cat:1",
    " ate:1",
    " the:1",
    " dog:1",
    " cheese:1"
  ]
}

Counts for Doc2: 

Doc 2 {
  "results": [
    " the:2",
    " house:1",
    " in:1",
    " mouse:1",
    " lovely:1",
    " was:1"
  ]
}

Counts for Doc3: 

Doc 3 {
  "results": [
    " the:2",
    " in:1",
    " jungle:1",
    " mouse:1",
    " dead:1",
    " was:1"
  ]
}

We will refer to the counts of the entire collection as CollectionCount and for the individual document as DocumentCount. Now, if you look at the probability of the word 'the' in Doc1 , it is 1/7 =0.143 which was obtained as follows:


The word 'the' appeared once in Doc1 and Doc1 has an overall word count of  7. Now, the probability of 'mouse' in Doc1 is also 1/7 =0.143 obtained as follows:


By just looking at Doc1, even though we know that the word 'mouse' is more important to Doc1 than the word 'the' (`the' as we know is used very frequently in English), the probabilities do not accurately reflect this. A system would think that these two words are equally important because the word 'mouse' and 'the' have identical counts and probabilities. One way to address this is to, give an IDF effect to the probabilities. Meaning, words that appear very frequently across the entire collection should be penalized more than words that appear in fewer documents on the whole. You can actually do this by normalizing the probability of words using probabilities from the entire Collection by using the CollectionCount.  So the updated scores of 'mouse' and 'the' will be computed as follows with this normalization strategy:

  
There you go! Now,  the word 'the' has a lower score than mouse even though the 'within document' probabilities are the same. Note that this example is based on a very very small sample, so the same theory may not hold for some of the other words in the example documents. However, when you actually take a large sample,  the numbers will accurately reflect the typical distribution of words in a document and a collection of documents. This type of normalization is powerful when you rank the words/n-grams in descending order of scores. Also, take note that when your sample is really large, the probabilities will get really small, so it is always better to utilize log of probabilities as it makes it easier for analysis. For example, you can do the following:


Don't get intimidated by the negative sign, when you rank the words in descending order of scores, it will all make sense.