Oct 7, 2014

Shell script to generate counts of words in descending order of term frequency

To get word counts from a text file, you don't really need to write a Java or a Python program. You can actually do it using shell scripting which is available by default if you are using Linux. If you are using Windows you can still use shell scripting if you install Cygwin - which is basically a unix command line for Windows. Here is any easy way of doing it:

Assuming you start with a file called my_text_file, we first transform all of the contents of this file to lowercase (my_text_file.lowercase), then split the entire textual content such that we have one word per line (my_text_file.onewordperline). Then we sort the words and count its term frequency and then sort it again by descending term frequency (my_text_file.countsorted). Here is the step-by-step guide:

1. First convert all capital letters to lower cases.
$ tr '[A-Z]' '[a-z]'  my_text_file.lowercase 

2. Split the words on a given line so that each line has only one word.
$ awk '{for (i=1;i<=NF;i++) print $i;}' my_text_file.lowercase > my_text_file.onewordperline

3. Sort all the words and then count the number of occurrences of each word.
$ sort my_text_file.onewordperline | uniq -c > my_text_file.count 

4. Sort the words in descending order of counts so you see the high frequency words.  
$ sort -rn -k1 my_text_file.count > my_text_file.countsorted"  

All steps above in a combined way:
$ tr '[A-Z]' '[a-z]' < my_text_file | awk '{for (i=1;i<=NF;i++) print $i;}' | sort | uniq -c |sort -rn -k1 > my_text_file.countsorted
The "|" character is called a pipe which basically says send the output from the previous command to the next command. The ">" symbol above basically says push the output to the file as named on the right. You don't have to create this file before hand, it is automatically created.

No comments:

Post a Comment