Jan 26, 2017

What is ROUGE and how it works for evaluation of summaries?

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is essentially of a set of metrics for evaluating automatic summarization of texts as well as machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced).

Let us say, we have the following system and reference summaries:

System Summary (what the machine produced):

    the cat was found under the bed
Reference Summary (gold standard - usually by humans) :

    the cat was under the bed

If we consider just the individual words, the number of overlapping words between the system summary and reference summary is 6. This however, does not tell you much as a metric. To get a good quantitative value, we can actually compute the precision and recall using the overlap.

Precision and Recall in the Context of ROUGE

Simplistically put, Recall in the context of ROUGE simply means how much of the reference summary is the system summary recovering or capturing? If we are just considering the individual words, it can be computed as:

In this example, the Recall would thus be:

This means that all the words in the reference summary has been captured by the system summary, which indeed is the case for this example. Whoala! this looks really good for a text summarization system. However, it does not tell you the other side of the story. A machine generated summary (system summary) can be extremely long, capturing all words in the reference summary. But, much of the words in the system summary may be useless, making the summary unnecessarily verbose. This is where precision comes into play. In terms of precision, what you are essentially measuring is, how much of the system summary was in fact relevant or needed? Precision is measured as:

In this example, the Precision would thus be:

This simply means that 6 out of the 7 words in the system summary were in fact relevant or needed. If we had the following system summary, as opposed to the example above:

System Summary 2:

    the tiny little cat was found under the big funny bed
The Precision now becomes:

Now, this doesn't look so good, does it? That is because we have quite a few unnecessary words in the summary. The precision aspect becomes really crucial when you are trying to generate summaries that are concise in nature. Therefore, it is always best to compute both the Precision and Recall and then report the F-Measure. If your summaries are in some way forced to be concise through some constraints, then you could consider using just the Recall since precision is of less concern in this scenario.


ROUGE-N, ROUGE-S and ROUGE-L can be thought of as the granularity of texts being compared between the system summaries and reference summaries. For example, ROUGE-1 refers to overlap of unigrams between the system summary and reference summary. ROUGE-2 refers to the overlap of bigrams between the system and reference summaries. Let's take the example from above. Let us say we want to compute the ROUGE-2 precision and recall scores. 

System Summary :
    the cat was found under the bed
Reference Summary :
    the cat was under the bed

System Summary Bigrams:
the cat, 
cat was, 
was found, 
found under, 
under the, 
the bed

Reference Summary Bigrams:
the cat, 
cat was, 
was under, 
under the, 
the bed

Based on the bigrams above, the ROUGE-2 recall is as follows:

Essentially, the system summary has recovered 4 bigrams out of 5 bigrams from the reference summary which is pretty good! Now the ROUGE-2 precision is as follows:

The precision here tells us that out of all the system summary bigrams, there is a 67% overlap with the reference summary.  This is not too bad either. Note that as the summaries (both system and reference summaries) get longer and longer, there will be fewer overlapping bigrams especially in the case of abstractive summarization where you are not directly re-using sentences for summarization.

The reason one would use ROUGE-1 over or in conjunction with ROUGE-2 (or other finer granularity ROUGE measures), is to also show the fluency of the summaries or translation. The intuition is that if you more closely follow the word orderings of the reference summary, then your summary is actually more fluent.

Short Explanation of a few Different ROUGE measures

  • ROUGE-N - measures unigram, bigram, trigram and higher order n-gram overlap
  • ROUGE-L -  measures longest matching sequence of words using LCS. An advantage of using LCS is that it does not require consecutive matches but in-sequence matches that reflect sentence level word order. Since it automatically includes longest in-sequence common n-grams, you don't need a predefined n-gram length.
  • ROUGE-S - Is any pair of word in a sentence in order, allowing for arbitrary gaps. This can also be called skip-gram coocurrence. For example, skip-bigram measures the overlap of word pairs that can have a maximum of two gaps in between words. As an example, for the phrase "cat in the hat" the skip-bigrams would be "cat in, cat the, cat hat, in the, in hat, the hat". 
For more in-depth information about these evaluation metrics you can refer to Lin's paper. Which measure to use depends on the specific task that you are trying to evaluate. If you are working on extractive summarization with fairly verbose system and reference summaries, then it may make sense to use ROUGE-1 and ROUGE-L. For very concise summaries, ROUGE-1 alone may suffice especially if you are also applying stemming and stop word removal. 

ROUGE Evaluation Packages

Papers to Read

Mar 7, 2016

Giving an IDF Effect to Probability Count of Words

The probability of an n-gram  or word tells you how important a word is to a document. Unfortunately, since a document almost always has important words inter-mingled with conjunctions and determiners and other types of noise, using the raw probabilities may not be a good idea. One thing that you can do, is to normalize these probabilities using a larger collection.  So let us say, you have 3 documents and you want to find the most important  words in a document. Assume that your document looks like this:

Doc1 { the cat and dog and mouse ate cheese }
Doc2 { the mouse in the house was lovely }
Doc3 { the mouse in the jungle was dead }

Now, let us look at the word counts.

Entire Collection (i.e. Doc1 + Doc2 + Doc3):

  "results": [
    " the:5",
    " mouse:3",
    " in:2",
    " and:2",
    " was:2",
    " house:1",
    " jungle:1",
    " cat:1",
    " lovely:1",
    " ate:1",
    " cheese:1",
    " dead:1",
    " dog:1"

Counts for Doc1:

Doc 1 {
  "results": [
    " and:2",
    " mouse:1",
    " cat:1",
    " ate:1",
    " the:1",
    " dog:1",
    " cheese:1"

Counts for Doc2: 

Doc 2 {
  "results": [
    " the:2",
    " house:1",
    " in:1",
    " mouse:1",
    " lovely:1",
    " was:1"

Counts for Doc3: 

Doc 3 {
  "results": [
    " the:2",
    " in:1",
    " jungle:1",
    " mouse:1",
    " dead:1",
    " was:1"

We will refer to the counts of the entire collection as CollectionCount and for the individual document as DocumentCount. Now, if you look at the probability of the word 'the' in Doc1 , it is 1/7 =0.143 which was obtained as follows:

The word 'the' appeared once in Doc1 and Doc1 has an overall word count of  7. Now, the probability of 'mouse' in Doc1 is also 1/7 =0.143 obtained as follows:

By just looking at Doc1, even though we know that the word 'mouse' is more important to Doc1 than the word 'the' (`the' as we know is used very frequently in English), the probabilities do not accurately reflect this. A system would think that these two words are equally important because the word 'mouse' and 'the' have identical counts and probabilities. One way to address this is to, give an IDF effect to the probabilities. Meaning, words that appear very frequently across the entire collection should be penalized more than words that appear in fewer documents on the whole. You can actually do this by normalizing the probability of words using probabilities from the entire Collection by using the CollectionCount.  So the updated scores of 'mouse' and 'the' will be computed as follows with this normalization strategy:

There you go! Now,  the word 'the' has a lower score than mouse even though the 'within document' probabilities are the same. Note that this example is based on a very very small sample, so the same theory may not hold for some of the other words in the example documents. However, when you actually take a large sample,  the numbers will accurately reflect the typical distribution of words in a document and a collection of documents. This type of normalization is powerful when you rank the words/n-grams in descending order of scores. Also, take note that when your sample is really large, the probabilities will get really small, so it is always better to utilize log of probabilities as it makes it easier for analysis. For example, you can do the following:

Don't get intimidated by the negative sign, when you rank the words in descending order of scores, it will all make sense.  

Mar 3, 2016

Untold Truths About Text Mining and Web Mining in Practice

Truth #1: There is no such thing as "THE TOOL" for text mining

People often ask me, "So what tool do you use for your text analysis projects"...This is how it typically goes..I stare at them not sure whether I should laugh or cry, they stare at me thinking I am crazy and there is a few seconds of uncomfortable silence. I usually end up with a single tear in one eye and a partial half-hearted smile...then the confused other person clarifies him/her self by saying..."actually, what I meant was do you use NLTK or GATE". Now the tears start to pour. Well, the truth is, it all depends on the problem!! There is no magic tool or tools. If I am scraping the Web, I might use something like Nutch or write a custom crawler (no, not in python!) that does crawling and scraping at the same time. If I am building a text classifier I may just use Mallet with simple stemming where needed. If I am doing entity linking I might just use some similarity measures like Jaccard or Cosine. It also depends on the programming language I am working on. If I am using Python, I may choose to use scikit learn for text classification instead of Mallet. So, there really is no such thing as THE TOOL for text analysis - it always depends on what you are trying to solve and in which development environment. The idea of "THE TOOL" probably comes from some irresponsible advertising where you are promised that you can solve the world's text mining problems with this one tool.

Truth #2: Text Mining is 20% engineering, 40% algorithms and 40% science/statistics

If you think text mining or web mining is just pure statistics or science where you can for example, apply black box Machine Learning and solve the entire problem, you are in for a big surprise. Many text mining projects start with intelligent algorithms (from focused crawling, to compact text representation to graph search) with much of the code written from scratch coupled with some statistical modeling to solve very specific parts of the problem (e.g. CRF based web page segmentation). Developing the custom algorithms and tying the pieces together requires some level of engineering. This is why text mining and analytics is so appealing to some, as there is always going to be new challenges in every new problem you work on - whether its engineering problems or types of algorithms to use - its always a fun and challenging journey. Even a change in domain, can significantly change the scope of the problem.

Truth # 3: Not all data scientists can work on text mining / web mining / NLP projects

Data science is a very broad term and it encapsulates BI analysts, NLP Experts and ML Experts. The way data scientists are being trained today (outside academic programs), is that most of them get hands on experience working on structured data using R or Python and are mostly trained in descriptive statistics and supervised machine learning. Text and language is a whole other ball game. The language expert type of data scientist must have significant NLP knowledge as well as some knowledge in information retrieval, data mining, breadth of AI, as well as creative algorithm development capabilities.

Now you might be thinking so what is the whole point of GATE and NLTK and that Alchemy API for gods sake! GATE and NLTK and Alchemy type NLP APIs are just some tools to help you get started with text mining type projects. These tools provide some core functionality such as stemming, sentencing, chunking, etc. You are not required to use these tools. If you find other off the shelf tools that are easy to integrate into your code, go ahead and use it! If you are already working with python, then NLTK is always an option for you to use. The bottom line is, pick and choose and use *only* what you REALLY need, not something that looks or sounds cool as this just adds unneeded overhead and makes debugging a living nightmare.

Jan 19, 2016

How to Get Access to the UMLS Metathesaurus?

The UMLS Metathesaurus is 
a repository containing millions of biomedical and health related concepts, their synonymous names, and their relationships. You can think of it as a meta-dictionary (dictionary over over dictionaries), pulling in classification systems and dictionaries such as ICD-10, ICD-9, SNOMED, RxNORM, Mesh and many many more. The Metathesaurus has over 150 dictionaries/classification systems.

To gain access to the UMLS Metathesaurus you have to:

1. Fill up an application form here: https://uts.nlm.nih.gov//license.html . This is reviewed by someone so, make sure you fill this form up correctly.

2. You will receive a confirmation email and after several days, and you will be given access if approved. 

3. You can then sign-in and you will have access to the entire database. Note that you will have to renew your license annually

Once you have access, to start exploration I would recommend that you start with the SNOMED CT browser (which is a subset) and then move on to the UMLS Metathesaurus browser (which is basically the meta-dictionary).

Nov 8, 2015

What is Text Similarity?

When talking about text similarity, different people have a slightly different notion on what text similarity means. In essence, the goal is to compute how 'close' two pieces of text are in (1) meaning or (2) surface closeness. The first is referred to as semantic similarity and the latter is referred to as lexical similarityAlthough the methods for lexical similarity are often used to achieve semantic similarity (to a certain extent), achieving true semantic similarity is often much more involved. In this article, I mainly focus on lexical similarity as it has the most use from a practical stand-point and then I briefly introduce semantic similarity.

Lexical or Word Level Similarity

For the most part, when referring to text similarity, people actually refer to how similar two pieces of text are at the surface level. For example, how similar are the phrases "the cat ate the mouse" with "the mouse ate the cat food" by just looking at the words?  On the surface, if you consider only word level similarity, these two phrases (with determiners disregarded) appear very similar as 3 of the 4 unique words are an exact overlap.

This notion of similarity is often referred to as lexical similarity. It typically does not take into account the actual meaning behind words or the entire phrase in context. While the actual meaning of the phrases is often disregarded this does not mean that computing similarity in this way is ineffective. You can actually come up with creative ways of expanding the scope of each word or a set of related words (lexical chains) to improve the similarity between two pieces of text being compared. For instance, if you are comparing similarity between phrases from newspaper articles. You could potentially use N non-determiner words to the left and to the right of the current word in the phrase for a simple scope expansion. Instead of doing a word for word comparison,  you are essentially providing more context. This is analogous to expanding a search query. Imagine if you were to compare the similarity between your search query, and all the documents on the Web so that you can get the best results matching your query. How would you expand your query? The same thought process can be applied in improving lexical level similarity measures.


Another point to note is that lexical similarity can be computed at various granularity. You can compute lexical similarity at the character level, word level (as shown earlier) or at a phrase  level (or lexical chain level) where you break a piece of text into a group of related words prior to computing similarity. Character level similarity is also known as string similarity/matching and is commonly used to determine how close two strings are. For example how close are the names 'Kavita Ganesan' and 'Kavita A Ganesan' ? Pretty close! You can use the common metrics outlined below for string similarity or you can use edit distance  type of measures to quantify how dissimilar two strings are. In essence, you are trying to compute the minimum number of operations required to transform one string into the other.

Common Metrics

Some of the most common metrics for computing similarity between two pieces of text are the Jaccard coefficient, Dice and Cosine similarity all of which have been around for a very long time. Jaccard and Dice are actually really simple as you are just dealing with sets. Here is how you can compute Jaccard:

Simply put, this is the intersection of the sets divided by the union of the sets. Your resulting value will be between [0,1], so you can set a threshold as needed. I have found that a threshold of 0.6 and above is pretty effective in detecting phrases that are similar (maximum length of phrase is 10 words). For longer texts this value could be smaller or if you only care for marginal overlap, again this value could be much smaller. The more you normalize, pre-process and filter your text (e.g. stem, remove noise, remove stop words), the better the outcome of your text similarity measure using simple measures such as Jaccard.

Where is lexical similarity used?

Clustering - if you want to group similar texts together how can you tell if two groups of text are even similar?

Redundancy removal - if two pieces of texts are so similar, why do you need both? You can always eliminate the redundant one. Think of duplicate product listings, or the same person in your database, with slight variation in the name or even html pages that are near duplicates.

Information Retrieval - you could use the more established information retrieval measures like BM25, PL2, etc. But you could also use a measure like cosine (for longer texts) or jaccard and dice for (shorter texts).

Semantic Similarity

So far, we have talked about lexical similarity. Another notion of similarity mostly explored by the NLP research community is how similar in meaning are any two phrases?  If we look at the phrases, "the cat ate the mouse" and "the mouse ate the cat food", we know that while the words significantly overlap, these two phrases actually have different meaning. Getting the meaning out of the phrases is often a more difficult task as it requires deeper level of analysis. In this example, we can actually look at simple aspects like order of words: "cat==>ate==>mouse" and "mouse==>ate==>cat food". Although the words overlap in this case, the order of occurrence is different and from that we can tell that these two phrases actually have different meaning. This is just one simple example. Most people use syntactic parsing to help with semantic similarity. Let's look at the parse trees for these two phrases. What can you get from it?

You can get phrases out of the parse (e.g. "cat food"), dependency structure (e.g. mouse is the object of ate in the first case and food is the object of ate in the second case) as well as parts of speech (nouns, verbs, adjectives and etc.)  - all of which can be used in different ways to estimate semantic similarity. Semantic similarity is often used to address NLP tasks such as paraphrase identification and automatic question answering. To get a better understanding of semantic similarity and paraphrasing you can refer to some of the articles below.

Text Similarity Tools and APIs

Related Articles:

Jun 25, 2015

Getting Started with Spark on Windows

This article talks about how to get started with the Spark shell on Windows. Based on the documentation on Spark's website it seemed like a breeze to get started. But, there were several mistakes I made which took me longer to get started than I had expected. This is what worked for me on a Windows machine:
  1. First, download Spark with a hadoop distribution: http://spark.apache.org/downloads.html

  2. Next, go to Windows=>Run and type cmd to get the DOS command prompt.

    *Note: Take note that Cygwin may not work. You will have to use the DOS prompt.

  3. Change directory into the Spark installation directory (home directory)

  4. Next, at the command prompt, type
  5. You should see something like this:

  6. Once Spark has started, you will see a prompt "scala>".

  7. If Spark correctly initialized, if you type:
    val k=5+5  
    at the command prompt, you should get back:
    k: Int = 10
    If you don't then Spark did not start correctly.

  8. Another check to do is to go to your Web browser and type http://localhost:port_that_spark_started_on. This value can be found in the start up screen. It is usually 4040, but it can be some other value if Spark had issues binding to that specific port.

Mar 3, 2015

Scheduling Automatic Restart of Windows 7

One of the lessons I have learnt with server administration is that after a while of running tomcat servers, the system can get really sluggish from all the "living objects". One of the solutions for this is to do a daily scheduled reboot of the server to keep the system fresh and performing at its best (of course you also need to fix the reason as to why objects are not being pushed into garbage collection).

So, here is how you create a scheduled restart for Windows machines (This was tested for Windows 7):
  1. Start Notepad,
  2. type c:\windows\system32\shutdown.exe -F -R
  3. Save the file as "AutoRestart.bat" somewhere on disk
  4. Now, Launch Task Scheduler.
  5. Click Action and select Create Basic task.
  6. Type AutoRestart (or others you want) in the Name box and click Next.
  7. Select Daily and click Next
  8. Type the time you want to restart the computer and click Next.
  9. Select Start a program and click Next.
  10. Click Browse and navigate to where you saved AutoRestart.bat, select the AutoRestart.bat file and click Next.
  11. Click Finish.
  12. To test if your auto-restart works, go back to task scheduler and when you see the list of scheduled tasks, right click on AutoRestart and select run.

Feb 13, 2015

What is the Notion of NLP in Computer Science vs. Biomedical Informatics vs. Information Systems (MIS)?

Computer Science: In computer science research, when we talk about NLP methods we are referring to specific methods that use some linguistics properties or structural properties of text often coupled with statistics to solve targetted language and text mining problems. The NLP parts are those "customized methods" that we use to solve specific problems. Thus, the notion of NLP here is rather broad as any novel approach that does not just re-use existing tools could be considered NLP or Computational Linguistics.

Biomedical Informatics: In the pure biomedical informatics world however, I often notice that the idea of what NLP is, is quite narrow. People tend to refer to usage of tools such as NegEx Detection, POS Taggers, Parsers and Stemming as NLP. While technically, the underlying technology in these tools can involve lightweight to heavy NLP, the applications of these tools in specific problems are not necessarily "NLP". It is more of applications of NLP tools or text-processing using NLP tools. To be technically correct, only when these NLP tools are used in combination with additional learning methods or customized text mining algorithms, this would probably qualify as actual NLP, solving a specific problem.

Information Systems (MIS): The concept of what qualifies as NLP (and text mining) in the information systems (MIS) world tends to be a lot more shallow and ill-defined than biomedical informatics or computer science where even counting word frequencies seems to be considered a form of NLP. In fact, even getting a binary classifier to work on text is considered fairly involved NLP. This may seem slightly exaggerated, but when you start reviewing papers from some of the MIS departments through TOIS or TKDE you will start to understand how ill-defined NLP can really be. 

The point here is that there are different flavors of NLP both in research and in practice and the notion of NLP varies from department to department. So when you are publishing obvious methods to conferences where the audience primarily consists of core computer scientists, you will get dinged really hard for the lack of originality in your proposed methods. At the same time, if you submit a really effective algorithm that does not use existing NLP tools to a biomedical informatics journal, again you may get dinged but this time for not using existing tools or as a friend of mine once told me "the reviewer had  philosophical issues with my submission". Also, be warned that if a physician comes and tells you hey, I am also doing NLP (heard quite a few of those), don't be surprised that what he or she means is that he is running some NLP tools on some patient data. And if someone from the IS department of a business school talks to you about offering a text mining course, it usually means they are teaching you how to use R or SaS to do some text processing and querying :D, not to mention they will also refer to this as "Big Data". 

Jan 22, 2015

'Something' Analytics the New Buzz Word?

From Big Data to Data Science to Machine Learning to Natural Language Processing, and now the next new Buzz Word I've been hearing lately is "Something" Analytics.  I've always heard of Google Analytics, Healthcare Analytics and Text Analytics, but here's a list of some new analytics buzz words. Some - you've already heard of, some might surprise you:

Natural Language Analytics - referring to the use of NLP

Review Analytics - analysis capability for user reviews.

Sentiment Analytics - well, basically sentiment analysis

Sports Analytics - statistics + sports

Onco Analytics - analysis of cancer treatment and success rates

Pinterest Analytics analysis of pins

Twitter Analytics -  analysis of your Tweet activities

Advertising Analytics -  analysis of how your ads worked across media and sales channels

Strategy Analytics - analysis and recommendation of business strategies

Academic Analytics process of evaluating and analysing organisational data received from university systems for reporting and decision making

Prayas Analytics - quantifies the offline customer experience for brick-and-mortar retailers (company).

Aging Analytics - research in biotechnology and regenerative medicine

Learning Analytics - analysis and reporting of data about learners and their contexts, for optimizing learning and the environments in which it occurs

'Something Analytics' refers to the use of predictive models? 

No, absolutely not! The term analytics is actually overloaded. Some companies and institution use the term 'analytics' to refer to the use of predictive models as the underlying technology. In many cases though, it just refers to the data-driven analysis, reporting and decision making capability in a specific domain. Bottom line - when you see the term 'analytics' in use it usually means some form of data analysis in plain traditional terms. 

Jan 11, 2015

Micropinions vs. Micro-reviews

I was recently asked "What's the difference between micropinions and micro-reviews?" based on my paper "Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions". The two concepts are actually fairly synonymous. Micropinion refers to a set of short phrases expressing opinions on a specific topic or entity. Here is an example of a micropinion summary of a restaurant:
As you can see, the size of micropinions are small enough that they would easily fit within the display of a hand-held device. Constraints can be imposed to limit the size of the overall generated micropinions. Micro-reviews on the other hand typically refer to reviews posted on social media websites like Twitter and Four Squares where the goal is to provide a review of a product or some service (e.g. restaurant) within the character limits imposed by these social media websites. Here are a few example of micro-reviews from Twitter:
So in essence, the terms micro-reviews and micropinions refer the same concept: Very concise opinions about an entity or topic often adhering to some constraints (e.g. character limit or word limit ). Hope this explains!