Oct 16, 2014

How do we evaluate machine generated textual summaries?

Text summarization is a hard problem! Evaluating textual summaries and summarization systems is just as hard. It is not as straightforward as generating precision and recall like in supervised learning problems. I have worked through summarization papers and the number one comment I get is - we don't really know if your summarization method is really effective even though the ROUGE scores say that the summaries are actually not bad. It sometimes boils down to the question of what exactly makes a summary beneficial to readers? Is it supposed to inform readers about the most essential 'event' or 'topic' or the most important opinions on a topic?

So in general, there is the question of utility and accuracy. When I say accuracy in this context it means how readable, well formed and on topic are the generated summaries in comparison to a gold standard. The question of on topic can usually be measured based on agreement with human composed summaries. The question of well formedness, cohesion and readability usually would require a human assessor to judge the quality of generated summaries with regard to these aspects.

Utility of summaries is mainly from the stand  point of the user. Did the summary fulfill its goal? This would require a few users to read the unsummarized documents and then the summaries to decide how well the summarization system performed. You will have to come up with a creative evaluation metric for this manual assessment. For example, in one of my abstractive summarization papers, we wanted to measure the readability aspect. So, we mixed the human composed summary phrases with the system generated summary phrases for each summarization task. We then asked a human assessor to pick out summaries that seemed to be generated by the system (based on grammatical errors, incompleteness, etc). We call this the readability score and this is how it is computed:

# of correctly picked system generated phrases / total number of system generated phrases

Surprisingly, the assessor could not tell the difference in many of the cases - which meant the system summaries were fairly well-formed making it hard for the assessor to distinguish between human composed phrases vs. machine generated phrases. More details on this readability test is available in the following paper:

Ganesan, Kavita, ChengXiang Zhai, and Jiawei Han. "Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions."Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 2010.

The most common automatic evaluation for summarization sustems is ROUGE (Recall-Oriented Understudy for Gisting Evaluation) which is an adaption of the IBM BLEU. ROUGE works by measuring n-gram overlap between gold standard summaries with system summaries. With this it can measure topic relevance and in some cases fluency especially when you use higher order ROUGE-N. I don't really 'trust' this part because there are many ways to express the same information. So use ROUGE-N > 1 with caution.