Nov 8, 2015

What is Text Similarity?

When talking about text similarity, different people have a slightly different notion on what text similarity means. In essence, the goal is to compute how 'close' two pieces of text are in (1) meaning or (2) surface closeness. The first is referred to as semantic similarity and the latter is referred to as lexical similarityAlthough the methods for lexical similarity are often used to achieve semantic similarity (to a certain extent), achieving true semantic similarity is often much more involved. In this article, I mainly focus on lexical similarity as it has the most use from a practical stand-point and then I briefly introduce semantic similarity.

Lexical or Word Level Similarity

For the most part, when referring to text similarity, people actually refer to how similar two pieces of text are at the surface level. For example, how similar are the phrases "the cat ate the mouse" with "the mouse ate the cat food" by just looking at the words?  On the surface, if you consider only word level similarity, these two phrases (with determiners disregarded) appear very similar as 3 of the 4 unique words are an exact overlap.

$Overlap=\ 'cat\ ate\ mouse'\ \cap\ 'mouse\ ate\ cat\ food\ '=3$

This notion of similarity is often referred to as lexical similarity. It typically does not take into account the actual meaning behind words or the entire phrase in context. While the actual meaning of the phrases is often disregarded this does not mean that computing similarity in this way is ineffective. You can actually come up with creative ways of expanding the scope of each word or a set of related words (lexical chains) to improve the similarity between two pieces of text being compared. For instance, if you are comparing similarity between phrases from newspaper articles. You could potentially use N non-determiner words to the left and to the right of the current word in the phrase for a simple scope expansion. Instead of doing a word for word comparison,  you are essentially providing more context. This is analogous to expanding a search query. Imagine if you were to compare the similarity between your search query, and all the documents on the Web so that you can get the best results matching your query. How would you expand your query? The same thought process can be applied in improving lexical level similarity measures.

Granularity

Another point to note is that lexical similarity can be computed at various granularity. You can compute lexical similarity at the character level, word level (as shown earlier) or at a phrase  level (or lexical chain level) where you break a piece of text into a group of related words prior to computing similarity. Character level similarity is also known as string similarity/matching and is commonly used to determine how close two strings are. For example how close are the names 'Kavita Ganesan' and 'Kavita A Ganesan' ? Pretty close! You can use the common metrics outlined below for string similarity or you can use edit distance  type of measures to quantify how dissimilar two strings are. In essence, you are trying to compute the minimum number of operations required to transform one string into the other.

Common Metrics

Some of the most common metrics for computing similarity between two pieces of text are the Jaccard coefficient, Dice and Cosine similarity all of which have been around for a very long time. Jaccard and Dice are actually really simple as you are just dealing with sets. Here is how you can compute Jaccard:

Simply put, this is the intersection of the sets divided by the union of the sets. Your resulting value will be between [0,1], so you can set a threshold as needed. I have found that a threshold of 0.6 and above is pretty effective in detecting phrases that are similar (maximum length of phrase is 10 words). For longer texts this value could be smaller or if you only care for marginal overlap, again this value could be much smaller. The more you normalize, pre-process and filter your text (e.g. stem, remove noise, remove stop words), the better the outcome of your text similarity measure using simple measures such as Jaccard.

Where is lexical similarity used?

Clustering - if you want to group similar texts together how can you tell if two groups of text are even similar?

Redundancy removal - if two pieces of texts are so similar, why do you need both? You can always eliminate the redundant one. Think of duplicate product listings, or the same person in your database, with slight variation in the name or even html pages that are near duplicates.

Information Retrieval - you could use the more established information retrieval measures like BM25, PL2, etc. But you could also use a measure like cosine (for longer texts) or jaccard and dice for (shorter texts).

Semantic Similarity

So far, we have talked about lexical similarity. Another notion of similarity mostly explored by the NLP research community is how similar in meaning are any two phrases?  If we look at the phrases, "the cat ate the mouse" and "the mouse ate the cat food", we know that while the words significantly overlap, these two phrases actually have different meaning. Getting the meaning out of the phrases is often a more difficult task as it requires deeper level of analysis. In this example, we can actually look at simple aspects like order of words: "cat==>ate==>mouse" and "mouse==>ate==>cat food". Although the words overlap in this case, the order of occurrence is different and from that we can tell that these two phrases actually have different meaning. This is just one simple example. Most people use syntactic parsing to help with semantic similarity. Let's look at the parse trees for these two phrases. What can you get from it?

You can get phrases out of the parse (e.g. "cat food"), dependency structure (e.g. mouse is the object of ate in the first case and food is the object of ate in the second case) as well as parts of speech (nouns, verbs, adjectives and etc.)  - all of which can be used in different ways to estimate semantic similarity. Semantic similarity is often used to address NLP tasks such as paraphrase identification and automatic question answering. To get a better understanding of semantic similarity and paraphrasing you can refer to some of the articles below.