### Giving an IDF Effect to Probability Count of Words

The probability of an n-gram  or word tells you how important a word is to a document. Unfortunately, since a document almost always has important words inter-mingled with conjunctions and determiners and other types of noise, using the raw probabilities may not be a good idea. One thing that you can do, is to normalize these probabilities using a larger collection.  So let us say, you have 3 documents and you want to find the most important  words in a document. Assume that your document looks like this:

Doc1 { the cat and dog and mouse ate cheese }
Doc2 { the mouse in the house was lovely }
Doc3 { the mouse in the jungle was dead }


Now, let us look at the word counts.

#### Entire Collection (i.e. Doc1 + Doc2 + Doc3):

{
"results": [
" the:5",
" mouse:3",
" in:2",
" and:2",
" was:2",
" house:1",
" jungle:1",
" cat:1",
" lovely:1",
" ate:1",
" cheese:1",
" dog:1"
]
}

#### Counts for Doc1:

Doc 1 {
"results": [
" and:2",
" mouse:1",
" cat:1",
" ate:1",
" the:1",
" dog:1",
" cheese:1"
]
}



#### Counts for Doc2:

Doc 2 {
"results": [
" the:2",
" house:1",
" in:1",
" mouse:1",
" lovely:1",
" was:1"
]
}



#### Counts for Doc3:

Doc 3 {
"results": [
" the:2",
" in:1",
" jungle:1",
" mouse:1",
" was:1"
]
}


We will refer to the counts of the entire collection as CollectionCount and for the individual document as DocumentCount. Now, if you look at the probability of the word 'the' in Doc1 , it is 1/7 =0.143 which was obtained as follows:

$P(the)_{Doc1}=\frac{DocumentCount_{Doc1}(the)}{OverallWordCount_{Doc1}}$

The word 'the' appeared once in Doc1 and Doc1 has an overall word count of  7. Now, the probability of 'mouse' in Doc1 is also 1/7 =0.143 obtained as follows:

$P(mouse)_{Doc1}=\frac{DocumentCount_{Doc1}(mouse)}{OverallWordCount_{Doc1}}$

By just looking at Doc1, even though we know that the word 'mouse' is more important to Doc1 than the word 'the' (the' as we know is used very frequently in English), the probabilities do not accurately reflect this. A system would think that these two words are equally important because the word 'mouse' and 'the' have identical counts and probabilities. One way to address this is to, give an IDF effect to the probabilities. Meaning, words that appear very frequently across the entire collection should be penalized more than words that appear in fewer documents on the whole. You can actually do this by normalizing the probability of words using probabilities from the entire Collection by using the CollectionCount.  So the updated scores of 'mouse' and 'the' will be computed as follows with this normalization strategy:
$score(mouse)_{Doc1}=\frac{P(mouse)_{Doc1}}{p(mouse)_{EntireCollection}}=\frac{0.143}{0.136}=1.05$ $score(the)_{Doc1}=\frac{P(the)_{Doc1}}{p(the)_{EntireCollection}}=\frac{0.143}{0.227}=0.63$

There you go! Now,  the word 'the' has a lower score than mouse even though the 'within document' probabilities are the same. Note that this example is based on a very very small sample, so the same theory may not hold for some of the other words in the example documents. However, when you actually take a large sample,  the numbers will accurately reflect the typical distribution of words in a document and a collection of documents. This type of normalization is powerful when you rank the words/n-grams in descending order of scores. Also, take note that when your sample is really large, the probabilities will get really small, so it is always better to utilize log of probabilities as it makes it easier for analysis. For example, you can do the following:

$score(the)_{Doc1}=log_2\{\frac{P(the)_{Doc1}}{p(the)_{EntireCollection}}\}=log_2\{\frac{0.143}{0.227}\}=-0.66$

Don't get intimidated by the negative sign, when you rank the words in descending order of scores, it will all make sense.

## Truth #1: There is no such thing as "THE TOOL" for text mining

People often ask me, "So what tool do you use for your text analysis projects"...This is how it typically goes..I stare at them not sure whether I should laugh or cry, they stare at me thinking I am crazy and there is a few seconds of uncomfortable silence. I usually end up with a single tear in one eye and a partial half-hearted smile...then the confused other person clarifies him/her self by saying..."actually, what I meant was do you use NLTK or GATE". Now the tears start to pour. Well, the truth is, it all depends on the problem!! There is no magic tool or tools. If I am scraping the Web, I might use something like Nutch or write a custom crawler (no, not in python!) that does crawling and scraping at the same time. If I am building a text classifier I may just use Mallet with simple stemming where needed. If I am doing entity linking I might just use some similarity measures like Jaccard or Cosine. It also depends on the programming language I am working on. If I am using Python, I may choose to use scikit learn for text classification instead of Mallet. So, there really is no such thing as THE TOOL for text analysis - it always depends on what you are trying to solve and in which development environment. The idea of "THE TOOL" probably comes from some irresponsible advertising where you are promised that you can solve the world's text mining problems with this one tool.

## Truth #2: Text Mining is 20% engineering, 40% algorithms and 40% science/statistics

If you think text mining or web mining is just pure statistics or science where you can for example, apply black box Machine Learning and solve the entire problem, you are in for a big surprise. Many text mining projects start with intelligent algorithms (from focused crawling, to compact text representation to graph search) with much of the code written from scratch coupled with some statistical modeling to solve very specific parts of the problem (e.g. CRF based web page segmentation). Developing the custom algorithms and tying the pieces together requires some level of engineering. This is why text mining and analytics is so appealing to some, as there is always going to be new challenges in every new problem you work on - whether its engineering problems or types of algorithms to use - its always a fun and challenging journey. Even a change in domain, can significantly change the scope of the problem.

## Truth # 3: Not all data scientists can work on text mining / web mining / NLP projects

Data science is a very broad term and it encapsulates BI analysts, NLP Experts and ML Experts. The way data scientists are being trained today (outside academic programs), is that most of them get hands on experience working on structured data using R or Python and are mostly trained in descriptive statistics and supervised machine learning. Text and language is a whole other ball game. The language expert type of data scientist must have significant NLP knowledge as well as some knowledge in information retrieval, data mining, breadth of AI, as well as creative algorithm development capabilities.

Now you might be thinking so what is the whole point of GATE and NLTK and that Alchemy API for gods sake! GATE and NLTK and Alchemy type NLP APIs are just some tools to help you get started with text mining type projects. These tools provide some core functionality such as stemming, sentencing, chunking, etc. You are not required to use these tools. If you find other off the shelf tools that are easy to integrate into your code, go ahead and use it! If you are already working with python, then NLTK is always an option for you to use. The bottom line is, pick and choose and use *only* what you REALLY need, not something that looks or sounds cool as this just adds unneeded overhead and makes debugging a living nightmare.

The UMLS Metathesaurus is
a repository containing millions of biomedical and health related concepts, their synonymous names, and their relationships. You can think of it as a meta-dictionary (dictionary over over dictionaries), pulling in classification systems and dictionaries such as ICD-10, ICD-9, SNOMED, RxNORM, Mesh and many many more. The Metathesaurus has over 150 dictionaries/classification systems.

1. Fill up an application form here: https://uts.nlm.nih.gov//license.html . This is reviewed by someone so, make sure you fill this form up correctly.

2. You will receive a confirmation email and after several days, and you will be given access if approved.

3. You can then sign-in and you will have access to the entire database. Note that you will have to renew your license annually

Once you have access, to start exploration I would recommend that you start with the SNOMED CT browser (which is a subset) and then move on to the UMLS Metathesaurus browser (which is basically the meta-dictionary).

### What is Text Similarity?

When talking about text similarity, different people have a slightly different notion on what text similarity means. In essence, the goal is to compute how 'close' two pieces of text are in (1) meaning or (2) surface closeness. The first is referred to as semantic similarity and the latter is referred to as lexical similarityAlthough the methods for lexical similarity are often used to achieve semantic similarity (to a certain extent), achieving true semantic similarity is often much more involved. In this article, I mainly focus on lexical similarity as it has the most use from a practical stand-point and then I briefly introduce semantic similarity.

#### Lexical or Word Level Similarity

For the most part, when referring to text similarity, people actually refer to how similar two pieces of text are at the surface level. For example, how similar are the phrases "the cat ate the mouse" with "the mouse ate the cat food" by just looking at the words?  On the surface, if you consider only word level similarity, these two phrases (with determiners disregarded) appear very similar as 3 of the 4 unique words are an exact overlap.

$Overlap=\ 'cat\ ate\ mouse'\ \cap\ 'mouse\ ate\ cat\ food\ '=3$

This notion of similarity is often referred to as lexical similarity. It typically does not take into account the actual meaning behind words or the entire phrase in context. While the actual meaning of the phrases is often disregarded this does not mean that computing similarity in this way is ineffective. You can actually come up with creative ways of expanding the scope of each word or a set of related words (lexical chains) to improve the similarity between two pieces of text being compared. For instance, if you are comparing similarity between phrases from newspaper articles. You could potentially use N non-determiner words to the left and to the right of the current word in the phrase for a simple scope expansion. Instead of doing a word for word comparison,  you are essentially providing more context. This is analogous to expanding a search query. Imagine if you were to compare the similarity between your search query, and all the documents on the Web so that you can get the best results matching your query. How would you expand your query? The same thought process can be applied in improving lexical level similarity measures.

### Granularity

Another point to note is that lexical similarity can be computed at various granularity. You can compute lexical similarity at the character level, word level (as shown earlier) or at a phrase  level (or lexical chain level) where you break a piece of text into a group of related words prior to computing similarity. Character level similarity is also known as string similarity/matching and is commonly used to determine how close two strings are. For example how close are the names 'Kavita Ganesan' and 'Kavita A Ganesan' ? Pretty close! You can use the common metrics outlined below for string similarity or you can use edit distance  type of measures to quantify how dissimilar two strings are. In essence, you are trying to compute the minimum number of operations required to transform one string into the other.

### Common Metrics

Some of the most common metrics for computing similarity between two pieces of text are the Jaccard coefficient, Dice and Cosine similarity all of which have been around for a very long time. Jaccard and Dice are actually really simple as you are just dealing with sets. Here is how you can compute Jaccard:

Simply put, this is the intersection of the sets divided by the union of the sets. Your resulting value will be between [0,1], so you can set a threshold as needed. I have found that a threshold of 0.6 and above is pretty effective in detecting phrases that are similar (maximum length of phrase is 10 words). For longer texts this value could be smaller or if you only care for marginal overlap, again this value could be much smaller. The more you normalize, pre-process and filter your text (e.g. stem, remove noise, remove stop words), the better the outcome of your text similarity measure using simple measures such as Jaccard.

### Where is lexical similarity used?

Clustering - if you want to group similar texts together how can you tell if two groups of text are even similar?

Redundancy removal - if two pieces of texts are so similar, why do you need both? You can always eliminate the redundant one. Think of duplicate product listings, or the same person in your database, with slight variation in the name or even html pages that are near duplicates.

Information Retrieval - you could use the more established information retrieval measures like BM25, PL2, etc. But you could also use a measure like cosine (for longer texts) or jaccard and dice for (shorter texts).

#### Semantic Similarity

So far, we have talked about lexical similarity. Another notion of similarity mostly explored by the NLP research community is how similar in meaning are any two phrases?  If we look at the phrases, "the cat ate the mouse" and "the mouse ate the cat food", we know that while the words significantly overlap, these two phrases actually have different meaning. Getting the meaning out of the phrases is often a more difficult task as it requires deeper level of analysis. In this example, we can actually look at simple aspects like order of words: "cat==>ate==>mouse" and "mouse==>ate==>cat food". Although the words overlap in this case, the order of occurrence is different and from that we can tell that these two phrases actually have different meaning. This is just one simple example. Most people use syntactic parsing to help with semantic similarity. Let's look at the parse trees for these two phrases. What can you get from it?

You can get phrases out of the parse (e.g. "cat food"), dependency structure (e.g. mouse is the object of ate in the first case and food is the object of ate in the second case) as well as parts of speech (nouns, verbs, adjectives and etc.)  - all of which can be used in different ways to estimate semantic similarity. Semantic similarity is often used to address NLP tasks such as paraphrase identification and automatic question answering. To get a better understanding of semantic similarity and paraphrasing you can refer to some of the articles below.

### Getting Started with Spark on Windows

This article talks about how to get started with the Spark shell on Windows. Based on the documentation on Spark's website it seemed like a breeze to get started. But, there were several mistakes I made which took me longer to get started than I had expected. This is what worked for me on a Windows machine:

2. Next, go to Windows=>Run and type cmd to get the DOS command prompt.

*Note: Take note that Cygwin may not work. You will have to use the DOS prompt.

3. Change directory into the Spark installation directory (home directory)

4. Next, at the command prompt, type
bin\spark-shell.cmd


5. You should see something like this:

6. Once Spark has started, you will see a prompt "scala>".

7. If Spark correctly initialized, if you type:
val k=5+5  
at the command prompt, you should get back:
k: Int = 10`
If you don't then Spark did not start correctly.

8. Another check to do is to go to your Web browser and type http://localhost:port_that_spark_started_on. This value can be found in the start up screen. It is usually 4040, but it can be some other value if Spark had issues binding to that specific port.

### Scheduling Automatic Restart of Windows 7

One of the lessons I have learnt with server administration is that after a while of running tomcat servers, the system can get really sluggish from all the "living objects". One of the solutions for this is to do a daily scheduled reboot of the server to keep the system fresh and performing at its best (of course you also need to fix the reason as to why objects are not being pushed into garbage collection).

So, here is how you create a scheduled restart for Windows machines (This was tested for Windows 7):
2. type c:\windows\system32\shutdown.exe -F -R
3. Save the file as "AutoRestart.bat" somewhere on disk
5. Click Action and select Create Basic task.
6. Type AutoRestart (or others you want) in the Name box and click Next.
7. Select Daily and click Next
8. Type the time you want to restart the computer and click Next.
9. Select Start a program and click Next.
10. Click Browse and navigate to where you saved AutoRestart.bat, select the AutoRestart.bat file and click Next.
11. Click Finish.
12. To test if your auto-restart works, go back to task scheduler and when you see the list of scheduled tasks, right click on AutoRestart and select run.

### What is the Notion of NLP in Computer Science vs. Biomedical Informatics vs. Information Systems (MIS)?

Computer Science: In computer science research, when we talk about NLP methods we are referring to specific methods that use some linguistics properties or structural properties of text often coupled with statistics to solve targetted language and text mining problems. The NLP parts are those "customized methods" that we use to solve specific problems. Thus, the notion of NLP here is rather broad as any novel approach that does not just re-use existing tools could be considered NLP or Computational Linguistics.

Biomedical Informatics: In the pure biomedical informatics world however, I often notice that the idea of what NLP is, is quite narrow. People tend to refer to usage of tools such as NegEx Detection, POS Taggers, Parsers and Stemming as NLP. While technically, the underlying technology in these tools can involve lightweight to heavy NLP, the applications of these tools in specific problems are not necessarily "NLP". It is more of applications of NLP tools or text-processing using NLP tools. To be technically correct, only when these NLP tools are used in combination with additional learning methods or customized text mining algorithms, this would probably qualify as actual NLP, solving a specific problem.

Information Systems (MIS): The concept of what qualifies as NLP (and text mining) in the information systems (MIS) world tends to be a lot more shallow and ill-defined than biomedical informatics or computer science where even counting word frequencies seems to be considered a form of NLP. In fact, even getting a binary classifier to work on text is considered fairly involved NLP. This may seem slightly exaggerated, but when you start reviewing papers from some of the MIS departments through TOIS or TKDE you will start to understand how ill-defined NLP can really be.

The point here is that there are different flavors of NLP both in research and in practice and the notion of NLP varies from department to department. So when you are publishing obvious methods to conferences where the audience primarily consists of core computer scientists, you will get dinged really hard for the lack of originality in your proposed methods. At the same time, if you submit a really effective algorithm that does not use existing NLP tools to a biomedical informatics journal, again you may get dinged but this time for not using existing tools or as a friend of mine once told me "the reviewer had  philosophical issues with my submission". Also, be warned that if a physician comes and tells you hey, I am also doing NLP (heard quite a few of those), don't be surprised that what he or she means is that he is running some NLP tools on some patient data. And if someone from the IS department of a business school talks to you about offering a text mining course, it usually means they are teaching you how to use R or SaS to do some text processing and querying :D, not to mention they will also refer to this as "Big Data".

### 'Something' Analytics the New Buzz Word?

From Big Data to Data Science to Machine Learning to Natural Language Processing, and now the next new Buzz Word I've been hearing lately is "Something" Analytics.  I've always heard of Google Analytics, Healthcare Analytics and Text Analytics, but here's a list of some new analytics buzz words. Some - you've already heard of, some might surprise you:

## 'Something Analytics' refers to the use of predictive models?

No, absolutely not! The term analytics is actually overloaded. Some companies and institution use the term 'analytics' to refer to the use of predictive models as the underlying technology. In many cases though, it just refers to the data-driven analysis, reporting and decision making capability in a specific domain. Bottom line - when you see the term 'analytics' in use it usually means some form of data analysis in plain traditional terms.

### Micropinions vs. Micro-reviews

I was recently asked "What's the difference between micropinions and micro-reviews?" based on my paper "Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions". The two concepts are actually fairly synonymous. Micropinion refers to a set of short phrases expressing opinions on a specific topic or entity. Here is an example of a micropinion summary of a restaurant:
As you can see, the size of micropinions are small enough that they would easily fit within the display of a hand-held device. Constraints can be imposed to limit the size of the overall generated micropinions. Micro-reviews on the other hand typically refer to reviews posted on social media websites like Twitter and Four Squares where the goal is to provide a review of a product or some service (e.g. restaurant) within the character limits imposed by these social media websites. Here are a few example of micro-reviews from Twitter:
So in essence, the terms micro-reviews and micropinions refer the same concept: Very concise opinions about an entity or topic often adhering to some constraints (e.g. character limit or word limit ). Hope this explains!

### What is an Opinion-Driven Decision Support System?

Opinion Driven Decision Support System (ODSS) refers to the use of large amounts of online opinions to facilitate business and consumer decision making. The idea is to combine the strengths of search technologies with opinion mining and analysis tools to provide a synergistic decision making platform.

The research and engineering problems related to developing such a system include :
(1) opinion acquisition
(2) opinion based search
(3) opinion summarization
(4) presentation of results

Opinions in this case can be aggregation of user reviews, blog comments, facebook status updates, Tweets and so on. Essentially any opinion containing texts on specific topics or entities qualify as candidates for building an ODSS platform. Here's a description of some of the research and engineering problems towards developing an ODSS platform:

### 1. Search Capabilities Based on Opinions

The goal of opinion-based search is to help users find entities of interest based on their key requirements. Since a user is often interested in choosing an entity based on opinions on that entity, a system that ranks entities based on a user’s personal preferences would provide a more direct support for a user’s decision-making task. For example, in the case of finding hotels at a destination, a user may only want to consider hotels where other people thought was clean. By finding and ranking hotels based on how well it satisfies such a requirement would significantly reduce the number of entities in consideration, facilitating decision making. Unlike traditional search, the query in this case is a set of preferences and the results is a set of entities that match these preferences. The challenge is to accurately match the user’s preferences with existing opinions in order to recommend the best entities. This special ranking problem is referred to as Opinion-Based Entity Ranking. Many of the existing opinion mining techniques can be potentially used for this new ranking task. I have explored information retrieval based techniques to specifically solve this ranking problem  and there has been a few follow-up works (from other groups) trying other approaches.

### 2. Opinion Summarization (i.e. Sentiment Analysis + Text Summarization)

Opinion summaries play a critical role in helping users better analyze entities in consideration (e.g. product, physician, cars, politican). Users are often looking out for major concerns or advantages in selecting a specific entity. Thus, a summary that can quickly highlight the key opinions about the entity would significantly help exploration of entities and aid decision making. The field of opinion summarization has been long explored with most techniques being focused on generating structured summaries on a fixed set of topics. These are referred to as stuctured summaries. In the last few years, textual summaries of opinions have been gaining more and more popularity. Bing Liu's Opinion Mining Tutorial covers some of these recent works or you can refer to this article point (5).

### 3. Opinion Acquisition (i.e. Opinion or Sentiment Crawling)

To support accurate search and analysis based on opinions, opinionated content is imperative. Relying on opinions from just one specific source not only makes the information unreliable, but also incomplete due to variations in opinions as well as potential bias present in a specific source. Although many applications rely on large amounts of opinions, there has been very limited work on collecting and integrating a complete set of opinions. I recently explored a very simple method to collecting large amounts of opinions on arbitrary entities.

The idea of an Opinion Driven Decision Support (ODSS) was developed as part of my thesis. For more information on this please see Kavita's thesis.