Mar 3, 2016

Untold Truths About Text Mining and Web Mining in Practice

Truth #1: There is no such thing as "THE TOOL" for text mining


People often ask me, "So what tool do you use for your text analysis projects"...This is how it typically goes..I stare at them not sure whether I should laugh or cry, they stare at me thinking I am crazy and there is a few seconds of uncomfortable silence. I usually end up with a single tear in one eye and a partial half-hearted smile...then the confused other person clarifies him/her self by saying..."actually, what I meant was do you use NLTK or GATE". Now the tears start to pour. Well, the truth is, it all depends on the problem!! There is no magic tool or tools. If I am scraping the Web, I might use something like Nutch or write a custom crawler (no, not in python!) that does crawling and scraping at the same time. If I am building a text classifier I may just use Mallet with simple stemming where needed. If I am doing entity linking I might just use some similarity measures like Jaccard or Cosine. It also depends on the programming language I am working on. If I am using Python, I may choose to use scikit learn for text classification instead of Mallet. So, there really is no such thing as THE TOOL for text analysis - it always depends on what you are trying to solve and in which development environment. The idea of "THE TOOL" probably comes from some irresponsible advertising where you are promised that you can solve the world's text mining problems with this one tool.

Truth #2: Text Mining is 20% engineering, 40% algorithms and 40% science/statistics

If you think text mining or web mining is just pure statistics or science where you can for example, apply black box Machine Learning and solve the entire problem, you are in for a big surprise. Many text mining projects start with intelligent algorithms (from focused crawling, to compact text representation to graph search) with much of the code written from scratch coupled with some statistical modeling to solve very specific parts of the problem (e.g. CRF based web page segmentation). Developing the custom algorithms and tying the pieces together requires some level of engineering. This is why text mining and analytics is so appealing to some, as there is always going to be new challenges in every new problem you work on - whether its engineering problems or types of algorithms to use - its always a fun and challenging journey. Even a change in domain, can significantly change the scope of the problem.


Truth # 3: Not all data scientists can work on text mining / web mining / NLP projects

Data science is a very broad term and it encapsulates BI analysts, NLP Experts and ML Experts. The way data scientists are being trained today (outside academic programs), is that most of them get hands on experience working on structured data using R or Python and are mostly trained in descriptive statistics and supervised machine learning. Text and language is a whole other ball game. The language expert type of data scientist must have significant NLP knowledge as well as some knowledge in information retrieval, data mining, breadth of AI, as well as creative algorithm development capabilities.

Now you might be thinking so what is the whole point of GATE and NLTK and that Alchemy API for gods sake! GATE and NLTK and Alchemy type NLP APIs are just some tools to help you get started with text mining type projects. These tools provide some core functionality such as stemming, sentencing, chunking, etc. You are not required to use these tools. If you find other off the shelf tools that are easy to integrate into your code, go ahead and use it! If you are already working with python, then NLTK is always an option for you to use. The bottom line is, pick and choose and use *only* what you REALLY need, not something that looks or sounds cool as this just adds unneeded overhead and makes debugging a living nightmare.