Text mining as a better solution for analyzing unstructured data

By Priya Chetty on July 25, 2017

Text mining is a sub-division of data mining that is used in recognizing hidden patterns and correlation in large amount of data. It is also known as text data mining, intelligent text analysis and knowledge discovery in text. It is related to extracting useful information from unstructured text data. Gupta & Lehal (2009) have regarded text mining as new interdisciplinary area which is an amalgamation of data mining, information retrieval, machine learning, computer linguistic and statistics. There are many applications of text mining. It is a valuable resource in social networking and blogging, customer relations management, tracking public opinion and text filtering (Mostafa, 2013). Text mining is popular in the biomedical field also. Many practitioners have developed several bioinformatics data mining toolboxes for computational biology. In addition it deals with the text related to biology, medicine and chemistry.

Text mining as a better solution to data mining for unstructured data

According to Kroeze (2004) difference between text mining and data mining is that only text mining is capable of processing unstructured data. However,  data mining processes only structured data. Text mining is mainly related to extracting useful text information from unstructured data. Until recently, many specialists from IT sector across the globe were interested in extracting information from structured data i.e., the data stored in data warehouses. With time the majority part of the data available online is in the form of unstructured information. This mainly includes unstructured text information in the form of articles, blogs, web-pages and more. The unstructured data  also consisted of numbers, facts, dates and structured fields of data (Jose 2010). Furthermore, this presence of unstructured data has developed a requirement of an intelligent tool which can effectively manage knowledge and extract useful information.

data mining has become one of the most important tool for analytics in various areas
Importance of data mining in different field

Thus, the main strength of text mining over data mining lies in the probability of generating relevant information and creating knowledge from massive amounts of unstructured data available on the internet or corporate intranets. Although data mining has a wider range of applications but text mining is a more effective way of examining multiple documents and extracting information that establishes a correlation or a pattern. (Kroeze 2004). 

there are some major difference between data mining and text mining
Difference between text mining and data mining

Text mining and key-word extraction

Text Mining is a key element which can link the information extracted from large amount of data together to form new facts or hypotheses. One can further explore it by employing conventional experimental processes (Jose 2010). It is fundamentally different from key word search which is a very common practice in web search. In key word search, the user is generally searching about something that is previously known to them and has been posted or published by someone else. The main issue with keyword search is that it discards the information that is not directly related with the keyword that is being provided.

there are some significant difference between text mining and keyword search
Difference between the text mining and keyword search

However, in text mining, the aim is discovery of unknown information that can be revealed as a new finding. With application of text mining in healthcare industry, the new principle of patient stratification and correlation between unknown diseases can be discovered (Jensen et al. 2012). When this technique is integrated with genetic analysis, genotype relationships can be developed. Thus, this technique has huge potential to be implemented in medical research and across healthcare sectors (Koh & Tan 2011). Traditionally, researchers used keyword search when there is need to find relevant documents that contained a particular keyword.

In the opinion of Banazir & Annes (2013), the existing technique of keyword search shows a poor performance when input is in the form of multiple keywords, in case of document retrieval. This is due to use of centralized storage of relevant document and data that necessitates the need of encryption using standard cryptographic algorithms. As keyword search retrieves relevant data without decrypting the data so it shows poor search performance. In such scenarios text mining is considered as a generic concept. This is because it optimizes the resulting performance and reduces search timings as well.

Two major steps for text mining

The processes of text mining involves two major steps:

  1. Collecting unstructured data from various sources like e-mails, websites, documents etc.
  2. Convert the information into a structured format.
  3. The next step involved applying descriptive and predictive analysis algorithms to extract information.

A brief comparative table of text mining and keyword search is provided below.

Text Mining Processes both structured and unstructured data Can extract information from encrypted data Less time consuming Can determine hidden patterns and correlation Prior knowledge related to subject not necessary Doesn’t consider text structure (Data features)


Processes only structured data Works on decrypted data only More time consuming Provides apparent information Necessary prior knowledge related to the subject Consider text structure

A comparison between text-mining and keyword processing

For example, companies can use text mining over other techniques in customer care applications, which requires frequent text analysis from various sources of information. This includes surveys, feedback forms, complaints of customers on services and quality to name a few.

In addition text mining provides an automated rapid response system to customer which reduces the load of grievance handling (Chakraborty et al. 2013).

Application of text mining in different industries

Some of the practical real-world applications who uses text mining tools include health care industry, credit card management, marketing etc . Healthcare industry uses text mining in managing system and generating low cost patient statistics. Similarly in credit card management text mining is used in complaint management and rating performance of call center employees. Furthermore, in marketing sector use of text mining is in optimizing the strategies implemented as well as in managing resources (Jose 2010). Similarly, Netzer et al. (2012) proposed to use text mining to convert the content generated by the user (survey data or sales data). This will help to develop insight into market structure and competition analysis. It can also assess the free content generated by social media sites. In addition, this helps the companies to gain advantages in competitive environment, monitoring and analyzing customer trends (He et al. 2013).

Text mining software

In conclusion, many text mining software are available in the form of open source and commercial programs. The table below shows some of the popularly used text mining software:

Software Features
RapidMiner Unified platform for data prep, machine learning, and model deployment.
Gensim Extracts semantics information from unstructured data and topic modelling on large scale.
Natural language toolkit (NLTK) Provides programs and libraries for statistical and symbolic natural language processing  using Python.
Orange Provides an add-on of text mining.
AlchemyLanguage Entity extraction, keyword extraction, emotion and sentiment analysis.
OpenNLP Processing of natural language.

SAS text miner

Automatic Boolean rule generator, theme discovery, term profiling.
Stanbol Semantic content management, mainly used in scholarly projects, has a web based text mining environment.
WordStat faster extraction of themes and trend,  efficient analysis of qualitative content.

Major text-mining software (Source: Capttera, 2017).


Priya is the co-founder and Managing Partner of Project Guru, a research and analytics firm based in Gurgaon. She is responsible for the human resource planning and operations functions. Her expertise in analytics has been used in a number of service-based industries like education and financial services.

Her foundational educational is from St. Xaviers High School (Mumbai). She also holds MBA degree in Marketing and Finance from the Indian Institute of Planning and Management, Delhi (2008).

Some of the notable projects she has worked on include:

  • Using systems thinking to improve sustainability in operations: A study carried out in Malaysia in partnership with Universiti Kuala Lumpur.
  • Assessing customer satisfaction with in-house doctors of Jiva Ayurveda (a project executed for the company)
  • Predicting the potential impact of green hydrogen microgirds (A project executed for the Government of South Africa)

She is a key contributor to the in-house research platform Knowledge Tank.

She currently holds over 300 citations from her contributions to the platform.

She has also been a guest speaker at various institutes such as JIMS (Delhi), BPIT (Delhi), and SVU (Tirupati).



2 thoughts on “Text mining as a better solution for analyzing unstructured data”