Text mining as a better solution for analyzing unstructured data

By Priya Chetty on July 25, 2017

Text mining is a sub-division of data mining that is used in recognizing hidden patterns and correlation in large amount of data. It is also known as text data mining, intelligent text analysis and knowledge discovery in text. It is related to extracting useful information from unstructured text data. Gupta & Lehal (2009) have regarded text mining as new interdisciplinary area which is an amalgamation of data mining, information retrieval, machine learning, computer linguistic and statistics. There are many applications of text mining. It is a valuable resource in social networking and blogging, customer relations management, tracking public opinion and text filtering (Mostafa, 2013). Text mining is popular in the biomedical field also. Many practitioners have developed several bioinformatics data mining toolboxes for computational biology. In addition it deals with the text related to biology, medicine and chemistry.

Text mining as a better solution to data mining for unstructured data

According to Kroeze (2004) difference between text mining and data mining is that only text mining is capable of processing unstructured data. However, data mining processes only structured data. Text mining is mainly related to extracting useful text information from unstructured data. Until recently, many specialists from IT sector across the globe were interested in extracting information from structured data i.e., the data stored in data warehouses. With time the majority part of the data available online is in the form of unstructured information. This mainly includes unstructured text information in the form of articles, blogs, web-pages and more. The unstructured data also consisted of numbers, facts, dates and structured fields of data (Jose 2010). Furthermore, this presence of unstructured data has developed a requirement of an intelligent tool which can effectively manage knowledge and extract useful information.

data mining has become one of the most important tool for analytics in various areas — Importance of data mining in different field

Thus, the main strength of text mining over data mining lies in the probability of generating relevant information and creating knowledge from massive amounts of unstructured data available on the internet or corporate intranets. Although data mining has a wider range of applications but text mining is a more effective way of examining multiple documents and extracting information that establishes a correlation or a pattern. (Kroeze 2004).

there are some major difference between data mining and text mining — Difference between text mining and data mining

Text mining and key-word extraction

Text Mining is a key element which can link the information extracted from large amount of data together to form new facts or hypotheses. One can further explore it by employing conventional experimental processes (Jose 2010). It is fundamentally different from key word search which is a very common practice in web search. In key word search, the user is generally searching about something that is previously known to them and has been posted or published by someone else. The main issue with keyword search is that it discards the information that is not directly related with the keyword that is being provided.

there are some significant difference between text mining and keyword search — Difference between the text mining and keyword search

However, in text mining, the aim is discovery of unknown information that can be revealed as a new finding. With application of text mining in healthcare industry, the new principle of patient stratification and correlation between unknown diseases can be discovered (Jensen et al. 2012). When this technique is integrated with genetic analysis, genotype relationships can be developed. Thus, this technique has huge potential to be implemented in medical research and across healthcare sectors (Koh & Tan 2011). Traditionally, researchers used keyword search when there is need to find relevant documents that contained a particular keyword.

In the opinion of Banazir & Annes (2013), the existing technique of keyword search shows a poor performance when input is in the form of multiple keywords, in case of document retrieval. This is due to use of centralized storage of relevant document and data that necessitates the need of encryption using standard cryptographic algorithms. As keyword search retrieves relevant data without decrypting the data so it shows poor search performance. In such scenarios text mining is considered as a generic concept. This is because it optimizes the resulting performance and reduces search timings as well.

Two major steps for text mining

The processes of text mining involves two major steps:

Collecting unstructured data from various sources like e-mails, websites, documents etc.
Convert the information into a structured format.
The next step involved applying descriptive and predictive analysis algorithms to extract information.

A brief comparative table of text mining and keyword search is provided below.

Text Mining

Processes both structured and unstructured data

Can extract information from encrypted data

Less time consuming

Can determine hidden patterns and correlation

Prior knowledge related to subject not necessary

Doesn’t consider text structure (Data features)

Keyword

Processing

Processes only structured data

Works on decrypted data only

More time consuming

Provides apparent information

Necessary prior knowledge related to the subject

Consider text structure

A comparison between text-mining and keyword processing

For example, companies can use text mining over other techniques in customer care applications, which requires frequent text analysis from various sources of information. This includes surveys, feedback forms, complaints of customers on services and quality to name a few.

In addition text mining provides an automated rapid response system to customer which reduces the load of grievance handling (Chakraborty et al. 2013).

Application of text mining in different industries

Some of the practical real-world applications who uses text mining tools include health care industry, credit card management, marketing etc . Healthcare industry uses text mining in managing system and generating low cost patient statistics. Similarly in credit card management text mining is used in complaint management and rating performance of call center employees. Furthermore, in marketing sector use of text mining is in optimizing the strategies implemented as well as in managing resources (Jose 2010). Similarly, Netzer et al. (2012) proposed to use text mining to convert the content generated by the user (survey data or sales data). This will help to develop insight into market structure and competition analysis. It can also assess the free content generated by social media sites. In addition, this helps the companies to gain advantages in competitive environment, monitoring and analyzing customer trends (He et al. 2013).

Text mining software

In conclusion, many text mining software are available in the form of open source and commercial programs. The table below shows some of the popularly used text mining software:

Software	Features
RapidMiner	Unified platform for data prep, machine learning, and model deployment.
Gensim	Extracts semantics information from unstructured data and topic modelling on large scale.
Natural language toolkit (NLTK)	Provides programs and libraries for statistical and symbolic natural language processing using Python.
Orange	Provides an add-on of text mining.
AlchemyLanguage	Entity extraction, keyword extraction, emotion and sentiment analysis.
OpenNLP	Processing of natural language.
SAS text miner	Automatic Boolean rule generator, theme discovery, term profiling.
Stanbol	Semantic content management, mainly used in scholarly projects, has a web based text mining environment.
WordStat	faster extraction of themes and trend, efficient analysis of qualitative content.

Major text-mining software (Source: Capttera, 2017).

References

Banazir B & Annes, P., 2013. Efficient Keyword Search Using Text-mining Techniques: a Survey. Certified International Journal of Engineering and Innovative Technology, 9001(1), pp.2277–3754. Available at: http://www.ijeit.com/Vol 3/Issue 1/IJEIT1412201307_89.pdf [Accessed May 26, 2017].
capttera, Best Text-Mining Software | 2017 Reviews of the Most Popular Systems. Available at: http://www.capterra.com/text-mining-software/ [Accessed May 26, 2017].
Chakraborty, G., Pagolu, M. & Garla, S., 2013. Text-Mining and Analysis. Available at: https://www.sas.com/storefront/aux/en/spmanaganalyzunstructured/65646_excerpt.pdf [Accessed May 26, 2017].
Gupta, V. & Lehal, G.S., 2009. A Survey of Text-Mining Techniques and Applications – Volume 1, No. 1, August 2009 – JETWI. Journal of emerging technologies in web intelligence, 1(1), pp.60–76. Available at: http://www.jetwi.us/index.php?m=content&c=index&a=show&catid=165&id=969 [Accessed May 26, 2017].
He, W., Zha, S. & Li, L., 2013. Social media competitive analysis and text-mining: A case study in the pizza industry. International Journal of Information Management, 33(3), pp.464–472. Available at: http://linkinghub.elsevier.com/retrieve/pii/S0268401213000030 [Accessed May 26, 2017].
Jensen, P.B., Jensen, L.J. & Brunak, S., 2012. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics, 13(6), pp.395–405. Available at: http://www.ncbi.nlm.nih.gov/pubmed/22549152 [Accessed May 26, 2017].
Jose, D., 2010. Three Real-World Applications of Text-Mining to Solve Specific Business Problems by Derick Jose – BeyeNETWORK. Available at: http://www.b-eye-network.com/view/12783 [Accessed May 26, 2017].
Koh, H.C. & Tan, G., 2011. Data Mining Applications in Healthcare. Journal of Healthcare Information Management —, 19(2). Available at: http://www.ssnpstudents.com/wp/wp-content/uploads/2015/02/10.1.1.92.3184.pdf [Accessed May 26, 2017].
Kroeze, J.H., 2004. Differentiating between data-mining and text-mining terminology. , 6(December).
Mostafa, M.M., 2013. More than words: Social networksâ€^TM text-mining for consumer brand sentiments. Available at: http://tarjomefa.com/wp-content/uploads/2016/02/4470-English.pdf [Accessed May 26, 2017].
Netzer, O. et al., 2012. Mine Your Own Business: Market-Structure Surveillance Through Text-Mining. , 31(3), pp.521–543. Available at: http://dx.doi.org/10.1287/mksc.1120.0713 [Accessed May 26, 2017].

Priya Chetty

I am a management graduate with specialisation in Marketing and Finance. I have over 12 years' experience in research and analysis. This includes fundamental and applied research in the domains of management and social sciences. I am well versed with academic research principles. Over the years i have developed a mastery in different types of data analysis on different applications like SPSS, Amos, and NVIVO. My expertise lies in inferring the findings and creating actionable strategies based on them.

Over the past decade I have also built a profile as a researcher on Project Guru's Knowledge Tank division. I have penned over 200 articles that have earned me 400+ citations so far. My Google Scholar profile can be accessed here.

I now consult university faculty through Faculty Development Programs (FDPs) on the latest developments in the field of research. I also guide individual researchers on how they can commercialise their inventions or research findings. Other developments im actively involved in at Project Guru include strengthening the "Publish" division as a bridge between industry and academia by bringing together experienced research persons, learners, and practitioners to collaboratively work on a common goal.