Text mining is a sub-division of data mining that is used in recognizing hidden patterns and correlation in large amount of data. It is also known as text data mining, intelligent text analysis and knowledge discovery in text. It is related to extracting useful information from unstructured text data. Gupta & Lehal (2009) have regarded text mining as new interdisciplinary area which is an amalgamation of data mining, information retrieval, machine learning, computer linguistic and statistics. There are many applications of text mining. It is a valuable resource in social networking and blogging, customer relations management, tracking public opinion and text filtering (Mostafa, 2013). Text mining is popular in the biomedical field also. Many practitioners have developed several bioinformatics data mining toolboxes for computational biology. In addition it deals with the text related to biology, medicine and chemistry.
Text mining as a better solution to data mining for unstructured data
According to Kroeze (2004) difference between text mining and data mining is that only text mining is capable of processing unstructured data. However, data mining processes only structured data. Text mining is mainly related to extracting useful text information from unstructured data. Until recently, many specialists from IT sector across the globe were interested in extracting information from structured data i.e., the data stored in data warehouses. With time the majority part of the data available online is in the form of unstructured information. This mainly includes unstructured text information in the form of articles, blogs, web-pages and more. The unstructured data also consisted of numbers, facts, dates and structured fields of data (Jose 2010). Furthermore, this presence of unstructured data has developed a requirement of an intelligent tool which can effectively manage knowledge and extract useful information.
Thus, the main strength of text mining over data mining lies in the probability of generating relevant information and creating knowledge from massive amounts of unstructured data available on the internet or corporate intranets. Although data mining has a wider range of applications but text mining is a more effective way of examining multiple documents and extracting information that establishes a correlation or a pattern. (Kroeze 2004).
Text mining and key-word extraction
Text Mining is a key element which can link the information extracted from large amount of data together to form new facts or hypotheses. One can further explore it by employing conventional experimental processes (Jose 2010). It is fundamentally different from key word search which is a very common practice in web search. In key word search, the user is generally searching about something that is previously known to them and has been posted or published by someone else. The main issue with keyword search is that it discards the information that is not directly related with the keyword that is being provided.
However, in text mining, the aim is discovery of unknown information that can be revealed as a new finding. With application of text mining in healthcare industry, the new principle of patient stratification and correlation between unknown diseases can be discovered (Jensen et al. 2012). When this technique is integrated with genetic analysis, genotype relationships can be developed. Thus, this technique has huge potential to be implemented in medical research and across healthcare sectors (Koh & Tan 2011). Traditionally, researchers used keyword search when there is need to find relevant documents that contained a particular keyword.
In the opinion of Banazir & Annes (2013), the existing technique of keyword search shows a poor performance when input is in the form of multiple keywords, in case of document retrieval. This is due to use of centralized storage of relevant document and data that necessitates the need of encryption using standard cryptographic algorithms. As keyword search retrieves relevant data without decrypting the data so it shows poor search performance. In such scenarios text mining is considered as a generic concept. This is because it optimizes the resulting performance and reduces search timings as well.
Two major steps for text mining
The processes of text mining involves two major steps:
- Collecting unstructured data from various sources like e-mails, websites, documents etc.
- Convert the information into a structured format.
- The next step involved applying descriptive and predictive analysis algorithms to extract information.
A brief comparative table of text mining and keyword search is provided below.
|Text Mining||Processes both structured and unstructured data||Can extract information from encrypted data||Less time consuming||Can determine hidden patterns and correlation||Prior knowledge related to subject not necessary||Doesn’t consider text structure (Data features)|
|Processes only structured data||Works on decrypted data only||More time consuming||Provides apparent information||Necessary prior knowledge related to the subject||Consider text structure|
A comparison between text-mining and keyword processing
For example, companies can use text mining over other techniques in customer care applications, which requires frequent text analysis from various sources of information. This includes surveys, feedback forms, complaints of customers on services and quality to name a few.
In addition text mining provides an automated rapid response system to customer which reduces the load of grievance handling (Chakraborty et al. 2013).
Application of text mining in different industries
Some of the practical real-world applications who uses text mining tools include health care industry, credit card management, marketing etc . Healthcare industry uses text mining in managing system and generating low cost patient statistics. Similarly in credit card management text mining is used in complaint management and rating performance of call center employees. Furthermore, in marketing sector use of text mining is in optimizing the strategies implemented as well as in managing resources (Jose 2010). Similarly, Netzer et al. (2012) proposed to use text mining to convert the content generated by the user (survey data or sales data). This will help to develop insight into market structure and competition analysis. It can also assess the free content generated by social media sites. In addition, this helps the companies to gain advantages in competitive environment, monitoring and analyzing customer trends (He et al. 2013).
Text mining software
In conclusion, many text mining software are available in the form of open source and commercial programs. The table below shows some of the popularly used text mining software:
|RapidMiner||Unified platform for data prep, machine learning, and model deployment.|
|Gensim||Extracts semantics information from unstructured data and topic modelling on large scale.|
|Natural language toolkit (NLTK)||Provides programs and libraries for statistical and symbolic natural language processing using Python.|
|Orange||Provides an add-on of text mining.|
|AlchemyLanguage||Entity extraction, keyword extraction, emotion and sentiment analysis.|
|OpenNLP||Processing of natural language.|
SAS text miner
|Automatic Boolean rule generator, theme discovery, term profiling.|
|Stanbol||Semantic content management, mainly used in scholarly projects, has a web based text mining environment.|
|WordStat||faster extraction of themes and trend, efficient analysis of qualitative content.|
Major text-mining software (Source: Capttera, 2017).
- Banazir B & Annes, P., 2013. Efficient Keyword Search Using Text-mining Techniques: a Survey. Certified International Journal of Engineering and Innovative Technology, 9001(1), pp.2277–3754. Available at: http://www.ijeit.com/Vol 3/Issue 1/IJEIT1412201307_89.pdf [Accessed May 26, 2017].
- capttera, Best Text-Mining Software | 2017 Reviews of the Most Popular Systems. Available at: http://www.capterra.com/text-mining-software/ [Accessed May 26, 2017].
- Chakraborty, G., Pagolu, M. & Garla, S., 2013. Text-Mining and Analysis. Available at: https://www.sas.com/storefront/aux/en/spmanaganalyzunstructured/65646_excerpt.pdf [Accessed May 26, 2017].
- Gupta, V. & Lehal, G.S., 2009. A Survey of Text-Mining Techniques and Applications – Volume 1, No. 1, August 2009 – JETWI. Journal of emerging technologies in web intelligence, 1(1), pp.60–76. Available at: http://www.jetwi.us/index.php?m=content&c=index&a=show&catid=165&id=969 [Accessed May 26, 2017].
- He, W., Zha, S. & Li, L., 2013. Social media competitive analysis and text-mining: A case study in the pizza industry. International Journal of Information Management, 33(3), pp.464–472. Available at: http://linkinghub.elsevier.com/retrieve/pii/S0268401213000030 [Accessed May 26, 2017].
- Jensen, P.B., Jensen, L.J. & Brunak, S., 2012. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics, 13(6), pp.395–405. Available at: http://www.ncbi.nlm.nih.gov/pubmed/22549152 [Accessed May 26, 2017].
- Jose, D., 2010. Three Real-World Applications of Text-Mining to Solve Specific Business Problems by Derick Jose – BeyeNETWORK. Available at: http://www.b-eye-network.com/view/12783 [Accessed May 26, 2017].
- Koh, H.C. & Tan, G., 2011. Data Mining Applications in Healthcare. Journal of Healthcare Information Management —, 19(2). Available at: http://www.ssnpstudents.com/wp/wp-content/uploads/2015/02/10.1.1.92.3184.pdf [Accessed May 26, 2017].
- Kroeze, J.H., 2004. Differentiating between data-mining and text-mining terminology. , 6(December).
- Mostafa, M.M., 2013. More than words: Social networksâ€TM text-mining for consumer brand sentiments. Available at: http://tarjomefa.com/wp-content/uploads/2016/02/4470-English.pdf [Accessed May 26, 2017].
- Netzer, O. et al., 2012. Mine Your Own Business: Market-Structure Surveillance Through Text-Mining. , 31(3), pp.521–543. Available at: http://dx.doi.org/10.1287/mksc.1120.0713 [Accessed May 26, 2017].
Latest posts by Ritika Taparia (see all)
- Text mining as a better solution for analyzing unstructured data - July 25, 2017