Starting with Rule-Based Sentiment Analysis
In today’s world, our opinions and those of others significantly shape decisions and actions. Rule-based sentiment analysis is a pragmatic choice for specific use cases, especially where transparency, speed, or cost matter. It is simple, fast, and requires no training data.
Lately, more and more individuals have been voicing their opinions on different subjects on the internet. Sentiment analysis involves classifying such opinions as positive, negative, or neutral, or into more granular emotional states like happy, excited, tender, scared, angry and sad (Ekman, 1999).
Rule-based sentiment analysis systems rely on manually crafted rules and patterns to determine the sentiment of a text. Such systems are useful in scenarios where simplicity, transparency, or domain-specific control are prioritized. While modern machine learning (ML) models like BERT or GPT-4 dominate research and large-scale applications, rule-based approaches remain practical for many real-world use cases.
Analyzing smartphone reviews where keywords like “battery life” or “screen quality” dominate sentiments.
Furthermore, sentiment analysis, based on manually crafted rules has been increasingly supplemented by advanced machine learning and deep learning approaches, which offer more nuanced and scalable solutions (Liu, 2012).
Importance of lexicons in rule-based sentiment analysis classifiers
An expert system is an intelligent computer program that uses knowledge and inference procedures to solve problems that are difficult enough to require significant human expertise for their solution.
Felgenbaum (1977)
Sentiment analysis systems focus on identifying and extracting subjective information from textual data. In a rule-based classifier, a predefined set of rules is constructed to identify specific patterns that are most likely associated with different classes. Rule-based classifiers are standard models generated from data using an unsupervised machine-learning algorithm (Chikersal et al., 2015). In sentiment analysis, rule induction learns weighted decision rules, where each rule is assigned a certainty factor (weight) that reflects its contribution to sentiment classification based on linguistic patterns (Berka, 2020).
Chikersal et al. (2015), used a rule-based classifier to label each tweet as ‘positive’, ‘negative’ and ‘unknown’ in their research. They applied the following rules:
- Emoticon-related rules classified a tweet as “positive” in case a tweet contained only positive emoticons and no negative emoticons. If a tweet contained only negative emoticons and no positive emoticons then it was classified as positive. If a tweet has no emoticon, then lexicon-related rules are applied.
- Lexicon-related rules classified the tweets based on a pre-classified group of opinion words from different dictionaries like the Bing Liu lexicon (Liu et al., 2005), the NRC Emotion lexicon (Mohammad & Turney, 2013) and SentiWordNet (Esuli & Sebastiani, 2006). If a tweet contained more than two positive words without negation and no negative words from either of the lexicons then it was classified as positive.
In rule-based sentiment analysis systems, lexicons play a fundamental role in classifying the sentiment polarity. Such lexicon-based methods directly compute sentiment scores from word lists or vocabulary without requiring feature selection, vectorization, or model training. Nanli et al. (2012) also stated that a lexicon-based approach uses a sentiment dictionary or vocabulary to identify positive and negative words and phrases. Lexicon-based classifier models are preferable in simulating the effects of linguistic context. They provide predefined sentiment scores for words, helping to classify the text input.
- Lexicons: happy (+3), excellent (+2), bad (-2), terrible (-3)
- Input text: The service was excellent, but the food was terrible.
- Sentiment score: (+2) + (-3) = -1; Slightly negative sentiment.
Lexicon-based sentiment analysis works by matching the words of a text to a sentiment lexicon dictionary and sums up the scores. Lexicons also work alongside linguistic rules to improve sentiment detection by considering negations, intensifiers, and modifiers.
- Negation handling:
- Not happy- This reverses the positive sentiment of “happy”.
- Intensifiers:
- Very good- If good is +2 then very good is boosted to +3.
Furthermore, SentiWordNet is another tool that uses a manually curated sentiment dictionary where each word already has an assigned sentiment score. Other tools like NRC Emotion Lexicon take the lexicons beyond polarity (positive or negative) and categorize words into emotions like joy, anger, fear, and sadness.
- Delighted- Joy
- Frustrated- Anger
- Heartbroken- Sad
Advantages of using VADER and TextBlob
Python modules like TextBlob assign predefined sentiment scores and Valence Aware Dictionary and Sentiment Reasoner (VADER) handles negation and intensifiers effectively. Both modules offer advantages, especially for handling real-world, informal, and social media text. Although SentiWordNet uses a pre-defined list of words that are good for general sentiment analysis, it struggles with negation, sarcasm, and social media text. Similarly, NRC Emotion Lexicon classifies emotions (joy, anger, sad etc.) but cannot accurately determine the overall sentiment strength.
VADER is a lexicon and rule-based tool especially optimised for social media texts. It outperforms other lexicon-based approaches by handling emojis, slang, punctuation, and capitalisation and offers contextually aware sentiment scoring.
“This is 🔥🔥🔥!”
- Using VADER will result in a positive score.
- Using SentiWordNet or NRC will result in neutrality as it lacks slang understanding.
VADER returns a compound sentiment score ranging from -1 to +1, making it easy to classify texts as positive, neutral or negative.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
text = "I LOVE this movie!! 😍🔥"
score = analyzer.polarity_scores(text)
print(score)
#OUTPUT
{'neg': 0.0, 'neu': 0.215, 'pos': 0.785, 'compound': 0.8516} #Result in JSON format
Unlike SentiWordNet, TextBlob is a Lexicon-based and machine-learning hybrid model that includes polarity & subjectivity scores to explain how opinionated a text is. It also uses a Naïve Bayes classifier, that understands negations better.
from textblob import TextBlob
text = "This movie is really amazing and touching."
blob = TextBlob(text)
print(blob.sentiment)
#OUTPUT
Sentiment(polarity=0.85, subjectivity=0.75) #SCSS format
Furthermore, TextBlob can also analyze sentiments of texts in languages beyond English using translation APIs.
Researchers have analyzed the accuracy of VADER and TextBlob across various datasets and conditions on several occasions. Their findings consistently show that VADER provides more precise sentiment analysis compared to other lexicon-based methods like SentiWordNet and TextBlob (Bonta et al., 2019a, 2019b; Nguyen et al., 2019; Srivastava et al., 2022). This is because VADER is specifically designed for social media, informal text, and short messages, handling elements like emojis, punctuation, and negation more effectively.
Customising sentiment lexicon in VADER
VADER is highly customizable and can be tweaked for different industries to improve accuracy with additional preprocessing. Customizing VADER allows for more accurate, context-aware, and industry-specific sentiment analysis. VADER’s default lexicon works well for general sentiment analysis, but different fields like finance, healthcare, e-commerce etc. use industry-specific words and expressions. Customization ensures relevant sentiment scoring based on industry context.
In the context of the stock market, “crash” is negative but in sports, “crash” might not have a sentiment impact.
VADER excels at detecting internet slang, emoticons, and emojis, but new trends keep emerging. Its predefined scores may not always match real-world sentiment in a dataset. Adjusting these scores improves accuracy, and updating the lexicon keeps it relevant.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
# Initialize VADER
analyzer = SentimentIntensityAnalyzer()
# Add new words to the lexicon
new_words = {
"underrated": 2.5, # Make "underrated" strongly positive
"meh": -1.5, # Make "meh" slightly negative
"overpriced": -2.0, # Strongly negative sentiment
"fire": 3.0 # Slang for something amazing
}
# Update VADER lexicon
analyzer.lexicon.update(new_words)
Use analyzer.lexicon.get(word, "Not in Lexicon") to identify whether a word was considered a lexicon in VADER.
VADER is English-centric but can be adapted for Hinglish (Hindi-English mix) and other languages by adding custom lexicons.
- Mast hai! : It’s amazing!
- Bakwas : Nonsense
- Faltu : Useless
When extending VADER with custom lexicons, lemmatizing the words ensures the inflected words match the modified lexicon properly. When all words are already present in VADER’s lexicon the lemmatisation effect is minimal.
This article explored rule-based and lexicon-based sentiment analysis, highlighting tools like VADER, TextBlob, SentiWordNet, and the NRC Emotion Lexicon. While VADER stands out for its pre-trained lexicon and real-time adaptability, it has limitations. Customizing lexicon scores can enhance sentiment detection, making it more accurate and relevant across different datasets and domains.
Exercise: Enhancing Sentiment Analysis by customizing VADER lexicons
Objectives
- Download and load the dataset and create a panda dataframe.
- Normalise the dataset by removing unnecessary HTML/XML tags.
- Correct and create a custom dictionary of irregular words.
- Define a function to apply sentiment analysis using VADER on each text entry.
- Add custom sentiment scores for VADER lexicons: tasty = 1, authentic = 1.5, yummy = 2, cold = -0.5, decent = 0.5, real = 1.2, mouthwatering = 2, taste less = -2
- Create a bar chart using Matplotlib to visualize sentiment distribution before and after adding custom scores.
References
- Berka, P. (2020). Sentiment analysis using rule-based and case-based reasoning. Journal of Intelligent Information Systems, 55 (1), 51–66. https://doi.org/10.1007/s10844-019-00591-8
- Bonta, V., Kumaresh, N., & Janardhan, N. (2019a). A Comprehensive Study on Lexicon Based Approaches for Sentiment Analysis. Asian Journal of Computer Science and Technology, 8 (S2), Article S2. https://doi.org/10.51983/ajcst-2019.8.S2.2037
- Bonta, V., Kumaresh, N., & Janardhan, N. (2019b). A Comprehensive Study on Lexicon Based Approaches for Sentiment Analysis. Asian Journal of Computer Science and Technology, 8 (S2), Article S2. https://doi.org/10.51983/ajcst-2019.8.S2.2037
- Chikersal, P., Poria, S., & Cambria, E. (2015). SeNTU: Sentiment Analysis of Tweets by Combining a Rule-based Classifier with Supervised Learning. In P. Nakov, T. Zesch, D. Cer, & D. Jurgens (Eds.), Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015) (pp. 647–651). Association for Computational Linguistics. https://doi.org/10.18653/v1/S15-2108
- Esuli, A., & Sebastiani, F. (2006). SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining. In N. Calzolari, K. Choukri, A. Gangemi, B. Maegaard, J. Mariani, J. Odijk, & D. Tapias (Eds.), Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC`06) . European Language Resources Association (ELRA). https://aclanthology.org/L06-1225/
- Felgenbaum, E. A. (1977). The art of artificial intelligence: Themes and case studies of knowledge engineering. Proceedings of the 5th International Joint Conference on Artificial Intelligence – Volume 2, 1014–1029.
- Liu, B. (2012). Sentiment Analysis: A Fascinating Problem. In B. Liu (Ed.), Sentiment Analysis and Opinion Mining (pp. 1–8). Springer International Publishing. https://doi.org/10.1007/978-3-031-02145-9_1
- Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: Analyzing and comparing opinions on the Web. Proceedings of the 14th International Conference on World Wide Web – WWW ’05, 342. https://doi.org/10.1145/1060745.1060797
- Mohammad, S. M., & Turney, P. D. (2013). Crowdsourcing a Word-Emotion Association Lexicon (No. arXiv:1308.6297). arXiv. https://doi.org/10.48550/arXiv.1308.6297
- Nanli, Z., Ping, Z., Weiguo, L., & Meng, C. (2012). Sentiment analysis: A literature review. 2012 International Symposium on Management of Technology (ISMOT), 572–576. https://doi.org/10.1109/ISMOT.2012.6679538
- Nguyen, H., Veluchamy, A., Diop, M., & Iqbal, R. (2019). Comparative Study of Sentiment Analysis with Product Reviews Using Machine Learning and Lexicon-Based Approaches. SMU Data Science Review, 1 (4). https://scholar.smu.edu/datasciencereview/vol1/iss4/7
- Srivastava, R., Bharti, P. K., & Verma, P. (2022). Comparative Analysis of Lexicon and Machine Learning Approach for Sentiment Analysis. International Journal of Advanced Computer Science and Applications, 13 (3). https://doi.org/10.14569/IJACSA.2022.0130312
I am an interdisciplinary educator, researcher, and technologist with over a decade of experience in applied coding, educational design, and research mentorship in fields spanning management, marketing, behavioral science, machine learning, and natural language processing. I specialize in simplifying complex topics such as sentiment analysis, adaptive assessments and data visualizatiion. My training approach emphasizes real-world application, clear interpretation of results and the integration of data mining, processing, and modeling techniques to drive informed strategies across academic and industry domains.
Discuss