Using POS tagging to identify valid bigrams for VADER

By Abhinash Jena on March 4, 2025

The Valence Aware Dictionary and Sentiment Reasoner (VADER) is a lexicon-based sentiment analysis tool that uses a lexicon dictionary to assign predefined sentiment scores. VADER is known for handling slang, emojis, and punctuation-based emphasis but may not always capture domain-specific nuances. While VADER performs well with general sentiment analysis, fine-tuning its lexicon clusters can enhance its effectiveness in domain-specific fields like retail, finance, and social media analytics.

Rule-based sentiment analysis methods rely on publicly available lexicons to evaluate the polarity of the text. New and improved lexicons should be continually developed to address challenges like outdated vocabulary, slang, and the need for more accurate sentiment scoring. This article emphasizes the importance of determining unknown word clusters in a dataset using the n-gram method. Furthermore, POS tagging is used to select meaningful lexicon clusters. This is to improve the accuracy of sentiment analysis using VADER across diverse applications.

Significance of POS tagging in Natural Language Processing (NLP)

In natural language processing, part of speech (POS) tagging helps to label the parts of sentences to help computers to understand the context and meaning, grammar and structure of a sentence. POS tagging helps to select meaningful word pairs for better sentiment classification. It involves labeling words with tags such as “NN” for singular nouns, “VBZ” for third-person singular verbs, or “JJ” for adjectives in a sentence. POS tagging is a foundational natural language processing (NLP) step that enriches text with syntactic metadata and is critical for accurate language understanding.

Using POS tagging to identify valid bigrams for VADER
Using POS tagging to identify valid bigrams for VADER

POS tagging approaches can be categorized into three types: rule-based tagging, statistical tagging, and hybrid tagging. The rule-based POS tagging approach applies hand-written rules and contextual information to assign tags but struggles with unknown text, requiring an exhaustive set of rules for better accuracy. The statistical approach relies on word frequency and probability from annotated data but may produce grammatically incorrect tag sequences. The hybrid approach combines both methods and can outperform them individually (Kumawat & Jain, 2015).

from nltk import pos_tag 

tokens = word_tokenize (text.lower ()) # Tokenization

tokens = [word for word in tokens if word.isalnum ()] # Remove punctuation

tags = pos_tag(tokens) # Apply POS tagging

Probabilistic approach to determine word sequences

The earliest POS tagging system used the rule-based approach, where hand-written rules and contextual information were applied to assign tags. These rules are commonly called context frame rules. By the early 1990s, probabilistic methods started replacing rule-based POS tagging. Statistical approaches revolutionized natural language processing (NLP), leading to the term “statistical natural language processing” (Martinez, 2012).

A probabilistic method, such as the stochastic model, uses probability distributions to determine the most likely sequence of tags in each sequence of words. The stochastic model is based on various techniques, including the Hidden Markov Model (HMM) and N-gram (Kumawat & Jain, 2015). The Hidden Markov Model (HMM) is widely used for part-of-speech (POS) tagging (Chiche & Yitagesu, 2022). The n-gram method is widely used for combining two or more words into a single unit (Ankita & Abdul Nazeer, 2018).

EXAMPLE

Bigrams of the sentence “food was good” will be [(‘food’, ‘was’), (‘was’, ‘good’)].

The simplest part-of-speech (POS) tagger is the unigram POS tagger. It assigns the most likely tag to a word based on a tagged corpus (Ankita & Abdul Nazeer, 2018). Moreover, the n-gram approach is a stochastic model that predicts a word’s tag based on the previous n−1 words in a sentence. N-grams capture the context of words by considering sequences of words rather than individual words in isolation. A POS tagger using this model follows the Markov assumption, meaning it considers only a limited number of preceding tags rather than the whole sentence. This makes the identification of contextual words more effective as contextual sentiment scoring aligns better with human interpretation.

EXAMPLE

“Lead” can be a verb or a noun, and its part of speech tagging depends on the surrounding words.

Furthermore, the n-gram method helps avoid misclassifying neutral or contradictory sentiments in a sentence bearing unknown words. Dandapat et al.. (2007) noted that a highly accurate stochastic tagger needs a large dataset of annotated text to make context-aware lexicon clusters. Therefore, dealing with large datasets where storing all possible n-grams in memory can become impractical and inefficient.

from nltk import pos_tag 

from nltk.tokenize import word_tokenize

from nltk import ngrams

# Sample dataset

reviews = [

  "food was good",

  "The place had an amazing ambience",

  "The food was deliciously awesome"

]

# Function to get bigrams

def get_bigrams(words):

  return list(ngrams (words, 2))

# Function to preprocess text and filter bigrams using POS tagging

def extract_bigrams(text): 

  tokens = word_tokenize(text.lower()) # Tokenization

  tokens = [word for word in tokens if word.isalnum()] # Remove punctuation

  tags = pos_tag(tokens) # Apply POS tagging

  return get_bigrams (tags) # Generate bigrams

# Process all reviews

for review in reviews:

  print(extract_bigrams(review)
[('food', 'was'), ('was', 'good')]

[('the', 'place'), ('place', 'had'), ('had', 'an'), ('an', 'amazing'), ('amazing', 'ambience')]

[('the', 'food'), ('food', 'was'), ('was', 'deliciously'), ('deliciously', 'awesome')]

Ankita and Abdul Nazeer (2018) used a Bloom Filter to determine the membership of an element in a labelled set. According to them, the Bloom Filter is memory efficient and takes less time than linear searches, such as the dictionary search in Python. While n-grams provide context and improve tagging accuracy, Bloom filters can enhance the efficiency of the process, especially when dealing with large datasets.

Updating VADER lexicon clusters with bigrams that follow valid syntactic patterns

The Valence Aware Dictionary and Sentiment Reasoner (VADER) is a lexicon-based sentiment analysis tool with predefined sentiment scores. It relies on the presence of known sentiment-bearing words to infer sentiment. Bigrams are pairs of consecutive words in a text that play a crucial role in sentiment analysis. However, some combinations of words may have context-dependent sentiment but might not be present in VADER’s lexicon dictionary with predefined sentiment scores.

EXAMPLE

“I’m killing it at the gym today!”. In this context, “killing it” means achieving great results. The sentiment is positive.

“Oh great, the traffic is killing it.”. This sentence uses sarcasm to convey frustration. The sentiment is negative.

Furthermore, encountering unknown words or phrases (lexicon clusters) that are not in VADER’s predefined dictionary is common, especially when dealing with social media posts, reviews and domain-specific phrases. Thus, determining new valid consecutive words in a meaningful way within a given corpus is a crucial step for accurate analysis. Valid syntactic patterns refer to combinations of words that follow valid grammatical rules.

EXAMPLE

An adjective followed by a noun like “happy child” is a valid syntactic pattern.

POS tagging for a large corpus is a labor-intensive and time-consuming task. Determining valid bigrams improves computational efficiency and minimizes the need for manual proofreading (Tsai & Chen, 2004). Processing all possible bigrams, including invalid ones, increases computational load unnecessarily. Although POS tagging helps to identify grammatical patterns, it does not conform to expected grammatical structures. When a model uses a random combination of words, the output will be syntactically incorrect, even if individual words are translated properly. Thus, determining bigrams with valid syntactic patterns over random bigrams is critical for enhancing the accuracy and efficiency of the model.

# Function to preprocess text and filter bigrams using POS tagging combined with custom syntactic patterns 

def extract_ bigrams (text, patterns):

  tokens = word_tokenize(text.lower ()) # Tokenization

  tokens = [word for word in tokens if word.isalnum ()] # Remove punctuation

  tags = pos_tag (tokens) # Apply POS tagging

  bigrams = get_bigrams (tags) # Generate bigrams

  return [
    (w1, w2) for (w1, tag1), (w2, tag2) in bigrams
    if ((tag 1, tag 2) in patterns)
  ]

# Process all reviews

for review in reviews:

  print(extract_bigrams (review, [("JJ", "NN"), ("VB", "NN")]))
[]

[('amazing', 'ambience')]

[('deliciously', 'awesome')]

Extract meaningful bigrams with rules

Researchers have shown that the n-gram method improves performance in tasks like text classification. Furnkranz J. (1998), in their study, concluded that word sequences of length 2 or 3 were useful and longer sequences reduced the performance. Furthermore, Jensen and Martinez (2000) tried to improve text classification for their study by using bigrams. Their work shows that adding sentiment scores based on context and meaning can make text classification models more accurate. However, identifying meaningful bigrams that might carry a sentimental value is a challenge. Dey et al. (2018) described a systematic approach to identifying meaningful bigrams by considering six types of bigram combinations based on the interaction between unigrams (single words) and modifiers (intensifiers or negations):

  1. Positive Unigram + Amplifier: “very good”
  2. Positive Unigram + Downtoner: “slightly good”
  3. Positive Unigram + Negation: “not good”
  4. Negative Unigram + Amplifier: “extremely bad”
  5. Negative Unigram + Downtoner: “somewhat bad”
  6. Negative Unigram + Negation: “not bad”

TextBlob is another powerful Python library to process textual data, and it can be incredibly helpful in identifying meaningful bigrams with sentiments. TextBlob provides a built-in sentiment analysis, which assigns a polarity score and a subjectivity score. The subjectivity score ranges from 0 to 1 and signifies how opinionated a phrase is. Furthermore, the subjectivity score can be used to help identify meaningful bigrams, especially when the goal is to extract emotional or opinionated combinations.

from nltk.tokenize import word_tokenize
from nltk import ngrams
from textblob import TextBlob 

text = "The camera quality is amazing, but the battery life is terrible."

tokens = word_tokenize(text.lower())  # Tokenization

tokens = [word for word in tokens if word.isalnum()]  # Remove punctuation

bigrams = ngrams(tokens, 2)

# Analyze subjectivity of each bigram

meaningful_bigrams = []

for bigram in bigrams:
    
    bigram_text = " ".join(bigram)
    
    blob_bigram = TextBlob(bigram_text)
    
    subjectivity = blob_bigram.sentiment.subjectivity
    
    # Filter for high subjectivity bigrams
    
    if subjectivity > 0.5:
        
        meaningful_bigrams.append((bigram_text, subjectivity))
        
print(meaningful_bigrams)
[('is amazing', 0.9), ('amazing but', 0.9), ('is terrible', 1.0)]

Exercise: Compare VADER sentiment analysis results with valid bigrams

The goal is to enhance VADER’s sentiment analysis by incorporating syntactically valid bigrams that can improve the accuracy of sentiment scoring. Try to find bigrams that have no more than 2% of the total corpus to avoid the problem of high dimensionality.

Objectives

  1. Download and load the dataset and create a panda dataframe.
  2. Normalise the dataset by removing unnecessary HTML/XML tags, converting text to lowercase, tokenize, and remove punctuations.
  3. Correct misspelled words, using the local dictionary and user input for unknown words.
  4. Apply sentiment analysis using VADER on each text entry before updating lexicons.
  5. Follow the POS tag combinations [(“JJ”, “NN”), (“RB”, “JJ”), (‘NN’, ‘VB’), (“JJ”, “VB”), (‘VB’, ‘JJ’), (“NN”, “VBD”), (“NNS”, “VBP”)] to distinguish irregular words.
  6. Extract relevant bigrams based on predefined POS patterns and TextBlob sentiment subjectivity of > 0.
  7. Iterate through the extracted bigrams and prompt the user to provide sentiment scores for new bigrams or modifiers for existing ones.
  8. Update the custom sentiment lexicon with the user-provided scores.
  9. Save the sentiment scores to a local file.
  10. Re-apply sentiment analysis using VADER.
  11. Compare results before and after with a line graph using Matplotlib.

References

NOTES

I am an interdisciplinary educator, researcher, and technologist with over a decade of experience in applied coding, educational design, and research mentorship in fields spanning management, marketing, behavioral science, machine learning, and natural language processing. I specialize in simplifying complex topics such as sentiment analysis, adaptive assessments and data visualizatiion. My training approach emphasizes real-world application, clear interpretation of results and the integration of data mining, processing, and modeling techniques to drive informed strategies across academic and industry domains.

Discuss