Finding sentiment bigrams with supervised machine learning
Machine learning is a transformative branch of artificial intelligence (AI) focused on developing algorithms that enable computers to learn from data (Talwar & Kumar, 2013). Instead of following rigid, predefined rules, machine learning systems improve their performance over time by identifying patterns and relationships in data.
Machine learning teaches computers to learn from data and make decisions or predictions. The primary objective of machine learning is to generalize beyond the training set examples. This is essential because, regardless of the amount of data available, it is improbable that the same examples will appear during testing. It is built on four key foundational pillars:
- Data: This can be structured or unstructured.
- Features: Measurable data properties like currency, area, units, location, frequency, etc.
- Algorithms: Mathematical procedures or frameworks used to train models. It includes supervised learning, unsupervised learning, and deep learning.
- Models: These are the learned patterns from the data and are used to make predictions or decisions.
Significance of machine learning in sentiment analysis
Sentiment analysis, the process of determining emotional tone (e.g., positive, negative, neutral) in text, is a cornerstone of modern data-driven decision-making. The simplest and most effective language models are the Markov chain (n-gram) source models, which efficiently capture local lexical regularities (Wang et al., 2003). Lexicon-based approaches for sentiment analysis use available lexicon clusters with pre-defined scores publicly available. There have also been efforts by researchers to use n-gram in lexicon-based sentiment analysis because sentiment lexicons like VADER are lists of words and phrases with associated sentiment scores (e.g., positive, negative, neutral). However, these lexicons may not be exhaustive and may miss new or domain-specific expressions that carry sentiment.
Therefore, n-gram is crucial to determine unfamiliar words with valid syntactic patterns. Most of the time, it is a challenge to accommodate newly identified word combinations in a lexicon-based sentiment analysis. Introducing machine learning (ML) can significantly mitigate the limitations by automatically discovering sentiment-bearing word combinations. Machine learning in sentiment analysis involves training algorithms on data so they can learn to classify sentiments on their own. Supervised machine learning approaches like Naïve Bayes, Support Vector Machines (SVM), Logistic Regression and Random Forests are regularly used in sentiment analysis (Dey et al., 2018).
The Maximum Entropy approach, also known as Logistic Regression, is widely used for text classification tasks such as sentiment analysis. It is a supervised learning algorithm method that models the probability of a binary outcome based on one or more predictors. It works by converting the text into numerical features like word frequency and TF-IDF scores. They are considered more reliable than simpler methods like Naïve Bayes in many cases due to their flexibility, ability to handle complex feature interactions and probabilistic nature (Prabhat & Khullar, 2017). These models use a probability distribution that maximizes entropy (uncertainty) while satisfying constraints derived from the training data (Nigam et al., n.d). Maximum entropy models provide unique benefits, particularly in scenarios where customizability, interpretability and computational efficiency are critical.
Feature extraction and vectorization for text classification
Classification refers to the process of grouping similar items into categories. In the context of text classification, this involves automatically assigning pre-defined index labels to new, unseen texts (Scott & Matwin, 1999). Feature extraction is the process of extracting relevant features from texts, including bigrams (word pairs) and topic distributions. These features help to capture the linguistic and semantic context of the n-grams. Features are also the basic building blocks or observations for the predictive model (Rawat & Khemchandani, 2017).
Imagine a dictionary that has words with emotional meanings like happy, sad, or angry. Following this pattern, the regression model will learn to find and add new phrases like “very good” or “extremely disappointing” in the appropriate sentiment group.
Feature extraction is crucial because the quality and relevance of the features directly impact the performance of the predictive model. Analysts use their domain knowledge to identify appropriate features or employ algorithms to generate new features from the given dataset. Effective feature extraction can also significantly enhance the model’s ability to make accurate predictions (Rawat & Khemchandani, 2017). Feature extraction typically involves a set of predefined keywords. Based on these keywords, the algorithm calculates the weights of the words in the text and then creates a digital vector, which serves as the feature vector of the text (Dzisevi? & Šešok, 2019).
In machine learning, vectorization is the process of converting text data into numerical representations that machine learning models can understand. Vectorization creates numerical features from the extracted text as vectors. It creates vectors in the form of a matrix where each row represents a data point, such as a negative or positive sentiment, and each column represents a feature, such as a word. The values in the matrix indicate the presence or frequency of features in each data point.
For these 2 reviews, “Food is great”, “Food was unhealthy”, the vectorised matrix will look like:
| food | great | unhealthy | |
|---|---|---|---|
| Review 1 | 1 | 1 | 0 |
| Review 2 | 1 | 0 | 1 |
Training the Logistic Regression model
Logistic regression is a powerful and predictive tool for binary classification tasks, including text classification for sentiment analysis. Training teaches the predictive model, like maximum entropy or logistic regression, how to make predictions or classifications by feeding it user data. During training, the model learns patterns and relationships from the input data to make accurate predictions and classifications on new, unseen data. Logistic regression learns a set of coefficients (weights) for each input feature during training. These coefficients indicate the contribution of each feature to the predicted probability. The model combines the weighted features to compute the log odds (logit) of the probability. Once the model is trained, it can predict the class of an unfamiliar word combination by:
- Transforming the text into a word vector (using the same representation as the training data).
- Apply the learned coefficients to the input features.
- Computing the probability of the text belonging to a particular class (e.g., positive or negative sentiment).
- Assigning the class label based on a threshold (e.g., if probability >= 0, classify as positive).
By representing texts as word vectors and learning coefficients for each feature, logistic regression can effectively predict the class of new texts. However, its performance depends heavily on the quality of the feature representation and the linearity of the underlying data.
Steps for classifying sentiment-bearing words from a corpus

Step 1: Data Preparation
- Collect a corpus of text that includes informal language, slang, and sentiment-bearing phrases.
- Normalize the corpus by performing HTML/XML tag removal, tokenization, spellcheck and punctuation removal.
- Extract n-grams from the text, e.g. “very good”.
Step 2: Feature extraction
- Create a dictionary of n-gram frequencies from a given text.
- Extract relevant features from the text, such as n-grams (word combinations). These features are what the model will use to learn patterns.
- Assign labels of 1 when the sentiment is positive or neutral and 0 for negative sentiments in the data.
Step 4: Vectorisation
- Convert the extracted features into a numerical representation that the model can understand.
- Convert the vocabulary of features and transform the features column into a matrix.
- Assign labels to the matrix ‘X ’ & ‘y’ where ‘X ’ is assigned to the numerically transformed vocabulary of the features and ‘y’ is assigned to the sentiment labels 0 (negative) & 1 (positive).
vectorizer = DictVectorizer(sparse=False)
X = vectorizer.fit_transform(features_list)
y = sentiment_labels
Step 4: Split and train the Maximum Entropy or logistic regression Model
- Split the corpus into training data and test data with a random state. The training set is used to teach the model patterns, and the testing set is used to evaluate how well the model generalizes unseen data.
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.7, random_state=42)
- Fixing the seed at 42 will make the split dataset reliable and reproducible. This will ensure that the random shuffling will be the same every time, helping create consistent results, especially when experimenting with different models or model variations.
- Train the maximum entropy model like
sklearn.linear_model.LogisticRegressionon the labeled data.
model = LogisticRegression (max_iter=500, solver='lbfgs')
model.fit(X_train, y_train)
- Use appropriate values for
max_iter&solverdepending on the size of the dataset and computing resources.fitis called to train the model using the training data.
Step 5: Use the trained model on the test data for prediction
- Use the trained model to predict sentiment labels for the test data.
- Store the predicted labels in a new column of the dataset or dataframe.
df.loc[y_test.index, "predicted_label"] = model.predict(X_test)
- Create a function to analyze the trained model and identify the n-grams most strongly associated with positive and negative sentiment.
def find_sentiment_bigrams(vectorizer, model, top_n=10):
feature_names = vectorizer.get_feature_names_out()
feature_importance = model.coef_[0] # Coefficients from logistic regression
# Pair feature names with their importance scores
feature_weights = list(zip(feature_names, feature_importance))
# Sort by absolute importance
sorted_features = sorted(feature_weights, key=lambda x:abs(x[1]), reverse=True)
# Return top positive and negative bigrams
top_positive = [bigram for bigram, weight in sorted_features if weight > 0][:top_n]
top_negative = [bigram for bigram, weight in sorted_features if weight < 0][:top_n]
return top_positive, top_negative
Writing a program to train a Logistic regression model to make predictions
The program’s main purpose is to classify texts and determine their sentiment, whether positive or negative. Further, using this labelled data, train an entropy model like logistic regression and save it for predictions. This will create a valuable tool for understanding customer opinions more generally.
Objectives
- Load the restaurant review data into a pandas DataFrame.
- Apply preprocessing steps to clean and normalize the text:
- Remove HTML tags and punctuation.
- Tokenize the text into individual words.
- Handle misspellings using a spell checker and a local dictionary.
- Convert text to lowercase.
- Use VADER to calculate a sentiment score for each normalized review, indicating the overall sentiment (positive, negative, or neutral).
- Create a binary sentiment label (positive or negative) based on the sentiment score, assigning 1 for positive and 0 for negative.
- Extract bigrams from the normalized reviews as features for the machine learning model.
- Convert the extracted features (bigrams) into a numerical representation using DictVectorizer.
- Split the data into training and testing sets to train and evaluate the model.
- Train a Logistic Regression model using the training data and the assigned sentiment labels.
- Save the trained model and its vectorizer.
- Use the trained model to predict the sentiment labels of unseen data.
- Analyze the model’s coefficients to identify the bigrams that are most strongly associated with positive and negative sentiment.
- Print the top positive and negative bigrams and potentially visualize the sentiment scores of the reviews.
References
- Dey, A., Jenamani, M., & Thakkar, J. J. (2018). Senti-N-Gram: An n -gram lexicon for sentiment analysis. Expert Systems with Applications, 103, 92–105. https://doi.org/10.1016/j.eswa.2018.03.004
- Dzisevi?, R., & Šešok, D. (2019). Text Classification using Different Feature Extraction Approaches. 2019 Open Conference of Electrical, Electronic and Information Sciences (eStream), 1–4. https://doi.org/10.1109/eStream.2019.8732167
- Nigam, K., La ? erty, J., & McCallum, A. (n.d.). Using Maximum Entropy for Text Classi?cation.
- Prabhat, A., & Khullar, V. (2017). Sentiment classification on big data using Naïve bayes and logistic regression. 2017 International Conference on Computer Communication and Informatics (ICCCI), 1–5. https://doi.org/10.1109/ICCCI.2017.8117734
- Rawat, T., & Khemchandani, V. (2017). Feature engineering (FE) tools and techniques for better classification performance. International Journal of Innovations in Engineering and Technology, 8 (2), 169–179.
- Scott, S., & Matwin, S. (1999). Feature engineering for text classification. ICML, 99, 379–388.
- StatQuest with Josh Starmer (Director). (2018, March 5). StatQuest: Logistic Regression [Video recording]. https://www.youtube.com/watch?v=yIYKR4sgzI8
- Talwar, A., & Kumar, Y. (2013). Machine Learning: An artificial intelligence methodology. International Journal of Engineering and Computer Science, 2 (12), Article 12. https://ijecs.in/index.php/ijecs/article/view/2261
- Wang, S., Schuurmans, D., Peng, F., & Zhao, Y. (2003). Semantic n-gram language modeling with the latent maximum entropy principle. 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03). , 1, I–I. https://doi.org/10.1109/ICASSP.2003.1198796.
I am an interdisciplinary educator, researcher, and technologist with over a decade of experience in applied coding, educational design, and research mentorship in fields spanning management, marketing, behavioral science, machine learning, and natural language processing. I specialize in simplifying complex topics such as sentiment analysis, adaptive assessments and data visualizatiion. My training approach emphasizes real-world application, clear interpretation of results and the integration of data mining, processing, and modeling techniques to drive informed strategies across academic and industry domains.

Discuss