Lexical normalisation technique for lexical sentiment analysis
A large world population now uses the internet to share their opinions through product reviews, which strongly influences buying decisions. Lexical normalisation is the process of converting noisy, informal, or misspelled text commonly found on the internet to its standardized form. A lexicon is a collection of words and their meanings used in a review, feedback, or a message by an individual. It serves as a dictionary or vocabulary that Natural Language Processing (NLP) models refer to while analysing or classifying a text based on predefined word associations.
In sentiment analysis, the accuracy of the results depends heavily on the quality of the text-based data. Because of the massive forms of noise such as slang, typos, emojis, and unnecessary symbols in text, the study of natural language processing has increasingly focused on normalising the raw data. The presence of emoticons, slang and misspellings in the raw data necessitates a preprocessing step before feature extraction (Neethu & Rajasree, 2013).
Identifying irregular words from noise
The normalisation of unstructured text data forms a very essential part of natural language processing (NLP). Lexical normalisation consists of the removal of punctuation, stopwords and rewriting the ill-formed text data using more conventional spelling to make it more readable (Jose & Raj, 2014). Stopwords are highly common in a word’s corpus that carries little informational value. As a result, they are frequently removed during text preprocessing. In tasks like information retrieval, stopwords contribute minimally to meaningful results, and their removal helps reduce computation time. Researchers recommend eliminating stopwords to improve efficiency (Hickman et al., 2022).
Jose & Raj (2014), in their study, found that most of the internet data, like Twitter messages or product reviews, are heavily laden with ill-formed tokens. Furthermore, the informal nature of such data adds to the complexity of processing it for sentiment analysis.
Users often extend words for emphasis, such as “goooood” instead of “good.”
Additionally, identifying such distorted words is also challenging due to the noisy context. Therefore, the first goal will be to convert these irregular words into their standard English forms. Han & Baldwin (2011), used a classifier to identify misspelled or distorted words and generated correction suggestions based on phonetic similarities. The most suitable correction was then chosen by considering both word similarity and the context of the sentence. Lexical normalisation is carried out in three phases (Han & Baldwin, 2011):
- Generating a confusion set.
- Detecting ill-formed words.
- Selecting the most appropriate correction replacement.
The lexical normalisation of noisy text is supported by a multichannel and a bottom-up parser. Lexical normalisation performs the preprocessing and the normalisation tasks. Furthermore, it uses a multichannel database that is organized with an abbreviation channel, a non-noisy channel, a graphene channel and a phoneme channel. The graphene channel is responsible for spelling alterations (e.g., recieve instead of receive), a phoneme channel would lead to phonetic modifications (e.g., nite instead of night) and an abbreviation channel involves using a letter or group of letters extracted from a word or phrase (e.g., ASAP for as soon as possible) (Jose & Raj, 2014).
The lexical normalisation process
The lexical normalisation process standardises input tokens through multiple channels, incorporating a user feedback system. This is to ensure consistency with the dictionary and grammatical rules (Ahmed, 2015). User intervention is crucial in text normalisation, as people tend to introduce different types of spellings for the same word. To address this, users can add and select their preferred correction for a specific noisy word. This approach ensures a more accurate and effective text normalisation (Jose & Raj, 2014).
After loading the data, the first step in lexical normalisation is to standardise the sentences by removing HTML tags and replacing Emojis & emoticons with sentiment polarity words. Then tokenize the sentence in a word-by-word manner to identify typos, abbreviations, or slang. Identify tokens or words that exist in a vocabulary or dictionary of words by searching for an exact match. If a word is found, classify it as “in vocabulary” or “out of vocabulary” for further alterations (Ahmed, 2015). The tokenization operation also includes removing empty spaces and punctuation.
Furthermore, use the NLTK’s stopwords corpus to match and remove stopwords from the tokens list. Stopwords are common words like conjunctions (e.g., “or, ” “and, ” “but”) and pronouns (e.g., “he, ” “she, ” “it”) that appear frequently in sentences but contribute little to sentiment analysis. This approach is based on the idea that eliminating non-discriminative words reduces the classifier’s feature space, leading to more accurate results (Saif et al., n.d.).
- Local dictionary: A self-made set of ill-formed words with correct replacements.
- Pyspellchecker: It has a vast corpus or vocabulary of clean words.
- Fuzzywuzzy: A Python module used to compare and match text even when there are slight differences, typos, or variations. It is built on the Levenshtein distance.
Firstly, ill-formed words in the tokens can be identified using the Pyspellchecker’s unknown words list by processing the tokens through its function. User intervention is required to provide an appropriate replacement, ideally after considering the context, unless a predefined replacement exists. The initial set of approximate matches is obtained by computing the Levenshtein distance between the query and the words in the dictionary. This process generates the first set of matches based on their textual similarity to the query (Ahmed, 2015). The user can then be asked to intervene to choose the right candidate from the suggested matches. If the user chooses to skip the intervention or does not enter any replacement, Pyspellchecker’s best probable match can be used. The correct replacement words should be stored in a file for future matching. Additionally, lemmatize the tokens with a part-of-speech tagger to convert the words to their basic form and improve precision (Komori & Eguchi, 2019).
Lemmatization is the process of reducing a word like “improving” or “improvements” to its base form, that is “improve”.

Achieving accuracy in normalisation is challenging due to the numerous possible variations of a given token. This becomes even more complex with the continuous evolution of elisions and acronyms commonly used on the internet. It is essential to consider the different normalisation techniques available and select the ones that best fit the specific task.
Dictionary-based sentiment analysis is a rule-based approach to determining sentiment in text using predefined word lists (lexicons). In dictionary-based sentiment analysis, the tokens or the bag of words are matched with a pre-defined words dictionary and their occurrence in the corpus is counted. This process is also known as text vectorization.
Exercise: Write a Python code for the lexical normalisation process
Your challenge is to build a Python script that normalises text data by implementing a series of transformations as discussed above. Your task is to develop a pipeline that performs the following:
- Load the sample dataset with raw restaurant reviews.
- Clean the raw text line by line. Remove HTML tags and special characters. Replace emojis and emoticons.
- Tokenize each line into a list of words or tokens. Fix ill-formed words like “goood” → “good”. Remove unnecessary punctuation.
- Filter unnecessary stop words and correct spelling mistakes using Pyspellchecker.
- Replace abbreviations and slang with their proper forms using a dictionary. Use a local dictionary of abbreviations to replace “osm” → “awesome”. If a word is unknown, ask the user for input. Store new words in the local dictionary file for future use.
- Lemmatize the tokens with part-of-speech tagger to their basic form “amazing” → “amaze”.
- Create a lexicon dictionary with sentiment words “happy”, “sad”, “good”, “bad”, “love”, “hate”, “great”, “terrible”, “excellent” and “awful.
- Count the occurrences of the predefined sentiment lexicon in the dataset.
- Plot the frequency using a bar chart.
Essential libraries and modules to use:
beautifulsoup4for text cleaning & standardisationemojito convert emojis to wordsreRegex to find and replace repeated punctuationspyspellcheckerto correct misspelled wordsjsonto flatten the Python dictionary and store user-defined corrections to a file.Fuzzywuzzyhelps to find approximate matches for slang from the file.nltk.corpus.stopwordsprovides a list of common words to remove.pos_tag from nltkto grammatically tag (noun, verb, adjective etc) the words.WordNetLemmatizer from nltkto lemmatize the tokens to their root form.Counter from collectionsto count the occurrence or frequency of each word in a review.matplotlibto plot the graphs.
References
- Ahmed, B. (2015). Lexical normalisation of Twitter Data. 2015 Science and Information Conference (SAI), 326–328. https://doi.org/10.1109/SAI.2015.7237164
- Han, B., & Baldwin, T. (2011). Lexical Normalisation of Short Text Messages: Makn Sens a #twitter. In D. Lin, Y. Matsumoto, & R. Mihalcea (Eds.), Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 368–378). Association for Computational Linguistics. https://aclanthology.org/P11-1038/
- Jose, G., & Raj, Nisha. S. (2014). Lexico-syntactic normalization model for noisy SMS text. 2014 International Conference on Electronics, Communication and Computational Engineering (ICECCE), 163–168. 2014 International Conference on Electronics, Communication and Computational Engineering (ICECCE). https://doi.org/10.1109/ICECCE.2014.7086652
- Komori, O., & Eguchi, S. (2019). Statistical methods for imbalanced data in ecological and biological studies. In SpringerBriefs in statistics. https://doi.org/10.1007/978-4-431-55570-4
- Neethu, M. S., & Rajasree, R. (2013). Sentiment analysis in twitter using machine learning techniques. 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT), 1–5. https://doi.org/10.1109/ICCCNT.2013.6726818
- Saif, H., Fernandez, M., He, Y., & Alani, H. (n.d.). On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter.
I am an interdisciplinary educator, researcher, and technologist with over a decade of experience in applied coding, educational design, and research mentorship in fields spanning management, marketing, behavioral science, machine learning, and natural language processing. I specialize in simplifying complex topics such as sentiment analysis, adaptive assessments and data visualizatiion. My training approach emphasizes real-world application, clear interpretation of results and the integration of data mining, processing, and modeling techniques to drive informed strategies across academic and industry domains.
Discuss