Natural Language Processing for Bias Mitigation in Sentiment Analysis
Natural language processing is a field of AI that deals with the interaction between computers and human language. It helps machines understand, interpret, and generate human language (de Jager, 2023).
NLP applications include chatbots, translation services, sentiment analysis, and voice assistants like Siri or Alexa.
Computers, rely on structured data and struggle with irregularities such as slang, typos, different dialects, punctuation, uppercase letters, etc.
I ll b der @ 10pm.
Raw text is unstructured and noisy. Preprocessing steps convert raw text into a structured format, improve model performance, reduce computational load, and handle inconsistencies in language.
["I", "will", "be", "there", "at", "10", "pm", "."]
Natural Language Processing (NLP) systems are only as reliable as the data they process. Preprocessing the data to mitigate bias is essential to ensure models produce accurate, fair, and generalizable results.
How does noise in data create bias in natural language processing?
The exponential increase in unstructured data has made handling it even more challenging, especially with the rise of noisy data. Informal communication styles like comments, social networking posts, feedback, reviews and chat often involve the deliberate use of non-standard word forms, adding to the complexity.

In the above figure, noise is categorised into types and sub-types limited to naturally occurring and human-produced noise. Orthography refers to noise related to the way words are written. Some instances are considered errors, such as spelling mistakes, while others are seen as variations or deviations from standard writing to serve a purpose, like word obfuscation or lengthening (Al Sharou et al., 2021). Digital text generated in informal environments like online chats, SMS, emails, message boards, newsgroups, blogs, wikis, and web pages often contains a significant amount of noise. Such text includes spelling errors, special characters, non-standard word forms, grammar mistakes, and the usage of multilingual words, among other issues. Substitution is very common in text communications where words or characters are replaced with numbers or letters which have the same phonetic sound to make it shorter like “2day” for “today”, “l8r” for “later”, and “byk” for “bike” (Subramaniam et al., 2009).
Mistranslations, misclassifications in sentiment analysis, or chatbots misunderstanding user intent.
Orthographic variants refer to the different ways words are spelled due to regional variations, such as British English vs. American English like ‘centre’ vs. ‘center’, or words with multiple correct spellings like ‘spelled’ vs. ‘spelt’. Word obfuscation involves disguising some characters within a word using numbers or symbols. This technique can be used for purposes such as disguising the real meaning of the word like “sh*t” for “shit” (Al Sharou et al., 2021).
Furthermore, jokes, idioms, or metaphors unfamiliar to the model’s training data lead to misinterpretations of certain groups. Moreover, punctuation marks are frequently employed to introduce meta-discourse, as well as to convey emotions, and verbal effects such as laughter or attitudes (Subramaniam et al., 2009). Such noise is also referred to as syntactic noise or lexical noise.
Semantic noise or noise in text like the ones discussed above is a fundamental barrier to robust NLP systems, exacerbating errors, bias, and user dissatisfaction (de Jager, 2023). Often the use of non-standard language is also viewed as a means of expressing identity, indicating authenticity, solidarity, or resistance to imposed norms (Bucholtz & Hall, 2005). Addressing it requires contextual understanding, diverse data, and ethical vigilance. Without preprocessing machine learning models learn spurious correlations resulting in biased or discriminatory outcomes.
Context-oriented processing
While noisy data can degrade model performance, understanding its effects and implementing strategies to mitigate them can lead to robust and generalizable models. The pre-processing phase is where decisions are typically made on whether to clean, normalize, or retain the data as it is. Data normalization is carried out to reduce inconsistencies in text by applying a noisy channel model. Often data normalisation is considered as a translation problem (Jose & Raj, 2014). Understanding the context of a word before normalising or cleaning the data is critically important in Natural Language Processing (NLP). Ignoring contextual information about an aspect or a word and treating it as “semantic noise” can lead to the loss of linguistic functionality and a narrow interpretation of the semantic capacity of language. Human interpreters resolve ambiguous language by contextualizing and making inferences, which cannot be simply dismissed as “guessing” (de Jager, 2023).
Similar words with different meanings like “bank” & “riverbank”. Without context, preprocessing steps like lemmatization or stemming may incorrectly normalize words, leading to a loss of meaning.
Models may memorize the noise because of different meanings of the same aspect instead of learning meaningful patterns, especially with high-capacity architectures like deep neural networks. The aspect-based methods of sentiment analysis can be roughly divided into two types:
- non-deep learning methods and,
- deep learning methods.
Non-deep learning methods involve creating a set of features and training a classifier for aspect-category sentiment analysis using machine learning techniques. This is generally applicable but faces challenges such as identifying emotional keywords in texts with multiple terms, misspellings, and slang. To address these issues, an efficient feature vector should be created through a two-step feature extraction process after preprocessing (Neethu & Rajasree, 2013). Non-deep learning methods use features like lexicon features, n-grams and word co-occurrence frequencies. Classifiers such as Maximum Entropy, Support Vector Machine (SVM), and Fuzzy Lattice Reasoning (FLR) to predict accuracy. These methods are easy to implement and computationally simple but struggle to capture complex non-linear relationships between features and sentiment polarity (Liao et al., 2021).
In contrast, deep learning methods can model complex non-linear relations. Word2vec offers an effective and simple way to vectorize words, useful in various NLP tasks. Popular deep neural networks for aspect-category sentiment analysis include 1D-CNN and LSTM networks. Recently, Transformer-based models like BERT, RoBERTa, and ALBERT have achieved state-of-the-art performance across various NLP tasks (Liao et al., 2021).
Above all, in the results of sentiment analysis, different people will have different opinions and emotional tendencies on the same issue, which can provide powerful functions for competition analysis and market analysis. By incorporating appropriate context-aware techniques, sentiment analysis models can better handle the complexities of human language and mitigate bias to deliver meaningful, ethical outcomes.
References
- Al Sharou, K., Li, Z., & Specia, L. (2021). Towards a Better Understanding of Noise in Natural Language Processing. In R. Mitkov & G. Angelova (Eds.), Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021) (pp. 53–62). INCOMA Ltd. https://aclanthology.org/2021.ranlp-1.7/
- Bucholtz, M., & Hall, K. (2005). Identity and interaction: A sociocultural linguistic approach. Discourse Studies, 7(4–5), 585–614. https://doi.org/10.1177/1461445605054407
- de Jager, S. (2023). Semantic Noise and Conceptual Stagnation in Natural Language Processing. Angelaki, 28(3), 111–132. https://doi.org/10.1080/0969725X.2023.2216555
- Liao, W., Zeng, B., Yin, X., & Wei, P. (2021). An improved aspect-category sentiment analysis model for text sentiment analysis based on RoBERTa. Applied Intelligence, 51(6), 3522–3533. https://doi.org/10.1007/s10489-020-01964-1
- Neethu, M. S., & Rajasree, R. (2013). Sentiment analysis in twitter using machine learning techniques. 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT), 1–5. https://doi.org/10.1109/ICCCNT.2013.6726818
- Subramaniam, L. V., Roy, S., Faruquie, T. A., & Negi, S. (2009). A survey of types of text noise and techniques to handle noisy text. Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, 115–122. https://doi.org/10.1145/1568296.1568315
I am an interdisciplinary educator, researcher, and technologist with over a decade of experience in applied coding, educational design, and research mentorship in fields spanning management, marketing, behavioral science, machine learning, and natural language processing. I specialize in simplifying complex topics such as sentiment analysis, adaptive assessments and data visualizatiion. My training approach emphasizes real-world application, clear interpretation of results and the integration of data mining, processing, and modeling techniques to drive informed strategies across academic and industry domains.
Discuss