Fine-tuning a pre-trained transformer model for sentiment analysis

By Abhinash Jena on April 14, 2025

Fine-tuning a pre-trained model involves taking a model already trained on a large, general dataset and adapting it to perform well on a smaller, specific task dataset. Transformers is a library of several pre-trained large language models (LLMs) available as open source for training and inference (Hugging Face, n.d.-b). Transformer models are language models that use probabilistic methods to predict missing or the next set of words in a given sentence.

EXAMPLE

New Delhi is the ________ of India.

A language model can predict with > 95% probability that the missing word in the sentence is “capital”.

Transformer-based models have revolutionized sentiment analysis due to their ability to capture contextual meaning, nuance, and long-range dependencies in text. Several pre-trained transformer models, such as ALBERT, BERT, and RoBERTa, have focused on improving text understanding and context. Transformer models use self-attention to model bidirectional context and relationships between all words in a sentence.

pip install transformers torch 

from transformers import pipeline

classifier = pipeline('sentiment-analysis')

result = classifier("I love transformer models!")

print(result)
[{'label': 'POSITIVE', 'score': 0.9998}] 

Models like Word2Vec and lexicon-based models treat words in isolation, missing the context. Whereas supervised ML models like logistic regression learn language patterns when the training data is good. Furthermore, supervised ML models can also be tailored to different domains by training on domain-specific data. Probabilistic machine learning models capture limited nuances in a sentence, whereas transformers use self-attention to weigh word importance dynamically.

Transformer models like BERT are pretrained on large corpora and outperform traditional machine learning models like SVM, LSTM and Naïve Bayes, but they require high computational resources. The cost of training these models is a significant challenge. Based on information released by Google, Sharir et al. (2020) estimate that at list price, training an 11-billion-parameter variant costs well above 10 million. This highlights the substantial financial investment required to develop these models and can be a barrier for many organizations. Moreover, neural networks require extensive fine-tuning to achieve good results (Kuhn & Johnson, 2019). Thus, it is imperative to use traditional machine learning models when the data is small and resources are limited (Dhola & Saradva, 2021).

Fine-tuning a pre-trained model

Pre-trained models are trained on large corpora using unsupervised learning. BERT significantly advances the state of the art across a diverse set of NLP tasks, demonstrating the power of deep bidirectional pre-training. It establishes a new benchmark for language understanding. Instead of training a model from scratch, fine-tuning leverages the general language understanding learned during pre-training and refines it for a particular application like sentiment analysis (Devlin et al., 2019). Fine-tuning a pre-trained model is a crucial step where the pre-trained model adapts to perform a specific task. It also significantly reduces high training costs and improves model performance for a specific task. Fine-tuning involves taking a neural network that has already been trained on a large general dataset like Wikipedia and continuing the training process on a specific, typically smaller dataset (Hugging Face, n.d.-a). It starts with a pre-trained model’s weights as a foundation and adapts them through additional training on a smaller dataset aligned for target tasks and use cases. This makes fine-tuning computationally cheaper than pre-training a model from scratch.

EXAMPLE

The BERT model trained on a large Wikipedia corpus that is made up of formally written English with correct grammar and spellings can be adapted to do sentiment analysis on a dataset consisting of loosely written text in Hinglish.

During fine-tuning, the loss function acts as a railing that makes the model aware of poor predictions. The loss function quantifies the difference between the model’s output (predictions) and the actual labelled data. Fine-tuning aims to find the optimal parameters that minimize the loss function. The loss function is also the measure to evaluate the model’s performance on unseen or test data. The commonly used loss function in text classification tasks is the Cross-entropy loss function.

Fine-tuning a pre-trained model
Fine-tuning a pre-trained model (Sankar, 2024)

Evaluating the fine-tuning progress

To evaluate the fine-tuning progress of the transformer model, observe the training and evaluation loss. This is provided by the Hugging Face Trainer class. The Trainer class in the Transformers library is designed to handle the training loops (epochs) and provides built-in mechanisms for logging and evaluation. When setting up the Trainer, use TrainingArguments to configure various training parameters, including logging and evaluation settings. Key arguments for this purpose include:

  • output_dir: The directory where training outputs (including logs and checkpoints) will be written.
  • logging_dir: The directory where TensorBoard logs will be saved (if report_to="tensorboard").
  • logging_strategy: Controls when logs are generated (“steps”, “epochs”, “no”). “steps” is often useful for detailed tracking.
  • logging_steps: The number of update steps between two logging actions.
  • evaluation_strategy: Controls when evaluation is performed (“no”, “steps”, “epoch”). “steps” or “epoch” are common.
  • eval_steps: The number of update steps between two evaluations if evaluation_strategy="steps".
  • per_device_eval_batch_size: The batch size for evaluation.
  • report_to: Integrations to use for reporting logs and results (e.g., “tensorboard”, “wandb”, “comet_ml”).
from transformers import TrainingArguments, Trainer 

training_args = TrainingArguments(

output_dir="./results",

learning_rate=2e-5,

per_device_train_batch_size= 8,

per_device_eval_batch_size= 64,

num_train_epochs=3,

weight_decay=0.01,

logging_dir='./logs',

logging_strategy="steps",

logging_steps=100,

evaluation_strategy="epoch", # or "steps"

eval_steps=500, # If evaluation_strategy="steps"

report_to="tensorboard", # Or "wandb", etc.

save_strategy="epoch",

)

When creating the Trainer instance, pass the eval_dataset argument containing the evaluation data.

trainer = Trainer( 

model=model,

args=training_args,

train_dataset=tokenized_train_dataset,

eval_dataset=tokenized_eval_dataset,

tokenizer=tokenizer,

# compute_metrics=your_evaluation_function # Optional: for more detailed metrics

)

trainer.train()

During training, the Trainer will periodically log information to the console, including the training loss at the specified logging_steps. If you’ve set an evaluation_strategy, it will also log the evaluation loss (and any other metrics computed by compute_metrics if provided) at the specified intervals. The training and evaluation losses (and other metrics) will be logged to the specified logging directory. After training, the trainer states. log_history an attribute will contain a list of dictionaries, where each dictionary holds the logged metrics at each logging/evaluation step. To explicitly run evaluation on the eval_dataset after training (or at any point), use trainer.evaluate(). This will return a dictionary containing the evaluation loss and any other metrics computed by compute_metrics (Karagiannakos, 2021).

evaluation_results = trainer.evaluate() 

print(evaluation_results)

Interpreting training and evaluation loss

Evaluating the training progress of a transformer model is important to ensure that the model is learning effectively and generalizing well. Training deep learning models like Transformers is computationally expensive. With proper evaluation, the training can be stopped early when the validation performance plateaus or degrades. This helps to prevent overfitting, i.e memorizing training data or underfitting, i.e not learning enough and optimizing the use of time and hardware.

Training loss indicates how well the model is fitting the training data. Ideally, it should decrease over time. A plateau or increase in training loss suggests issues like a learning rate that is too high or the model has converged. Evaluation loss indicates how well the model generalizes to unseen data. It’s a more crucial metric for assessing the model’s performance. Low training loss and low evaluation loss are the ideal scenario, suggesting the model is learning well and generalizing effectively. Low training loss and high evaluation loss indicate that the model has memorized the training data but is not generalizing well to new data. Techniques like regularization, dropout, or more data can help. High training loss and high evaluation loss suggest that the model is not learning the training data well. Increasing the number of training epochs and adjusting the training rate can help. Whereas, increasing evaluation loss while decreasing training loss indicates overfitting (Karagiannakos, 2021).

The challenge of overfitting while fine-tuning a pre-trained model

Supervised data analysis involves identifying patterns between predictors and an identified outcome that is to be modeled or predicted. Unsupervised learning methods are focused solely on identifying patterns among the predictors. In supervised learning based sentiment analysis, a model is trained to recognize the relationship between the text input, like reviews or comments (the predictors) and sentiments like positive, negative and neutral (the outcome) (Kuhn & Johnson, 2019). Pre-trained models, such as those based on transformers (like BERT or DistilBERT ), have revolutionized sentiment analysis. They’ve been trained on massive datasets, enabling them to understand language context remarkably well. However, this advantage also brings the risk of overfitting, especially in cross-domain sentiment analysis (Zhou et al., 2020).

Overfitting occurs when a model learns the training data too well, including its noise and characteristics. This leads to excellent performance on the training data but poor performance on unseen or test data. This makes the model useless for the intended purpose. Although with fine-tuning, a pre-trained model for sentiment analysis can adapt to a specific dataset but if the dataset is small or not representative enough, then the model will still overfit. Whereas choosing a simpler pre-trained model can result in underfitting, as the model is not trained on a suitable or adept dataset. Therefore, choosing a suitable pre-trained model is challenging (Bejani & Ghatee, 2021). The most effective way to combat overfitting is to train on a larger and more diverse dataset. This helps the model learn more generalizable features. Various other approaches have also been proposed by researchers to address the issue of overfitting (Sungheetha & Sharma R., 2020) :

  • regularization,
  • data augmentation, and
  • early stopping

Regularization techniques penalize complex models, discouraging them from fitting the noise in the training data. The dropout technique randomly sets a fraction of the neurons to zero during training. This prevents neurons from co-adapting too much and forces the network to learn more robust features. DistilBERT already incorporates a default dropout rate, but different dropout rates can be set for different experiments. Furthermore, weight decay adds a penalty to the loss function proportional to the square of the weights. This encourages smaller weights, leading to a simpler model.

TrainingArguments ( 

#. other arguments.

weight_decay=0.01

#.

)

Data augmentation is simple and relatively straightforward to implement to improve model generalization. To start with, data augmentation, a simple technique like synonym replacement and random deletion, can be implemented.

Early stopping is an intuitive and effective way to prevent a model from learning noise in the training data. This technique involves monitoring validation performance and stopping training when it starts to decline. This prevents the model from continuing to learn the noise in the training data.

training_args = TrainingArguments( 

#. other arguments.

output_dir="./results",

evaluation_strategy="epoch",

#.

)

trainer = Trainer(

model=model,

args=training_args,

train_dataset=tokenized_train_dataset,

eval_dataset=tokenized_eval_dataset,

tokenizer=tokenizer,

callbacks = [EarlyStoppingCallback(patience=3, threshold=0.001)] # Example: stop if eval loss doesn't improve by 0.001 for 3 epochs

)

While DistilBERT is generally good, ensure it’s a reasonable starting point for a specific sentiment analysis task. For a very specialized domain, consider fine-tuning a model pre-trained on a more relevant corpus. Carefully tune the hyperparameters, such as the learning rate, batch size, number of training epochs, and regularization strength. A learning rate that is too high can cause the model to overfit quickly, while a very low learning rate might lead to underfitting or very slow convergence. Tools like Optuna or Ray Tune can help automate this process. When trying different mitigation strategies, change one thing at a time and observe its impact on the validation performance. This helps to understand which techniques are most effective for the task and dataset. Remember that finding the right balance often involves experimentation.

Performing sentiment analysis using a pre-trained DistilBERT model 

Write a program that performs sentiment analysis on text data using a pre-trained DistilBERT model from the Hugging Face Transformers library. Fine-tune the model on a labeled dataset and then use it to predict the sentiment of new reviews.

Objectives

  1.  Install the necessary libraries.
  2. Load the training dataset and preprocess it.
  3. Tokenize the text data using the DistilBERT tokenizer.
  4. Create a custom dataset class and split it into training and validation sets.
  5. Fine-tune the DistilBERT model for sentiment analysis.
  6. Evaluate the model’s performance.
  7. Save the trained model and tokenizer for future use.
  8. Load test data, make predictions using the fine-tuned model and visualize them.

References

NOTES

I am an interdisciplinary educator, researcher, and technologist with over a decade of experience in applied coding, educational design, and research mentorship in fields spanning management, marketing, behavioral science, machine learning, and natural language processing. I specialize in simplifying complex topics such as sentiment analysis, adaptive assessments and data visualizatiion. My training approach emphasizes real-world application, clear interpretation of results and the integration of data mining, processing, and modeling techniques to drive informed strategies across academic and industry domains.

Discuss