Using SGD Classifier to train models with incremental learning

By Abhinash Jena on July 31, 2025

This article explores a robust, adaptive framework for incremental learning for sentiment analysis using the SGD Classifier. It addresses key challenges such as concept drift, catastrophic forgetting, and hyperparameter sensitivity. Through empirical simulations and theoretical grounding, the article demonstrates how incremental learning can evolve with dynamic user sentiments. Such systems are well-suited for real-time applications like e-commerce review monitoring, social media sentiment tracking, and customer feedback systems.

The SGD Classifier is a machine learning model that classifies input data by computing a linear combination of the input features and applying a threshold to assign a class label. In mathematical terms, it separates classes using a linear boundary such as a line in two dimensions or a hyperplane in higher dimensions. Gradient descent in Stochastic Gradient Descent (SGD) Classifier optimizes the weights by iteratively adjusting them to minimize the loss function based on the current prediction error.

Incremental learning helps SGD Classifier models to adapt quickly to new patterns and drifts to remain relevant as data evolves. Incremental approaches are designed for handling massive or streaming data, where batch methods are infeasible due to memory or time constraints (Sivylla Paraskevopoulou, 2024). The loss function directly shapes the convergence of the model by determining the error landscape that the optimizer navigates, affecting error rate and convergence. The loss function plays a critical role in guiding how the classifier model learns by quantifying the difference between the actual target values and the model’s predictions.

Understanding the key roles of the Loss function

The loss function is a mathematical method used to measure the difference between a model’s prediction and the actual value or the true label. The loss function should align with the model’s objective. The loss function assigns a numeric value to how ‘bad’ the model’s prediction is for a given training sample; smaller values mean better predictions.

EXAMPLE

In regression the goal is to minimize the differences between the predictions and the target values, while in classification the goal is to minimize the number of misclassifications (Yehoshua, 2023).

In incremental learning and sentiment classification, the selection and characteristics of the loss function plays a critical role in ensuring adaptability, stability, and sustained accuracy. Throughout model training, Stochastic Gradient Descent (SGD) calculates the partial derivatives of the loss with respect to model weights. Consequently, loss functions vary in their approaches to:

Challenging examples, such as neutral reviews exhibiting nuanced sentiment.
Outliers, including sarcastic reviews or incorrectly labeled instances.
Class imbalance, for example, when positive reviews significantly outnumber negative ones.

What is a Loss Function? Understanding How AI Models Learn

Furthermore, the choice of loss function also impacts the generalization behavior of a model. Different losses have varying stability and sensitivity to outliers, which affects how the final model performs on unseen data.

Loss Function	Used With	Characteristics
Log loss	Logistic regression	Probabilistic, penalizes wrong predictions more severely. Highly sensitive to outliers and class probabilities.
Hinge loss	Linear SVM	Focuses on classification margins rather than probability estimates. Robust to outliers.
Modified Huber	SGDClassifier (robust)	Offers a compromise between mean squared error and absolute error, increasing robustness to noisy text.
Cross-entropy	Neural networks	Ideal for multiclass sentiment classification

Common Loss Functions in Text Classification

In incremental learning, models learn from data in small batches, the loss function provides informative feedback from each incoming batch. A well-chosen loss ensures that the model adapts without catastrophic forgetting (Lu et al., 2019).

EXAMPLE

Hinge loss (used in linear SVMs) is sensitive to margin violations, while modified Huber loss combines robustness with smooth gradients, making it suitable for noisy or evolving sentiment streams.

In supervised learning with incremental learning, selecting an appropriate loss function is fundamental to achieving model convergence. This means that a model trained on a specific dataset should perform well on new, unseen data. A loss function influences how the model’s parameters are updated during training. This is done by evaluating the precision of a model’s hypothesis by measuring the difference between its expected and true outputs (Akbari et al., 2021). This determines how quickly and reliably the model converges to a solution with minimized misclassifications. During incremental learning for sentiment analysis, sentiment patterns evolve over time, known as concept drift. In such scenarios, loss functions helps stabilize convergence of the model despite shifts in feature distribution.

Convergence comparison of Loss function in SGD Classifier

The above convergence comparison chart show s how different loss functions impact model training over 20 epochs. Log Loss (blue) shows a steady decline in loss, indicating stable and smooth convergence. Thus, it is w ell-suited for probabilistic outputs in classification tasks like sentiment analysis. It should be the preferred choice when using SGD Classifier for dynamic, stream-based updates, especially when there is a gradual drift. Whereas, Hinge Loss (orange) converges less smoothly, with more fluctuations in loss. And, Modified Huber (green) exhibits fast initial convergence, which can be ideal for incremental learning where rapid feedback is crucial.

Understanding the distinction between partial_fit() and fit()

In machine learning, estimators like SGD Classifier, both fit() and partial_fit() methods are used for training models, but they serve distinct learning purposes. The fit() method implements batch learning, where the entire dataset is loaded into memory and the model is trained on all data at once. Each call to fit() resets the model, so previous training information is discarded, and the model learns from scratch. It is suitable for situations where the dataset is small enough to fit into memory and there’s no need to learn incrementally. The partial_fit() method enables incremental (online) learning, where the model is updated with batches of data (mini-batches) sequentially, without forgetting what it learned from the previous batches. Each call to partial_fit() updates the existing model parameters according to the new data, preserving the current state. This allows learning new patterns over time without retraining from zero.

However, partial_fit() also introduces critical challenges that impact model performance, reliability, and interpretability over time. These challenges arise from the nature of non-stationary data, limited visibility into long-term trends, and the inherent sensitivity of stochastic optimization.

Sensitivity of SGD Classifier to hyperparameters

Incremental learning models are highly sensitive to hyperparameter settings, which influence s model convergence, stability, and forgetting. The core issue stems from the shared weights, which allow for generalization and graceful degradation but also make them vulnerable to a radical form of forgetting (French, 1999).

Hyperparameter	Role	Sensitivity Impact
Alpha (learning_rate)	Controls the step size in each update during weight optimization. Higher values prevent overfitting	High: underfitting. Low: overfitting recent batches.
Regularization strength	Determines update step sizes. Penalty applied to model weights (L1/L2 regularization)	Optimal value must be chosen based on validation performance
max_features in TF-IDF	Reduces dimensionality, controlling noise	Few features: missed nuances Many: instability
Batch size	Number of samples used to estimate gradients per update.	Small: High variance, faster adaptation, noisier and less stable training. Large: Stable gradients, slower adaptation and less generalization
Momentum	Incorporates past gradients to smooth updates and accelerate convergence.	Zero/Low: Slow training High: Can overshoot minima if not combined with proper learning rates.
Number of Epochs / Iterations	Number of complete passes over the training data.	Too few: Underfitting. Insufficient training. Too many: Overfitting. Model captured noise.

Hyperparameter and their settings

Tuning hyperparameters like the learning rate, regularization strength, and loss function is essential because these settings govern how well the model:

Adapts to new patterns
Retains old knowledge
Reacts to concept drift

Choosing and tuning these hyperparameters is crucial for achieving optimal convergence, accuracy, and generalization. There is no universal set of optimal hyperparameters; best values depend on data, task, and computational constraints. The interplay between alpha, learning rate, and regularization directly governs the stability-plasticity tradeoff (French, 1999). Success of models trained with SGD classifiers highly depends on careful and time-consuming hyperparameter tuning, particularly for the learning rate, regularization, batch size, and feature scaling. Small changes can have large effects on model convergence, stability, and generalization, making dedicated tuning and validation indispensable

In traditional batch learning, grid search is a widely used technique that exhaustively searches across a predefined set of hyperparameter combinations to identify those yielding the best performance (Pedregosa et al., 2011).

EXAMPLE

You might specify a list of possible learning rates, regularization strengths, and loss functions, and grid search will try all permutations.

However, in incremental learning, where the model learns from a non-stationary stream of data over time, grid search requires careful adaptation. Challenges of grid search in incremental contexts include:

No fixed validation set: In streaming settings, it is impractical to maintain a holdout set since data keeps arriving.
Delayed feedback: Performance differences across configurations may only emerge over several batches.

To address these challenges, a pseudo-grid search or prequential validation (Gama et al., 2014) approach can be implemented.

from sklearn.model_selection import ParameterGrid 

param_grid = {

   'alpha': [0.00001, 0.0001, 0.001],

   'learning_rate': ['constant', 'optimal'],

   'loss': ['log_loss', 'modified_huber'],

   'eta0': [0.001, 0.01]

}

grid = list(ParameterGrid(param_grid))

The role of Gradient Descent and Stochasticity in SGD Classifier

A key feature of a models trained with SGD Classifier is that they learn by adjusting the model weights through repeated training. Another important characteristic is that they represent information in a distributed way, meaning that an item’s representation is spread across many units, and each unit is involved in representing many items (McCloskey & Cohen, 1989). It uses Stochastic Gradient Descent (SGD) as an optimization algorithm to tune these parameters based on the training data. A gradient refers to the vector of partial derivatives of the loss function with respect to model parameters. Stochasticity refers to the randomness introduced during learning, particularly when using Stochastic Gradient Descent (SGD).

Stochastic Gradient Descent, Clearly Explained!!!

In sentiment analysis the vocabulary patterns change over time, which is known as concept drift. To adapt to drift the model weights are updated step by step in the direction that most reduces the prediction error. At each step of the incremental learning, the gradient of the loss function guides these updates to minimize prediction error. Over several updates (epochs), the weights gradually descend toward values that minimize the loss function on the training distribution. As the learning process is repeated (with many updates over many epochs), gradient descent provides the systematic direction for reducing loss, while stochasticity injects natural variation into those steps. The combination ensures efficient learning from large or streaming data, robustness to noisy inputs, and adaptability to changing data patterns.

Mitigating catastrophic forgetting in repeated cycles of model updates

Repeated cycles of model updates using only new data often lead s to catastrophic forgetting of previously learned but still relevant patterns. This means that when a model learns something new, it tends to forget what it learned previously. This further lead s the model to underperform on older but still important sentiment themes. Incremental learning models trained using SGD Classifier are highly sensitive to data stream variability. Therefore, ensuring a robust learning environment becomes an essential design goal. There are several strategies can help mitigate catastrophic forgetting :

Replay a randomly selected subset of old data alongside new data during incremental updates. This can be done by storing and periodically mixing a small, representative sample of past data with each new batch, helping the model retain earlier knowledge (Doan et al., 2023).
Taking advantage of regularization in incremental learning. Regularization techniques impose constraints on model complexity to prevent overfitting to recent or noisy batches, especially when data is non-stationary (Bottou, 2010).
- L1 Regularization (Lasso): Useful in high-dimensional text data like TF-IDF representations.
- L2 Regularization (Ridge): Smoothens the weight updates, reducing drastic shifts.
- Elastic Net: Combines L1 and L2, balancing sparsity and weight shrinkage. Effective for preserving stability in online text classification tasks.
In addition to regularization, stabilizing weight updates also helps avoid learning volatility across streaming data batches.
Gradient Clipping: Prevents disproportionately large updates due to outliers or noisy data.
Learning Rate Decay: Gradually reduces the learning rate after each batch or epoch.
Averaged SGD: Maintains a running average of weights during updates.

Component	Description
TF-IDF Vectorization	Ensures consistent feature representation; older terms retain influence via IDF.
Initial Model Training	Use a regularized SGDClassifier with `loss='log_loss'`, `penalty='l2'`, and learning rate control. Controls drastic weight updates; avoids overfitting to new batches.
Learning Rate Control	Maintains small updates for stability; avoids abrupt knowledge overwriting.
Drift Monitoring	Apply statistical drift detection (e.g., KS test) to detect feature distribution shifts.
Batch Ingestion	Feed mini-batches (e.g., 1000 reviews) via `partial_fit()` for incremental updates.
Stabilization Layer	Apply gradient clipping, learning rate decay, and regularization.
Performance Evaluation	Track accuracy, F1-score, and drift scores over time.
Logging & Recalibration	Re-train or re-weight if significant degradation is detected.

Framework for Catastrophic Forgetting Mitigation

Implementing SGD Classifier with Grid Search best results

Employed SGD Classifier from scikit-learn, trained using the partial_fit() method to support incremental learning. The classifier uses a logistic loss function with L2 regularization, ensuring weight stability across updates. For this implementation, a prequential grid search was performed to optimize key hyperparameters including alpha, eta0, learning_rate, and loss. Based on the results of this tuning process, the best configuration was identified as alpha=0.0001, eta0=0.001, learning_rate='optimal', and loss='log_loss'. These values were then used to configure the final streaming classifier.

Open notebook

The TF-IDF vectorizer, with a fixed vocabulary size of 1000 features, ensured representational consistency across batches. This controlled feature space helped preserve sentiment-related terms such as “food”, “ambience” and “great, ” even as the distribution of vocabulary evolved. Regularization through penalty='l2' further supported long-term weight stabilization, while the log_loss function allowed for probabilistic prediction, which is particularly beneficial in sentiment modeling with overlapping class features.

The pipeline included batch-wise processing of approximately 7000 customer reviews, split into sequential batches of 1000. Each batch was vectorized using the frozen TF-IDF model and used to update the classifier incrementally. Performance was evaluated using the F1 score to account for precision-recall balance, critical in sentiment analysis where class imbalance may exist. Additionally, the Kolmogorov–Smirnov (KS) test was used to detect concept drift by comparing the distribution of TF-IDF feature vectors between batches.

The results, visualized through a dual-axis plot, indicates that the classifier maintained stable F1 scores across most batches, with minimal performance degradation even in the presence of drift. The drift events were found to coincide with minor F1 fluctuations, validating the importance of monitoring distributional shifts while also affirming the classifier’s resilience to these changes. This behavior reinforces the efficacy of using a drift-aware, incrementally updated model guided by regularization and hyperparameter tuning.

References

Akbari, A., Awais, M., Bashar, M., & Kittler, J. (2021). How Does Loss Function Affect Generalization Performance of Deep Learning? Application to Human Age Estimation. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning (Vol. 139, pp. 141–151). PMLR. https://proceedings.mlr.press/v139/akbari21a.html
Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gradient Descent. In Y. Lechevallier & G. Saporta (Eds.), Proceedings of COMPSTAT’2010 (pp. 177–186). Physica-Verlag HD. https://doi.org/10.1007/978-3-7908-2604-3_16
Doan, H.-G., Luong, H.-Q., Ha, T.-O., & Pham, T. T. T. (2023). An Efficient Strategy for Catastrophic Forgetting Reduction in Incremental Learning. Electronics, 12(10), Article 10. https://doi.org/10.3390/electronics12102265
French, R. M. (1999). Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4), 128–135. https://doi.org/10.1016/S1364-6613(99)01294-2
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Comput. Surv., 46(4), 44:1-44:37. https://doi.org/10.1145/2523813
IBM Technology. (2025, January 6). What is a Loss Function? Understanding How AI Models Learn [Video]. YouTube. https://www.youtube.com/watch?v=v_ueBW_5dLg
Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2019). Learning under Concept Drift: A Review. IEEE Transactions on Knowledge and Data Engineering, 31(12), 2346–2363. https://doi.org/10.1109/TKDE.2018.2876857
McCloskey, M., & Cohen, N. J. (1989). Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In G. H. Bower (Ed.), Psychology of Learning and Motivation (Vol. 24, pp. 109–165). Academic Press. https://doi.org/10.1016/S0079-7421(08)60536-8
Pedregosa, F., Pedregosa, F., Varoquaux, G., Varoquaux, G., Org, N., Gramfort, A., Gramfort, A., Michel, V., Michel, V., Fr, L., Thirion, B., Thirion, B., Grisel, O., Grisel, O., Blondel, M., Prettenhofer, P., Prettenhofer, P., Weiss, R., Dubourg, V., … Cournapeau, D. (2011). Scikit-learn: Machine Learning in Python. MACHINE LEARNING IN PYTHON.
Sivylla Paraskevopoulou. (2024, March 4). Incremental Learning: Adaptive and real-time machine learning. Artificial Intelligence. https://blogs.mathworks.com/deep-learning/2024/03/04/incremental-learning-adaptive-and-real-time-machine-learning/
StatQuest with Josh Starmer. (2019, May 13). Stochastic gradient descent, clearly explained!!! [Video]. YouTube. https://www.youtube.com/watch?v=vMh0zPT0tLI
Yehoshua, D. R. (2023, May 12). Loss Functions in Machine Learning. Towards Data Science. https://towardsdatascience.com/loss-functions-in-machine-learning-9977e810ac02/

Abhinash Jena

I am an interdisciplinary educator, researcher, and technologist with over a decade of experience in applied coding, educational design, and research mentorship in fields spanning management, marketing, behavioral science, machine learning, and natural language processing. I specialize in simplifying complex topics such as sentiment analysis, adaptive assessments and data visualizatiion. My training approach emphasizes real-world application, clear interpretation of results and the integration of data mining, processing, and modeling techniques to drive informed strategies across academic and industry domains.

Understanding the key roles of the Loss function

Understanding the distinction between partial_fit() and fit()

Sensitivity of SGD Classifier to hyperparameters

The role of Gradient Descent and Stochasticity in SGD Classifier

Mitigating catastrophic forgetting in repeated cycles of model updates

Implementing SGD Classifier with Grid Search best results

References

Discuss

proofreading