Skip to main content

Troubleshooting — Model Training & Serialization

Troubleshooting 30 min Module 2

Introduction

This guide covers the most common issues encountered during ML model training, evaluation, and serialization. Each problem is presented with its symptoms, causes, solutions, and preventive measures.


1. Version Mismatch When Loading Models

Symptom

ModuleNotFoundError: No module named 'sklearn.ensemble._forest'

or

UserWarning: Trying to unpickle estimator RandomForestClassifier from version 1.2.0
when using version 1.4.0. This might lead to breaking code or invalid results.

Cause

The model was serialized with a different version of scikit-learn than the one used to load it. The internal class structures change between versions.

Solution

# Option 1: Match the scikit-learn version
pip install scikit-learn==1.2.0 # match the version used during training

# Option 2: Re-train and re-serialize with current version
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
joblib.dump(model, 'model_new_version.joblib')

Prevention

import sklearn
import json

metadata = {
"python_version": "3.10.12",
"sklearn_version": sklearn.__version__,
"numpy_version": np.__version__,
"serialized_date": "2025-01-15",
}

with open('model_metadata.json', 'w') as f:
json.dump(metadata, f, indent=2)
Best Practice

Always save a requirements.txt or metadata.json file with the exact versions of all dependencies alongside your serialized model.


2. Overfitting — High Training Performance, Low Test Performance

Symptom

Training Accuracy: 0.9990
Test Accuracy: 0.7523
Gap: 0.2467 ← Too large!

Cause

The model memorized the noise in the training data instead of learning generalizable patterns. Common causes:

  • Model too complex (too many trees, unlimited depth)
  • Not enough training data
  • Too many features relative to the number of samples
  • No regularization

Solution

# Solution 1: Add regularization
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=0.1, penalty='l2') # C small = more regularization

# Solution 2: Reduce model complexity
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
n_estimators=100,
max_depth=5, # limit depth
min_samples_leaf=10, # require minimum samples per leaf
max_features='sqrt', # use subset of features
random_state=42
)

# Solution 3: Feature selection
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)

# Solution 4: Cross-validation to detect early
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"CV Score: {scores.mean():.4f} ± {scores.std():.4f}")
# Large std = overfitting risk
Warning Sign

If the gap between training and test performance exceeds 10 percentage points, it is likely overfitting. If the standard deviation of cross-validation scores is greater than 0.05, the model is unstable.

Prevention

StrategyWhen to Use
Systematic cross-validationAlways
Regularization (L1, L2)Linear models
Limit depth / complexityDecision trees, Random Forest
Increase dataSmall dataset (< 1000 samples)
Early stoppingDeep learning, boosting
DropoutNeural networks

3. Underfitting — Low Performance Everywhere

Symptom

Training Accuracy: 0.6234
Test Accuracy: 0.6102

Cause

The model is too simple to capture the patterns in the data.

Solution

# Solution 1: Use a more complex model
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=200, max_depth=5)

# Solution 2: Add polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True)
X_train_poly = poly.fit_transform(X_train)

# Solution 3: Engineer better features
# (domain-specific — combine or transform existing features)

# Solution 4: Reduce regularization
model = LogisticRegression(C=10.0) # higher C = less regularization

4. Data Leakage — Results Too Good to Be True

Symptom

Cross-Validation Accuracy: 0.9999
Test Accuracy: 0.6543

Unrealistically high validation scores that drop drastically on the test set.

Cause

Information from the test set has leaked into the training process. Common sources:

Leak SourceExample
Scaling before splitscaler.fit(X_all) before train_test_split
Feature including the targetColumn 100% correlated with y (e.g., "diagnosis_code")
Duplicated dataSame patient in both train and test
Temporal order not respectedFuture data in the training set

Solution

# ❌ WRONG: Scaling before split (data leakage!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # fits on ALL data
X_train, X_test = train_test_split(X_scaled, ...)

# ✅ CORRECT: Scaling after split
X_train, X_test = train_test_split(X, ...)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit ONLY on train
X_test_scaled = scaler.transform(X_test) # transform only

# ✅ BEST: Use Pipeline (automatically prevents leakage)
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier())
])
pipeline.fit(X_train, y_train) # scaler sees only training data
Data leakage is the most dangerous bug in ML

It produces falsely excellent metrics that never reproduce in production. Always verify: has my model seen, directly or indirectly, any data from the test set during training?

Prevention

  1. Always use Pipelines — they ensure that preprocessing only sees the training data
  2. Check correlations — a feature correlated at >0.95 with the target is suspicious
  3. Inspect duplicates — make sure no sample appears in multiple splits
  4. Respect temporal order for time series data

5. Serialization Errors

5a. Pickle/Joblib — Custom Objects Not Serializable

Symptom

PicklingError: Can't pickle <class 'function'>: attribute lookup builtins.function failed

Cause

The pipeline contains lambda functions, closures, or non-serializable objects.

Solution

# ❌ WRONG: Lambda in pipeline
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(lambda x: x ** 2) # lambdas can't pickle

# ✅ CORRECT: Named function
def square_transform(x):
return x ** 2

transformer = FunctionTransformer(square_transform)

5b. ONNX Conversion Fails

Symptom

RuntimeError: Unable to find a shape calculator for type 'MyCustomTransformer'

Cause

ONNX does not support all scikit-learn transformers/estimators.

Solution

# Check supported converters
from skl2onnx import supported_converters
print("Supported models:")
for name in sorted(supported_converters()):
print(f" - {name}")

# Workaround: serialize pipeline in parts
# Save the custom preprocessing separately
joblib.dump(custom_preprocessor, 'preprocessor.joblib')

# Convert only the model to ONNX
from skl2onnx import convert_sklearn
onnx_model = convert_sklearn(sklearn_model, initial_types=initial_type)
ONNX Compatibility

The most common scikit-learn models (LinearRegression, LogisticRegression, RandomForest, SVM, KNN, GradientBoosting) are supported. Check the list of converters before planning a conversion.

5c. Loaded Model Gives Different Results

Symptom

Original model accuracy: 0.9500
Loaded model accuracy: 0.8700 ← Different!

Cause

  • The preprocessing was not serialized with the model
  • The test data was transformed differently
  • The scikit-learn version is different

Solution

# ✅ Always serialize the COMPLETE pipeline
full_pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier())
])
full_pipeline.fit(X_train, y_train)

# Save
joblib.dump(full_pipeline, 'full_pipeline.joblib')

# Load and verify immediately
loaded = joblib.load('full_pipeline.joblib')
original_pred = full_pipeline.predict(X_test[:5])
loaded_pred = loaded.predict(X_test[:5])
assert np.array_equal(original_pred, loaded_pred), "Mismatch detected!"

6. Memory Issues with Large Models

Symptom

MemoryError: Unable to allocate 2.50 GiB for an array

or the process is killed by the OS (killed/OOM).

Cause

The model is too large for the available memory. Random Forests with many deep trees are the most memory-hungry.

Solution

# Solution 1: Reduce model size
model = RandomForestClassifier(
n_estimators=50, # fewer trees
max_depth=10, # limit depth
max_leaf_nodes=100, # limit leaves
)

# Solution 2: Use compression when saving
joblib.dump(model, 'model.joblib', compress=9) # max compression

# Solution 3: Use memory-mapped loading for large files
model = joblib.load('model.joblib', mmap_mode='r')

# Solution 4: Monitor memory usage
import sys
model_size_mb = sys.getsizeof(model) / (1024 * 1024)
print(f"Model in memory: {model_size_mb:.1f} MB")
ParameterEffect on MemoryTrade-off
n_estimators ↓⬇️ ReducesLess performance
max_depth ↓⬇️ Reduces significantlyRisk of underfitting
max_leaf_nodes ↓⬇️ ReducesLess accuracy
compress=9Smaller file on diskSlower loading
mmap_mode='r'Lazy loadingSlower access

7. Poor Metrics Despite Correct Pipeline

Symptom

All steps seem correct, but the metrics are poor.

Diagnostic

# Step 1: Check data quality
print(f"Missing values:\n{X_train.isnull().sum()[X_train.isnull().sum() > 0]}")
print(f"\nDuplicate rows: {X_train.duplicated().sum()}")
print(f"\nClass distribution:\n{y_train.value_counts(normalize=True)}")

# Step 2: Check feature relevance
from sklearn.feature_selection import mutual_info_classif
mi = mutual_info_classif(X_train_scaled, y_train)
feature_importance = pd.Series(mi, index=X_train.columns).sort_values(ascending=False)
print(f"\nTop 10 features by mutual info:")
print(feature_importance.head(10))

# Step 3: Verify no constant/quasi-constant features
low_variance = X_train.columns[X_train.std() < 0.01]
if len(low_variance) > 0:
print(f"\n⚠️ Low variance features: {list(low_variance)}")

# Step 4: Try a simple baseline
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
print(f"\nBaseline (majority class): {dummy.score(X_test, y_test):.4f}")
print(f"Your model: {model.score(X_test, y_test):.4f}")
Baseline Rule

If your model does not significantly beat a DummyClassifier, the problem is in the data, not the model. Focus on feature engineering and data quality.


8. Convergence Warnings

Symptom

ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Cause

The optimization algorithm did not converge within the allocated number of iterations.

Solution

# Solution 1: Increase max iterations
model = LogisticRegression(max_iter=1000) # default is 100

# Solution 2: Scale your features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Solution 3: Try a different solver
model = LogisticRegression(solver='saga', max_iter=500)

# Solution 4: Reduce regularization strength
model = LogisticRegression(C=10.0, max_iter=500)
Do Not Ignore Warnings

ConvergenceWarning messages are not mere warnings — a model that has not converged produces unreliable results. Always fix this warning before proceeding.


9. Quick Reference — Diagnostic Table

SymptomProbable CauseImmediate Action
ModuleNotFoundError when loadingVersion mismatchCheck sklearn version in metadata
Train score >> Test scoreOverfittingReduce complexity, add regularization
Both train and test scores lowUnderfittingMore complex model, more features
Unrealistic CV scores (>0.99)Data leakageCheck pipeline, order of operations
PicklingErrorNon-serializable objectReplace lambdas with named functions
ONNX conversion errorUnsupported transformerCheck the compatibility list
MemoryErrorModel too largeReduce n_estimators, max_depth
ConvergenceWarningNot enough iterationsIncrease max_iter, scale features
Scores don't beat baselineInadequate dataFeature engineering, check data quality
Loaded model accuracy differsIncomplete pipelineSerialize the complete pipeline

Debugging Checklist

When you encounter a problem, follow this checklist systematically:

  1. ☐ Check dependency versions (Python, sklearn, numpy)
  2. ☐ Check for data leakage (scaling after split?)
  3. ☐ Compare train vs test scores (gap < 10%?)
  4. ☐ Check cross-validation scores (reasonable standard deviation?)
  5. ☐ Test a DummyClassifier as baseline
  6. ☐ Check for missing values and duplicates
  7. ☐ Verify the complete pipeline is serialized
  8. ☐ Test loading and prediction of the serialized model
  9. ☐ Check for warnings in logs (ConvergenceWarning, etc.)
  10. ☐ Reproduce the issue in a clean environment