Troubleshooting — Model Training & Serialization

Troubleshooting 30 min Module 2

Introduction

This guide covers the most common issues encountered during ML model training, evaluation, and serialization. Each problem is presented with its symptoms, causes, solutions, and preventive measures.

Symptom

ModuleNotFoundError: No module named 'sklearn.ensemble._forest'

UserWarning: Trying to unpickle estimator RandomForestClassifier from version 1.2.0
when using version 1.4.0. This might lead to breaking code or invalid results.

Cause

The model was serialized with a different version of scikit-learn than the one used to load it. The internal class structures change between versions.

Solution

# Option 1: Match the scikit-learn version
pip install scikit-learn==1.2.0  # match the version used during training

# Option 2: Re-train and re-serialize with current version
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
joblib.dump(model, 'model_new_version.joblib')

Prevention

import sklearn
import json

metadata = {
    "python_version": "3.10.12",
    "sklearn_version": sklearn.__version__,
    "numpy_version": np.__version__,
    "serialized_date": "2025-01-15",
}

with open('model_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

Best Practice

Always save a requirements.txt or metadata.json file with the exact versions of all dependencies alongside your serialized model.

2. Overfitting — High Training Performance, Low Test Performance

Symptom

Training Accuracy: 0.9990
Test Accuracy:     0.7523
Gap:               0.2467  ← Too large!

Cause

The model memorized the noise in the training data instead of learning generalizable patterns. Common causes:

Model too complex (too many trees, unlimited depth)
Not enough training data
Too many features relative to the number of samples
No regularization

Solution

# Solution 1: Add regularization
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=0.1, penalty='l2')  # C small = more regularization

# Solution 2: Reduce model complexity
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,            # limit depth
    min_samples_leaf=10,    # require minimum samples per leaf
    max_features='sqrt',    # use subset of features
    random_state=42
)

# Solution 3: Feature selection
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)

# Solution 4: Cross-validation to detect early
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"CV Score: {scores.mean():.4f} ± {scores.std():.4f}")
# Large std = overfitting risk

Warning Sign

If the gap between training and test performance exceeds 10 percentage points, it is likely overfitting. If the standard deviation of cross-validation scores is greater than 0.05, the model is unstable.

Prevention

Strategy	When to Use
Systematic cross-validation	Always
Regularization (L1, L2)	Linear models
Limit depth / complexity	Decision trees, Random Forest
Increase data	Small dataset (< 1000 samples)
Early stopping	Deep learning, boosting
Dropout	Neural networks

3. Underfitting — Low Performance Everywhere

Symptom

Training Accuracy: 0.6234
Test Accuracy:     0.6102

Cause

The model is too simple to capture the patterns in the data.

Solution

# Solution 1: Use a more complex model
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=200, max_depth=5)

# Solution 2: Add polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True)
X_train_poly = poly.fit_transform(X_train)

# Solution 3: Engineer better features
# (domain-specific — combine or transform existing features)

# Solution 4: Reduce regularization
model = LogisticRegression(C=10.0)  # higher C = less regularization

4. Data Leakage — Results Too Good to Be True

Symptom

Cross-Validation Accuracy: 0.9999
Test Accuracy:             0.6543

Unrealistically high validation scores that drop drastically on the test set.

Cause

Information from the test set has leaked into the training process. Common sources:

Leak Source	Example
Scaling before split	`scaler.fit(X_all)` before `train_test_split`
Feature including the target	Column 100% correlated with y (e.g., "diagnosis_code")
Duplicated data	Same patient in both train and test
Temporal order not respected	Future data in the training set

Solution

# ❌ WRONG: Scaling before split (data leakage!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)          # fits on ALL data
X_train, X_test = train_test_split(X_scaled, ...)

# ✅ CORRECT: Scaling after split
X_train, X_test = train_test_split(X, ...)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit ONLY on train
X_test_scaled = scaler.transform(X_test)          # transform only

# ✅ BEST: Use Pipeline (automatically prevents leakage)
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier())
])
pipeline.fit(X_train, y_train)  # scaler sees only training data

Data leakage is the most dangerous bug in ML

It produces falsely excellent metrics that never reproduce in production. Always verify: has my model seen, directly or indirectly, any data from the test set during training?

Prevention

Always use Pipelines — they ensure that preprocessing only sees the training data
Check correlations — a feature correlated at >0.95 with the target is suspicious
Inspect duplicates — make sure no sample appears in multiple splits
Respect temporal order for time series data

5. Serialization Errors

5a. Pickle/Joblib — Custom Objects Not Serializable

Symptom

PicklingError: Can't pickle <class 'function'>: attribute lookup builtins.function failed

Cause

The pipeline contains lambda functions, closures, or non-serializable objects.

Solution

# ❌ WRONG: Lambda in pipeline
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(lambda x: x ** 2)  # lambdas can't pickle

# ✅ CORRECT: Named function
def square_transform(x):
    return x ** 2

transformer = FunctionTransformer(square_transform)

5b. ONNX Conversion Fails

Symptom

RuntimeError: Unable to find a shape calculator for type 'MyCustomTransformer'

Cause

ONNX does not support all scikit-learn transformers/estimators.

Solution

# Check supported converters
from skl2onnx import supported_converters
print("Supported models:")
for name in sorted(supported_converters()):
    print(f"  - {name}")

# Workaround: serialize pipeline in parts
# Save the custom preprocessing separately
joblib.dump(custom_preprocessor, 'preprocessor.joblib')

# Convert only the model to ONNX
from skl2onnx import convert_sklearn
onnx_model = convert_sklearn(sklearn_model, initial_types=initial_type)

ONNX Compatibility

The most common scikit-learn models (LinearRegression, LogisticRegression, RandomForest, SVM, KNN, GradientBoosting) are supported. Check the list of converters before planning a conversion.

5c. Loaded Model Gives Different Results

Symptom

Original model accuracy: 0.9500
Loaded model accuracy:   0.8700   ← Different!

Cause

The preprocessing was not serialized with the model
The test data was transformed differently
The scikit-learn version is different

Solution

# ✅ Always serialize the COMPLETE pipeline
full_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier())
])
full_pipeline.fit(X_train, y_train)

# Save
joblib.dump(full_pipeline, 'full_pipeline.joblib')

# Load and verify immediately
loaded = joblib.load('full_pipeline.joblib')
original_pred = full_pipeline.predict(X_test[:5])
loaded_pred = loaded.predict(X_test[:5])
assert np.array_equal(original_pred, loaded_pred), "Mismatch detected!"

6. Memory Issues with Large Models

Symptom

MemoryError: Unable to allocate 2.50 GiB for an array

or the process is killed by the OS (killed/OOM).

Cause

The model is too large for the available memory. Random Forests with many deep trees are the most memory-hungry.

Solution

# Solution 1: Reduce model size
model = RandomForestClassifier(
    n_estimators=50,    # fewer trees
    max_depth=10,       # limit depth
    max_leaf_nodes=100, # limit leaves
)

# Solution 2: Use compression when saving
joblib.dump(model, 'model.joblib', compress=9)  # max compression

# Solution 3: Use memory-mapped loading for large files
model = joblib.load('model.joblib', mmap_mode='r')

# Solution 4: Monitor memory usage
import sys
model_size_mb = sys.getsizeof(model) / (1024 * 1024)
print(f"Model in memory: {model_size_mb:.1f} MB")

Parameter	Effect on Memory	Trade-off
`n_estimators ↓`	⬇️ Reduces	Less performance
`max_depth ↓`	⬇️ Reduces significantly	Risk of underfitting
`max_leaf_nodes ↓`	⬇️ Reduces	Less accuracy
`compress=9`	Smaller file on disk	Slower loading
`mmap_mode='r'`	Lazy loading	Slower access

7. Poor Metrics Despite Correct Pipeline

Symptom

All steps seem correct, but the metrics are poor.

Diagnostic

# Step 1: Check data quality
print(f"Missing values:\n{X_train.isnull().sum()[X_train.isnull().sum() > 0]}")
print(f"\nDuplicate rows: {X_train.duplicated().sum()}")
print(f"\nClass distribution:\n{y_train.value_counts(normalize=True)}")

# Step 2: Check feature relevance
from sklearn.feature_selection import mutual_info_classif
mi = mutual_info_classif(X_train_scaled, y_train)
feature_importance = pd.Series(mi, index=X_train.columns).sort_values(ascending=False)
print(f"\nTop 10 features by mutual info:")
print(feature_importance.head(10))

# Step 3: Verify no constant/quasi-constant features
low_variance = X_train.columns[X_train.std() < 0.01]
if len(low_variance) > 0:
    print(f"\n⚠️ Low variance features: {list(low_variance)}")

# Step 4: Try a simple baseline
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
print(f"\nBaseline (majority class): {dummy.score(X_test, y_test):.4f}")
print(f"Your model:               {model.score(X_test, y_test):.4f}")

Baseline Rule

If your model does not significantly beat a DummyClassifier, the problem is in the data, not the model. Focus on feature engineering and data quality.

8. Convergence Warnings

Symptom

ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Cause

The optimization algorithm did not converge within the allocated number of iterations.

Solution

# Solution 1: Increase max iterations
model = LogisticRegression(max_iter=1000)  # default is 100

# Solution 2: Scale your features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Solution 3: Try a different solver
model = LogisticRegression(solver='saga', max_iter=500)

# Solution 4: Reduce regularization strength
model = LogisticRegression(C=10.0, max_iter=500)

Do Not Ignore Warnings

ConvergenceWarning messages are not mere warnings — a model that has not converged produces unreliable results. Always fix this warning before proceeding.

9. Quick Reference — Diagnostic Table

Symptom	Probable Cause	Immediate Action
`ModuleNotFoundError` when loading	Version mismatch	Check sklearn version in metadata
Train score >> Test score	Overfitting	Reduce complexity, add regularization
Both train and test scores low	Underfitting	More complex model, more features
Unrealistic CV scores (>0.99)	Data leakage	Check pipeline, order of operations
`PicklingError`	Non-serializable object	Replace lambdas with named functions
ONNX conversion error	Unsupported transformer	Check the compatibility list
`MemoryError`	Model too large	Reduce n_estimators, max_depth
`ConvergenceWarning`	Not enough iterations	Increase max_iter, scale features
Scores don't beat baseline	Inadequate data	Feature engineering, check data quality
Loaded model accuracy differs	Incomplete pipeline	Serialize the complete pipeline

Debugging Checklist

When you encounter a problem, follow this checklist systematically:

☐ Check dependency versions (Python, sklearn, numpy)
☐ Check for data leakage (scaling after split?)
☐ Compare train vs test scores (gap < 10%?)
☐ Check cross-validation scores (reasonable standard deviation?)
☐ Test a DummyClassifier as baseline
☐ Check for missing values and duplicates
☐ Verify the complete pipeline is serialized
☐ Test loading and prediction of the serialized model
☐ Check for warnings in logs (ConvergenceWarning, etc.)
☐ Reproduce the issue in a clean environment

Introduction​

1. Version Mismatch When Loading Models​

Symptom​

Cause​

Solution​

Prevention​

2. Overfitting — High Training Performance, Low Test Performance​

Symptom​

Cause​

Solution​

Prevention​

3. Underfitting — Low Performance Everywhere​

Symptom​

Cause​

Solution​

4. Data Leakage — Results Too Good to Be True​

Symptom​

Cause​

Solution​

Prevention​

5. Serialization Errors​

5a. Pickle/Joblib — Custom Objects Not Serializable​

Symptom​

Cause​

Solution​

5b. ONNX Conversion Fails​

Symptom​

Cause​

Solution​

5c. Loaded Model Gives Different Results​

Symptom​

Cause​

Solution​

6. Memory Issues with Large Models​

Symptom​

Cause​

Solution​

7. Poor Metrics Despite Correct Pipeline​

Symptom​

Diagnostic​

8. Convergence Warnings​

Symptom​

Cause​

Solution​

9. Quick Reference — Diagnostic Table​

Debugging Checklist​

Introduction

1. Version Mismatch When Loading Models

Symptom

Cause

Solution

Prevention

2. Overfitting — High Training Performance, Low Test Performance

Symptom

Cause

Solution

Prevention

3. Underfitting — Low Performance Everywhere

Symptom

Cause

Solution

4. Data Leakage — Results Too Good to Be True

Symptom

Cause

Solution

Prevention

5. Serialization Errors

5a. Pickle/Joblib — Custom Objects Not Serializable

Symptom

Cause

Solution

5b. ONNX Conversion Fails

Symptom

Cause

Solution

5c. Loaded Model Gives Different Results

Symptom

Cause

Solution

6. Memory Issues with Large Models

Symptom

Cause

Solution

7. Poor Metrics Despite Correct Pipeline

Symptom

Diagnostic

8. Convergence Warnings

Symptom

Cause

Solution

9. Quick Reference — Diagnostic Table

Debugging Checklist