Troubleshooting — Model Training & Serialization
Introduction
This guide covers the most common issues encountered during ML model training, evaluation, and serialization. Each problem is presented with its symptoms, causes, solutions, and preventive measures.
1. Version Mismatch When Loading Models
Symptom
ModuleNotFoundError: No module named 'sklearn.ensemble._forest'
or
UserWarning: Trying to unpickle estimator RandomForestClassifier from version 1.2.0
when using version 1.4.0. This might lead to breaking code or invalid results.
Cause
The model was serialized with a different version of scikit-learn than the one used to load it. The internal class structures change between versions.
Solution
# Option 1: Match the scikit-learn version
pip install scikit-learn==1.2.0 # match the version used during training
# Option 2: Re-train and re-serialize with current version
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
joblib.dump(model, 'model_new_version.joblib')
Prevention
import sklearn
import json
metadata = {
"python_version": "3.10.12",
"sklearn_version": sklearn.__version__,
"numpy_version": np.__version__,
"serialized_date": "2025-01-15",
}
with open('model_metadata.json', 'w') as f:
json.dump(metadata, f, indent=2)
Always save a requirements.txt or metadata.json file with the exact versions of all dependencies alongside your serialized model.
2. Overfitting — High Training Performance, Low Test Performance
Symptom
Training Accuracy: 0.9990
Test Accuracy: 0.7523
Gap: 0.2467 ← Too large!
Cause
The model memorized the noise in the training data instead of learning generalizable patterns. Common causes:
- Model too complex (too many trees, unlimited depth)
- Not enough training data
- Too many features relative to the number of samples
- No regularization
Solution
# Solution 1: Add regularization
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=0.1, penalty='l2') # C small = more regularization
# Solution 2: Reduce model complexity
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
n_estimators=100,
max_depth=5, # limit depth
min_samples_leaf=10, # require minimum samples per leaf
max_features='sqrt', # use subset of features
random_state=42
)
# Solution 3: Feature selection
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)
# Solution 4: Cross-validation to detect early
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"CV Score: {scores.mean():.4f} ± {scores.std():.4f}")
# Large std = overfitting risk
If the gap between training and test performance exceeds 10 percentage points, it is likely overfitting. If the standard deviation of cross-validation scores is greater than 0.05, the model is unstable.
Prevention
| Strategy | When to Use |
|---|---|
| Systematic cross-validation | Always |
| Regularization (L1, L2) | Linear models |
| Limit depth / complexity | Decision trees, Random Forest |
| Increase data | Small dataset (< 1000 samples) |
| Early stopping | Deep learning, boosting |
| Dropout | Neural networks |
3. Underfitting — Low Performance Everywhere
Symptom
Training Accuracy: 0.6234
Test Accuracy: 0.6102
Cause
The model is too simple to capture the patterns in the data.
Solution
# Solution 1: Use a more complex model
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=200, max_depth=5)
# Solution 2: Add polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True)
X_train_poly = poly.fit_transform(X_train)
# Solution 3: Engineer better features
# (domain-specific — combine or transform existing features)
# Solution 4: Reduce regularization
model = LogisticRegression(C=10.0) # higher C = less regularization
4. Data Leakage — Results Too Good to Be True
Symptom
Cross-Validation Accuracy: 0.9999
Test Accuracy: 0.6543
Unrealistically high validation scores that drop drastically on the test set.
Cause
Information from the test set has leaked into the training process. Common sources:
| Leak Source | Example |
|---|---|
| Scaling before split | scaler.fit(X_all) before train_test_split |
| Feature including the target | Column 100% correlated with y (e.g., "diagnosis_code") |
| Duplicated data | Same patient in both train and test |
| Temporal order not respected | Future data in the training set |
Solution
# ❌ WRONG: Scaling before split (data leakage!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # fits on ALL data
X_train, X_test = train_test_split(X_scaled, ...)
# ✅ CORRECT: Scaling after split
X_train, X_test = train_test_split(X, ...)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit ONLY on train
X_test_scaled = scaler.transform(X_test) # transform only
# ✅ BEST: Use Pipeline (automatically prevents leakage)
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier())
])
pipeline.fit(X_train, y_train) # scaler sees only training data
It produces falsely excellent metrics that never reproduce in production. Always verify: has my model seen, directly or indirectly, any data from the test set during training?
Prevention
- Always use Pipelines — they ensure that preprocessing only sees the training data
- Check correlations — a feature correlated at >0.95 with the target is suspicious
- Inspect duplicates — make sure no sample appears in multiple splits
- Respect temporal order for time series data
5. Serialization Errors
5a. Pickle/Joblib — Custom Objects Not Serializable
Symptom
PicklingError: Can't pickle <class 'function'>: attribute lookup builtins.function failed
Cause
The pipeline contains lambda functions, closures, or non-serializable objects.
Solution
# ❌ WRONG: Lambda in pipeline
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(lambda x: x ** 2) # lambdas can't pickle
# ✅ CORRECT: Named function
def square_transform(x):
return x ** 2
transformer = FunctionTransformer(square_transform)
5b. ONNX Conversion Fails
Symptom
RuntimeError: Unable to find a shape calculator for type 'MyCustomTransformer'
Cause
ONNX does not support all scikit-learn transformers/estimators.
Solution
# Check supported converters
from skl2onnx import supported_converters
print("Supported models:")
for name in sorted(supported_converters()):
print(f" - {name}")
# Workaround: serialize pipeline in parts
# Save the custom preprocessing separately
joblib.dump(custom_preprocessor, 'preprocessor.joblib')
# Convert only the model to ONNX
from skl2onnx import convert_sklearn
onnx_model = convert_sklearn(sklearn_model, initial_types=initial_type)
The most common scikit-learn models (LinearRegression, LogisticRegression, RandomForest, SVM, KNN, GradientBoosting) are supported. Check the list of converters before planning a conversion.
5c. Loaded Model Gives Different Results
Symptom
Original model accuracy: 0.9500
Loaded model accuracy: 0.8700 ← Different!
Cause
- The preprocessing was not serialized with the model
- The test data was transformed differently
- The scikit-learn version is different
Solution
# ✅ Always serialize the COMPLETE pipeline
full_pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier())
])
full_pipeline.fit(X_train, y_train)
# Save
joblib.dump(full_pipeline, 'full_pipeline.joblib')
# Load and verify immediately
loaded = joblib.load('full_pipeline.joblib')
original_pred = full_pipeline.predict(X_test[:5])
loaded_pred = loaded.predict(X_test[:5])
assert np.array_equal(original_pred, loaded_pred), "Mismatch detected!"
6. Memory Issues with Large Models
Symptom
MemoryError: Unable to allocate 2.50 GiB for an array
or the process is killed by the OS (killed/OOM).
Cause
The model is too large for the available memory. Random Forests with many deep trees are the most memory-hungry.
Solution
# Solution 1: Reduce model size
model = RandomForestClassifier(
n_estimators=50, # fewer trees
max_depth=10, # limit depth
max_leaf_nodes=100, # limit leaves
)
# Solution 2: Use compression when saving
joblib.dump(model, 'model.joblib', compress=9) # max compression
# Solution 3: Use memory-mapped loading for large files
model = joblib.load('model.joblib', mmap_mode='r')
# Solution 4: Monitor memory usage
import sys
model_size_mb = sys.getsizeof(model) / (1024 * 1024)
print(f"Model in memory: {model_size_mb:.1f} MB")
| Parameter | Effect on Memory | Trade-off |
|---|---|---|
n_estimators ↓ | ⬇️ Reduces | Less performance |
max_depth ↓ | ⬇️ Reduces significantly | Risk of underfitting |
max_leaf_nodes ↓ | ⬇️ Reduces | Less accuracy |
compress=9 | Smaller file on disk | Slower loading |
mmap_mode='r' | Lazy loading | Slower access |
7. Poor Metrics Despite Correct Pipeline
Symptom
All steps seem correct, but the metrics are poor.
Diagnostic
# Step 1: Check data quality
print(f"Missing values:\n{X_train.isnull().sum()[X_train.isnull().sum() > 0]}")
print(f"\nDuplicate rows: {X_train.duplicated().sum()}")
print(f"\nClass distribution:\n{y_train.value_counts(normalize=True)}")
# Step 2: Check feature relevance
from sklearn.feature_selection import mutual_info_classif
mi = mutual_info_classif(X_train_scaled, y_train)
feature_importance = pd.Series(mi, index=X_train.columns).sort_values(ascending=False)
print(f"\nTop 10 features by mutual info:")
print(feature_importance.head(10))
# Step 3: Verify no constant/quasi-constant features
low_variance = X_train.columns[X_train.std() < 0.01]
if len(low_variance) > 0:
print(f"\n⚠️ Low variance features: {list(low_variance)}")
# Step 4: Try a simple baseline
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
print(f"\nBaseline (majority class): {dummy.score(X_test, y_test):.4f}")
print(f"Your model: {model.score(X_test, y_test):.4f}")
If your model does not significantly beat a DummyClassifier, the problem is in the data, not the model. Focus on feature engineering and data quality.
8. Convergence Warnings
Symptom
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Cause
The optimization algorithm did not converge within the allocated number of iterations.
Solution
# Solution 1: Increase max iterations
model = LogisticRegression(max_iter=1000) # default is 100
# Solution 2: Scale your features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Solution 3: Try a different solver
model = LogisticRegression(solver='saga', max_iter=500)
# Solution 4: Reduce regularization strength
model = LogisticRegression(C=10.0, max_iter=500)
ConvergenceWarning messages are not mere warnings — a model that has not converged produces unreliable results. Always fix this warning before proceeding.
9. Quick Reference — Diagnostic Table
| Symptom | Probable Cause | Immediate Action |
|---|---|---|
ModuleNotFoundError when loading | Version mismatch | Check sklearn version in metadata |
| Train score >> Test score | Overfitting | Reduce complexity, add regularization |
| Both train and test scores low | Underfitting | More complex model, more features |
| Unrealistic CV scores (>0.99) | Data leakage | Check pipeline, order of operations |
PicklingError | Non-serializable object | Replace lambdas with named functions |
| ONNX conversion error | Unsupported transformer | Check the compatibility list |
MemoryError | Model too large | Reduce n_estimators, max_depth |
ConvergenceWarning | Not enough iterations | Increase max_iter, scale features |
| Scores don't beat baseline | Inadequate data | Feature engineering, check data quality |
| Loaded model accuracy differs | Incomplete pipeline | Serialize the complete pipeline |
Debugging Checklist
When you encounter a problem, follow this checklist systematically:
- ☐ Check dependency versions (Python, sklearn, numpy)
- ☐ Check for data leakage (scaling after split?)
- ☐ Compare train vs test scores (gap < 10%?)
- ☐ Check cross-validation scores (reasonable standard deviation?)
- ☐ Test a DummyClassifier as baseline
- ☐ Check for missing values and duplicates
- ☐ Verify the complete pipeline is serialized
- ☐ Test loading and prediction of the serialized model
- ☐ Check for warnings in logs (ConvergenceWarning, etc.)
- ☐ Reproduce the issue in a clean environment