Saltar al contenido principal

Model Training and Evaluation Concepts

Theory 45 min Module 2

Introduction

Before deploying an AI model, you need to train it properly and evaluate its performance rigorously. A model that performs well on training data but fails in production is worse than no model at all — it gives false confidence.

Real-World Analogy

Imagine a student who memorizes all the answers from a past exam without understanding the concepts. They will score 100% on that specific exam, but fail when faced with new questions. This is exactly what happens to a model that overfits its training data.


1. The ML Training Pipeline

The training pipeline is the structured sequence of steps that transforms raw data into a deployable model.

View ML Training Pipeline

Each step in this pipeline is critical. Skipping or rushing any step can lead to models that seem good on paper but fail in production.


2. Data Splitting: Train / Validation / Test

Why Split Data?

We split data into separate sets to get an honest estimate of how well our model will perform on data it has never seen before.

View Data Splitting Strategy
SetPurposeUsageTypical Size
TrainingLearn patterns from featuresUsed during model.fit()60-70%
ValidationTune hyperparameters, select best modelUsed during model selection15-20%
TestFinal unbiased evaluationUsed once at the end15-20%
Golden Rule

The test set must never be used for making decisions during development. It is only used for the final evaluation. If you look at test set performance and go back to modify your model, you have contaminated your evaluation.

Code Example: Splitting Data

from sklearn.model_selection import train_test_split

# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)

# Second split: training and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

# Result: 60% train, 20% validation, 20% test
print(f"Training: {len(X_train)} samples ({len(X_train)/len(X)*100:.0f}%)")
print(f"Validation: {len(X_val)} samples ({len(X_val)/len(X)*100:.0f}%)")
print(f"Test: {len(X_test)} samples ({len(X_test)/len(X)*100:.0f}%)")
Stratification

The stratify=y parameter ensures that each subset has the same class proportions as the original dataset. This is essential for imbalanced datasets (e.g., 95% class A, 5% class B).


3. Cross-Validation

The Problem with a Single Split

A single train/validation split can be lucky or unlucky. Maybe all the "easy" examples ended up in the training set. Cross-validation solves this by testing multiple splits.

Analogy

Imagine evaluating a restaurant by eating only one dish on one day. Maybe the chef was sick that day, or on the contrary it was their best dish. It is better to come back k times and try different dishes for a reliable evaluation.

K-Fold Cross-Validation

TypeDescriptionWhen to Use
K-FoldSplits into K folds, each serving as test in turnGeneral use, K=5 or K=10
Stratified K-FoldK-Fold preserving class proportionsImbalanced classification
Leave-One-Out (LOO)K = number of samplesVery small datasets (< 100)
Repeated K-FoldRepeats K-Fold multiple times with different seedsVery robust estimation
Time Series SplitRespects the temporal order of dataSequential / time series data

Code Example: Cross-Validation

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)

# Stratified 5-Fold Cross-Validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='accuracy')

print(f"Scores per fold: {scores}")
print(f"Mean accuracy: {scores.mean():.4f} ± {scores.std():.4f}")
# Multiple metrics at once
from sklearn.model_selection import cross_validate

results = cross_validate(
model, X_train, y_train, cv=cv,
scoring=['accuracy', 'f1_weighted', 'precision_weighted', 'recall_weighted'],
return_train_score=True
)

for metric in ['accuracy', 'f1_weighted']:
train_score = results[f'train_{metric}'].mean()
test_score = results[f'test_{metric}'].mean()
print(f"{metric}: Train={train_score:.4f}, Val={test_score:.4f}")

4. Hyperparameter Tuning

Hyperparameters are settings you choose before training (unlike model parameters, which are learned during training).

ParameterHyperparameterLearned During Training?
Model weights✅ Yes
Learning rate❌ No (you choose it)
Number of trees (n_estimators)❌ No
Max depth (max_depth)❌ No
Regularization (C, alpha)❌ No

Tests all possible combinations. Exhaustive but expensive.

from sklearn.model_selection import GridSearchCV

param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# Total: 3 × 4 × 3 × 3 = 108 combinations × 5 folds = 540 fits

grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='f1_weighted',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")
best_model = grid_search.best_estimator_

Randomly samples the search space. More efficient when the space is large.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': [3, 5, 10, 20, None],
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': uniform(0.1, 0.9)
}

random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions,
n_iter=50, # only 50 random combinations (vs 108+ for grid)
cv=5,
scoring='f1_weighted',
n_jobs=-1,
random_state=42
)
random_search.fit(X_train, y_train)

print(f"Best params: {random_search.best_params_}")
print(f"Best score: {random_search.best_score_:.4f}")
MethodAdvantagesDisadvantagesWhen to Use
Grid SearchExhaustive, guaranteed to find the bestVery slow (exponential)Small search space
Random SearchFaster, explores the space betterNo guarantee of optimalLarge search space
Bayesian OptimizationIntelligent, converges quicklyMore complex to implementExpensive models to train

5. Evaluation Metrics

Classification Metrics

The Confusion Matrix

The confusion matrix is the foundation of all classification metrics.

Analogy: The Smoke Detector
  • True Positive: There is a fire, the alarm sounds ✅
  • False Positive: No fire, the alarm sounds anyway (burnt toast) 🚨
  • False Negative: There is a fire, but the alarm doesn't sound 💀
  • True Negative: No fire, the alarm stays silent ✅

An FN (missing a real fire) is much more serious than an FP (false alarm). The choice of metric depends on the cost of errors.

Metrics Derived from the Confusion Matrix

MetricFormulaQuestion It Answers
Accuracy(TP + TN) / TotalWhat fraction of predictions is correct?
PrecisionTP / (TP + FP)Among positive predictions, how many are true?
Recall (Sensitivity)TP / (TP + FN)Among true positives, how many were detected?
F1-Score2 × (Precision × Recall) / (Precision + Recall)Harmonic mean of Precision and Recall
SpecificityTN / (TN + FP)Among true negatives, how many were identified?
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, confusion_matrix, classification_report
)

y_pred = model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred, average='weighted'):.4f}")
print(f"Recall: {recall_score(y_test, y_pred, average='weighted'):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred, average='weighted'):.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test, ax=ax, cmap='Purples')
ax.set_title("Confusion Matrix")
plt.tight_layout()
plt.savefig("confusion_matrix.png", dpi=150)
plt.show()

AUC-ROC Curve

The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate vs. the False Positive Rate at various classification thresholds. The AUC (Area Under the Curve) summarizes this into a single number.

AUC ValueInterpretation
1.0Perfect classifier
0.9 - 1.0Excellent
0.8 - 0.9Good
0.7 - 0.8Fair
0.5Random (no skill)
< 0.5Worse than random
from sklearn.metrics import roc_curve, roc_auc_score

# For binary classification
y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='#7c3aed', lw=2, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'k--', lw=1, label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

When to Use Which Metric?

ScenarioPriority MetricWhy
Spam detectionPrecisionAvoid marking legit emails as spam (minimize FP)
Cancer screeningRecallDon't miss real cancers (minimize FN)
Balanced classesAccuracy or F1Both error types equally costly
Imbalanced classesF1, AUC-ROCAccuracy is misleading with class imbalance
Ranking predictionsAUC-ROCMeasures ranking quality across thresholds
The Accuracy Trap

On a dataset with 95% class A and 5% class B, a model that always predicts A will have 95% accuracy. This is why accuracy alone is insufficient for imbalanced datasets. Always use F1, Precision, Recall, and AUC-ROC as complements.

Regression Metrics

MetricFormulaInterpretation
MSE (Mean Squared Error)Σ(y - ŷ)² / nHeavily penalizes large errors
RMSE (Root MSE)√MSESame unit as the target variable
MAE (Mean Absolute Error)Σ|y - ŷ| / nAverage error in absolute value
(Coefficient of Determination)1 - (SS_res / SS_tot)Proportion of variance explained (0 to 1)
MAPEΣ|y - ŷ|/|y| × 100 / nError as percentage
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R²: {r2:.4f}")

6. Overfitting vs. Underfitting

View Bias-Variance Comparison
CharacteristicUnderfittingGood FitOverfitting
Train accuracy❌ Low✅ High✅ Very High
Test accuracy❌ Low✅ High❌ Low
Model complexityToo simpleJust rightToo complex
BiasHighLowLow
VarianceLowLowHigh

How to Detect

# Train and evaluate on both sets
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

gap = train_score - test_score
print(f"Train: {train_score:.4f}")
print(f"Test: {test_score:.4f}")
print(f"Gap: {gap:.4f}")

if train_score < 0.7 and test_score < 0.7:
print("⚠️ Underfitting: model too simple")
elif gap > 0.10:
print("⚠️ Overfitting: model too complex")
else:
print("✅ Good generalization")

Remedies

ProblemSolutions
UnderfittingIncrease model complexity, add more features, reduce regularization, train longer
OverfittingAdd more data, increase regularization, reduce features, use dropout, early stopping, cross-validation

7. Bias-Variance Tradeoff

The bias-variance tradeoff is a fundamental concept in machine learning that describes the tension between two sources of error.

Analogy: The Sport Shooter
  • High bias: The shooter consistently aims too far to the left (systematic error)
  • High variance: The shots are scattered all around the target (instability)
  • Goal: Grouped shots at the center (low bias + low variance)
Low VarianceHigh Variance
Low Bias✅ Ideal (generalization)⚠️ Overfitting
High Bias⚠️ Underfitting❌ Worst case

8. Learning Curves

Learning curves plot model performance as a function of training set size or training iterations. They are the most powerful visual diagnostic tool for detecting overfitting and underfitting.

from sklearn.model_selection import learning_curve
import numpy as np
import matplotlib.pyplot as plt

train_sizes, train_scores, val_scores = learning_curve(
RandomForestClassifier(n_estimators=100, random_state=42),
X_train, y_train,
cv=5,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='accuracy',
n_jobs=-1
)

train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)

plt.figure(figsize=(10, 6))
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='#7c3aed')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1, color='#f59e0b')
plt.plot(train_sizes, train_mean, 'o-', color='#7c3aed', label='Training Score')
plt.plot(train_sizes, val_mean, 's-', color='#f59e0b', label='Validation Score')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curve')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Interpreting Learning Curves

PatternDiagnosisAction
Both curves low and convergingUnderfittingUse more complex model or more features
Train high, validation low, gap persistsOverfittingAdd more data, regularize, simplify model
Both curves high and convergingGood fitModel is ready for deployment
Validation curve still rising at the endMore data neededCollect more training data

9. Putting It All Together

Here is a complete example combining all concepts:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import (
train_test_split, StratifiedKFold, GridSearchCV
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# 1. Load data
data = load_iris()
X, y = data.data, data.target

# 2. Split: 60% train, 20% val, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

# 3. Build a pipeline (preprocessing + model)
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier(random_state=42))
])

# 4. Hyperparameter tuning with cross-validation
param_grid = {
'clf__n_estimators': [50, 100, 200],
'clf__max_depth': [3, 5, None],
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_search = GridSearchCV(
pipeline, param_grid, cv=cv,
scoring='f1_weighted', n_jobs=-1
)
grid_search.fit(X_train, y_train)

print(f"Best hyperparameters: {grid_search.best_params_}")
print(f"Best CV F1-score: {grid_search.best_score_:.4f}")

# 5. Validate on validation set
best_model = grid_search.best_estimator_
val_pred = best_model.predict(X_val)
print(f"\nValidation Accuracy: {accuracy_score(y_val, val_pred):.4f}")

# 6. Final evaluation on test set
test_pred = best_model.predict(X_test)
print(f"\nTest Set Results:")
print(classification_report(y_test, test_pred, target_names=data.target_names))

Summary

🔑 Key Takeaways
  1. Data Splitting: Always split into train/validation/test. The test set is only for final evaluation.
  2. Cross-Validation: Use K-Fold (K=5 or 10) for robust performance estimates.
  3. Hyperparameter Tuning: GridSearch for small spaces, RandomSearch for large spaces.
  4. Metrics: Choose the metric based on the cost of errors (Precision vs Recall).
  5. Confusion Matrix: Foundation of all classification metrics. Learn to read it.
  6. Overfitting: High training performance, low test performance → model too complex.
  7. Underfitting: Low performance everywhere → model too simple.
  8. Bias-Variance: The goal is to minimize both simultaneously.
  9. Learning Curves: Essential visual tool for diagnosing problems.

Further Reading

ResourceLink
scikit-learn Model Selection Guidesklearn.model_selection
Cross-Validation: Evaluating Estimator Performancesklearn Cross-Validation
Metrics and Scoringsklearn Metrics
Understanding the Bias-Variance TradeoffScott Fortmann-Roe