إنتقل إلى المحتوى الرئيسي

TP2 — Train, Evaluate and Serialize a Model

Practical Lab 90 min Intermediate

Lab Objectives

By the end of this lab, you will be able to:

  • ✅ Load and explore a real dataset
  • ✅ Preprocess the data (scaling, encoding)
  • ✅ Train multiple classification models
  • ✅ Evaluate and compare models with rigorous metrics
  • ✅ Visualize results (confusion matrix, ROC curve)
  • ✅ Serialize the best model in pickle, joblib, and ONNX
  • ✅ Load and verify serialized models
  • ✅ Generate an evaluation report

Prerequisites

PrerequisiteDetail
Python3.10+ installed
Librariesscikit-learn, pandas, numpy, matplotlib, seaborn
KnowledgeModule 2 — Concepts (Training & Serialization)
EnvironmentVirtual environment activated

Install dependencies

pip install scikit-learn pandas numpy matplotlib seaborn joblib skl2onnx onnxruntime

Project architecture

tp2-model-evaluation/
├── tp2_train_evaluate.py # Main script
├── models/
│ ├── best_model.pkl # Serialized model (pickle)
│ ├── best_model.joblib # Serialized model (joblib)
│ ├── best_model.onnx # Serialized model (ONNX)
│ └── metadata.json # Model metadata
├── reports/
│ ├── confusion_matrix.png # Confusion matrix
│ ├── roc_curve.png # ROC curve
│ └── evaluation_report.txt # Text report
└── README.md

Step 1 — Setup and data loading

Dataset used

We use the Breast Cancer Wisconsin dataset from scikit-learn. This is a binary classification problem (malignant vs benign tumor) with 30 numeric features and 569 samples.

# tp2_train_evaluate.py — Step 1: Load and explore data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
import os

# Create output directories
os.makedirs('models', exist_ok=True)
os.makedirs('reports', exist_ok=True)

# Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Explore the dataset
print("=" * 60)
print("DATASET EXPLORATION")
print("=" * 60)
print(f"\nShape: {X.shape}")
print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]}")
print(f"\nTarget distribution:")
print(y.value_counts())
print(f"\nClass names: {data.target_names}")
print(f"\nFirst 5 features:")
print(X.iloc[:, :5].describe())
✅ Expected result
============================================================
DATASET EXPLORATION
============================================================

Shape: (569, 30)
Features: 30
Samples: 569

Target distribution:
1 357
0 212
Name: target, dtype: int64

Class names: ['malignant' 'benign']

Step 2 — Data preprocessing

# Step 2: Preprocessing and data splitting

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split: 60% train, 20% validation, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.0f}%)")
print(f"Validation set: {X_val.shape[0]} samples ({X_val.shape[0]/len(X)*100:.0f}%)")
print(f"Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.0f}%)")

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

print(f"\nAfter scaling — Train mean: {X_train_scaled.mean():.6f}, std: {X_train_scaled.std():.4f}")
Watch out for data leakage!

scaler.fit_transform() is called only on the training set. The validation and test sets use scaler.transform() without fit. This avoids information leakage (data leakage).

✅ Expected result
Training set:   341 samples (60%)
Validation set: 114 samples (20%)
Test set: 114 samples (20%)

After scaling — Train mean: -0.000000, std: 1.0000

Step 3 — Training multiple models

# Step 3: Train multiple models

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Define models to compare
models = {
'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM (RBF)': SVC(kernel='rbf', probability=True, random_state=42),
'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
'Decision Tree': DecisionTreeClassifier(random_state=42),
}

# Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Train and evaluate each model with cross-validation
cv_results = {}
print("=" * 60)
print("CROSS-VALIDATION RESULTS (5-Fold)")
print("=" * 60)

for name, model in models.items():
scores = cross_val_score(model, X_train_scaled, y_train, cv=cv, scoring='f1')
cv_results[name] = {
'mean': scores.mean(),
'std': scores.std(),
'scores': scores
}
print(f"\n{name}:")
print(f" F1 scores: {scores.round(4)}")
print(f" Mean F1: {scores.mean():.4f} ± {scores.std():.4f}")
✅ Expected result (approximate)
============================================================
CROSS-VALIDATION RESULTS (5-Fold)
============================================================

Logistic Regression:
F1 scores: [0.9783 0.9778 0.9565 0.9778 0.9778]
Mean F1: 0.9736 ± 0.0087

Random Forest:
F1 scores: [0.9778 0.9556 0.9565 0.9778 0.9556]
Mean F1: 0.9647 ± 0.0107

SVM (RBF):
F1 scores: [0.9783 0.9778 0.9783 0.9778 0.9778]
Mean F1: 0.9780 ± 0.0003

...

Step 4 — Detailed evaluation on the validation set

# Step 4: Detailed evaluation on validation set

from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
classification_report, confusion_matrix, roc_auc_score
)

# Train all models on full training set and evaluate on validation set
val_results = {}
print("\n" + "=" * 60)
print("VALIDATION SET RESULTS")
print("=" * 60)

for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_val_scaled)
y_proba = model.predict_proba(X_val_scaled)[:, 1]

val_results[name] = {
'accuracy': accuracy_score(y_val, y_pred),
'precision': precision_score(y_val, y_pred),
'recall': recall_score(y_val, y_pred),
'f1': f1_score(y_val, y_pred),
'auc_roc': roc_auc_score(y_val, y_proba),
'predictions': y_pred,
'probabilities': y_proba,
}

# Display results as a comparison table
results_df = pd.DataFrame(val_results).T
results_df = results_df[['accuracy', 'precision', 'recall', 'f1', 'auc_roc']]
results_df = results_df.round(4)
results_df = results_df.sort_values('f1', ascending=False)

print("\n📊 Model Comparison Table:")
print(results_df.to_string())

# Identify best model
best_model_name = results_df['f1'].idxmax()
print(f"\n🏆 Best model: {best_model_name} (F1 = {results_df.loc[best_model_name, 'f1']:.4f})")

Step 5 — Visualization: Confusion Matrix & ROC Curve

# Step 5a: Confusion Matrix for the best model

from sklearn.metrics import ConfusionMatrixDisplay

best_model = models[best_model_name]
best_model.fit(X_train_scaled, y_train)
y_val_pred = best_model.predict(X_val_scaled)

fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_predictions(
y_val, y_val_pred,
display_labels=data.target_names,
cmap='Purples',
ax=ax
)
ax.set_title(f'Confusion Matrix - {best_model_name}', fontsize=14)
plt.tight_layout()
plt.savefig('reports/confusion_matrix.png', dpi=150)
plt.show()
print("✅ Confusion matrix saved to reports/confusion_matrix.png")
# Step 5b: ROC Curves for all models

from sklearn.metrics import roc_curve

fig, ax = plt.subplots(figsize=(10, 7))
colors = ['#7c3aed', '#3b82f6', '#10b981', '#f59e0b', '#ef4444']

for (name, result), color in zip(val_results.items(), colors):
fpr, tpr, _ = roc_curve(y_val, result['probabilities'])
ax.plot(fpr, tpr, color=color, lw=2,
label=f"{name} (AUC = {result['auc_roc']:.4f})")

ax.plot([0, 1], [0, 1], 'k--', lw=1, alpha=0.5, label='Random')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves - Model Comparison', fontsize=14)
ax.legend(loc='lower right', fontsize=10)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('reports/roc_curve.png', dpi=150)
plt.show()
print("✅ ROC curves saved to reports/roc_curve.png")

Step 6 — Serialize the best model

Complete pipeline

We serialize the complete pipeline (scaler + model) to ensure preprocessing is included.

# Step 6: Serialize the best model in all formats

import pickle
import joblib
from sklearn.pipeline import Pipeline

# Build full pipeline with best model
best_pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', models[best_model_name])
])
best_pipeline.fit(X_train, y_train) # fit on UN-scaled data (pipeline handles it)

# 6a. Pickle
with open('models/best_model.pkl', 'wb') as f:
pickle.dump(best_pipeline, f)
print("✅ Saved: models/best_model.pkl")

# 6b. Joblib (with compression)
joblib.dump(best_pipeline, 'models/best_model.joblib', compress=3)
print("✅ Saved: models/best_model.joblib")

# 6c. ONNX
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_sklearn(best_pipeline, initial_types=initial_type)
with open('models/best_model.onnx', 'wb') as f:
f.write(onnx_model.SerializeToString())
print("✅ Saved: models/best_model.onnx")

# Compare file sizes
import os
for ext in ['pkl', 'joblib', 'onnx']:
filepath = f'models/best_model.{ext}'
size_kb = os.path.getsize(filepath) / 1024
print(f" {filepath:30s}{size_kb:8.1f} KB")

Step 7 — Load and verify serialized models

# Step 7: Load and verify all serialized models

import numpy as np

# Sample test data (first 5 samples)
X_sample = X_test.iloc[:5]
y_sample = y_test.iloc[:5]

print("=" * 60)
print("VERIFICATION - Serialized Models")
print("=" * 60)
print(f"\nTrue labels: {y_sample.values}")

# 7a. Load Pickle
with open('models/best_model.pkl', 'rb') as f:
model_pkl = pickle.load(f)
pred_pkl = model_pkl.predict(X_sample)
print(f"Pickle predictions: {pred_pkl}")

# 7b. Load Joblib
model_joblib = joblib.load('models/best_model.joblib')
pred_joblib = model_joblib.predict(X_sample)
print(f"Joblib predictions: {pred_joblib}")

# 7c. Load ONNX
import onnxruntime as ort

session = ort.InferenceSession('models/best_model.onnx')
input_name = session.get_inputs()[0].name
X_sample_float = X_sample.values.astype(np.float32)
pred_onnx = session.run(None, {input_name: X_sample_float})[0]
print(f"ONNX predictions: {pred_onnx}")

# 7d. Verify consistency
assert np.array_equal(pred_pkl, pred_joblib), "Pickle/Joblib mismatch!"
print("\n✅ All serialization formats produce consistent predictions!")

# Full test set evaluation of loaded model
y_test_pred = model_joblib.predict(X_test)
final_accuracy = accuracy_score(y_test, y_test_pred)
final_f1 = f1_score(y_test, y_test_pred)
print(f"\n📊 Final Test Set Performance (loaded model):")
print(f" Accuracy: {final_accuracy:.4f}")
print(f" F1-Score: {final_f1:.4f}")
✅ Expected result
============================================================
VERIFICATION - Serialized Models
============================================================

True labels: [1 0 0 1 1]
Pickle predictions: [1 0 0 1 1]
Joblib predictions: [1 0 0 1 1]
ONNX predictions: [1 0 0 1 1]

✅ All serialization formats produce consistent predictions!

📊 Final Test Set Performance (loaded model):
Accuracy: 0.9737
F1-Score: 0.9808

Step 8 — Generate evaluation report

# Step 8: Generate evaluation report

import json
from datetime import datetime

# Save metadata
metadata = {
"model_name": best_model_name,
"version": "1.0.0",
"timestamp": datetime.now().isoformat(),
"dataset": "Breast Cancer Wisconsin",
"n_samples": len(X),
"n_features": X.shape[1],
"split": {"train": len(X_train), "val": len(X_val), "test": len(X_test)},
"test_metrics": {
"accuracy": round(final_accuracy, 4),
"f1_score": round(final_f1, 4),
"precision": round(precision_score(y_test, y_test_pred), 4),
"recall": round(recall_score(y_test, y_test_pred), 4),
},
"serialization_formats": ["pickle", "joblib", "onnx"],
"hyperparameters": models[best_model_name].get_params(),
}

with open('models/metadata.json', 'w') as f:
json.dump(metadata, f, indent=2, default=str)
print("✅ Metadata saved to models/metadata.json")

# Generate text report
report_lines = [
"=" * 60,
"MODEL EVALUATION REPORT",
"=" * 60,
f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
f"Dataset: Breast Cancer Wisconsin ({len(X)} samples, {X.shape[1]} features)",
"",
"--- Data Split ---",
f"Training: {len(X_train)} samples",
f"Validation: {len(X_val)} samples",
f"Test: {len(X_test)} samples",
"",
"--- Cross-Validation Results (F1 Score) ---",
]

for name, result in sorted(cv_results.items(), key=lambda x: x[1]['mean'], reverse=True):
report_lines.append(f" {name:25s}: {result['mean']:.4f} ± {result['std']:.4f}")

report_lines.extend([
"",
"--- Validation Set Results ---",
results_df.to_string(),
"",
f"--- Best Model: {best_model_name} ---",
f"Test Accuracy: {final_accuracy:.4f}",
f"Test F1-Score: {final_f1:.4f}",
"",
"--- Classification Report (Test Set) ---",
classification_report(y_test, y_test_pred, target_names=data.target_names),
"",
"--- Serialized Files ---",
])

for ext in ['pkl', 'joblib', 'onnx']:
filepath = f'models/best_model.{ext}'
size_kb = os.path.getsize(filepath) / 1024
report_lines.append(f" {filepath}: {size_kb:.1f} KB")

report_text = "\n".join(report_lines)
with open('reports/evaluation_report.txt', 'w') as f:
f.write(report_text)

print("✅ Evaluation report saved to reports/evaluation_report.txt")
print("\n" + report_text)

Validation checklist

Before submitting your lab, verify the following points:

#CriterionVerified
1The dataset is correctly loaded and explored
2Data is split into 3 sets (train/val/test)
3Scaling is applied correctly (fit on train only)
4At least 3 models are trained and compared
55-fold cross-validation is used
6Metrics include accuracy, precision, recall, F1, AUC-ROC
7Confusion matrix is generated and saved
8ROC curves are generated and saved
9Best model is serialized in 3 formats (pkl, joblib, onnx)
10Serialized models are reloaded and verified
11An evaluation report is generated
12Metadata is saved in JSON

Bonus challenges

🚀 Challenge 1 — Hyperparameter Tuning

Add a step of GridSearchCV or RandomizedSearchCV on the best model to optimize its hyperparameters. Compare performance before and after tuning.

from sklearn.model_selection import GridSearchCV

# Example for Random Forest
param_grid = {
'classifier__n_estimators': [50, 100, 200, 300],
'classifier__max_depth': [3, 5, 10, None],
'classifier__min_samples_split': [2, 5, 10],
}

grid_search = GridSearchCV(
best_pipeline, param_grid, cv=5,
scoring='f1', n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best F1: {grid_search.best_score_:.4f}")
🚀 Challenge 2 — Learning Curves

Generate learning curves for the best model and identify if there is overfitting or underfitting.

from sklearn.model_selection import learning_curve

train_sizes, train_scores, val_scores = learning_curve(
best_pipeline, X_train, y_train, cv=5,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='f1', n_jobs=-1
)

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', color='#7c3aed', label='Training')
plt.plot(train_sizes, val_scores.mean(axis=1), 's-', color='#f59e0b', label='Validation')
plt.xlabel('Training Set Size')
plt.ylabel('F1 Score')
plt.title('Learning Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.savefig('reports/learning_curve.png', dpi=150)
plt.show()
🚀 Challenge 3 — MLflow Tracking

Integrate MLflow to automatically log experiments, metrics, and models.

import mlflow
import mlflow.sklearn

mlflow.set_experiment("tp2-breast-cancer")

for name, model in models.items():
with mlflow.start_run(run_name=name):
pipeline = Pipeline([('scaler', StandardScaler()), ('clf', model)])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
mlflow.log_metric("f1", f1_score(y_test, y_pred))
mlflow.sklearn.log_model(pipeline, "model")