إنتقل إلى المحتوى الرئيسي

LIA - Suggested Project Ideas

Ideas 30 min read Planning

How to Choose Your Project

Selecting the right project is critical. Choose one that matches your skill level and interests. Use the decision flowchart below to help you decide.

Decision Flowchart

Selection Criteria

FactorQuestions to Ask Yourself
InterestAm I genuinely curious about this problem?
Data AvailabilityIs the dataset easy to access and well-documented?
ComplexityCan I realistically complete this in 15 hours?
ExplainabilityWill LIME/SHAP produce meaningful results for this model?
DemonstrabilityCan I build a convincing live demo?
Choose Wisely

You cannot change your project after Week 13. Make sure you have access to the dataset and understand the problem before committing.


Project Comparison Table

#ProjectTypeDifficultyDataset SizeBest For
1House Price PredictorRegression⭐ Beginner~1,500 rowsComfort with numbers
2Sentiment AnalyzerClassification⭐⭐ Intermediate~50,000 rowsInterest in NLP
3Spam ClassifierClassification⭐ Beginner~5,500 rowsFirst ML project
4Customer Churn PredictorClassification⭐⭐ Intermediate~7,000 rowsBusiness interest
5Image ClassifierClassification⭐⭐⭐ Advanced~70,000 imagesComputer vision interest
6Medical Risk AssessmentClassification⭐⭐ Intermediate~300 rowsHealthcare interest
7Movie RecommenderClassification⭐⭐⭐ Advanced~100,000 ratingsRecommender systems
8Credit ScoringClassification⭐⭐ Intermediate~30,000 rowsFinance interest
9Insurance Cost PredictorRegression⭐ Beginner~1,300 rowsQuick start, clean data
10Employee Attrition PredictorClassification⭐⭐ Intermediate~1,400 rowsHR / people analytics

1. House Price Predictor

Predict housing prices based on property features.

AspectDetails
TypeRegression
Difficulty⭐ Beginner
DatasetBoston Housing or Ames Housing
Samples~1,460 (Ames) / ~506 (Boston)
FeaturesSquare footage, bedrooms, neighborhood, year built, etc.
TargetSale price (continuous)

Suggested Models

  • Linear Regression (baseline)
  • Random Forest Regressor
  • Gradient Boosting (XGBoost or LightGBM)

API Endpoints

EndpointMethodInputOutput
/predictPOST{ sqft, bedrooms, bathrooms, year_built, neighborhood }{ predicted_price, confidence_interval }
/healthGET{ status, timestamp }
/model-infoGET{ model, r2_score, mae, features }

Evaluation Metrics

  • MAE (Mean Absolute Error)
  • RMSE (Root Mean Square Error)
  • R² Score
  • MAPE (Mean Absolute Percentage Error)

Explainability Focus

SHAP is ideal here — waterfall plots clearly show how each feature pushes the predicted price up or down (e.g., larger square footage increases price, older year_built decreases it).

Sample predict request
{
"square_feet": 1500,
"bedrooms": 3,
"bathrooms": 2,
"year_built": 2005,
"neighborhood": "CollgCr",
"garage_cars": 2,
"overall_quality": 7
}

Response:

{
"predicted_price": 215400.50,
"model_version": "1.0.0",
"confidence": "Prediction based on Random Forest with R²=0.89"
}

2. Sentiment Analyzer

Classify text reviews as positive, negative, or neutral.

AspectDetails
TypeText Classification (NLP)
Difficulty⭐⭐ Intermediate
DatasetIMDB Reviews or Amazon Reviews
Samples~50,000
FeaturesRaw text (requires TF-IDF or CountVectorizer)
TargetSentiment (positive / negative)

Suggested Models

  • Logistic Regression with TF-IDF (baseline)
  • Naive Bayes (MultinomialNB)
  • Random Forest with TF-IDF features

API Endpoints

EndpointMethodInputOutput
/predictPOST{ text: "This movie was fantastic!" }{ sentiment, confidence, keywords }
/healthGET{ status, timestamp }
/model-infoGET{ model, accuracy, vocabulary_size }

Evaluation Metrics

  • Accuracy
  • Precision / Recall per class
  • F1-Score
  • Confusion Matrix

Explainability Focus

LIME excels for text — it highlights which words contribute most to the prediction (e.g., "fantastic" → positive, "terrible" → negative).

Text Preprocessing

Don't forget: lowercase, remove punctuation, remove stopwords, and apply TF-IDF before training. Document your preprocessing pipeline carefully — it must be reproduced at prediction time.

Sample predict request
{
"text": "The product arrived quickly and works perfectly. Very satisfied with my purchase!"
}

Response:

{
"sentiment": "positive",
"confidence": 0.94,
"model_version": "1.0.0"
}

3. Spam Classifier

Classify SMS messages or emails as spam or not spam (ham).

AspectDetails
TypeBinary Classification (NLP)
Difficulty⭐ Beginner
DatasetSMS Spam Collection
Samples~5,574
FeaturesRaw SMS text
TargetSpam / Ham

Suggested Models

  • Naive Bayes with CountVectorizer (baseline — classic spam filter)
  • Logistic Regression with TF-IDF
  • Support Vector Machine (SVM)

API Endpoints

EndpointMethodInputOutput
/predictPOST{ message: "You won a free iPhone!" }{ label, spam_probability }
/healthGET{ status, timestamp }
/model-infoGET{ model, accuracy, f1_score }

Evaluation Metrics

  • Accuracy
  • Precision (critical — avoid false positives marking real messages as spam)
  • Recall
  • F1-Score

Explainability Focus

LIME highlights the "spammy" words that trigger the classifier (e.g., "free", "winner", "click", "prize").

Why This is Great for Beginners

The SMS Spam dataset is small, clean, and well-documented. The problem is intuitive (everyone knows what spam looks like), and the results are easy to demo and explain.

Sample predict request
{
"message": "CONGRATULATIONS! You've been selected to receive a $1000 Walmart gift card. Click here to claim now!"
}

Response:

{
"label": "spam",
"spam_probability": 0.97,
"model_version": "1.0.0"
}

4. Customer Churn Predictor

Predict whether a customer will leave (churn) a service.

AspectDetails
TypeBinary Classification
Difficulty⭐⭐ Intermediate
DatasetTelco Customer Churn
Samples~7,043
FeaturesTenure, monthly charges, contract type, internet service, payment method, etc.
TargetChurn (Yes / No)

Suggested Models

  • Logistic Regression (baseline)
  • Random Forest Classifier
  • Gradient Boosting (XGBoost)

API Endpoints

EndpointMethodInputOutput
/predictPOST{ tenure, monthly_charges, contract, internet_service, ... }{ churn_prediction, churn_probability, risk_level }
/healthGET{ status, timestamp }
/model-infoGET{ model, auc_roc, features }

Evaluation Metrics

  • Accuracy
  • AUC-ROC (primary metric for imbalanced classes)
  • Precision / Recall
  • F1-Score

Explainability Focus

SHAP dependence plots reveal which factors drive churn most (e.g., month-to-month contracts, high monthly charges). Business-relevant insights make this great for your presentation.

Sample predict request
{
"tenure": 3,
"monthly_charges": 85.50,
"total_charges": 256.50,
"contract": "Month-to-month",
"internet_service": "Fiber optic",
"payment_method": "Electronic check",
"senior_citizen": 0
}

Response:

{
"churn_prediction": "Yes",
"churn_probability": 0.82,
"risk_level": "High",
"model_version": "1.0.0"
}

5. Image Classifier (Simple)

Classify images into categories using pixel features or pre-extracted features.

AspectDetails
TypeMulti-class Classification
Difficulty⭐⭐⭐ Advanced
DatasetFashion MNIST or CIFAR-10
Samples~70,000 (Fashion MNIST)
Features28×28 pixel values (flattened to 784 features)
TargetCategory (T-shirt, Trouser, Pullover, etc.)

Suggested Models

  • Logistic Regression on flattened pixels (baseline)
  • Random Forest on flattened pixels
  • SVM with dimensionality reduction (PCA)

API Endpoints

EndpointMethodInputOutput
/predictPOST{ pixels: [0.0, 0.0, ..., 0.5, ...] } (784 values){ category, confidence, top_3_predictions }
/healthGET{ status, timestamp }
/model-infoGET{ model, accuracy, classes }

Evaluation Metrics

  • Accuracy
  • Per-class Precision and Recall
  • Confusion Matrix (10×10)
  • Macro F1-Score

Explainability Focus

LIME for images highlights which pixel regions contributed most to the classification. SHAP can show feature importance across pixel positions.

Advanced Challenge

This project requires handling high-dimensional input (784 features). Consider PCA for dimensionality reduction. The API payload will be larger — plan for validation of array length and value ranges.

Sample predict request
{
"pixels": [0.0, 0.0, 0.0, 0.0, 0.12, 0.45, 0.78, "... (784 values total)"]
}

Response:

{
"category": "Ankle boot",
"confidence": 0.91,
"top_3_predictions": [
{"category": "Ankle boot", "confidence": 0.91},
{"category": "Sneaker", "confidence": 0.06},
{"category": "Sandal", "confidence": 0.02}
],
"model_version": "1.0.0"
}

6. Medical Risk Assessment

Predict the risk of a medical condition based on patient features.

AspectDetails
TypeBinary Classification
Difficulty⭐⭐ Intermediate
DatasetHeart Disease UCI or Diabetes
Samples~303 (Heart) / ~768 (Diabetes)
FeaturesAge, blood pressure, cholesterol, glucose, BMI, etc.
TargetDisease presence (Yes / No)

Suggested Models

  • Logistic Regression (baseline — interpretable in healthcare)
  • Random Forest Classifier
  • Support Vector Machine

API Endpoints

EndpointMethodInputOutput
/predictPOST{ age, blood_pressure, cholesterol, glucose, bmi, ... }{ risk_level, probability, contributing_factors }
/healthGET{ status, timestamp }
/model-infoGET{ model, recall, precision, features }

Evaluation Metrics

  • Recall (primary — minimizing false negatives is critical in healthcare)
  • Precision
  • F1-Score
  • AUC-ROC

Explainability Focus

Explainability is essential in healthcare. SHAP force plots and LIME explanations show exactly which patient factors contribute to the risk assessment. This makes your analysis highly relevant for your presentation.

Ethical Considerations

In your report, discuss the ethical implications of deploying an AI system for medical risk assessment. Address topics like: model bias, the importance of recall over precision, and why the model should assist (not replace) medical professionals.

Sample predict request
{
"age": 55,
"sex": 1,
"blood_pressure": 140,
"cholesterol": 250,
"fasting_blood_sugar": 1,
"max_heart_rate": 150,
"exercise_angina": 0
}

Response:

{
"risk_level": "High",
"probability": 0.78,
"contributing_factors": ["high cholesterol", "elevated blood pressure", "age"],
"model_version": "1.0.0"
}

7. Movie Recommender

Predict whether a user will like a movie based on their preferences.

AspectDetails
TypeClassification / Recommendation
Difficulty⭐⭐⭐ Advanced
DatasetMovieLens 100K
Samples~100,000 ratings
FeaturesUser demographics, movie genres, historical ratings
TargetRating prediction or like/dislike binary

Suggested Models

  • Logistic Regression on engineered features (baseline)
  • Random Forest on user-movie feature matrix
  • K-Nearest Neighbors (content-based)

API Endpoints

EndpointMethodInputOutput
/predictPOST{ user_age, genres: ["Action", "Sci-Fi"], avg_rating }{ will_like, confidence, similar_movies }
/healthGET{ status, timestamp }
/model-infoGET{ model, accuracy, genres_supported }

Evaluation Metrics

  • Accuracy
  • Precision@K
  • F1-Score
  • Mean Absolute Error (for rating prediction)

Explainability Focus

SHAP can reveal which genres, user demographics, or viewing patterns most influence recommendations. Feature interaction plots show how genre preference interacts with age or viewing history.

Advanced Project

This project requires significant feature engineering — you'll need to create a user-movie feature matrix from raw rating data. Budget extra time for data preprocessing.

Sample predict request
{
"user_age": 28,
"preferred_genres": ["Action", "Sci-Fi", "Thriller"],
"average_rating": 3.8,
"movie_genre": "Action",
"movie_year": 2020
}

Response:

{
"will_like": true,
"confidence": 0.85,
"predicted_rating": 4.2,
"model_version": "1.0.0"
}

8. Credit Scoring

Predict whether a loan applicant is likely to default.

AspectDetails
TypeBinary Classification
Difficulty⭐⭐ Intermediate
DatasetGerman Credit Data or Give Me Some Credit
Samples~1,000 (German) / ~150,000 (Give Me Some Credit)
FeaturesIncome, age, number of dependents, credit history, loan amount, etc.
TargetDefault (Yes / No)

Suggested Models

  • Logistic Regression (baseline — standard in finance)
  • Random Forest
  • Gradient Boosting (XGBoost)

API Endpoints

EndpointMethodInputOutput
/predictPOST{ income, age, debt_ratio, credit_lines, loan_amount }{ default_risk, probability, risk_category }
/healthGET{ status, timestamp }
/model-infoGET{ model, auc_roc, features }

Evaluation Metrics

  • AUC-ROC (primary metric in credit scoring)
  • Precision / Recall
  • F1-Score
  • KS Statistic (bonus — industry standard)

Explainability Focus

Explainability is legally required in many jurisdictions for credit decisions. SHAP force plots can explain individual loan decisions, making this project highly relevant for regulatory compliance discussions.

Regulatory Context

In the EU (GDPR) and Canada, individuals have the "right to explanation" for automated decisions. Credit scoring models must be explainable. Discuss this in your report.

Sample predict request
{
"annual_income": 52000,
"age": 35,
"debt_ratio": 0.35,
"number_of_open_credit_lines": 5,
"number_of_dependents": 2,
"months_employed": 48
}

Response:

{
"default_risk": "Low",
"default_probability": 0.12,
"risk_category": "A",
"model_version": "1.0.0"
}

9. Insurance Cost Predictor

Predict individual medical insurance costs based on personal attributes.

AspectDetails
TypeRegression
Difficulty⭐ Beginner
DatasetMedical Cost Personal Dataset
Samples~1,338
FeaturesAge, sex, BMI, children, smoker, region
TargetInsurance charges (continuous)

Suggested Models

  • Linear Regression (baseline)
  • Random Forest Regressor
  • Gradient Boosting Regressor

API Endpoints

EndpointMethodInputOutput
/predictPOST{ age, sex, bmi, children, smoker, region }{ predicted_cost, model_version }
/healthGET{ status, timestamp }
/model-infoGET{ model, r2_score, mae }

Evaluation Metrics

  • MAE
  • RMSE
  • R² Score
  • MAPE

Explainability Focus

SHAP clearly shows that smoker status has a massive impact on predicted cost. This makes for an excellent, visually impactful demo. Waterfall plots for individual predictions are very intuitive.

Best Beginner Choice for Regression

This dataset is small, clean, requires minimal preprocessing, and produces very interpretable SHAP results. If you're new to ML and prefer regression, start here.

Sample predict request
{
"age": 32,
"sex": "female",
"bmi": 27.5,
"children": 1,
"smoker": "no",
"region": "northwest"
}

Response:

{
"predicted_cost": 5765.30,
"model_version": "1.0.0"
}

10. Employee Attrition Predictor

Predict whether an employee is likely to leave the company.

AspectDetails
TypeBinary Classification
Difficulty⭐⭐ Intermediate
DatasetIBM HR Analytics Employee Attrition
Samples~1,470
FeaturesAge, job satisfaction, monthly income, overtime, years at company, etc.
TargetAttrition (Yes / No)

Suggested Models

  • Logistic Regression (baseline)
  • Random Forest Classifier
  • Gradient Boosting (XGBoost)

API Endpoints

EndpointMethodInputOutput
/predictPOST{ age, job_satisfaction, monthly_income, overtime, years_at_company }{ attrition_risk, probability, key_factors }
/healthGET{ status, timestamp }
/model-infoGET{ model, f1_score, features }

Evaluation Metrics

  • Accuracy
  • F1-Score (important due to class imbalance)
  • AUC-ROC
  • Recall

Explainability Focus

SHAP reveals which workplace factors drive attrition most (overtime, satisfaction, compensation). HR-relevant insights make the presentation compelling.

Sample predict request
{
"age": 30,
"job_satisfaction": 2,
"monthly_income": 3500,
"overtime": "Yes",
"years_at_company": 2,
"distance_from_home": 25,
"work_life_balance": 2
}

Response:

{
"attrition_risk": "High",
"attrition_probability": 0.73,
"key_factors": ["low job satisfaction", "overtime", "short tenure"],
"model_version": "1.0.0"
}

Project Difficulty Summary

View Projects by Difficulty Level
Instructor Recommendation

If this is your first end-to-end ML project, choose Spam Classifier, Insurance Cost Predictor, or House Price Predictor. These datasets are clean, small, and well-documented — letting you focus on the deployment pipeline rather than data wrangling. If you want a challenge, Customer Churn or Credit Scoring offer great business-relevant problems with interesting explainability results.


Can I Choose My Own Project?

Yes! You may propose your own project, provided it meets these minimum requirements:

RequirementMinimum
Dataset size≥ 500 samples
Number of features≥ 5
Problem typeClassification or Regression
Data accessibilityPublicly available or approved by instructor
ExplainabilityLIME and/or SHAP must be applicable

Submit your custom project proposal to your instructor by the end of Week 12 for approval.