Titanic Case Study

This case study demonstrates the Universal ML Framework using the famous Titanic dataset, showing how to predict passenger survival with minimal code.

Dataset Overview

The Titanic dataset contains information about passengers aboard the RMS Titanic, including:

Features: - PassengerId: Unique identifier for each passenger - Pclass: Passenger class (1st, 2nd, 3rd) - Name: Passenger name - Sex: Gender (male/female) - Age: Age in years - SibSp: Number of siblings/spouses aboard - Parch: Number of parents/children aboard - Ticket: Ticket number - Fare: Passenger fare - Cabin: Cabin number - Embarked: Port of embarkation (C=Cherbourg, Q=Queenstown, S=Southampton)

Target: - Survived: Survival status (0=No, 1=Yes)

Implementation

Complete Titanic Prediction Script

from universal_ml_framework import UniversalMLPipeline

# Create pipeline with optimal settings for Titanic dataset
pipeline = UniversalMLPipeline(
    problem_type='classification',
    random_state=42,
    verbose=True,
    fast_mode=False,
    tuning_method='bayesian',
    n_jobs=-1
)

# Run complete pipeline
pipeline.run_pipeline(
    train_path='titanic_train.csv',
    test_path='titanic_test.csv',
    target_column='Survived',
    problem_type='classification',
    id_column='PassengerId'
)

What Happens Automatically

Data Loading
- Loads training data (891 passengers)
- Loads test data (418 passengers)
- Identifies target column (Survived)
- Uses PassengerId as identifier
Feature Detection

The framework automatically categorizes features:
- Numeric: Age, SibSp, Parch, Fare
- Categorical: Pclass, Name, Sex, Ticket, Cabin, Embarked
- Binary: None (Survived is the target)
Preprocessing
- Age: Median imputation → Standard scaling
- Fare: Median imputation → Standard scaling
- Sex, Embarked: Constant imputation → One-hot encoding
- Name, Ticket, Cabin: Handled as categorical features
Model Training

Tests 7 classification algorithms:
- Random Forest Classifier
- Gradient Boosting Classifier
- Logistic Regression
- Support Vector Machine
- Naive Bayes
- K-Nearest Neighbors
- Decision Tree
Cross Validation
- Uses StratifiedKFold (5 folds)
- Preserves class distribution (survived vs not survived)
- Parallel processing across all CPU cores
Hyperparameter Tuning
- Uses Bayesian optimization for intelligent parameter search
- Optimizes the best performing model
- Comprehensive parameter grids for each algorithm
Prediction Generation
- Generates survival predictions for test set
- Uses PassengerId as identifier
- Exports to predictions.csv

Expected Results

Typical Performance Metrics

Cross-Validation Results:

📊 Cross validating models...

[1/7] 🔄 Training RandomForest...
  Fold 1/5: 0.8324
  Fold 2/5: 0.8202
  Fold 3/5: 0.8315
  Fold 4/5: 0.8427
  Fold 5/5: 0.8258
  ✅ RandomForest completed - Mean: 0.8305 (±0.0081)

[2/7] 🔄 Training GradientBoosting...
  Fold 1/5: 0.8268
  Fold 2/5: 0.8146
  Fold 3/5: 0.8315
  Fold 4/5: 0.8371
  Fold 5/5: 0.8202
  ✅ GradientBoosting completed - Mean: 0.8260 (±0.0078)

🏆 Best model: RandomForest

Final Results:

🎉 PIPELINE COMPLETED!
============================================================
✅ Problem Type: classification
✅ Best Model: RandomForest
✅ Best Score: 0.8456
============================================================

Feature Importance Analysis

The framework automatically identifies the most important features for survival prediction:

Sex - Gender is the strongest predictor
Fare - Ticket price indicates passenger class/wealth
Age - Age affects survival probability
Pclass - Passenger class (1st, 2nd, 3rd)
SibSp/Parch - Family size relationships

Output Files

Generated Files

After running the pipeline, you’ll find:

predictions.csv

PassengerId,Prediction
892,0
893,1
894,0
895,0
896,1
...

model_info.json

{
  "problem_type": "classification",
  "best_model": "RandomForest",
  "best_params": {
    "model__n_estimators": 200,
    "model__max_depth": 10,
    "model__min_samples_split": 2
  },
  "cv_score": 0.8456,
  "feature_types": {
    "numeric": ["Age", "SibSp", "Parch", "Fare"],
    "categorical": ["Pclass", "Name", "Sex", "Ticket", "Cabin", "Embarked"],
    "binary": []
  }
}

best_model.pkl

Serialized trained model ready for production use.

Advanced Usage

Custom Feature Engineering

For better results, you can add custom feature engineering:

def titanic_feature_engineering(df):
    # Extract title from name
    df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

    # Create family size feature
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

    # Create age groups
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 12, 18, 35, 60, 100],
                           labels=['Child', 'Teen', 'Adult', 'Middle', 'Senior'])

    # Create fare groups
    df['FareGroup'] = pd.qcut(df['Fare'], q=4, labels=['Low', 'Medium', 'High', 'VeryHigh'])

    return df

# Run with custom feature engineering
pipeline.run_pipeline(
    train_path='titanic_train.csv',
    test_path='titanic_test.csv',
    target_column='Survived',
    id_column='PassengerId',
    feature_engineering_func=titanic_feature_engineering
)

Exclude Irrelevant Features

# Exclude features that don't help prediction
pipeline.run_pipeline(
    train_path='titanic_train.csv',
    test_path='titanic_test.csv',
    target_column='Survived',
    id_column='PassengerId',
    exclude_columns=['Name', 'Ticket', 'Cabin']  # High cardinality features
)

Performance Comparison

Framework vs Manual Implementation

Universal ML Framework:

# 10 lines of code
from universal_ml_framework import UniversalMLPipeline

pipeline = UniversalMLPipeline(problem_type='classification', tuning_method='bayesian')
pipeline.run_pipeline('titanic_train.csv', 'Survived', 'titanic_test.csv', id_column='PassengerId')

Manual Implementation:

# 100+ lines of code
import pandas as pd
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
# ... many more imports and 100+ lines of preprocessing, training, tuning code

Results Comparison:

Metric	Framework	Manual
Lines of Code	4	100+
Development Time	2 minutes	2-4 hours
Accuracy	84.56%	82-85%
Models Tested	7	1-2
Tuning	Bayesian	Manual/Grid

Key Insights

Why This Works Well

Automatic Feature Detection: Correctly identifies numeric vs categorical features
Proper Preprocessing: Handles missing values and scaling appropriately
Model Comparison: Tests multiple algorithms to find the best performer
Smart Tuning: Bayesian optimization finds optimal hyperparameters efficiently
Production Ready: Generates all necessary files for deployment

Lessons Learned

Gender is Key: Sex is the most important feature for Titanic survival
Class Matters: Passenger class strongly correlates with survival
Age Factor: Children and elderly have different survival patterns
Family Size: Both very small and very large families had lower survival rates
Fare Proxy: Ticket fare serves as a proxy for socioeconomic status

This case study demonstrates how the Universal ML Framework can achieve competitive results with minimal effort, making machine learning accessible to users of all skill levels.