Titanic Case Study

This case study demonstrates the Universal ML Framework using the famous Titanic dataset, showing how to predict passenger survival with minimal code.

Dataset Overview

The Titanic dataset contains information about passengers aboard the RMS Titanic, including:

Features: - PassengerId: Unique identifier for each passenger - Pclass: Passenger class (1st, 2nd, 3rd) - Name: Passenger name - Sex: Gender (male/female) - Age: Age in years - SibSp: Number of siblings/spouses aboard - Parch: Number of parents/children aboard - Ticket: Ticket number - Fare: Passenger fare - Cabin: Cabin number - Embarked: Port of embarkation (C=Cherbourg, Q=Queenstown, S=Southampton)

Target: - Survived: Survival status (0=No, 1=Yes)

Implementation

Complete Titanic Prediction Script

from universal_ml_framework import UniversalMLPipeline

# Create pipeline with optimal settings for Titanic dataset
pipeline = UniversalMLPipeline(
    problem_type='classification',
    random_state=42,
    verbose=True,
    fast_mode=False,
    tuning_method='bayesian',
    n_jobs=-1
)

# Run complete pipeline
pipeline.run_pipeline(
    train_path='titanic_train.csv',
    test_path='titanic_test.csv',
    target_column='Survived',
    problem_type='classification',
    id_column='PassengerId'
)

What Happens Automatically

  1. Data Loading

    • Loads training data (891 passengers)

    • Loads test data (418 passengers)

    • Identifies target column (Survived)

    • Uses PassengerId as identifier

  2. Feature Detection

    The framework automatically categorizes features:

    • Numeric: Age, SibSp, Parch, Fare

    • Categorical: Pclass, Name, Sex, Ticket, Cabin, Embarked

    • Binary: None (Survived is the target)

  3. Preprocessing

    • Age: Median imputation → Standard scaling

    • Fare: Median imputation → Standard scaling

    • Sex, Embarked: Constant imputation → One-hot encoding

    • Name, Ticket, Cabin: Handled as categorical features

  4. Model Training

    Tests 7 classification algorithms:

    • Random Forest Classifier

    • Gradient Boosting Classifier

    • Logistic Regression

    • Support Vector Machine

    • Naive Bayes

    • K-Nearest Neighbors

    • Decision Tree

  5. Cross Validation

    • Uses StratifiedKFold (5 folds)

    • Preserves class distribution (survived vs not survived)

    • Parallel processing across all CPU cores

  6. Hyperparameter Tuning

    • Uses Bayesian optimization for intelligent parameter search

    • Optimizes the best performing model

    • Comprehensive parameter grids for each algorithm

  7. Prediction Generation

    • Generates survival predictions for test set

    • Uses PassengerId as identifier

    • Exports to predictions.csv

Expected Results

Typical Performance Metrics

Cross-Validation Results:

📊 Cross validating models...

[1/7] 🔄 Training RandomForest...
  Fold 1/5: 0.8324
  Fold 2/5: 0.8202
  Fold 3/5: 0.8315
  Fold 4/5: 0.8427
  Fold 5/5: 0.8258
  ✅ RandomForest completed - Mean: 0.8305 (±0.0081)

[2/7] 🔄 Training GradientBoosting...
  Fold 1/5: 0.8268
  Fold 2/5: 0.8146
  Fold 3/5: 0.8315
  Fold 4/5: 0.8371
  Fold 5/5: 0.8202
  ✅ GradientBoosting completed - Mean: 0.8260 (±0.0078)

🏆 Best model: RandomForest

Final Results:

🎉 PIPELINE COMPLETED!
============================================================
✅ Problem Type: classification
✅ Best Model: RandomForest
✅ Best Score: 0.8456
============================================================

Feature Importance Analysis

The framework automatically identifies the most important features for survival prediction:

  1. Sex - Gender is the strongest predictor

  2. Fare - Ticket price indicates passenger class/wealth

  3. Age - Age affects survival probability

  4. Pclass - Passenger class (1st, 2nd, 3rd)

  5. SibSp/Parch - Family size relationships

Output Files

Generated Files

After running the pipeline, you’ll find:

predictions.csv

PassengerId,Prediction
892,0
893,1
894,0
895,0
896,1
...

model_info.json

{
  "problem_type": "classification",
  "best_model": "RandomForest",
  "best_params": {
    "model__n_estimators": 200,
    "model__max_depth": 10,
    "model__min_samples_split": 2
  },
  "cv_score": 0.8456,
  "feature_types": {
    "numeric": ["Age", "SibSp", "Parch", "Fare"],
    "categorical": ["Pclass", "Name", "Sex", "Ticket", "Cabin", "Embarked"],
    "binary": []
  }
}

best_model.pkl

Serialized trained model ready for production use.

Advanced Usage

Custom Feature Engineering

For better results, you can add custom feature engineering:

def titanic_feature_engineering(df):
    # Extract title from name
    df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

    # Create family size feature
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

    # Create age groups
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 12, 18, 35, 60, 100],
                           labels=['Child', 'Teen', 'Adult', 'Middle', 'Senior'])

    # Create fare groups
    df['FareGroup'] = pd.qcut(df['Fare'], q=4, labels=['Low', 'Medium', 'High', 'VeryHigh'])

    return df

# Run with custom feature engineering
pipeline.run_pipeline(
    train_path='titanic_train.csv',
    test_path='titanic_test.csv',
    target_column='Survived',
    id_column='PassengerId',
    feature_engineering_func=titanic_feature_engineering
)

Exclude Irrelevant Features

# Exclude features that don't help prediction
pipeline.run_pipeline(
    train_path='titanic_train.csv',
    test_path='titanic_test.csv',
    target_column='Survived',
    id_column='PassengerId',
    exclude_columns=['Name', 'Ticket', 'Cabin']  # High cardinality features
)

Performance Comparison

Framework vs Manual Implementation

Universal ML Framework:

# 10 lines of code
from universal_ml_framework import UniversalMLPipeline

pipeline = UniversalMLPipeline(problem_type='classification', tuning_method='bayesian')
pipeline.run_pipeline('titanic_train.csv', 'Survived', 'titanic_test.csv', id_column='PassengerId')

Manual Implementation:

# 100+ lines of code
import pandas as pd
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
# ... many more imports and 100+ lines of preprocessing, training, tuning code

Results Comparison:

Metric

Framework

Manual

Lines of Code

4

100+

Development Time

2 minutes

2-4 hours

Accuracy

84.56%

82-85%

Models Tested

7

1-2

Tuning

Bayesian

Manual/Grid

Key Insights

Why This Works Well

  1. Automatic Feature Detection: Correctly identifies numeric vs categorical features

  2. Proper Preprocessing: Handles missing values and scaling appropriately

  3. Model Comparison: Tests multiple algorithms to find the best performer

  4. Smart Tuning: Bayesian optimization finds optimal hyperparameters efficiently

  5. Production Ready: Generates all necessary files for deployment

Lessons Learned

  1. Gender is Key: Sex is the most important feature for Titanic survival

  2. Class Matters: Passenger class strongly correlates with survival

  3. Age Factor: Children and elderly have different survival patterns

  4. Family Size: Both very small and very large families had lower survival rates

  5. Fare Proxy: Ticket fare serves as a proxy for socioeconomic status

This case study demonstrates how the Universal ML Framework can achieve competitive results with minimal effort, making machine learning accessible to users of all skill levels.