Architecture & Design

Core Components

UniversalMLPipeline Class

The main pipeline class that orchestrates the entire machine learning workflow.

class UniversalMLPipeline:
    def __init__(self, problem_type, random_state, verbose, fast_mode, tuning_method, n_jobs)
    def load_data(self, train_path, test_path, target_column)
    def auto_detect_features(self, df, exclude_columns)
    def create_preprocessor(self)
    def prepare_data(self, custom_features)
    def define_models(self)
    def cross_validate_models(self)
    def hyperparameter_tuning(self)
    def make_predictions(self, save_predictions, id_column)
    def save_model(self, filename)
    def run_pipeline(self, **kwargs)

Key Attributes

problem_type:

‘classification’ or ‘regression’

fast_mode:

Speed optimization for large datasets

tuning_method:

‘grid’, ‘random’, or ‘bayesian’

n_jobs:

Number of cores for parallel processing

feature_types:

Dictionary containing detected feature types

cv_results:

Cross-validation results for all models

best_pipeline:

Best performing pipeline after tuning

Pipeline Workflow

The framework follows an 8-stage automated workflow:

  1. Data Loading

    • Load CSV files using pandas

    • Create backup of original data

    • Display dataset information

    • Handle missing test data

  2. Feature Detection

    • Automatically categorize features into:

      • Numeric: Continuous numerical features

      • Categorical: Text or discrete categorical features

      • Binary: Boolean or 0/1 features

    • Skip columns with >80% missing values

  3. Preprocessing

    • Numeric: Median imputation → Standard scaling

    • Categorical: Constant imputation → One-hot encoding

    • Binary: Zero imputation

    • Uses ColumnTransformer for efficient processing

  4. Model Definition

    Fast Mode (for large datasets):

    • Classification: RandomForest, LogisticRegression, NaiveBayes

    • Regression: RandomForest, LinearRegression

    Full Mode:

    • Classification: 7 algorithms

    • Regression: 6 algorithms

  5. Cross Validation

    • Classification: StratifiedKFold (preserves class distribution)

    • Regression: KFold

    • Folds: 3 (fast mode) or 5 (normal mode)

    • Parallel: Multi-core cross-validation

  6. Hyperparameter Tuning

    • Grid Search: Exhaustive parameter search

    • Random Search: Random parameter sampling (default)

    • Bayesian Search: Smart parameter exploration

    • Comprehensive parameter grids for each algorithm

  7. Prediction Generation

    • Generate predictions on test set

    • Support for custom ID columns

    • Automatic CSV export

    • Prediction statistics

  8. Model Persistence

    • Save trained pipeline (joblib)

    • Export model metadata (JSON)

    • Include performance metrics and feature information

Design Principles

Universal Design

  • Works with any tabular dataset

  • No manual feature engineering required

  • Automatic problem type detection

  • Robust error handling

Automation First

  • Minimal user configuration

  • Smart defaults for all parameters

  • Automatic feature type detection

  • End-to-end pipeline execution

Performance Optimization

  • Multi-core parallel processing

  • Fast mode for large datasets

  • Memory-efficient transformations

  • Scalable architecture

Production Ready

  • Model persistence and versioning

  • Comprehensive metadata tracking

  • Reproducible results

  • Error handling and validation

Extensibility

  • Plugin architecture for custom models

  • Custom preprocessing functions

  • Configurable validation strategies

  • Modular component design

Performance Characteristics

Speed Benchmarks

Typical Performance (10K rows, 20 features):

  • Fast Mode: 2-5 minutes

  • Normal Mode: 5-15 minutes

  • Bayesian Tuning: 10-30 minutes

Memory Usage

  • Small Dataset (<1K rows): ~50MB

  • Medium Dataset (10K rows): ~200MB

  • Large Dataset (100K rows): ~1-2GB

Scalability Features

  • Fast Mode: 70% speed improvement

  • Parallel Processing: Linear scaling with CPU cores

  • Memory Management: Efficient data handling

  • Reduced Model Set: Optimized algorithm selection

Error Handling Strategy

Data Quality Validation

  • Missing value threshold checking

  • Feature type validation

  • Empty dataset handling

  • Corrupted file detection

Graceful Degradation

  • Fallback mechanisms for missing components

  • Default parameter substitution

  • Alternative algorithm selection

  • Robust preprocessing pipelines

Validation Framework

  • Cross-validation integrity checks

  • Score normalization and validation

  • Pipeline consistency verification

  • Output format validation