Architecture & Design
Core Components
UniversalMLPipeline Class
The main pipeline class that orchestrates the entire machine learning workflow.
class UniversalMLPipeline:
def __init__(self, problem_type, random_state, verbose, fast_mode, tuning_method, n_jobs)
def load_data(self, train_path, test_path, target_column)
def auto_detect_features(self, df, exclude_columns)
def create_preprocessor(self)
def prepare_data(self, custom_features)
def define_models(self)
def cross_validate_models(self)
def hyperparameter_tuning(self)
def make_predictions(self, save_predictions, id_column)
def save_model(self, filename)
def run_pipeline(self, **kwargs)
Key Attributes
- problem_type:
‘classification’ or ‘regression’
- fast_mode:
Speed optimization for large datasets
- tuning_method:
‘grid’, ‘random’, or ‘bayesian’
- n_jobs:
Number of cores for parallel processing
- feature_types:
Dictionary containing detected feature types
- cv_results:
Cross-validation results for all models
- best_pipeline:
Best performing pipeline after tuning
Pipeline Workflow
The framework follows an 8-stage automated workflow:
Data Loading
Load CSV files using pandas
Create backup of original data
Display dataset information
Handle missing test data
Feature Detection
Automatically categorize features into:
Numeric: Continuous numerical features
Categorical: Text or discrete categorical features
Binary: Boolean or 0/1 features
Skip columns with >80% missing values
Preprocessing
Numeric: Median imputation → Standard scaling
Categorical: Constant imputation → One-hot encoding
Binary: Zero imputation
Uses ColumnTransformer for efficient processing
Model Definition
Fast Mode (for large datasets):
Classification: RandomForest, LogisticRegression, NaiveBayes
Regression: RandomForest, LinearRegression
Full Mode:
Classification: 7 algorithms
Regression: 6 algorithms
Cross Validation
Classification: StratifiedKFold (preserves class distribution)
Regression: KFold
Folds: 3 (fast mode) or 5 (normal mode)
Parallel: Multi-core cross-validation
Hyperparameter Tuning
Grid Search: Exhaustive parameter search
Random Search: Random parameter sampling (default)
Bayesian Search: Smart parameter exploration
Comprehensive parameter grids for each algorithm
Prediction Generation
Generate predictions on test set
Support for custom ID columns
Automatic CSV export
Prediction statistics
Model Persistence
Save trained pipeline (joblib)
Export model metadata (JSON)
Include performance metrics and feature information
Design Principles
Universal Design
Works with any tabular dataset
No manual feature engineering required
Automatic problem type detection
Robust error handling
Automation First
Minimal user configuration
Smart defaults for all parameters
Automatic feature type detection
End-to-end pipeline execution
Performance Optimization
Multi-core parallel processing
Fast mode for large datasets
Memory-efficient transformations
Scalable architecture
Production Ready
Model persistence and versioning
Comprehensive metadata tracking
Reproducible results
Error handling and validation
Extensibility
Plugin architecture for custom models
Custom preprocessing functions
Configurable validation strategies
Modular component design
Performance Characteristics
Speed Benchmarks
Typical Performance (10K rows, 20 features):
Fast Mode: 2-5 minutes
Normal Mode: 5-15 minutes
Bayesian Tuning: 10-30 minutes
Memory Usage
Small Dataset (<1K rows): ~50MB
Medium Dataset (10K rows): ~200MB
Large Dataset (100K rows): ~1-2GB
Scalability Features
Fast Mode: 70% speed improvement
Parallel Processing: Linear scaling with CPU cores
Memory Management: Efficient data handling
Reduced Model Set: Optimized algorithm selection
Error Handling Strategy
Data Quality Validation
Missing value threshold checking
Feature type validation
Empty dataset handling
Corrupted file detection
Graceful Degradation
Fallback mechanisms for missing components
Default parameter substitution
Alternative algorithm selection
Robust preprocessing pipelines
Validation Framework
Cross-validation integrity checks
Score normalization and validation
Pipeline consistency verification
Output format validation