Advanced Usage
Performance Optimization
Fast Mode for Large Datasets
When working with large datasets (>100K rows), enable fast mode for significant speed improvements:
# Enable fast mode
pipeline = UniversalMLPipeline(fast_mode=True)
# Or enable during execution
pipeline.run_pipeline('large_data.csv', 'target', fast_mode=True)
Fast Mode Optimizations:
Reduced model set (2-3 fastest algorithms)
Fewer cross-validation folds (3 instead of 5)
Optimized hyperparameters
70% speed improvement on average
Multi-Core Processing
Leverage all available CPU cores for parallel processing:
# Use all available cores (default)
pipeline = UniversalMLPipeline(n_jobs=-1)
# Use specific number of cores
pipeline = UniversalMLPipeline(n_jobs=4)
# Single-core processing
pipeline = UniversalMLPipeline(n_jobs=1)
Parallel Components:
Model training during cross-validation
Hyperparameter search (Grid/Random/Bayesian)
Individual algorithm parallelization
Hyperparameter Tuning Strategies
Grid Search
Exhaustive search through all parameter combinations:
pipeline = UniversalMLPipeline(tuning_method='grid')
Pros: Thorough exploration, guaranteed optimal within grid Cons: Computationally expensive, scales exponentially
Random Search
Random sampling from parameter distributions (default):
pipeline = UniversalMLPipeline(tuning_method='random')
Pros: Efficient, good performance/time ratio Cons: May miss optimal combinations
Bayesian Optimization
Smart parameter exploration using Gaussian processes:
# Requires: pip install scikit-optimize
pipeline = UniversalMLPipeline(tuning_method='bayesian')
Pros: Intelligent search, fewer iterations needed Cons: Additional dependency, complex implementation
Custom Feature Engineering
Custom Preprocessing Functions
Apply custom transformations before the pipeline:
def custom_feature_engineering(df):
# Create interaction features
df['feature_interaction'] = df['feature1'] * df['feature2']
# Log transformation for skewed features
df['log_feature'] = np.log1p(df['skewed_feature'])
# Binning continuous variables
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 50, 75, 100],
labels=['young', 'adult', 'middle', 'senior'])
return df
pipeline.run_pipeline(
'data.csv',
'target',
feature_engineering_func=custom_feature_engineering
)
Manual Feature Selection
Override automatic feature detection:
# Define custom feature types
pipeline.feature_types = {
'numeric': ['age', 'income', 'score'],
'categorical': ['city', 'job_type', 'education'],
'binary': ['has_phone', 'is_married', 'owns_car']
}
# Or specify exact features to use
custom_features = ['age', 'income', 'city', 'education']
pipeline.run_pipeline('data.csv', 'target', custom_features=custom_features)
Advanced Configuration
Verbose Mode
Get detailed progress information:
pipeline = UniversalMLPipeline(verbose=True)
Verbose Output Includes:
Model-by-model training progress
Fold-by-fold cross-validation scores
Detailed hyperparameter tuning results
Step-by-step pipeline execution
Custom ID Columns
Handle datasets with custom identifier columns:
# Use PassengerId from Titanic dataset
pipeline.run_pipeline(
'titanic_train.csv',
'Survived',
'titanic_test.csv',
id_column='PassengerId'
)
Column Exclusion
Exclude irrelevant columns from training:
pipeline.run_pipeline(
'data.csv',
'target',
exclude_columns=['id', 'timestamp', 'name', 'description']
)
Model Customization
Adding Custom Models
Extend the framework with your own algorithms:
from sklearn.ensemble import ExtraTreesClassifier
# Add custom model after initialization
pipeline = UniversalMLPipeline()
pipeline.models['ExtraTrees'] = ExtraTreesClassifier(random_state=42)
# Run pipeline with extended model set
pipeline.run_pipeline('data.csv', 'target')
Custom Parameter Grids
Define custom hyperparameter grids:
# Override default parameter grids
custom_grids = {
'RandomForest': {
'model__n_estimators': [50, 100, 200, 500],
'model__max_depth': [5, 10, 20, None],
'model__min_samples_split': [2, 5, 10, 20]
}
}
# Apply custom grids (requires manual implementation)
pipeline._get_param_grids = lambda: custom_grids
Production Deployment
Model Loading and Inference
Load saved models for production use:
import joblib
import json
# Load trained model
model = joblib.load('best_model.pkl')
# Load model metadata
with open('model_info.json', 'r') as f:
model_info = json.load(f)
# Make predictions on new data
predictions = model.predict(new_data)
Batch Processing
Process multiple datasets efficiently:
datasets = [
('customer_data.csv', 'churn', 'classification'),
('sales_data.csv', 'revenue', 'regression'),
('marketing_data.csv', 'conversion', 'classification')
]
results = {}
for data_path, target, problem_type in datasets:
pipeline = UniversalMLPipeline(
problem_type=problem_type,
fast_mode=True # Speed up batch processing
)
pipeline.run_pipeline(data_path, target)
results[data_path] = {
'best_model': pipeline.best_model_name,
'cv_score': pipeline.best_score
}
Model Monitoring
Track model performance over time:
# Save model metadata with timestamp
import datetime
model_info = {
'timestamp': datetime.datetime.now().isoformat(),
'problem_type': pipeline.problem_type,
'best_model': pipeline.best_model_name,
'cv_score': pipeline.best_score,
'feature_count': len(pipeline.feature_types['numeric'] +
pipeline.feature_types['categorical'] +
pipeline.feature_types['binary']),
'training_samples': len(pipeline.X)
}
Troubleshooting
Common Issues and Solutions
Memory Errors with Large Datasets:
# Enable fast mode and reduce cores
pipeline = UniversalMLPipeline(fast_mode=True, n_jobs=2)
Slow Training:
# Use random search with fewer iterations
pipeline = UniversalMLPipeline(
tuning_method='random',
fast_mode=True
)
Poor Model Performance:
# Try different tuning method
pipeline = UniversalMLPipeline(tuning_method='bayesian')
# Or add custom feature engineering
pipeline.run_pipeline(
'data.csv',
'target',
feature_engineering_func=your_custom_function
)
Missing Dependencies:
# Install optional dependencies
pip install scikit-optimize # For Bayesian optimization
Performance Monitoring
Track pipeline execution time:
import time
start_time = time.time()
pipeline.run_pipeline('data.csv', 'target')
execution_time = time.time() - start_time
print(f"Pipeline completed in {execution_time:.2f} seconds")