📌TL;DR

Built a complete 4-stage ML pipeline for binary classification: (1) Cross-validation for model evaluation, (2) Grid search for automatic hyperparameter tuning across multiple dataset sizes, (3) Training with optimal configurations, and (4) Testing with comprehensive metrics. Compared Decision Tree, Random Forest, and Logistic Regression on datasets from 200 to 2000 samples. Random Forest achieved best test accuracy at 77.5%, while Logistic Regression offered fastest inference (0.0001s). Demonstrates systematic approach with k-fold validation, model persistence via pickle, and importance of dataset size for stable performance.

Machine learning pipelines can seem overwhelming at first, but with the right approach and structure, you can build a robust system that handles everything from data preprocessing to model evaluation. In this tutorial, I'll walk you through building a complete ML pipeline that includes cross-validation, hyperparameter optimization, training, and testing.

What We're Building

This project implements a comprehensive binary classification pipeline that supports three popular algorithms: Decision Tree, Random Forest, and Logistic Regression. The pipeline is modular, scalable, and designed to find the best model configuration automatically.

Project Architecture

The pipeline consists of four main stages:

Cross-Validation: Evaluate different configurations using k-fold validation
Grid Search: Automatically find optimal hyperparameters
Training: Build the final model with optimized settings
Testing: Evaluate performance on unseen data

Each stage is implemented as a separate Python script, making the workflow easy to understand and modify.

Core Technologies

The project uses several key Python libraries:

scikit-learn: For machine learning algorithms and preprocessing
pandas: For data manipulation and CSV handling
numpy: For numerical operations and array handling
pickle: For model persistence

Setting Up the Environment

First, install the required dependencies:

pip install numpy pandas scikit-learn

Building the Helper Functions

I created an auxiliary_functions.py module that handles all the common operations. Here are the key functions:

Loading and Preprocessing Data

def load_raw_dataset(dataset_filename: str) -> Tuple[np.ndarray, np.ndarray]:
    df = pd.read_csv(dataset_filename)

    X = df.iloc[:, :-1]
    y = df.iloc[:, -1]

    dataset_x = X.replace({"Yes": 1, "No": 0}).astype(float).to_numpy()
    dataset_y = y.to_numpy()

    return dataset_x, dataset_y

The function handles CSV loading and automatically converts categorical values. This simplifies the preprocessing step significantly.

Standardization

def apply_normalization(raw_dataset: np.ndarray, scaler: StandardScaler | None):
    if scaler is None:
        scaler = StandardScaler()
        new_X = scaler.fit_transform(raw_dataset)
    else:
        new_X = scaler.transform(raw_dataset)

    return new_X, scaler

Feature standardization is crucial for algorithms like Logistic Regression. This function either fits a new scaler on training data or uses an existing one for test data, preventing data leakage.

Stage 1: Cross-Validation

The first script evaluates a single configuration using k-fold cross-validation:

python part_01_cross_validation.py hyperparameters.json training_data_small.csv 5

This script:

Loads the dataset and hyperparameters
Splits data into k folds
Trains on k-1 folds and validates on the remaining fold
Repeats k times so each fold serves as validation once
Reports average metrics across all folds

Cross-validation gives you a reliable estimate of model performance and helps detect overfitting early.

Stage 2: Grid Search

The grid search script automates hyperparameter tuning:

python part_02_grid_search.py hyperparameters.json training_data_{}.csv 5

The grid search:

Iterates through all hyperparameter combinations
Tests each combination using cross-validation
Tracks performance metrics for every configuration
Identifies the best hyperparameters for each algorithm

Grid Search Results

After running grid search on all datasets, I found interesting patterns:

Decision Trees

Best configuration: Gini criterion with max_depth=8 on the large dataset
Validation accuracy: 64.5%
Very sensitive to depth (None leads to overfitting)

Random Forest

Best configuration: 100 trees with max_depth=2 on small dataset
Validation accuracy: 70.0%
More trees generally improved performance but increased training time

Logistic Regression

Best configuration: No penalty with C=4.0 on large dataset
Validation accuracy: 70.5%
Very fast training and inference times
Consistent performance across dataset sizes

Stage 3: Training

After identifying optimal hyperparameters, train the final model:

python part_03_training.py hyperparameters.json training_data_large.csv scaler_lr.pkl classifier_lr.pkl

This script:

Loads the full training dataset
Applies standardization and fits the scaler
Trains the classifier with optimal hyperparameters
Saves both the scaler and classifier to disk

The pickle format allows you to load these trained models later without retraining.

Stage 4: Testing

Finally, evaluate the trained model on unseen data:

python part_04_testing.py testing_data.csv scaler_lr.pkl classifier_lr.pkl

This script loads the saved scaler and classifier, then reports comprehensive metrics:

Accuracy
Precision, Recall, F1-Score per class
Macro-averaged metrics
Confusion matrix analysis
Inference time

Final Test Results

Here are the results on the test dataset:

Classifier	Accuracy	Macro Avg Recall	Macro Avg Precision	Macro Avg F1	Inference Time
Decision Tree	75.50%	75.50%	75.50%	75.50%	0.0004 s
Random Forest	77.50%	77.50%	78.77%	77.25%	0.0014 s
Logistic Regressor	76.50%	76.50%	76.50%	76.50%	0.0001 s

Random Forest achieved the best performance with 77.5% accuracy, though Logistic Regression was faster during inference.

Key Insights and Best Practices

Dataset Size Matters

The experiments revealed that model performance improved with larger datasets:

Small dataset (200 samples): High variance, unstable results
Medium dataset (400 samples): Better generalization
Large dataset (800 samples): Most reliable performance
Very large dataset (2000 samples): Best accuracy and stability

Hyperparameter Impact

Decision Trees:

Shallow trees (depth 2-4) generalized better than deep trees
Gini and entropy criteria performed similarly
Unrestricted depth led to severe overfitting

Random Forest:

More trees consistently improved performance
Shallow individual trees (depth 2) with many trees (100) worked best
Training time scaled linearly with tree count

Logistic Regression:

L1 regularization performed slightly better than L2
C value around 1.0-4.0 gave optimal results
Very fast and stable across all dataset sizes

Computational Efficiency

Training times varied significantly:

Decision Tree: ~0.003-0.015 seconds (fastest)
Logistic Regression: ~0.004-0.016 seconds (fast)
Random Forest: ~0.02-0.33 seconds (slowest but most accurate)

For production systems with tight latency requirements, Logistic Regression offers the best balance of speed and accuracy.

Common Pitfalls to Avoid

Data Leakage: Never fit the scaler on test data. Always use the scaler fitted on training data.
Overfitting: Deep decision trees and complex models may achieve 100% training accuracy but fail on validation data.
Imbalanced Folds: The splitting function ensures balanced partition sizes, preventing some folds from being too small.
Ignoring Standardization: Algorithms like Logistic Regression perform poorly without standardization.

Conclusion

This machine learning pipeline demonstrates a systematic approach to binary classification. The modular design makes it easy to experiment with different configurations, while the automated grid search saves time and finds optimal hyperparameters.

Key takeaways:

Cross-validation provides reliable performance estimates
Grid search automates hyperparameter tuning efficiently
Model persistence with pickle enables deployment without retraining
Multiple dataset sizes help understand scalability
Comprehensive metrics give a complete picture of model performance

The Random Forest classifier achieved the best test accuracy at 77.5%, though Logistic Regression offers a faster alternative with only slightly lower accuracy. The choice between them depends on your specific requirements for speed versus accuracy.

📓 Interactive Notebook

Explore the complete implementation with all four pipeline stages:

Check the repository for the complete code, helper functions, and sample datasets.

Jonesh Shrestha
AI/ML Engineer

Building a Complete Machine Learning Pipeline for Binary Classification