📌TL;DR

Built a complete 4-stage ML pipeline to predict a developer's programming language choice using binary classification: (1) Cross-validation for model evaluation, (2) Grid search for automatic hyperparameter tuning across multiple dataset sizes, (3) Training with optimal configurations, and (4) Testing with comprehensive metrics. Compared Decision Tree, Random Forest, and Logistic Regression on datasets from 200 to 2000 developer samples. Random Forest achieved best test accuracy at 77.5% for predicting language preferences, while Logistic Regression offered fastest inference (0.0001s). Demonstrates systematic approach with k-fold validation, model persistence via pickle, and importance of dataset size for stable performance in predicting developer language choices.

What programming language will a developer choose? This question might seem simple, but predicting a developer's language preference based on their characteristics, experience, and background is a fascinating machine learning problem. Whether it's Python vs Java, JavaScript vs TypeScript, or any other language pairing, understanding the factors that influence this choice can help with recruitment, team building, and tool selection.

Machine learning pipelines can seem overwhelming at first, but with the right approach and structure, you can build a robust system that handles everything from data preprocessing to model evaluation. In this tutorial, I'll walk you through building a complete ML pipeline to predict a developer's programming language choice, including cross-validation, hyperparameter optimization, training, and testing.

What We're Building

This project implements a comprehensive binary classification pipeline to predict a developer's programming language choice. The system uses developer characteristics such as experience level, education background, preferred development environment, and other relevant features to classify which programming language a developer is likely to choose. The pipeline supports three popular algorithms: Decision Tree, Random Forest, and Logistic Regression. It's modular, scalable, and designed to find the best model configuration automatically.

Project Architecture

The pipeline consists of four main stages:

Cross-Validation: Evaluate different configurations using k-fold validation
Grid Search: Automatically find optimal hyperparameters
Training: Build the final model with optimized settings
Testing: Evaluate performance on unseen data

Each stage is implemented as a separate Python script, making the workflow easy to understand and modify.

Core Technologies

The project uses several key Python libraries:

scikit-learn: For machine learning algorithms and preprocessing
pandas: For data manipulation and CSV handling
numpy: For numerical operations and array handling
pickle: For model persistence

Setting Up the Environment

First, install the required dependencies:

pip install numpy pandas scikit-learn

Building the Helper Functions

I created an auxiliary_functions.py module that handles all the common operations. Here are the key functions:

Loading and Preprocessing Data

def load_raw_dataset(dataset_filename: str) -> Tuple[np.ndarray, np.ndarray]:
    df = pd.read_csv(dataset_filename)

    X = df.iloc[:, :-1]
    y = df.iloc[:, -1]

    dataset_x = X.replace({"Yes": 1, "No": 0}).astype(float).to_numpy()
    dataset_y = y.to_numpy()

    return dataset_x, dataset_y

The function handles CSV loading and automatically converts categorical values. For the developer language prediction task, the features might include years of experience, education level, preferred development environment, project types worked on, and other developer characteristics. The target variable represents the binary classification (e.g., Python vs Java, or preference for one language over another). This simplifies the preprocessing step significantly.

Standardization

def apply_normalization(raw_dataset: np.ndarray, scaler: StandardScaler | None):
    if scaler is None:
        scaler = StandardScaler()
        new_X = scaler.fit_transform(raw_dataset)
    else:
        new_X = scaler.transform(raw_dataset)

    return new_X, scaler

Feature standardization is crucial for algorithms like Logistic Regression. This function either fits a new scaler on training data or uses an existing one for test data, preventing data leakage.

Stage 1: Cross-Validation

The first script evaluates a single configuration using k-fold cross-validation:

python part_01_cross_validation.py hyperparameters.json training_data_small.csv 5

This script:

Loads the dataset and hyperparameters
Splits data into k folds
Trains on k-1 folds and validates on the remaining fold
Repeats k times so each fold serves as validation once
Reports average metrics across all folds

Cross-validation gives you a reliable estimate of model performance and helps detect overfitting early.

Stage 2: Grid Search

The grid search script automates hyperparameter tuning:

python part_02_grid_search.py hyperparameters.json training_data_{}.csv 5

The grid search:

Iterates through all hyperparameter combinations
Tests each combination using cross-validation
Tracks performance metrics for every configuration
Identifies the best hyperparameters for each algorithm

Grid Search Results

After running grid search on all datasets, I found interesting patterns:

Decision Trees

Best configuration: Gini criterion with max_depth=8 on the large dataset
Validation accuracy: 64.5%
Very sensitive to depth (None leads to overfitting)

Random Forest

Best configuration: 100 trees with max_depth=2 on small dataset
Validation accuracy: 70.0%
More trees generally improved performance but increased training time

Logistic Regression

Best configuration: No penalty with C=4.0 on large dataset
Validation accuracy: 70.5%
Very fast training and inference times
Consistent performance across dataset sizes

Stage 3: Training

After identifying optimal hyperparameters, train the final model:

python part_03_training.py hyperparameters.json training_data_large.csv scaler_lr.pkl classifier_lr.pkl

This script:

Loads the full training dataset
Applies standardization and fits the scaler
Trains the classifier with optimal hyperparameters
Saves both the scaler and classifier to disk

The pickle format allows you to load these trained models later without retraining.

Stage 4: Testing

Finally, evaluate the trained model on unseen data:

python part_04_testing.py testing_data.csv scaler_lr.pkl classifier_lr.pkl

This script loads the saved scaler and classifier, then reports comprehensive metrics:

Accuracy
Precision, Recall, F1-Score per class
Macro-averaged metrics
Confusion matrix analysis
Inference time

Final Test Results

Here are the results on the test dataset for predicting developer language choice:

Classifier	Accuracy	Macro Avg Recall	Macro Avg Precision	Macro Avg F1	Inference Time
Decision Tree	75.50%	75.50%	75.50%	75.50%	0.0004 s
Random Forest	77.50%	77.50%	78.77%	77.25%	0.0014 s
Logistic Regressor	76.50%	76.50%	76.50%	76.50%	0.0001 s

Random Forest achieved the best performance with 77.5% accuracy in predicting a developer's programming language choice, though Logistic Regression was faster during inference. This suggests that ensemble methods can capture complex patterns in developer characteristics that influence language preferences.

Key Insights and Best Practices

Dataset Size Matters

The experiments revealed that model performance improved with larger datasets of developer information:

Small dataset (200 developers): High variance, unstable results in language prediction
Medium dataset (400 developers): Better generalization across different developer profiles
Large dataset (800 developers): Most reliable performance for predicting language choices
Very large dataset (2000 developers): Best accuracy and stability in identifying language preferences

Hyperparameter Impact

Decision Trees:

Shallow trees (depth 2-4) generalized better than deep trees
Gini and entropy criteria performed similarly
Unrestricted depth led to severe overfitting

Random Forest:

More trees consistently improved performance
Shallow individual trees (depth 2) with many trees (100) worked best
Training time scaled linearly with tree count

Logistic Regression:

L1 regularization performed slightly better than L2
C value around 1.0-4.0 gave optimal results
Very fast and stable across all dataset sizes

Computational Efficiency

Training times varied significantly:

Decision Tree: ~0.003-0.015 seconds (fastest)
Logistic Regression: ~0.004-0.016 seconds (fast)
Random Forest: ~0.02-0.33 seconds (slowest but most accurate)

For production systems with tight latency requirements, Logistic Regression offers the best balance of speed and accuracy when predicting developer language choices in real-time scenarios.

Common Pitfalls to Avoid

Data Leakage: Never fit the scaler on test data. Always use the scaler fitted on training data.
Overfitting: Deep decision trees and complex models may achieve 100% training accuracy but fail on validation data.
Imbalanced Folds: The splitting function ensures balanced partition sizes, preventing some folds from being too small.
Ignoring Standardization: Algorithms like Logistic Regression perform poorly without standardization.

Conclusion

This machine learning pipeline demonstrates a systematic approach to predicting a developer's programming language choice using binary classification. The modular design makes it easy to experiment with different configurations, while the automated grid search saves time and finds optimal hyperparameters.

Key takeaways:

Cross-validation provides reliable performance estimates for language prediction
Grid search automates hyperparameter tuning efficiently across different model types
Model persistence with pickle enables deployment without retraining
Multiple dataset sizes help understand how model performance scales with more developer data
Comprehensive metrics give a complete picture of how well we can predict language preferences

The Random Forest classifier achieved the best test accuracy at 77.5% for predicting developer language choice, though Logistic Regression offers a faster alternative with only slightly lower accuracy. The choice between them depends on your specific requirements for speed versus accuracy. Understanding which features are most predictive of language choice can provide valuable insights into developer preferences and help with recruitment, team composition, and technology stack decisions.

📓 Interactive Notebook

Explore the complete implementation with all four pipeline stages:

Check the repository for the complete code, helper functions, and sample datasets.

Jonesh Shrestha
AI/ML Engineer

Predicting a Developer's Programming Language Choice with Machine Learning