Machine Learning

US Healthcare Fraud Detection: Building a Provider-Level Machine Learning Pipeline

December 12, 2025Jonesh Shrestha
Healthcare Fraud DetectionRandom ForestXGBoostFraud DetectionSMOTEFeature EngineeringImbalanced LearningProvider Risk Scoring

📌TL;DR

Built a provider-level fraud detection pipeline on Medicare-inspired claims data (5,410 providers, 558,211 claims from 138,556 beneficiaries) to identify fraudulent healthcare providers. Engineered 47 provider-level features from raw claims and beneficiary data, handled 9.35% class imbalance with SMOTE oversampling, and compared Logistic Regression, Random Forest, and XGBoost. Random Forest achieved 95.46% ROC-AUC, exceeding the 80% target. Optimized decision threshold to 0.46 on validation set for maximum F1-score, achieving 82.18% recall (catching 4 out of 5 fraudulent providers) and 53.90% precision on test set. 10-fold cross-validation confirmed model stability with 99.40% ± 0.36% ROC-AUC. Demonstrates complete KDD pipeline including unsupervised learning (KMeans clustering, Isolation Forest), feature selection via Mutual Information, hyperparameter tuning with GridSearchCV, and threshold optimization for real-world fraud detection applications.

Introduction

Healthcare fraud, waste, and abuse (FWA) contributes to billions in annual losses in the US healthcare system. The goal of this project was to build a model that can identify potentially fraudulent providers using claims patterns, with a target of at least 0.80 ROC-AUC.

The key design choice: predict fraud at the provider level. The label is per-provider (Yes/No), so I engineered provider-level features from raw claim and beneficiary data before training any models.

What I Built

A full KDD-style pipeline:

  1. Data integration and preprocessing
  2. Provider-level feature engineering (47 features)
  3. Exploratory Data Analysis (EDA)
  4. Unsupervised learning (KMeans, Isolation Forest)
  5. Feature selection (Mutual Information and percentile selection)
  6. Supervised learning (Logistic Regression, Random Forest, XGBoost)
  7. Threshold optimization on validation set
  8. Cross-validation for stability

Dataset Overview

The dataset has four components (all Medicare-inspired):

  • Beneficiary data: 138,556 records, demographics and chronic conditions
  • Inpatient claims: 40,474 records
  • Outpatient claims: 517,737 records
  • Provider labels: 5,410 providers with fraud label

A few quality checks that mattered:

  • All beneficiary IDs in claims exist in beneficiary data (100% linkage)
  • All provider IDs in labels match providers in claims (perfect match)
  • No duplicate claim IDs

Step 1: Preprocessing and Provider-Level Feature Engineering

Why provider-level aggregation?

Because the target is a binary provider label, claim-level modeling would require a second modeling layer (claim to provider) or label propagation. Instead, I aggregated all raw signals into a single row per provider to match the prediction task and keep the modeling clean.

The 47 engineered features

I engineered 47 provider-level features grouped into:

  • Claim volume (6): total claims, inpatient vs outpatient counts, ratios, unique beneficiaries, claims per beneficiary
  • Financial (7): total reimbursement/deductible, averages, std dev, ratios, max/min
  • Temporal (2): average claim duration, average admission duration
  • Medical code (6): unique diagnosis/procedure codes, averages per claim, "top code" concentration ratio
  • Physician (5): unique physician counts and diversity ratios
  • Demographics (21): aggregated beneficiary age, chronic condition burden, percentages by gender/race/deceased/chronic flags

Handling missingness correctly

Not all missing values mean "bad data" in claims. I treated them based on semantics:

  • Date of death is mostly missing (expected). I did not use it directly, only as "percent deceased beneficiaries per provider."
  • Missing diagnosis/procedure codes often mean "no additional codes."
  • Missing physician fields naturally produce zero counts when features are "unique physician counts."

Scaling

I standardized features using StandardScaler to support distance-based methods (KMeans) and improve optimization stability for Logistic Regression. I tested log scaling for skewed financial features, but it reduced clustering quality and did not help Logistic Regression, so I did not use it for the final models.

Step 2: Exploratory Data Analysis

Two things jumped out during EDA:

1) Heavy skew and outliers

Provider behavior is not "nice and Gaussian." Many features were right-skewed with extreme outliers, such as total claims having a max of 8,240.

2) Strong correlations inside feature groups

Volume and financial features were highly correlated (for example, total claims vs outpatient claims). This confirmed the engineered features were consistent, but it also introduced multicollinearity that could hurt some models and interpretability.

EDA-based fraud signals

Fraud-labeled providers tended to show:

  • Higher claim volumes and higher average reimbursement per claim
  • Longer or inconsistent durations
  • More diverse diagnosis and procedure codes
  • More unique beneficiaries and physicians

Step 3: Unsupervised Learning (Risk Discovery Without Labels)

I used unsupervised learning for two reasons:

  1. Validate whether engineered features separate suspicious behavior at all
  2. Provide extra "risk context" beyond a single classifier score

KMeans clustering

I tested k from 2 to 10 (using elbow and silhouette analysis) and focused on k=3 for interpretability.

Result: one cluster represented "high volume / high reimbursement providers" and had a 60.97% fraud rate compared to the 9.35% baseline.

Isolation Forest anomaly detection

I flagged the top 10% most anomalous providers and checked enrichment:

  • Anomalies had 32.35% fraud rate
  • They captured 34.58% of all fraud cases

Isolation Forest works by isolating points via random splits, where outliers tend to be isolated in fewer splits.

Step 4: Feature Selection (Mutual Information and Percentile Selection)

Since many features are correlated, I used Mutual Information (MI) as a model-agnostic way to score feature usefulness, then used percentile-based selection to find a compact subset.

From the notebook results:

  • Testing percentiles 10% to 100% via cross-validation, 10% performed best, selecting 5 features:
    1. InpatientClaims
    2. TotalReimbursement
    3. MaxReimbursement
    4. TotalDeductible
    5. UniqueProcedureCodes

This matched the EDA intuition: fraud is strongly tied to volume, money flow, and code diversity.

Step 5: Supervised Modeling

Train-validation-test split

I used a stratified split so each partition preserved the 9.35% fraud rate:

  • Train: 3,787 providers (70%)
  • Validation: 541 providers (10%)
  • Test: 1,082 providers (20%)

Also important: the dataset's separate "test" file did not include fraud labels, so all supervised work used the labeled data with internal splits.

Handling class imbalance

I compared:

  • Class weighting
  • SMOTE oversampling

SMOTE-NC was considered but not used because the provider-level features are continuous.

Baseline: Logistic Regression

With class weighting:

  • Precision 0.3966
  • Recall 0.9307
  • F1 0.5562
  • ROC-AUC 0.9587

This model catches most fraud (high recall), but it produces many false positives (low precision).

Random Forest vs XGBoost (with GridSearchCV)

I tuned both models with GridSearchCV and compared on the held-out test set.

From the report:

  • Random Forest: fraud F1 0.6473
  • XGBoost: fraud F1 0.6288
  • Similar ROC-AUC, but Random Forest won on F1

Feature importance aligned with EDA and feature selection, especially TotalReimbursement, TotalDeductible, and UniqueProcedureCodes.

Step 6: Threshold Optimization (Making Recall vs Precision Explicit)

Fraud detection is not a "default threshold = 0.5" problem. The threshold is a business decision:

  • Lower threshold: catch more fraud (higher recall), but flag more normal providers (lower precision)
  • Higher threshold: fewer false alarms, but more missed fraud

I optimized the decision threshold on the validation set to maximize F1, then evaluated once on test.

Best threshold: 0.46

On test, this improved recall to 82.18% and F1 to 0.6510 (precision became 53.90%).

Confusion matrix at the optimized threshold:

PredictedNon-FraudFraud
Non-Fraud910 (TN)71 (FP)
Fraud18 (FN)83 (TP)
  • True Positives: 83 fraud cases correctly identified
  • False Negatives: 18 fraud cases missed (17.82% miss rate)
  • False Positives: 71 non-fraud cases flagged (audit candidates)

Step 7: Validation and Interpretation

Cross-validation stability

A 10-fold CV on the best model produced:

  • Mean ROC-AUC 0.9940 ± 0.0036

How to interpret errors in this setting

  • False negatives likely represent subtle fraud that looks normal.
  • False positives should be treated as audit candidates, not automatic punishment.

This framing matters if you ever deploy the model as a screening tool.

Key Insights

What worked well

  1. Provider-level aggregation: Matching the feature granularity to the target label simplified the modeling pipeline and improved interpretability.
  2. Engineered features: The 47 features captured diverse fraud signals (volume, financial, temporal, medical codes, demographics).
  3. Threshold optimization: Moving from default 0.5 to 0.46 significantly improved recall (from 77% to 82%) with minimal precision loss.
  4. Model stability: High cross-validation scores confirmed the model generalizes well across different provider subsets.

Practical applications

  • Risk-Scoring Tool: Deploy as automated provider risk assessment system
  • Audit Prioritization: Focus investigative resources on high-risk providers (82% fraud capture rate)
  • Cost Savings: Reduce manual investigation workload while maintaining high fraud detection
  • Decision Support: Clear feature importance provides actionable insights for investigators

Limitations

  • False Negatives: 17.8% of fraud cases remain undetected
  • Synthetic Data: Results based on Medicare-inspired data; real-world validation required
  • Tabular Features Only: Does not incorporate temporal patterns or network relationships
  • Class Imbalance: Even with SMOTE, minority class remains challenging

Future work

If I extend this project, I would focus on:

  • Richer temporal features (sequences, seasonality, bursts)
  • Graph-based signals over provider-beneficiary networks
  • Cost-sensitive optimization tied to audit capacity and investigation cost
  • Validation on real Medicare claims data (subject to data access and privacy regulations)
  • Implement continuous model updates as fraud patterns evolve

Reproducibility Notes

Language: Python 3.11.11

Core libraries: pandas, numpy, matplotlib, seaborn, scikit-learn, imbalanced-learn, xgboost

Typical environment setup:

pip install pandas numpy matplotlib seaborn scikit-learn imbalanced-learn xgboost

📓 Jupyter Notebook

Want to explore the complete code and run it yourself? Access the full Jupyter notebook with detailed implementations, visualizations, and methodology:

→ View Notebook on GitHub

You can also run it interactively:

Note: The repository includes the full dataset (ZIP file), complete notebook, and README with detailed setup instructions.