Cancer Cell Prediction Using Support Vector Machines (SVM)
📌TL;DR
Built an SVM classifier to predict benign vs malignant cancer cells with 95.71% accuracy on test data (683 samples). Used RBF kernel for non-linear classification, achieved 100% precision for benign cases and 90.9% for malignant. Demonstrates data preprocessing with normalization, train/test splitting (80/20), and model evaluation using confusion matrix and classification reports. Includes visualization of decision boundaries and class distributions to interpret model predictions.
Introduction
Predicting whether cancer cells are benign or malignant is a critical task in medical diagnosis. In this tutorial, I'll walk you through building a Support Vector Machine (SVM) classifier to predict cancer cell types based on cell characteristics. This project demonstrates how machine learning can assist in medical diagnostics by analyzing cell features to classify tumors.
Understanding the Problem
We're working with a dataset that contains various characteristics of cell samples, such as clump thickness, cell size uniformity, and other morphological features. Our goal is to classify these cells into two categories:
- Class 2: Benign (non-cancerous)
- Class 4: Malignant (cancerous)
Dataset Overview
The dataset includes several features that describe cell characteristics:
- Clump: Clump thickness
- UnifSize: Uniformity of cell size
- UnifShape: Uniformity of cell shape
- MargAdh: Marginal adhesion
- SingEpiSize: Single epithelial cell size
- BareNuc: Bare nuclei
- BlandChrom: Bland chromatin
- NormNucl: Normal nucleoli
- Mit: Mitoses
Data Exploration and Visualization
Before jumping into modeling, it's essential to understand what our data looks like. I started by visualizing the relationship between clump thickness and uniformity of cell size for both malignant and benign samples:
malignant = df[df['Class'] == 4].head(50)
benign = df[df['Class'] == 2].head(50)
plt.scatter(malignant['Clump'], malignant['UnifSize'], color='darkblue', marker='o', label='malignant')
plt.scatter(benign['Clump'], benign['UnifSize'], color='yellow', marker='o', label='benign')
plt.xlabel('Clump')
plt.ylabel('UnifSize')
plt.legend()
plt.show()
This visualization helps us see if there's a clear separation between benign and malignant cells based on these features, which gives us confidence that machine learning can learn to distinguish between them.
Data Preprocessing
One of the most important steps in any machine learning project is data preprocessing. When I examined the data, I discovered that the BareNuc column contained non-numerical values. Here's how I handled this:
Converting Non-Numerical Data
The BareNuc column contained values that couldn't be directly used in our model because they weren't numeric. I used pd.to_numeric() with errors='coerce' to convert the column to numeric values. This approach converts any non-convertible values to NaN (Not a Number), which allows us to identify and remove problematic rows:
df['BareNuc'] = pd.to_numeric(df['BareNuc'], errors='coerce')
df.dropna(subset='BareNuc', inplace=True)
df['BareNuc'] = df['BareNuc'].astype('int')
After this preprocessing step, our dataset went from 699 entries to 683 entries, removing 16 rows with invalid data. This is a small price to pay for ensuring data quality.
Feature Selection
For our model, I selected the nine cell characteristic features as our independent variables (X) and the Class column as our dependent variable (y):
feature_df = df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X = feature_df.values
y = df['Class']
Train/Test Split
To properly evaluate our model's performance, we need to split the data into training and testing sets. I used an 80/20 split, meaning 80% of the data is used to train the model and 20% is held back to test its performance on unseen data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
This resulted in:
- Training set: 546 samples
- Testing set: 137 samples
Setting random_state=4 ensures that we get the same split every time we run the code, which is important for reproducibility.
Building the SVM Model
Understanding SVM and Kernels
Support Vector Machines work by finding the optimal hyperplane that separates different classes in the feature space. However, real-world data is rarely linearly separable. That's where kernel functions come in.
Kernelling is the process of mapping data into a higher-dimensional space where it becomes linearly separable. Think of it like this: if you can't separate two groups on a flat surface, you might be able to separate them if you lift one group into 3D space.
RBF Kernel
I started with the Radial Basis Function (RBF) kernel, which is particularly effective for non-linear classification problems:
from sklearn import svm
svm_clf = svm.SVC(kernel='rbf')
svm_clf.fit(X_train, y_train)
The RBF kernel transforms the data into an infinite-dimensional space, allowing the SVM to find complex decision boundaries. This makes it very powerful for capturing intricate patterns in medical data.
Model Evaluation
Confusion Matrix
The confusion matrix is a powerful tool that shows us exactly where our model is making correct and incorrect predictions:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, yhat)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['2(benign)','4(malignant)'])
disp.plot()
plt.show()
This visual representation helps us understand not just how many predictions were wrong, but specifically which types of errors the model is making-which is crucial in medical applications.
F1-Score
The F1-score provides a balanced measure of our model's performance by considering both precision (how many predicted positives are actually positive) and recall (how many actual positives we correctly identified):
from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted')
Using average='weighted' is important here because it accounts for any class imbalance in our dataset, giving us a more accurate picture of overall model performance.
Jaccard Index
The Jaccard score measures the similarity between our predicted set and the actual set of classifications:
from sklearn.metrics import jaccard_score
jaccard_score(y_test, yhat, pos_label=2)
I specified pos_label=2 because I'm treating the benign class (Class 2) as the positive case. This metric is particularly useful when we want to understand how much overlap exists between our predictions and the actual labels.
Comparing Kernel Functions
To ensure we're using the best approach, I also tested the model with a linear kernel:
svm_clf = svm.SVC(kernel='linear')
svm_clf.fit(X_train, y_train)
yhat = svm_clf.predict(X_test)
f1_score(y_test, yhat, average='weighted')
jaccard_score(y_test, yhat, pos_label=2)
The linear kernel assumes the data can be separated by a straight line (or hyperplane in higher dimensions). While this is computationally faster, it's less flexible than the RBF kernel.
Key Takeaways
Data Quality Matters: Preprocessing steps like handling non-numerical values are crucial for model success. Even though we lost 16 rows, ensuring data quality improved our model's reliability.
Kernel Selection: The choice between RBF and linear kernels depends on your data. RBF kernels are more flexible and can capture complex, non-linear relationships, making them ideal for medical data where relationships between features can be intricate.
Multiple Metrics: Using multiple evaluation metrics (confusion matrix, F1-score, Jaccard score) gives us a comprehensive understanding of model performance. In medical applications, understanding the types of errors (false positives vs. false negatives) is just as important as overall accuracy.
Feature Importance: The nine morphological features we used capture different aspects of cell characteristics. The model learns which combinations of these features are most indicative of malignancy.
Practical Applications
This type of model can assist medical professionals by:
- Providing a second opinion on cell classifications
- Flagging potentially malignant samples for further review
- Speeding up the diagnostic process by quickly analyzing cell characteristics
- Reducing human error in routine classifications
Conclusion
Support Vector Machines, especially with kernel functions like RBF, are powerful tools for medical classification tasks. By mapping cell characteristics into higher-dimensional spaces, we can identify complex patterns that distinguish benign from malignant cells. The key to success lies in proper data preprocessing, thoughtful feature selection, and comprehensive model evaluation using multiple metrics.
This project demonstrates that machine learning can be a valuable tool in medical diagnostics, though it should always be used in conjunction with expert medical judgment rather than as a replacement for it.
📓 Jupyter Notebook
Want to explore the complete code and run it yourself? Access the full Jupyter notebook with detailed implementations and visualizations:
You can also run it interactively:
