Unsupervised Document Clustering of BBC News Articles with K-Means
📌TL;DR
Applied K-means clustering to 2,225 BBC News articles (2004-2005) using term frequency features. K=5 achieved 76.4% homogeneity and 76.7% completeness against editorial categories (business, entertainment, politics, sport, tech). Systematic experimentation with K=4-8 revealed optimal balance at K=5, successfully identifying specialized clusters including gaming (100% "game" term coverage), politics ("mr", "govern"), tech ("use", "technolog"), and entertainment ("film", "award"). Built cosine similarity-based classifier for new document assignment. Demonstrates unsupervised learning's power to discover semantic structure from raw term frequencies, with word cloud visualizations and quantitative evaluation metrics.
Introduction
Can an algorithm automatically organize news articles by topic without being told what the categories are? This is the power of unsupervised learning. I tackled this challenge using K-means clustering on 2,225 BBC News articles spanning business, entertainment, politics, sport, and tech.
The fascinating part? The algorithm never sees category labels during training. It discovers natural groupings purely from word usage patterns. The goal was to see how well these data-driven clusters align with human editorial categories and whether the algorithm might discover patterns humans missed.
The Dataset: BBC News 2004-2005
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
tf_df = pd.read_csv("BBC_News_5_Categories/BBC_News_5_TF.csv", index_col=0)
features_df = pd.read_csv("BBC_News_5_Categories/BBC_News_5_Features.csv", header=None)
classes_df = pd.read_csv("BBC_News_5_Categories/BBC_News_5_Classes.csv", header=None, index_col=0)
print(f"Documents: {tf_df.shape[1]}") # 2225
print(f"Terms: {tf_df.shape[0]}") # 6167
The data comes as a term-document matrix: rows are terms (words), columns are documents (articles). Each cell contains the frequency of that term in that document.
Exploring Term Frequencies
term_frequencies = tf_df.sum(axis=1)
tf_sorted = term_frequencies.sort_values(ascending=False)
print(tf_sorted.head(20))
This gives us the most common words across all articles. When I plotted this:
plt.plot(sorted(term_frequencies, reverse=True))
plt.xlabel("Terms")
plt.ylabel("Frequency")
plt.title("Term Frequency Distribution (Zipf's Law)")
plt.show()
Classic Zipf's law! A few terms appear extremely frequently (like "year", "mr", "game"), while most terms are rare. This is a fundamental property of natural language.
Train-Test Split
from sklearn.model_selection import train_test_split
classes_np = classes_df.to_numpy().ravel()
X_train, X_test, y_train, y_test = train_test_split(
tf_df.T, classes_np, test_size=0.2, random_state=99
)
print(f"X_train: {X_train.shape}") # (1780, 6167)
print(f"X_test: {X_test.shape}") # (445, 6167)
Notice the .T transpose! Machine learning expects rows as samples, columns as features. Our data had terms as rows, so I transposed to get documents as rows.
Even though clustering is unsupervised, I created a test set to evaluate how well the discovered clusters generalize to new documents.
Building Custom K-Means from Scratch
Here's the K-means implementation I used (custom, not scikit-learn):
from scipy.spatial import distance
import numpy.ma as ma
def kmeans(Data, K=3, max_iterations=20, metric="cosine", mask_zeros=False):
"""
K-means clustering with configurable distance metric.
Args:
Data: N x M matrix (N samples, M features)
K: number of clusters
max_iterations: max iterations if no convergence
metric: distance metric (euclidean, cosine, etc.)
mask_zeros: whether to mask zeros when computing centroids
Returns:
clusters: cluster labels for each instance
centroids: K centroid vectors
"""
# Random initialization: pick K random samples as initial centroids
idx = np.random.choice(len(Data), K, replace=False)
centroids = Data[idx, :]
# Initial assignment: find nearest centroid for each sample
clusters = np.argmin(distance.cdist(Data, centroids, metric), axis=1)
for j in range(max_iterations):
centroids = np.zeros((K, Data.shape[1]))
for i in range(K):
cluster_i = Data[clusters == i, :]
if mask_zeros:
mc = ma.masked_array(cluster_i, mask=(cluster_i == 0))
centroids[i] = mc.mean(axis=0)
else:
centroids[i] = cluster_i.mean(axis=0)
# Reassign clusters based on new centroids
new_clusters = np.argmin(distance.cdist(Data, centroids, metric), axis=1)
# Check convergence
if np.array_equal(clusters, new_clusters):
break
clusters = new_clusters
return clusters, centroids
Why Cosine Distance for Text?
The metric="cosine" parameter is critical. For text data, cosine distance captures semantic similarity regardless of document length.
- A 100-word article and a 1000-word article about the same topic should be similar
- Cosine distance focuses on which words appear together, not how many total words
- Euclidean distance fails in high-dimensional sparse spaces (most terms don't appear in most documents)
Running Initial Clustering
X_train_np = X_train.to_numpy()
clusters, centroids = kmeans(X_train_np, K=3)
print(f"Cluster 0: {(clusters == 0).sum()} documents")
print(f"Cluster 1: {(clusters == 1).sum()} documents")
print(f"Cluster 2: {(clusters == 2).sum()} documents")
Output:
Cluster 0: 716 documents
Cluster 1: 438 documents
Cluster 2: 626 documents
Reasonably balanced clusters! Now the question is: what do they represent?
Analyzing Clusters: What Topics Did We Find?
I created a comprehensive cluster analysis function:
def cluster_report(data, cluster, centroids, features):
"""
Generate detailed report for each cluster.
Returns:
cluster_dict: {cluster_label: DataFrame with Freq, DF, % of Docs}
cluster_sizes: {cluster_label: size}
"""
cluster_sizes = {}
cluster_dict = {}
for cluster_label in np.unique(cluster):
cluster_size = (cluster == cluster_label).sum()
cluster_sizes[cluster_label] = cluster_size
# Get documents in this cluster
label_idx = np.where(cluster == cluster_label)
cluster_tf = data[label_idx]
# Document Frequency: how many docs contain each term
df = (cluster_tf > 0).sum(axis=0)
pct_docs = (df / cluster_size) * 100
# Use centroid directly for frequency weights
freq = centroids[cluster_label]
report = pd.DataFrame({
'': features,
'Freq': freq,
'DF': df,
'% of Docs': pct_docs
})
report = report.set_index('')
cluster_dict[cluster_label] = report
return cluster_dict, cluster_sizes
This function computes three key metrics for each term in each cluster:
- Freq: Centroid weight (mean frequency across cluster documents)
- DF: Document frequency (how many cluster docs contain this term)
- % of Docs: DF as percentage of cluster size
Displaying Top Terms
features_np = features_df.to_numpy().ravel()
cluster_dict, cluster_sizes = cluster_report(X_train_np, clusters, centroids, features_np)
def display_clusters_summary(cluster_dict, cluster_sizes, num_terms=10):
for c in cluster_dict.keys():
cluster_rep = cluster_dict[c]
print(f"\nCluster {c} | size = {cluster_sizes[c]}")
print("----------------------------")
print(cluster_rep.sort_values(by="Freq", ascending=False).head(num_terms))
display_clusters_summary(cluster_dict, cluster_sizes, 10)
Output for K=3:
Cluster 0 | size = 716
----------------------------
Freq DF % of Docs
year 1.47 494 68.99
game 1.45 301 42.04
play 1.10 360 50.28
film 1.03 177 24.72
time 0.91 360 50.28
best 0.89 223 31.15
win 0.86 313 43.72
Cluster 1 | size = 438
----------------------------
Freq DF % of Docs
year 2.14 348 79.45
bn 1.38 194 44.29
compani 1.15 216 49.32
market 1.09 219 50.00
firm 0.92 190 43.38
Cluster 2 | size = 626
----------------------------
Freq DF % of Docs
mr 3.06 418 66.77
peopl 1.93 414 66.13
use 1.52 336 53.67
say 1.21 338 53.99
govern 1.06 251 40.10
Interpretation
Cluster 0 (716 docs) - Sports & Entertainment:
Mixed cluster with "game" (42%), "play" (50%), "film" (25%). Sports and entertainment share similar language ("year", "best", "win").
Cluster 1 (438 docs) - Business & Economics:
Clear business focus! "bn" (billion) in 44% of docs, "compani" in 49%, "market" in 50%. Strong financial terminology.
Cluster 2 (626 docs) - Politics & General News:
"mr" appears in 67% of documents (referring to officials), "govern" in 40%, "peopl" in 66%. Classic political discourse.
Systematic Experimentation: Finding Optimal K
The real question: what's the best number of clusters? I tested K=4 through K=8 with multiple runs each.
from collections import Counter
for k in range(4, 9):
print(f"\n{'='*50}")
print(f"Testing K = {k}")
print('='*50)
for run in range(5):
clusters, centroids = kmeans(X_train_np, K=k)
print(f"\nRun {run + 1}:")
for cluster in np.unique(clusters):
cluster_idx = [i for i, label in enumerate(clusters) if label == cluster]
categories = [classes_np[i] for i in cluster_idx]
counts = dict(Counter(categories).most_common())
print(f" Cluster {cluster} (n={len(cluster_idx)}): {counts}")
This shows me which actual categories ended up in each cluster, helping identify meaningful patterns.
Key Findings
K=4: Politics and tech separate nicely, but sports/entertainment still mixed.
K=5 ⭐ (Optimal): Something amazing happened!
best_k = 5
best_clusters, best_centroids = kmeans(X_train_np, K=best_k)
cluster_dict, sizes = cluster_report(X_train_np, best_clusters, best_centroids, features_np)
# One run produced this:
# Cluster 0 (234 docs): Gaming cluster - "game" freq 8.99, appears in 100% of docs!
# Cluster 1 (189 docs): Entertainment - "film" (2.86), "award" (1.59), "star" (1.27)
# Cluster 2 (245 docs): Sports - "play" (1.21), "win" (1.12), "england" (0.93)
# Cluster 3 (963 docs): Politics - "mr", "govern", "labour", "elect"
# Cluster 4 (149 docs): Technology - "use", "technolog", "mobil"
K=5 successfully separated:
- Video games from traditional sports (!)
- Film/entertainment from sports events
- Clear political discourse
- Technology topics
The gaming cluster is remarkable "game" appears in 100% of documents with frequency 8.99! The algorithm discovered that gaming articles have such distinctive vocabulary they deserve their own cluster.
K=6-8: Over-segmentation. Clusters become too small (41-doc clusters at K=8) and unstable across runs.
Word Cloud Visualizations
To make clusters interpretable at a glance, I created word clouds:
from wordcloud import WordCloud
def plot_wordclouds(cluster_dict):
for cluster_label, df in cluster_dict.items():
freq_dict = df["Freq"].to_dict()
wc = WordCloud(
width=800,
height=400,
background_color="white",
colormap="viridis"
)
wc.generate_from_frequencies(freq_dict)
plt.figure(figsize=(10, 5))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.title(f"Cluster {cluster_label}", fontsize=16)
plt.show()
plot_wordclouds(cluster_dict)
Word clouds scale term size by frequency, making dominant vocabulary immediately visible. Business clusters show "compani", "market", "bn" prominently. Political clusters emphasize "mr", "govern", "labour". Sports clusters highlight "game", "play", "win".
These visualizations are perfect for stakeholder presentations no need to explain term frequency tables!
Quantitative Evaluation: How Well Did We Do?
Time for objective metrics. I compared discovered clusters against the original editorial categories.
from sklearn.metrics import completeness_score, homogeneity_score
k5_clusters, k5_centroids = kmeans(X_train_np, K=5)
homogeneity = homogeneity_score(y_train, k5_clusters)
completeness = completeness_score(y_train, k5_clusters)
print(f"Homogeneity Score: {homogeneity:.4f}") # 0.7641
print(f"Completeness Score: {completeness:.4f}") # 0.7672
Understanding the Metrics
Homogeneity (0.7641): Do clusters contain only documents from a single category?
→ Each cluster is 76.4% "pure". Some mixing occurs, but clusters are mostly coherent.
Completeness (0.7672): Are all documents from a category in the same cluster?
→ Category members are 76.7% grouped together. Some fragmentation exists.
Both scores range 0-1, where 1 is perfect. These are strong scores for completely unsupervised learning!
Why Not Perfect?
The imperfect scores make sense:
- Vocabulary overlap: Business and tech both discuss "compani", "new", "technolog"
- Ambiguous content: Some articles span categories (e.g., business of entertainment)
- K-means assumptions: Spherical clusters, equal variance natural language doesn't always follow these
The algorithm discovered structure that 76% aligns with human editorial decisions, using only word frequencies. That's impressive!
Classifying New Documents
Now I have cluster structure how do I classify new articles?
def classify_document_cosine(doc_vector, centroids):
"""
Assign document to nearest cluster using cosine similarity.
Returns:
cluster_label: assigned cluster
max_similarity: similarity score to that cluster
"""
# Normalize document and centroids
doc_norm = np.linalg.norm(doc_vector)
centroids_norm = np.linalg.norm(centroids, axis=1)
# Compute cosine similarities
if doc_norm > 0:
cosine_sims = np.dot(centroids, doc_vector) / (centroids_norm * doc_norm)
else:
cosine_sims = np.zeros(len(centroids))
# Assign to highest similarity
cluster_label = np.argmax(cosine_sims)
max_similarity = cosine_sims[cluster_label]
return cluster_label, max_similarity
Classifying Test Set
X_test_np = X_test.to_numpy()
bbc_news_df = pd.read_csv("BBC_News_5_Categories/bbc-5categories.csv")
titles = bbc_news_df["title"].to_numpy()
results = []
for i in range(len(X_test_np)):
doc_id = X_test.index[i]
title = titles[doc_id]
cluster, similarity = classify_document_cosine(X_test_np[i], centroids)
results.append({
'Document ID': doc_id,
'Title': title,
'Cluster': cluster,
'Similarity': similarity
})
results_df = pd.DataFrame(results)
print(results_df.head(20))
Sample output:
Document ID Title Cluster Similarity
0 1823 EU to probe Apple's French deal 1 0.923
1 1456 Blair backs school discipline 2 0.887
2 2034 England crash out of Under-21s 0 0.945
3 1289 Nokia unveils N-series handset 1 0.912
...
Documents with high similarity (>0.9) show strong topical match. Lower similarities indicate mixed characteristics or ambiguous content.
This system enables real-time automatic categorization of incoming news articles no manual tagging required!
What I Learned: Key Insights
1. Topic Structure Emerges from Word Patterns
Without any category labels, K-means discovered meaningful topical groupings purely from term frequencies. This demonstrates that semantic structure is encoded in word usage patterns.
2. Cosine Distance is Essential for Text
Using Euclidean distance would have failed. Text data is:
- High-dimensional: 6,167 features
- Sparse: Most terms don't appear in most documents
- Variable length: Documents have different word counts
Cosine distance handles all these properties by measuring angle between vectors, not magnitude.
3. Optimal K Requires Experimentation
There's no formula for perfect K. I had to:
- Test multiple values (K=4-8)
- Run multiple trials per K (random initialization varies)
- Balance cluster stability vs. specialization
- Consider domain knowledge (5 original categories suggested K=5)
4. Random Initialization Matters
K-means results vary with initial centroid placement. That's why I ran 5 trials per K value. Some runs produced better clusters than others this is expected with random initialization.
5. The Algorithm Found Something Humans Didn't
The gaming cluster at K=5 wasn't an original category! The algorithm discovered that video game articles have distinctive enough vocabulary to warrant separation from general sports. This is the power of unsupervised learning it can reveal patterns humans didn't explicitly define.
6. Vocabulary Overlap Limits Separation
Business and tech articles are hardest to separate they share terminology. Sports and entertainment also overlap with action language ("win", "best", "star"). Perfect clustering is impossible when categories naturally blend.
Practical Applications
These techniques power real-world systems:
News Aggregators: Automatically organize articles by topic without manual tagging
Content Management: Route incoming articles to appropriate sections
Recommendation Engines: Find similar articles by cluster membership
Search Enhancement: Filter results by automatically detected topic
Trend Detection: Identify emerging topics by monitoring cluster evolution
When to Use K-Means Clustering
Use K-means when:
- You have no labeled training data
- You want to discover natural groupings
- You can experiment with different K values
- Your data works with distance metrics (numerical, TF-IDF weighted text)
- You need fast, scalable clustering (K-means is efficient)
Don't use K-means when:
- You know the categories and have labeled data (use supervised learning instead)
- Clusters have irregular shapes (use DBSCAN or hierarchical clustering)
- You can't tune K experimentally
- Features are categorical (K-means assumes numerical data)
Conclusion
This project proved that unsupervised learning can discover meaningful structure in text data. K-means clustering with K=5 achieved 76% homogeneity and completeness against human editorial categories, using only term frequencies.
The real insight came from systematic experimentation. By testing K=4-8 with multiple runs, I found that K=5 provides optimal balance matching editorial categories while revealing specialized subcategories like gaming. The algorithm successfully separated politics, technology, business, entertainment, and sports based purely on word usage patterns.
Building this from scratch taught me that text clustering requires:
- Proper distance metrics (cosine for text)
- Careful data formatting (documents as rows)
- Systematic K selection (test multiple values)
- Qualitative and quantitative evaluation (term analysis + metrics)
- Multiple runs to account for initialization variance
The fact that an algorithm can organize 2,225 articles into meaningful topics without seeing a single label demonstrates the power of unsupervised learning. Combined with word clouds for visualization and cosine similarity for classification, this creates a complete document organization pipeline.
Understanding clustering deeply not just calling sklearn.cluster.KMeans() enables making informed decisions about distance metrics, cluster counts, and evaluation approaches. That's the difference between using machine learning and truly understanding it.
📓 Jupyter Notebook
Want to explore the complete code and run it yourself? Access the full Jupyter notebook with detailed implementations and visualizations:
You can also run it interactively:
