📌TL;DR

Applied K-means clustering to 2,225 BBC News articles (2004-2005) using term frequency features. K=5 achieved 76.4% homogeneity and 76.7% completeness against editorial categories (business, entertainment, politics, sport, tech). Systematic experimentation with K=4-8 revealed optimal balance at K=5, successfully identifying specialized clusters including gaming (100% "game" term coverage), politics ("mr", "govern"), tech ("use", "technolog"), and entertainment ("film", "award"). Built cosine similarity-based classifier for new document assignment. Demonstrates unsupervised learning's power to discover semantic structure from raw term frequencies, with word cloud visualizations and quantitative evaluation metrics.

Introduction

Can an algorithm automatically organize news articles by topic without being told what the categories are? This is the power of unsupervised learning. I tackled this challenge using K-means clustering on 2,225 BBC News articles spanning business, entertainment, politics, sport, and tech.

The fascinating part? The algorithm never sees category labels during training. It discovers natural groupings purely from word usage patterns. The goal was to see how well these data-driven clusters align with human editorial categories and whether the algorithm might discover patterns humans missed.

The Dataset: BBC News 2004-2005

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

tf_df = pd.read_csv("BBC_News_5_Categories/BBC_News_5_TF.csv", index_col=0)
features_df = pd.read_csv("BBC_News_5_Categories/BBC_News_5_Features.csv", header=None)
classes_df = pd.read_csv("BBC_News_5_Categories/BBC_News_5_Classes.csv", header=None, index_col=0)

print(f"Documents: {tf_df.shape[1]}")    # 2225
print(f"Terms: {tf_df.shape[0]}")        # 6167

The data comes as a term-document matrix: rows are terms (words), columns are documents (articles). Each cell contains the frequency of that term in that document.

Exploring Term Frequencies

term_frequencies = tf_df.sum(axis=1)
tf_sorted = term_frequencies.sort_values(ascending=False)

print(tf_sorted.head(20))

This gives us the most common words across all articles. When I plotted this:

plt.plot(sorted(term_frequencies, reverse=True))
plt.xlabel("Terms")
plt.ylabel("Frequency")
plt.title("Term Frequency Distribution (Zipf's Law)")
plt.show()

Classic Zipf's law! A few terms appear extremely frequently (like "year", "mr", "game"), while most terms are rare. This is a fundamental property of natural language.

Train-Test Split

from sklearn.model_selection import train_test_split

classes_np = classes_df.to_numpy().ravel()

X_train, X_test, y_train, y_test = train_test_split(
    tf_df.T, classes_np, test_size=0.2, random_state=99
)

print(f"X_train: {X_train.shape}")  # (1780, 6167)
print(f"X_test: {X_test.shape}")    # (445, 6167)

Notice the .T transpose! Machine learning expects rows as samples, columns as features. Our data had terms as rows, so I transposed to get documents as rows.

Even though clustering is unsupervised, I created a test set to evaluate how well the discovered clusters generalize to new documents.

Building Custom K-Means from Scratch

Here's the K-means implementation I used (custom, not scikit-learn):

from scipy.spatial import distance
import numpy.ma as ma

def kmeans(Data, K=3, max_iterations=20, metric="cosine", mask_zeros=False):
    """
    K-means clustering with configurable distance metric.

    Args:
        Data: N x M matrix (N samples, M features)
        K: number of clusters
        max_iterations: max iterations if no convergence
        metric: distance metric (euclidean, cosine, etc.)
        mask_zeros: whether to mask zeros when computing centroids

    Returns:
        clusters: cluster labels for each instance
        centroids: K centroid vectors
    """
    # Random initialization: pick K random samples as initial centroids
    idx = np.random.choice(len(Data), K, replace=False)
    centroids = Data[idx, :]

    # Initial assignment: find nearest centroid for each sample
    clusters = np.argmin(distance.cdist(Data, centroids, metric), axis=1)

    for j in range(max_iterations):
        centroids = np.zeros((K, Data.shape[1]))

        for i in range(K):
            cluster_i = Data[clusters == i, :]

            if mask_zeros:
                mc = ma.masked_array(cluster_i, mask=(cluster_i == 0))
                centroids[i] = mc.mean(axis=0)
            else:
                centroids[i] = cluster_i.mean(axis=0)

        # Reassign clusters based on new centroids
        new_clusters = np.argmin(distance.cdist(Data, centroids, metric), axis=1)

        # Check convergence
        if np.array_equal(clusters, new_clusters):
            break

        clusters = new_clusters

    return clusters, centroids

Why Cosine Distance for Text?

The metric="cosine" parameter is critical. For text data, cosine distance captures semantic similarity regardless of document length.

A 100-word article and a 1000-word article about the same topic should be similar
Cosine distance focuses on which words appear together, not how many total words
Euclidean distance fails in high-dimensional sparse spaces (most terms don't appear in most documents)

Running Initial Clustering

X_train_np = X_train.to_numpy()
clusters, centroids = kmeans(X_train_np, K=3)

print(f"Cluster 0: {(clusters == 0).sum()} documents")
print(f"Cluster 1: {(clusters == 1).sum()} documents")
print(f"Cluster 2: {(clusters == 2).sum()} documents")

Output:

Cluster 0: 716 documents
Cluster 1: 438 documents
Cluster 2: 626 documents

Reasonably balanced clusters! Now the question is: what do they represent?

Analyzing Clusters: What Topics Did We Find?

I created a comprehensive cluster analysis function:

def cluster_report(data, cluster, centroids, features):
    """
    Generate detailed report for each cluster.

    Returns:
        cluster_dict: {cluster_label: DataFrame with Freq, DF, % of Docs}
        cluster_sizes: {cluster_label: size}
    """
    cluster_sizes = {}
    cluster_dict = {}

    for cluster_label in np.unique(cluster):
        cluster_size = (cluster == cluster_label).sum()
        cluster_sizes[cluster_label] = cluster_size

        # Get documents in this cluster
        label_idx = np.where(cluster == cluster_label)
        cluster_tf = data[label_idx]

        # Document Frequency: how many docs contain each term
        df = (cluster_tf > 0).sum(axis=0)
        pct_docs = (df / cluster_size) * 100

        # Use centroid directly for frequency weights
        freq = centroids[cluster_label]

        report = pd.DataFrame({
            '': features,
            'Freq': freq,
            'DF': df,
            '% of Docs': pct_docs
        })

        report = report.set_index('')
        cluster_dict[cluster_label] = report

    return cluster_dict, cluster_sizes

This function computes three key metrics for each term in each cluster:

Freq: Centroid weight (mean frequency across cluster documents)
DF: Document frequency (how many cluster docs contain this term)
% of Docs: DF as percentage of cluster size

Displaying Top Terms

features_np = features_df.to_numpy().ravel()
cluster_dict, cluster_sizes = cluster_report(X_train_np, clusters, centroids, features_np)

def display_clusters_summary(cluster_dict, cluster_sizes, num_terms=10):
    for c in cluster_dict.keys():
        cluster_rep = cluster_dict[c]
        print(f"\nCluster {c} | size = {cluster_sizes[c]}")
        print("----------------------------")
        print(cluster_rep.sort_values(by="Freq", ascending=False).head(num_terms))

display_clusters_summary(cluster_dict, cluster_sizes, 10)

Output for K=3:

Cluster 0 | size = 716
----------------------------
        Freq   DF  % of Docs
year    1.47  494      68.99
game    1.45  301      42.04
play    1.10  360      50.28
film    1.03  177      24.72
time    0.91  360      50.28
best    0.89  223      31.15
win     0.86  313      43.72

Cluster 1 | size = 438
----------------------------
         Freq   DF  % of Docs
year     2.14  348      79.45
bn       1.38  194      44.29
compani  1.15  216      49.32
market   1.09  219      50.00
firm     0.92  190      43.38

Cluster 2 | size = 626
----------------------------
        Freq   DF  % of Docs
mr      3.06  418      66.77
peopl   1.93  414      66.13
use     1.52  336      53.67
say     1.21  338      53.99
govern  1.06  251      40.10

Interpretation

Cluster 0 (716 docs) - Sports & Entertainment:
Mixed cluster with "game" (42%), "play" (50%), "film" (25%). Sports and entertainment share similar language ("year", "best", "win").

Cluster 1 (438 docs) - Business & Economics:
Clear business focus! "bn" (billion) in 44% of docs, "compani" in 49%, "market" in 50%. Strong financial terminology.

Cluster 2 (626 docs) - Politics & General News:
"mr" appears in 67% of documents (referring to officials), "govern" in 40%, "peopl" in 66%. Classic political discourse.

Systematic Experimentation: Finding Optimal K

The real question: what's the best number of clusters? I tested K=4 through K=8 with multiple runs each.

from collections import Counter

for k in range(4, 9):
    print(f"\n{'='*50}")
    print(f"Testing K = {k}")
    print('='*50)

    for run in range(5):
        clusters, centroids = kmeans(X_train_np, K=k)

        print(f"\nRun {run + 1}:")
        for cluster in np.unique(clusters):
            cluster_idx = [i for i, label in enumerate(clusters) if label == cluster]
            categories = [classes_np[i] for i in cluster_idx]
            counts = dict(Counter(categories).most_common())
            print(f"  Cluster {cluster} (n={len(cluster_idx)}): {counts}")

This shows me which actual categories ended up in each cluster, helping identify meaningful patterns.

Key Findings

K=4: Politics and tech separate nicely, but sports/entertainment still mixed.

K=5 ⭐ (Optimal): Something amazing happened!

best_k = 5
best_clusters, best_centroids = kmeans(X_train_np, K=best_k)
cluster_dict, sizes = cluster_report(X_train_np, best_clusters, best_centroids, features_np)

# One run produced this:
# Cluster 0 (234 docs): Gaming cluster - "game" freq 8.99, appears in 100% of docs!
# Cluster 1 (189 docs): Entertainment - "film" (2.86), "award" (1.59), "star" (1.27)
# Cluster 2 (245 docs): Sports - "play" (1.21), "win" (1.12), "england" (0.93)
# Cluster 3 (963 docs): Politics - "mr", "govern", "labour", "elect"
# Cluster 4 (149 docs): Technology - "use", "technolog", "mobil"

K=5 successfully separated:

Video games from traditional sports (!)
Film/entertainment from sports events
Clear political discourse
Technology topics

The gaming cluster is remarkable "game" appears in 100% of documents with frequency 8.99! The algorithm discovered that gaming articles have such distinctive vocabulary they deserve their own cluster.

K=6-8: Over-segmentation. Clusters become too small (41-doc clusters at K=8) and unstable across runs.

Word Cloud Visualizations

To make clusters interpretable at a glance, I created word clouds:

from wordcloud import WordCloud

def plot_wordclouds(cluster_dict):
    for cluster_label, df in cluster_dict.items():
        freq_dict = df["Freq"].to_dict()

        wc = WordCloud(
            width=800,
            height=400,
            background_color="white",
            colormap="viridis"
        )
        wc.generate_from_frequencies(freq_dict)

        plt.figure(figsize=(10, 5))
        plt.imshow(wc, interpolation="bilinear")
        plt.axis("off")
        plt.title(f"Cluster {cluster_label}", fontsize=16)
        plt.show()

plot_wordclouds(cluster_dict)

Word clouds scale term size by frequency, making dominant vocabulary immediately visible. Business clusters show "compani", "market", "bn" prominently. Political clusters emphasize "mr", "govern", "labour". Sports clusters highlight "game", "play", "win".

These visualizations are perfect for stakeholder presentations no need to explain term frequency tables!

Quantitative Evaluation: How Well Did We Do?

Time for objective metrics. I compared discovered clusters against the original editorial categories.

from sklearn.metrics import completeness_score, homogeneity_score

k5_clusters, k5_centroids = kmeans(X_train_np, K=5)

homogeneity = homogeneity_score(y_train, k5_clusters)
completeness = completeness_score(y_train, k5_clusters)

print(f"Homogeneity Score: {homogeneity:.4f}")    # 0.7641
print(f"Completeness Score: {completeness:.4f}")  # 0.7672

Understanding the Metrics

Homogeneity (0.7641): Do clusters contain only documents from a single category?
→ Each cluster is 76.4% "pure". Some mixing occurs, but clusters are mostly coherent.

Completeness (0.7672): Are all documents from a category in the same cluster?
→ Category members are 76.7% grouped together. Some fragmentation exists.

Both scores range 0-1, where 1 is perfect. These are strong scores for completely unsupervised learning!

Why Not Perfect?

The imperfect scores make sense:

Vocabulary overlap: Business and tech both discuss "compani", "new", "technolog"
Ambiguous content: Some articles span categories (e.g., business of entertainment)
K-means assumptions: Spherical clusters, equal variance natural language doesn't always follow these

The algorithm discovered structure that 76% aligns with human editorial decisions, using only word frequencies. That's impressive!

Classifying New Documents

Now I have cluster structure how do I classify new articles?

def classify_document_cosine(doc_vector, centroids):
    """
    Assign document to nearest cluster using cosine similarity.

    Returns:
        cluster_label: assigned cluster
        max_similarity: similarity score to that cluster
    """
    # Normalize document and centroids
    doc_norm = np.linalg.norm(doc_vector)
    centroids_norm = np.linalg.norm(centroids, axis=1)

    # Compute cosine similarities
    if doc_norm > 0:
        cosine_sims = np.dot(centroids, doc_vector) / (centroids_norm * doc_norm)
    else:
        cosine_sims = np.zeros(len(centroids))

    # Assign to highest similarity
    cluster_label = np.argmax(cosine_sims)
    max_similarity = cosine_sims[cluster_label]

    return cluster_label, max_similarity

Classifying Test Set

X_test_np = X_test.to_numpy()
bbc_news_df = pd.read_csv("BBC_News_5_Categories/bbc-5categories.csv")
titles = bbc_news_df["title"].to_numpy()

results = []

for i in range(len(X_test_np)):
    doc_id = X_test.index[i]
    title = titles[doc_id]

    cluster, similarity = classify_document_cosine(X_test_np[i], centroids)

    results.append({
        'Document ID': doc_id,
        'Title': title,
        'Cluster': cluster,
        'Similarity': similarity
    })

results_df = pd.DataFrame(results)
print(results_df.head(20))

Sample output:

   Document ID                            Title  Cluster  Similarity
0         1823  EU to probe Apple's French deal        1       0.923
1         1456     Blair backs school discipline        2       0.887
2         2034  England crash out of Under-21s        0       0.945
3         1289    Nokia unveils N-series handset        1       0.912
...

Documents with high similarity (>0.9) show strong topical match. Lower similarities indicate mixed characteristics or ambiguous content.

This system enables real-time automatic categorization of incoming news articles no manual tagging required!

What I Learned: Key Insights

1. Topic Structure Emerges from Word Patterns

Without any category labels, K-means discovered meaningful topical groupings purely from term frequencies. This demonstrates that semantic structure is encoded in word usage patterns.

2. Cosine Distance is Essential for Text

Using Euclidean distance would have failed. Text data is:

High-dimensional: 6,167 features
Sparse: Most terms don't appear in most documents
Variable length: Documents have different word counts

Cosine distance handles all these properties by measuring angle between vectors, not magnitude.

3. Optimal K Requires Experimentation

There's no formula for perfect K. I had to:

Test multiple values (K=4-8)
Run multiple trials per K (random initialization varies)
Balance cluster stability vs. specialization
Consider domain knowledge (5 original categories suggested K=5)

4. Random Initialization Matters

K-means results vary with initial centroid placement. That's why I ran 5 trials per K value. Some runs produced better clusters than others this is expected with random initialization.

5. The Algorithm Found Something Humans Didn't

The gaming cluster at K=5 wasn't an original category! The algorithm discovered that video game articles have distinctive enough vocabulary to warrant separation from general sports. This is the power of unsupervised learning it can reveal patterns humans didn't explicitly define.

6. Vocabulary Overlap Limits Separation

Business and tech articles are hardest to separate they share terminology. Sports and entertainment also overlap with action language ("win", "best", "star"). Perfect clustering is impossible when categories naturally blend.

Practical Applications

These techniques power real-world systems:

News Aggregators: Automatically organize articles by topic without manual tagging

Content Management: Route incoming articles to appropriate sections

Recommendation Engines: Find similar articles by cluster membership

Search Enhancement: Filter results by automatically detected topic

Trend Detection: Identify emerging topics by monitoring cluster evolution

When to Use K-Means Clustering

Use K-means when:

You have no labeled training data
You want to discover natural groupings
You can experiment with different K values
Your data works with distance metrics (numerical, TF-IDF weighted text)
You need fast, scalable clustering (K-means is efficient)

Don't use K-means when:

You know the categories and have labeled data (use supervised learning instead)
Clusters have irregular shapes (use DBSCAN or hierarchical clustering)
You can't tune K experimentally
Features are categorical (K-means assumes numerical data)

Conclusion

This project proved that unsupervised learning can discover meaningful structure in text data. K-means clustering with K=5 achieved 76% homogeneity and completeness against human editorial categories, using only term frequencies.

The real insight came from systematic experimentation. By testing K=4-8 with multiple runs, I found that K=5 provides optimal balance matching editorial categories while revealing specialized subcategories like gaming. The algorithm successfully separated politics, technology, business, entertainment, and sports based purely on word usage patterns.

Building this from scratch taught me that text clustering requires:

Proper distance metrics (cosine for text)
Careful data formatting (documents as rows)
Systematic K selection (test multiple values)
Qualitative and quantitative evaluation (term analysis + metrics)
Multiple runs to account for initialization variance

The fact that an algorithm can organize 2,225 articles into meaningful topics without seeing a single label demonstrates the power of unsupervised learning. Combined with word clouds for visualization and cosine similarity for classification, this creates a complete document organization pipeline.

Understanding clustering deeply not just calling sklearn.cluster.KMeans() enables making informed decisions about distance metrics, cluster counts, and evaluation approaches. That's the difference between using machine learning and truly understanding it.

📓 Jupyter Notebook

Want to explore the complete code and run it yourself? Access the full Jupyter notebook with detailed implementations and visualizations:

→ View Notebook on GitHub

You can also run it interactively:

Jonesh Shrestha
AI/ML Engineer

Unsupervised Document Clustering of BBC News Articles with K-Means