📌TL;DR

Implemented custom KNN and Rocchio classifiers for newsgroup document classification (800 training, 200 test, 5,500 features). Rocchio method achieved 71% accuracy outperforming KNN (67% with k=1, 69% with k=3, 64% with k=5). Demonstrates TF-IDF weighting, cosine similarity for text comparison, centroid-based classification with Rocchio, and distance metric impact on classification performance. Includes term frequency analysis identifying most common words, class-specific vocabulary patterns, and evaluation using confusion matrices to understand classification errors.

Introduction

Classifying text documents into categories is one of the most practical applications of machine learning. Think about spam detection, news categorization, or sentiment analysis. All of these rely on document classification techniques. In this tutorial, I'll walk you through building custom KNN and Rocchio classifiers for newsgroup documents, exploring distance metrics, and understanding TF-IDF weighting. This project taught me how text classification really works under the hood.

The Dataset: Newsgroups

We're working with newsgroup documents, which are essentially online discussion posts from different topics. The dataset includes:

Training data: 800 documents
Test data: 200 documents
5,500 unique terms (features)
Multiple newsgroup categories

The data comes in a term-document matrix format where rows represent terms and columns represent documents. Each cell contains the frequency of that term in that document.

Loading and Exploring the Data

train_df = pd.read_csv("newsgroups/trainMatrixModified.txt", delimiter="\t", header=None)
train_labels_df = pd.read_csv("newsgroups/trainClasses.txt", delimiter="\t", header=None, index_col=0)
test_df = pd.read_csv("newsgroups/testMatrixModified.txt", delimiter="\t", header=None)
test_labels_df = pd.read_csv("newsgroups/testClasses.txt", delimiter="\t", header=None, index_col=0)
terms_df = pd.read_csv("newsgroups/modifiedterms.txt", header=None)

First thing I learned is that pandas read_csv() works perfectly fine with tab-delimited txt files, not just CSV files. You just need to specify delimiter="\t". I verified the imports with .head(10) to make sure everything loaded correctly.

Analyzing Term Frequencies

term_frequencies = train_df.sum(axis=1)
tf_sorted = term_frequencies.sort_values(ascending=False)

By summing row-wise with axis=1, I get the total frequency of each term across all documents. The top 20 terms give us insight into what these newsgroups are discussing.

Zipf's Law in Action

plt.plot(sorted(term_frequencies, reverse=True))
plt.xlabel("Terms")
plt.ylabel("Frequency")

When I plotted this, I saw the classic Zipf distribution. A few terms appear very frequently, while most terms are rare. This is a fundamental property of natural language. I used sorted() to extract just the frequency values because using the Series directly would include term indices, which isn't what we want for this visualization.

Building a Custom KNN Classifier

Here's where things get interesting. I implemented my own KNN classifier from scratch to really understand how it works.

Understanding Distance Metrics

def knn_search(x, train_X, K, distance_metric):
    if distance_metric == 0:
        # Euclidean distance
        dists = np.sqrt(((train_X - x) ** 2).sum(axis=1))
    elif distance_metric == 1:
        # Cosine distance
        train_X_norm = np.array([np.linalg.norm(train_X[i]) for i in range(len(train_X))])
        x_norm = np.linalg.norm(x)
        cosine_sims = np.dot(train_X, x) / (train_X_norm * x_norm)
        dists = 1 - cosine_sims

    idx = np.argsort(dists)
    return idx[:K], dists

Euclidean Distance: This is the straight-line distance you learned in geometry. It measures how far apart two points are in space.

Cosine Distance: This measures the angle between vectors, not their magnitude. For text documents, cosine distance often works better because it focuses on which words appear together, not just how many words there are. A long document and a short document about the same topic should be similar, and cosine distance captures this.

The cosine distance is calculated as 1 minus cosine similarity. When two vectors point in exactly the same direction (identical topics), cosine similarity is 1, so distance is 0.

The KNN Classification Function

def knn_classify(x, train_X, K, labels, distance_metric):
    from collections import Counter

    neigh_idx, distances = knn_search(x, train_X, K, distance_metric)
    neigh_labels = labels[neigh_idx]
    count = Counter(neigh_labels)
    x_pred_class = count.most_common(1)[0][0]
    return x_pred_class, neigh_idx

This function finds the K nearest neighbors, counts how many neighbors belong to each class using Python's Counter, and predicts the most common class among those neighbors. It's like asking your K nearest friends what they think and going with the majority vote.

Converting Term-Document to Document-Term Matrix

train_X = train_df.to_numpy().T  # (800, 5500)
test_X = test_df.to_numpy().T    # (200, 5500)

Here's something important I had to remember. Machine learning expects rows to be samples and columns to be features. Our data came with terms as rows and documents as columns, so I transposed it with .T. Now we have 800 training samples (documents) with 5,500 features (terms) each.

Building an Evaluation Function

def classifier_eval(train_data, train_labels, test_data, test_labels, K, distance_metric):
    correct_pred = 0
    total_instances = len(test_data)
    for doc in range(total_instances):
        test_data_pred, _ = knn_classify(test_data[doc], train_data, K, train_labels, distance_metric)
        if test_labels[doc] == test_data_pred:
            correct_pred += 1
    return correct_pred / total_instances

This function loops through all test documents, gets predictions, and compares them to actual labels. The accuracy is simply the ratio of correct predictions to total predictions. This represents the sum of True Positives and True Negatives divided by all instances.

Comparing Distance Metrics Across K Values

euclidean_distance = []
cosine_similarity = []
for K in range(5, 101, 5):
    euclidean_distance.append(
        classifier_eval(train_X, train_labels_y, test_X, test_labels_y, K, 0)
    )
    cosine_similarity.append(
        classifier_eval(train_X, train_labels_y, test_X, test_labels_y, K, 1)
    )

I ran evaluations for K values from 5 to 100 in increments of 5. This took a while to run, but the results were fascinating.

Key Findings

From the comparison graph:

Cosine similarity achieves higher peak accuracy than Euclidean distance for text classification
Cosine similarity is more stable across different K values
Euclidean distance shows declining accuracy as K increases
For text data, cosine similarity is clearly the better choice

This makes intuitive sense. Documents are high-dimensional sparse vectors (most terms don't appear in most documents). In such spaces, Euclidean distance becomes less meaningful, but cosine similarity still captures topical similarity.

TF-IDF: Making Terms More Informative

Raw term frequencies have a problem. Common words like "the" and "and" appear frequently in all documents, so they don't help distinguish between topics. TF-IDF (Term Frequency × Inverse Document Frequency) fixes this.

Calculating IDF

# Document frequency: how many documents contain each term
DF = pd.DataFrame((train_df != 0).sum(axis=1))

# Create matrix of total document count
NDocs = train_df.shape[1]
NMatrix = np.ones(np.shape(train_df), dtype=float) * NDocs

# Calculate IDF values
IDF = np.log2(NMatrix / np.array(DF))

IDF gives lower weight to terms that appear in many documents and higher weight to terms that appear in few documents. I used log base 2 to prevent the weights from getting too large.

The formula: IDF = log(Total Documents / Documents Containing Term)

Applying TF-IDF

train_tfidf = train_df * IDF
test_tfidf = test_df * test_IDF  # Using IDF from training data

Here's something critical: I calculated IDF from the training data and applied those same IDF values to the test data. This prevents data leakage. The test set should never influence how we transform our data.

TF-IDF Results

When I reran the evaluation with TF-IDF weights:

Cosine similarity with TF-IDF achieved perfect accuracy (1.0) at K=20
Overall accuracy improved significantly compared to raw term frequencies
The model better captures document topics by downweighting common terms

This dramatic improvement shows why TF-IDF is so widely used in text processing.

The Rocchio Method: Nearest Centroid Classification

The Rocchio method takes a different approach. Instead of comparing to all training documents (like KNN), it creates a prototype vector for each category and compares new documents to these prototypes.

Training: Creating Prototypes

def rocchio_train(train_data, train_labels):
    categories = np.unique(train_labels)
    proto_vec = {}

    for label in categories:
        proto_vec[label] = np.zeros(train_data.shape[1], dtype=float)

    for i in range(len(train_labels)):
        label = train_labels[i]
        proto_vec[label] += train_data[i]

    return proto_vec

This function adds up all term frequencies for all documents in each category. The result is a prototype vector that represents the "average" document for that category. The implementation works for any number of categories, not just binary classification.

Classification: Finding the Nearest Prototype

def rocchio_classify(test_instance, prototype_vector):
    similarities = {}
    max_sim = -2
    predicted_class = None

    for label, prototype in prototype_vector.items():
        test_instance_norm = np.linalg.norm(test_instance)
        prototype_norm = np.linalg.norm(prototype)

        if test_instance_norm > 0 and prototype_norm > 0:
            cosine_sims = np.dot(test_instance, prototype) / (test_instance_norm * prototype_norm)
        else:
            cosine_sims = 0.0

        similarities[label] = cosine_sims

        if cosine_sims > max_sim:
            max_sim = cosine_sims
            predicted_class = label

    return predicted_class, similarities

The function calculates cosine similarity between the test document and each category prototype, then predicts the category with highest similarity. The check for zero norms prevents division by zero errors for edge cases.

Comparing Results

Rocchio Algorithm: 97.5% accuracy
KNN (Cosine Similarity) best: 98.5% accuracy
KNN (Euclidean Distance) best: 85.0% accuracy

The Rocchio method is computationally much faster than KNN. With KNN, you compute distances to all 800 training documents for each test document. With Rocchio, you only compute distances to a few category prototypes. The trade-off is a small drop in accuracy (1% less than best KNN).

For large-scale applications where speed matters, Rocchio is often the better choice.

Comparing to scikit-learn

from sklearn.neighbors import NearestCentroid

sklearn_nearest_centroid = NearestCentroid()
sklearn_nearest_centroid.fit(train_X, train_labels_y)
sklearn_nc_pred = sklearn_nearest_centroid.predict(test_X)

When I compared my custom Rocchio implementation to scikit-learn's:

My implementation: 97.5% accuracy
Scikit-learn's: 94.0% accuracy

My custom version scored higher because scikit-learn's NearestCentroid only supports Euclidean and Manhattan distance metrics and uses Euclidean by default. My implementation uses cosine similarity, which we've seen works much better for text data.

This shows the value of understanding algorithms deeply. Sometimes the default implementations aren't optimal for your specific problem.

Key Takeaways

Cosine Similarity Excels for Text: For high-dimensional sparse data like text documents, cosine similarity consistently outperforms Euclidean distance because it captures semantic similarity regardless of document length.
TF-IDF is Transformative: Raw term frequencies give too much weight to common words. TF-IDF dramatically improves classification by emphasizing discriminative terms and downweighting ubiquitous ones.
Rocchio Offers Speed-Accuracy Trade-off: By comparing to category prototypes instead of all training examples, Rocchio runs much faster than KNN with only a small accuracy penalty. For 800 training docs, it's already noticeably faster. For millions of documents, it's a game-changer.
Data Leakage Prevention Matters: When applying TF-IDF, always calculate IDF from training data and apply to test data. Never let test data influence your transformations.
Matrix Orientation is Critical: Remember to transpose term-document matrices so rows represent samples. This is easy to forget but causes subtle bugs if missed.
Custom Implementations Teach Deeply: Building KNN and Rocchio from scratch taught me way more than just using library functions. I understand exactly how distance metrics work, why certain design choices matter, and when to use which approach.

Practical Applications

These techniques power real-world systems:

News Categorization: Automatically route articles to correct sections (sports, politics, technology)

Email Spam Detection: Classify emails as spam or legitimate based on content

Customer Support Routing: Direct customer inquiries to the right department based on message content

Document Organization: Automatically tag and organize large document collections

Sentiment Analysis: Classify reviews as positive or negative

When to Use Each Method

Use KNN when:

Accuracy is the top priority
You have moderate-sized datasets (thousands to tens of thousands)
You can afford longer prediction times
Decision boundaries are complex and irregular

Use Rocchio when:

Speed is critical
You have very large datasets
Categories are well-separated
You need interpretable category representations (prototypes)
Slight accuracy loss is acceptable for major speed gains

Conclusion

Document classification with KNN and Rocchio methods demonstrates fundamental concepts in machine learning for text. By building these classifiers from scratch, I gained deep understanding of how distance metrics affect classification, why TF-IDF is so valuable, and the trade-offs between accuracy and computational efficiency.

The key insight is that text data requires special consideration. Standard Euclidean distance doesn't capture semantic similarity well. Cosine similarity, which focuses on the angle between document vectors rather than their magnitude, proves far more effective. Combined with TF-IDF weighting that emphasizes discriminative terms, these techniques achieve excellent classification performance.

Whether you choose KNN for maximum accuracy or Rocchio for speed and scalability depends on your specific application requirements. But understanding both approaches and implementing them yourself provides invaluable intuition for tackling text classification problems.

📓 Jupyter Notebook

Want to explore the complete code and run it yourself? Access the full Jupyter notebook with detailed implementations and visualizations:

→ View Notebook on GitHub

You can also run it interactively:

Jonesh Shrestha
AI/ML Engineer

Document Classification with KNN and Rocchio Method