Build a spam classifier from scratch: complete Python tutorial

You’ve learned machine learning fundamentals and built basic models. Now it’s time to create something genuinely useful: a spam email classifier that automatically identifies junk messages. This is the kind of system that protects billions of inboxes every day.

<u>Building real machine learning projects that solve actual problems transforms theoretical knowledge into practical skills</u>. Spam classification is perfect for this because it combines text processing, feature extraction, and classification in one complete workflow. By the end, you’ll have a working classifier you can actually use.

Building a spam classifier from scratch teaches you natural language processing fundamentals, text preprocessing techniques, and how to handle real messy data. Unlike clean tutorial datasets, text data requires extensive preparation before models can use it. Let me show you the complete process from raw emails to accurate predictions.

Understanding the spam classification problem

Spam classification is binary classification with two categories: spam or legitimate email also called ham. Given an email’s content, your model predicts which category it belongs to. Simple concept but tricky implementation because text is unstructured and messy.

The challenge is that spam constantly evolves. Spammers adapt their messages to bypass filters. Your model needs to learn patterns that generalize rather than memorizing specific spam examples. Good features and proper evaluation matter enormously.

Evaluation metrics are crucial for spam classification. Accuracy alone is misleading if you have imbalanced classes with far more ham than spam. Precision measures what fraction of emails flagged as spam actually are spam. Recall measures what fraction of actual spam you catch. Both matter but in different ways.

False positives where legitimate emails get marked as spam are often worse than false negatives where spam gets through. Users tolerate some spam but get furious when important emails disappear into spam folders. Your evaluation should reflect this reality.

Loading and exploring the dataset

We’ll use the SMS Spam Collection dataset containing 5,574 SMS messages labeled as spam or ham. It’s publicly available and perfect for learning. Each row has a label and message text.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Load dataset
url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/data/sms.tsv'
sms = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

# Explore the data
print(sms.head())
print(f"\nDataset shape: {sms.shape}")
print(f"\nClass distribution:\n{sms['label'].value_counts()}")
print(f"\nSpam percentage: {(sms['label'] == 'spam').mean() * 100:.2f}%")

The dataset has about 13 percent spam messages, which is realistic but creates class imbalance you’ll need to handle. Look at actual message examples to understand what you’re working with.

# Sample messages
print("\nSample ham messages:")
print(sms[sms['label'] == 'ham']['message'].head(3).values)

print("\nSample spam messages:")
print(sms[sms['label'] == 'spam']['message'].head(3).values)

Spam messages often contain words like free, win, call now, and excessive punctuation. Ham messages are normal conversations. Your model needs to learn these patterns from examples.

Preprocessing text data

Text preprocessing converts raw messages into a form machine learning algorithms can process. This involves several steps that clean and standardize the text.

Convert labels to binary values. Spam becomes 1 and ham becomes 0. This is standard for binary classification.

# Convert labels to binary
sms['label_num'] = sms['label'].map({'ham': 0, 'spam': 1})

# Verify conversion
print(sms[['label', 'label_num']].head())

Text cleaning removes noise and standardizes format. Common steps include converting to lowercase, removing punctuation, and handling special characters. The goal is reducing variation that doesn’t carry meaning.

For this tutorial, we’ll keep preprocessing simple and let TF-IDF handle most of the work. In production systems, you might add stemming or lemmatization to reduce words to root forms. Removing stop words like the, is, and at can help too.

Split data into training and test sets before any feature extraction. Use 80 percent for training and 20 percent for testing. Stratify by label to maintain class distribution in both sets.

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    sms['message'],
    sms['label_num'],
    test_size=0.2,
    random_state=42,
    stratify=sms['label_num']
)

print(f"Training set: {len(X_train)} messages")
print(f"Test set: {len(X_test)} messages")
print(f"Training spam rate: {y_train.mean():.2%}")
print(f"Test spam rate: {y_test.mean():.2%}")

Extracting features with TF-IDF

Machine learning models need numerical features, not raw text. TF-IDF or Term Frequency Inverse Document Frequency converts text into numbers that represent word importance.

TF-IDF has two components. Term frequency measures how often a word appears in a document. Inverse document frequency measures how unique a word is across all documents. Multiply these to get TF-IDF scores.

Common words like the appear in most documents and get low scores. Distinctive words that appear in few documents get high scores. This helps identify words that actually discriminate between spam and ham.

# Create TF-IDF vectorizer
tfidf = TfidfVectorizer(
    max_features=3000,  # Keep top 3000 features
    min_df=2,  # Word must appear in at least 2 documents
    max_df=0.8,  # Word can't appear in more than 80% of documents
    ngram_range=(1, 2)  # Use unigrams and bigrams
)

# Fit on training data and transform both sets
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print(f"TF-IDF matrix shape: {X_train_tfidf.shape}")
print(f"Number of features: {len(tfidf.get_feature_names_out())}")

The ngram_range parameter lets you capture phrases, not just single words. A bigram like free money has different meaning than the individual words. This helps catch spam patterns.

The resulting matrix is sparse with mostly zeros because each message uses only a small fraction of the vocabulary. Sklearn handles sparse matrices efficiently.

Training and comparing classifiers

Try multiple algorithms to see what works best. Naive Bayes is fast and works well for text. Logistic Regression is also effective and provides probability estimates.

# Train Naive Bayes
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

# Train Logistic Regression
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train_tfidf, y_train)

# Make predictions
nb_pred = nb_model.predict(X_test_tfidf)
lr_pred = lr_model.predict(X_test_tfidf)

print("Naive Bayes Results:")
print(classification_report(y_test, nb_pred, target_names=['ham', 'spam']))

print("\nLogistic Regression Results:")
print(classification_report(y_test, lr_pred, target_names=['ham', 'spam']))

The classification report shows precision, recall, and F1 score for each class. Precision is what fraction of predicted spam actually is spam. Recall is what fraction of actual spam you caught.

For spam filtering, precision matters more than recall. You’d rather let some spam through than mistakenly block important emails. A precision of 95 percent or higher is good.

Evaluating with confusion matrix

The confusion matrix shows exactly where your model makes mistakes. It breaks down predictions into true positives, true negatives, false positives, and false negatives.

# Confusion matrix for best model
print("\nConfusion Matrix (Logistic Regression):")
cm = confusion_matrix(y_test, lr_pred)
print(cm)
print("\nExplanation:")
print(f"True negatives (correctly identified ham): {cm[0][0]}")
print(f"False positives (ham marked as spam): {cm[0][1]}")
print(f"False negatives (spam marked as ham): {cm[1][0]}")
print(f"True positives (correctly identified spam): {cm[1][1]}")

False positives are your biggest concern. Each one represents a legitimate email wrongly flagged as spam. Monitor this number carefully.

False negatives let spam through but are less harmful. Users expect some spam to slip past filters. The tradeoff between precision and recall lets you tune the model’s aggressiveness.

Testing on new messages

A trained classifier is useful because it can classify new unseen messages. Let’s test it on custom examples.

# Test messages
new_messages = [
    "Hey, are we still meeting for lunch tomorrow?",
    "CONGRATULATIONS! You've won a FREE iPhone! Click here now!",
    "Can you send me that report by end of day?",
    "Claim your prize now! Limited time offer! Call 555-1234"
]

# Transform and predict
new_tfidf = tfidf.transform(new_messages)
predictions = lr_model.predict(new_tfidf)
probabilities = lr_model.predict_proba(new_tfidf)

# Display results
for i, msg in enumerate(new_messages):
    label = "SPAM" if predictions[i] == 1 else "HAM"
    confidence = probabilities[i][predictions[i]] * 100
    print(f"\nMessage: {msg}")
    print(f"Prediction: {label} (confidence: {confidence:.1f}%)")

The model should correctly identify obvious spam with words like free, congratulations, and click here. Normal messages should classify as ham with high confidence.

Improving classifier performance

Several techniques can boost performance if initial results aren’t good enough. Add more sophisticated text preprocessing like stemming or lemmatization. Remove stop words that don’t carry meaning. Experiment with different ngram ranges.

Try different classifiers like Support Vector Machines or ensemble methods. Adjust hyperparameters through grid search. Balance the dataset if class imbalance is causing problems.

Feature engineering helps too. Add features like message length, number of exclamation marks, or presence of URLs. Spam often has distinctive structural patterns beyond just word choice.

# Add length feature
sms['length'] = sms['message'].str.len()
print(f"\nAverage ham length: {sms[sms['label'] == 'ham']['length'].mean():.0f}")
print(f"Average spam length: {sms[sms['label'] == 'spam']['length'].mean():.0f}")

Combining TF-IDF features with handcrafted features often improves performance. Stack them together before training your classifier.

Deploying your spam classifier

A working spam classifier is useful if you can actually use it. Save the trained model and vectorizer so you can load them later without retraining.

import joblib

# Save model and vectorizer
joblib.dump(lr_model, 'spam_classifier_model.pkl')
joblib.dump(tfidf, 'tfidf_vectorizer.pkl')

# Load later
# loaded_model = joblib.load('spam_classifier_model.pkl')
# loaded_tfidf = joblib.load('tfidf_vectorizer.pkl')

You could integrate this into an email client, create a web API that accepts messages and returns predictions, or build a batch processor that filters large volumes of email.

Building a spam classifier from scratch taught you the complete text classification workflow. You preprocessed text data, extracted features with TF-IDF, trained multiple classifiers, and evaluated results properly. These same steps apply to any text classification problem from sentiment analysis to document categorization.

Ready to tackle another machine learning project with a completely different data type? Check out our tutorial on image classification with deep learning to build a cats vs dogs classifier using convolutional neural networks and see how computer vision projects differ from text classification.