AI

AI-Powered Document Classification: Complete Guide for 2025

I built a classifier to stop hand-sorting invoices and contracts. Here’s the setup, code, and what failed along the way.

Written by
Convert Magic Team
Published
Reading time
12 min
AI-Powered Document Classification: Complete Guide for 2025

AI-Powered Document Classification: Complete Guide for 2025

AI-Powered Document Classification: Complete Guide for 2025

Three of us were spending an hour a day moving invoices, contracts, and receipts into the right folders. After one mislabeled contract delayed a client payment, I built a small classifier to do the sorting for us. This is the exact setup that worked (and the places I tripped up).

How I tested

  • Dataset: 1,200 invoices, 800 contracts, 500 receipts, 300 support emails (all anonymized); 80/10/10 train/val/test split
  • Labels: invoice, contract, receipt, support_email
  • Hardware: 2021 M1 MacBook Pro; Python 3.11; scikit-learn 1.4
  • Goal: >92% macro F1 on the test set, minimal false positives on contracts

Why this matters

  • Misrouted docs cost us real money: one invoice sat for 11 days because it lived in “marketing.”
  • Compliance: contracts and PHI must land in the right bucket, not a shared inbox.
  • Humans get bored; the model does not. Accuracy jumped from ~88% manual sorting to 95% automated with review.

Implementation playbook

  1. Collect and label only what you need. Start with 300–500 examples per class. Keep filenames and small snippets for audits.
  2. Clean text just enough: strip signatures, boilerplate footers, and tracking pixels; avoid over-cleaning dates/amounts that are predictive.
  3. Features: TF-IDF still wins for speed and simplicity. Try character n-grams for noisy scans.
  4. Model: start with Linear SVM or Logistic Regression before reaching for transformers. They’re fast to train and cheap to ship.
  5. Evaluate with per-class precision/recall; contracts usually need the highest precision.
  6. Deploy behind a simple API and keep a feedback loop—humans can re-label mistakes weekly.

Minimal, fast baseline (TF-IDF + Logistic Regression)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

docs = [...]   # list of document texts
labels = [...] # matching labels

train_x, test_x, train_y, test_y = train_test_split(
    docs, labels, test_size=0.2, random_state=42, stratify=labels
)

model = Pipeline([
    ("tfidf", TfidfVectorizer(
        max_features=50000,
        ngram_range=(1, 2),  # unigrams + bigrams
        min_df=2
    )),
    ("clf", LogisticRegression(max_iter=200, class_weight="balanced"))
])

model.fit(train_x, train_y)
preds = model.predict(test_x)
print(classification_report(test_y, preds))

What went wrong (and fixes)

  • OCR noise from scans: character n-grams (2–5) lifted recall on receipts by 3 points.
  • Contracts mislabeled as invoices: adding a handful of regex-derived features (e.g., contains("governing law")) improved precision; alternatively, set a higher threshold for the contract class.
  • Drift over time: quarterly re-training with 200–300 fresh examples kept F1 within 1–2 points.
  • Privacy concerns: we hash document IDs and strip PII before logging predictions; keep raw docs in a secure bucket, not in model logs.

If you’re starting from zero next week

  • Label 100 docs per class; ship a TF-IDF + linear model; aim for >90% macro F1.
  • Add human review for anything under 0.6 confidence.
  • Schedule a weekly error review to decide which 50–100 items to re-label.
  • Only move to a transformer once the simple model and labeling loop plateau.

Ready to Convert Your Files?

Try our free, browser-based conversion tools. Lightning-fast, secure, and no registration required.

Browse All Tools

Continue Reading