Building an ML-Powered Transaction Classifier with Retraining and A/B Testing

Every month I download a CSV from my bank with all our household transactions. Each one needs a category: groceries, fuel, mortgage, subscriptions, insurance. With 200+ transactions per month across two accounts, manual categorization gets tedious fast. So I built a machine learning classifier that does it automatically.

What started as a simple script has grown into a system with model retraining, A/B testing, confidence-based review workflows, and experiment tracking. It runs on a Synology NAS in Docker containers, costs nothing to operate, and gets better over time as we correct its mistakes. This post covers how it works and what I learned building ML for a real (if small-scale) production environment.

The problem

Bank transaction CSVs contain a description field, an amount, a date, and a counterparty name. The description is messy: abbreviated merchant names, reference numbers, sometimes just a string of digits. The goal is to assign each transaction to one of 66 predefined categories like “Groceries”, “Fuel”, “Salary”, “Software subscriptions”, or “Mortgage payment”.

A rules-based approach (if description contains “Albert Heijn” then “Groceries”) works for the obvious cases, but breaks down quickly. The same supermarket can appear as “AH”, “Albert Heijn”, “AH to Go”, or “AH Bezorgservice”. And new merchants appear constantly.

Feature engineering

The classifier uses four types of features:

  1. Text features (TF-IDF): The transaction description and notes are concatenated, lowercased, and vectorized with TF-IDF (max 1000 features, unigrams + bigrams, min_df=2, max_df=0.8, unicode accent stripping). This captures merchant names, payment references, and recurring patterns.
  2. Amount magnitude: log1p(abs(amount)). The log transform handles the wide range from small purchases to mortgage payments. A transaction of 3.50 behaves differently from one of 1,200.00, and the magnitude is often a strong signal for the category.
  3. Expense flag: Binary indicator for whether the amount is negative (expense) or positive (income). Simple but effective: salary and refunds are positive, everything else is negative.
  4. Temporal features: Day of week, day of month, and month. Some categories cluster temporally: salary arrives on the 24th, mortgage debits on the 1st, grocery shopping peaks on Saturdays.

These features are combined into a sparse matrix with scipy.sparse.hstack. Categories with fewer than 10 training samples are excluded to avoid unreliable predictions.

Model training with Optuna

I train four model types in parallel: Random Forest, LightGBM, XGBoost, and Logistic Regression. Each gets its own Optuna study with 50 trials using the TPE sampler. The data is split 60/20/20: training, validation (for Optuna), and a held-out test set for final evaluation.

# Simplified training loop
import optuna
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier

def objective_rf(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 5, 50),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
    }
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)
    return f1_score(y_val, model.predict(X_val), average='weighted')

study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler())
study.optimize(objective_rf, n_trials=50, timeout=1200)

The best model across all four types is selected based on weighted F1 score on the test set. In practice, LightGBM and XGBoost consistently outperform Random Forest and Logistic Regression, reaching 80-90% accuracy on 66 categories.

Experiment tracking with MLflow

Every training run is logged to MLflow: hyperparameters, metrics (accuracy, F1, precision, recall), confusion matrix, and the serialized model. The Model Registry handles versioning and stage transitions (Staging to Production).

This sounds like overkill for a personal finance app, and it probably is. But it solves a real problem: when you retrain a model and the accuracy drops, you want to know why. Was it the new data? The hyperparameters? A category that got too few samples? MLflow lets you compare runs side by side and roll back if needed.

The confidence-based review workflow

Not all predictions are equal. The classifier outputs probability scores for each category, and the UI uses these to create a three-tier review workflow:

ConfidenceIndicatorAction
> 80%GreenAuto-categorized, just verify
60-80%YellowSuggested category, accept or change
< 60%RedTop 5 suggestions, manual selection

In practice, about 70-80% of transactions land in the green zone. The remaining 20-30% need a quick review. This is dramatically faster than categorizing everything manually, and the corrections feed back into the training data.

Continuous learning: the retraining loop

Every time a user corrects a prediction, the correction is logged to a JSONL file with the original prediction, the corrected category, and the confidence score. This creates a growing dataset of “mistakes the model made” that is gold for retraining.

Retraining is triggered when:

  • 50+ new corrections have accumulated
  • Accuracy drops below 75% (monitored via prediction confidence distribution)
  • New categories are added
  • Quarterly maintenance cycle

The retraining workflow:

# 1. Sync corrections from production (NAS) to development machine
rsync -avz nas:/volume1/.../user_corrections/ ./corrections/

# 2. Retrain with Optuna optimization
python train_model_optuna.py

# 3. Review results in MLflow UI
mlflow ui --port 5000

# 4. If better: promote to production
python set_model_alias.py --model-name kasboek --stage Production

# 5. Deploy to NAS
./update_model_to_nas.sh

Training runs on my MacBook (Apple Silicon, 32GB) because the NAS (Intel Atom, 2GB RAM) is too constrained for Optuna’s 50 trials across 4 model types. The trained model (typically 500KB-2MB as a pickle file) is then deployed to the NAS where it serves predictions.

A/B testing new models

Before fully replacing a production model, I run an A/B test. The system splits traffic between the current champion and the new challenger (default 80/20). Each prediction is logged with which model produced it. After a testing period, a statistical analysis determines whether the challenger is significantly better.

# Champion/Challenger configuration
ab_config = {
    'champion_model': 'kasboek_v3',
    'challenger_model': 'kasboek_v4_lightgbm',
    'traffic_split': 0.8,  # 80% champion, 20% challenger
    'min_samples': 100,     # minimum predictions before analysis
}

# After testing period:
python analyze_ab_test.py
# Output: "Challenger wins on F1 (0.87 vs 0.82, p=0.003). Promote."

Is A/B testing overkill for a household finance app? Yes. Did building it teach me more about production ML than any tutorial? Also yes.

Deployment: Docker on a Synology NAS

The entire system runs on a Synology NAS in Docker containers:

ContainerPortPurpose
MongoDB27018Transaction storage (8000+ records)
Flask app8081Web interface + ML inference

The Flask app loads the trained model once at startup (singleton pattern) and serves predictions in under 50ms per transaction. The ML model is a pickle file containing the fitted model, TF-IDF vectorizer, label encoder, and metadata. Gunicorn handles the WSGI layer.

The NAS has limited resources (Intel Atom quad-core, 2GB RAM), which forces useful constraints: the model must be small, inference must be fast, and memory usage must stay low. These are the same constraints you face deploying ML to edge devices or cost-sensitive cloud environments.

What I learned

Feature engineering beats model complexity. The jump from Random Forest to LightGBM improved accuracy by a few percentage points. Adding the amount magnitude feature improved it by ten. The TF-IDF bigrams (capturing “Albert Heijn” as a single feature rather than two separate words) were another significant step.

User corrections are your best training data. The initial model trained on 500 manually categorized transactions was decent. After 6 months of corrections, it had seen every edge case in our spending patterns. The correction-based retraining loop is the single most valuable part of the system.

Confidence thresholds matter more than accuracy. An 85% accurate model that knows when it’s uncertain is more useful than a 90% accurate model that’s confidently wrong. The three-tier confidence workflow means the user only spends time on the hard cases.

MLOps practices scale down. MLflow, Optuna, A/B testing, model registry, staged deployments. These aren’t just for large teams. On a personal project with one user, they solve the same problems: reproducibility, comparison, and safe deployment. The tools work at any scale.

Stack summary

ComponentTechnology
Web frameworkFlask + Jinja2
DatabaseMongoDB (Docker)
ML modelsLightGBM, XGBoost, Random Forest, Logistic Regression
Hyperparameter tuningOptuna (TPE sampler, 50 trials)
Experiment trackingMLflow + Model Registry
Text featuresscikit-learn TF-IDF
DeploymentDocker + Gunicorn on Synology NAS
Training hardwareMacBook Pro (Apple Silicon)

The full system is about 2000 lines of Python across the Flask app, training scripts, and utility functions. It handles our household finances reliably, gets smarter with each use, and taught me more about production ML than reading about it ever did. Sometimes the best way to learn MLOps is to build something you actually use every day.

Add a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.