Skip to content

AI-powered transaction categorization system using DistilBERT and hybrid NLP rules. Automatically classifies bank statement expenses into 16+ categories. Built with Streamlit, PyTorch, and Transformers.

License

Notifications You must be signed in to change notification settings

namanj2003/TransactAI

Repository files navigation

πŸ’° TransactAI: Personal Expense Categorization System

Python Streamlit PyTorch License

AI-Powered Bank Statement Analysis & Expense Categorization

TransactAI is an intelligent expense tracking system that automatically categorizes bank transactions using a hybrid approach combining DistilBERT deep learning with context-aware NLP rules. Built as an M.Sc Data Science project, it achieves 95%+ accuracy on real Indian bank statements.


πŸ“‹ Table of Contents


✨ Features

  • πŸ€– AI-Powered Categorization: Fine-tuned DistilBERT model combined with rule-based NLP for high accuracy
  • πŸ“Š 16+ Categories: Food, Shopping, Travel, Bills, Income, Cashback, EMI, Utilities, Healthcare, and more
  • 🏦 Multi-Bank Support: Works with HDFC, ICICI, SBI, Federal Bank, Axis Bank, and other Indian banks
  • πŸ“ Multiple Formats: Supports CSV and Excel (.xlsx, .xls) statements
  • πŸ“ˆ Interactive Analytics: Real-time pie charts, bar charts, spending insights, and personalized recommendations
  • πŸ“„ Professional PDF Reports: Generate comprehensive expense analysis reports with charts and insights
  • πŸ”„ Smart Parsing: Automatically detects headers and handles complex Excel formats with metadata
  • πŸ’‘ Context-Aware: Uses transaction direction (credit/debit) for improved accuracy
  • ⚑ Fast Processing: Categorizes 900+ transactions in seconds
  • 🎯 High Accuracy: 95%+ overall accuracy with hybrid deep learning + rules approach

🎬 Demo

Upload & Categorize

Upload your bank statement (CSV/Excel), and the AI automatically detects columns and categorizes each transaction.

Analytics Dashboard

View interactive charts showing spending distribution, category-wise breakdowns, and top expenses.

PDF Report

Download a professional PDF report with complete expense analysis, charts, and smart recommendations.


🧠 How It Works

Hybrid Architecture

TransactAI uses a unique hybrid approach combining:

  1. Deep Learning: DistilBERT transformer model fine-tuned on 2000+ synthetic Indian banking transactions
  2. Rule-Based NLP: Context-aware rules handle edge cases (e.g., distinguishing "CASH DEPOSIT" from "CASHBACK")
  3. Transaction Direction Analysis: Uses withdrawal/deposit columns as additional signals
  4. Feature Extraction: Extracts keywords, patterns, and metadata from transaction descriptions

Categorization Pipeline

Bank Statement (CSV/Excel)
         ↓
    File Parser & Data Cleaning
         ↓
  Column Detection & Feature Extraction
         ↓
DistilBERT Model + Context-Aware Rules
         ↓
   Category Prediction + Confidence
         ↓
  Analytics, Insights & PDF Report

πŸš€ Installation

Prerequisites

  • Python 3.8 or higher
  • 4GB+ RAM (for model loading)
  • GPU optional (runs efficiently on CPU)

Setup Instructions

  1. Clone the repository

    https://github.com/namanj2003/TransactAI.git
    cd TransactAI
  2. Create a virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Download the trained model

    Place the trained DistilBERT model files in the expense_model_distilbert/ directory:

    expense_model_distilbert/
    β”œβ”€β”€ config.json
    β”œβ”€β”€ pytorch_model.bin
    β”œβ”€β”€ tokenizer_config.json
    β”œβ”€β”€ vocab.txt
    └── special_tokens_map.json
    
  5. Ensure model_config.json exists

    This file should contain category mappings in your project root:

    {
      "categories": ["Food", "Shopping", "Travel", ...],
      "id_map": {"0": "Food", "1": "Shopping", ...}
    }

πŸ’» Usage

Running the Application

streamlit run app.py

The web app will automatically open in your browser at http://localhost:8501

Step-by-Step Guide

  1. Upload Bank Statement

    • Click "Browse files" and select your bank statement (CSV or Excel format)
    • Supported formats: .csv, .xlsx, .xls
  2. Column Mapping

    • The app automatically detects description, withdrawal, deposit, and date columns
    • Adjust mappings manually if needed
  3. Categorize Transactions

    • Click "πŸš€ Categorize Transactions" to run the AI model
    • Progress bar shows real-time processing status
  4. View Results

    • Data Tab: View categorized transactions with confidence scores
    • Charts Tab: Interactive visualizations (pie chart, bar chart, top expenses)
    • Insights Tab: Personalized recommendations and financial summary
  5. Export Data

    • Download categorized data as CSV
    • Generate professional PDF report with analytics
  6. Upload New File

    • Click "πŸ”„ Upload New File" to reset and start fresh

Example Bank Statement Format

Date Narration Withdrawal Deposit
01-01-2024 SWIGGY UPI-4023049214 350.00
02-01-2024 SALARY CREDIT 50000.00
03-01-2024 CASH DEPOSIT ATM 10000.00
04-01-2024 UPI-AMAZON PAY 1250.00

πŸ“‚ Project Structure

transactai/
β”œβ”€β”€ app.py                          # Main Streamlit application
β”œβ”€β”€ model_utils.py                  # Model loading & hybrid prediction logic
β”œβ”€β”€ file_processors.py              # CSV/Excel parsing & cleaning
β”œβ”€β”€ recommendations.py              # Financial insights generation
β”œβ”€β”€ pdf_generator.py                # PDF report creation
β”œβ”€β”€ requirements.txt                # Python dependencies
β”œβ”€β”€ model_config.json               # Category ID mappings
β”œβ”€β”€ expense_model_distilbert/       # Trained DistilBERT model directory
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ pytorch_model.bin
β”‚   └── tokenizer files
└── README.md                       # This file

πŸ—οΈ Model Architecture

DistilBERT Fine-Tuning

  • Base Model: distilbert-base-uncased (66M parameters, 40% smaller than BERT)
  • Task: Multi-class text classification (16 expense/income categories)
  • Training Data: 2000+ synthetic transaction descriptions covering all major Indian bank patterns
  • Fine-tuning: Cross-entropy loss with AdamW optimizer, 3-5 epochs
  • Inference: Text β†’ Tokenization β†’ DistilBERT β†’ Softmax β†’ Category prediction

Hybrid Enhancement

The system combines neural predictions with expert rules:

# Simplified hybrid logic
def predict(transaction_text, credit_amount, debit_amount):
    # Step 1: Get DistilBERT prediction
    category, confidence = distilbert_model.predict(transaction_text)
    
    # Step 2: Apply context-aware rules
    if "cash deposit" in text.lower() and credit_amount > 0:
        category = "Income"  # Override neural prediction
    
    if category == "Cashback" and "cashback" not in text.lower():
        category = "Income"  # Fix common misclassification
    
    # Step 3: Use transaction direction
    if credit_amount > 0 and debit_amount == 0:
        if category not in ["Income", "Cashback"]:
            category = "Income"  # Credit must be income-related
    
    return category, confidence

Why Hybrid Approach?

  • Neural Network Strength: Understands context and semantic meaning
  • Rule-Based Strength: Handles edge cases and banking-specific logic
  • Combined Result: 95%+ accuracy vs 87% from DistilBERT alone

πŸ“Š Results

Performance Metrics

Metric Value
Overall Accuracy 95.3%
Avg Confidence 89.2%
Inference Speed ~50ms/txn
Categories 16
Model Size 255 MB
Supports Banks All major Indian banks

Category-wise Performance

Category Precision Notes
Food 97.2% High accuracy
Shopping 94.8% Good performance
Income 99.1% Rule-enhanced
Cashback 98.5% Rule-enhanced
Travel 96.3% Uber, Ola, IRCTC detected
Bill Payment 95.7% EMI distinguished

πŸ› οΈ Technologies Used

Core Technologies

  • Frontend: Streamlit
  • Deep Learning: PyTorch, Hugging Face Transformers
  • NLP Model: DistilBERT (distilbert-base-uncased)

Data Processing

  • pandas: Data manipulation and cleaning
  • numpy: Numerical operations
  • openpyxl, xlrd: Excel file reading

Visualization

  • Plotly: Interactive charts in web app
  • Matplotlib: Static charts for PDF reports

Other Tools

  • ReportLab: PDF generation
  • scikit-learn: ML utilities (optional)
  • regex: Pattern matching for feature extraction

🀝 Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create a feature branch
    git checkout -b feature/AmazingFeature
  3. Commit your changes
    git commit -m 'Add some AmazingFeature'
  4. Push to the branch
    git push origin feature/AmazingFeature
  5. Open a Pull Request

Areas for Contribution

  • Add support for more banks
  • Improve categorization rules
  • Add new expense categories
  • Enhance PDF report design
  • Add OCR support for scanned PDFs
  • Multi-language support

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

AI-powered transaction categorization system using DistilBERT and hybrid NLP rules. Automatically classifies bank statement expenses into 16+ categories. Built with Streamlit, PyTorch, and Transformers.

Resources

License

Stars

Watchers

Forks

Packages

No packages published