💰 TransactAI: Personal Expense Categorization System

AI-Powered Bank Statement Analysis & Expense Categorization

TransactAI is an intelligent expense tracking system that automatically categorizes bank transactions using a hybrid approach combining DistilBERT deep learning with context-aware NLP rules. Built as an M.Sc Data Science project, it achieves 95%+ accuracy on real Indian bank statements.

✨ Features

🤖 AI-Powered Categorization: Fine-tuned DistilBERT model combined with rule-based NLP for high accuracy
📊 16+ Categories: Food, Shopping, Travel, Bills, Income, Cashback, EMI, Utilities, Healthcare, and more
🏦 Multi-Bank Support: Works with HDFC, ICICI, SBI, Federal Bank, Axis Bank, and other Indian banks
📁 Multiple Formats: Supports CSV and Excel (.xlsx, .xls) statements
📈 Interactive Analytics: Real-time pie charts, bar charts, spending insights, and personalized recommendations
📄 Professional PDF Reports: Generate comprehensive expense analysis reports with charts and insights
🔄 Smart Parsing: Automatically detects headers and handles complex Excel formats with metadata
💡 Context-Aware: Uses transaction direction (credit/debit) for improved accuracy
⚡ Fast Processing: Categorizes 900+ transactions in seconds
🎯 High Accuracy: 95%+ overall accuracy with hybrid deep learning + rules approach

🎬 Demo

Upload & Categorize

Upload your bank statement (CSV/Excel), and the AI automatically detects columns and categorizes each transaction.

Analytics Dashboard

View interactive charts showing spending distribution, category-wise breakdowns, and top expenses.

PDF Report

Download a professional PDF report with complete expense analysis, charts, and smart recommendations.

🧠 How It Works

Hybrid Architecture

TransactAI uses a unique hybrid approach combining:

Deep Learning: DistilBERT transformer model fine-tuned on 2000+ synthetic Indian banking transactions
Rule-Based NLP: Context-aware rules handle edge cases (e.g., distinguishing "CASH DEPOSIT" from "CASHBACK")
Transaction Direction Analysis: Uses withdrawal/deposit columns as additional signals
Feature Extraction: Extracts keywords, patterns, and metadata from transaction descriptions

Categorization Pipeline

Bank Statement (CSV/Excel)
         ↓
    File Parser & Data Cleaning
         ↓
  Column Detection & Feature Extraction
         ↓
DistilBERT Model + Context-Aware Rules
         ↓
   Category Prediction + Confidence
         ↓
  Analytics, Insights & PDF Report

🚀 Installation

Prerequisites

Python 3.8 or higher
4GB+ RAM (for model loading)
GPU optional (runs efficiently on CPU)

Setup Instructions

Clone the repository

https://github.com/namanj2003/TransactAI.git
cd TransactAI

Create a virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```

Download the trained model

Place the trained DistilBERT model files in the expense_model_distilbert/ directory:

expense_model_distilbert/
├── config.json
├── pytorch_model.bin
├── tokenizer_config.json
├── vocab.txt
└── special_tokens_map.json

Ensure model_config.json exists

This file should contain category mappings in your project root:

{
  "categories": ["Food", "Shopping", "Travel", ...],
  "id_map": {"0": "Food", "1": "Shopping", ...}
}

💻 Usage

Running the Application

streamlit run app.py

The web app will automatically open in your browser at http://localhost:8501

Step-by-Step Guide

Upload Bank Statement
- Click "Browse files" and select your bank statement (CSV or Excel format)
- Supported formats: .csv, .xlsx, .xls
Column Mapping
- The app automatically detects description, withdrawal, deposit, and date columns
- Adjust mappings manually if needed
Categorize Transactions
- Click "🚀 Categorize Transactions" to run the AI model
- Progress bar shows real-time processing status
View Results
- Data Tab: View categorized transactions with confidence scores
- Charts Tab: Interactive visualizations (pie chart, bar chart, top expenses)
- Insights Tab: Personalized recommendations and financial summary
Export Data
- Download categorized data as CSV
- Generate professional PDF report with analytics
Upload New File
- Click "🔄 Upload New File" to reset and start fresh

Example Bank Statement Format

Date	Narration	Withdrawal	Deposit
01-01-2024	SWIGGY UPI-4023049214	350.00
02-01-2024	SALARY CREDIT		50000.00
03-01-2024	CASH DEPOSIT ATM		10000.00
04-01-2024	UPI-AMAZON PAY	1250.00

📂 Project Structure

transactai/
├── app.py                          # Main Streamlit application
├── model_utils.py                  # Model loading & hybrid prediction logic
├── file_processors.py              # CSV/Excel parsing & cleaning
├── recommendations.py              # Financial insights generation
├── pdf_generator.py                # PDF report creation
├── requirements.txt                # Python dependencies
├── model_config.json               # Category ID mappings
├── expense_model_distilbert/       # Trained DistilBERT model directory
│   ├── config.json
│   ├── pytorch_model.bin
│   └── tokenizer files
└── README.md                       # This file

🏗️ Model Architecture

DistilBERT Fine-Tuning

Base Model: distilbert-base-uncased (66M parameters, 40% smaller than BERT)
Task: Multi-class text classification (16 expense/income categories)
Training Data: 2000+ synthetic transaction descriptions covering all major Indian bank patterns
Fine-tuning: Cross-entropy loss with AdamW optimizer, 3-5 epochs
Inference: Text → Tokenization → DistilBERT → Softmax → Category prediction

Hybrid Enhancement

The system combines neural predictions with expert rules:

# Simplified hybrid logic
def predict(transaction_text, credit_amount, debit_amount):
    # Step 1: Get DistilBERT prediction
    category, confidence = distilbert_model.predict(transaction_text)
    
    # Step 2: Apply context-aware rules
    if "cash deposit" in text.lower() and credit_amount > 0:
        category = "Income"  # Override neural prediction
    
    if category == "Cashback" and "cashback" not in text.lower():
        category = "Income"  # Fix common misclassification
    
    # Step 3: Use transaction direction
    if credit_amount > 0 and debit_amount == 0:
        if category not in ["Income", "Cashback"]:
            category = "Income"  # Credit must be income-related
    
    return category, confidence

Why Hybrid Approach?

Neural Network Strength: Understands context and semantic meaning
Rule-Based Strength: Handles edge cases and banking-specific logic
Combined Result: 95%+ accuracy vs 87% from DistilBERT alone

📊 Results

Performance Metrics

Metric	Value
Overall Accuracy	95.3%
Avg Confidence	89.2%
Inference Speed	~50ms/txn
Categories	16
Model Size	255 MB
Supports Banks	All major Indian banks

Category-wise Performance

Category	Precision	Notes
Food	97.2%	High accuracy
Shopping	94.8%	Good performance
Income	99.1%	Rule-enhanced
Cashback	98.5%	Rule-enhanced
Travel	96.3%	Uber, Ola, IRCTC detected
Bill Payment	95.7%	EMI distinguished

🛠️ Technologies Used

Core Technologies

Frontend: Streamlit
Deep Learning: PyTorch, Hugging Face Transformers
NLP Model: DistilBERT (distilbert-base-uncased)

Data Processing

pandas: Data manipulation and cleaning
numpy: Numerical operations
openpyxl, xlrd: Excel file reading

Visualization

Plotly: Interactive charts in web app
Matplotlib: Static charts for PDF reports

Other Tools

ReportLab: PDF generation
scikit-learn: ML utilities (optional)
regex: Pattern matching for feature extraction

🤝 Contributing

Contributions are welcome! Here's how you can help:

Fork the repository
Create a feature branch
```
git checkout -b feature/AmazingFeature
```
Commit your changes
```
git commit -m 'Add some AmazingFeature'
```
Push to the branch
```
git push origin feature/AmazingFeature
```
Open a Pull Request

Areas for Contribution

Add support for more banks
Improve categorization rules
Add new expense categories
Enhance PDF report design
Add OCR support for scanned PDFs
Multi-language support

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.idea		.idea
__pycache__		__pycache__
expense_model_distilbert		expense_model_distilbert
.gitattributes		.gitattributes
.gitignore		.gitignore
DistilBERT_Custom_Trained_Model.ipynb		DistilBERT_Custom_Trained_Model.ipynb
LICENSE.md		LICENSE.md
NLP_Project_Dataset.ipynb		NLP_Project_Dataset.ipynb
README.md		README.md
app.py		app.py
file_processors.py		file_processors.py
model_config.json		model_config.json
model_utils.py		model_utils.py
pdf_generator.py		pdf_generator.py
recommendations.py		recommendations.py
requirements.txt		requirements.txt

License

namanj2003/TransactAI

Folders and files

Latest commit

History

Repository files navigation