AI-Powered Bank Statement Analysis & Expense Categorization
TransactAI is an intelligent expense tracking system that automatically categorizes bank transactions using a hybrid approach combining DistilBERT deep learning with context-aware NLP rules. Built as an M.Sc Data Science project, it achieves 95%+ accuracy on real Indian bank statements.
- Features
- Demo
- How It Works
- Installation
- Usage
- Project Structure
- Model Architecture
- Results
- Technologies Used
- Contributing
- License
- Contact
- π€ AI-Powered Categorization: Fine-tuned DistilBERT model combined with rule-based NLP for high accuracy
- π 16+ Categories: Food, Shopping, Travel, Bills, Income, Cashback, EMI, Utilities, Healthcare, and more
- π¦ Multi-Bank Support: Works with HDFC, ICICI, SBI, Federal Bank, Axis Bank, and other Indian banks
- π Multiple Formats: Supports CSV and Excel (.xlsx, .xls) statements
- π Interactive Analytics: Real-time pie charts, bar charts, spending insights, and personalized recommendations
- π Professional PDF Reports: Generate comprehensive expense analysis reports with charts and insights
- π Smart Parsing: Automatically detects headers and handles complex Excel formats with metadata
- π‘ Context-Aware: Uses transaction direction (credit/debit) for improved accuracy
- β‘ Fast Processing: Categorizes 900+ transactions in seconds
- π― High Accuracy: 95%+ overall accuracy with hybrid deep learning + rules approach
Upload your bank statement (CSV/Excel), and the AI automatically detects columns and categorizes each transaction.
View interactive charts showing spending distribution, category-wise breakdowns, and top expenses.
Download a professional PDF report with complete expense analysis, charts, and smart recommendations.
TransactAI uses a unique hybrid approach combining:
- Deep Learning: DistilBERT transformer model fine-tuned on 2000+ synthetic Indian banking transactions
- Rule-Based NLP: Context-aware rules handle edge cases (e.g., distinguishing "CASH DEPOSIT" from "CASHBACK")
- Transaction Direction Analysis: Uses withdrawal/deposit columns as additional signals
- Feature Extraction: Extracts keywords, patterns, and metadata from transaction descriptions
Bank Statement (CSV/Excel)
β
File Parser & Data Cleaning
β
Column Detection & Feature Extraction
β
DistilBERT Model + Context-Aware Rules
β
Category Prediction + Confidence
β
Analytics, Insights & PDF Report
- Python 3.8 or higher
- 4GB+ RAM (for model loading)
- GPU optional (runs efficiently on CPU)
-
Clone the repository
https://github.com/namanj2003/TransactAI.git cd TransactAI -
Create a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Download the trained model
Place the trained DistilBERT model files in the
expense_model_distilbert/directory:expense_model_distilbert/ βββ config.json βββ pytorch_model.bin βββ tokenizer_config.json βββ vocab.txt βββ special_tokens_map.json -
Ensure model_config.json exists
This file should contain category mappings in your project root:
{ "categories": ["Food", "Shopping", "Travel", ...], "id_map": {"0": "Food", "1": "Shopping", ...} }
streamlit run app.pyThe web app will automatically open in your browser at http://localhost:8501
-
Upload Bank Statement
- Click "Browse files" and select your bank statement (CSV or Excel format)
- Supported formats:
.csv,.xlsx,.xls
-
Column Mapping
- The app automatically detects description, withdrawal, deposit, and date columns
- Adjust mappings manually if needed
-
Categorize Transactions
- Click "π Categorize Transactions" to run the AI model
- Progress bar shows real-time processing status
-
View Results
- Data Tab: View categorized transactions with confidence scores
- Charts Tab: Interactive visualizations (pie chart, bar chart, top expenses)
- Insights Tab: Personalized recommendations and financial summary
-
Export Data
- Download categorized data as CSV
- Generate professional PDF report with analytics
-
Upload New File
- Click "π Upload New File" to reset and start fresh
| Date | Narration | Withdrawal | Deposit |
|---|---|---|---|
| 01-01-2024 | SWIGGY UPI-4023049214 | 350.00 | |
| 02-01-2024 | SALARY CREDIT | 50000.00 | |
| 03-01-2024 | CASH DEPOSIT ATM | 10000.00 | |
| 04-01-2024 | UPI-AMAZON PAY | 1250.00 |
transactai/
βββ app.py # Main Streamlit application
βββ model_utils.py # Model loading & hybrid prediction logic
βββ file_processors.py # CSV/Excel parsing & cleaning
βββ recommendations.py # Financial insights generation
βββ pdf_generator.py # PDF report creation
βββ requirements.txt # Python dependencies
βββ model_config.json # Category ID mappings
βββ expense_model_distilbert/ # Trained DistilBERT model directory
β βββ config.json
β βββ pytorch_model.bin
β βββ tokenizer files
βββ README.md # This file
- Base Model:
distilbert-base-uncased(66M parameters, 40% smaller than BERT) - Task: Multi-class text classification (16 expense/income categories)
- Training Data: 2000+ synthetic transaction descriptions covering all major Indian bank patterns
- Fine-tuning: Cross-entropy loss with AdamW optimizer, 3-5 epochs
- Inference: Text β Tokenization β DistilBERT β Softmax β Category prediction
The system combines neural predictions with expert rules:
# Simplified hybrid logic
def predict(transaction_text, credit_amount, debit_amount):
# Step 1: Get DistilBERT prediction
category, confidence = distilbert_model.predict(transaction_text)
# Step 2: Apply context-aware rules
if "cash deposit" in text.lower() and credit_amount > 0:
category = "Income" # Override neural prediction
if category == "Cashback" and "cashback" not in text.lower():
category = "Income" # Fix common misclassification
# Step 3: Use transaction direction
if credit_amount > 0 and debit_amount == 0:
if category not in ["Income", "Cashback"]:
category = "Income" # Credit must be income-related
return category, confidence- Neural Network Strength: Understands context and semantic meaning
- Rule-Based Strength: Handles edge cases and banking-specific logic
- Combined Result: 95%+ accuracy vs 87% from DistilBERT alone
| Metric | Value |
|---|---|
| Overall Accuracy | 95.3% |
| Avg Confidence | 89.2% |
| Inference Speed | ~50ms/txn |
| Categories | 16 |
| Model Size | 255 MB |
| Supports Banks | All major Indian banks |
| Category | Precision | Notes |
|---|---|---|
| Food | 97.2% | High accuracy |
| Shopping | 94.8% | Good performance |
| Income | 99.1% | Rule-enhanced |
| Cashback | 98.5% | Rule-enhanced |
| Travel | 96.3% | Uber, Ola, IRCTC detected |
| Bill Payment | 95.7% | EMI distinguished |
- Frontend: Streamlit
- Deep Learning: PyTorch, Hugging Face Transformers
- NLP Model: DistilBERT (distilbert-base-uncased)
- pandas: Data manipulation and cleaning
- numpy: Numerical operations
- openpyxl, xlrd: Excel file reading
- Plotly: Interactive charts in web app
- Matplotlib: Static charts for PDF reports
- ReportLab: PDF generation
- scikit-learn: ML utilities (optional)
- regex: Pattern matching for feature extraction
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a feature branch
git checkout -b feature/AmazingFeature
- Commit your changes
git commit -m 'Add some AmazingFeature' - Push to the branch
git push origin feature/AmazingFeature
- Open a Pull Request
- Add support for more banks
- Improve categorization rules
- Add new expense categories
- Enhance PDF report design
- Add OCR support for scanned PDFs
- Multi-language support
This project is licensed under the MIT License - see the LICENSE file for details.