A Retrieval-Augmented Generation (RAG) application for financial data analysis using DeepSeek LLM, built with Streamlit and powered by LangChain.
FinancialRAG is an intelligent document analysis tool that allows you to upload financial PDF documents and ask questions about them using natural language. The application uses advanced RAG techniques to provide accurate, context-aware answers by retrieving relevant information from your documents.
- 📄 PDF Document Processing: Upload and process financial PDF documents using Docling
- 🔍 Semantic Search: Uses FAISS vector database for efficient similarity search
- 💬 Natural Language Q&A: Ask questions in plain English about your financial documents
- 🖼️ PDF Preview: View uploaded PDFs directly in the sidebar
- 💾 Persistent Storage: Save processed documents and reuse them without reprocessing
- 🔄 Streaming Responses: Real-time answer generation with streaming support
- 🎯 Context-Aware Answers: Leverages DeepSeek R1 model for accurate financial analysis
Before running this application, you need to have:
- Python 3.11+ installed on your system
- Poppler (for PDF to image conversion):
- Ubuntu/Debian:
sudo apt-get install poppler-utils - macOS:
brew install poppler - Windows: Download from poppler for Windows
- Ubuntu/Debian:
- Ollama installed and running locally
- Required Ollama models:
nomic-embed-text(for embeddings)deepseek-r1:1.5b(for question answering)
-
Install Ollama from https://ollama.ai
-
Pull the required models:
ollama pull nomic-embed-text ollama pull deepseek-r1:1.5b
-
Ensure Ollama is running:
ollama serve
-
Clone the repository:
git clone https://github.com/hyperion912/FinancialRAG.git cd FinancialRAG -
Install the required Python packages:
pip install -r requirements.txt
-
Start the Streamlit application:
streamlit run app.py
-
Open your browser and navigate to
http://localhost:8501 -
Upload a new document:
- Select "Upload New Document" from the dropdown
- Upload a PDF file containing financial data
- Click "Process PDF and Store in Vector DB"
- Wait for processing to complete
-
Query existing documents:
- Select a previously processed document from the dropdown
- Enter your question in the text input field
- Click "Submit Question" to get an answer
FinancialRAG/
├── app.py # Main Streamlit application
├── rag.py # RAG pipeline implementation
├── ragbot.ipynb # Jupyter notebook for experimentation
├── requirements.txt # Python dependencies
├── vector_db/ # Storage for FAISS vector databases and PDFs
├── .devcontainer/ # Dev container configuration
└── README.md # This file
- Document Processing: PDFs are converted to markdown using Docling
- Text Splitting: Markdown content is split into chunks based on headers
- Embedding: Text chunks are embedded using the
nomic-embed-textmodel - Vector Storage: Embeddings are stored in a FAISS vector database
- Retrieval: When a question is asked, relevant chunks are retrieved using MMR search
- Answer Generation: The DeepSeek model generates answers based on retrieved context
The application uses the following default configurations:
- Ollama Base URL:
http://localhost:11434 - Embedding Model:
nomic-embed-text - LLM Model:
deepseek-r1:1.5b - Vector DB Folder:
vector_db/ - Retrieval Method: MMR (Maximum Marginal Relevance)
- Top K Results: 5
To modify these settings, edit the respective files:
app.pyfor Streamlit UI settingsrag.pyfor RAG pipeline configuration
Core dependencies include:
langchain- LLM orchestration frameworklangchain-community- Community integrationslangchain-ollama- Ollama integrationfaiss-cpu- Vector similarity searchdocling- PDF to markdown conversionstreamlit- Web UI frameworkpdf2image- PDF rendering
See requirements.txt for the complete list.
This project includes a dev container configuration for easy development setup. If you're using VS Code or GitHub Codespaces:
- Open the project in VS Code
- Click "Reopen in Container" when prompted
- The environment will be automatically configured
- Ensure Ollama is running:
ollama serve - Check if models are installed:
ollama list - Verify the base URL in the code matches your Ollama installation
- Ensure the PDF is not corrupted
- Check if the PDF contains actual text (not just images)
- Verify you have sufficient disk space for image conversion
- Consider using smaller documents
- Reduce the chunk size in
rag.py - Use a more lightweight embedding model
Contributions are welcome! Please feel free to submit a Pull Request.
This project is open source. Please check the repository for license information.