Comprehensive RAG System

A Retrieval-Augmented Generation (RAG) system for querying and extracting insights from your documents using LLMs.

Features

Document Ingestion: Process files (PDF, DOCX, TXT, CSV, XLSX, images), web pages, and database content
Vector Search: Semantic similarity search using ChromaDB with embedding models
Hybrid Retrieval: Combine vector search with keyword filtering for more relevant results
Question-Answering: Generate accurate responses based on context from your documents
Structured Extraction: Extract specific data from documents using customizable schemas
Modern Web Interface: React-based UI for document management and querying
RESTful API: Comprehensive API for integration with other systems
WebSocket Support: Real-time updates and notifications
Multiple Data Sources: Connect to file systems, databases, websites, and APIs
Health Monitoring: System health checks for all components

System Requirements

Python 3.8+
PostgreSQL database (optional, can use SQLite for development)
MinIO or S3-compatible object storage (optional)
ChromaDB for vector storage
Node.js and npm for frontend development

Installation

Prerequisites

Install Python 3.8 or higher
Install PostgreSQL (optional)
Install MinIO (optional)
Install ChromaDB (optional for local development)
Install Node.js and npm for frontend

Backend Setup

Clone the repository:

git clone https://github.com/yourusername/rag.git
cd rag

Create a virtual environment:
```
python -m venv rag_env
```
Copy the example environment file and edit it:
```
cp .env.example .env
```
- Edit .env with your configuration (database URL, API keys, etc.)
Activate the virtual environment:
- Windows: rag_env\Scripts\activate
- macOS/Linux: source rag_env/bin/activate
Install dependencies:
```
pip install -r requirements.txt
```

Create required directories:

mkdir -p data/uploads data/raw data/processed

Start the backend:
```
python run.py
```

Frontend Setup

Navigate to the frontend directory:
```
cd frontend
```
Install dependencies:
```
npm install
```
Set the environment variable for the API URL:
- Create .env.local with NEXT_PUBLIC_API_URL=http://localhost:8080
Start the development server:
```
npm run dev
```
Access the frontend at http://localhost:3000

Docker Setup

Make sure Docker and Docker Compose are installed
Configure environment variables in docker-compose.yml
Build and start the containers:
```
docker-compose up -d
```

Application Structure

rag/
├── backend/                   # Backend application
│   ├── app/                   # Application code
│   │   ├── api/               # API endpoints
│   │   ├── core/              # Core configuration
│   │   ├── db/                # Database models and connection
│   │   ├── models/            # SQLAlchemy models
│   │   ├── schemas/           # Pydantic schemas
│   │   ├── services/          # Business logic
│   │   └── static/            # Static files
│   ├── data/                  # Data directory
│   ├── tests/                 # Test files
│   ├── .env                   # Environment variables
│   └── requirements.txt       # Python dependencies
├── frontend/                  # Frontend application
│   ├── public/                # Public assets
│   ├── src/                   # Source code
│   │   ├── app/               # Next.js app directory
│   │   ├── components/        # React components
│   │   └── lib/               # Utility functions and types
│   ├── .env.local             # Local environment variables
│   └── package.json           # Node.js dependencies
└── docker-compose.yml         # Docker configuration

Configuration

The system is configured using environment variables in the .env file:

DATABASE_URL: Connection string for the database
CHROMA_HOST: ChromaDB host
CHROMA_PORT: ChromaDB port
MINIO_URL: MinIO/S3 URL
MINIO_ACCESS_KEY: MinIO/S3 access key
MINIO_SECRET_KEY: MinIO/S3 secret key
OPENAI_API_KEY: OpenAI API key for embeddings and LLM
OPENAI_MODEL: OpenAI model to use (default: gpt-4o)
EMBEDDING_MODEL: Embedding model to use
Check .env.example for all available configuration options

Usage

Document Management

Upload Documents: Use the web interface to upload files or add web pages
View Documents: Browse and manage your uploaded documents
Data Sources: Configure external data sources for automatic ingestion

Querying Documents

Chat Interface: Use the chat interface to ask questions about your documents
Document Retrieval: View which documents were used to generate answers
Performance Metrics: See retrieval and generation metrics for each query

Structured Data Extraction

Extract specific data from documents using a schema definition:

{
  "document_id": "your-document-id",
  "schema_definition": {
    "title": "string",
    "author": "string",
    "publication_date": "date",
    "abstract": "string",
    "keywords": "list"
  }
}

API Endpoints

Documents: /api/v1/documents
- GET: List all documents
- POST /upload: Upload a document
- POST /web: Process a web page
- DELETE /{id}: Delete a document
Query: /api/v1/query
- POST /retrieve: Retrieve relevant documents
- POST /generate: Generate an answer
Data Sources: /api/v1/datasources
- GET: List all data sources
- POST: Add a new data source
- DELETE /{id}: Delete a data source
Extraction: /api/v1/extraction/structured-extract
- POST: Extract structured data from documents
Health: /api/v1/health
- GET: Check system health

WebSockets

The system supports WebSockets for real-time updates:

WebSocket endpoint: /api/ws
Test page: /websocket

Troubleshooting

Common Issues

Database Connection Error
- Verify database credentials in .env
- Check if PostgreSQL is running
MinIO Connection Error
- Verify MinIO/S3 credentials in .env
- Check if MinIO is running
OpenAI API Key Error
- Verify your OpenAI API key in .env
- Check API key permissions

Debug Mode

Run the application in debug mode for more detailed logs:

LOG_LEVEL=debug python run.py

Contributing

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
.vscode		.vscode
backend		backend
docs		docs
frontend		frontend
storage/minio		storage/minio
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Comprehensive RAG System

Features

System Requirements

Installation

Prerequisites

Backend Setup

Frontend Setup

Docker Setup

Application Structure

Configuration

Usage

Document Management

Querying Documents

Structured Data Extraction

API Endpoints

WebSockets

Troubleshooting

Common Issues

Debug Mode

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

paste-deepak/rag

Folders and files

Latest commit

History

Repository files navigation

Comprehensive RAG System

Features

System Requirements

Installation

Prerequisites

Backend Setup

Frontend Setup

Docker Setup

Application Structure

Configuration

Usage

Document Management

Querying Documents

Structured Data Extraction

API Endpoints

WebSockets

Troubleshooting

Common Issues

Debug Mode

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages