A modular, extensible framework for building and benchmarking Retrieval Augmented Generation (RAG) systems.
RAG Bench provides a flexible foundation for:
- Building customizable RAG pipelines with interchangeable components
- Benchmarking different RAG configurations and strategies
- Evaluating performance across a range of metrics
- Experimenting with advanced techniques like query enhancement and reranking
The system is designed to be extended for domain-specific applications while maintaining a consistent architecture.
This template provides a complete RAG system with all essential components implemented:
rag_bench/
├── components/ # Core component implementations
│ ├── embedding_component.py
│ ├── llm_component.py
│ └── vector_store_component.py
├── core/ # Core system logic and types
│ ├── document_processors.py
│ ├── engine.py
│ ├── query_enhancers.py
│ └── types.py
├── db/ # Database connections and models
│ └── base.py
├── dependency_injection.py # Dependency injection configuration
├── evaluation/ # Benchmarking and evaluation framework
│ ├── benchmark.py
│ ├── metrics.py
│ ├── run_benchmark.py
│ └── sample_data/
├── main.py # Application entry point
├── models/ # Data models
│ └── document.py
├── routers/ # API routes
│ └── api_v1/
├── settings/ # Configuration and settings
│ ├── settings.py
│ └── settings_loader.py
└── workflows/ # Document processing workflows
└── ingest.py
- Modular Architecture: Swap components without changing the core system
- Comprehensive Evaluation: Measure retrieval quality, answer correctness, latency, and more
- Multiple Strategies: Compare different retrieval, processing, and generation approaches
- Benchmarking Framework: Run standardized tests across configurations
- Extension Points: Add custom implementations for specific use cases
RAG Bench includes the following core components:
- Document Ingestor: Processes and chunks documents with metadata preservation
- Text Splitter: Divides documents into appropriately sized chunks with overlap
- Metadata Management: Preserves document provenance and relationships
- Embedding Components: Generate vector representations of text using configurable models
- Vector Store: Efficiently store and retrieve embeddings using PGVector
- Query Enhancers: Multiple strategies to improve query effectiveness:
- LLM-based query expansion for broader semantic coverage
- Hyponym expansion for adding related terms
- Stop word removal for focusing on meaningful terms
- Hybrid approaches that combine multiple techniques
- Document Processors: Advanced filtering and reranking:
- Threshold filtering based on similarity scores
- Semantic reranking using embedding models
- LLM-based reranking for nuanced relevance assessments
- Diversity reranking to reduce redundancy in results
- LLM Integration: Flexible integration with multiple LLM providers
- RAG Engine: Core orchestration of the retrieval and generation process
- Evaluation Framework: Comprehensive benchmarking system:
- Metrics collection for retrieval quality and generation performance
- LLM-based evaluation of answer correctness, completeness, and helpfulness
- Precision/recall calculation against known relevant documents
- Comparative reporting across different configurations
- Python 3.10+ (3.12 recommended for best compatibility)
- PostgreSQL with pgvector extension (automatically detected by setup script)
- Poetry for dependency management
Note: The
setup_all.pyscript will attempt to detect your PostgreSQL installation automatically and fall back to mock mode if not found.
# Install pyenv (macOS with Homebrew)
brew update
brew install pyenv
# For zsh (default on newer macOS)
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.zshrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.zshrc
echo 'eval "$(pyenv init -)"' >> ~/.zshrc
# For bash
# echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bash_profile
# echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bash_profile
# echo 'eval "$(pyenv init -)"' >> ~/.bash_profile
# Restart your terminal or source the configuration
source ~/.zshrc # or source ~/.bash_profile for bash
# Install Python 3.12
pyenv install 3.12# Clone the repository
git clone https://github.com/yourusername/rag-bench.git
cd rag-bench
# Set Python 3.12 as the version for this directory
pyenv local 3.12
# Install Poetry if you don't have it
# curl -sSL https://install.python-poetry.org | python3 -
# Configure Poetry to use Python 3.12
poetry env use $(pyenv which python)
# Install dependencies
poetry install# Install PostgreSQL
brew install postgresql@16
# Start PostgreSQL service
brew services start postgresql@16
# Add PostgreSQL to your PATH (one-time setup)
export PATH="/opt/homebrew/opt/postgresql@16/bin:$PATH"
# For permanent setup, add this line to your .zshrc or .bash_profile
# Create database
createdb rag_bench
# Install pgvector
brew install pgvector
# For PostgreSQL 16, you need to install pgvector from source
git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git
cd pgvector
make
make install
# Enable the pgvector extension
psql -d rag_bench -c "CREATE EXTENSION IF NOT EXISTS vector;"# Install PostgreSQL
sudo apt update
sudo apt install -y postgresql postgresql-contrib build-essential postgresql-server-dev-all
# Start PostgreSQL service
sudo systemctl start postgresql
sudo systemctl enable postgresql
# Switch to postgres user
sudo -u postgres psql
# Create a database (run inside psql)
CREATE DATABASE rag_bench;
\q
# Install pgvector from source
git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install
# Enable pgvector extension
sudo -u postgres psql -d rag_bench -c "CREATE EXTENSION IF NOT EXISTS vector;"Update the settings.yaml file with your PostgreSQL credentials:
# PostgreSQL settings
postgres:
host: localhost
port: 5432
user: yourusername # Change to your system username
password: "" # Set password if required
database: rag_bench
schema_name: public
pgvector:
host: localhost
port: 5432
user: yourusername # Change to your system username
password: "" # Set password if required
database: rag_bench
schema_name: publicThe initialization script will download a small, freely available LLM model by default (TinyLlama 1.1B). However, if you want to use larger models or restricted models like Llama-3, you'll need to authenticate with Hugging Face:
-
Create a Hugging Face account at https://huggingface.co/join if you don't already have one
-
Generate an access token:
- Go to https://huggingface.co/settings/tokens
- Click "New token"
- Name it (e.g., "rag-bench-access")
- Select "read" access
- Click "Generate token"
- Copy the token
-
Login using the Hugging Face CLI:
# Install huggingface_hub if needed
pip install huggingface_hub
# Login with your token
huggingface-cli login
# Or alternative method:
python -c "from huggingface_hub import login; login()"Enter your token when prompted. This will save your credentials locally.
- View available models and specify one to download:
# List all available models
poetry run python initialize_models.py --list-models
# Download a specific model
poetry run python initialize_models.py --model phi-2.Q4_K_M.ggufAvailable models include:
- TinyLlama 1.1B (700MB) - Default, fastest but least capable
- Phi-2 (1.3GB) - Good balance of size and quality
- Orca-2 7B (3.8GB) - Higher quality, requires more RAM
- Llama-3 8B (4.6GB) - Highest quality, requires special access approval
For the easiest setup experience, we provide a unified setup script that handles everything:
# Run the all-in-one setup script
poetry run python setup_all.pyThis script will:
- Check PostgreSQL availability (automatically finds Homebrew installations)
- Create the database and enable pgvector extension
- Download a small LLM model for local inference (TinyLlama 1.1B)
- Drop existing vector tables to prevent dimension mismatch errors
- Ingest sample documents
- Provide instructions for testing the system
The script is designed to be robust and will:
- Automatically detect PostgreSQL installations in standard locations
- Fall back to mock mode if PostgreSQL isn't found
- Fall back to mock mode if model download fails
- Handle clean reinstalls by dropping and recreating tables
- Work with minimal user intervention
Note: If you prefer to perform these steps manually, you can run each component separately:
# Download LLM model
poetry run python initialize_models.py
# See available models
poetry run python initialize_models.py --list-models
# Ingest sample documents
poetry run python ingest_docs.pyNote: LLM models are large files (4-6GB). Please ensure you have sufficient disk space.
If you encounter any of these issues:
Vector Dimension Mismatch Error
DimensionError: Cannot insert 768-dimensional vector into 1536-dimensional column
Solution: Re-run setup_all.py which will drop existing tables with mismatched dimensions.
PostgreSQL Not Found If PostgreSQL isn't found in the standard paths, the script will automatically fall back to mock mode, which still allows you to test the system functionality without a database.
LLM Model Errors If model downloads fail due to Hugging Face rate limits or network issues, the system will fall back to mock LLM mode.
The system can run in different modes:
- Fully Local Mode - All components run locally (no API keys needed)
- Hybrid Mode - Some components use local resources, others use cloud APIs
- Cloud Mode - All components use cloud APIs (requires API keys)
Edit settings.yaml to configure your preferred mode:
This mode uses local LLM inference, local embeddings, and PostgreSQL:
# Local mode (no API keys required)
llm:
mode: local
embedding:
mode: huggingface
vectorstore:
mode: pgvector
# Local LLM settings
local_llm:
model_path: models/llama-3-8b-instruct.gguf
context_length: 4096
n_gpu_layers: 0 # Increase for GPU acceleration
max_tokens: 1024
temperature: 0.7
# HuggingFace embedding settings
huggingface:
embedding_model: sentence-transformers/all-mpnet-base-v2
# PostgreSQL connection settings
pgvector:
host: localhost
port: 5432
user: yourusername # Change this to your system username
password: "" # Update with your password if needed
database: rag_benchThis mode uses OpenAI for LLM inference and embeddings:
# Cloud mode (API keys required)
llm:
mode: openai
embedding:
mode: openai
vectorstore:
mode: pgvector # Still uses local PostgreSQL
# OpenAI settings
openai:
api_key: ${OPENAI_API_KEY} # Set this environment variable
model: gpt-4o
embedding_model: text-embedding-3-large
# PostgreSQL connection settings
pgvector:
host: localhost
port: 5432
user: yourusername # Change this to your system username
password: "" # Update with your password if needed
database: rag_benchThis mode combines local LLM with OpenAI embeddings:
# Hybrid mode
llm:
mode: local
embedding:
mode: openai # Uses OpenAI for embeddings
vectorstore:
mode: pgvector
# Local LLM settings
local_llm:
model_path: models/llama-3-8b-instruct.gguf
context_length: 4096
n_gpu_layers: 0
# OpenAI settings (only for embeddings)
openai:
api_key: ${OPENAI_API_KEY}
embedding_model: text-embedding-3-largeFor the quickest setup:
# Run the all-in-one setup script
poetry run python setup_all.pyThis script handles:
- PostgreSQL detection and initialization
- Database and pgvector extension setup
- Downloading LLM models
- Creating tables and schema
- Ingesting sample documents
If you encounter any issues, the script will provide troubleshooting guidance.
After setup is complete, start the server:
# Start the server
poetry run python -m rag_bench.mainThe server will run at http://localhost:8000 by default.
After starting the server, verify it's working correctly using the test script:
# Make the script executable
chmod +x simple_test.sh
# Run the test script
./simple_test.shThis script makes basic curl requests to test if the server is responding properly with two test queries:
- "What is RAG?"
- "What are the key components of a RAG system?"
The response should include an answer and sources with relevance scores. If you see responses formatted as JSON, the server is functioning correctly.
Example output:
{
"answer": "RAG is a technique that enhances LLM outputs with external knowledge.",
"sources": [
{
"source": "sample",
"relevance_score": 0.7988745719194412,
"title": "RAG Introduction"
},
{
"source": "sample",
"relevance_score": 0.8848782032728195,
"title": "LLMs"
},
{
"source": "sample",
"relevance_score": 0.9188621118664742,
"title": "Embeddings"
}
]
}Before making queries, you need to ingest documents into the vector database. Here's a basic example:
# Example script to ingest documents (save as ingest_docs.py)
import asyncio
from langchain.schema import Document as LangchainDocument
from rag_bench.settings.settings import Settings
from rag_bench.settings.settings_loader import load_settings
from rag_bench.dependency_injection import get_vector_store_component
async def ingest_sample_documents():
# Load settings
settings_dict = load_settings("settings.yaml")
settings = Settings.model_validate(settings_dict)
# Create vector store component
vector_store = get_vector_store_component(settings)
# Create sample documents
documents = [
LangchainDocument(
page_content="RAG (Retrieval Augmented Generation) is a technique that enhances LLM outputs with external knowledge.",
metadata={"source": "sample", "title": "RAG Introduction"}
),
LangchainDocument(
page_content="Vector databases store and retrieve embeddings efficiently, enabling semantic search.",
metadata={"source": "sample", "title": "Vector Databases"}
),
LangchainDocument(
page_content="Embeddings convert text into numerical vectors that capture semantic meaning.",
metadata={"source": "sample", "title": "Embeddings"}
)
]
# Add documents to vector store
await vector_store.aadd_documents(documents)
print(f"Ingested {len(documents)} documents")
if __name__ == "__main__":
asyncio.run(ingest_sample_documents())Run the script to ingest the sample documents:
poetry run python ingest_docs.pyOnce documents are ingested and the server is running, you can test the system:
-
Using the API endpoint:
curl "http://localhost:8000/api/v1/query?q=What%20is%20RAG" -
Using a web browser:
- Open http://localhost:8000/api/v1/query?q=What%20is%20RAG in your browser
You should receive a response that includes information retrieved from the documents along with sources.
RAG Bench includes a comprehensive benchmarking framework for systematic evaluation of different configurations:
# Run a benchmark with different configurations
poetry run python -m rag_bench.evaluation.run_benchmark --config rag_bench/evaluation/sample_data/benchmark_config.json
# Or use the convenience script
./benchmark.shThe benchmark will:
- Compare different RAG configurations (baseline, query expansion, reranking, etc.)
- Run each query from the evaluation set through each configuration
- Generate metrics for retrieval quality, answer quality, and performance
- Output detailed results to the
benchmark_resultsdirectory
The benchmarking framework allows comparing multiple system configurations:
{
"name": "rag_benchmark_basic",
"description": "Basic RAG benchmark comparing different configurations",
"evaluation_set_path": "path/to/evaluation_set.json",
"output_dir": "benchmark_results",
"use_llm_evaluation": true,
"num_iterations": 1,
"configurations": [
{
"name": "baseline",
"similarity_top_k": 3,
"similarity_threshold": 0.7,
"use_reranking": false,
"use_query_expansion": false
},
{
"name": "with_reranking",
"similarity_top_k": 5,
"similarity_threshold": 0.5,
"use_reranking": true,
"reranker_type": "semantic",
"use_query_expansion": false
},
{
"name": "full_pipeline",
"similarity_top_k": 5,
"similarity_threshold": 0.5,
"use_reranking": true,
"reranker_type": "hybrid",
"use_query_expansion": true,
"query_expansion_type": "llm"
}
]
}The system collects and compares multiple metrics across configurations:
- Retrieval Performance: Precision, recall, document scores
- Runtime Performance: Total time, retrieval time, generation time
- Answer Quality: Correctness, completeness, conciseness, groundedness
- Resource Usage: Number of documents retrieved and used
Benchmarks generate several output files for analysis:
- Summary Reports: High-level comparison of configurations
- Detailed CSV Data: Complete metrics for each query
- Comparison Charts: Visual comparison of key metrics
- Raw JSON Results: Complete data for further analysis
This template is designed to be extended for domain-specific applications. Here are the key extension points:
Create custom implementations by extending the base classes:
from rag_bench.core.types import QueryEnhancer
from typing import Optional
class DomainSpecificEnhancer(QueryEnhancer):
"""Enhances queries with domain-specific knowledge."""
async def enhance(self, query: str, conversation_id: Optional[str] = None) -> str:
# Custom domain-specific logic here
# Example: Add industry terminology, expand domain abbreviations, etc.
return enhanced_queryfrom rag_bench.core.types import DocumentPostProcessor, DocumentWithScore
from typing import List
class DomainRelevanceProcessor(DocumentPostProcessor):
"""Processes documents based on domain-specific relevance criteria."""
async def process(self, documents: List[DocumentWithScore], query: str) -> List[DocumentWithScore]:
# Custom filtering or reranking logic
# Example: Apply domain-specific weighting, filter by recency, etc.
return processed_documentsCreate evaluation sets with queries and expected answers relevant to your domain:
{
"name": "Domain Specific Questions",
"description": "Evaluation set for testing domain-specific knowledge",
"queries": [
{
"id": "domain-001",
"query": "What are the key requirements for X in industry Y?",
"expected_answer": "The key requirements for X in industry Y include...",
"relevant_doc_ids": ["doc-industry-y-1", "doc-requirements-x-1"]
},
{
"id": "domain-002",
"query": "How does process Z affect outcomes in scenario W?",
"expected_answer": "Process Z affects outcomes in scenario W by...",
"relevant_doc_ids": ["doc-process-z-1", "doc-scenario-w-1"]
}
],
"metadata": {
"domain": "Your Domain",
"version": "1.0"
}
}The system exposes a REST API for interacting with the RAG pipeline:
GET /api/v1/query?q=your_query- Simple query endpointPOST /api/v1/chat/message- Chat interface with conversation historyPOST /api/v1/ingest- Add documents to the system
To adapt this template for your specific use case:
- Domain-Specific Data: Add your own document ingestion pipelines in the
workflowsdirectory - Custom Enhancers: Implement domain-specific query enhancers for terminology, abbreviations, etc.
- Custom Processors: Add specialized document processors for your content types
- Evaluation Sets: Create benchmark data relevant to your domain
- UI Integration: Extend the API with additional endpoints as needed
This template can be adapted for various RAG applications:
- Customer Support: Connect to product documentation and support tickets
- Legal Research: Link to case law, statutes, and legal documents
- Financial Analysis: Connect to financial reports, news, and market data
- Medical Information: Adapt for connecting to medical literature and clinical guidelines
- Technical Documentation: Build a system for software documentation and code examples
This project is licensed under the MIT License - see the LICENSE file for details.