A comprehensive Retrieval-Augmented Generation (RAG) system designed for insurance companies in the insurance industry. This system builds an intelligent knowledge base from documentation, policy materials, and customer service guides, enabling accurate and up-to-date responses to insurance-related queries.
This project teaches you:
- Embeddings: Converting text into numerical vectors for semantic search
- Vector Databases: Storing and querying high-dimensional vectors (ChromaDB, Pinecone, Weaviate)
- Chunking Strategies: Breaking documents into optimal-sized pieces
- Re-ranking: Improving retrieval quality with cross-encoders
- Knowledge Freshness: Detecting and flagging outdated content
- Token Counting: Understanding API usage and costs
- Caching: Using Redis to cache embeddings and LLM responses
- Request Batching: Efficiently processing multiple queries
- Model Comparison: Evaluating trade-offs between GPT-4, GPT-3.5, and local models
Retrieval-Augmented Generation (RAG) combines:
- Retrieval: Finding relevant documents from a knowledge base
- Augmentation: Adding retrieved context to prompts
- Generation: Using LLMs to generate answers based on retrieved context
Why RAG?
- Reduces hallucinations by grounding answers in source documents
- Enables up-to-date information beyond training cutoff
- Provides source citations for transparency
- More cost-effective than fine-tuning for domain knowledge
Embeddings are numerical representations of text that capture semantic meaning:
- Similar texts have similar embedding vectors
- Enables semantic search (finding relevant content by meaning, not just keywords)
- Common models: OpenAI
text-embedding-ada-002,text-embedding-3-small/large
Example:
"auto insurance claim" β [0.123, -0.456, 0.789, ...]
"car insurance filing" β [0.125, -0.454, 0.791, ...] # Similar!
"restaurant menu" β [-0.234, 0.567, -0.123, ...] # Different!
Vector databases efficiently store and search millions of embeddings:
- ChromaDB: Open-source, lightweight, easy to use
- Pinecone: Managed, scalable, production-ready
- Weaviate: GraphQL-based, supports hybrid search
Key Operations:
- Insert: Add document embeddings with metadata
- Query: Find similar embeddings (cosine similarity, dot product)
- Filter: Combine vector search with metadata filters
Why chunk documents?
- LLM context windows are limited (GPT-4: 128k tokens, GPT-3.5: 16k tokens)
- Better retrieval when chunks match query granularity
- Balance: too small = lose context, too large = irrelevant content
Strategies:
- Fixed-size chunking: Simple, predictable sizes (e.g., 500 tokens)
- Sentence-aware chunking: Split at sentence boundaries
- Semantic chunking: Group semantically related sentences
- Recursive chunking: Hierarchical splitting (paragraph β sentence β word)
Problem: Vector search returns many candidates, but top-K might not be best Solution: Re-rank results using cross-encoder models (BGE-reranker, Cohere)
Two-stage approach:
- Retrieval: Fast vector search returns 50-100 candidates
- Re-ranking: Slower but more accurate model ranks top 5-10
Challenge: Documentation changes over time, cached embeddings become stale Solution: Track source update dates and flag outdated content
Implementation:
- Store metadata:
last_updated,version,source_url - Compare against known update dates
- Flag or re-embed outdated chunks
RAG/
βββ README.md # This comprehensive guide
βββ requirements.txt # Python dependencies
βββ .env.example # Environment variables template
β
βββ scraper/ # Web scraping module
β βββ __init__.py
β βββ web_scraper.py # Scrape documentation sites
β βββ content_extractor.py # Extract and clean text
β
βββ chunking/ # Text chunking strategies
β βββ __init__.py
β βββ chunker.py # Main chunking logic
β βββ strategies.py # Different chunking methods
β
βββ embeddings/ # Embedding generation
β βββ __init__.py
β βββ embedder.py # Generate embeddings
β βββ providers.py # Support multiple embedding models
β
βββ vector_db/ # Vector database integration
β βββ __init__.py
β βββ chromadb_client.py # ChromaDB implementation
β βββ base.py # Abstract base class
β
βββ reranking/ # Re-ranking module
β βββ __init__.py
β βββ reranker.py # Re-rank retrieved results
β
βββ query_interface/ # Query and generation
β βββ __init__.py
β βββ rag_pipeline.py # Main RAG orchestration
β βββ query_processor.py # Process user queries
β βββ prompt_templates.py # RAG prompt templates
β
βββ cache/ # Caching layer
β βββ __init__.py
β βββ redis_cache.py # Redis caching for embeddings/responses
β βββ cache_manager.py # Cache management utilities
β
βββ optimization/ # Cost & latency optimization
β βββ __init__.py
β βββ token_counter.py # Count tokens and estimate costs
β βββ latency_tracker.py # Track request latency
β βββ cost_calculator.py # Calculate API costs
β βββ model_comparison.py # Compare different models
β
βββ freshness/ # Knowledge freshness checks
β βββ __init__.py
β βββ freshness_checker.py # Check and flag outdated content
β
βββ utils/ # Utilities
β βββ __init__.py
β βββ config.py # Configuration management
β βββ logger.py # Logging utilities
β
βββ data/ # Data storage
β βββ raw/ # Raw scraped content
β βββ processed/ # Processed chunks
β βββ metadata.json # Document metadata with update dates
β
βββ logs/ # Application logs
β
βββ main.py # Main entry point
cd RAG
pip install -r requirements.txtCreate a .env file from .env.example:
# OpenAI (for embeddings and GPT models)
OPENAI_API_KEY=your_openai_key_here
# Redis (for caching) - optional, will use in-memory cache if not set
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=
# Ollama (for local Llama 3) - optional
OLLAMA_BASE_URL=http://localhost:11434
# Configuration
EMBEDDING_MODEL=text-embedding-3-small
LLM_MODEL=gpt-3.5-turbo
VECTOR_DB=chromadb
CHUNK_SIZE=500
CHUNK_OVERLAP=50# Scrape documentation (example: insurance policy documentation)
python main.py scrape --url https://example-insurance.com/documentation/policies/
# Or use local files
python main.py index --directory ./data/raw/policies/
# Process and embed documents
python main.py embed# Interactive query
python main.py query "What is the deductible for comprehensive coverage?"
# With model comparison
python main.py query "How do I file a claim?" --compare-models
# Check knowledge freshness
python main.py check-freshness- Answer policy questions instantly
- Provide accurate claim filing procedures
- Explain coverage details
- Help agents find answers quickly
- Ensure consistent information across team
- Reduce training time for new agents
- Search across thousands of policy documents
- Find relevant clauses and conditions
- Compare different policy types
- Track regulation updates
- Ensure documentation is current
- Flag outdated policies
Scrapes documentation websites and extracts clean text:
- Handles JavaScript-rendered content
- Extracts metadata (title, date, URL)
- Cleans HTML and formatting
Implements multiple chunking strategies:
- Fixed-size: Split by token count
- Sentence-aware: Respect sentence boundaries
- Semantic: Group related sentences
- Recursive: Hierarchical splitting
Generates embeddings for text chunks:
- Supports OpenAI, Cohere, HuggingFace models
- Batch processing for efficiency
- Handles rate limits and retries
Stores and queries embeddings:
- Create collections with metadata
- Similarity search with filters
- Update and delete operations
Main orchestration:
- Embed user query
- Retrieve similar chunks from vector DB
- Re-rank results
- Augment prompt with context
- Generate answer with LLM
Monitors content staleness:
- Compares chunk timestamps with source dates
- Flags outdated content
- Triggers re-embedding when needed
Caches expensive operations:
- Embeddings (same text = same embedding)
- LLM responses (frequent queries)
- Retrieval results
Tracks API usage:
- Token counting (input + output)
- Cost estimation per request
- Cumulative usage tracking
Why it matters:
- OpenAI charges per token (input and output)
- GPT-4: ~$0.03 per 1K input tokens, ~$0.06 per 1K output tokens
- GPT-3.5: ~$0.0015 per 1K input tokens, ~$0.002 per 1K output tokens
Example:
# Query: "What is my deductible?"
# Input tokens: 50 (query + context)
# Output tokens: 100 (response)
# GPT-3.5 cost: (50/1000 * $0.0015) + (100/1000 * $0.002) = $0.000275What to cache:
- Embeddings: Same text β same embedding (save API calls)
- LLM Responses: Frequent queries β cache responses
- Retrieval Results: Similar queries β reuse top chunks
Cache TTL:
- Embeddings: Never expire (immutable)
- LLM Responses: 24 hours (may need updates)
- Retrieval: 1 hour (documents may update)
GPT-4 vs GPT-3.5 vs Llama 3:
| Model | Cost (per 1K tokens) | Latency | Quality | Best For |
|---|---|---|---|---|
| GPT-4 | ~$0.03 input | ~2-3s | Highest | Complex reasoning |
| GPT-3.5 | ~$0.0015 input | ~0.5s | Good | Simple queries |
| Llama 3 | Free (local) | ~1-2s | Good | Cost-sensitive, privacy |
When to use what:
- GPT-4: Complex questions requiring reasoning
- GPT-3.5: Simple Q&A, high-volume queries
- Llama 3: Cost-sensitive, privacy-critical, offline use
- Learn about embeddings and vector similarity
- Implement simple chunking
- Build basic retrieval system
- Set up ChromaDB
- Implement semantic search
- Add metadata filtering
- Implement re-ranking
- Experiment with chunking strategies
- Optimize retrieval quality
- Add caching layer
- Implement token counting
- Compare model performance
- Knowledge freshness checks
- Error handling and retries
- Monitoring and logging
"How do I file a comprehensive auto insurance claim?"
Query β Embedding Vector: [0.123, -0.456, ...]
Search vector DB β Find top 10 similar chunks
Cross-encoder model β Rank and select top 3 chunks
System: "You are an insurance assistant..."
Context: [Retrieved chunks about claim filing]
User: "How do I file a comprehensive auto insurance claim?"
Generate answer based on context
Answer: "To file a comprehensive claim..."
Sources: [URL1, URL2, URL3]
- Retrieval Quality: Are retrieved chunks relevant?
- Response Accuracy: Are answers correct?
- Latency: Query β Response time
- Cost: Tokens and API costs per query
- Cache Hit Rate: Percentage of cached responses
- Freshness: Number of outdated chunks
All operations are logged:
- Query and response
- Retrieved chunks
- Token usage
- Latency
- Cache hits/misses
Problem: Poor retrieval quality Solution: Experiment with chunk sizes (200-1000 tokens)
Problem: Vector search returns wrong chunks Solution: Use re-ranking, improve embeddings, add metadata filters
Problem: Documents updated but embeddings stale Solution: Implement freshness checks, re-embed periodically
Problem: Too many API calls Solution: Cache aggressively, use GPT-3.5 for simple queries
Problem: Long response times Solution: Cache responses, batch embeddings, use faster models
- ChromaDB Documentation
- LangChain RAG Tutorial
- OpenAI Embeddings Guide
- RAG Paper (Lewis et al., 2020)
- In-Context Retrieval-Augmented Language Models
- Experiment: Try different chunking strategies
- Optimize: Measure and improve latency/cost
- Scale: Move to Pinecone for larger datasets
- Enhance: Add hybrid search (vector + keyword)
- Deploy: Build API endpoint or web interface
This is a learning project. Feel free to:
- Add more chunking strategies
- Support additional vector databases
- Implement hybrid search
- Add evaluation metrics (retrieval accuracy, answer quality)
Built for learning RAG pipelines and optimization techniques in a real-world insurance context.