This project is a complete, production-ready backend system for Retrieval-Augmented Generation (RAG). Its primary purpose is to process uploaded PDF documents, convert them into a searchable format, and store them in a vector database. This enables a larger application to perform semantic searches and retrieve relevant document chunks to answer user questions.
The system is designed as a set of decoupled microservices that can be scaled and maintained independently. It is fully containerized with Docker for easy and consistent deployment.
- PDF Document Upload: Users can upload PDF files through a simple
/uploadREST endpoint. - Asynchronous Processing: Document processing is handled in the background using a task queue (Celery + Redis), so the API can respond quickly without making the user wait.
- Text Extraction: The system automatically extracts text from PDF files using
PyMuPDF. - Text Chunking & Embedding: It segments the extracted text into smaller chunks (sentences) and generates semantic vector embeddings for each chunk using a
sentence-transformersmodel. - Vector Storage: Embeddings and associated metadata (like the original filename) are stored in the Weaviate vector database.
- Task Status Tracking: A
/status/{task_id}endpoint allows clients to poll for the status of their upload and processing job. - Containerized & Ready for Deployment: The entire application stack is defined in a
docker-compose.ymlfile, allowing you to build and run all services with a single command. - Secure by Design: Includes measures for secure file handling, secrets management, and container security.
The system is composed of the following microservices:
flowchart TD
A[User uploads PDF] --> B[API Gateway - FastAPI Web Server]
B -->|enqueue task| C[Redis + Celery Queue]
C --> D[Embedding Worker Service]
D -->|OCR/Text Extract| E[Text Extractor]
D -->|Embedding| F[Sentence Transformer]
D -->|Store| G[Weaviate Vector DB]
- API Gateway (FastAPI): The user-facing service that exposes a REST API for uploading documents and checking their processing status.
- Embedding Worker (Celery): A background worker that handles the heavy lifting of document processing.
- Weaviate Vector DB: A specialized database that stores the document chunks and their corresponding vector embeddings.
- Redis: Acts as a message broker, managing the queue of documents to be processed.
- Flower: A web-based monitoring tool for Celery.
| Component | Tech |
|---|---|
| API Gateway | FastAPI, Python 3.11 |
| Background Worker | Celery |
| Message Broker | Redis |
| PDF Parsing | PyMuPDF |
| Embedding | sentence-transformers |
| Vector DB | Weaviate |
| Monitoring | Flower Dashboard |
| Environment | Docker + docker-compose |
-
Clone the repository:
git clone <repository-url> cd rag_llama_index
-
Create the environment file: Copy the example environment file to create your own configuration.
cp .env.example .env
You can modify the
.envfile to change the embedding model or other settings if needed. -
Build and run the services:
docker-compose up --build
This command will build the Docker images for the API gateway and the worker, and then start all the services.
All protected endpoints require an API key to be passed in the X-API-Key header. You can configure valid keys in your PostgreSQL database.
Send a POST request to the /upload endpoint with a PDF file.
Example using curl:
curl -X POST \
-H "X-API-Key: your-secret-api-key" \
-F "file=@/path/to/your/document.pdf" \
http://localhost:8000/uploadThe API will respond with a task_id:
{
"message": "File uploaded successfully",
"task_id": "a1b2c3d4-e5f6-7890-1234-567890abcdef"
}Use the task_id from the upload response to check the status of the document processing.
Example using curl:
curl -H "X-API-Key: your-secret-api-key" \
http://localhost:8000/status/a1b2c3d4-e5f6-7890-1234-567890abcdefThe response will show the current status (PENDING, SUCCESS, FAILURE, etc.) and the result if the task is complete.
{
"task_id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"status": "SUCCESS",
"result": {
"status": "success",
"chunks_processed": 150
}
}Once a document has been processed successfully, you can search for information using the /query endpoint.
Example using curl:
curl -X POST \
-H "X-API-Key: your-secret-api-key" \
-H "Content-Type: application/json" \
-d '{"query": "What is the main topic of the document?"}' \
http://localhost:8000/queryThis will return a task_id. You can use the /status endpoint again to retrieve the search results.
Example Result from /status endpoint:
{
"task_id": "...",
"status": "SUCCESS",
"result": {
"status": "success",
"results": [
{
"text": "This is a relevant chunk of text from your document.",
"file": "document.pdf"
},
{
"text": "This is another relevant chunk of text.",
"file": "document.pdf"
}
]
}
}You can monitor the status of the background processing tasks using the Flower dashboard.
This dashboard provides a real-time overview of the Celery workers and the tasks they are processing.