RAG Embedding Worker

This project is a complete, production-ready backend system for Retrieval-Augmented Generation (RAG). Its primary purpose is to process uploaded PDF documents, convert them into a searchable format, and store them in a vector database. This enables a larger application to perform semantic searches and retrieve relevant document chunks to answer user questions.

The system is designed as a set of decoupled microservices that can be scaled and maintained independently. It is fully containerized with Docker for easy and consistent deployment.

Features

PDF Document Upload: Users can upload PDF files through a simple /upload REST endpoint.
Asynchronous Processing: Document processing is handled in the background using a task queue (Celery + Redis), so the API can respond quickly without making the user wait.
Text Extraction: The system automatically extracts text from PDF files using PyMuPDF.
Text Chunking & Embedding: It segments the extracted text into smaller chunks (sentences) and generates semantic vector embeddings for each chunk using a sentence-transformers model.
Vector Storage: Embeddings and associated metadata (like the original filename) are stored in the Weaviate vector database.
Task Status Tracking: A /status/{task_id} endpoint allows clients to poll for the status of their upload and processing job.
Containerized & Ready for Deployment: The entire application stack is defined in a docker-compose.yml file, allowing you to build and run all services with a single command.
Secure by Design: Includes measures for secure file handling, secrets management, and container security.

Architecture

The system is composed of the following microservices:

flowchart TD
  A[User uploads PDF] --> B[API Gateway - FastAPI Web Server]
  B -->|enqueue task| C[Redis + Celery Queue]
  C --> D[Embedding Worker Service]
  D -->|OCR/Text Extract| E[Text Extractor]
  D -->|Embedding| F[Sentence Transformer]
  D -->|Store| G[Weaviate Vector DB]

API Gateway (FastAPI): The user-facing service that exposes a REST API for uploading documents and checking their processing status.
Embedding Worker (Celery): A background worker that handles the heavy lifting of document processing.
Weaviate Vector DB: A specialized database that stores the document chunks and their corresponding vector embeddings.
Redis: Acts as a message broker, managing the queue of documents to be processed.
Flower: A web-based monitoring tool for Celery.

Technology Stack

Component	Tech
API Gateway	FastAPI, Python 3.11
Background Worker	Celery
Message Broker	Redis
PDF Parsing	PyMuPDF
Embedding	sentence-transformers
Vector DB	Weaviate
Monitoring	Flower Dashboard
Environment	Docker + docker-compose

Getting Started

Prerequisites

Installation

Clone the repository:

git clone <repository-url>
cd rag_llama_index

Create the environment file: Copy the example environment file to create your own configuration.
```
cp .env.example .env
```
You can modify the .env file to change the embedding model or other settings if needed.
Build and run the services:
```
docker-compose up --build
```
This command will build the Docker images for the API gateway and the worker, and then start all the services.

How to Use

All protected endpoints require an API key to be passed in the X-API-Key header. You can configure valid keys in your PostgreSQL database.

1. Upload a PDF File

Send a POST request to the /upload endpoint with a PDF file.

Example using curl:

curl -X POST \
  -H "X-API-Key: your-secret-api-key" \
  -F "file=@/path/to/your/document.pdf" \
  http://localhost:8000/upload

The API will respond with a task_id:

{
  "message": "File uploaded successfully",
  "task_id": "a1b2c3d4-e5f6-7890-1234-567890abcdef"
}

2. Check Processing Status

Use the task_id from the upload response to check the status of the document processing.

Example using curl:

curl -H "X-API-Key: your-secret-api-key" \
  http://localhost:8000/status/a1b2c3d4-e5f6-7890-1234-567890abcdef

The response will show the current status (PENDING, SUCCESS, FAILURE, etc.) and the result if the task is complete.

{
  "task_id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
  "status": "SUCCESS",
  "result": {
    "status": "success",
    "chunks_processed": 150
  }
}

3. Query Your Documents

Once a document has been processed successfully, you can search for information using the /query endpoint.

Example using curl:

curl -X POST \
  -H "X-API-Key: your-secret-api-key" \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the main topic of the document?"}' \
  http://localhost:8000/query

This will return a task_id. You can use the /status endpoint again to retrieve the search results.

Example Result from /status endpoint:

{
    "task_id": "...",
    "status": "SUCCESS",
    "result": {
        "status": "success",
        "results": [
            {
                "text": "This is a relevant chunk of text from your document.",
                "file": "document.pdf"
            },
            {
                "text": "This is another relevant chunk of text.",
                "file": "document.pdf"
            }
        ]
    }
}

Monitoring

You can monitor the status of the background processing tasks using the Flower dashboard.

URL: http://localhost:5555

This dashboard provides a real-time overview of the Celery workers and the tasks they are processing.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
api_gateway		api_gateway
data		data
embedding_worker		embedding_worker
migrations		migrations
.dockerignore		.dockerignore
.env.example		.env.example
.env.production		.env.production
.gitignore		.gitignore
01 - Imprement_Core_Feature.md		01 - Imprement_Core_Feature.md
02 - Imprement_ARAG_Personalized_Recommendation.md		02 - Imprement_ARAG_Personalized_Recommendation.md
02 - TASK.md		02 - TASK.md
CHANGE_LOGS.md		CHANGE_LOGS.md
GEMINI.md		GEMINI.md
LICENSE		LICENSE
README.md		README.md
TECHNICAL.md		TECHNICAL.md
TODO.md		TODO.md
__init__.py		__init__.py
docker-compose.yml		docker-compose.yml
requirements-dev.txt		requirements-dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAG Embedding Worker

Features

Architecture

Technology Stack

Getting Started

Prerequisites

Installation

How to Use

1. Upload a PDF File

2. Check Processing Status

3. Query Your Documents

Monitoring

About

Uh oh!

Releases

Packages

Languages

License

damrongsak/agentic-retrieval-core

Folders and files

Latest commit

History

Repository files navigation

RAG Embedding Worker

Features

Architecture

Technology Stack

Getting Started

Prerequisites

Installation

How to Use

1. Upload a PDF File

2. Check Processing Status

3. Query Your Documents

Monitoring

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages