This FastAPI server provides a scalable and efficient solution for serving generative AI models requiring GPU acceleration. It supports HTTP clients and can be deployed locally or within a Docker container.
The service implements OpenAI-compatible endpoints, allowing seamless integration with existing tools and libraries.
A built-in queuing system manages GPU resources, ensuring requests are processed sequentially in the order received. Horizontal scaling is achieved by deploying multiple instances of the server, allowing you to handle increased demand.
The server is configured to run one model type at a time. The model type is specified at startup and determines the model's behavior and the type of requests it can handle. The server supports multiple model types, including LLMs, NER, text and image embeddings, and vision models.
- LLMs
- KEY:
text - IE: microsoft/Phi-3.5-mini-instruct
- KEY:
- NER (Entity Recognition)
- KEY:
ner - IE: urchade/gliner_medium-v2.1
- KEY:
- Text Embedding
- KEY:
embed - IE: nomic-ai/nomic-embed-text-v1.5
- KEY:
- Image Embedding
- KEY:
vision-embed - IE: nomic-ai/nomic-embed-vision-v1.5
- KEY:
- Vision
- KEY:
vision - IE: microsoft/Florence-2-large
- KEY:
- Image
- KEY:
image - IE: PixArt-alpha/PixArt-XL-2-1024-MS
- KEY:
- This project uses the Black code formatter. Please install the Black formatter in your IDE to ensure consistent code formatting before submitting a PR.
You can install the required Python packages using UV or a standard Python virtual environment. The project requires Python 3.10 or higher. When adding new Python packages, please update the dev_requirements.txt file and the pyproject.toml file.
- Local Python install on Windows:
python -m venv venv
.\venv\Scripts\activate
pip install -r dev_requirements.txt- Local Python install on MacOS/Linux:
python3 -m venv venv
source venv/bin/activate
pip install -r dev_requirements.txt- You can also use the
uvtool to create a virtual environment and install the required packages.
uv venv
.venv\Scripts\activate
uv pip install -r pyproject.toml- You can test your local PyTorch/CUDA installation by using the
utils/torch_test.ipynbnotebook.
- The model files for a specific model are downloaded on the start up of the server if it not currently present. The model files location is determined by the start up configuration and are either placed in the
model_filesdirectory at the root of the project or in a Docker volume location. - When developing locally, if you prefer, you can manually download the model files using the
utils/dl_model_files.pyscript. The start up lifecycle will then check the model_files directory to ensure the files are present.
- --model: A simplified, path-safe name for the model (e.g., 'phi-3-mini-128k-instruct' for 'microsoft/Phi-3.5-mini-instruct'). Only use alphanumeric characters and hyphens (-). Avoid special characters like /, @, or spaces.
- --model_repo_id: The complete repository ID from Hugging Face Hub (e.g., 'microsoft/Phi-3.5-mini-instruct'). This should match exactly how it appears on Hugging Face, including the namespace and forward slashes.
- --model_type: The type of model being served (e.g., 'text', 'ner', 'embed', 'vision-embed', 'vision', 'image'). Must match one of the supported model type keys listed in the "Currently Supported Model Types" section.
- semoss_id (OPTIONAL): The SEMOSS Engine ID associated to the model. If you are running locally and do not have one you can omitt this argument.
- --host (OPTIONAL): The host IP address for the server. Defaults to '0.0.0.0'. This can be omitted in most cases.
- --port (OPTIONAL): The port number for the server. Defaults to '8888'. This can be omitted in most cases.
- --local_files (OPTIONAL): A flag indicating that model files should be loaded from the project's root directory rather than from a Docker volume location. When this flag is used, the server expects model files to be present in the local model_files directory at the root instead of looking for them in mounted volumes. Useful for local development and testing.
- --no_redis (OPTIONAL RECOMMENDED): A flag to disable Redis deployment status updates. When running the server locally without Redis infrastructure, use this flag to prevent the server from attempting to connect to and update Redis with deployment status information. This is typically used during local development and testing scenarios where Redis is not available or needed.
The following are example commands for running the server locally (not in a Docker container) and should be executed at the root of the project.
- EX: Running the Nomic Text Embedding model locally with the model files stored in the
model_filesdirectory.
python server/main.py --model nomic-ai-nomic-embed-text-v1-5 --model_repo_id nomic-ai/nomic-embed-text-v1.5 --model_type embed --semoss_id 2aa0e4bf-08d5-452e-aa75-dd417f8ae610 --local_files --no_redis- EX: Running the GLiNER NER model locally with the model files stored in the
model_filesdirectory. Note how we did not include the--no_redisflag, so this would assume you have a Redis instance running locally.
python server/main.py --model gliner-multi-v2-1 --model_repo_id urchade/gliner_multi-v2.1 --model_type ner --semoss_id abd20c47-2ce7-45ef-a10a-572150a3b0d6 --local_filesWhen the model_device flag isn't specified the server automatically selects the best available device:
- Uses CUDA if available (
cuda:0) - Falls back to CPU if no GPU is present
- EX: Running the TinyLlama-1.1B-Chat model locally with CPU using the
model_deviceflag.
python server/main.py --model TinyLlama-1.1B-Chat-v1.0 --model_repo_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --model_type text --semoss_id 2aa0e4bf-08d5-452e-aa75-dd417f8ae610 --local_files --no_redis --model_device cpuThe project uses a two-stage Docker build process to optimize build times and reduce the final image size. This process involves creating a base image with common dependencies first, then building the server image on top of it.
The base image (defined in Dockerfile.base) contains:
- NVIDIA CUDA runtime
- Python environment setup
- Common ML dependencies (PyTorch, Transformers, etc.)
- Flash Attention configuration
- Git LFS support
To build the base image (This will take a long time due to Flash Attention installation):
docker build -f Dockerfile.base -t remote-client-server-base:latest .The server image (defined in Dockerfile) builds upon the base image and adds:
- Application code
- Server configurations
- Model file management
- Additional dependencies
To build the server image locally:
docker build -t remote-client-server:latest .For local development, use the provided build script: With Windows Powershell:
./build.ps1With Unix/MacOS:
./build.shMight need to run chmod +x local_build.sh if you are on Unix/MacOS.
With Bash:
bash build.shThe script performs the following steps:
- Builds the base image using the Dockerfile.base if needed
- Builds the server image using the Dockerfile and local base image
To run the container with a volume attached with the model files, first create the volume and then use the following command to start the container with the volume attached. (NOTE: The volume name is my-volume in this example, you can name a volume whatever you like).
docker run -p 8888:8888 -e MODEL=gliner-multi-v2-1 -e MODEL_REPO_ID=urchade/gliner_multi-v2.1 -e MODEL_TYPE=ner -e SEMOSS_ID=abd20c47-2ce7-45ef-a10a-572150a3b0d6 --gpus all --name remote-client-server -v my-volume:/app/model_files remote-client-serverhttp://127.0.0.1:8888/docsfor Swagger UI documentation.http://127.0.0.1:8888/redocfor ReDoc documentation.
-
/api/chat/completions- An OpenAI API compatible endpoint for chat completions (text models). -
/api/embeddings- An OpenAI API compatible endpoint for text embeddings. -
/api/generate- A generic endpoint for generations from models not natively supported by the OpenAI API. IE: NER models. -
/api/health- Health check endpoint. -
/api/status- Returns an object with values for the current model, queue size, GPU utilization and server status. -
/api/queue- Returns the current queue size as a plain text response. (IE: "queue_size 0") -
/metrics- Returns Prometheus metrics.
- Text generation models are served using the vLLM library, which provides efficient and scalable inference for large language models. vLLM supports advanced features like dynamic batching, model parallelism, and optimized memory management. This will require you to run the server in a Docker container for text models.
The server integrates with Redis to maintain deployment status and facilitate scaling operations. Each model deployment is associated with a Redis hash that stores critical operational metrics and status information.
Each deployment maintains a Redis hash using the SEMOSS Engine ID as the key, with the format {semoss_id}:deployment. The hash contains the following fields:
model_name: The simplified path-safe name of the modelmodel_repo_id: The complete HuggingFace repository IDmodel_type: The type of model being served (text, ner, embed, etc.)model_device: Controls the hardware device used for model setupsemoss_id: The unique SEMOSS Engine ID associated with the deploymentaddress: The IP address and port where the model is accessiblestart_time: Timestamp when the deployment was initiatedlast_request: Timestamp of the most recent generation requestgenerations: Counter of total generations performedshutdown_lock: Flag indicating if the deployment is exempt from scaling down
Example Redis hash:
"abd20c47-2ce7-45ef-a10a-572150a3b0d6:deployment" : {
"model_name": "gliner-multi-v2-1",
"model_repo_id": "urchade/gliner_multi-v2.1",
"model_type": "ner",
"model_device": "cuda:0",
"semoss_id": "abd20c47-2ce7-45ef-a10a-572150a3b0d6",
"address": "10.218.221.138:31213",
"start_time": "2025-01-16T17:41:32.934832-05:00",
"last_request": "2025-01-17T09:43:04.588880-05:00",
"generations": "588",
"shutdown_lock": "false"
}
The server automatically updates specific fields in the Redis hash during generation operations:
- Generation Counter: Incremented after each successful model generation
- Last Request Time: Updated with the timestamp of the most recent request
These updates enable monitoring systems to track usage patterns and make scaling decisions based on actual deployment activity.
This is not comprehensive, but a high-level overview of the project structure.
server: Contains the main server application code.server/main.py: The entry point for the server.server/gaas: Contains the classes for running specific models (text, ner, embed, etc.) and managing the models.server/gaas/model_manager/model_manager.py: The ModelManager class is a singleton that manages the model instance and loads the model into memory.server/gaas/model_manager/model_files_manager.py: The ModelFilesManager manages the location and parsing of model files.
server/model_utilsserver/model_utils/download.py: Contains all of the logic for downloading model files and verifying their integrity during the server start up lifecycle.server/model_utils/model_config.py: Contains the configuration for the currently running model by pulling the OS environment variables.
server/pydantic_models: Contains the Pydantic models for requests and response validation.server/queue_manager: Contains the QueueManager singleton class for managing the job queue.server/redis_manager: Contains the RedisManager class for managing the Redis connection and deployment status updates.server/routerContains the FastAPI routers for the different endpoints.
The following is a high-level overview of the server lifecycle when starting up and processing requests when the container is started with the following command:
docker run -p 8888:8888 -e MODEL=gliner-multi-v2-1 -e MODEL_REPO_ID=urchade/gliner_multi-v2.1 -e MODEL_TYPE=ner -e SEMOSS_ID=abd20c47-2ce7-45ef-a10a-572150a3b0d6 --gpus all --name remote-client-server -v my-volume:/app/model_files remote-client-server
- Model File Verification
- The server uses the shortname of the model to check if an existing path exists on the volume. (see
download.py)- If the path does not exist, the server will attempt to download the model files from the Hugging Face Hub.
- If the path exists, the server will verify the integrity of the model files.
- The server uses the shortname of the model to check if an existing path exists on the volume. (see
- Model Loading
- The server will load the model into memory using the ModelManager class.
- The ModelManager class will use the OS environment variables to determine the model configuration.
- The ModelManager and ModelFilesManager classes will parse the model files to determine whether to utilize Flash Attention.
- The server will load the model into memory using the ModelManager class.
- Queue Initialization
- The server will initialize the QueueManager class to manage the job queue based on the model type.
- Request Received
- The server will receive a request at a given endpoint (eg. `/api/embeddings) and validate it using the Pydantic models.
- The request will be assigned a unique job id and added to the queue.
- Queue Processing
- The QueueManager will process the job queue sequentially.
- The QueueManager will report the current queue position as it updates.
- The job will be popped from the queue and sent to the ModelManager for processing.
- Model Processing
- The ModelManager will process the job using the model instance.
- The ModelManager will return the result to the QueueManager.
- Queue Update
- The QueueManager will update the job status with the payload and set the job as complete.
- The QueueManager will update the Redis deployment hash with the latest request timestamp and generation count.
- Response
- The server will return the response to the client.
- For more information on how the server is deployed, see the Kubernetes Model Scaler
- Please create a new branch for your changes and submit a pull request for review.
- In the PR description, please include a brief summary of the changes and any relevant information.
- Ensure that your code is formatted using the Black code formatter.
- PRs will require review from at least one team member (Ryan Weiler or Kunal Patel for now).