Foundry Local is an on-device AI inference solution that lets you run AI models locally through a CLI, SDK, or REST API. This repository provides a collection of Jupyter Notebook tutorials to help you get started and explore advanced capabilities.
π Website: www.foundrylocal.ai
Foundry Local is currently in preview.
Foundry Local is a Microsoft on-device AI inference solution designed to let developers and organizations run modern generative AI models directly on their local hardware β Windows PCs, macOS (Apple Silicon), or servers β without relying on cloud-based endpoints.
- π Complete Data Privacy β All prompts and outputs are processed entirely on your device. Data never leaves your system, making it ideal for sensitive, confidential, or regulated workloads in healthcare, government, finance, and more.
- β‘ Low-Latency Inference β Run AI models locally for real-time, interactive experiences with minimal latency β no network round-trips required.
- π΄ Offline Operation β Once models are downloaded, everything works fully offline. Perfect for remote environments, air-gapped systems, or locations with unreliable connectivity.
- π° Cost Efficiency β Leverage your existing hardware (CPU, GPU, NPU) for inference, eliminating recurring cloud costs and providing predictable cost control.
- π OpenAI-Compatible API β Foundry Local exposes an OpenAI-compatible REST API, allowing you to use the same code for local and cloud-based inference. Switch between local and Azure endpoints by simply changing the base URL.
- π οΈ Multiple Integration Options β Interact via CLI, Python SDK, JavaScript SDK, .NET SDK, or REST API β flexible integration for any workflow.
- βοΈ Automatic Hardware Optimization β Foundry Local detects your hardware and automatically downloads the best-optimized model variant (NVIDIA CUDA, AMD DirectML, Apple Metal, Intel/Qualcomm NPU, or CPU with INT4/INT8 quantization).
- π No Azure Subscription Required β Use Foundry Local entirely standalone, though hybrid cloud-to-edge workflows with Azure AI Foundry are fully supported.
| Platform | Details |
|---|---|
| Windows | Windows 10/11 (x64, ARM), Windows Server 2025 |
| macOS | macOS with Apple Silicon (M1/M2/M3/M4) |
| Hardware | Min 8 GB RAM (16 GB recommended); NVIDIA, AMD, Intel, Qualcomm GPUs/NPUs, Apple Metal |
- π₯ Applications handling sensitive or regulated data (HIPAA, GDPR)
- π Scenarios with unreliable or no internet access
- π§ͺ Prototyping and developing AI applications before cloud deployment
- β±οΈ Real-time, interactive AI-driven applications requiring low latency
- πΈ Reducing ongoing public cloud inference costs
| # | Notebook | Description |
|---|---|---|
| 01 | Getting Started with Foundry Local | Introduction to Foundry Local β installation, setup, and running your first local model |
| 02 | Foundry Local Chat Completions | Using the chat completions API to interact with local models |
| 03 | Foundry Local Practical Applications | Real-world use cases and practical examples with Foundry Local |
| 04 | Foundry Local Mistral 7B | Running and interacting with the Mistral 7B model locally |
| 05 | Advanced Function Calling with Foundry Local | Implementing advanced function calling and tool use with local models |
| 06 | Deploying Custom Models with Microsoft Olive and Foundry Local | Optimizing and deploying custom models using Microsoft Olive |
Foundry Local's architecture is designed for efficient, private, and scalable on-device AI inference. For the complete architecture reference, see the official documentation: Foundry Local Architecture on Microsoft Learn.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Developer / Application β
β (CLI, Python SDK, JS SDK, .NET SDK) β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Foundry Local Service β
β (OpenAI-Compatible REST API Endpoint) β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββ β
β β Model β β Cache β β Service β β
β β Manager β β Manager β β Manager β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ONNX Runtime β β
β β (CPU / CUDA / DirectML / Metal / NPU Providers) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Local Hardware β
β (CPU, NVIDIA GPU, AMD GPU, Apple Silicon, NPU) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Role |
|---|---|
| Foundry Local Service | Core engine that orchestrates local AI model execution. Exposes an OpenAI-compatible REST API endpoint for inference and model management. |
| Model Manager | Handles the full model lifecycle β downloading, loading, unloading, compilation, and removal from cache. |
| Cache Manager | Manages local storage of AI models. Configure cache locations, list cached models, and optimize storage space. |
| Service Manager | Controls the Foundry Local Service β start, stop, monitor, and restart for maintenance or configuration changes. |
| ONNX Runtime | The inference engine that executes optimized models across supported hardware. Uses execution providers (CUDA, DirectML, Metal, CPU) for hardware-specific acceleration. |
| CLI & SDKs | Primary interfaces to interact with the service. CLI for command-line operations; Python, JavaScript, C#, and Rust SDKs for programmatic integration. |
- Request β The developer sends a request via CLI, SDK, or REST API
- Routing β The Foundry Local Service receives the request through its OpenAI-compatible endpoint
- Model Operations β The Model Manager loads the requested model (downloading and caching if needed)
- Inference β ONNX Runtime executes the inference using the optimal hardware execution provider
- Response β Results are returned through the same API interface
- π Local-first design β All processing happens on-device with no data leaving the system
- π Cloud-compatible β Same API interface as Azure OpenAI, enabling seamless local-to-cloud portability
- β‘ Hardware-aware β Automatic detection and optimization for available compute resources
- π¦ Efficient caching β Models are downloaded once and cached locally for instant offline access
The official Foundry Local documentation is available at www.foundrylocal.ai and covers everything you need to get started and build on-device AI applications.
| Resource | Link | Description |
|---|---|---|
| π Official Website | foundrylocal.ai | Main homepage with overview, downloads, and getting started guides |
| π Microsoft Learn | Foundry Local on Microsoft Learn | In-depth documentation including concepts, quickstarts, and API references |
| ποΈ Architecture | Foundry Local Architecture | Detailed architecture overview and component descriptions |
| π Getting Started Guide | Get Started with Foundry Local | Step-by-step guide to install and run your first model |
- Installation & Setup β How to install Foundry Local on Windows, macOS, and servers
- CLI Reference β Full command-line interface documentation (
foundry model list,foundry model run, etc.) - SDK Integration β Python, JavaScript, and .NET SDK guides with code examples
- REST API β OpenAI-compatible REST API reference for seamless integration
- Hardware Optimization β How Foundry Local auto-detects and optimizes for your hardware (NVIDIA/AMD GPU, Apple Silicon, NPU, CPU)
- Custom Model Deployment β Guide to converting and deploying your own models using Microsoft Olive
Foundry Local provides a curated catalog of pre-optimized, open-source AI models ready to run on your device. Browse the full model catalog at foundrylocal.ai/models.
+25 models are available.
π‘ The model catalog is regularly updated. Visit foundrylocal.ai/models for the latest available models.
Foundry Local automatically detects your hardware and downloads the best-optimized variant for your device:
- π’ NVIDIA GPU β CUDA-accelerated ONNX models
- π΄ AMD GPU β DirectML-optimized models
- π Apple Silicon β Metal-accelerated models for M-series chips
- π΅ Intel/Qualcomm NPU β Neural Processing Unit optimized models
- π» CPU β Quantized INT4/INT8 models for CPU-only inference
# List all available models in the catalog
foundry model list
# Get detailed info about a specific model
foundry model info <model-alias>
# Download and run a model
foundry model run <model-alias>
# Remove a cached model
foundry model remove <model-alias>You can also deploy custom models from Hugging Face by converting them to ONNX format using Microsoft Olive. See Notebook 06 for a complete walkthrough.
A reference list of models is also available in this repository: π models.xlsx
- Python 3.10+
- Foundry Local installed β see foundrylocal.ai for installation instructions
- Jupyter Notebook or JupyterLab
-
Clone this repository:
git clone https://github.com/retkowsky/foundry-local.git cd foundry-local -
Install the required Python packages:
pip install -r requirements.txt
-
Launch Jupyter and open any notebook:
jupyter notebook
| Package | Purpose |
|---|---|
foundry-local |
Core Foundry Local package |
foundry-local-sdk |
Foundry Local Python SDK |
openai |
OpenAI-compatible API client |
onnxruntime / onnxruntime-genai |
ONNX Runtime for model inference |
olive-ai |
Microsoft Olive for model optimization |
transformers |
Hugging Face Transformers |
torch |
PyTorch |
- π Foundry Local Website
- π Foundry Local Documentation
- ποΈ Foundry Local Architecture
- π€ Available Models Catalog
- π Microsoft Learn β Foundry Local
- π Models Reference (Excel)
| Field | Details |
|---|---|
| Name | Serge Retkowsky |
| Created | 26 February 2026 |
| Last updated | 26 February 2026 |
| serge.retkowsky@microsoft.com | |
| https://www.linkedin.com/in/serger/ | |
| Medium publications | https://medium.com/@sergems18/ |

