Skip to content

retkowsky/foundry-local

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Foundry Local πŸš€

Foundry Local is an on-device AI inference solution that lets you run AI models locally through a CLI, SDK, or REST API. This repository provides a collection of Jupyter Notebook tutorials to help you get started and explore advanced capabilities.

🌐 Website: www.foundrylocal.ai

Foundry Local is currently in preview.


🧠 What is Foundry Local?

Foundry Local is a Microsoft on-device AI inference solution designed to let developers and organizations run modern generative AI models directly on their local hardware β€” Windows PCs, macOS (Apple Silicon), or servers β€” without relying on cloud-based endpoints.

Key Highlights

  • πŸ”’ Complete Data Privacy β€” All prompts and outputs are processed entirely on your device. Data never leaves your system, making it ideal for sensitive, confidential, or regulated workloads in healthcare, government, finance, and more.
  • ⚑ Low-Latency Inference β€” Run AI models locally for real-time, interactive experiences with minimal latency β€” no network round-trips required.
  • πŸ“΄ Offline Operation β€” Once models are downloaded, everything works fully offline. Perfect for remote environments, air-gapped systems, or locations with unreliable connectivity.
  • πŸ’° Cost Efficiency β€” Leverage your existing hardware (CPU, GPU, NPU) for inference, eliminating recurring cloud costs and providing predictable cost control.
  • πŸ”— OpenAI-Compatible API β€” Foundry Local exposes an OpenAI-compatible REST API, allowing you to use the same code for local and cloud-based inference. Switch between local and Azure endpoints by simply changing the base URL.
  • πŸ› οΈ Multiple Integration Options β€” Interact via CLI, Python SDK, JavaScript SDK, .NET SDK, or REST API β€” flexible integration for any workflow.
  • βš™οΈ Automatic Hardware Optimization β€” Foundry Local detects your hardware and automatically downloads the best-optimized model variant (NVIDIA CUDA, AMD DirectML, Apple Metal, Intel/Qualcomm NPU, or CPU with INT4/INT8 quantization).
  • πŸš€ No Azure Subscription Required β€” Use Foundry Local entirely standalone, though hybrid cloud-to-edge workflows with Azure AI Foundry are fully supported.

Supported Platforms

Platform Details
Windows Windows 10/11 (x64, ARM), Windows Server 2025
macOS macOS with Apple Silicon (M1/M2/M3/M4)
Hardware Min 8 GB RAM (16 GB recommended); NVIDIA, AMD, Intel, Qualcomm GPUs/NPUs, Apple Metal

Typical Use Cases

  • πŸ₯ Applications handling sensitive or regulated data (HIPAA, GDPR)
  • 🌐 Scenarios with unreliable or no internet access
  • πŸ§ͺ Prototyping and developing AI applications before cloud deployment
  • ⏱️ Real-time, interactive AI-driven applications requiring low latency
  • πŸ’Έ Reducing ongoing public cloud inference costs

πŸ“š Notebooks

# Notebook Description
01 Getting Started with Foundry Local Introduction to Foundry Local β€” installation, setup, and running your first local model
02 Foundry Local Chat Completions Using the chat completions API to interact with local models
03 Foundry Local Practical Applications Real-world use cases and practical examples with Foundry Local
04 Foundry Local Mistral 7B Running and interacting with the Mistral 7B model locally
05 Advanced Function Calling with Foundry Local Implementing advanced function calling and tool use with local models
06 Deploying Custom Models with Microsoft Olive and Foundry Local Optimizing and deploying custom models using Microsoft Olive

πŸ—οΈ Architecture

Foundry Local's architecture is designed for efficient, private, and scalable on-device AI inference. For the complete architecture reference, see the official documentation: Foundry Local Architecture on Microsoft Learn.

Core Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Developer / Application                      β”‚
β”‚              (CLI, Python SDK, JS SDK, .NET SDK)                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Foundry Local Service                         β”‚
β”‚            (OpenAI-Compatible REST API Endpoint)                β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚   Model      β”‚  β”‚    Cache     β”‚  β”‚     Service        β”‚     β”‚
β”‚  β”‚   Manager    β”‚  β”‚    Manager   β”‚  β”‚     Manager        β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚         β”‚                 β”‚                                     β”‚
β”‚         β–Ό                 β–Ό                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                    ONNX Runtime                         β”‚    β”‚
β”‚  β”‚     (CPU / CUDA / DirectML / Metal / NPU Providers)     β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Local Hardware                               β”‚
β”‚          (CPU, NVIDIA GPU, AMD GPU, Apple Silicon, NPU)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Details

Component Role
Foundry Local Service Core engine that orchestrates local AI model execution. Exposes an OpenAI-compatible REST API endpoint for inference and model management.
Model Manager Handles the full model lifecycle β€” downloading, loading, unloading, compilation, and removal from cache.
Cache Manager Manages local storage of AI models. Configure cache locations, list cached models, and optimize storage space.
Service Manager Controls the Foundry Local Service β€” start, stop, monitor, and restart for maintenance or configuration changes.
ONNX Runtime The inference engine that executes optimized models across supported hardware. Uses execution providers (CUDA, DirectML, Metal, CPU) for hardware-specific acceleration.
CLI & SDKs Primary interfaces to interact with the service. CLI for command-line operations; Python, JavaScript, C#, and Rust SDKs for programmatic integration.

How It Works

  1. Request β€” The developer sends a request via CLI, SDK, or REST API
  2. Routing β€” The Foundry Local Service receives the request through its OpenAI-compatible endpoint
  3. Model Operations β€” The Model Manager loads the requested model (downloading and caching if needed)
  4. Inference β€” ONNX Runtime executes the inference using the optimal hardware execution provider
  5. Response β€” Results are returned through the same API interface

Key Architectural Benefits

  • πŸ” Local-first design β€” All processing happens on-device with no data leaving the system
  • πŸ”„ Cloud-compatible β€” Same API interface as Azure OpenAI, enabling seamless local-to-cloud portability
  • ⚑ Hardware-aware β€” Automatic detection and optimization for available compute resources
  • πŸ“¦ Efficient caching β€” Models are downloaded once and cached locally for instant offline access

πŸ“– Documentation

The official Foundry Local documentation is available at www.foundrylocal.ai and covers everything you need to get started and build on-device AI applications.

Key Documentation Resources

Resource Link Description
🌐 Official Website foundrylocal.ai Main homepage with overview, downloads, and getting started guides
πŸ“˜ Microsoft Learn Foundry Local on Microsoft Learn In-depth documentation including concepts, quickstarts, and API references
πŸ—οΈ Architecture Foundry Local Architecture Detailed architecture overview and component descriptions
πŸš€ Getting Started Guide Get Started with Foundry Local Step-by-step guide to install and run your first model

What You'll Find in the Docs

  • Installation & Setup β€” How to install Foundry Local on Windows, macOS, and servers
  • CLI Reference β€” Full command-line interface documentation (foundry model list, foundry model run, etc.)
  • SDK Integration β€” Python, JavaScript, and .NET SDK guides with code examples
  • REST API β€” OpenAI-compatible REST API reference for seamless integration
  • Hardware Optimization β€” How Foundry Local auto-detects and optimizes for your hardware (NVIDIA/AMD GPU, Apple Silicon, NPU, CPU)
  • Custom Model Deployment β€” Guide to converting and deploying your own models using Microsoft Olive

πŸ€– Available Models

Foundry Local provides a curated catalog of pre-optimized, open-source AI models ready to run on your device. Browse the full model catalog at foundrylocal.ai/models.

Featured Models

+25 models are available.

πŸ’‘ The model catalog is regularly updated. Visit foundrylocal.ai/models for the latest available models.

Hardware-Optimized Variants

Foundry Local automatically detects your hardware and downloads the best-optimized variant for your device:

  • 🟒 NVIDIA GPU β€” CUDA-accelerated ONNX models
  • πŸ”΄ AMD GPU β€” DirectML-optimized models
  • 🍎 Apple Silicon β€” Metal-accelerated models for M-series chips
  • πŸ”΅ Intel/Qualcomm NPU β€” Neural Processing Unit optimized models
  • πŸ’» CPU β€” Quantized INT4/INT8 models for CPU-only inference

Model Management CLI

# List all available models in the catalog
foundry model list

# Get detailed info about a specific model
foundry model info <model-alias>

# Download and run a model
foundry model run <model-alias>

# Remove a cached model
foundry model remove <model-alias>

Bring Your Own Models

You can also deploy custom models from Hugging Face by converting them to ONNX format using Microsoft Olive. See Notebook 06 for a complete walkthrough.

A reference list of models is also available in this repository: πŸ“Š models.xlsx


βš™οΈ Getting Started

Prerequisites

  • Python 3.10+
  • Foundry Local installed β€” see foundrylocal.ai for installation instructions
  • Jupyter Notebook or JupyterLab

Installation

  1. Clone this repository:

    git clone https://github.com/retkowsky/foundry-local.git
    cd foundry-local
  2. Install the required Python packages:

    pip install -r requirements.txt
  3. Launch Jupyter and open any notebook:

    jupyter notebook

πŸ”‘ Key Dependencies

Package Purpose
foundry-local Core Foundry Local package
foundry-local-sdk Foundry Local Python SDK
openai OpenAI-compatible API client
onnxruntime / onnxruntime-genai ONNX Runtime for model inference
olive-ai Microsoft Olive for model optimization
transformers Hugging Face Transformers
torch PyTorch

πŸ“„ Resources


Author

Field Details
Name Serge Retkowsky
Created 26 February 2026
Last updated 26 February 2026
Email serge.retkowsky@microsoft.com
LinkedIn https://www.linkedin.com/in/serger/
Medium publications https://medium.com/@sergems18/

About

Foundry Local is an on-device AI inference solution that you use to run AI models locally through a CLI, SDK, or REST API

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages