What is Foundry Local?
Foundry Local is an end-to-end solution from Microsoft for building AI applications that run entirely on the user’s device. The local AI runtime handles model acquisition, hardware acceleration, model management, and inference via ONNX Runtime. The runtime adds only about 20 MB to the application and runs inference in-process. This lets you embed AI where package size, data privacy, and offline capability matter.
Foundry Local solves a concrete problem: it lets you add AI features to client applications without sending data to the cloud, without network latency, and without per-token costs. No Azure subscription is required. Responses start immediately, and the application works offline. For inference on your own infrastructure at enterprise scale with Kubernetes-native operations, Microsoft offers a separate option, Foundry Local on Azure Local.
Key Features
- Lightweight on-device runtime: The runtime, built on ONNX Runtime, handles model acquisition, hardware acceleration, and inference within the application process and adds only about 20 MB to the app package.
- OpenAI-compatible API: Foundry Local supports OpenAI request and response formats, including the OpenAI Responses API format, so existing OpenAI SDK applications can be reused with minimal code changes.
- Automatic hardware acceleration: Foundry Local detects the available hardware and selects the best execution provider across CPU, GPU, and NPU, with seamless fallback to CPU. Execution provider and driver updates are managed automatically.
- Curated model catalog: A versioned catalog of quantized models optimized for on-device use covers chat completions (such as GPT OSS, Qwen, DeepSeek, Mistral, Phi) and audio transcription (such as Whisper). Models download on first use and are cached locally.
Typical Use Cases
Embed AI in client applications: Developers integrate AI features directly into desktop applications through the SDK for C#, JavaScript, Python, or Rust. Inference runs within the application process, with no separate backend and no cloud dependency.
Process sensitive data on the device: Applications process audio, text, or images locally so the data never leaves the device. This suits scenarios with strict data-privacy and compliance requirements.
Offline and edge scenarios: In environments with limited or no connectivity, Foundry Local delivers AI features without network access. After the initial model download, no connection is required for inference itself.
Benefits
- No per-token costs and no Azure subscription required.
- Data stays on the device, with responses starting immediately and no network latency.
- Cross-platform support for Windows, macOS (Apple silicon), and Linux, with automatic selection of the best hardware.
Integration with innFactory
As a Microsoft Solutions Partner, innFactory supports you with the adoption and operation of this service.
Typical Use Cases
Frequently Asked Questions
What is Foundry Local?
Foundry Local is an end-to-end solution from Microsoft for running AI models entirely on the user's device. It combines a lightweight runtime built on ONNX Runtime, a curated model catalog, and an SDK for C#, JavaScript, Python, and Rust. Data never leaves the device, and there are no per-token costs.
When should I use Foundry Local?
Foundry Local fits when sensitive data must stay on the device, when applications need to work offline or in limited-connectivity environments, when you need low latency for real-time interactions, or when you want to reduce the per-token costs of cloud-based inference. It is designed for single-user scenarios on client devices.
How much does Foundry Local cost?
Foundry Local has no per-token cost and requires no Azure subscription. The models run entirely on local hardware. Use is governed by the product terms and licenses that apply to the software and the models in use.
Which platforms does Foundry Local support, and is it OpenAI-compatible?
Foundry Local supports Windows, macOS (Apple silicon), and Linux, and automatically uses the best available hardware (CPU, GPU, or NPU). The runtime exposes an OpenAI-compatible API, including the OpenAI Responses API format. Existing OpenAI SDK applications can be repointed to a Foundry Local endpoint with minimal changes.
