What is GKE Inference Gateway?
GKE Inference Gateway is a Kubernetes-native gateway for serving generative AI on Google Kubernetes Engine (GKE). It extends the GKE Gateway and builds on the open Gateway API Inference Extension (llm-d). Unlike classic load balancing, it makes model-aware routing decisions: requests go to the best-suited Pod replica based on live metrics such as KV cache utilization, GPU or TPU utilization, and request-queue length, rather than generic round-robin.
This solves a core problem when running your own LLM inference on Kubernetes: generic load balancing ignores the characteristics of language models, leading to high time-to-first-token, uneven accelerator utilization, and unnecessary token cost. LLM-aware routing and disaggregated serving lower latency and cost measurably and make serving generative AI on GKE more predictable.
Core Features
- Model-aware routing: Requests are routed by model name (OpenAI API format). This enables traffic splitting, gradual rollouts, and dynamic multiplexing of multiple LoRA adapters on shared accelerators.
- Prefix-cache-aware routing: Requests that share context go to the same replicas to maximize cache hits. Google reports this improves time-to-first-token for prefix-heavy multi-turn workloads and reduces accelerator requirements.
- Disaggregated serving: The compute-intensive prefill phase is separated from the memory-intensive decode phase onto independently scalable nodes. Google reports roughly 60 percent higher throughput along with better TTFT and TPOT.
- LLM-optimized autoscaling and observability: Autoscaling (HPA) uses model server metrics to scale efficiently. Cloud Monitoring dashboards track request rate, latency, errors, and saturation.
Typical Use Cases
Self-hosted LLM serving on GKE: Teams that run language models themselves on GKE distribute load in an accelerator- and cache-aware way instead of round-robin. This lowers latency and uses expensive GPUs and TPUs more evenly.
Multi-tenant LoRA serving: Multiple fine-tuned LoRA adapters run on shared accelerators and are addressed by model name. This makes it cost-effective to serve many specialized model variants.
Gradual model rollouts: New model versions initially receive only part of the traffic via traffic splitting by model name. This allows controlled rollouts and fast rollback without separate infrastructure.
Benefits
- Lower time-to-first-token and reduced token cost through LLM-aware and prefix-cache-aware routing.
- Higher throughput through disaggregated serving with separately scalable prefill and decode phases.
- No separate product charge, billing only for the GKE resources you use, with operation in EU regions.
Integration with innFactory
As a certified Google Cloud Partner, innFactory supports you with the adoption and operation of this service.
Typical Use Cases
Frequently Asked Questions
What is GKE Inference Gateway?
GKE Inference Gateway is an extension to the GKE Gateway that optimizes routing and load balancing for generative AI and LLM workloads on Kubernetes. Instead of generic round-robin, it uses live metrics such as KV cache utilization, accelerator utilization, and request-queue length to route requests to the right Pod replica. The single-cluster version has been generally available since September 2025.
When should I use GKE Inference Gateway?
Use it when you self-host LLMs or generative AI on GKE and want to lower latency and token cost. It fits multi-turn chat with prefix-cache-aware routing, serving many LoRA adapters on shared GPUs or TPUs, and gradual model rollouts with traffic splitting by model name.
How much does GKE Inference Gateway cost?
There is no separate product charge for GKE Inference Gateway. You pay for the underlying GKE resources: compute and accelerators (GPU/TPU), load balancing, and networking. Cost therefore scales with the size of your inference infrastructure.
Is GKE Inference Gateway available in the EU and how is it secured?
GKE Inference Gateway runs in the GKE regions, including EU regions, so data can be processed in the EU. For AI safety it integrates with Model Armor and NVIDIA NeMo Guardrails to screen prompts and responses for harmful content and threats.
