GKE Inference Gateway - LLM Routing on Kubernetes · innFactory

What is GKE Inference Gateway?

GKE Inference Gateway is a Kubernetes-native gateway for serving generative AI on Google Kubernetes Engine (GKE). It extends the GKE Gateway and builds on the open Gateway API Inference Extension (llm-d). Unlike classic load balancing, it makes model-aware routing decisions: requests go to the best-suited Pod replica based on live metrics such as KV cache utilization, GPU or TPU utilization, and request-queue length, rather than generic round-robin.

This solves a core problem when running your own LLM inference on Kubernetes: generic load balancing ignores the characteristics of language models, leading to high time-to-first-token, uneven accelerator utilization, and unnecessary token cost. LLM-aware routing and disaggregated serving lower latency and cost measurably and make serving generative AI on GKE more predictable.

Core Features

Model-aware routing: Requests are routed by model name (OpenAI API format). This enables traffic splitting, gradual rollouts, and dynamic multiplexing of multiple LoRA adapters on shared accelerators.
Prefix-cache-aware routing: Requests that share context go to the same replicas to maximize cache hits. Google reports this improves time-to-first-token for prefix-heavy multi-turn workloads and reduces accelerator requirements.
Disaggregated serving: The compute-intensive prefill phase is separated from the memory-intensive decode phase onto independently scalable nodes. Google reports roughly 60 percent higher throughput along with better TTFT and TPOT.
LLM-optimized autoscaling and observability: Autoscaling (HPA) uses model server metrics to scale efficiently. Cloud Monitoring dashboards track request rate, latency, errors, and saturation.

Typical Use Cases

Self-hosted LLM serving on GKE: Teams that run language models themselves on GKE distribute load in an accelerator- and cache-aware way instead of round-robin. This lowers latency and uses expensive GPUs and TPUs more evenly.

Multi-tenant LoRA serving: Multiple fine-tuned LoRA adapters run on shared accelerators and are addressed by model name. This makes it cost-effective to serve many specialized model variants.

Gradual model rollouts: New model versions initially receive only part of the traffic via traffic splitting by model name. This allows controlled rollouts and fast rollback without separate infrastructure.

Benefits

Lower time-to-first-token and reduced token cost through LLM-aware and prefix-cache-aware routing.
Higher throughput through disaggregated serving with separately scalable prefill and decode phases.
No separate product charge, billing only for the GKE resources you use, with operation in EU regions.

Integration with innFactory

As a certified Google Cloud Partner, innFactory supports you with the adoption and operation of this service.

Frequently Asked Questions

What is GKE Inference Gateway?

GKE Inference Gateway is an extension to the GKE Gateway that optimizes routing and load balancing for generative AI and LLM workloads on Kubernetes. Instead of generic round-robin, it uses live metrics such as KV cache utilization, accelerator utilization, and request-queue length to route requests to the right Pod replica. The single-cluster version has been generally available since September 2025.

When should I use GKE Inference Gateway?

Use it when you self-host LLMs or generative AI on GKE and want to lower latency and token cost. It fits multi-turn chat with prefix-cache-aware routing, serving many LoRA adapters on shared GPUs or TPUs, and gradual model rollouts with traffic splitting by model name.

How much does GKE Inference Gateway cost?

There is no separate product charge for GKE Inference Gateway. You pay for the underlying GKE resources: compute and accelerators (GPU/TPU), load balancing, and networking. Cost therefore scales with the size of your inference infrastructure.

Is GKE Inference Gateway available in the EU and how is it secured?

GKE Inference Gateway runs in the GKE regions, including EU regions, so data can be processed in the EU. For AI safety it integrates with Model Armor and NVIDIA NeMo Guardrails to screen prompts and responses for harmful content and threats.

GKE Inference Gateway - LLM Routing on Kubernetes

What is GKE Inference Gateway?

Core Features

Typical Use Cases

Benefits

Integration with innFactory

Typical Use Cases

Frequently Asked Questions

What is GKE Inference Gateway?

When should I use GKE Inference Gateway?

How much does GKE Inference Gateway cost?

Is GKE Inference Gateway available in the EU and how is it secured?

Quick Links

Google Cloud Partner

Similar Products from Other Clouds

Amazon Augmented AI (A2I) - Human Review for ML

Amazon Bedrock AgentCore - AI Agent Runtime

Amazon Bedrock Agents (Classic): Status and Alternative

Amazon Bedrock Data Automation - Structure Data

Amazon Bedrock Guardrails - Safety for Generative AI

Amazon Bedrock Knowledge Bases: Managed RAG

Ready to start with GKE Inference Gateway - LLM Routing on Kubernetes?