Skip to content

Gemma LLM - Removed

Gemma 4 E2B ran CPU-only inference on the free-tier ARM64 nodes (llama.cpp, Q4_K_M GGUF). The 12K+ token system prompts OpenClaw generates took 15+ minutes to process on ARM CPUs, causing persistent agent timeouts.

After exhausting optimization options (batch size, context window, prompt caching, quantization levels), the decision was made to replace local inference with cloud APIs. Cloud models deliver sub-second responses with no node resource pressure.

BeforeAfter
ghcr.io/nsudhanva/llama-server sidecarNo local LLM
gemma.k8s.sudhanva.me endpointRemoved
gemma-api-key secretRemoved
argocd/apps/gemma/Deleted
5Gi model cache PVCFreed

OpenClaw’s model configuration in argocd/apps/openclaw/openclawinstance.yaml handles all AI provider routing. See the OpenClaw docs for details.