Gemma 3 LLM - Self-Hosted AI on OKE
This cluster runs Gemma 3 1B IT, Google’s lightweight instruction-tuned language model, using llama.cpp for efficient CPU inference on ARM64 nodes.
Endpoint
Section titled “Endpoint”https://gemma.k8s.sudhanva.meHigh-Level Architecture
Section titled “High-Level Architecture”The Gemma deployment follows a sidecar pattern with authentication handled at the proxy level:
flowchart TB
subgraph Internet
User((User))
end
subgraph OCI["Oracle Cloud Infrastructure"]
LB[OCI Load Balancer<br/>10 Mbps Free Tier]
subgraph OKE["OKE Cluster"]
subgraph Gateway["Envoy Gateway"]
EG[Gateway API<br/>TLS Termination]
end
subgraph GemmaPod["Gemma Pod"]
Auth[OpenResty<br/>Auth Proxy<br/>:8080]
LLM[llama-server<br/>Gemma 3 1B<br/>:8000]
end
subgraph Storage["Persistent Storage"]
PVC[(Model Cache<br/>5GB PVC)]
end
end
subgraph Vault["OCI Vault"]
Secret[API Key Secret]
end
end
subgraph HuggingFace["HuggingFace Hub"]
GGUF[(GGUF Model<br/>806 MB)]
end
User -->|HTTPS| LB
LB --> EG
EG -->|HTTP| Auth
Auth -->|Validated| LLM
LLM --> PVC
GGUF -.->|First Run| LLM
Secret -.->|ExternalSecret| Auth
Request Flow
Section titled “Request Flow”Every API request goes through multiple layers of processing:
sequenceDiagram
participant U as User
participant LB as OCI LB
participant EG as Envoy Gateway
participant AP as Auth Proxy
participant LS as llama-server
participant HF as HuggingFace
U->>+LB: HTTPS Request
LB->>+EG: TLS Terminated
EG->>+AP: HTTP Forward
Note over AP: Validate Bearer Token
alt Invalid/Missing Token
AP-->>U: 401 Unauthorized
else Valid Token
AP->>+LS: Proxy Request
alt Model Not Cached
LS->>HF: Download GGUF
HF-->>LS: Model Weights
end
LS->>LS: Run Inference
LS-->>-AP: Response
AP-->>-EG: Response
EG-->>-LB: Response
LB-->>-U: HTTPS Response
end
Authentication
Section titled “Authentication”All requests require a Bearer token in the Authorization header:
curl https://gemma.k8s.sudhanva.me/v1/chat/completions \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "gemma-3-1b-it", "messages": [{"role": "user", "content": "Hello!"}] }'import openai
client = openai.OpenAI( base_url="https://gemma.k8s.sudhanva.me/v1", api_key="YOUR_API_KEY")
response = client.chat.completions.create( model="gemma-3-1b-it", messages=[{"role": "user", "content": "Hello!"}])print(response.choices[0].message.content)const response = await fetch('https://gemma.k8s.sudhanva.me/v1/chat/completions', { method: 'POST', headers: { 'Authorization': 'Bearer YOUR_API_KEY', 'Content-Type': 'application/json' }, body: JSON.stringify({ model: 'gemma-3-1b-it', messages: [{ role: 'user', content: 'Hello!' }] })});const data = await response.json();console.log(data.choices[0].message.content);Streaming
Section titled “Streaming”For longer generations, use streaming to prevent timeout errors and get real-time output:
curl -N https://gemma.k8s.sudhanva.me/v1/chat/completions \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "gemma-3-1b-it", "messages": [{"role": "user", "content": "Explain quantum computing"}], "stream": true }'import openai
client = openai.OpenAI( base_url="https://gemma.k8s.sudhanva.me/v1", api_key="YOUR_API_KEY")
response = client.chat.completions.create( model="gemma-3-1b-it", messages=[{"role": "user", "content": "Explain quantum computing"}], stream=True)
for chunk in response: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True)Pod Architecture
Section titled “Pod Architecture”The Gemma deployment uses a sidecar pattern with two containers:
flowchart LR
subgraph Pod["gemma pod"]
direction TB
subgraph AuthContainer["auth-proxy container"]
OpenResty[OpenResty/nginx<br/>Port 8080]
Lua[Lua Auth Script]
end
subgraph LLMContainer["llama-server container"]
Server[llama-server<br/>Port 8000]
Model[Gemma 3 1B<br/>Q4_K_M]
end
OpenResty --> Lua
Lua -->|localhost:8000| Server
Server --> Model
end
subgraph Volumes
ConfigMap[nginx ConfigMap]
SecretVol[API Key Secret]
PVC[(Model PVC)]
end
ConfigMap --> OpenResty
SecretVol --> Lua
PVC --> Model
Ingress[Envoy Gateway] -->|:8080| OpenResty
Container Details
Section titled “Container Details”| Container | Image | Purpose | Resources |
|---|---|---|---|
auth-proxy | openresty/openresty:alpine | Bearer token validation | 32-64 MB, 10-50m CPU |
llama-server | ghcr.io/nsudhanva/llama-server:latest | LLM inference | 2-4 GB, 1.2-2 CPU |
Model Configuration
Section titled “Model Configuration”flowchart LR
subgraph Config["llama-server Configuration"]
HFRepo["--hf-repo<br/>ggml-org/gemma-3-1b-it-GGUF"]
HFFile["--hf-file<br/>gemma-3-1b-it-Q4_K_M.gguf"]
Context["-c 4096<br/>Context Length"]
GPU["-ngl 0<br/>CPU Only"]
end
HFRepo --> Download[Auto Download]
HFFile --> Download
Download --> Cache[(HF_HOME<br/>/models PVC)]
Cache --> Inference[Inference Engine]
Context --> Inference
GPU --> Inference
Settings
Section titled “Settings”| Setting | Value | Description |
|---|---|---|
--hf-repo | ggml-org/gemma-3-1b-it-GGUF | HuggingFace GGUF repository |
--hf-file | gemma-3-1b-it-Q4_K_M.gguf | Q4 quantized model file |
-c | 4096 | Context length (tokens) |
-ngl | 0 | GPU layers (0 = CPU only) |
--threads | 2 | Parallel inference threads |
--batch-size | 512 | Batch size for prompt processing |
HF_HOME | /models | Model cache directory (PVC mounted) |
Resource Allocation
Section titled “Resource Allocation”The deployment is optimized for OCI Always Free tier constraints:
pie showData
title Memory Usage (4GB Limit)
"Model Weights (Q4)" : 806
"KV Cache" : 1500
"Runtime Overhead" : 500
"Available Buffer" : 1194
| Resource | Allocated | Notes |
|---|---|---|
| Memory Request | 2 GB | Minimum for model loading |
| Memory Limit | 4 GB | Allows KV cache growth |
| CPU Request | 1.2 cores | Dedicated compute |
| CPU Limit | 2 cores | Burst capacity for inference |
| Storage | 5 GB PVC | Model cache persistence |
Secrets Management
Section titled “Secrets Management”The API key flows from OCI Vault to the cluster via External Secrets Operator:
flowchart LR
subgraph Terraform
TF[terraform.tfvars<br/>gemma_api_key]
end
subgraph OCI["OCI Vault"]
VaultSecret[Vault Secret<br/>gemma-api-key]
end
subgraph Kubernetes
ESO[External Secrets<br/>Operator]
ExtSecret[ExternalSecret CR]
K8sSecret[K8s Secret<br/>gemma-api-key]
Pod[Auth Proxy]
end
TF -->|terraform apply| VaultSecret
VaultSecret -->|Sync| ESO
ESO -->|Reads| ExtSecret
ExtSecret -->|Creates| K8sSecret
K8sSecret -->|Volume Mount| Pod
Configuration
Section titled “Configuration”Set the API key in terraform.tfvars:
gemma_api_key = "your-secret-key"After setting, run terraform apply to create the vault secret, then sync managed-secrets in ArgoCD.
Available Endpoints
Section titled “Available Endpoints”| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions | POST | Chat completions (recommended) |
/v1/completions | POST | Text completions |
/v1/models | GET | List available models |
/v1/embeddings | POST | Text embeddings |
/health | GET | Health check |
/slots | GET | View inference slots status |
Monitoring
Section titled “Monitoring”Check Pod Status
Section titled “Check Pod Status”kubectl get pods -l app=gemmaView Logs
Section titled “View Logs”kubectl logs -f deploy/gemma -c llama-serverCheck Model Loading
Section titled “Check Model Loading”kubectl exec deploy/gemma -c llama-server -- curl -s localhost:8000/healthView Inference Slots
Section titled “View Inference Slots”kubectl exec deploy/gemma -c llama-server -- curl -s localhost:8000/slotsWhy llama.cpp?
Section titled “Why llama.cpp?”We evaluated multiple inference engines for running LLMs on the free tier:
quadrantChart
title LLM Inference Engines Comparison
x-axis Low Memory --> High Memory
y-axis Poor ARM64 --> Great ARM64
quadrant-1 Suitable for Free Tier
quadrant-2 Good but Memory Heavy
quadrant-3 Not Recommended
quadrant-4 Limited ARM Support
llama.cpp: [0.2, 0.9]
Ollama: [0.4, 0.6]
vLLM: [0.85, 0.3]
TGI: [0.75, 0.4]
Comparison
Section titled “Comparison”| Feature | llama.cpp | vLLM | Ollama |
|---|---|---|---|
| Memory usage | Best | Heavy | Good |
| ARM64 CPU | Native NEON | Experimental | Good |
| OpenAI API | Native | Native | Wrapper |
| Quantization | GGUF (Q4=806MB) | BF16 (2GB+) | GGUF |
| Production ready | Yes | Yes | Dev-focused |
| Auto model download | Yes | Yes | Manual pull |
Troubleshooting
Section titled “Troubleshooting”Pod in CrashLoopBackOff
Section titled “Pod in CrashLoopBackOff”Check logs for the specific error:
kubectl logs deploy/gemma -c llama-server --previousCommon causes:
- OOMKilled: Reduce context size (
-c) or use smaller quantization - Model download failed: Check HuggingFace token if using gated models
401 Unauthorized
Section titled “401 Unauthorized”Verify your API key:
kubectl get secret gemma-api-key -o jsonpath='{.data.api-key}' | base64 -dSlow Inference
Section titled “Slow Inference”This is expected on CPU. For better performance:
- Use streaming to get incremental responses
- Keep prompts concise
- Reduce
max_tokensin requests