Gemma 3 LLM - Self-Hosted AI on OKE

This cluster runs Gemma 3 1B IT, Google’s lightweight instruction-tuned language model, using llama.cpp for efficient CPU inference on ARM64 nodes.

Endpoint

https://gemma.k8s.sudhanva.me

High-Level Architecture

The Gemma deployment follows a sidecar pattern with authentication handled at the proxy level:

flowchart TB
    subgraph Internet
        User((User))
    end

    subgraph OCI["Oracle Cloud Infrastructure"]
        LB[OCI Load Balancer<br/>10 Mbps Free Tier]

        subgraph OKE["OKE Cluster"]
            subgraph Gateway["Envoy Gateway"]
                EG[Gateway API<br/>TLS Termination]
            end

            subgraph GemmaPod["Gemma Pod"]
                Auth[OpenResty<br/>Auth Proxy<br/>:8080]
                LLM[llama-server<br/>Gemma 3 1B<br/>:8000]
            end

            subgraph Storage["Persistent Storage"]
                PVC[(Model Cache<br/>5GB PVC)]
            end
        end

        subgraph Vault["OCI Vault"]
            Secret[API Key Secret]
        end
    end

    subgraph HuggingFace["HuggingFace Hub"]
        GGUF[(GGUF Model<br/>806 MB)]
    end

    User -->|HTTPS| LB
    LB --> EG
    EG -->|HTTP| Auth
    Auth -->|Validated| LLM
    LLM --> PVC
    GGUF -.->|First Run| LLM
    Secret -.->|ExternalSecret| Auth

Request Flow

Every API request goes through multiple layers of processing:

sequenceDiagram
    participant U as User
    participant LB as OCI LB
    participant EG as Envoy Gateway
    participant AP as Auth Proxy
    participant LS as llama-server
    participant HF as HuggingFace

    U->>+LB: HTTPS Request
    LB->>+EG: TLS Terminated
    EG->>+AP: HTTP Forward

    Note over AP: Validate Bearer Token

    alt Invalid/Missing Token
        AP-->>U: 401 Unauthorized
    else Valid Token
        AP->>+LS: Proxy Request

        alt Model Not Cached
            LS->>HF: Download GGUF
            HF-->>LS: Model Weights
        end

        LS->>LS: Run Inference
        LS-->>-AP: Response
        AP-->>-EG: Response
        EG-->>-LB: Response
        LB-->>-U: HTTPS Response
    end

Authentication

All requests require a Bearer token in the Authorization header:

curl https://gemma.k8s.sudhanva.me/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-3-1b-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

import openai

client = openai.OpenAI(
    base_url="https://gemma.k8s.sudhanva.me/v1",
    api_key="YOUR_API_KEY"
)

response = client.chat.completions.create(
    model="gemma-3-1b-it",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

const response = await fetch('https://gemma.k8s.sudhanva.me/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'gemma-3-1b-it',
    messages: [{ role: 'user', content: 'Hello!' }]
  })
});
const data = await response.json();
console.log(data.choices[0].message.content);

Streaming

For longer generations, use streaming to prevent timeout errors and get real-time output:

curl
Python

curl -N https://gemma.k8s.sudhanva.me/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-3-1b-it",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "stream": true
  }'

import openai

client = openai.OpenAI(
    base_url="https://gemma.k8s.sudhanva.me/v1",
    api_key="YOUR_API_KEY"
)

response = client.chat.completions.create(
    model="gemma-3-1b-it",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Pod Architecture

The Gemma deployment uses a sidecar pattern with two containers:

flowchart LR
    subgraph Pod["gemma pod"]
        direction TB

        subgraph AuthContainer["auth-proxy container"]
            OpenResty[OpenResty/nginx<br/>Port 8080]
            Lua[Lua Auth Script]
        end

        subgraph LLMContainer["llama-server container"]
            Server[llama-server<br/>Port 8000]
            Model[Gemma 3 1B<br/>Q4_K_M]
        end

        OpenResty --> Lua
        Lua -->|localhost:8000| Server
        Server --> Model
    end

    subgraph Volumes
        ConfigMap[nginx ConfigMap]
        SecretVol[API Key Secret]
        PVC[(Model PVC)]
    end

    ConfigMap --> OpenResty
    SecretVol --> Lua
    PVC --> Model

    Ingress[Envoy Gateway] -->|:8080| OpenResty

Container Details

Container	Image	Purpose	Resources
`auth-proxy`	`openresty/openresty:alpine`	Bearer token validation	32-64 MB, 10-50m CPU
`llama-server`	`ghcr.io/nsudhanva/llama-server:latest`	LLM inference	2-4 GB, 1.2-2 CPU

Model Configuration

flowchart LR
    subgraph Config["llama-server Configuration"]
        HFRepo["--hf-repo<br/>ggml-org/gemma-3-1b-it-GGUF"]
        HFFile["--hf-file<br/>gemma-3-1b-it-Q4_K_M.gguf"]
        Context["-c 4096<br/>Context Length"]
        GPU["-ngl 0<br/>CPU Only"]
    end

    HFRepo --> Download[Auto Download]
    HFFile --> Download
    Download --> Cache[(HF_HOME<br/>/models PVC)]
    Cache --> Inference[Inference Engine]
    Context --> Inference
    GPU --> Inference

Settings

Setting	Value	Description
`--hf-repo`	`ggml-org/gemma-3-1b-it-GGUF`	HuggingFace GGUF repository
`--hf-file`	`gemma-3-1b-it-Q4_K_M.gguf`	Q4 quantized model file
`-c`	`4096`	Context length (tokens)
`-ngl`	`0`	GPU layers (0 = CPU only)
`--threads`	`2`	Parallel inference threads
`--batch-size`	`512`	Batch size for prompt processing
`HF_HOME`	`/models`	Model cache directory (PVC mounted)

Resource Allocation

The deployment is optimized for OCI Always Free tier constraints:

pie showData
    title Memory Usage (4GB Limit)
    "Model Weights (Q4)" : 806
    "KV Cache" : 1500
    "Runtime Overhead" : 500
    "Available Buffer" : 1194

Resource	Allocated	Notes
Memory Request	2 GB	Minimum for model loading
Memory Limit	4 GB	Allows KV cache growth
CPU Request	1.2 cores	Dedicated compute
CPU Limit	2 cores	Burst capacity for inference
Storage	5 GB PVC	Model cache persistence

Secrets Management

The API key flows from OCI Vault to the cluster via External Secrets Operator:

flowchart LR
    subgraph Terraform
        TF[terraform.tfvars<br/>gemma_api_key]
    end

    subgraph OCI["OCI Vault"]
        VaultSecret[Vault Secret<br/>gemma-api-key]
    end

    subgraph Kubernetes
        ESO[External Secrets<br/>Operator]
        ExtSecret[ExternalSecret CR]
        K8sSecret[K8s Secret<br/>gemma-api-key]
        Pod[Auth Proxy]
    end

    TF -->|terraform apply| VaultSecret
    VaultSecret -->|Sync| ESO
    ESO -->|Reads| ExtSecret
    ExtSecret -->|Creates| K8sSecret
    K8sSecret -->|Volume Mount| Pod

Configuration

Set the API key in terraform.tfvars:

gemma_api_key = "your-secret-key"

After setting, run terraform apply to create the vault secret, then sync managed-secrets in ArgoCD.

Available Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completions (recommended)
`/v1/completions`	POST	Text completions
`/v1/models`	GET	List available models
`/v1/embeddings`	POST	Text embeddings
`/health`	GET	Health check
`/slots`	GET	View inference slots status

Monitoring

Check Pod Status

kubectl get pods -l app=gemma

View Logs

kubectl logs -f deploy/gemma -c llama-server

Check Model Loading

kubectl exec deploy/gemma -c llama-server -- curl -s localhost:8000/health

View Inference Slots

kubectl exec deploy/gemma -c llama-server -- curl -s localhost:8000/slots

Why llama.cpp?

We evaluated multiple inference engines for running LLMs on the free tier:

quadrantChart
    title LLM Inference Engines Comparison
    x-axis Low Memory --> High Memory
    y-axis Poor ARM64 --> Great ARM64
    quadrant-1 Suitable for Free Tier
    quadrant-2 Good but Memory Heavy
    quadrant-3 Not Recommended
    quadrant-4 Limited ARM Support
    llama.cpp: [0.2, 0.9]
    Ollama: [0.4, 0.6]
    vLLM: [0.85, 0.3]
    TGI: [0.75, 0.4]

Comparison

Feature	llama.cpp	vLLM	Ollama
Memory usage	Best	Heavy	Good
ARM64 CPU	Native NEON	Experimental	Good
OpenAI API	Native	Native	Wrapper
Quantization	GGUF (Q4=806MB)	BF16 (2GB+)	GGUF
Production ready	Yes	Yes	Dev-focused
Auto model download	Yes	Yes	Manual pull

Troubleshooting

Pod in CrashLoopBackOff

Check logs for the specific error:

kubectl logs deploy/gemma -c llama-server --previous

Common causes:

OOMKilled: Reduce context size (-c) or use smaller quantization
Model download failed: Check HuggingFace token if using gated models

401 Unauthorized

Verify your API key:

kubectl get secret gemma-api-key -o jsonpath='{.data.api-key}' | base64 -d

Slow Inference

This is expected on CPU. For better performance:

Use streaming to get incremental responses
Keep prompts concise
Reduce max_tokens in requests