Skip to content

Gemma 3 LLM - Self-Hosted AI on OKE

This cluster runs Gemma 3 1B IT, Google’s lightweight instruction-tuned language model, using llama.cpp for efficient CPU inference on ARM64 nodes.

https://gemma.k8s.sudhanva.me

The Gemma deployment follows a sidecar pattern with authentication handled at the proxy level:

flowchart TB
    subgraph Internet
        User((User))
    end

    subgraph OCI["Oracle Cloud Infrastructure"]
        LB[OCI Load Balancer<br/>10 Mbps Free Tier]

        subgraph OKE["OKE Cluster"]
            subgraph Gateway["Envoy Gateway"]
                EG[Gateway API<br/>TLS Termination]
            end

            subgraph GemmaPod["Gemma Pod"]
                Auth[OpenResty<br/>Auth Proxy<br/>:8080]
                LLM[llama-server<br/>Gemma 3 1B<br/>:8000]
            end

            subgraph Storage["Persistent Storage"]
                PVC[(Model Cache<br/>5GB PVC)]
            end
        end

        subgraph Vault["OCI Vault"]
            Secret[API Key Secret]
        end
    end

    subgraph HuggingFace["HuggingFace Hub"]
        GGUF[(GGUF Model<br/>806 MB)]
    end

    User -->|HTTPS| LB
    LB --> EG
    EG -->|HTTP| Auth
    Auth -->|Validated| LLM
    LLM --> PVC
    GGUF -.->|First Run| LLM
    Secret -.->|ExternalSecret| Auth

Every API request goes through multiple layers of processing:

sequenceDiagram
    participant U as User
    participant LB as OCI LB
    participant EG as Envoy Gateway
    participant AP as Auth Proxy
    participant LS as llama-server
    participant HF as HuggingFace

    U->>+LB: HTTPS Request
    LB->>+EG: TLS Terminated
    EG->>+AP: HTTP Forward

    Note over AP: Validate Bearer Token

    alt Invalid/Missing Token
        AP-->>U: 401 Unauthorized
    else Valid Token
        AP->>+LS: Proxy Request

        alt Model Not Cached
            LS->>HF: Download GGUF
            HF-->>LS: Model Weights
        end

        LS->>LS: Run Inference
        LS-->>-AP: Response
        AP-->>-EG: Response
        EG-->>-LB: Response
        LB-->>-U: HTTPS Response
    end

All requests require a Bearer token in the Authorization header:

Terminal window
curl https://gemma.k8s.sudhanva.me/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-3-1b-it",
"messages": [{"role": "user", "content": "Hello!"}]
}'

For longer generations, use streaming to prevent timeout errors and get real-time output:

Terminal window
curl -N https://gemma.k8s.sudhanva.me/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-3-1b-it",
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"stream": true
}'

The Gemma deployment uses a sidecar pattern with two containers:

flowchart LR
    subgraph Pod["gemma pod"]
        direction TB

        subgraph AuthContainer["auth-proxy container"]
            OpenResty[OpenResty/nginx<br/>Port 8080]
            Lua[Lua Auth Script]
        end

        subgraph LLMContainer["llama-server container"]
            Server[llama-server<br/>Port 8000]
            Model[Gemma 3 1B<br/>Q4_K_M]
        end

        OpenResty --> Lua
        Lua -->|localhost:8000| Server
        Server --> Model
    end

    subgraph Volumes
        ConfigMap[nginx ConfigMap]
        SecretVol[API Key Secret]
        PVC[(Model PVC)]
    end

    ConfigMap --> OpenResty
    SecretVol --> Lua
    PVC --> Model

    Ingress[Envoy Gateway] -->|:8080| OpenResty
ContainerImagePurposeResources
auth-proxyopenresty/openresty:alpineBearer token validation32-64 MB, 10-50m CPU
llama-serverghcr.io/nsudhanva/llama-server:latestLLM inference2-4 GB, 1.2-2 CPU
flowchart LR
    subgraph Config["llama-server Configuration"]
        HFRepo["--hf-repo<br/>ggml-org/gemma-3-1b-it-GGUF"]
        HFFile["--hf-file<br/>gemma-3-1b-it-Q4_K_M.gguf"]
        Context["-c 4096<br/>Context Length"]
        GPU["-ngl 0<br/>CPU Only"]
    end

    HFRepo --> Download[Auto Download]
    HFFile --> Download
    Download --> Cache[(HF_HOME<br/>/models PVC)]
    Cache --> Inference[Inference Engine]
    Context --> Inference
    GPU --> Inference
SettingValueDescription
--hf-repoggml-org/gemma-3-1b-it-GGUFHuggingFace GGUF repository
--hf-filegemma-3-1b-it-Q4_K_M.ggufQ4 quantized model file
-c4096Context length (tokens)
-ngl0GPU layers (0 = CPU only)
--threads2Parallel inference threads
--batch-size512Batch size for prompt processing
HF_HOME/modelsModel cache directory (PVC mounted)

The deployment is optimized for OCI Always Free tier constraints:

pie showData
    title Memory Usage (4GB Limit)
    "Model Weights (Q4)" : 806
    "KV Cache" : 1500
    "Runtime Overhead" : 500
    "Available Buffer" : 1194
ResourceAllocatedNotes
Memory Request2 GBMinimum for model loading
Memory Limit4 GBAllows KV cache growth
CPU Request1.2 coresDedicated compute
CPU Limit2 coresBurst capacity for inference
Storage5 GB PVCModel cache persistence

The API key flows from OCI Vault to the cluster via External Secrets Operator:

flowchart LR
    subgraph Terraform
        TF[terraform.tfvars<br/>gemma_api_key]
    end

    subgraph OCI["OCI Vault"]
        VaultSecret[Vault Secret<br/>gemma-api-key]
    end

    subgraph Kubernetes
        ESO[External Secrets<br/>Operator]
        ExtSecret[ExternalSecret CR]
        K8sSecret[K8s Secret<br/>gemma-api-key]
        Pod[Auth Proxy]
    end

    TF -->|terraform apply| VaultSecret
    VaultSecret -->|Sync| ESO
    ESO -->|Reads| ExtSecret
    ExtSecret -->|Creates| K8sSecret
    K8sSecret -->|Volume Mount| Pod

Set the API key in terraform.tfvars:

gemma_api_key = "your-secret-key"

After setting, run terraform apply to create the vault secret, then sync managed-secrets in ArgoCD.

EndpointMethodDescription
/v1/chat/completionsPOSTChat completions (recommended)
/v1/completionsPOSTText completions
/v1/modelsGETList available models
/v1/embeddingsPOSTText embeddings
/healthGETHealth check
/slotsGETView inference slots status
Terminal window
kubectl get pods -l app=gemma
Terminal window
kubectl logs -f deploy/gemma -c llama-server
Terminal window
kubectl exec deploy/gemma -c llama-server -- curl -s localhost:8000/health
Terminal window
kubectl exec deploy/gemma -c llama-server -- curl -s localhost:8000/slots

We evaluated multiple inference engines for running LLMs on the free tier:

quadrantChart
    title LLM Inference Engines Comparison
    x-axis Low Memory --> High Memory
    y-axis Poor ARM64 --> Great ARM64
    quadrant-1 Suitable for Free Tier
    quadrant-2 Good but Memory Heavy
    quadrant-3 Not Recommended
    quadrant-4 Limited ARM Support
    llama.cpp: [0.2, 0.9]
    Ollama: [0.4, 0.6]
    vLLM: [0.85, 0.3]
    TGI: [0.75, 0.4]
Featurellama.cppvLLMOllama
Memory usageBestHeavyGood
ARM64 CPUNative NEONExperimentalGood
OpenAI APINativeNativeWrapper
QuantizationGGUF (Q4=806MB)BF16 (2GB+)GGUF
Production readyYesYesDev-focused
Auto model downloadYesYesManual pull

Check logs for the specific error:

Terminal window
kubectl logs deploy/gemma -c llama-server --previous

Common causes:

  • OOMKilled: Reduce context size (-c) or use smaller quantization
  • Model download failed: Check HuggingFace token if using gated models

Verify your API key:

Terminal window
kubectl get secret gemma-api-key -o jsonpath='{.data.api-key}' | base64 -d

This is expected on CPU. For better performance:

  • Use streaming to get incremental responses
  • Keep prompts concise
  • Reduce max_tokens in requests