Skip to content

OKE on OCI Troubleshooting - Common Issues and Solutions

flowchart TB
    subgraph Issues["Common Issues"]
        OOC[Out of Capacity]
        ARM[ARM64 Images]
        DNS[DNS Not Resolving]
    end

    subgraph Solutions["Solutions"]
        AD[Change Availability Domain]
        Multi[Multi-arch Build]
        Annot[Add DNS Annotation]
    end

    OOC --> AD
    ARM --> Multi
    DNS --> Annot

Ampere A1 instances are frequently unavailable in popular regions.

flowchart LR
    subgraph Problem
        OCI[OCI API] -->|Out of Capacity| Fail[Provisioning Failed]
    end

    subgraph Solution
        AD0[AD-0] -.->|try| OCI2[OCI API]
        AD1[AD-1] -.->|try| OCI2
        AD2[AD-2] -.->|try| OCI2
        OCI2 --> Success[Provisioning OK]
    end

Try changing the availability_domain index in compute.tf to 0, 1, or 2.

Standard container images often fail with exec format error on ARM64 nodes.

Build multi-architecture images using GitHub Actions with docker/setup-qemu-action for linux/amd64,linux/arm64.

OKE uses the OCI CSI driver (Block Volume) for persistent storage. Ensure your StorageClass is configured correctly (default oci-bv is usually provided).

When using Kustomize to inflate Helm charts, Argo CD requires explicit enablement.

Error: must specify --enable-helm

Fix: Patch argocd-cm ConfigMap and restart the repo-server:

Terminal window
kubectl -n argocd patch cm argocd-cm --type=merge -p '{"data":{"kustomize.buildOptions":"--enable-helm"}}'
kubectl -n argocd rollout restart deploy argocd-repo-server

OCI requires OpenSSH formatted public keys, not PEM format.

Convert PEM keys:

Terminal window
ssh-keygen -y -f ~/.oci/oci_api_key.pem > ssh_key.pub

Docker Hub rate-limits OCI artifact requests from cloud IPs.

Use Git-based installation for Envoy Gateway instead of Helm OCI.

Scoped Cloudflare API tokens may fail to discover the zone ID automatically.

Error: Could not route to /client/v4/zones//dns_records...

Fix: Explicitly provide the zone ID with --zone-id-filter=<zone-id>.

External DNS may not detect HTTPRoute targets if the Gateway status address is internal.

Fix: Add the annotation external-dns.alpha.kubernetes.io/target: <public-ip> to the HTTPRoute.

If the Gateway shows PROGRAMMED: False with RefNotPermitted errors, it cannot access TLS secrets from other namespaces.

Symptom: kubectl describe gateway public-gateway shows:

Certificate ref to secret argocd/argocd-tls not permitted by any ReferenceGrant

Fix: Create ReferenceGrants in each namespace containing TLS secrets:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
name: allow-gateway-to-secrets
namespace: argocd
spec:
from:
- group: gateway.networking.k8s.io
kind: Gateway
namespace: envoy-gateway-system
to:
- group: ""
kind: Secret

The Envoy Gateway config template includes these ReferenceGrants automatically.

Applications may fail to sync if dependencies aren’t deployed yet.

Symptom: one or more synchronization tasks are not valid with message about missing CRDs.

Common dependency issues:

  • envoy-gateway needs external-dns CRD (DNSEndpoint)
  • docs-app needs cert-manager CRD (Certificate)
  • All apps using Helm charts need kustomize.buildOptions: "--enable-helm" in argocd-cm

Fix: Manually sync in order:

Terminal window
kubectl -n argocd patch application external-dns --type=merge -p '{"operation":{"sync":{}}}'
kubectl -n argocd patch application cert-manager --type=merge -p '{"operation":{"sync":{}}}'
kubectl -n argocd patch application envoy-gateway --type=merge -p '{"operation":{"sync":{}}}'
kubectl -n argocd patch application docs-app --type=merge -p '{"operation":{"sync":{}}}'

If applications using Kustomize with helmCharts fail with must specify --enable-helm:

Fix: Ensure the argocd-cm ConfigMap has the correct setting:

Terminal window
kubectl -n argocd patch cm argocd-cm --type=merge -p '{"data":{"kustomize.buildOptions":"--enable-helm"}}'
kubectl -n argocd rollout restart deploy argocd-repo-server

The ArgoCD kustomization includes this configuration automatically via the argocd-self-managed application.

The Envoy Gateway controller modifies the Gateway resource after ArgoCD applies it, causing perpetual OutOfSync status.

Symptom: envoy-gateway application shows OutOfSync but Healthy.

Fix: Add ignoreDifferences to the Application spec:

spec:
ignoreDifferences:
- group: gateway.networking.k8s.io
kind: Gateway
jsonPointers:
- /spec/listeners
- /status

This is included in the applications.yaml.tpl template automatically.

When using Envoy Gateway for TLS termination, ArgoCD may cause redirect loops because it expects HTTPS connections internally.

Symptom: cd.k8s.yourdomain.com returns HTTP 307 redirect loop.

Fix: Configure ArgoCD to run in insecure mode (TLS handled by Gateway):

valuesInline:
server:
extraArgs:
- --insecure

The ArgoCD kustomization template includes this configuration.

HTTPRoutes may serve content on both HTTP and HTTPS if not bound to specific listeners.

Symptom: http://k8s.yourdomain.com returns 200 instead of redirecting to HTTPS.

Fix: Use sectionName to bind routes to HTTPS listeners and create separate redirect routes:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: docs-route
spec:
parentRefs:
- name: public-gateway
namespace: envoy-gateway-system
sectionName: https-docs
# ... backend config
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: docs-redirect
spec:
parentRefs:
- name: public-gateway
namespace: envoy-gateway-system
sectionName: http
hostnames:
- "k8s.yourdomain.com"
rules:
- filters:
- type: RequestRedirect
requestRedirect:
scheme: https
statusCode: 301

The HTTPRoute templates include these redirect configurations.

Go templates in ExternalSecret resources require specific syntax for nested expressions.

Symptom: ExternalSecret shows SecretSyncedError with unable to parse template.

Fix: Use %s format specifiers instead of escaped quotes:

# Wrong
"auth": "{{ printf \"${username}:%s\" .password | b64enc }}"
# Correct
"auth": "{{ printf "%s:%s" "${username}" .password | b64enc }}"

Critical for Development/Testing: Let’s Encrypt enforces strict rate limits (5 certificates per week for the same set of domains).

Symptom: Certificate shows Failed status with error:

429 urn:ietf:params:acme:error:rateLimited: too many certificates (5) already issued for this exact set of identifiers in the last 168h0m0s

This is common during iterative cluster development where you destroy and recreate the cluster frequently.

Prevention:

  1. Use Staging Issuer (Recommended): For development, use the Let’s Encrypt Staging environment which has much higher limits. The certificates won’t be trusted by browsers (you’ll see a warning), but it verifies the entire ACME flow works.

    Update cluster-issuer.yaml (or create a separate staging issuer):

    spec:
    acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
  2. Wait: The limit resets after 7 days from the first issuance.

When restarting or rolling out Envoy Gateway pods, new pods may remain Pending.

Symptom: kubectl get pods -n envoy-gateway-system shows:

envoy-...-new 0/2 Pending 0 5m
envoy-...-old 2/2 Running 0 1h

Events show: 0/3 nodes are available: 1 node(s) didn't have free ports for the requested pod ports

This occurs because Envoy uses hostPort for ports 80 and 443. Only one pod can bind these ports on a node at a time, causing deployment rollouts to deadlock.

Fix: Delete the old pod to free the ports:

Terminal window
kubectl delete pod -n envoy-gateway-system <old-pod-name> --grace-period=10

The new pod will then schedule and start.

Note: This is expected behavior for hostPort deployments. The deployment strategy could be changed to Recreate instead of RollingUpdate to avoid this, but that causes brief downtime during updates.

ACME HTTP-01 Challenges Failing with Cloudflare

Section titled “ACME HTTP-01 Challenges Failing with Cloudflare”

When using Cloudflare as your DNS provider with HTTP-01 ACME challenges, certificate issuance may fail if Cloudflare proxy is enabled.

Symptom: Certificate stuck in Pending state, cert-manager logs show:

Waiting for HTTP-01 challenge propagation: wrong status code '403'

This occurs because Cloudflare’s proxy intercepts the /.well-known/acme-challenge/ requests and returns 403 Forbidden.

Fix: Disable Cloudflare proxy for your DNS records. In DNSEndpoint resources:

spec:
endpoints:
- dnsName: k8s.example.com
recordType: A
targets:
- "1.2.3.4"
providerSpecific:
- name: cloudflare-proxied
value: "false"

The cloudflare-proxied: "false" setting creates DNS-only (grey cloud) records instead of proxied (orange cloud) records.

External DNS Not Creating Records for Gateway API

Section titled “External DNS Not Creating Records for Gateway API”

External DNS supports Gateway API but may not create records if only watching HTTPRoute resources without the Gateway having a routable address.

Symptom: DNS records not created, External DNS logs show no activity for your domains.

Fix: Use DNSEndpoint CRD to explicitly define DNS records:

apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
name: gateway-dns
namespace: envoy-gateway-system
spec:
endpoints:
- dnsName: k8s.example.com
recordType: A
targets:
- "1.2.3.4" # Your Load Balancer IP

The envoy-gateway kustomization includes DNSEndpoint resources that are populated with the Load Balancer IP by Terraform.

OCI Network Load Balancer Backend Health Check Failures

Section titled “OCI Network Load Balancer Backend Health Check Failures”

When using OCI Network Load Balancer (NLB) with NodePort services, health checks may fail if the Network Security List (NSL) doesn’t allow traffic on NodePort ranges.

Symptom: NLB backend health shows unhealthy, ACME challenges timeout, services return 503.

The NLB health checks originate from OCI’s infrastructure and must reach the NodePort on worker nodes.

Fix: Add an ingress rule to the private subnet’s NSL allowing TCP traffic on NodePort range (30000-32767):

# In network.tf - private subnet security list
ingress_security_rules {
protocol = "6" # TCP
source = "10.0.0.0/16" # VCN CIDR
source_type = "CIDR_BLOCK"
description = "Allow NLB to reach NodePorts"
tcp_options {
min = 30000
max = 32767
}
}

The Terraform configuration includes this rule automatically.

When setting up OCI Identity Domain OIDC authentication for Open WebUI, several configuration issues can cause login failures.

flowchart TB
    subgraph Errors["Common OIDC Errors"]
        E1[401 on JWK Fetch]
        E2[401 on Metadata]
        E3[invalid_scope]
        E4[Redirect Loop]
    end

    subgraph Fixes["Solutions"]
        F1[Enable Access<br/>Signing Certificate]
        F2[Fix Provider URL<br/>Format]
        F3[Add Scopes in<br/>OCI Console]
        F4[Match Redirect URI<br/>Exactly]
    end

    E1 --> F1
    E2 --> F2
    E3 --> F3
    E4 --> F4

Symptom: Login fails with error in logs:

httpx.HTTPStatusError: Client error '401 Unauthorized' for url
'https://idcs-xxx.identity.oraclecloud.com/admin/v1/SigningCert/jwk'

This occurs because the JWK endpoint requires authentication by default in OCI Identity Domain.

Fix:

  1. Navigate to OCI Console → Identity & Security → Domains → Default
  2. Go to Settings → Domain settings
  3. Click Edit domain settings
  4. Under “Access signing certificate”, enable Configure client access
  5. Save changes
  6. Restart Open WebUI: kubectl rollout restart deploy/open-webui

Symptom: Login fails immediately with:

httpx.HTTPStatusError: Client error '401 Unauthorized' for url
'https://idcs-xxx.identity.oraclecloud.com/.well-known/openid-configuration'

This occurs when the OIDC provider URL is incorrectly formatted (e.g., includes :443 port or is missing the discovery path).

Fix: Ensure the provider URL in terraform.tfvars includes the full discovery endpoint:

# Correct format
oidc_provider_url = "https://idcs-xxxxx.identity.oraclecloud.com/.well-known/openid-configuration"
# Wrong formats (will cause 401)
oidc_provider_url = "https://idcs-xxxxx.identity.oraclecloud.com:443"
oidc_provider_url = "https://idcs-xxxxx.identity.oraclecloud.com"

After fixing, run terraform apply to update the vault secret, then restart Open WebUI.

Symptom: OAuth flow fails with scope error:

Error: invalid_scope - Scope 'openid' is not configured for the application

This occurs when required OIDC scopes are not enabled in the OCI Identity Domain application.

Fix:

  1. Navigate to OCI Console → Identity & Security → Domains → Default
  2. Go to Applications → open-webui
  3. Under Resources, click Token Issuance Policy
  4. Add these scopes:
    • openid (required for OIDC)
    • profile (user name)
    • email (user email)
  5. Save changes

Symptom: After clicking “Continue with Oracle”, the browser loops between Open WebUI and OCI Identity.

This occurs when the redirect URI in the OCI application doesn’t match the callback URL exactly.

Fix: Verify the redirect URI in the OCI application matches:

https://chat.k8s.sudhanva.me/oauth/oidc/callback

Note: The protocol must be https, and there should be no trailing slash.

Verify the External Secret is syncing correctly:

Terminal window
# Check ExternalSecret status
kubectl get externalsecret oidc-credentials-sync
# Verify secret exists
kubectl get secret oidc-credentials
# View provider URL (should include /.well-known/openid-configuration)
kubectl get secret oidc-credentials -o jsonpath='{.data.provider-url}' | base64 -d

Symptom: OAuth flow fails after entering credentials.

Ensure the OCI application has these grant types enabled:

  • Authorization code - Required for the OIDC authorization code flow
  • Refresh token - Required for session refresh

GitHub Container Registry packages are private by default, even for public repositories.

Symptom: Pods show ImagePullBackOff with error:

Failed to pull image "ghcr.io/username/repo/image:tag": unauthorized

Fix: Make the GHCR package public:

  1. Go to https://github.com/users/<username>/packages/container/<repo>%2F<image>/settings
  2. Scroll to “Danger Zone”
  3. Click “Change visibility” → Select “Public”

Alternatively, create an imagePullSecret with a GitHub PAT that has read:packages scope.