cloudflared Hotfix + GitOps Consistency Improvement
Sprint 69: cloudflared Hotfix + GitOps Consistency Improvement
Decisions
D1: Keep cloudflared hotfix as direct AlgoSu apply, defer GitOps migration
- Context: blog.algo-su.com inaccessible due to Cloudflare Error 1033. During root cause investigation, discovered cloudflared is not under ArgoCD management (only
kubectl.kubernetes.io/last-applied-configurationexists, no ArgoCD tracking-id). Decision needed: proceed with GitOps migration simultaneously, or apply hotfix first. - Choice: Apply hotfix immediately with
kubectl applywhile keepingAlgoSu/infra/k3s/cloudflared.yamlpath as-is. GitOps migration (69-2), tag pinning (69-3), and orphan manifest cleanup (69-4) are deferred. - Alternatives: (a) Hotfix + GitOps migration simultaneously — rejected due to extended service downtime risk. (b) Apply only
kubectl editdirectly to cluster — rejected due to drift from source files. - Code Paths:
infra/k3s/cloudflared.yaml
Patterns
Not applicable (single flag addition hotfix)
Gotchas
G1: cloudflared --metrics must be explicitly specified — liveness probe port alignment required
- Symptom: cloudflared pod accumulated 427 CrashLoopBackOff cycles over 28 hours. Each cycle: pod starts for ~88 seconds then terminates with
Initiating graceful shutdown due to signal terminatedlog. Intermittent Cloudflare Error 1033 (Argo Tunnel error) appearing on blog.algo-su.com. - Root Cause: No
--metricsflag in Deployment args, so cloudflared binds metrics/ready endpoint to random port (actual measured:[::]:20241). HoweverlivenessProbe.httpGet.port: 2000is fixed, causing probe failure withdial tcp 10.42.0.216:2000: connection refusedfor 1275 times → kubelet sends SIGTERM after 3 failures → container exits normally with code 0 → CrashLoopBackOff backoff loop. - Fix: Add
- --metrics/- 0.0.0.0:2000at the beginning ofargs. After redeployment, confirmedStarting metrics server on [::]:2000/metricsin logs, new podRunning 1/1maintained, restarts 0, blog.algo-su.com HTTP 200 recovered.
G2: Don't only check Ingress when verifying external domain routing
- Symptom: No blog route in
kubectl get ingressresults during CD deployment verification, momentarily misunderstood as "blog not exposed externally". - Root Cause: Blog service bypasses Ingress —
algosu/cloudflaredpod delivers directly to in-clusterblogService (ClusterIP) via Cloudflare Tunnel (QUIC). No trace in Ingress. - Fix: When verifying external domain response: (1) confirm tunnel pod with
kubectl get pods | grep cloudflared, (2) verify HTTP response with actualcurl -sSI https://<domain>/, (3) check routing path via Cloudflare Zero Trust dashboard. Record blog domain and routing mechanism inreference_domain.mdto prevent recurrence.
G3: cloudflared outside GitOps management — drift detection not possible
- Symptom:
kubectl -n argocd get application algosushowsSynced / Healthy, but cloudflared was in CrashLoop state without--metricsflag for 28 hours. ArgoCD did not report this issue. - Root Cause: cloudflared Deployment was applied directly with
kubectl apply -f AlgoSu/infra/k3s/cloudflared.yaml, not under ArgoCD tracking. A separate orphan manifestalgosu/base/monitoring/cloudflared.yamlexists in aether-gitops but is not referenced in overlays/prod kustomization and is ignored. - Fix: (Deferred) In 69-2, migrate cloudflared to aether-gitops base and include in overlays/prod resources. Then delete
AlgoSu/infra/k3s/cloudflared.yaml, clean up (or promote to SSoT)aether-gitops/algosu/base/monitoring/cloudflared.yaml. After migration, ArgoCD Health can report drift/failures.
Metrics
- Commits: 1 (49b719a)
- Files changed: 1 (+4/-0)
- Service impact: blog.algo-su.com — restored from intermittent Error 1033 (28h) to HTTP 200 always responding
- Deferred tasks: 3 (69-2 GitOps migration, 69-3 tag pinning, 69-4 orphan manifest cleanup)