Context: In aether-gitops f5f391d commit, when adding new key to SealedSecret, only gateway/github-worker were updated, identity-service-secrets missing. Sprint 125 Wave C code (token-encryption.service.ts) requires the key but cluster Secret not reflected → new ReplicaSet CrashLoopBackOff 308 times (26 hours)
Choice (Track 1, temporary): Directly kubectl patch cluster Secret with same key used by gateway/github-worker + rollout restart → immediate production recovery. 0 user impact (no re-authentication needed, DB tokens 4 entries preserved)
Choice (Track 2, formal): SealedSecret manifest PR (#2) to add key → however unseal failure due to controller key mismatch debt → absorbed into Wave B-2 (PR #3) for GitOps consistency recovery
Alternatives: Issue new key → invalidate DB GitHub tokens 4 entries, force user re-authentication (not chosen)
PR: aether-gitops #2 (Track 2 draft), Track 1 is kubectl patch temporary → structural blocking decided in ADR-028 (dev server separation)
Context: After sealed-secrets controller key rotation (~2026-04-02, sealed-secrets-keyqvbr5 53d → sealed-secrets-keycdlrs 23d), 8 SealedSecret manifests not re-sealed. Cluster's unsealed Secret maintained in pre-rotation state with no production impact, but manifest changes blocked from cluster application — exposed in Wave A-2 PR #2
Choice: Extract plaintext keys from cluster memory directly → re-seal with current active controller cert using kubeseal --raw → bulk replace encryptedData: blocks in 8 manifests. Preserve apiVersion/kind/metadata/template:
Alternatives: Restore controller old key → increased infra complexity / Maintain only cluster Secret direct patch → permanent GitOps consistency break (not chosen)
Side effect: INTERNAL_KEY_AI_ANALYSIS key missing from submission-service-secrets manifest (exists in cluster) → absorbed into manifest = additional GitOps consistency recovery achieved simultaneously
Verification: After merge, all 8 in kubectl get sealedsecret -n algosu reached Synced=True, cluster Secret data sha256 hash identical before and after merge (plaintext value unchanged), production pod restarts 0
Context: Both Sprint 130 incidents had ArgoCD Health=Degraded status but left unnoticed for 26h~2 days 5 hours without alerts
Shocking finding (Architect): alertmanager receiver: 'null' — 13 existing rules (PodRestartFrequent, ServiceDown, OOMKilled, etc.) were firing normally but all alerts were silently dropped. The real root cause of Sprint 130 no-alerts was not rule absence but outlet (receiver) itself disabled
When to Reuse: When doing path matching with req.path in NestJS consumer.apply(...).forRoutes('*') or wildcard routes. May cause regression where mount-strip interprets as / — use req.originalUrl ?? req.url for non-stripped path
Verification pattern: Inject originalUrl/url together in unit test mocks (production simulation). path-only setter cannot detect regressions
Before merge: capture baseline with kubectl get secret -n <ns> <name> -o json | jq -S '.data' | sha256sum
Wait for ArgoCD sync completion after merge
Calculate hash again → compare diff
If identical, plaintext value unchanged → cluster impact confirmed none
P4: Production hotfix Track 1 + Track 2 pattern
When to Reuse: When production recovery is urgent with GitOps flow blocked
Track 1 (immediate recovery): Direct cluster patch — but separately record in ADR + maintain consistency with memory feedback_avoid_prod_direct_edit.md
Track 2 (post-reconciliation): Absorb cluster traces into manifest via GitOps PR — process within same Sprint as Track 1
Gotchas
G1: Memory window not updated → wrong sprint number at /start
Symptom: Wrong sprint number embedded in PR/branch/commit messages (e.g., Sprint 93 → actual 130)
Root Cause: /start skill depends solely on sprint-window.md, no cross-check with docs/adr/sprints/
Fix: In this Sprint, permanently recorded via ADR-026 naming error mapping. Automation is a Sprint 131+ candidate
G2: No procedure for SealedSecret manifest re-seal after controller key rotation
Symptom: SealedSecret in Status=False (no key could decrypt secret) state. Cluster Secret maintained in pre-rotation state with no production impact → no alerts → exposed only when manifest changes blocked from cluster application
Root Cause: No manifest re-seal synchronization procedure for either automatic/manual rotation
Fix: Bulk re-seal in this Sprint Wave B-2. Automation (CI job) is a Sprint 131+ candidate
G3: NestJS forRoutes('*') unit test cannot catch mount-strip regression
Symptom: Production regression not detected by direct req.path mock injection in spec.ts
Root Cause: Express middleware mount behavior not simulated in unit test environment
Fix: Inject originalUrl/url mock together (production simulation). Adding e2e integration tests is a P3 follow-up
G4: aether-gitops main direct push allows missing commit to bypass PR review
Symptom: identity-service-secrets key addition missing in f5f391d commit. Single reviewer flow allowed human error through
Fix: Structurally blocked when ADR-027 (branch discipline introduction) is adopted
G5: Production cluster kubectl direct modification environment
Symptom: Direct cluster Secret patch in Sprint 130 Track 1. GitOps consistency temporarily broken + change tracking difficult
Fix: Structurally blocked when ADR-028 (dev server separation + production read-only) is adopted
G6: alertmanager receiver: 'null' causes all alerts to silently drop
Symptom: 13 PrometheusRules were firing normally but operators were not notified. ArgoCD Health=Degraded also without alerts. Direct cause of Sprint 130 two incidents 6 days/26 hours unnoticed + sealed-secrets controller key mismatch 23 days unnoticed
Root Cause: alertmanager.yaml receiver set to null → all routing mapped to discard destination. Would not have been found by checking rules alone (rule inventory was normal)
Fix: Activated discord-default + discord-critical in Wave B-1 (PR aether-gitops #4). End-to-end verification including receiver behavior is required when checking, not just rules (amtool/actual Discord reach test)
Recurrence prevention: Consider adding self-test rule for alertmanager receiver behavior monitoring in Sprint 131 (e.g., fire dummy alert every 5 minutes → high alert if Discord not reached)