Monitoring Dashboard Refactoring
Sprint 81: Monitoring Dashboard Refactoring
Decisions
D1: Achieve DRY with Alert Rule Recording Rules
- Context: HighErrorRate 4, HighMemoryUsage 4, HighLatencyP95 2 existed as copy-paste duplicates differing only in service name. Every time a new service is added, a new rule must be copied with risk of omission.
- Choice: Introduced recording rules (
algosu:http_error_rate:5m,algosu:memory_usage_pct) + consolidated into single regex-based alert. 10 alerts → 5 alerts + 2 recording rules = 7 total. - Alternatives: (a) Keep existing copy-paste — risk of omission when adding services, rejected. (b) Switch to Grafana Managed Alerting — dual management with Prometheus rules, rejected.
- Code Paths:
infra/k3s/monitoring/prometheus-rules.yaml
D2: Switch HighMemoryUsage to RSS/container limits Based Alert
- Context: heapUsed/heapTotal based alert was triggering false positives during V8's normal heap management pattern (89.2% firing, actual RSS 70MB/1Gi). Node.js V8 normally allocates a small heap and maintains high usage until GC runs.
- Choice: Changed to
process_resident_memory_bytes / kube_pod_container_resource_limits{resource="memory"}basis. Dynamically references limits from kube-state-metrics. - Alternatives: (a) Raise threshold to 95% — not a fundamental fix, rejected. (b) Use fixed value (1Gi) — inconsistent when limits change, rejected.
- Code Paths:
infra/k3s/monitoring/prometheus-rules.yaml,infra/k3s/gateway.yaml
D3: Alertmanager Deployment Recovery + github-worker metrics Service Creation
- Context: Alertmanager Pod not deployed so all alerts were going to a blackhole. github-worker-metrics Service missing causing scrape failures. kube-state-metrics scrape config missing from ConfigMap.
- Choice: Deploy Pod with existing alertmanager.yaml, apply existing Service definition from github-worker.yaml, add kube-state-metrics job to prometheus-config.
- Code Paths:
infra/k3s/monitoring/alertmanager.yaml,infra/k3s/github-worker.yaml,infra/k3s/monitoring/prometheus-config.yaml
Patterns
P1: Prometheus Recording Rule + Regex-Based Single Alert Pattern
- Where:
prometheus-rules.yamlrecording rule groupalgosu.recording - When to Reuse: When applying the same metric pattern to multiple services. Auto-includes new services with
{__name__=~"algosu_.+_http_requests_total"}regex.
Gotchas
G1: V8 heapUsed/heapTotal Ratio is Unsuitable as Memory Leak Indicator
- Symptom: Gateway HighMemoryUsage alert constantly firing (89.2%). Actual RSS is 70MB/1Gi (7%).
- Root Cause: V8 allocates a small heap (~51MB) when
--max-old-space-sizeis not set and maintains high usage until GC runs. heapUsed/heapTotal is always near 80%+. - Fix: Switched to RSS/container limits based alert + added
NODE_OPTIONS=--max-old-space-size=768.
G2: aether-gitops and Source Repo ConfigMap Mismatch
- Symptom: prometheus-config ConfigMap differed between cluster (aether-gitops) and source repo. alerting section, github-worker job, etc. were missing.
- Root Cause: aether-gitops sync was omitted when source repo was changed.
- Fix: Copy same file to aether-gitops after source repo change + commit/push. Register missing resources (grafana-service-dashboard, alertmanager) in kustomization.yaml.
Metrics
- Commits: 2 (6dd91ed, b197297)
- Files changed: 6 (+66/-125)
- aether-gitops: 2 (4ca6f74 monitoring sync, bef7dc6 gateway NODE_OPTIONS)
- Carryover items: Anonymous access, Loki restart, Grafana upgrade, Promtail structured_metadata