Circuit Breaker Standalone Sprint — opossum Introduction + 5 Host Single CB + Grafana Dashboard
Sprint 135: Circuit Breaker Standalone Sprint
Sprint Goal
Introduce the Circuit Breaker pattern to NestJS HTTP call sites to block external service failure propagation. Proceed in order: Wave A (PoC) → Wave B (github-worker expansion) → Wave A sync (separate PR) → Wave C (submission addition) → Wave D (Grafana dashboard).
Final Result Summary (Wave E Comprehensive)
| Item | Result |
|---|---|
| Merged PRs | 5 (#167 / #168 / #169 / #170 / #171) |
| CB instances | 5 (host single isolation) |
| Protected external services | 4 (AI Analysis / submission internal / Gateway / Problem Service) |
| New NestJS modules | 2 (CircuitBreakerModule@Global, ProblemServiceClientModule) |
| New library | opossum v8 (Node.js TypeScript), existing Python circuit_breaker.py retained |
| Cumulative tests | submission 343 + github-worker 175 = 518 tests |
| Critic cross-review | 17 rounds (3/4/3/5/3 per PR) — P1 8 + P2 9 all resolved before merge |
| Grafana dashboard | 1 set (12 panels, uid algosu-cb) |
Host Single CB Instances
| Service | CB Name | External Call | Fallback Policy |
|---|---|---|---|
| submission | aiQuotaCheck | AI Analysis /quota/check | () => true (fail-open, errorFilter: () => false override — default whitelist neutralized since it's a fixed endpoint) |
| submission | problem-service-internal | Problem Service (2 ops via dispatcher) | Per-op (undefined / {isLate: false, weekNumber: null}) |
| github-worker | submission-internal | submission internal API (5 ops via dispatcher) | throw propagation (DLQ handling) |
| github-worker | gateway-getUserGitHubInfo | Gateway /internal/users/.../github-encrypted-token | throw propagation |
| github-worker | problem-getProblemInfo | Problem Service /internal/{id} | default value + try/catch safety net |
PR Merge Commits
| PR | Wave | Squash commit | Key changes |
|---|---|---|---|
| #167 | A | 459cd8a | submission aiQuotaCheck PoC + opossum introduction + Prometheus metrics 3 types |
| #168 | B | c561488 | github-worker plain class wrapper + host single CB 3 + errorFilter whitelist |
| #169 | A sync | 1f40247 | Wave B policy sync + errorFilter wrapper + WeakSet marker (P2 exact resolution) |
| #170 | C | 7d4c539 | ProblemServiceClient + CircuitBreakerModule@Global + isConfigReady |
| #171 | D | 2c5d8e3 | Grafana CB dashboard (12 panels, TypeScript + Python schema separated) |
Decisions
D1: opossum Library Adoption (Wave A)
- Context: Python reference implementation (
services/ai-analysis/src/circuit_breaker.py) is a custom implementation (threading.Lock-based 3-state machine). TypeScript porting required a choice between self-implementation vs library introduction. - Options: (A) opossum — Node.js leading CB library, Netflix Hystrix pattern, event-based metrics integration / (B) Self-implementation — 1:1 port of Python reference.
- Choice: (A) opossum v8. Reason: Event-based state transition simplifies Prometheus metrics integration, proven sliding window statistics, NestJS DI compatible, minimizes maintenance burden.
- Code Paths:
services/submission/package.json(opossum@8, @types/opossum@8)
D2: CB Thresholds — Python Reference Consistency (Wave A)
- Context: Mapping Python CB's
failure_threshold=5,recovery_timeout=30,half_open_requests=2to opossum options. - Choice:
volumeThreshold: 5(judge after minimum 5 requests),errorThresholdPercentage: 50(failure rate 50%+),resetTimeout: 30000(30 seconds to HALF_OPEN),rollingCountTimeout: 60000(60 second aggregate window),timeout: 10000(fetch 10 second timeout) - Rationale: Maintains same flow as Python — "5 failures → OPEN → 30 second wait → HALF_OPEN". Planned adjustment after accumulating production operation data.
D3: Prometheus Metrics Design (Wave A)
- Choice: 3 metrics, sharing existing MetricsService Registry
algosu_submission_circuit_breaker_state(Gauge, label: name) — 0=CLOSED / 1=HALF_OPEN / 2=OPENalgosu_submission_circuit_breaker_failures_total(Counter, label: name)algosu_submission_circuit_breaker_requests_total(Counter, labels: name, result) — result: success/failure/reject/timeout
- Architecture: MetricsModule shares prom-client Registry via
METRICS_REGISTRYcustom provider. CircuitBreakerModule → MetricsModule import → registers CB metrics on same Registry → collected by Prometheus at /metrics endpoint. - Code Paths:
services/submission/src/common/circuit-breaker/circuit-breaker.service.ts,services/submission/src/common/metrics/metrics.module.ts
D4: AI Quota Check Fallback Strategy — fail-open (Wave A)
- Context:
saga-orchestrator.service.ts:checkAiQuotaexisting behavior: returntrueon API failure (allow AI analysis). Should also allow when CB OPEN to avoid Saga compensation transaction conflicts. - Choice:
fallback: () => true(fail-open). When CB OPEN, skip AI Analysis Service call entirely and return allow immediately → Saga flow uninterrupted. - Rationale: checkAiQuota is a non-critical guard (quota check). Service stays uninterrupted even if AI analysis proceeds without quota check on failure. Context differs from Python reference's "DELAYED + analysis delayed" fallback, but fail-open is appropriate for this quota pre-check call.
- Code Paths:
services/submission/src/saga/saga-orchestrator.service.ts:110-133(createBreaker),services/submission/src/saga/saga-orchestrator.service.ts:324-332(checkAiQuota)
D5: retry vs Circuit Breaker Priority (Design Principle)
- Context: Order determination required when retry and CB coexist.
- Choice: Retry 3 times → CB judgment (retry failures reflected in CB failure count). Current Wave A target checkAiQuota has no retry, so only CB applied (same as existing code). For Wave B+ expansion, calls with retry use "retry → CB" order.
- Rationale: Retry recovers from transient network errors, CB detects persistent failures. If retry wraps CB externally, unnecessary delay before recording CB failure. Placing retry inside CB enables faster blocking.
Wave A Output
| Item | Result |
|---|---|
| PR | feat/sprint-135-cb-poc (13 files, +661/-58) |
| Commits | 3 atomic (module creation → call site application → tests) |
| Tests | 21 suites / 283 tests (existing 268 + new 15) |
| typecheck | 0 errors |
| lint | 0 errors (9 pre-existing warnings) |
Wave A Memory Correction
- sprint-window.md "github-worker 5 locations" → "7 locations (status-reporter 5 + worker.ts 2)": Confirmed actual
grep -n fetch services/github-worker/src/status-reporter.ts services/github-worker/src/worker.tsresults show 7 call sites.
D6: github-worker CB Expansion — plain class Wrapper (Wave B)
- Context: github-worker is standalone TS, not NestJS. CB introduction required in an environment without NestJS DI pattern.
- Choice: Wrapped opossum in
CircuitBreakerManagerplain class. Single instance created in main.ts → injected into GitHubWorker/StatusReporter constructors. - Host single CB instances 3 (after Critic 3rd P1 integration): submission-internal 1 (5 methods integrated — generic dispatcher), gateway-internal 1 (gateway-getUserGitHubInfo), problem-service 1 (problem-getProblemInfo). Host 1 = CB 1 principle for stronger host-isolation (blocking dead host load amplification).
- Initial design was separate CB per 5 methods, but Critic 3rd pointed out that reporters sharing the same host create dead host hammering risk without load distribution → integrated to host single CB.
- Fallback strategy: StatusReporter 5 locations throw propagation (DLQ/idempotency handling), gateway-getUserGitHubInfo throw propagation (can't push without token), problem-getProblemInfo returns default value + external try/catch safety net (maintains existing catch behavior → 0 regressions).
- Metrics prefix:
algosu_github_worker_circuit_breaker_*(only service prefix differs, same label/structure as Wave A) - Code Paths:
services/github-worker/src/circuit-breaker.ts(new + spec),services/github-worker/src/main.ts,services/github-worker/src/worker.ts,services/github-worker/src/status-reporter.ts,services/github-worker/src/metrics.ts(registry export)
D7: errorFilter Policy + Host Single CB (Wave B Critic 1st~3rd Integration)
- Context (1st): Critic 1st review P1 2 items — 4xx permanent errors (404 not found, etc.) counted as CB failures, causing circuit OPEN. Wide-scale outage risk of normal messages being rejected during 30-second worker paralysis (Critic 1st P1).
- Context (2nd): 1st fix excluded all 4xx (
>=400 && <500) via errorFilter, but 401/403 also pass through → internal-auth outage protection fails if permanent 401/403 occurs from X-Internal-Key rotation/misconfiguration. Also, opossum errorFilter pass emitssuccessevent with first argument being Error instance (filtered error object) → counted withresult="success"label → inaccurate metrics (Critic 2nd P1+P2). - Context (3rd):
- status-reporter's 5 method CBs (
submission-getSubmission+ 4 others) share the same submission-service host. If submission-service fails, onlysubmission-getSubmissionopens and the other 4 remain CLOSED → continue calling dead host → host-isolation purpose neutralized + load amplification (Critic 3rd P1). FILTERED_BUSINESS_STATUSincludes 400 → CB OPEN not triggered on DTO/contract regression (header missing, validation drift, schema mismatch), allowing infinite hammering of dead dependency (Critic 3rd P2).
- status-reporter's 5 method CBs (
- Choice (1st & 2nd maintained):
- Narrowed errorFilter scope to explicit business 4xx whitelist — 401/403/408/429 etc. counted as CB failures to maintain auth/permission/timeout/overload outage protection.
- When opossum errorFilter passes, branch on result being Error instance in success event handler → count as
requests_total{result="filtered"}separately (metric accuracy). resultlabel enum expanded:success | failure | reject | timeout | filtered.
- Choice (3rd):
- Integrated status-reporter into host single CB (
submission-internal). 5 ops (get/reportSuccess/reportFailed/reportTokenInvalid/reportSkipped) share generic dispatcher (_dispatch+_resolveEndpoint). Unified to single SubmissionRequest payload ({ op, submissionId, body? }). All 5 methods blocked simultaneously when host OPEN → dead host protection. FILTERED_BUSINESS_STATUS = {404, 410, 422}— removed 400. 400 is a contract regression signal → counted as CB failure to trigger circuit OPEN + alert.- StatusReporter public API signature unchanged — no impact on existing callers (worker.ts).
- Integrated status-reporter into host single CB (
- Rationale:
- Host-isolation essence is "don't keep calling a failed host" — per-method separation on the same host only causes load amplification. Integrated with host 1 = CB 1 principle.
- 400 Bad Request looks like a permanent business error but may be contract drift (header missing, schema mismatch) → circuit protection target. True permanent business errors are sufficiently represented by 404 (not found) / 410 (permanently removed) / 422 (rule violation).
- Code Paths:
services/github-worker/src/status-reporter.ts(5 CBs → 1 host CB + dispatcher /_dispatch+_resolveEndpointSRP separation)services/github-worker/src/circuit-breaker.ts(removed 400 fromFILTERED_BUSINESS_STATUS)services/github-worker/src/status-reporter.spec.ts(host single CB validation +_resolveEndpoint/_dispatchunit test new)services/github-worker/src/circuit-breaker.spec.ts(new 400 protection OPEN case + whitelist definition/unit behavior update)services/github-worker/src/worker.spec.ts(5 CB registration validation → 1 host CB validation)
- Test updates/additions (3rd): status-reporter.spec.ts host single CB registration validation +
_resolveEndpoint5 items (op-by-op endpoint mapping) +_dispatch4 items (get/reportSuccess/reportFailed/non-ok status attachment) + circuit-breaker.spec.ts 1 new 400 OPEN transition. jest 169 → 175 (+6 net), coverage stmts 99.79% / branches 97.45% / functions 100% / lines 100% (threshold 98/92/100/98 met) - Wave A compatibility: Same whitelist policy (including removing 400) + filtered metrics separation seed for
submission/circuit-breaker.service.ts→ Sprint 135 Wave C or separate PR. Currently fetchAiQuota usesfallback: () => trueso no user impact, but follow-up correction recommended for consistency, metric accuracy, and auth/contract failure protection.
Wave B Output
| Item | Result |
|---|---|
| Branch | feat/sprint-135-cb-worker |
| CB instances | 3 (submission-internal / gateway-getUserGitHubInfo / problem-getProblemInfo) — Critic 3rd P1 integrated from 7 → 3 |
| Commits | 5 atomic (deps+manager → status-reporter → worker → tests → ADR) + 2 (1st errorFilter fix + ADR D7) + 2 (Critic 2nd whitelist+filtered label fix + ADR D7 update) + 2~3 (Critic 3rd host single CB + remove 400 + ADR D7 update) |
| Tests | 8 suites / 175 tests |
| Coverage | stmts 99.79% / branches 97.45% / functions 100% / lines 100% (threshold 98/92/100/98 met) |
| typecheck | 0 errors |
| lint | 0 errors |
D8: Wave A submission CB Policy Sync with Wave B (Separate PR)
- Context: Wave B (D7) errorFilter whitelist policy not applied to Wave A's
services/submission/src/common/circuit-breaker/module. Synchronization needed for metric accuracy + contract regression protection consistency. fetchAiQuota usesfallback: () => trueso no user impact, but dead dependency hammering risk from contract regression / auth outage is identical. - Choice:
- Applied
FILTERED_BUSINESS_STATUS = {404, 410, 422}whitelist (same as Wave B, excluding 400) - Added
DEFAULT_ERROR_FILTER— applied by default increateBreaker, overridable by caller - In success event, branch on
result instanceof Error→requests_total{result="filtered"}label separation (prevent success count contamination) - fetchAiQuota non-2xx throw now attaches
error.status→ errorFilter can branch - Added
buildHttpError(message, status)helper tocircuit-breaker.constants.ts(reusability + consistency)
- Applied
- Rationale: CB policies of both modules (submission/github-worker) must be consistent for operational/monitoring consistency. When permanent 5xx/401/403 occurs → CB OPEN → fallback triggered (fail-open maintained) → dead dependency hammering blocked. Public API signature unchanged (createBreaker/getBreaker/getState/onModuleDestroy all maintained) — no regression in callers (saga-orchestrator).
- Code Paths:
services/submission/src/common/circuit-breaker/circuit-breaker.service.ts(FILTERED_BUSINESS_STATUS / DEFAULT_ERROR_FILTER / errorFilter option / success handler filtered branch)services/submission/src/common/circuit-breaker/circuit-breaker.constants.ts(buildHttpErrorhelper)services/submission/src/saga/saga-orchestrator.service.ts(use buildHttpError infetchAiQuota)services/submission/src/common/circuit-breaker/circuit-breaker.service.spec.ts(errorFilter unit + integration + buildHttpError validation, +15 tests)services/submission/src/saga/saga-orchestrator.service.spec.ts(2 fetchAiQuota status attachment additions)
- Tests: 21 suites / 290 → 305 tests (+15 net), coverage threshold met (stmts 97.91% / branches 92.82% / functions 96.34% / lines 97.98% — all thresholds 97/92/96/97 passed)
D8 Critic 1st Follow-up Corrections (P1+P2)
- P1 —
aiQuotaCheckCB witherrorFilter: () => falseoverride applied:fetchAiQuotaonly calls a fixed endpoint (/quota/check) → 404/410/422 are not resource-not-found but "AI Analysis Service route misconfig or service absence" signals.- Applying default whitelist (
{404,410,422}) as-is allows infinite calls to dead service + CB OPEN not triggered + no alert fired. errorFilter: () => falsecounts all non-2xx as CB failures so CB OPEN whenvolumeThresholdreached → fallback() => truekeeps user impact at 0 + alert signal obtained.- Code Paths:
services/submission/src/saga/saga-orchestrator.service.ts:onModuleInitcreateBreaker call - Tests: 2 new
errorFilteroption validation + override behavior unit tests insaga-orchestrator.service.spec.ts
- P2 Exact Resolution (Critic 2nd):
- Introduced errorFilter wrapper + WeakSet marker pattern for accurate success/filtered branching.
- Wrapper: when filtered — (a) count
requests_total{result="filtered"}+ (b) add marker to WeakSet. - Success handler: if result is object and WeakSet has marker, skip (prevent double-counting).
- Resolves both incorrect cases of previous
instanceof Errorheuristic (Error resolve / non-Error throw). - Remaining limit: primitives (string/number) cannot be added to WeakSet → primitive errorFilter pass adds 1 success count (practical impact 0, this project only throws Error/object).
- Code Paths:
services/submission/src/common/circuit-breaker/circuit-breaker.service.tscreateBreaker (wrapper + WeakSet) + success handler (marker lookup) + onModuleDestroy (WeakSet cleanup) - Test additions: plain object throw + filtered (1) / WeakSet reuse safety (1) / primitive limit documented (1) / object resolve regression prevention (1) — total +4
- Sprint 136+ seed update: Apply same wrapper pattern to Wave B (
services/github-worker/src/circuit-breaker.ts) (currentlyinstanceof Errorheuristic).
D9: Wave C — submission Service Problem Service 2 Call Sites CB Applied
- Context: submission service's 2 Problem Service HTTP call sites (
saga-orchestrator.fetchSourcePlatform,submission.service.checkLateSubmission) are unprotected by CB. Infinite calling possible on dead host + graceful degradation exists on deadline query failure but no dead host hammering block. - Choice:
- Host single CB (
problem-service-internal) — same dispatcher pattern as Wave B status-reporter. - Encapsulated in
ProblemServiceClientNestJS service — 2 call ops (getSourcePlatform,getDeadline) integrated via dispatcher (_dispatch+_doGetSourcePlatform+_doGetDeadlineSRP separation). - Per-op fallback (
getSourcePlatform: undefined,getDeadline: {isLate: false, weekNumber: null}) — maintains existing graceful degradation. - Default 4xx whitelist (
{404, 410, 422}) applied — Problem Service uses dynamic endpoint (/internal/{problemId}) so 404 = "problem not found" is a natural business error, only 5xx/auth/timeout counted as CB failures. - Non-2xx uses
buildHttpError(message, status)to attach status → DEFAULT_ERROR_FILTER can branch on whitelist.
- Host single CB (
- Code Paths:
services/submission/src/common/problem-service-client/(new module — module/client/index/spec)services/submission/src/saga/saga-orchestrator.service.ts(removefetchSourcePlatformprivate method → delegate to client, remove problemServiceUrl/Key fields)services/submission/src/submission/submission.service.ts(simplifycheckLateSubmissionbody to 1-line client delegation, remove ConfigService dependency)services/submission/src/submission/submission.module.ts(importProblemServiceClientModule)
- Host single CB instances (Sprint 135 comprehensive):
- submission service:
aiQuotaCheck(1) +problem-service-internal(1) = 2 - github-worker:
submission-internal(1) +gateway-getUserGitHubInfo(1) +problem-getProblemInfo(1) = 3 - Total 5 CB instances protecting 4 external services (AI Analysis / submission internal / Gateway / Problem Service)
- submission service:
- Tests:
problem-service-client.spec.ts24 new + saga-orchestrator/submission.service/ai-satisfaction spec updates. Total 334 tests pass, coverage stmts 98.46% / branches 94.53% / functions 96.59% / lines 98.55% (threshold 97/92/96/97 met) - Public API signature unchanged: No impact on SagaOrchestratorService.advanceToAiQueued / SubmissionService.create callers
- Critic 1st P2 follow-up correction: Regression block when env (
PROBLEM_SERVICE_KEY) not set → fetch slow path (timeout 5 seconds → CB OPEN). Added key validation at start of public methods (getSourcePlatform/getDeadline) → immediate fallback return for sub-millisecond recovery. Preserved existingsubmission.service.checkLateSubmission'sgetOrThrowimmediate fallback behavior. Added 2 env-not-set fallback validation cases toproblem-service-client.spec.ts(fetch not triggered + hostBreaker.fire not called + default value returned). - Critic 2nd P1 follow-up correction: Marked
CircuitBreakerModuleas@Global()+ removed fromProblemServiceClientModule.imports. NestJS instantiatesCircuitBreakerServiceseparately per module scope, causing prom-client duplicate metric registration error and submission service boot failure. Single import in AppModule or SubmissionModule suffices for global use. - Critic 3rd P1 follow-up correction — defense code added: Added
instanceof Errorcheck tohostBreaker.fireresult ingetSourcePlatform/getDeadline. Current opossum 8.x callsreject(error)when errorFilter passes → caught in catch → fallback returned, but explicit check (1) guards against future opossum behavior changes + (2) clarifies code intent + (3) ensures Error objects don't reach business logic (sourcePlatform: Error,isLate: Error). Defense in depth. 2 new tests (1 each for getSourcePlatform/getDeadline — mock Error as resolve → verify fallback returned). Total 336 → 338 tests pass, problem-service-client.ts coverage stmts/branches/functions/lines all 100% maintained. - Critic 4th P2 follow-up correction — URL validation added: 1st P2 fix validated only
problemServiceKeyand leftproblemServiceUrlfalling back to default value ('http://problem-service:3002'). URL not set + KEY set → fetch to default host 5-second timeout → CB OPEN regression possible. In constructor, removed default (preserved as?? ''empty string) +isConfigReady()private helper validates both URL and KEY + guard in public methods (getSourcePlatform/getDeadline) changed fromif (!this.problemServiceKey)→if (!this.isConfigReady()).getOrThrownot used (boot-time throw regression risk) — get + empty string fallback pattern maintained. 5 new tests (URL not set / URL+KEY both not set each for getSourcePlatform/getDeadline + ConfigService.get returns undefined → no default applied). Total 338 → 343 tests pass, problem-service-client.ts stmts/branches/lines 100% maintained (functions 91.66% limited to index.ts empty re-export — this file 100%).
D10: Wave D — Grafana CB Dashboard
- Context: Sprint 135 Wave A/B/C introduced 5 host single CBs in production. No operation dashboard to monitor CB state changes/request throughput/failure rate at a glance.
- Choice:
- Created new ConfigMap
grafana-cb-dashboard(infra/k3s/monitoring/grafana-cb-dashboard.yaml) - Added ConfigMap mount to
grafana.yaml's projected volume → automatic provisioning - 5 panel types: State Matrix (current state) + Request Rate by Result (throughput) + Failure Rate + State Timeline (state history) + Distribution (cumulative stats) + Stats Table
- Integrated metrics from both submission + github-worker (
algosu_submission_circuit_breaker_*+algosu_github_worker_circuit_breaker_*) - Template variable
name(multi-select, includeAll) — auto-exposes all 5 CB instances vialabel_values({__name__=~"algosu_(submission|github_worker)_circuit_breaker_state"}, name)query + panel filtering - Dashboard uid:
algosu-cb, refresh: 30s, schemaVersion: 39, links: SLO Overview / Service Debug
- Created new ConfigMap
- Code Paths:
infra/k3s/monitoring/grafana-cb-dashboard.yaml(new)infra/k3s/monitoring/grafana.yaml(1 line added to projected sources)
- Operational value: Visual alert immediately when CB OPEN + observe infra failures (failure/timeout/reject) and business 4xx (filtered) separately via result label.
- Verification: yaml.safe_load pass + ConfigMap's JSON
json.loadspass + Deployment's projected sources 3 (slo/service/cb) consistency. - Critic 1st P2 follow-up correction — AI Analysis addition + schema difference acknowledged:
- Initial D10 implementation only visualized TypeScript CBs (submission 2 + github-worker 3) → Sprint 135 original reference Python CB (
services/ai-analysis/src/circuit_breaker.py, metricalgosu_ai_analysis_circuit_breaker_state) operates in production but was missing from dashboard → Claude API failures would show "all normal" without observation. - Schema differences: TypeScript (0=CLOSED/1=HALF_OPEN/2=OPEN, name label present, failures_total + requests_total{result} exist) vs Python (0=CLOSED/0.5=HALF_OPEN/1=OPEN, no name label, only state metric exposed) — simple regex integration not possible.
- State Matrix panel separated: Existing Panel id=1 width 24→18 (
Circuit Breaker State (TypeScript)), new Panel id=7 width 6 (AI Analysis CB State (Python), mappings 0/0.5/1, thresholds green→yellow at 0.5→red at 1). - State Timeline separated: Existing Panel id=4 width 24→18 (
Circuit Breaker State Timeline (TypeScript)), new Panel id=8 width 6 (AI Analysis CB Timeline (Python schema), same 0/0.5/1 mappings). - Request Rate / Failure Rate / Distribution / Stats Table: ai-analysis lacks
failures_total/requests_total{result}metrics → no changes, description added ("AI Analysis Python CB ... metrics absent — to be added in Sprint 136+ when schema unified"). - Verification: yaml.safe_load pass + json.loads pass + 12 panel IDs (100/1/7/101/2/3/102/4/8/103/5/6) no conflicts + gridPos per-row 24-width consistency (y:1 18+6 / y:17 18+6).
- Sprint 136+ seed: Unify ai-analysis Python CB metric schema with TypeScript (state value 0/1/2 +
namelabel +failures_total+requests_total{result}) → can integrate into single Grafana query regex. Requires simultaneous update ofservices/ai-analysis/src/metrics.py+circuit_breaker.py+ operational alert rules.
- Initial D10 implementation only visualized TypeScript CBs (submission 2 + github-worker 3) → Sprint 135 original reference Python CB (
Carryover (Wave E Comprehensive — All Complete)
Sprint 135 Internal Work (All Complete ✅)
- Wave A: submission
aiQuotaCheckPoC + opossum introduction (D1~D5) — PR #167459cd8a - Wave B: github-worker 7 sites CB applied + host single CB pattern (D6~D7) — PR #168
c561488 - Wave A follow-up correction (separate PR): errorFilter wrapper + WeakSet marker for exact P2 resolution (D8) — PR #169
1f40247 - Wave C: submission 2 sites added (
ProblemServiceClient+CircuitBreakerModule@Global) (D9) — PR #1707d4c539 - Wave D: Grafana CB dashboard (TypeScript + Python schema separated) (D10) — PR #171
2c5d8e3 - Wave E: Sprint 135 ADR comprehensive update + sprint-window.md final cleanup — this ADR
status: completed
Sprint 136+ Carryover Seeds
CB Consistency Improvements
- github-worker errorFilter wrapper sync: Apply errorFilter wrapper + WeakSet marker pattern to
services/github-worker/src/circuit-breaker.ts(currentlyinstanceof Errorheuristic → same exact branching as Wave A). Only submission applied in Sprint Wave A sync PR #169, two modules need consistency recovery. - ai-analysis Python CB metric schema unification: Add state value 0/1/2 +
namelabel +failures_total+requests_total{result}to match TypeScript. Updateservices/ai-analysis/src/metrics.py+circuit_breaker.py+ operational alert rules simultaneously. After unification, the separated Python schema panels in Grafana dashboard can be integrated into a single query regex.
Separate Seeds (non-CB)
- CLAUDE.md L11 naming mismatch correction: Document says "ai-feedback" → actual directory is
services/ai-analysis/. Metadata consistency issue discovered during Sprint Wave A. - E2E auto PR CI integration (Sprint 134 carryover): Integrate
e2e-full.sh(657 lines) into.github/workflows/ci.ymlPR trigger. Currently only manual workflow_dispatch execution is possible.