Prompt Optimization — AI Improved Code Correctness Gate
Sprint 142 — Prompt Optimization — AI Improved Code Correctness Gate
Context
User feedback: "The AI improved code comes back wrong when run on Programmers."
Regression where optimizedCode generated by the ai-analysis service is judged incorrect by the scoring platform.
8 Diagnosed Defects (P0~P3)
| Priority | Defect | Location |
|---|---|---|
| P0 | No behavioral equivalence enforcement | prompt.py:61 SYSTEM_PROMPT response rules |
| P0 | No function signature preservation instruction | prompt.py:216-224 _build_platform_context |
| P1 | Ambiguous correctness evaluation target | prompt.py:22-24 |
| P1 | No optimizedCode self-verification field | prompt.py:64-85 JSON schema |
| P2 | Silent skip when problem context absent | prompt.py:248-254 build_user_prompt |
| P2 | Weight bias (correctness 30% < readability+structure+best practice 45%) | prompt.py:164-170 |
| P3 | MAX_TOKENS = 8192 truncation risk | claude_client.py:32 |
| P3 | Group analysis code [:500] truncation | prompt.py:325 |
Infrastructure Investigation Result (Separate — Outside Scope)
No problemTitle/problemDescription columns in Submission entity + not included in saga publisher payload → ai-analysis worker always called with empty strings. Structural defect where problem context itself is not delivered to LLM. This sprint only strengthens prompt-side guards; infrastructure fix carried to Sprint 143+ seed.
Decisions
Scope: Wave A+B (Option b — Standard Volume)
- Wave A (P0): SYSTEM_PROMPT/SQL_SYSTEM_PROMPT absolute behavioral equivalence rules +
_build_platform_contextinstruction strengthening - Wave B (P1): Add JSON schema
optimizedCodeMeta+ claude_client self-verification fallback - Wave C (P2): Carried to Sprint 143+ — weight rebalancing risks score distribution regression + loss of comparison baseline with existing data
Rationale: Wave A only (P0) is undetectable if LLM ignores it. Self-verification field (P1) violation detection + safe fallback guarantees ROI. Decided after confirming 0 frontend compatibility changes.
Wave A — Behavioral Equivalence Absolute Rules
Added [Top Priority Rules] section before SYSTEM_PROMPT/SQL_SYSTEM_PROMPT response rules:
- Absolute prohibition on changing function signature / input-output format / result column names and order
- Permitted changes are internal implementation only
- Explicitly states scoring failure on violation + "if unsure, return original" safety net
_build_platform_context instruction strengthening:
- PROGRAMMERS: Function signature preservation instruction (SQL branch added in Sprint 142 R3)
- BOJ: Standard I/O format preservation instruction (Python API hardcoding removed in Sprint 142 R1 → language-neutralized)
Wave B — Self-Verification Metadata + Safe Fallback
JSON schema optimizedCodeMeta added:
{
"signaturePreserved": true/false, // not included for SQL
"behaviorEquivalent": true/false,
"changes": ["summary of changes"]
}
In claude_client._parse_response:
_is_explicit_falsehelper for strict boolean verification (Sprint 142 R1 P2 fix)- Only explicit
falseor"false"string triggers fallback - On fallback:
optimized_code = None+parsed["optimizedCode"] = Noneapplied simultaneously (Sprint 142 R2 P1 fix — frontendparseFeedbackprioritizes optimizedCode in feedback JSON, so both must be cleared)
Score ↔ Self-Verification Separation Policy (Sprint 142 R4 P1 fix)
correctness rubric:
- ❌ "Evaluate both submitted code and optimizedCode" (draft — causes user score penalty if LLM suggests incorrect improved code)
- ✅ "Evaluate submitted code only, optimizedCode handled separately by self-verification meta"
Core principle: Separation of responsibilities between scoring (evaluating user code) and verification (LLM self-verification meta). Guarantees users are not penalized for LLM mistakes.
SQL Signature Gate Removal (Sprint 142 R4 P2 fix)
SQL has no concept of signature → Remove signaturePreserved from optimizedCodeMeta schema in SQL_SYSTEM_PROMPT. _is_explicit_false(None)=False so missing values auto-pass → only behaviorEquivalent is effectively verified.
PROGRAMMERS + SQL Branch (Sprint 142 R3 P2 fix)
AddProblemModal supports PROGRAMMERS + sql combination (real use case). _build_platform_context(source_platform, language) signature change:
PROGRAMMERS + sql→ result column name/order/sort preservation rulesPROGRAMMERS + other→ function signature preservation rulesBOJ→ SQL not supported, single message
Patterns (Reusable)
Critic Multiple Round Repeated Verification Pattern
| Round | Issues Caught |
|---|---|
| R1 | P2×2 (string boolean bypass / Python API hardcoding) |
| R2 | P1×1 (feedback JSON remaining) |
| R3 | P2×1 (PROGRAMMERS+SQL conflict) |
| R4 | P1×1 + P2×1 (score/verification coupling / SQL signature gate) |
| R5 | Clean pass ✅ |
Total 5 rounds · P0 0 · P1 2 · P2 4 all resolved.
Observation: Defect patterns difficult to catch with single-round verification (e.g., feedback JSON serialization and frontend exposure, SQL/algorithm branch conflicts, score stability regressions) are discovered through repeated rounds. Value of repeated verification.
LLM Self-Verification Metadata Pattern
When LLM output reliability cannot be enforced from code:
- Add self-verification fields to JSON schema (boolean + change summary)
- Code triggers fallback only on explicit
false(blocks string boolean bypass) - Missing/None/type mismatch does not trigger fallback (backward compatible)
- Rejected outputs removed from all serialization paths
Score ↔ Self-Verification Separation Principle
In LLM output scoring systems:
- Score: Evaluates user artifacts only
- Self-verification: Verifies reliability of LLM's own output
- Combining both responsibilities in one field transfers LLM mistakes to user penalties
Lessons
Single Topic Single PR + Multiple Critic Rounds (Sprint 142) vs Group PR Split (Sprint 141)
- Single topic (Sprint 142): Small change scope + deep verification → incremental improvement through multiple Critic rounds
- Group split (Sprint 141): Diverse topics + 0 dependencies → single-round verification per PR
This sprint is centered on a single file (prompt.py) but distributes merge burden across 5 Critic rounds. Pattern of compensating change depth with verification depth.
Importance of Pre-Merge Frontend Compatibility Check
The P1 (feedback JSON remaining) could have been pre-blocked by analyzing parseFeedback behavior before merge. Found by external Critic verification in R2 — the bulk check of entire consumer stack on change policy (Sprint 141 B-2 pattern) should also be applied to frontend.
Value of "User Visual Verification" (Reconfirmed)
This sprint passed all code/CI/Critic verification, but actual Programmers resubmission → scoring pass verification is only possible in the user's environment. Same pattern as calendar regression discovery in Sprint 139/140 — automated verification + user visual verification in tandem is essential.
Verification Results
- CI: Quality/Test AI Analysis/E2E/Coverage Gate all pass (5 PR pushes all GREEN)
- 36 new tests added (test_prompt 18 + test_claude_client 18)
- 100% backward compatible with existing tests (0 regressions)
- Critic 5 rounds P0/P1/P2 all resolved
- Branch discipline ✅: new branch + PR + Squash merge — 8 consecutive sprints compliant
Carryover (Sprint 143+)
Sprint 142 New Seeds
- P2: Weight rebalancing (correctness 30% → 40% review)
- P2: Postpone optimizedCode generation guard when problem context absent
- P3: Token truncation / group analysis truncation guard (
MAX_TOKENS=8192,code_preview[:500]) - New infrastructure: ai-analysis worker → problem service
/internal/:idcross-service call to actually inject problem info (title/description/examples) - User visual verification: Actual Programmers resubmission → scoring pass confirmation
Sprint 141 Remaining (4 items)
- Calendar provider dependency defense (Group C P2)
- prometheus-rules / dashboard automatic verification CI
- E2E full integration UX enhancement (D-3 follow-up)
- User visual verification — English environment calendar + production Grafana CB dashboard ai-analysis consistency
Total accumulated carryover 9 items → Sprint 143 cleanup targets.