Prompt Optimization — AI Improved Code Correctness Gate

Sprint 142 — Prompt Optimization — AI Improved Code Correctness Gate

Context

User feedback: "The AI improved code comes back wrong when run on Programmers."

Regression where optimizedCode generated by the ai-analysis service is judged incorrect by the scoring platform.

8 Diagnosed Defects (P0~P3)

PriorityDefectLocation
P0No behavioral equivalence enforcementprompt.py:61 SYSTEM_PROMPT response rules
P0No function signature preservation instructionprompt.py:216-224 _build_platform_context
P1Ambiguous correctness evaluation targetprompt.py:22-24
P1No optimizedCode self-verification fieldprompt.py:64-85 JSON schema
P2Silent skip when problem context absentprompt.py:248-254 build_user_prompt
P2Weight bias (correctness 30% < readability+structure+best practice 45%)prompt.py:164-170
P3MAX_TOKENS = 8192 truncation riskclaude_client.py:32
P3Group analysis code [:500] truncationprompt.py:325

Infrastructure Investigation Result (Separate — Outside Scope)

No problemTitle/problemDescription columns in Submission entity + not included in saga publisher payload → ai-analysis worker always called with empty strings. Structural defect where problem context itself is not delivered to LLM. This sprint only strengthens prompt-side guards; infrastructure fix carried to Sprint 143+ seed.

Decisions

Scope: Wave A+B (Option b — Standard Volume)

  • Wave A (P0): SYSTEM_PROMPT/SQL_SYSTEM_PROMPT absolute behavioral equivalence rules + _build_platform_context instruction strengthening
  • Wave B (P1): Add JSON schema optimizedCodeMeta + claude_client self-verification fallback
  • Wave C (P2): Carried to Sprint 143+ — weight rebalancing risks score distribution regression + loss of comparison baseline with existing data

Rationale: Wave A only (P0) is undetectable if LLM ignores it. Self-verification field (P1) violation detection + safe fallback guarantees ROI. Decided after confirming 0 frontend compatibility changes.

Wave A — Behavioral Equivalence Absolute Rules

Added [Top Priority Rules] section before SYSTEM_PROMPT/SQL_SYSTEM_PROMPT response rules:

  • Absolute prohibition on changing function signature / input-output format / result column names and order
  • Permitted changes are internal implementation only
  • Explicitly states scoring failure on violation + "if unsure, return original" safety net

_build_platform_context instruction strengthening:

  • PROGRAMMERS: Function signature preservation instruction (SQL branch added in Sprint 142 R3)
  • BOJ: Standard I/O format preservation instruction (Python API hardcoding removed in Sprint 142 R1 → language-neutralized)

Wave B — Self-Verification Metadata + Safe Fallback

JSON schema optimizedCodeMeta added:

JSON
{
  "signaturePreserved": true/false,  // not included for SQL
  "behaviorEquivalent": true/false,
  "changes": ["summary of changes"]
}

In claude_client._parse_response:

  1. _is_explicit_false helper for strict boolean verification (Sprint 142 R1 P2 fix)
  2. Only explicit false or "false" string triggers fallback
  3. On fallback: optimized_code = None + parsed["optimizedCode"] = None applied simultaneously (Sprint 142 R2 P1 fix — frontend parseFeedback prioritizes optimizedCode in feedback JSON, so both must be cleared)

Score ↔ Self-Verification Separation Policy (Sprint 142 R4 P1 fix)

correctness rubric:

  • ❌ "Evaluate both submitted code and optimizedCode" (draft — causes user score penalty if LLM suggests incorrect improved code)
  • ✅ "Evaluate submitted code only, optimizedCode handled separately by self-verification meta"

Core principle: Separation of responsibilities between scoring (evaluating user code) and verification (LLM self-verification meta). Guarantees users are not penalized for LLM mistakes.

SQL Signature Gate Removal (Sprint 142 R4 P2 fix)

SQL has no concept of signature → Remove signaturePreserved from optimizedCodeMeta schema in SQL_SYSTEM_PROMPT. _is_explicit_false(None)=False so missing values auto-pass → only behaviorEquivalent is effectively verified.

PROGRAMMERS + SQL Branch (Sprint 142 R3 P2 fix)

AddProblemModal supports PROGRAMMERS + sql combination (real use case). _build_platform_context(source_platform, language) signature change:

  • PROGRAMMERS + sql → result column name/order/sort preservation rules
  • PROGRAMMERS + other → function signature preservation rules
  • BOJ → SQL not supported, single message

Patterns (Reusable)

Critic Multiple Round Repeated Verification Pattern

RoundIssues Caught
R1P2×2 (string boolean bypass / Python API hardcoding)
R2P1×1 (feedback JSON remaining)
R3P2×1 (PROGRAMMERS+SQL conflict)
R4P1×1 + P2×1 (score/verification coupling / SQL signature gate)
R5Clean pass

Total 5 rounds · P0 0 · P1 2 · P2 4 all resolved.

Observation: Defect patterns difficult to catch with single-round verification (e.g., feedback JSON serialization and frontend exposure, SQL/algorithm branch conflicts, score stability regressions) are discovered through repeated rounds. Value of repeated verification.

LLM Self-Verification Metadata Pattern

When LLM output reliability cannot be enforced from code:

  1. Add self-verification fields to JSON schema (boolean + change summary)
  2. Code triggers fallback only on explicit false (blocks string boolean bypass)
  3. Missing/None/type mismatch does not trigger fallback (backward compatible)
  4. Rejected outputs removed from all serialization paths

Score ↔ Self-Verification Separation Principle

In LLM output scoring systems:

  • Score: Evaluates user artifacts only
  • Self-verification: Verifies reliability of LLM's own output
  • Combining both responsibilities in one field transfers LLM mistakes to user penalties

Lessons

Single Topic Single PR + Multiple Critic Rounds (Sprint 142) vs Group PR Split (Sprint 141)

  • Single topic (Sprint 142): Small change scope + deep verification → incremental improvement through multiple Critic rounds
  • Group split (Sprint 141): Diverse topics + 0 dependencies → single-round verification per PR

This sprint is centered on a single file (prompt.py) but distributes merge burden across 5 Critic rounds. Pattern of compensating change depth with verification depth.

Importance of Pre-Merge Frontend Compatibility Check

The P1 (feedback JSON remaining) could have been pre-blocked by analyzing parseFeedback behavior before merge. Found by external Critic verification in R2 — the bulk check of entire consumer stack on change policy (Sprint 141 B-2 pattern) should also be applied to frontend.

Value of "User Visual Verification" (Reconfirmed)

This sprint passed all code/CI/Critic verification, but actual Programmers resubmission → scoring pass verification is only possible in the user's environment. Same pattern as calendar regression discovery in Sprint 139/140 — automated verification + user visual verification in tandem is essential.

Verification Results

  • CI: Quality/Test AI Analysis/E2E/Coverage Gate all pass (5 PR pushes all GREEN)
  • 36 new tests added (test_prompt 18 + test_claude_client 18)
  • 100% backward compatible with existing tests (0 regressions)
  • Critic 5 rounds P0/P1/P2 all resolved
  • Branch discipline ✅: new branch + PR + Squash merge — 8 consecutive sprints compliant

Carryover (Sprint 143+)

Sprint 142 New Seeds

  • P2: Weight rebalancing (correctness 30% → 40% review)
  • P2: Postpone optimizedCode generation guard when problem context absent
  • P3: Token truncation / group analysis truncation guard (MAX_TOKENS=8192, code_preview[:500])
  • New infrastructure: ai-analysis worker → problem service /internal/:id cross-service call to actually inject problem info (title/description/examples)
  • User visual verification: Actual Programmers resubmission → scoring pass confirmation

Sprint 141 Remaining (4 items)

  • Calendar provider dependency defense (Group C P2)
  • prometheus-rules / dashboard automatic verification CI
  • E2E full integration UX enhancement (D-3 follow-up)
  • User visual verification — English environment calendar + production Grafana CB dashboard ai-analysis consistency

Total accumulated carryover 9 items → Sprint 143 cleanup targets.