- AGENT_WORKER.md: Added Rule 3 for minimal, maintainable, modular code - AGENT_REVIEW.md: Added strict enforcement check in Phase 2 - Emphasizes single responsibility, clean interfaces, and production quality - Reviewers must block code that doesn't meet these standards
12 KiB
Agent Reviewer Rules
⚠️ IMPORTANT: This document is for REVIEW AGENTS who handle commits, PRs, and code reviews. Regular agents follow AGENT_WORKER.md for implementation tasks and DO NOT make commits.
Review Philosophy
Mission: Prevent the circular development patterns identified in commit history.
Standards:
- Reject code that doesn't meet quality bar
- Ask for tests, don't accept "I'll add them later"
- Check token counts for prompt changes
- Verify architectural consistency
- Demand clear error messages
Reviewer Authority:
- Can block PR for: missing tests, token bloat, architecture violations
- Cannot approve own code
- Must provide constructive feedback with specific fixes
Review Checklist
Phase 1: Structure & Hygiene (Block if failed)
-
Branch naming follows convention
- Format:
type/description(e.g.,fix/tool-parsing) - Not:
quick-fix,temp-branch,dev
- Format:
-
Commit messages are clear
- Format:
type(scope): description - No:
fix stuff,WIP,asdf,omg finally - Each commit should be reviewable independently
- Format:
-
No production debugging code
- Search for:
print(,console.log,debugger,TODO,FIXME,XXX - Check: No commented-out code blocks
- Check: No temporary files committed
- Search for:
-
Git history is clean
- No "fix typo" commits after initial commit
- No "WIP" commits in PR
- No merge commits (rebase instead)
- Squash fixup commits
Phase 2: Code Quality (Block if failed)
-
Tests exist and pass
- Unit tests for new functions
- Integration tests for API changes
- Run:
pytest -v(must pass) - Coverage: ≥80% for new code
- BLOCKING: No tests = No merge
-
Type hints present
- All function parameters typed
- All return values typed
- Run:
mypy src/(must pass with zero errors)
-
No code smells
- No functions > 50 lines
- No files > 300 lines
- No indentation > 3 levels deep
- No circular imports
- No duplicate code (>3 lines copied)
-
Minimal, Maintainable, Modular Code
- Minimal: Only code needed to solve the problem, no over-engineering
- Maintainable: Clear names, self-documenting, consistent style
- Modular: Single Responsibility Principle, loose coupling, clear interfaces
- STRICT ENFORCEMENT:
- Functions should do ONE thing (if it does 2+ things, break it up)
- No monolithic blocks (>50 lines in one function)
- Clear separation of concerns
- Interfaces between modules are stable and well-defined
- Easy to understand for new maintainers
- No "temp" or "quick" solutions - production quality only
- BLOCKING: Code that is too complex, monolithic, or poorly structured must be rejected
-
Error handling is robust
- No bare
except:clauses - All errors have clear messages
- No silent failures
- Edge cases handled
- No bare
-
Documentation is adequate
- All public functions have docstrings
- Complex logic has inline comments
- README updated if user-facing change
- Architecture doc updated if pattern changes
Phase 3: Token Budget (Block if failed)
For any prompt/instruction changes:
-
Token count documented
- Before: X tokens
- After: Y tokens
- Change: +/- Z tokens
-
Within budget
- System prompt + instructions ≤ 2000 tokens (HARD LIMIT)
- Leaves ≥ 50% context window for user input
- BLOCKING: Over budget = Request reduction
-
Efficient wording
- No redundant examples
- No verbose explanations
- Prefer code over prose
Token Counting Command:
# Count tokens in a string
echo "Your prompt here" | python -c "import sys; import tiktoken; enc = tiktoken.get_encoding('cl100k_base'); print(len(enc.encode(sys.stdin.read())))"
Phase 4: Architecture (Block if failed)
-
Consistent with ARCHITECTURE.md
- No new patterns without updating docs
- No mixing of concerns
- Follows existing module structure
-
No architecture changes in fixes
- Bug fixes should not refactor
- Refactors should be separate PRs
- Exception: If fix requires arch change, document WHY
-
Parser rules
- Only ONE parser per format
- No alternative parsing paths
- Clear regex patterns
- Handles all documented cases
-
No feature flags in core
- Code should not have
if config.get("ENABLE_X"): - Pick one approach, remove old one
- A/B testing only in separate branch
- Code should not have
Phase 5: Research & Continuous Learning
For significant changes (>100 lines or new algorithms):
-
Research documented
- Check
research/folder for related findings - PR description mentions alternatives considered
- Links to sources (docs, papers, repos)
- Not: "I thought this would work"
- Yes: "Based on [source], this approach handles [case] better than [alternative]"
- Check
-
Best practices followed
- Implementation matches current language/framework conventions
- No deprecated patterns
- Modern Python features used appropriately (3.9+)
-
No reinvention
- Check if standard library solves the problem
- Check if well-maintained package exists
- If custom implementation needed, document WHY
Research Documentation Requirements:
## Research
- Alternatives considered: [list]
- Sources: [links]
- Decision: [why chosen approach]
- Benchmarks: [if applicable]
Phase 6: Logic Correctness
-
Logic is sound
- Read through the code
- Check edge cases
- Verify error conditions
- Question anything unclear
-
No performance regressions
- No blocking I/O in async functions (unless wrapped)
- No memory leaks
- No N+1 queries
- Reasonable algorithmic complexity
-
Security check
- No SQL injection vectors
- No command injection (bash execution sanitized)
- Path traversal protection (for file ops)
- No secrets in code
Review Report Format
After review, write a report to reports/PR-{number}-{branch}.md:
# Review Report: PR #{number} - {branch}
**Reviewer:** {your name}
**Date:** {YYYY-MM-DD}
**Status:** [APPROVED / CHANGES_REQUESTED / BLOCKED]
## Summary
Brief description of what this PR does and overall quality assessment.
## Detailed Findings
### ✅ Passed
- [List items that passed review]
- [Be specific: "Tests cover 85% of new code"]
### ⚠️ Warnings (Non-blocking)
- [Minor issues that don't block merge]
- [Style suggestions]
- [Future improvements]
### ❌ Blockers (Must fix)
1. **[Category]** [Specific issue]
- **Location:** `file.py:123`
- **Problem:** [What's wrong]
- **Fix:** [Exactly what to change]
- **Why:** [Why this matters]
2. **[Category]** [Specific issue]
- ...
## Token Impact Analysis
- Component: [what changed]
- Before: [X] tokens
- After: [Y] tokens
- Impact: [+/- Z] tokens
- Within budget: [Yes/No]
## Test Coverage
- New code coverage: [X]%
- Tests pass: [Yes/No]
- Integration tests: [Present/Missing]
## Architecture Review
- Follows existing patterns: [Yes/No]
- Introduces new dependencies: [List if any]
- Breaking changes: [Yes/No - explain if yes]
## Research Review
- Alternatives considered: [Listed/None]
- Sources cited: [Yes/No]
- Best practices followed: [Yes/No]
- Research documented: [Yes/No - location]
## Code Quality Score
- Structure: [0-10]
- Testing: [0-10]
- Documentation: [0-10]
- Logic: [0-10]
- **Overall: [0-10]**
## Action Items
- [ ] [Specific fix needed]
- [ ] [Specific fix needed]
- [ ] [Test to add]
## Verdict
[APPROVED / CHANGES_REQUESTED / BLOCKED]
**If CHANGES_REQUESTED:**
- Address all blockers
- Re-request review when ready
**If BLOCKED:**
- Major issues require architecture discussion
- Schedule meeting before continuing
Severity Levels
🔴 BLOCKING (Cannot merge)
- Missing tests for new functionality
- Token budget exceeded
- Bare
except:clauses - Production debugging code (
printstatements) - Breaking changes without documentation
- Security vulnerabilities
- Tests failing
- Type check errors
- Architecture violations
🟡 CHANGES_REQUESTED (Fix before merge)
- Unclear variable names
- Missing docstrings
- Inefficient algorithms
- Missing error handling
- Unclear commit messages
- Minor style issues
🟢 APPROVED (Optional suggestions)
- Style preferences
- Future improvements
- Optional refactors
Common Issues to Watch For
Issue 1: Tool Parsing Duplication
# ❌ WRONG - Multiple parsers
def parse_tools_v1(text): ...
def parse_tools_v2(text): ...
def parse_tools_legacy(text): ...
# ✅ CORRECT - Single parser
TOOL_PATTERN = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})'
Check: Search for "def parse" - should be ONE per format.
Issue 2: Token Bloat
# ❌ WRONG - Too verbose
SYSTEM_PROMPT = """
You are an AI assistant. Here are detailed instructions...
[2000 words of explanation]
[10 examples]
"""
# ✅ CORRECT - Concise
SYSTEM_PROMPT = """Use TOOL: name\nARGUMENTS: {...} format. Available: read, write, bash."""
Check: Count tokens, verify < 2000.
Issue 3: Architecture Drift
# ❌ WRONG - Mixing concerns in one file
# src/api/routes.py
def handle_request(): ...
def parse_tools(): ...
def execute_tool(): ...
def format_response(): ...
# ✅ CORRECT - Separated
# src/api/routes.py - only HTTP handling
# src/tools/parser.py - only parsing
# src/tools/executor.py - only execution
Check: Each module has ONE responsibility.
Issue 4: Debug Code Left In
# ❌ WRONG
def process(data):
print(f"DEBUG: data={data}") # REMOVE THIS
result = transform(data)
print(f"DEBUG: result={result}") # REMOVE THIS
return result
# ✅ CORRECT
logger = logging.getLogger(__name__)
def process(data):
logger.debug("Processing data", extra={"data_size": len(data)})
return transform(data)
Check: grep -r "print(" src/ --include="*.py" | grep -v "^#"
Issue 5: Missing Error Context
# ❌ WRONG
raise ValueError("Invalid input")
# ✅ CORRECT
raise ValueError(f"Invalid model format: '{model_str}'. Expected: 'name:size:quant' (e.g., 'qwen:7b:q4')")
Check: All errors explain what was expected vs received.
Review Workflow
-
First Pass: Structure (5 min)
- Check branch name, commits, no debug code
- If failed → Write report, BLOCK
-
Second Pass: Quality (10 min)
- Run tests, check types, review code
- If failed → Write report, CHANGES_REQUESTED
-
Third Pass: Deep Dive (15 min)
- Read logic, check edge cases
- Verify token counts
- Check architecture
- Write detailed report
-
Final Decision (5 min)
- APPROVE / CHANGES_REQUESTED / BLOCK
- Write report to
reports/folder - Post summary in PR comments
Total time per review: 30-35 minutes
Reviewer Self-Check
Before submitting review:
- I ran all tests locally
- I checked type hints
- I counted tokens (if applicable)
- I read every line of changed code
- My feedback is specific and actionable
- I explained WHY for each blocker
- I wrote a report to
reports/folder
Escalation
Escalate to architecture discussion if:
- PR changes core patterns
- Token budget cannot be met
- Two reviewers disagree
- Breaking changes proposed
Don't just approve to be nice. Don't let technical debt accumulate.
Report Storage
All reports go in reports/ folder:
reports/
├── PR-123-fix-tool-parsing.md
├── PR-124-add-federation.md
├── PR-125-refactor-consensus.md
└── README.md # Index of all reviews
This folder is gitignored - reports stay local.
Generate index with:
ls -1 reports/PR-*.md | sort -t'-' -k2 -n > reports/README.md
Remember: You're the last line of defense against technical debt. Be thorough, be kind, be strict.