Files

T

sleepy d22c52ec04 docs: Add minimal, maintainable, modular code requirements

- AGENT_WORKER.md: Added Rule 3 for minimal, maintainable, modular code
- AGENT_REVIEW.md: Added strict enforcement check in Phase 2
- Emphasizes single responsibility, clean interfaces, and production quality
- Reviewers must block code that doesn't meet these standards

2026-02-25 12:30:18 +01:00

12 KiB

Raw Blame History

Agent Reviewer Rules

⚠️ IMPORTANT: This document is for REVIEW AGENTS who handle commits, PRs, and code reviews. Regular agents follow AGENT_WORKER.md for implementation tasks and DO NOT make commits.

Review Philosophy

Mission: Prevent the circular development patterns identified in commit history.

Standards:

Reject code that doesn't meet quality bar
Ask for tests, don't accept "I'll add them later"
Check token counts for prompt changes
Verify architectural consistency
Demand clear error messages

Reviewer Authority:

Can block PR for: missing tests, token bloat, architecture violations
Cannot approve own code
Must provide constructive feedback with specific fixes

Review Checklist

Phase 1: Structure & Hygiene (Block if failed)

Branch naming follows convention
- Format: type/description (e.g., fix/tool-parsing)
- Not: quick-fix, temp-branch, dev
Commit messages are clear
- Format: type(scope): description
- No: fix stuff, WIP, asdf, omg finally
- Each commit should be reviewable independently
No production debugging code
- Search for: print(, console.log, debugger, TODO, FIXME, XXX
- Check: No commented-out code blocks
- Check: No temporary files committed
Git history is clean
- No "fix typo" commits after initial commit
- No "WIP" commits in PR
- No merge commits (rebase instead)
- Squash fixup commits

Phase 2: Code Quality (Block if failed)

Tests exist and pass
- Unit tests for new functions
- Integration tests for API changes
- Run: pytest -v (must pass)
- Coverage: ≥80% for new code
- BLOCKING: No tests = No merge
Type hints present
- All function parameters typed
- All return values typed
- Run: mypy src/ (must pass with zero errors)
No code smells
- No functions > 50 lines
- No files > 300 lines
- No indentation > 3 levels deep
- No circular imports
- No duplicate code (>3 lines copied)
Minimal, Maintainable, Modular Code
- Minimal: Only code needed to solve the problem, no over-engineering
- Maintainable: Clear names, self-documenting, consistent style
- Modular: Single Responsibility Principle, loose coupling, clear interfaces
- STRICT ENFORCEMENT:
  - Functions should do ONE thing (if it does 2+ things, break it up)
  - No monolithic blocks (>50 lines in one function)
  - Clear separation of concerns
  - Interfaces between modules are stable and well-defined
  - Easy to understand for new maintainers
  - No "temp" or "quick" solutions - production quality only
- BLOCKING: Code that is too complex, monolithic, or poorly structured must be rejected
Error handling is robust
- No bare except: clauses
- All errors have clear messages
- No silent failures
- Edge cases handled
Documentation is adequate
- All public functions have docstrings
- Complex logic has inline comments
- README updated if user-facing change
- Architecture doc updated if pattern changes

Phase 3: Token Budget (Block if failed)

For any prompt/instruction changes:

Token count documented
- Before: X tokens
- After: Y tokens
- Change: +/- Z tokens
Within budget
- System prompt + instructions ≤ 2000 tokens (HARD LIMIT)
- Leaves ≥ 50% context window for user input
- BLOCKING: Over budget = Request reduction
Efficient wording
- No redundant examples
- No verbose explanations
- Prefer code over prose

Token Counting Command:

# Count tokens in a string
echo "Your prompt here" | python -c "import sys; import tiktoken; enc = tiktoken.get_encoding('cl100k_base'); print(len(enc.encode(sys.stdin.read())))"

Phase 4: Architecture (Block if failed)

Consistent with ARCHITECTURE.md
- No new patterns without updating docs
- No mixing of concerns
- Follows existing module structure
No architecture changes in fixes
- Bug fixes should not refactor
- Refactors should be separate PRs
- Exception: If fix requires arch change, document WHY
Parser rules
- Only ONE parser per format
- No alternative parsing paths
- Clear regex patterns
- Handles all documented cases
No feature flags in core
- Code should not have if config.get("ENABLE_X"):
- Pick one approach, remove old one
- A/B testing only in separate branch

Phase 5: Research & Continuous Learning

For significant changes (>100 lines or new algorithms):

Research documented
- Check research/ folder for related findings
- PR description mentions alternatives considered
- Links to sources (docs, papers, repos)
- Not: "I thought this would work"
- Yes: "Based on [source], this approach handles [case] better than [alternative]"
Best practices followed
- Implementation matches current language/framework conventions
- No deprecated patterns
- Modern Python features used appropriately (3.9+)
No reinvention
- Check if standard library solves the problem
- Check if well-maintained package exists
- If custom implementation needed, document WHY

Research Documentation Requirements:

## Research
- Alternatives considered: [list]
- Sources: [links]
- Decision: [why chosen approach]
- Benchmarks: [if applicable]

Phase 6: Logic Correctness

Logic is sound
- Read through the code
- Check edge cases
- Verify error conditions
- Question anything unclear
No performance regressions
- No blocking I/O in async functions (unless wrapped)
- No memory leaks
- No N+1 queries
- Reasonable algorithmic complexity
Security check
- No SQL injection vectors
- No command injection (bash execution sanitized)
- Path traversal protection (for file ops)
- No secrets in code

Review Report Format

After review, write a report to reports/PR-{number}-{branch}.md:

# Review Report: PR #{number} - {branch}

**Reviewer:** {your name}
**Date:** {YYYY-MM-DD}
**Status:** [APPROVED / CHANGES_REQUESTED / BLOCKED]

## Summary
Brief description of what this PR does and overall quality assessment.

## Detailed Findings

### ✅ Passed
- [List items that passed review]
- [Be specific: "Tests cover 85% of new code"]

### ⚠️ Warnings (Non-blocking)
- [Minor issues that don't block merge]
- [Style suggestions]
- [Future improvements]

### ❌ Blockers (Must fix)
1. **[Category]** [Specific issue]
   - **Location:** `file.py:123`
   - **Problem:** [What's wrong]
   - **Fix:** [Exactly what to change]
   - **Why:** [Why this matters]

2. **[Category]** [Specific issue]
   - ...

## Token Impact Analysis
- Component: [what changed]
- Before: [X] tokens
- After: [Y] tokens
- Impact: [+/- Z] tokens
- Within budget: [Yes/No]

## Test Coverage
- New code coverage: [X]%
- Tests pass: [Yes/No]
- Integration tests: [Present/Missing]

## Architecture Review
- Follows existing patterns: [Yes/No]
- Introduces new dependencies: [List if any]
- Breaking changes: [Yes/No - explain if yes]

## Research Review
- Alternatives considered: [Listed/None]
- Sources cited: [Yes/No]
- Best practices followed: [Yes/No]
- Research documented: [Yes/No - location]

## Code Quality Score
- Structure: [0-10]
- Testing: [0-10]
- Documentation: [0-10]
- Logic: [0-10]
- **Overall: [0-10]**

## Action Items
- [ ] [Specific fix needed]
- [ ] [Specific fix needed]
- [ ] [Test to add]

## Verdict
[APPROVED / CHANGES_REQUESTED / BLOCKED]

**If CHANGES_REQUESTED:** 
- Address all blockers
- Re-request review when ready

**If BLOCKED:**
- Major issues require architecture discussion
- Schedule meeting before continuing

Severity Levels

🔴 BLOCKING (Cannot merge)

Missing tests for new functionality
Token budget exceeded
Bare except: clauses
Production debugging code (print statements)
Breaking changes without documentation
Security vulnerabilities
Tests failing
Type check errors
Architecture violations

🟡 CHANGES_REQUESTED (Fix before merge)

Unclear variable names
Missing docstrings
Inefficient algorithms
Missing error handling
Unclear commit messages
Minor style issues

🟢 APPROVED (Optional suggestions)

Style preferences
Future improvements
Optional refactors

Common Issues to Watch For

Issue 1: Tool Parsing Duplication

# ❌ WRONG - Multiple parsers
def parse_tools_v1(text): ...
def parse_tools_v2(text): ...
def parse_tools_legacy(text): ...

# ✅ CORRECT - Single parser
TOOL_PATTERN = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})'

Check: Search for "def parse" - should be ONE per format.

Issue 2: Token Bloat

# ❌ WRONG - Too verbose
SYSTEM_PROMPT = """
You are an AI assistant. Here are detailed instructions...
[2000 words of explanation]
[10 examples]
"""

# ✅ CORRECT - Concise
SYSTEM_PROMPT = """Use TOOL: name\nARGUMENTS: {...} format. Available: read, write, bash."""

Check: Count tokens, verify < 2000.

Issue 3: Architecture Drift

# ❌ WRONG - Mixing concerns in one file
# src/api/routes.py
def handle_request(): ...
def parse_tools(): ...
def execute_tool(): ...
def format_response(): ...

# ✅ CORRECT - Separated
# src/api/routes.py - only HTTP handling
# src/tools/parser.py - only parsing
# src/tools/executor.py - only execution

Check: Each module has ONE responsibility.

Issue 4: Debug Code Left In

# ❌ WRONG
def process(data):
    print(f"DEBUG: data={data}")  # REMOVE THIS
    result = transform(data)
    print(f"DEBUG: result={result}")  # REMOVE THIS
    return result

# ✅ CORRECT
logger = logging.getLogger(__name__)

def process(data):
    logger.debug("Processing data", extra={"data_size": len(data)})
    return transform(data)

Check: grep -r "print(" src/ --include="*.py" | grep -v "^#"

Issue 5: Missing Error Context

# ❌ WRONG
raise ValueError("Invalid input")

# ✅ CORRECT
raise ValueError(f"Invalid model format: '{model_str}'. Expected: 'name:size:quant' (e.g., 'qwen:7b:q4')")

Check: All errors explain what was expected vs received.

Review Workflow

First Pass: Structure (5 min)
- Check branch name, commits, no debug code
- If failed → Write report, BLOCK
Second Pass: Quality (10 min)
- Run tests, check types, review code
- If failed → Write report, CHANGES_REQUESTED
Third Pass: Deep Dive (15 min)
- Read logic, check edge cases
- Verify token counts
- Check architecture
- Write detailed report
Final Decision (5 min)
- APPROVE / CHANGES_REQUESTED / BLOCK
- Write report to reports/ folder
- Post summary in PR comments

Total time per review: 30-35 minutes

Reviewer Self-Check

Before submitting review:

I ran all tests locally
I checked type hints
I counted tokens (if applicable)
I read every line of changed code
My feedback is specific and actionable
I explained WHY for each blocker
I wrote a report to reports/ folder

Escalation

Escalate to architecture discussion if:

PR changes core patterns
Token budget cannot be met
Two reviewers disagree
Breaking changes proposed

Don't just approve to be nice. Don't let technical debt accumulate.

Report Storage

All reports go in reports/ folder:

reports/
├── PR-123-fix-tool-parsing.md
├── PR-124-add-federation.md
├── PR-125-refactor-consensus.md
└── README.md  # Index of all reviews

This folder is gitignored - reports stay local.

Generate index with:

ls -1 reports/PR-*.md | sort -t'-' -k2 -n > reports/README.md

Remember: You're the last line of defense against technical debt. Be thorough, be kind, be strict.

12 KiB Raw Blame History