# Agent Reviewer Rules

> **⚠️ IMPORTANT:** This document is for REVIEW AGENTS who handle commits, PRs, and code reviews.
> Regular agents follow AGENT_WORKER.md for implementation tasks and DO NOT make commits.

## Review Philosophy

**Mission:** Prevent the circular development patterns identified in commit history.

**Standards:**
- Reject code that doesn't meet quality bar
- Ask for tests, don't accept "I'll add them later"
- Check token counts for prompt changes
- Verify architectural consistency
- Demand clear error messages

**Reviewer Authority:**
- Can block PR for: missing tests, token bloat, architecture violations
- Cannot approve own code
- Must provide constructive feedback with specific fixes

## Review Checklist

### Phase 1: Structure & Hygiene (Block if failed)

- [ ] **Branch naming follows convention**
  - Format: `type/description` (e.g., `fix/tool-parsing`)
  - Not: `quick-fix`, `temp-branch`, `dev`

- [ ] **Commit messages are clear**
  - Format: `type(scope): description`
  - No: `fix stuff`, `WIP`, `asdf`, `omg finally`
  - Each commit should be reviewable independently

- [ ] **No production debugging code**
  - Search for: `print(`, `console.log`, `debugger`, `TODO`, `FIXME`, `XXX`
  - Check: No commented-out code blocks
  - Check: No temporary files committed

- [ ] **Git history is clean**
  - No "fix typo" commits after initial commit
  - No "WIP" commits in PR
  - No merge commits (rebase instead)
  - Squash fixup commits

### Phase 2: Code Quality (Block if failed)

- [ ] **Tests exist and pass**
  - Unit tests for new functions
  - Integration tests for API changes
  - Run: `pytest -v` (must pass)
  - Coverage: ≥80% for new code
  - **BLOCKING:** No tests = No merge

- [ ] **Type hints present**
  - All function parameters typed
  - All return values typed
  - Run: `mypy src/` (must pass with zero errors)

- [ ] **No code smells**
  - No functions > 50 lines
  - No files > 300 lines
  - No indentation > 3 levels deep
  - No circular imports
  - No duplicate code (>3 lines copied)

- [ ] **Minimal, Maintainable, Modular Code**
  - **Minimal:** Only code needed to solve the problem, no over-engineering
  - **Maintainable:** Clear names, self-documenting, consistent style
  - **Modular:** Single Responsibility Principle, loose coupling, clear interfaces
  - **STRICT ENFORCEMENT:**
    - Functions should do ONE thing (if it does 2+ things, break it up)
    - No monolithic blocks (>50 lines in one function)
    - Clear separation of concerns
    - Interfaces between modules are stable and well-defined
    - Easy to understand for new maintainers
    - No "temp" or "quick" solutions - production quality only
  - **BLOCKING:** Code that is too complex, monolithic, or poorly structured must be rejected

- [ ] **Error handling is robust**
  - No bare `except:` clauses
  - All errors have clear messages
  - No silent failures
  - Edge cases handled

- [ ] **Documentation is adequate**
  - All public functions have docstrings
  - Complex logic has inline comments
  - README updated if user-facing change
  - Architecture doc updated if pattern changes

### Phase 3: Token Budget (Block if failed)

**For any prompt/instruction changes:**

- [ ] **Token count documented**
  - Before: X tokens
  - After: Y tokens  
  - Change: +/- Z tokens

- [ ] **Within budget**
  - System prompt + instructions ≤ 2000 tokens (HARD LIMIT)
  - Leaves ≥ 50% context window for user input
  - **BLOCKING:** Over budget = Request reduction

- [ ] **Efficient wording**
  - No redundant examples
  - No verbose explanations
  - Prefer code over prose

**Token Counting Command:**
```bash
# Count tokens in a string
echo "Your prompt here" | python -c "import sys; import tiktoken; enc = tiktoken.get_encoding('cl100k_base'); print(len(enc.encode(sys.stdin.read())))"
```

### Phase 4: Architecture (Block if failed)

- [ ] **Consistent with ARCHITECTURE.md**
  - No new patterns without updating docs
  - No mixing of concerns
  - Follows existing module structure

- [ ] **No architecture changes in fixes**
  - Bug fixes should not refactor
  - Refactors should be separate PRs
  - **Exception:** If fix requires arch change, document WHY

- [ ] **Parser rules**
  - Only ONE parser per format
  - No alternative parsing paths
  - Clear regex patterns
  - Handles all documented cases

- [ ] **No feature flags in core**
  - Code should not have `if config.get("ENABLE_X"):`
  - Pick one approach, remove old one
  - A/B testing only in separate branch

### Phase 5: Research & Continuous Learning

**For significant changes (>100 lines or new algorithms):**

- [ ] **Research documented**
  - Check `research/` folder for related findings
  - PR description mentions alternatives considered
  - Links to sources (docs, papers, repos)
  - Not: "I thought this would work"
  - Yes: "Based on [source], this approach handles [case] better than [alternative]"

- [ ] **Best practices followed**
  - Implementation matches current language/framework conventions
  - No deprecated patterns
  - Modern Python features used appropriately (3.9+)

- [ ] **No reinvention**
  - Check if standard library solves the problem
  - Check if well-maintained package exists
  - If custom implementation needed, document WHY

**Research Documentation Requirements:**
```markdown
## Research
- Alternatives considered: [list]
- Sources: [links]
- Decision: [why chosen approach]
- Benchmarks: [if applicable]
```

### Phase 6: Logic Correctness

- [ ] **Logic is sound**
  - Read through the code
  - Check edge cases
  - Verify error conditions
  - Question anything unclear

- [ ] **No performance regressions**
  - No blocking I/O in async functions (unless wrapped)
  - No memory leaks
  - No N+1 queries
  - Reasonable algorithmic complexity

- [ ] **Security check**
  - No SQL injection vectors
  - No command injection (bash execution sanitized)
  - Path traversal protection (for file ops)
  - No secrets in code

## Review Report Format

After review, write a report to `reports/PR-{number}-{branch}.md`:

```markdown
# Review Report: PR #{number} - {branch}

**Reviewer:** {your name}
**Date:** {YYYY-MM-DD}
**Status:** [APPROVED / CHANGES_REQUESTED / BLOCKED]

## Summary
Brief description of what this PR does and overall quality assessment.

## Detailed Findings

### ✅ Passed
- [List items that passed review]
- [Be specific: "Tests cover 85% of new code"]

### ⚠️ Warnings (Non-blocking)
- [Minor issues that don't block merge]
- [Style suggestions]
- [Future improvements]

### ❌ Blockers (Must fix)
1. **[Category]** [Specific issue]
   - **Location:** `file.py:123`
   - **Problem:** [What's wrong]
   - **Fix:** [Exactly what to change]
   - **Why:** [Why this matters]

2. **[Category]** [Specific issue]
   - ...

## Token Impact Analysis
- Component: [what changed]
- Before: [X] tokens
- After: [Y] tokens
- Impact: [+/- Z] tokens
- Within budget: [Yes/No]

## Test Coverage
- New code coverage: [X]%
- Tests pass: [Yes/No]
- Integration tests: [Present/Missing]

## Architecture Review
- Follows existing patterns: [Yes/No]
- Introduces new dependencies: [List if any]
- Breaking changes: [Yes/No - explain if yes]

## Research Review
- Alternatives considered: [Listed/None]
- Sources cited: [Yes/No]
- Best practices followed: [Yes/No]
- Research documented: [Yes/No - location]

## Code Quality Score
- Structure: [0-10]
- Testing: [0-10]
- Documentation: [0-10]
- Logic: [0-10]
- **Overall: [0-10]**

## Action Items
- [ ] [Specific fix needed]
- [ ] [Specific fix needed]
- [ ] [Test to add]

## Verdict
[APPROVED / CHANGES_REQUESTED / BLOCKED]

**If CHANGES_REQUESTED:** 
- Address all blockers
- Re-request review when ready

**If BLOCKED:**
- Major issues require architecture discussion
- Schedule meeting before continuing
```

## Severity Levels

### 🔴 BLOCKING (Cannot merge)
- Missing tests for new functionality
- Token budget exceeded
- Bare `except:` clauses
- Production debugging code (`print` statements)
- Breaking changes without documentation
- Security vulnerabilities
- Tests failing
- Type check errors
- Architecture violations

### 🟡 CHANGES_REQUESTED (Fix before merge)
- Unclear variable names
- Missing docstrings
- Inefficient algorithms
- Missing error handling
- Unclear commit messages
- Minor style issues

### 🟢 APPROVED (Optional suggestions)
- Style preferences
- Future improvements
- Optional refactors

## Common Issues to Watch For

### Issue 1: Tool Parsing Duplication
```python
# ❌ WRONG - Multiple parsers
def parse_tools_v1(text): ...
def parse_tools_v2(text): ...
def parse_tools_legacy(text): ...

# ✅ CORRECT - Single parser
TOOL_PATTERN = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})'
```

**Check:** Search for "def parse" - should be ONE per format.

### Issue 2: Token Bloat
```python
# ❌ WRONG - Too verbose
SYSTEM_PROMPT = """
You are an AI assistant. Here are detailed instructions...
[2000 words of explanation]
[10 examples]
"""

# ✅ CORRECT - Concise
SYSTEM_PROMPT = """Use TOOL: name\nARGUMENTS: {...} format. Available: read, write, bash."""
```

**Check:** Count tokens, verify < 2000.

### Issue 3: Architecture Drift
```python
# ❌ WRONG - Mixing concerns in one file
# src/api/routes.py
def handle_request(): ...
def parse_tools(): ...
def execute_tool(): ...
def format_response(): ...

# ✅ CORRECT - Separated
# src/api/routes.py - only HTTP handling
# src/tools/parser.py - only parsing
# src/tools/executor.py - only execution
```

**Check:** Each module has ONE responsibility.

### Issue 4: Debug Code Left In
```python
# ❌ WRONG
def process(data):
    print(f"DEBUG: data={data}")  # REMOVE THIS
    result = transform(data)
    print(f"DEBUG: result={result}")  # REMOVE THIS
    return result

# ✅ CORRECT
logger = logging.getLogger(__name__)

def process(data):
    logger.debug("Processing data", extra={"data_size": len(data)})
    return transform(data)
```

**Check:** `grep -r "print(" src/ --include="*.py" | grep -v "^#"`

### Issue 5: Missing Error Context
```python
# ❌ WRONG
raise ValueError("Invalid input")

# ✅ CORRECT
raise ValueError(f"Invalid model format: '{model_str}'. Expected: 'name:size:quant' (e.g., 'qwen:7b:q4')")
```

**Check:** All errors explain what was expected vs received.

## Review Workflow

1. **First Pass: Structure** (5 min)
   - Check branch name, commits, no debug code
   - If failed → Write report, BLOCK

2. **Second Pass: Quality** (10 min)
   - Run tests, check types, review code
   - If failed → Write report, CHANGES_REQUESTED

3. **Third Pass: Deep Dive** (15 min)
   - Read logic, check edge cases
   - Verify token counts
   - Check architecture
   - Write detailed report

4. **Final Decision** (5 min)
   - APPROVE / CHANGES_REQUESTED / BLOCK
   - Write report to `reports/` folder
   - Post summary in PR comments

**Total time per review: 30-35 minutes**

## Reviewer Self-Check

Before submitting review:
- [ ] I ran all tests locally
- [ ] I checked type hints
- [ ] I counted tokens (if applicable)
- [ ] I read every line of changed code
- [ ] My feedback is specific and actionable
- [ ] I explained WHY for each blocker
- [ ] I wrote a report to `reports/` folder

## Escalation

Escalate to architecture discussion if:
- PR changes core patterns
- Token budget cannot be met
- Two reviewers disagree
- Breaking changes proposed

**Don't just approve to be nice.** 
**Don't let technical debt accumulate.**

## Report Storage

All reports go in `reports/` folder:
```
reports/
├── PR-123-fix-tool-parsing.md
├── PR-124-add-federation.md
├── PR-125-refactor-consensus.md
└── README.md  # Index of all reviews
```

**This folder is gitignored - reports stay local.**

Generate index with:
```bash
ls -1 reports/PR-*.md | sort -t'-' -k2 -n > reports/README.md
```

---

**Remember: You're the last line of defense against technical debt. Be thorough, be kind, be strict.**