Files
sleepy 51123212c4 Initial commit: coding harness feedback analysis
Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00

26 KiB
Raw Permalink Blame History

Prompting Strategies for Single Agents

A practical, research-backed field guide for writing prompts that make a single LLM agent more capable and reliable.

Use it for:

  • system prompt design
  • reasoning and tool-use strategy
  • structured output and format control
  • reliability and brittleness mitigation
  • uncertainty and verification policy

The goal is to keep the highest-signal findings that actually change how a prompt should be written. Orchestration, multi-agent design, and evaluation live in Research-orchestration.md.

Fast Takeaways

  1. Start with zero-shot + clear instruction. Add few-shot examples only when you need format stability, not extra reasoning.
  2. Put documents and context at the top of the prompt. Put the query and task at the bottom. This alone can improve quality by ~30%.
  3. State objective, constraints, and success criteria explicitly. Explain why each constraint exists, not just what it is.
  4. Use XML tags for structure. Ambiguous delimiters in a long prompt cause misinterpretation.
  5. Give the model an honest escape hatch: unknown, need evidence, or search more. Do not build a prompt that forces false confidence.
  6. Test every prompt with at least 35 paraphrase variants. A single-character change can collapse performance by tens of points.
  7. For Claude 4.x: use adaptive thinking with an effort parameter instead of manual budget_tokens. Normal phrasing beats ALL-CAPS urgency.
  8. Principles outperform personas. Put behavior into numbered constraints with rationales, not theatrical character descriptions.
  9. Prefer "do X" over "don't do Y". Negation-only constraints leave behavioral gaps.
  10. External verification beats self-critique. Ground revision passes in search results, test output, or grader feedback.

What To Copy Into Prompts

Structure

  • Role assignment in the first sentence of the system prompt.
  • Constraints written as "do X because Y" — the rationale makes the rule generalizable.
  • XML sections for mixing content types: <instructions>, <context>, <examples>, <input>.
  • Long documents at the top; the task instruction and query at the bottom.
  • 35 few-shot examples inside <example> tags: diverse, covering edge cases.

Reasoning

  • Test zero-shot CoT before inventing a multi-step scaffold. A minimal reasoning cue often closes the gap.
  • Use "think step by step" or "reason through this" as a baseline, then measure what actually helps.
  • For reasoning-heavy tasks: include <thinking> examples in few-shot demonstrations — Claude generalizes the style.
  • Use adaptive thinking (effort: high) for hard problems. Use effort: low or disabled thinking for classification and low-latency work.
  • Prompt for interleaved reasoning over tool results: "After receiving tool results, carefully reflect on their quality before deciding next steps."

Verification

  • Append "Before you finish, verify your answer against [criteria]" for coding and math tasks.
  • For factual tasks, ask for quote-extraction before answering: "Find quotes relevant to [X] in <quotes> tags, then answer." This forces active retrieval of middle-context content.
  • After two failed self-correction attempts, prefer grounded external feedback (tests, search, grader) over another introspection pass.

Tool Use

  • Be imperative: "Change this function" not "Could you suggest changes?" — the model takes the verb literally.
  • Replace "CRITICAL: You MUST use this tool" with "Use this tool when [condition]" — Claude 4.x overtriggers on aggressive phrasing.
  • For parallel tool calls, prompt explicitly for parallelism: "Call all three tools in a single turn." Otherwise execution is often sequential.
  • Never speculate about code you have not read. If the model tends to hallucinate file contents, add: "Never describe code you have not opened.".

Uncertainty Policy

  • Put uncertainty policy in the prompt: "If you cannot verify a claim, say 'unverified' and explain what evidence is missing.".
  • Give the model explicit permission to say unknown rather than guessing — this makes refusals useful rather than blocking.
  • State when to escalate: "If the task requires permissions you do not have, stop and describe what you need.".

Core Sources: Reasoning and Chain of Thought

1. Large Language Models are Zero-Shot Reasoners (Kojima et al., 2022)

Source: https://arxiv.org/abs/2205.11916

Why it matters:

  • A very small reasoning cue can unlock much better performance than a plain direct answer.

Key takeaway:

  • Before building a complicated prompt chain, test a minimal reasoning baseline.

Implication:

  • Use simple reasoning scaffolds as the baseline to beat.
  • If a complex workflow does not outperform a minimal prompt + tool loop, the workflow is probably not worth it.

2. Self-Consistency Improves Chain of Thought Reasoning (Wang et al., 2022)

Source: https://arxiv.org/abs/2203.11171

Why it matters:

  • Sampling multiple reasoning paths and aggregating them can improve correctness on hard reasoning tasks.

Key takeaway:

  • Best-of-N / vote / pass@k style decoding is often more useful than one brittle "perfect prompt".

Implication:

  • Use selectively for high-value reasoning or planning steps.
  • Do not apply blindly to every turn — it is a latency and cost tradeoff.

3. ReAct: Synergizing Reasoning and Acting (Yao et al., 2022)

Source: https://arxiv.org/abs/2210.03629

Why it matters:

  • ReAct formalized the now-standard pattern of interleaving reasoning with external actions.

Key takeaway:

  • Reasoning is better when it can touch the world.

Implication:

  • For factual, interactive, or environment-dependent tasks, combine thinking with acting instead of pushing all work into one monologue.
  • This is a strong default for search, repo work, shell use, and structured tool loops.

4. Tree of Thoughts / Search-Style Deliberation

Sources:

Why it matters:

  • Search over candidate reasoning paths can improve hard problems when a single left-to-right pass is too brittle.

Key takeaway:

  • Deliberate search can help, but it is expensive. Use it for genuinely hard branches, not routine chat.

Implication:

  • Keep search/planning loops bounded.
  • Reach for tree search only when the task is hard enough and the eval gain justifies the extra cost.

5. Zero-Shot Can Be Stronger than Few-Shot CoT (2025)

Source: https://arxiv.org/abs/2506.14641

Why it matters:

  • For strong modern models, few-shot CoT examples mainly align output format, not reasoning quality. Attention analysis shows models largely ignore exemplar content.

Key takeaway:

  • For frontier models (Claude 3.5+, Qwen2.5-72B+): start with zero-shot + clear instruction. Add few-shot examples primarily for format control, not reasoning.
  • For smaller or fine-tuned models: few-shot CoT with worked steps still provides meaningful lift.

Implication:

  • Test zero-shot first on capable models.
  • If adding few-shot, target 35 diverse examples focused on edge-case output formats.
  • For format stability at lower cost: add 12 examples rather than 5+.

Core Sources: Tool Use and Self-Correction

6. Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023)

Source: https://arxiv.org/abs/2302.04761

Why it matters:

  • Tool use is not just a static external heuristic — it can be learned from information need, not keyword triggers.

Key takeaway:

  • A good prompt should direct tool decisions from the information need, not from crude keyword triggers.

Implication:

  • Prefer model-directed tool decisions over brittle word lists.
  • Keep a simple fallback policy, but do not let the fallback dominate product behavior.

7. CRITIC: LLMs Can Self-Correct with Tool-Interactive Critiquing (Gou et al., 2024)

Source: https://arxiv.org/abs/2305.11738

Why it matters:

  • Self-correction becomes much stronger when the critique is grounded in external tools instead of pure self-reflection.

Key takeaway:

  • Verification works better with evidence than with vibes.

Implication:

  • When possible, critique drafts against search results, tests, or environment state.
  • A grounded revision pass is usually higher value than another creative generation pass.

8. Self-Refine: Iterative Refinement with Self-Feedback (Madaan et al., 2023)

Source: https://arxiv.org/abs/2303.17651

Why it matters:

  • Even without external tools, generate → critique → revise can improve outputs.

Key takeaway:

  • Revision is a useful primitive, but should be bounded and measured.

Implication:

  • Keep self-refine loops short.
  • Prefer one clear revision pass over open-ended introspection.

9. Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023)

Source: https://arxiv.org/abs/2303.11366

Why it matters:

  • Reflection across attempts can improve repeated-task performance.

Key takeaway:

  • Memory is most useful when it captures compact lessons from failures, not giant transcripts.

Implication:

  • Store short, actionable reflections from past failures.
  • Use reflection memory across repeated tasks or sessions, not as an excuse to keep every token forever.

10. Programmatic Tool Calling (Anthropic, 2026)

Source: https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling

Why it matters:

  • When the model can write code that fans out or sequences tool calls, filters large results, and returns a compact summary, this beats paying a round-trip per tool call.

Key takeaways:

  • Useful when: 3+ dependent tool calls, large datasets, or parallel checks across many items.
  • Tool outputs must be treated as untrusted strings. Injection hygiene matters if the execution environment will parse the results.
  • Not the default for single fast calls or highly interactive steps where code-execution overhead outweighs the gain.

Implication:

  • Add batched tool fanout only as an opt-in executor pattern for research-heavy or data-heavy tasks.
  • Log caller/executor state clearly enough to debug failures and reuse behavior.

Core Sources: Prompt Design and Reliability

11. Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023/2024)

Source: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00638/119630/

Why it matters:

  • LLMs attend well to content at the beginning and end of a context window but show 30%+ degradation on content buried in the middle. The effect holds even for models designed for long contexts.

Key takeaway:

  • Placement is a first-class concern. Put documents first, put the query last.

Implication:

  • Place long documents and context near the top of the prompt; place the task instruction and query at the bottom (end of the prompt). Anthropic's docs confirm: up to 30% quality improvement from this ordering.
  • For RAG: use focused retrieval so the relevant chunk is short or placed at a privileged position. Do not fill the context window with undifferentiated text.
  • Ask the model to extract and quote relevant passages before answering, which forces active retrieval of middle-context content.
  • Use XML-tagged document structure with index numbers to give the model explicit anchors when presenting multiple documents.

Avoid:

  • Burying critical facts in the middle of a long prompt.
  • Assuming "larger context window = better utilization" — window size and utilization quality are independent.

12. Quantifying LM Sensitivity to Spurious Features in Prompt Design (ICLR 2024)

Source: https://arxiv.org/abs/2310.11324

Why it matters:

  • LLMs show extreme sensitivity to superficially trivial prompt variations. Performance swings of up to 76 accuracy points from single-character formatting differences. This does not improve with scale.

Key takeaway:

  • Test prompts in multiple phrasings before deploying. Format effects do not transfer across models.

Implication:

  • Test with at least 35 paraphrase variants before deploying. If performance swings more than ~5%, the prompt is brittle.
  • Add 12 representative few-shot examples as a "stabilizer" even when zero-shot quality is acceptable — even one example substantially reduces brittleness.
  • Use XML tag structure to reduce the model's need to parse ambiguous delimiters.
  • Track prompt versions with version control and re-evaluate after any model upgrade.

Avoid:

  • Deploying prompts tested in only one phrasing.
  • Changing punctuation, casing, or whitespace in production prompts without re-evaluation.
  • Comparing models using a single prompt format — ranking reversals are common.

13. POSIX: A Prompt Sensitivity Index For Large Language Models (EMNLP 2024)

Source: https://arxiv.org/abs/2410.02185

Why it matters:

  • Adding even one few-shot example dramatically reduces prompt sensitivity. Model size and instruction tuning do not.

Key takeaway:

  • If your prompt breaks under slight rewordings, add an example before hunting for a better phrasing.

Implication:

  • Template changes cause highest sensitivity on multiple-choice tasks; paraphrasing causes highest sensitivity on open-ended generation. Tune mitigation to the task type.
  • For production agents with evolving prompts: the POSIX index is a useful pre-deployment stability check.

14. Principled Instructions Are All You Need (2024)

Source: https://arxiv.org/abs/2312.16171

Why it matters:

  • Giving a model 26 structured principles as part of a zero-shot prompt raised GPT-4 accuracy by 57.7% over unstructured baseline prompts.

Key takeaway:

  • Write an explicit "operating principles" section in your system prompt — a short numbered list of rules with rationales.

Implication:

  • High-impact principles: (1) assign a role, (2) use affirmative directives, (3) ask for step-by-step reasoning, (4) specify output format, (5) use delimiters/tags, (6) combine CoT with examples for complex tasks.
  • Principle-based prompting and few-shot prompting are complementary, not competing — combine them for complex reasoning tasks.

Avoid:

  • Long lists of vague principles ("be helpful, be honest") without specificity — the model cannot operationalize them.
  • Writing only prohibitions without positive guidance.

15. Control Illusion: The Failure of Instruction Hierarchies in LLMs (2025)

Source: https://arxiv.org/abs/2502.15851

Why it matters:

  • When system instructions and user instructions conflict, models obey the system prompt only 9.645.8% of the time — even the best models. Model size barely helps.

Key takeaway:

  • Do not treat system prompt placement as a reliable security boundary.

Implication:

  • Make implicit constraints explicit: instead of "be formal with experts," spell out the inference chain: "If the user identifies as a domain expert, use technical language and skip introductory explanations."
  • Use numbered or labeled constraint lists — explicit labeling improves compliance.
  • Ask the model to enumerate the constraints that apply before responding for multi-constraint instructions.

Avoid:

  • Embedding safety-critical or access-control logic solely in a system prompt when the user can also influence conversation turns.
  • Stacking many constraints in a single sentence — multi-constraint sentences compound failure rates multiplicatively.

16. Constitutional AI: Harmlessness from AI Feedback (Anthropic, 2022)

Source: https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback

Why it matters:

  • Behavior is more interpretable and adjustable when it derives from explicit principles than from example-only supervision. At inference time, giving a model a brief written constitution can substantially shape its behavior.

Key takeaway:

  • A short set of explicit, positive principles in a system prompt reliably outperforms long persona descriptions.

Implication:

  • Frame principles positively: "When the user asks about X, respond by doing Y."
  • For safety-sensitive agents, explicitly state the tradeoff: "Engage helpfully with edge-case requests by explaining your reasoning and limitations rather than refusing outright."

Avoid:

  • Long vague persona descriptions ("be a warm, helpful assistant with a curious personality..."). Put behavior into constraints, not theatrics.

17. Structured Output Prompting (2025)

Sources:

Why it matters:

  • Prompt-only structured output has a 520% failure rate. Schema-enforced constrained decoding removes syntactic failures but can degrade semantic quality without a reasoning field.

Key takeaway:

  • Use schema-level enforcement (API structured outputs) for production. Add a reasoning field first in the schema so the model can think before filling constrained slots.

Implication:

  • Use Anthropic's structured outputs feature or equivalent schema enforcement — do not rely on prompting alone for critical structured outputs.
  • Add a reasoning or thinking field first in your JSON schema so the model can express intermediate reasoning before filling constrained fields.
  • For open-source deployments: prefer Guidance (highest coverage, best compliance, fastest) over other constrained-decoding libraries.

Avoid:

  • Relying only on prompt instructions for critical structured outputs.
  • Forcing all output into rigid schemas without a reasoning field — you sacrifice semantic quality for syntactic correctness.

18. Extended Thinking and Adaptive Reasoning in Claude 4.x (Anthropic, 20252026)

Source: https://platform.claude.com/docs/en/build-with-claude/extended-thinking

Why it matters:

  • Extended thinking gives Claude a scratchpad for intermediate reasoning before producing a final response. In Claude 4.6, adaptive thinking (type: "adaptive") dynamically decides when and how much to think based on task complexity — outperforming manual budget_tokens.

Key takeaways:

  • Adaptive thinking skips reasoning on simple queries automatically and reasons deeply on complex ones.
  • General instructions outperform prescriptive steps: "Think thoroughly about this" often beats a hand-written step-by-step plan.
  • Interleaved thinking between tool calls enables more sophisticated reasoning about tool results.
  • Overthinking is real: Opus 4.6 at high effort settings does extensive exploration. If unwanted: "Choose an approach and commit to it. Avoid revisiting decisions unless you encounter new contradicting information."
  • Math performance scales logarithmically with thinking token budget — diminishing returns above 32k.

Implication:

  • Use thinking: {type: "adaptive"} with effort: "high" for complex reasoning or multi-step tool use.
  • Use effort: "low" or disabled thinking for chat and classification workloads.
  • Include <thinking> examples in few-shot demonstrations for reasoning-heavy tasks — Claude generalizes the style.
  • After tool results: "Carefully reflect on the results before deciding the next step." This triggers useful interleaved thinking.

Avoid:

  • Using budget_tokens on Claude 4.6+ — it is deprecated and inferior to adaptive thinking.
  • Setting effort: "max" for simple tasks — inflates latency and cost with no quality benefit.
  • Writing a detailed prescribed reasoning chain and expecting Claude to follow it exactly — Claude's own reasoning typically exceeds the prescribed plan. Give direction, not a script.

19. LLMLingua: Compressing Prompts for Accelerated Inference (EMNLP 2023 / ACL 2024)

Sources:

Why it matters:

  • Long prompts degrade quality (lost-in-the-middle) and cost money. LLMLingua-2 achieves up to 20x compression with only ~1.5 accuracy point drop, and is 36x faster than v1.

Key takeaway:

  • Compress the context/documents portion of the prompt, not the instructions.

Implication:

  • Use LLMLingua-2 as the default for RAG pipelines: compress retrieved passages before inserting them to reduce context length and improve signal-to-noise.
  • Compression is also a mitigation for the "lost in the middle" problem — shorter context places key information closer to the ends.
  • Apply to natural prose/documents, not structured instructions or few-shot examples.

Avoid:

  • Very high compression ratios (>10x) for tasks requiring precise factual recall.
  • Compressing instructions or few-shot examples — compressors are tuned for prose and may corrupt instruction syntax.

20. Agent READMEs: An Empirical Study of Context Files for Agentic Coding (2025)

Source: arXiv:2511.12884

Why it matters:

  • Agent context files become living operational artifacts, but drift into unreadable piles. Teams over-specify build/run and architecture but badly underspecify security and performance.

Key takeaway:

  • Keep agent context short, operational, and constraint-rich.

Implication:

  • Add explicit non-functional requirements (latency, safety, permission boundaries).
  • Treat agent context as maintained configuration, not lore.
  • Audit for drift whenever the base model or deployment changes.

21. From Biased Chatbots to Biased Agents (2026)

Source: arXiv:2602.12285

Why it matters:

  • Persona baggage can actively hurt agent behavior. Capability framing helps; character acting often hurts.

Key takeaway:

  • Keep personalities light. Put behavior into constraints and tools, not theatrics.

Implication:

  • Use a short role sentence ("You are a code reviewer focused on correctness and security") rather than an elaborate persona.
  • All behavioral requirements should appear as explicit constraints, not implied by a character description.

Anthropic Prompting Guidance (Claude 4.x)

Source: https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices

High-signal principles for Claude 4.5 / 4.6:

  • Be explicit, not inferential. Claude is "a brilliant but new employee" who lacks context on your norms.
  • Explain why. Constraints written as "do X because Y" generalize better than bare rules.
  • Say what to do, not just what to avoid. "Your response should be composed of flowing prose paragraphs" outperforms "Do not use markdown."
  • XML tags for complex prompts. Use <instructions>, <context>, <examples>, <input> when mixing content types.
  • Documents first, query last. Long-context prompts: context/data at top, task at bottom. Up to 30% quality improvement.
  • Avoid ALL-CAPS urgency. Claude 4.x is more obedient — aggressive phrasing causes overtriggering. Use normal language.
  • Prefill is deprecated. Don't use assistant prefill for format control on Claude 4.6+. Use structured outputs or direct instructions.
  • Agentic safety. Explicitly instruct Claude to pause before irreversible actions: "For actions that are hard to reverse, ask the user before proceeding." Name specific action types.
  • Context window awareness. Tell Claude whether its context will be auto-compacted — otherwise it may artificially truncate work near the limit.
  • Match style. Remove markdown from your prompt if you want markdown-free output — input style propagates to output style.

Distilled Prompt-Writing Rules

System Prompt Structure

  1. One-sentence role assignment at the top.
  2. Numbered constraints, each with a brief rationale.
  3. XML-separated sections for context, examples, and input when mixing content types.
  4. Documents and context before the task. Task and query at the end.
  5. 35 few-shot examples in <example> tags; focused on output format and edge cases.

Constraint Framing

  • Positive over negative: "Do X" over "Don't do Y."
  • Rationale-included: "Do X because Y" over bare "Do X."
  • Explicit over implicit: spell out multi-hop conditions rather than relying on the model to infer them.
  • Numbered, not prose-buried: label constraints so the model can enumerate them before responding.

Reasoning

  • Zero-shot CoT baseline first.
  • Adaptive thinking for hard tasks. Disabled or low-effort for simple tasks.
  • Bounded revision: one clear self-refine pass, not open-ended introspection.
  • External grounding beats self-critique for verification.
  • "Verify your answer against [criteria] before finishing."

Uncertainty

  • Give the model an honest escape hatch: unknown, unverified, need evidence.
  • State escalation conditions explicitly: when to stop and say what permission or evidence is missing.
  • Do not build a prompt that forces a confident answer when evidence is absent.

Format and Output

  • Schema enforcement, not prompt-only, for structured outputs in production.
  • reasoning field first in JSON schemas so the model can think before committing to constrained fields.
  • Explicit "no preamble" instruction if needed: "Respond directly without preamble. Do not start with 'Here is...'"
  • For parallel tool use: explicitly prompt for parallel execution.

Anti-Patterns

  • Persona-rich prompts with weak task constraints
  • ALL-CAPS urgency instructions on Claude 4.x
  • Prompt-only structured output without schema enforcement
  • Keyword-triggered tool policies
  • Unbounded self-reflection loops
  • Burying critical facts in the middle of long prompts
  • System prompt as security boundary without additional enforcement
  • Testing prompts in only one phrasing variant
  • budget_tokens on Claude 4.6+ models
  • Negative-only constraint lists without positive guidance

What To Re-Read Often

  • Anthropic Prompting Best Practices docs (platform.claude.com)
  • Anthropic Extended Thinking docs
  • ReAct
  • CRITIC
  • Lost in the Middle (Liu et al. 2023)
  • Principled Instructions Are All You Need
  • POSIX Prompt Sensitivity Index
  • Control Illusion (instruction hierarchy failure)
  • Constitutional AI (Anthropic)

Update Policy

When adding a new source, prefer:

  • primary paper
  • official engineering article
  • official documentation

For each new source, capture:

  • what it claims
  • what to copy into prompts
  • what to avoid
  • whether it actually changes how a prompt should be written

If it does not change prompt design decisions, it probably does not belong here.

Last updated: 2026-04-01