Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
26 KiB
Prompting Strategies for Single Agents
A practical, research-backed field guide for writing prompts that make a single LLM agent more capable and reliable.
Use it for:
- system prompt design
- reasoning and tool-use strategy
- structured output and format control
- reliability and brittleness mitigation
- uncertainty and verification policy
The goal is to keep the highest-signal findings that actually change how a prompt should be written. Orchestration, multi-agent design, and evaluation live in Research-orchestration.md.
Fast Takeaways
- Start with zero-shot + clear instruction. Add few-shot examples only when you need format stability, not extra reasoning.
- Put documents and context at the top of the prompt. Put the query and task at the bottom. This alone can improve quality by ~30%.
- State objective, constraints, and success criteria explicitly. Explain why each constraint exists, not just what it is.
- Use XML tags for structure. Ambiguous delimiters in a long prompt cause misinterpretation.
- Give the model an honest escape hatch:
unknown,need evidence, orsearch more. Do not build a prompt that forces false confidence. - Test every prompt with at least 3–5 paraphrase variants. A single-character change can collapse performance by tens of points.
- For Claude 4.x: use adaptive thinking with an
effortparameter instead of manualbudget_tokens. Normal phrasing beats ALL-CAPS urgency. - Principles outperform personas. Put behavior into numbered constraints with rationales, not theatrical character descriptions.
- Prefer
"do X"over"don't do Y". Negation-only constraints leave behavioral gaps. - External verification beats self-critique. Ground revision passes in search results, test output, or grader feedback.
What To Copy Into Prompts
Structure
- Role assignment in the first sentence of the system prompt.
- Constraints written as
"do X because Y"— the rationale makes the rule generalizable. - XML sections for mixing content types:
<instructions>,<context>,<examples>,<input>. - Long documents at the top; the task instruction and query at the bottom.
- 3–5 few-shot examples inside
<example>tags: diverse, covering edge cases.
Reasoning
- Test zero-shot CoT before inventing a multi-step scaffold. A minimal reasoning cue often closes the gap.
- Use
"think step by step"or"reason through this"as a baseline, then measure what actually helps. - For reasoning-heavy tasks: include
<thinking>examples in few-shot demonstrations — Claude generalizes the style. - Use adaptive thinking (
effort: high) for hard problems. Useeffort: lowor disabled thinking for classification and low-latency work. - Prompt for interleaved reasoning over tool results:
"After receiving tool results, carefully reflect on their quality before deciding next steps."
Verification
- Append
"Before you finish, verify your answer against [criteria]"for coding and math tasks. - For factual tasks, ask for quote-extraction before answering:
"Find quotes relevant to [X] in <quotes> tags, then answer."This forces active retrieval of middle-context content. - After two failed self-correction attempts, prefer grounded external feedback (tests, search, grader) over another introspection pass.
Tool Use
- Be imperative:
"Change this function"not"Could you suggest changes?"— the model takes the verb literally. - Replace
"CRITICAL: You MUST use this tool"with"Use this tool when [condition]"— Claude 4.x overtriggers on aggressive phrasing. - For parallel tool calls, prompt explicitly for parallelism:
"Call all three tools in a single turn."Otherwise execution is often sequential. - Never speculate about code you have not read. If the model tends to hallucinate file contents, add:
"Never describe code you have not opened.".
Uncertainty Policy
- Put uncertainty policy in the prompt:
"If you cannot verify a claim, say 'unverified' and explain what evidence is missing.". - Give the model explicit permission to say
unknownrather than guessing — this makes refusals useful rather than blocking. - State when to escalate:
"If the task requires permissions you do not have, stop and describe what you need.".
Core Sources: Reasoning and Chain of Thought
1. Large Language Models are Zero-Shot Reasoners (Kojima et al., 2022)
Source: https://arxiv.org/abs/2205.11916
Why it matters:
- A very small reasoning cue can unlock much better performance than a plain direct answer.
Key takeaway:
- Before building a complicated prompt chain, test a minimal reasoning baseline.
Implication:
- Use simple reasoning scaffolds as the baseline to beat.
- If a complex workflow does not outperform a minimal prompt + tool loop, the workflow is probably not worth it.
2. Self-Consistency Improves Chain of Thought Reasoning (Wang et al., 2022)
Source: https://arxiv.org/abs/2203.11171
Why it matters:
- Sampling multiple reasoning paths and aggregating them can improve correctness on hard reasoning tasks.
Key takeaway:
- Best-of-N / vote / pass@k style decoding is often more useful than one brittle "perfect prompt".
Implication:
- Use selectively for high-value reasoning or planning steps.
- Do not apply blindly to every turn — it is a latency and cost tradeoff.
3. ReAct: Synergizing Reasoning and Acting (Yao et al., 2022)
Source: https://arxiv.org/abs/2210.03629
Why it matters:
- ReAct formalized the now-standard pattern of interleaving reasoning with external actions.
Key takeaway:
- Reasoning is better when it can touch the world.
Implication:
- For factual, interactive, or environment-dependent tasks, combine thinking with acting instead of pushing all work into one monologue.
- This is a strong default for search, repo work, shell use, and structured tool loops.
4. Tree of Thoughts / Search-Style Deliberation
Sources:
- Tree of Thoughts: https://arxiv.org/abs/2305.10601
- Language Agent Tree Search (LATS): https://arxiv.org/abs/2310.04406
Why it matters:
- Search over candidate reasoning paths can improve hard problems when a single left-to-right pass is too brittle.
Key takeaway:
- Deliberate search can help, but it is expensive. Use it for genuinely hard branches, not routine chat.
Implication:
- Keep search/planning loops bounded.
- Reach for tree search only when the task is hard enough and the eval gain justifies the extra cost.
5. Zero-Shot Can Be Stronger than Few-Shot CoT (2025)
Source: https://arxiv.org/abs/2506.14641
Why it matters:
- For strong modern models, few-shot CoT examples mainly align output format, not reasoning quality. Attention analysis shows models largely ignore exemplar content.
Key takeaway:
- For frontier models (Claude 3.5+, Qwen2.5-72B+): start with zero-shot + clear instruction. Add few-shot examples primarily for format control, not reasoning.
- For smaller or fine-tuned models: few-shot CoT with worked steps still provides meaningful lift.
Implication:
- Test zero-shot first on capable models.
- If adding few-shot, target 3–5 diverse examples focused on edge-case output formats.
- For format stability at lower cost: add 1–2 examples rather than 5+.
Core Sources: Tool Use and Self-Correction
6. Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023)
Source: https://arxiv.org/abs/2302.04761
Why it matters:
- Tool use is not just a static external heuristic — it can be learned from information need, not keyword triggers.
Key takeaway:
- A good prompt should direct tool decisions from the information need, not from crude keyword triggers.
Implication:
- Prefer model-directed tool decisions over brittle word lists.
- Keep a simple fallback policy, but do not let the fallback dominate product behavior.
7. CRITIC: LLMs Can Self-Correct with Tool-Interactive Critiquing (Gou et al., 2024)
Source: https://arxiv.org/abs/2305.11738
Why it matters:
- Self-correction becomes much stronger when the critique is grounded in external tools instead of pure self-reflection.
Key takeaway:
- Verification works better with evidence than with vibes.
Implication:
- When possible, critique drafts against search results, tests, or environment state.
- A grounded revision pass is usually higher value than another creative generation pass.
8. Self-Refine: Iterative Refinement with Self-Feedback (Madaan et al., 2023)
Source: https://arxiv.org/abs/2303.17651
Why it matters:
- Even without external tools, generate → critique → revise can improve outputs.
Key takeaway:
- Revision is a useful primitive, but should be bounded and measured.
Implication:
- Keep self-refine loops short.
- Prefer one clear revision pass over open-ended introspection.
9. Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023)
Source: https://arxiv.org/abs/2303.11366
Why it matters:
- Reflection across attempts can improve repeated-task performance.
Key takeaway:
- Memory is most useful when it captures compact lessons from failures, not giant transcripts.
Implication:
- Store short, actionable reflections from past failures.
- Use reflection memory across repeated tasks or sessions, not as an excuse to keep every token forever.
10. Programmatic Tool Calling (Anthropic, 2026)
Source: https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling
Why it matters:
- When the model can write code that fans out or sequences tool calls, filters large results, and returns a compact summary, this beats paying a round-trip per tool call.
Key takeaways:
- Useful when: 3+ dependent tool calls, large datasets, or parallel checks across many items.
- Tool outputs must be treated as untrusted strings. Injection hygiene matters if the execution environment will parse the results.
- Not the default for single fast calls or highly interactive steps where code-execution overhead outweighs the gain.
Implication:
- Add batched tool fanout only as an opt-in executor pattern for research-heavy or data-heavy tasks.
- Log caller/executor state clearly enough to debug failures and reuse behavior.
Core Sources: Prompt Design and Reliability
11. Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023/2024)
Source: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00638/119630/
Why it matters:
- LLMs attend well to content at the beginning and end of a context window but show 30%+ degradation on content buried in the middle. The effect holds even for models designed for long contexts.
Key takeaway:
- Placement is a first-class concern. Put documents first, put the query last.
Implication:
- Place long documents and context near the top of the prompt; place the task instruction and query at the bottom (end of the prompt). Anthropic's docs confirm: up to 30% quality improvement from this ordering.
- For RAG: use focused retrieval so the relevant chunk is short or placed at a privileged position. Do not fill the context window with undifferentiated text.
- Ask the model to extract and quote relevant passages before answering, which forces active retrieval of middle-context content.
- Use XML-tagged document structure with index numbers to give the model explicit anchors when presenting multiple documents.
Avoid:
- Burying critical facts in the middle of a long prompt.
- Assuming "larger context window = better utilization" — window size and utilization quality are independent.
12. Quantifying LM Sensitivity to Spurious Features in Prompt Design (ICLR 2024)
Source: https://arxiv.org/abs/2310.11324
Why it matters:
- LLMs show extreme sensitivity to superficially trivial prompt variations. Performance swings of up to 76 accuracy points from single-character formatting differences. This does not improve with scale.
Key takeaway:
- Test prompts in multiple phrasings before deploying. Format effects do not transfer across models.
Implication:
- Test with at least 3–5 paraphrase variants before deploying. If performance swings more than ~5%, the prompt is brittle.
- Add 1–2 representative few-shot examples as a "stabilizer" even when zero-shot quality is acceptable — even one example substantially reduces brittleness.
- Use XML tag structure to reduce the model's need to parse ambiguous delimiters.
- Track prompt versions with version control and re-evaluate after any model upgrade.
Avoid:
- Deploying prompts tested in only one phrasing.
- Changing punctuation, casing, or whitespace in production prompts without re-evaluation.
- Comparing models using a single prompt format — ranking reversals are common.
13. POSIX: A Prompt Sensitivity Index For Large Language Models (EMNLP 2024)
Source: https://arxiv.org/abs/2410.02185
Why it matters:
- Adding even one few-shot example dramatically reduces prompt sensitivity. Model size and instruction tuning do not.
Key takeaway:
- If your prompt breaks under slight rewordings, add an example before hunting for a better phrasing.
Implication:
- Template changes cause highest sensitivity on multiple-choice tasks; paraphrasing causes highest sensitivity on open-ended generation. Tune mitigation to the task type.
- For production agents with evolving prompts: the POSIX index is a useful pre-deployment stability check.
14. Principled Instructions Are All You Need (2024)
Source: https://arxiv.org/abs/2312.16171
Why it matters:
- Giving a model 26 structured principles as part of a zero-shot prompt raised GPT-4 accuracy by 57.7% over unstructured baseline prompts.
Key takeaway:
- Write an explicit "operating principles" section in your system prompt — a short numbered list of rules with rationales.
Implication:
- High-impact principles: (1) assign a role, (2) use affirmative directives, (3) ask for step-by-step reasoning, (4) specify output format, (5) use delimiters/tags, (6) combine CoT with examples for complex tasks.
- Principle-based prompting and few-shot prompting are complementary, not competing — combine them for complex reasoning tasks.
Avoid:
- Long lists of vague principles ("be helpful, be honest") without specificity — the model cannot operationalize them.
- Writing only prohibitions without positive guidance.
15. Control Illusion: The Failure of Instruction Hierarchies in LLMs (2025)
Source: https://arxiv.org/abs/2502.15851
Why it matters:
- When system instructions and user instructions conflict, models obey the system prompt only 9.6–45.8% of the time — even the best models. Model size barely helps.
Key takeaway:
- Do not treat system prompt placement as a reliable security boundary.
Implication:
- Make implicit constraints explicit: instead of "be formal with experts," spell out the inference chain: "If the user identifies as a domain expert, use technical language and skip introductory explanations."
- Use numbered or labeled constraint lists — explicit labeling improves compliance.
- Ask the model to enumerate the constraints that apply before responding for multi-constraint instructions.
Avoid:
- Embedding safety-critical or access-control logic solely in a system prompt when the user can also influence conversation turns.
- Stacking many constraints in a single sentence — multi-constraint sentences compound failure rates multiplicatively.
16. Constitutional AI: Harmlessness from AI Feedback (Anthropic, 2022)
Source: https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
Why it matters:
- Behavior is more interpretable and adjustable when it derives from explicit principles than from example-only supervision. At inference time, giving a model a brief written constitution can substantially shape its behavior.
Key takeaway:
- A short set of explicit, positive principles in a system prompt reliably outperforms long persona descriptions.
Implication:
- Frame principles positively: "When the user asks about X, respond by doing Y."
- For safety-sensitive agents, explicitly state the tradeoff: "Engage helpfully with edge-case requests by explaining your reasoning and limitations rather than refusing outright."
Avoid:
- Long vague persona descriptions ("be a warm, helpful assistant with a curious personality..."). Put behavior into constraints, not theatrics.
17. Structured Output Prompting (2025)
Sources:
- Generating Structured Outputs from LMs: Benchmark and Studies (arxiv 2501.10868)
- vLLM Structured Decoding blog: https://blog.vllm.ai/2025/01/14/struct-decode-intro.html
Why it matters:
- Prompt-only structured output has a 5–20% failure rate. Schema-enforced constrained decoding removes syntactic failures but can degrade semantic quality without a reasoning field.
Key takeaway:
- Use schema-level enforcement (API structured outputs) for production. Add a
reasoningfield first in the schema so the model can think before filling constrained slots.
Implication:
- Use Anthropic's structured outputs feature or equivalent schema enforcement — do not rely on prompting alone for critical structured outputs.
- Add a
reasoningorthinkingfield first in your JSON schema so the model can express intermediate reasoning before filling constrained fields. - For open-source deployments: prefer Guidance (highest coverage, best compliance, fastest) over other constrained-decoding libraries.
Avoid:
- Relying only on prompt instructions for critical structured outputs.
- Forcing all output into rigid schemas without a reasoning field — you sacrifice semantic quality for syntactic correctness.
18. Extended Thinking and Adaptive Reasoning in Claude 4.x (Anthropic, 2025–2026)
Source: https://platform.claude.com/docs/en/build-with-claude/extended-thinking
Why it matters:
- Extended thinking gives Claude a scratchpad for intermediate reasoning before producing a final response. In Claude 4.6, adaptive thinking (
type: "adaptive") dynamically decides when and how much to think based on task complexity — outperforming manualbudget_tokens.
Key takeaways:
- Adaptive thinking skips reasoning on simple queries automatically and reasons deeply on complex ones.
- General instructions outperform prescriptive steps: "Think thoroughly about this" often beats a hand-written step-by-step plan.
- Interleaved thinking between tool calls enables more sophisticated reasoning about tool results.
- Overthinking is real: Opus 4.6 at high effort settings does extensive exploration. If unwanted: "Choose an approach and commit to it. Avoid revisiting decisions unless you encounter new contradicting information."
- Math performance scales logarithmically with thinking token budget — diminishing returns above 32k.
Implication:
- Use
thinking: {type: "adaptive"}witheffort: "high"for complex reasoning or multi-step tool use. - Use
effort: "low"or disabled thinking for chat and classification workloads. - Include
<thinking>examples in few-shot demonstrations for reasoning-heavy tasks — Claude generalizes the style. - After tool results: "Carefully reflect on the results before deciding the next step." This triggers useful interleaved thinking.
Avoid:
- Using
budget_tokenson Claude 4.6+ — it is deprecated and inferior to adaptive thinking. - Setting
effort: "max"for simple tasks — inflates latency and cost with no quality benefit. - Writing a detailed prescribed reasoning chain and expecting Claude to follow it exactly — Claude's own reasoning typically exceeds the prescribed plan. Give direction, not a script.
19. LLMLingua: Compressing Prompts for Accelerated Inference (EMNLP 2023 / ACL 2024)
Sources:
- LLMLingua: https://arxiv.org/abs/2310.05736
- LLMLingua-2: https://arxiv.org/abs/2403.12968
Why it matters:
- Long prompts degrade quality (lost-in-the-middle) and cost money. LLMLingua-2 achieves up to 20x compression with only ~1.5 accuracy point drop, and is 3–6x faster than v1.
Key takeaway:
- Compress the context/documents portion of the prompt, not the instructions.
Implication:
- Use LLMLingua-2 as the default for RAG pipelines: compress retrieved passages before inserting them to reduce context length and improve signal-to-noise.
- Compression is also a mitigation for the "lost in the middle" problem — shorter context places key information closer to the ends.
- Apply to natural prose/documents, not structured instructions or few-shot examples.
Avoid:
- Very high compression ratios (>10x) for tasks requiring precise factual recall.
- Compressing instructions or few-shot examples — compressors are tuned for prose and may corrupt instruction syntax.
20. Agent READMEs: An Empirical Study of Context Files for Agentic Coding (2025)
Source: arXiv:2511.12884
Why it matters:
- Agent context files become living operational artifacts, but drift into unreadable piles. Teams over-specify build/run and architecture but badly underspecify security and performance.
Key takeaway:
- Keep agent context short, operational, and constraint-rich.
Implication:
- Add explicit non-functional requirements (latency, safety, permission boundaries).
- Treat agent context as maintained configuration, not lore.
- Audit for drift whenever the base model or deployment changes.
21. From Biased Chatbots to Biased Agents (2026)
Source: arXiv:2602.12285
Why it matters:
- Persona baggage can actively hurt agent behavior. Capability framing helps; character acting often hurts.
Key takeaway:
- Keep personalities light. Put behavior into constraints and tools, not theatrics.
Implication:
- Use a short role sentence ("You are a code reviewer focused on correctness and security") rather than an elaborate persona.
- All behavioral requirements should appear as explicit constraints, not implied by a character description.
Anthropic Prompting Guidance (Claude 4.x)
High-signal principles for Claude 4.5 / 4.6:
- Be explicit, not inferential. Claude is "a brilliant but new employee" who lacks context on your norms.
- Explain why. Constraints written as
"do X because Y"generalize better than bare rules. - Say what to do, not just what to avoid. "Your response should be composed of flowing prose paragraphs" outperforms "Do not use markdown."
- XML tags for complex prompts. Use
<instructions>,<context>,<examples>,<input>when mixing content types. - Documents first, query last. Long-context prompts: context/data at top, task at bottom. Up to 30% quality improvement.
- Avoid ALL-CAPS urgency. Claude 4.x is more obedient — aggressive phrasing causes overtriggering. Use normal language.
- Prefill is deprecated. Don't use assistant prefill for format control on Claude 4.6+. Use structured outputs or direct instructions.
- Agentic safety. Explicitly instruct Claude to pause before irreversible actions: "For actions that are hard to reverse, ask the user before proceeding." Name specific action types.
- Context window awareness. Tell Claude whether its context will be auto-compacted — otherwise it may artificially truncate work near the limit.
- Match style. Remove markdown from your prompt if you want markdown-free output — input style propagates to output style.
Distilled Prompt-Writing Rules
System Prompt Structure
- One-sentence role assignment at the top.
- Numbered constraints, each with a brief rationale.
- XML-separated sections for context, examples, and input when mixing content types.
- Documents and context before the task. Task and query at the end.
- 3–5 few-shot examples in
<example>tags; focused on output format and edge cases.
Constraint Framing
- Positive over negative: "Do X" over "Don't do Y."
- Rationale-included: "Do X because Y" over bare "Do X."
- Explicit over implicit: spell out multi-hop conditions rather than relying on the model to infer them.
- Numbered, not prose-buried: label constraints so the model can enumerate them before responding.
Reasoning
- Zero-shot CoT baseline first.
- Adaptive thinking for hard tasks. Disabled or low-effort for simple tasks.
- Bounded revision: one clear self-refine pass, not open-ended introspection.
- External grounding beats self-critique for verification.
- "Verify your answer against [criteria] before finishing."
Uncertainty
- Give the model an honest escape hatch:
unknown,unverified,need evidence. - State escalation conditions explicitly: when to stop and say what permission or evidence is missing.
- Do not build a prompt that forces a confident answer when evidence is absent.
Format and Output
- Schema enforcement, not prompt-only, for structured outputs in production.
reasoningfield first in JSON schemas so the model can think before committing to constrained fields.- Explicit "no preamble" instruction if needed: "Respond directly without preamble. Do not start with 'Here is...'"
- For parallel tool use: explicitly prompt for parallel execution.
Anti-Patterns
- Persona-rich prompts with weak task constraints
- ALL-CAPS urgency instructions on Claude 4.x
- Prompt-only structured output without schema enforcement
- Keyword-triggered tool policies
- Unbounded self-reflection loops
- Burying critical facts in the middle of long prompts
- System prompt as security boundary without additional enforcement
- Testing prompts in only one phrasing variant
budget_tokenson Claude 4.6+ models- Negative-only constraint lists without positive guidance
What To Re-Read Often
- Anthropic Prompting Best Practices docs (platform.claude.com)
- Anthropic Extended Thinking docs
- ReAct
- CRITIC
- Lost in the Middle (Liu et al. 2023)
- Principled Instructions Are All You Need
- POSIX Prompt Sensitivity Index
- Control Illusion (instruction hierarchy failure)
- Constitutional AI (Anthropic)
Update Policy
When adding a new source, prefer:
- primary paper
- official engineering article
- official documentation
For each new source, capture:
- what it claims
- what to copy into prompts
- what to avoid
- whether it actually changes how a prompt should be written
If it does not change prompt design decisions, it probably does not belong here.
Last updated: 2026-04-01