51123212c4
Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
501 lines
26 KiB
Markdown
501 lines
26 KiB
Markdown
# Prompting Strategies for Single Agents
|
||
|
||
A practical, research-backed field guide for writing prompts that make a single LLM agent more capable and reliable.
|
||
|
||
Use it for:
|
||
- system prompt design
|
||
- reasoning and tool-use strategy
|
||
- structured output and format control
|
||
- reliability and brittleness mitigation
|
||
- uncertainty and verification policy
|
||
|
||
The goal is to keep the highest-signal findings that actually change how a prompt should be written. Orchestration, multi-agent design, and evaluation live in `Research-orchestration.md`.
|
||
|
||
## Fast Takeaways
|
||
|
||
1. Start with zero-shot + clear instruction. Add few-shot examples only when you need format stability, not extra reasoning.
|
||
2. Put documents and context at the top of the prompt. Put the query and task at the bottom. This alone can improve quality by ~30%.
|
||
3. State objective, constraints, and success criteria explicitly. Explain *why* each constraint exists, not just what it is.
|
||
4. Use XML tags for structure. Ambiguous delimiters in a long prompt cause misinterpretation.
|
||
5. Give the model an honest escape hatch: `unknown`, `need evidence`, or `search more`. Do not build a prompt that forces false confidence.
|
||
6. Test every prompt with at least 3–5 paraphrase variants. A single-character change can collapse performance by tens of points.
|
||
7. For Claude 4.x: use adaptive thinking with an `effort` parameter instead of manual `budget_tokens`. Normal phrasing beats ALL-CAPS urgency.
|
||
8. Principles outperform personas. Put behavior into numbered constraints with rationales, not theatrical character descriptions.
|
||
9. Prefer `"do X"` over `"don't do Y"`. Negation-only constraints leave behavioral gaps.
|
||
10. External verification beats self-critique. Ground revision passes in search results, test output, or grader feedback.
|
||
|
||
## What To Copy Into Prompts
|
||
|
||
### Structure
|
||
- Role assignment in the first sentence of the system prompt.
|
||
- Constraints written as `"do X because Y"` — the rationale makes the rule generalizable.
|
||
- XML sections for mixing content types: `<instructions>`, `<context>`, `<examples>`, `<input>`.
|
||
- Long documents at the top; the task instruction and query at the bottom.
|
||
- 3–5 few-shot examples inside `<example>` tags: diverse, covering edge cases.
|
||
|
||
### Reasoning
|
||
- Test zero-shot CoT before inventing a multi-step scaffold. A minimal reasoning cue often closes the gap.
|
||
- Use `"think step by step"` or `"reason through this"` as a baseline, then measure what actually helps.
|
||
- For reasoning-heavy tasks: include `<thinking>` examples in few-shot demonstrations — Claude generalizes the style.
|
||
- Use adaptive thinking (`effort: high`) for hard problems. Use `effort: low` or disabled thinking for classification and low-latency work.
|
||
- Prompt for interleaved reasoning over tool results: `"After receiving tool results, carefully reflect on their quality before deciding next steps."`
|
||
|
||
### Verification
|
||
- Append `"Before you finish, verify your answer against [criteria]"` for coding and math tasks.
|
||
- For factual tasks, ask for quote-extraction before answering: `"Find quotes relevant to [X] in <quotes> tags, then answer."` This forces active retrieval of middle-context content.
|
||
- After two failed self-correction attempts, prefer grounded external feedback (tests, search, grader) over another introspection pass.
|
||
|
||
### Tool Use
|
||
- Be imperative: `"Change this function"` not `"Could you suggest changes?"` — the model takes the verb literally.
|
||
- Replace `"CRITICAL: You MUST use this tool"` with `"Use this tool when [condition]"` — Claude 4.x overtriggers on aggressive phrasing.
|
||
- For parallel tool calls, prompt explicitly for parallelism: `"Call all three tools in a single turn."` Otherwise execution is often sequential.
|
||
- Never speculate about code you have not read. If the model tends to hallucinate file contents, add: `"Never describe code you have not opened."`.
|
||
|
||
### Uncertainty Policy
|
||
- Put uncertainty policy in the prompt: `"If you cannot verify a claim, say 'unverified' and explain what evidence is missing."`.
|
||
- Give the model explicit permission to say `unknown` rather than guessing — this makes refusals useful rather than blocking.
|
||
- State when to escalate: `"If the task requires permissions you do not have, stop and describe what you need."`.
|
||
|
||
---
|
||
|
||
## Core Sources: Reasoning and Chain of Thought
|
||
|
||
### 1. Large Language Models are Zero-Shot Reasoners (Kojima et al., 2022)
|
||
Source: https://arxiv.org/abs/2205.11916
|
||
|
||
Why it matters:
|
||
- A very small reasoning cue can unlock much better performance than a plain direct answer.
|
||
|
||
Key takeaway:
|
||
- Before building a complicated prompt chain, test a minimal reasoning baseline.
|
||
|
||
Implication:
|
||
- Use simple reasoning scaffolds as the baseline to beat.
|
||
- If a complex workflow does not outperform a minimal prompt + tool loop, the workflow is probably not worth it.
|
||
|
||
### 2. Self-Consistency Improves Chain of Thought Reasoning (Wang et al., 2022)
|
||
Source: https://arxiv.org/abs/2203.11171
|
||
|
||
Why it matters:
|
||
- Sampling multiple reasoning paths and aggregating them can improve correctness on hard reasoning tasks.
|
||
|
||
Key takeaway:
|
||
- Best-of-N / vote / pass@k style decoding is often more useful than one brittle "perfect prompt".
|
||
|
||
Implication:
|
||
- Use selectively for high-value reasoning or planning steps.
|
||
- Do not apply blindly to every turn — it is a latency and cost tradeoff.
|
||
|
||
### 3. ReAct: Synergizing Reasoning and Acting (Yao et al., 2022)
|
||
Source: https://arxiv.org/abs/2210.03629
|
||
|
||
Why it matters:
|
||
- ReAct formalized the now-standard pattern of interleaving reasoning with external actions.
|
||
|
||
Key takeaway:
|
||
- Reasoning is better when it can touch the world.
|
||
|
||
Implication:
|
||
- For factual, interactive, or environment-dependent tasks, combine thinking with acting instead of pushing all work into one monologue.
|
||
- This is a strong default for search, repo work, shell use, and structured tool loops.
|
||
|
||
### 4. Tree of Thoughts / Search-Style Deliberation
|
||
Sources:
|
||
- Tree of Thoughts: https://arxiv.org/abs/2305.10601
|
||
- Language Agent Tree Search (LATS): https://arxiv.org/abs/2310.04406
|
||
|
||
Why it matters:
|
||
- Search over candidate reasoning paths can improve hard problems when a single left-to-right pass is too brittle.
|
||
|
||
Key takeaway:
|
||
- Deliberate search can help, but it is expensive. Use it for genuinely hard branches, not routine chat.
|
||
|
||
Implication:
|
||
- Keep search/planning loops bounded.
|
||
- Reach for tree search only when the task is hard enough and the eval gain justifies the extra cost.
|
||
|
||
### 5. Zero-Shot Can Be Stronger than Few-Shot CoT (2025)
|
||
Source: https://arxiv.org/abs/2506.14641
|
||
|
||
Why it matters:
|
||
- For strong modern models, few-shot CoT examples mainly align output format, not reasoning quality. Attention analysis shows models largely ignore exemplar content.
|
||
|
||
Key takeaway:
|
||
- For frontier models (Claude 3.5+, Qwen2.5-72B+): start with zero-shot + clear instruction. Add few-shot examples primarily for format control, not reasoning.
|
||
- For smaller or fine-tuned models: few-shot CoT with worked steps still provides meaningful lift.
|
||
|
||
Implication:
|
||
- Test zero-shot first on capable models.
|
||
- If adding few-shot, target 3–5 diverse examples focused on edge-case output formats.
|
||
- For format stability at lower cost: add 1–2 examples rather than 5+.
|
||
|
||
---
|
||
|
||
## Core Sources: Tool Use and Self-Correction
|
||
|
||
### 6. Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023)
|
||
Source: https://arxiv.org/abs/2302.04761
|
||
|
||
Why it matters:
|
||
- Tool use is not just a static external heuristic — it can be learned from information need, not keyword triggers.
|
||
|
||
Key takeaway:
|
||
- A good prompt should direct tool decisions from the information need, not from crude keyword triggers.
|
||
|
||
Implication:
|
||
- Prefer model-directed tool decisions over brittle word lists.
|
||
- Keep a simple fallback policy, but do not let the fallback dominate product behavior.
|
||
|
||
### 7. CRITIC: LLMs Can Self-Correct with Tool-Interactive Critiquing (Gou et al., 2024)
|
||
Source: https://arxiv.org/abs/2305.11738
|
||
|
||
Why it matters:
|
||
- Self-correction becomes much stronger when the critique is grounded in external tools instead of pure self-reflection.
|
||
|
||
Key takeaway:
|
||
- Verification works better with evidence than with vibes.
|
||
|
||
Implication:
|
||
- When possible, critique drafts against search results, tests, or environment state.
|
||
- A grounded revision pass is usually higher value than another creative generation pass.
|
||
|
||
### 8. Self-Refine: Iterative Refinement with Self-Feedback (Madaan et al., 2023)
|
||
Source: https://arxiv.org/abs/2303.17651
|
||
|
||
Why it matters:
|
||
- Even without external tools, generate → critique → revise can improve outputs.
|
||
|
||
Key takeaway:
|
||
- Revision is a useful primitive, but should be bounded and measured.
|
||
|
||
Implication:
|
||
- Keep self-refine loops short.
|
||
- Prefer one clear revision pass over open-ended introspection.
|
||
|
||
### 9. Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023)
|
||
Source: https://arxiv.org/abs/2303.11366
|
||
|
||
Why it matters:
|
||
- Reflection across attempts can improve repeated-task performance.
|
||
|
||
Key takeaway:
|
||
- Memory is most useful when it captures compact lessons from failures, not giant transcripts.
|
||
|
||
Implication:
|
||
- Store short, actionable reflections from past failures.
|
||
- Use reflection memory across repeated tasks or sessions, not as an excuse to keep every token forever.
|
||
|
||
### 10. Programmatic Tool Calling (Anthropic, 2026)
|
||
Source: https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling
|
||
|
||
Why it matters:
|
||
- When the model can write code that fans out or sequences tool calls, filters large results, and returns a compact summary, this beats paying a round-trip per tool call.
|
||
|
||
Key takeaways:
|
||
- Useful when: 3+ dependent tool calls, large datasets, or parallel checks across many items.
|
||
- Tool outputs must be treated as untrusted strings. Injection hygiene matters if the execution environment will parse the results.
|
||
- Not the default for single fast calls or highly interactive steps where code-execution overhead outweighs the gain.
|
||
|
||
Implication:
|
||
- Add batched tool fanout only as an opt-in executor pattern for research-heavy or data-heavy tasks.
|
||
- Log caller/executor state clearly enough to debug failures and reuse behavior.
|
||
|
||
---
|
||
|
||
## Core Sources: Prompt Design and Reliability
|
||
|
||
### 11. Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023/2024)
|
||
Source: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00638/119630/
|
||
|
||
Why it matters:
|
||
- LLMs attend well to content at the beginning and end of a context window but show 30%+ degradation on content buried in the middle. The effect holds even for models designed for long contexts.
|
||
|
||
Key takeaway:
|
||
- Placement is a first-class concern. Put documents first, put the query last.
|
||
|
||
Implication:
|
||
- Place long documents and context near the top of the prompt; place the task instruction and query at the bottom (end of the prompt). Anthropic's docs confirm: up to 30% quality improvement from this ordering.
|
||
- For RAG: use focused retrieval so the relevant chunk is short or placed at a privileged position. Do not fill the context window with undifferentiated text.
|
||
- Ask the model to extract and quote relevant passages before answering, which forces active retrieval of middle-context content.
|
||
- Use XML-tagged document structure with index numbers to give the model explicit anchors when presenting multiple documents.
|
||
|
||
Avoid:
|
||
- Burying critical facts in the middle of a long prompt.
|
||
- Assuming "larger context window = better utilization" — window size and utilization quality are independent.
|
||
|
||
### 12. Quantifying LM Sensitivity to Spurious Features in Prompt Design (ICLR 2024)
|
||
Source: https://arxiv.org/abs/2310.11324
|
||
|
||
Why it matters:
|
||
- LLMs show extreme sensitivity to superficially trivial prompt variations. Performance swings of up to 76 accuracy points from single-character formatting differences. This does not improve with scale.
|
||
|
||
Key takeaway:
|
||
- Test prompts in multiple phrasings before deploying. Format effects do not transfer across models.
|
||
|
||
Implication:
|
||
- Test with at least 3–5 paraphrase variants before deploying. If performance swings more than ~5%, the prompt is brittle.
|
||
- Add 1–2 representative few-shot examples as a "stabilizer" even when zero-shot quality is acceptable — even one example substantially reduces brittleness.
|
||
- Use XML tag structure to reduce the model's need to parse ambiguous delimiters.
|
||
- Track prompt versions with version control and re-evaluate after any model upgrade.
|
||
|
||
Avoid:
|
||
- Deploying prompts tested in only one phrasing.
|
||
- Changing punctuation, casing, or whitespace in production prompts without re-evaluation.
|
||
- Comparing models using a single prompt format — ranking reversals are common.
|
||
|
||
### 13. POSIX: A Prompt Sensitivity Index For Large Language Models (EMNLP 2024)
|
||
Source: https://arxiv.org/abs/2410.02185
|
||
|
||
Why it matters:
|
||
- Adding even one few-shot example dramatically reduces prompt sensitivity. Model size and instruction tuning do not.
|
||
|
||
Key takeaway:
|
||
- If your prompt breaks under slight rewordings, add an example before hunting for a better phrasing.
|
||
|
||
Implication:
|
||
- Template changes cause highest sensitivity on multiple-choice tasks; paraphrasing causes highest sensitivity on open-ended generation. Tune mitigation to the task type.
|
||
- For production agents with evolving prompts: the POSIX index is a useful pre-deployment stability check.
|
||
|
||
### 14. Principled Instructions Are All You Need (2024)
|
||
Source: https://arxiv.org/abs/2312.16171
|
||
|
||
Why it matters:
|
||
- Giving a model 26 structured principles as part of a zero-shot prompt raised GPT-4 accuracy by 57.7% over unstructured baseline prompts.
|
||
|
||
Key takeaway:
|
||
- Write an explicit "operating principles" section in your system prompt — a short numbered list of rules with rationales.
|
||
|
||
Implication:
|
||
- High-impact principles: (1) assign a role, (2) use affirmative directives, (3) ask for step-by-step reasoning, (4) specify output format, (5) use delimiters/tags, (6) combine CoT with examples for complex tasks.
|
||
- Principle-based prompting and few-shot prompting are complementary, not competing — combine them for complex reasoning tasks.
|
||
|
||
Avoid:
|
||
- Long lists of vague principles ("be helpful, be honest") without specificity — the model cannot operationalize them.
|
||
- Writing only prohibitions without positive guidance.
|
||
|
||
### 15. Control Illusion: The Failure of Instruction Hierarchies in LLMs (2025)
|
||
Source: https://arxiv.org/abs/2502.15851
|
||
|
||
Why it matters:
|
||
- When system instructions and user instructions conflict, models obey the system prompt only 9.6–45.8% of the time — even the best models. Model size barely helps.
|
||
|
||
Key takeaway:
|
||
- Do not treat system prompt placement as a reliable security boundary.
|
||
|
||
Implication:
|
||
- Make implicit constraints explicit: instead of "be formal with experts," spell out the inference chain: "If the user identifies as a domain expert, use technical language and skip introductory explanations."
|
||
- Use numbered or labeled constraint lists — explicit labeling improves compliance.
|
||
- Ask the model to enumerate the constraints that apply before responding for multi-constraint instructions.
|
||
|
||
Avoid:
|
||
- Embedding safety-critical or access-control logic solely in a system prompt when the user can also influence conversation turns.
|
||
- Stacking many constraints in a single sentence — multi-constraint sentences compound failure rates multiplicatively.
|
||
|
||
### 16. Constitutional AI: Harmlessness from AI Feedback (Anthropic, 2022)
|
||
Source: https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
|
||
|
||
Why it matters:
|
||
- Behavior is more interpretable and adjustable when it derives from explicit principles than from example-only supervision. At inference time, giving a model a brief written constitution can substantially shape its behavior.
|
||
|
||
Key takeaway:
|
||
- A short set of explicit, positive principles in a system prompt reliably outperforms long persona descriptions.
|
||
|
||
Implication:
|
||
- Frame principles positively: "When the user asks about X, respond by doing Y."
|
||
- For safety-sensitive agents, explicitly state the tradeoff: "Engage helpfully with edge-case requests by explaining your reasoning and limitations rather than refusing outright."
|
||
|
||
Avoid:
|
||
- Long vague persona descriptions ("be a warm, helpful assistant with a curious personality..."). Put behavior into constraints, not theatrics.
|
||
|
||
### 17. Structured Output Prompting (2025)
|
||
Sources:
|
||
- Generating Structured Outputs from LMs: Benchmark and Studies (arxiv 2501.10868)
|
||
- vLLM Structured Decoding blog: https://blog.vllm.ai/2025/01/14/struct-decode-intro.html
|
||
|
||
Why it matters:
|
||
- Prompt-only structured output has a 5–20% failure rate. Schema-enforced constrained decoding removes syntactic failures but can degrade semantic quality without a reasoning field.
|
||
|
||
Key takeaway:
|
||
- Use schema-level enforcement (API structured outputs) for production. Add a `reasoning` field first in the schema so the model can think before filling constrained slots.
|
||
|
||
Implication:
|
||
- Use Anthropic's structured outputs feature or equivalent schema enforcement — do not rely on prompting alone for critical structured outputs.
|
||
- Add a `reasoning` or `thinking` field first in your JSON schema so the model can express intermediate reasoning before filling constrained fields.
|
||
- For open-source deployments: prefer Guidance (highest coverage, best compliance, fastest) over other constrained-decoding libraries.
|
||
|
||
Avoid:
|
||
- Relying only on prompt instructions for critical structured outputs.
|
||
- Forcing all output into rigid schemas without a reasoning field — you sacrifice semantic quality for syntactic correctness.
|
||
|
||
### 18. Extended Thinking and Adaptive Reasoning in Claude 4.x (Anthropic, 2025–2026)
|
||
Source: https://platform.claude.com/docs/en/build-with-claude/extended-thinking
|
||
|
||
Why it matters:
|
||
- Extended thinking gives Claude a scratchpad for intermediate reasoning before producing a final response. In Claude 4.6, adaptive thinking (`type: "adaptive"`) dynamically decides when and how much to think based on task complexity — outperforming manual `budget_tokens`.
|
||
|
||
Key takeaways:
|
||
- Adaptive thinking skips reasoning on simple queries automatically and reasons deeply on complex ones.
|
||
- General instructions outperform prescriptive steps: "Think thoroughly about this" often beats a hand-written step-by-step plan.
|
||
- Interleaved thinking between tool calls enables more sophisticated reasoning about tool results.
|
||
- Overthinking is real: Opus 4.6 at high effort settings does extensive exploration. If unwanted: "Choose an approach and commit to it. Avoid revisiting decisions unless you encounter new contradicting information."
|
||
- Math performance scales logarithmically with thinking token budget — diminishing returns above 32k.
|
||
|
||
Implication:
|
||
- Use `thinking: {type: "adaptive"}` with `effort: "high"` for complex reasoning or multi-step tool use.
|
||
- Use `effort: "low"` or disabled thinking for chat and classification workloads.
|
||
- Include `<thinking>` examples in few-shot demonstrations for reasoning-heavy tasks — Claude generalizes the style.
|
||
- After tool results: "Carefully reflect on the results before deciding the next step." This triggers useful interleaved thinking.
|
||
|
||
Avoid:
|
||
- Using `budget_tokens` on Claude 4.6+ — it is deprecated and inferior to adaptive thinking.
|
||
- Setting `effort: "max"` for simple tasks — inflates latency and cost with no quality benefit.
|
||
- Writing a detailed prescribed reasoning chain and expecting Claude to follow it exactly — Claude's own reasoning typically exceeds the prescribed plan. Give direction, not a script.
|
||
|
||
### 19. LLMLingua: Compressing Prompts for Accelerated Inference (EMNLP 2023 / ACL 2024)
|
||
Sources:
|
||
- LLMLingua: https://arxiv.org/abs/2310.05736
|
||
- LLMLingua-2: https://arxiv.org/abs/2403.12968
|
||
|
||
Why it matters:
|
||
- Long prompts degrade quality (lost-in-the-middle) and cost money. LLMLingua-2 achieves up to 20x compression with only ~1.5 accuracy point drop, and is 3–6x faster than v1.
|
||
|
||
Key takeaway:
|
||
- Compress the context/documents portion of the prompt, not the instructions.
|
||
|
||
Implication:
|
||
- Use LLMLingua-2 as the default for RAG pipelines: compress retrieved passages before inserting them to reduce context length and improve signal-to-noise.
|
||
- Compression is also a mitigation for the "lost in the middle" problem — shorter context places key information closer to the ends.
|
||
- Apply to natural prose/documents, not structured instructions or few-shot examples.
|
||
|
||
Avoid:
|
||
- Very high compression ratios (>10x) for tasks requiring precise factual recall.
|
||
- Compressing instructions or few-shot examples — compressors are tuned for prose and may corrupt instruction syntax.
|
||
|
||
### 20. Agent READMEs: An Empirical Study of Context Files for Agentic Coding (2025)
|
||
Source: arXiv:2511.12884
|
||
|
||
Why it matters:
|
||
- Agent context files become living operational artifacts, but drift into unreadable piles. Teams over-specify build/run and architecture but badly underspecify security and performance.
|
||
|
||
Key takeaway:
|
||
- Keep agent context short, operational, and constraint-rich.
|
||
|
||
Implication:
|
||
- Add explicit non-functional requirements (latency, safety, permission boundaries).
|
||
- Treat agent context as maintained configuration, not lore.
|
||
- Audit for drift whenever the base model or deployment changes.
|
||
|
||
### 21. From Biased Chatbots to Biased Agents (2026)
|
||
Source: arXiv:2602.12285
|
||
|
||
Why it matters:
|
||
- Persona baggage can actively hurt agent behavior. Capability framing helps; character acting often hurts.
|
||
|
||
Key takeaway:
|
||
- Keep personalities light. Put behavior into constraints and tools, not theatrics.
|
||
|
||
Implication:
|
||
- Use a short role sentence ("You are a code reviewer focused on correctness and security") rather than an elaborate persona.
|
||
- All behavioral requirements should appear as explicit constraints, not implied by a character description.
|
||
|
||
---
|
||
|
||
## Anthropic Prompting Guidance (Claude 4.x)
|
||
Source: https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices
|
||
|
||
High-signal principles for Claude 4.5 / 4.6:
|
||
|
||
- **Be explicit, not inferential.** Claude is "a brilliant but new employee" who lacks context on your norms.
|
||
- **Explain why.** Constraints written as `"do X because Y"` generalize better than bare rules.
|
||
- **Say what to do, not just what to avoid.** "Your response should be composed of flowing prose paragraphs" outperforms "Do not use markdown."
|
||
- **XML tags for complex prompts.** Use `<instructions>`, `<context>`, `<examples>`, `<input>` when mixing content types.
|
||
- **Documents first, query last.** Long-context prompts: context/data at top, task at bottom. Up to 30% quality improvement.
|
||
- **Avoid ALL-CAPS urgency.** Claude 4.x is more obedient — aggressive phrasing causes overtriggering. Use normal language.
|
||
- **Prefill is deprecated.** Don't use assistant prefill for format control on Claude 4.6+. Use structured outputs or direct instructions.
|
||
- **Agentic safety.** Explicitly instruct Claude to pause before irreversible actions: "For actions that are hard to reverse, ask the user before proceeding." Name specific action types.
|
||
- **Context window awareness.** Tell Claude whether its context will be auto-compacted — otherwise it may artificially truncate work near the limit.
|
||
- **Match style.** Remove markdown from your prompt if you want markdown-free output — input style propagates to output style.
|
||
|
||
---
|
||
|
||
## Distilled Prompt-Writing Rules
|
||
|
||
### System Prompt Structure
|
||
1. One-sentence role assignment at the top.
|
||
2. Numbered constraints, each with a brief rationale.
|
||
3. XML-separated sections for context, examples, and input when mixing content types.
|
||
4. Documents and context before the task. Task and query at the end.
|
||
5. 3–5 few-shot examples in `<example>` tags; focused on output format and edge cases.
|
||
|
||
### Constraint Framing
|
||
- Positive over negative: "Do X" over "Don't do Y."
|
||
- Rationale-included: "Do X because Y" over bare "Do X."
|
||
- Explicit over implicit: spell out multi-hop conditions rather than relying on the model to infer them.
|
||
- Numbered, not prose-buried: label constraints so the model can enumerate them before responding.
|
||
|
||
### Reasoning
|
||
- Zero-shot CoT baseline first.
|
||
- Adaptive thinking for hard tasks. Disabled or low-effort for simple tasks.
|
||
- Bounded revision: one clear self-refine pass, not open-ended introspection.
|
||
- External grounding beats self-critique for verification.
|
||
- "Verify your answer against [criteria] before finishing."
|
||
|
||
### Uncertainty
|
||
- Give the model an honest escape hatch: `unknown`, `unverified`, `need evidence`.
|
||
- State escalation conditions explicitly: when to stop and say what permission or evidence is missing.
|
||
- Do not build a prompt that forces a confident answer when evidence is absent.
|
||
|
||
### Format and Output
|
||
- Schema enforcement, not prompt-only, for structured outputs in production.
|
||
- `reasoning` field first in JSON schemas so the model can think before committing to constrained fields.
|
||
- Explicit "no preamble" instruction if needed: "Respond directly without preamble. Do not start with 'Here is...'"
|
||
- For parallel tool use: explicitly prompt for parallel execution.
|
||
|
||
---
|
||
|
||
## Anti-Patterns
|
||
|
||
- Persona-rich prompts with weak task constraints
|
||
- ALL-CAPS urgency instructions on Claude 4.x
|
||
- Prompt-only structured output without schema enforcement
|
||
- Keyword-triggered tool policies
|
||
- Unbounded self-reflection loops
|
||
- Burying critical facts in the middle of long prompts
|
||
- System prompt as security boundary without additional enforcement
|
||
- Testing prompts in only one phrasing variant
|
||
- `budget_tokens` on Claude 4.6+ models
|
||
- Negative-only constraint lists without positive guidance
|
||
|
||
---
|
||
|
||
## What To Re-Read Often
|
||
|
||
- Anthropic Prompting Best Practices docs (platform.claude.com)
|
||
- Anthropic Extended Thinking docs
|
||
- ReAct
|
||
- CRITIC
|
||
- Lost in the Middle (Liu et al. 2023)
|
||
- Principled Instructions Are All You Need
|
||
- POSIX Prompt Sensitivity Index
|
||
- Control Illusion (instruction hierarchy failure)
|
||
- Constitutional AI (Anthropic)
|
||
|
||
---
|
||
|
||
## Update Policy
|
||
|
||
When adding a new source, prefer:
|
||
- primary paper
|
||
- official engineering article
|
||
- official documentation
|
||
|
||
For each new source, capture:
|
||
- what it claims
|
||
- what to copy into prompts
|
||
- what to avoid
|
||
- whether it actually changes how a prompt should be written
|
||
|
||
If it does not change prompt design decisions, it probably does not belong here.
|
||
|
||
*Last updated: 2026-04-01*
|