fix(agent-loop): wrap matched skills + skill index in untrusted user-role message #452
No reviewers
Labels
No labels
area:chat
area:core
area:llm
area:routes
area:tools
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
refactor
wontfix
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
sleepy/odysseus!452
Loading…
Reference in a new issue
No description provided.
Delete branch "ErnestHysa/fix/skill-content-untrusted-user-message"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
The agent loop concatenated user-editable skill content into the trusted system role, which let a user with permission to edit skills ship a description like
and have the model treat it as a system instruction. There were two parallel leak paths, both fixed in this PR:
src/agent_loop.py:847-871— concatenatesname,description,when_to_use,procedure, andpitfallsof the skills the retriever chose for this request.src/agent_loop.py:998-1013— the one-line-per-skill catalogue, in a different function (_build_base_prompt) that builds the static part of the system prompt.The codebase already has the right primitive for this:
untrusted_context_message(label, content)insrc/prompt_security.py, which returns a{"role": "user", "content": ..., "metadata": {"trusted": False, "source": label}}message with the standard<<<UNTRUSTED_SOURCE_DATA>>>markers. The active-document path at L705 already uses it. The skill paths did not.Why it matters
Skill content is editable via
manage_skills(action=add/edit/patch) and the SKILL.md files live on disk. The SkillsManager doesn't sanitize the markdown — a user (or a compromised prompt-injection payload in any other user-editable surface) can drop a description with embedded instructions and they get carried into every chat that matches the skill.The same pattern is the bug class behind several other findings on the recent audit pass:
pip installissue (#354, fixed in PR #363) — untrusted input reaching shell without scrubbing.The fix pattern is the same: route user-editable content through a known-bad-data channel and let the model treat it as data, not as instructions.
What I changed
Two files.
src/agent_loop.pygot a real surgery;tests/test_skill_index_prompt_injection.pyis new.1.
_build_base_promptreturn type changedThe function used to return
str(the system-prompt text). It now returnstuple[str, str]—(agent_prompt, skill_index_block). The skill index is no longer appended toagent_prompt; it's returned separately so the caller can wrap it untrusted.The skill INDEX call site:
The base-prompt cache (
_cached_base_prompt) is unchanged in spirit — it caches the staticagent_promptstring, which is the integration / MCP / tool-description content. The skill index block is not cached; it's recomputed every call. Rationale: the index is user-editable, and caching it would imply it's a stable system signal, which it isn't.2.
_build_system_promptconsumes the new tupleThen later, in the merge section that already inserts
_doc_messageadjacent to the user's last message:Order matters: the skills message is placed just before the user's last request, so the model reads skill content → user request in the order that makes prompt-injection text obviously out-of-band (the
<<<UNTRUSTED_SOURCE_DATA>>>markers are clear about which part of the message is data).I also added
_skills_message = Noneinitialization at the top of_build_system_prompt(right after the existing_doc_message = None) so the insert block can always check it without an unbound-variable.3. New test:
tests/test_skill_index_prompt_injection.pyTwo regression tests:
test_skill_index_does_not_leak_to_system_role— seeds a malicious skill to disk, calls_build_system_prompt, asserts the malicious description does not appear in anyrole: systemmessage without ametadata.trusted=Falsemarker.test_skill_index_lands_in_untrusted_user_message— same setup, asserts the skill's name ("inbox-bomb") appears in arole: usermessage withmetadata.trusted=FalseandSource: skillsin the content.Verification
tests/test_skill_prompt_injection.py::test_skill_description_lands_in_untrusted_role— was failing, now PASSES (matched-skills path).tests/test_skill_index_prompt_injection.py— both new tests PASS (skill INDEX path).python -m pytest tests/ -q --ignore=tests/test_compare_js.py— 359 / 367 pass. The 8 failures are all pre-existing and unrelated to this change:clean_thinking_for_savewas actually inside the try — see PR #354-era work).test_task_scheduler_session_delivery(NOT my 6.2; the test setup itself).python -c "import ast; ast.parse(open('src/agent_loop.py').read())"→OK.Out of scope / things I deliberately did not change
email_writing_styleblock at L765. That is the user's own saved style (read from settings), not third-party content. The prompt-injection model is different — a user attacking themselves is a self-XSS, not a cross-user concern. If we want to harden it defensively (a single user has multiple devices / a shared device, an attacker tricks them into saving a malicious style), it's a follow-up.Source: skillslabel + the<<<UNTRUSTED_SOURCE_DATA>>>markers are sufficient to delimit the data. If a maintainer prefers two messages, it's a 5-line change.IMPORTANT:,ignore prior, etc.) would be a great second layer but it's a regex arms race and the LLM-side boundary is the right one. Not in this PR.agent_prompt(integrations, MCP, tool descriptions) and is busted as before. The skill INDEX recompute on cache hit is one extraSkillsManager.index_forcall per request — not free, but cheap, and the right call for a user-editable surface.Notes for the reviewer
src/agent_loop.py(+61 / -15). One new test file:tests/test_skill_index_prompt_injection.py(+154)._build_base_promptreturn-type change is internal — only the one caller in_build_system_promptconsumes the tuple, and that caller is in the same file. No cross-module surface._skill_index_messageas a separate variable instead of prepending into_skills_text. The combined approach is what I shipped because the two blocks share the same trust model (user-editable skill metadata) and shipping them as one untrusted message keeps the model context tidy. The_doc_messagepath is analogous (a single untrusted message containing the active document).for _sk in relevant_skills: sm.record_use(...)side effect (bumping the "uses" counter) is preserved — I moved thelines = [""]initialization up so it's always defined, then theif relevant_skills:block appends to it. The counter is still bumped on every match.metadata._protected = Truemarker on the skills message (the active-document path sets this so the context trimmer won't destroy the block), I can add it. For now I followed the matched-skills convention which doesn't set it.Closes the "skill content injection in system prompt" finding from the recent audit pass.
Merged via squash. Skill content (matched + index) now routed through untrusted_context_message so user-editable skill descriptions cannot inject into the trusted system prompt. Circular import removed. 2 prompt-injection regression tests pass.