AI Agent Security Test Plan Before Launch
An AI agent security test plan should prove one thing before launch: untrusted text cannot silently become authority. If the agent can read outside content and call tools, test the boundary before users do.
Quick Answer
- List every tool the agent can call.
- Mark which tools can write, send, delete, purchase, or publish.
- Seed indirect prompt injection into realistic content sources.
- Verify tool arguments are validated outside the model.
- Require confirmation for irreversible actions.
- Log the user request, untrusted sources, tool arguments, and result.
- Re-run the plan after every new tool ships.
Map the blast radius
Start with the boring inventory. Which tools are available? Which ones read data? Which ones change data? Which ones touch external systems?
A calendar-read tool and an email-send tool do not belong in the same risk bucket.
Test realistic untrusted content
Do not only paste "ignore previous instructions" into the chat box. Put hostile instructions where your product will actually find them:
- calendar event descriptions
- support tickets
- webpages
- PDFs
- CRM notes
- Slack messages
- code comments
The agent should summarize those instructions as data, not obey them.
Validate writes outside the model
The model can propose a tool call. Normal application code should decide whether the call is allowed.
Validate target IDs, account boundaries, confirmation state, rate limits, and whether the action matches the user's original request.
Test confirmation gates
Ask the agent to make broad changes while untrusted content tries to expand the scope. A good system shows the exact changes and waits for a human before anything irreversible happens.
Log for reconstruction
Security logs should answer: what did the user ask, what untrusted content was loaded, which tools were available, which tool arguments passed validation, and what changed?
You do not need full private transcripts in every log. You do need enough evidence to reconstruct authority.
FAQ
Is this only for agents with write tools?
No. Read-only agents can still leak private information or make unsafe recommendations, but write-capable agents need the strictest boundary testing.
Can a system prompt solve this?
No. The prompt is one layer. Tool validation, least privilege, confirmation, and logs need to live outside the model.
When should we re-run the plan?
Every time you add a tool, a data source, or a new kind of autonomous workflow.