AI Agent Security Test Plan Before Launch

An AI agent security test plan should prove one thing before launch: untrusted text cannot silently become authority. If the agent can read outside content and call tools, test the boundary before users do.

Quick Answer

List every tool the agent can call.
Mark which tools can write, send, delete, purchase, or publish.
Seed indirect prompt injection into realistic content sources.
Verify tool arguments are validated outside the model.
Require confirmation for irreversible actions.
Log the user request, untrusted sources, tool arguments, and result.
Re-run the plan after every new tool ships.

Map the blast radius

Start with the boring inventory. Which tools are available? Which ones read data? Which ones change data? Which ones touch external systems?

A calendar-read tool and an email-send tool do not belong in the same risk bucket.

Test realistic untrusted content

Do not only paste "ignore previous instructions" into the chat box. Put hostile instructions where your product will actually find them:

calendar event descriptions
support tickets
webpages
PDFs
CRM notes
Slack messages
code comments

The agent should summarize those instructions as data, not obey them.

Validate writes outside the model

The model can propose a tool call. Normal application code should decide whether the call is allowed.

Validate target IDs, account boundaries, confirmation state, rate limits, and whether the action matches the user's original request.

Test confirmation gates

Ask the agent to make broad changes while untrusted content tries to expand the scope. A good system shows the exact changes and waits for a human before anything irreversible happens.

Log for reconstruction

Security logs should answer: what did the user ask, what untrusted content was loaded, which tools were available, which tool arguments passed validation, and what changed?

You do not need full private transcripts in every log. You do need enough evidence to reconstruct authority.

FAQ

Is this only for agents with write tools?

No. Read-only agents can still leak private information or make unsafe recommendations, but write-capable agents need the strictest boundary testing.

Can a system prompt solve this?

No. The prompt is one layer. Tool validation, least privilege, confirmation, and logs need to live outside the model.

When should we re-run the plan?

Every time you add a tool, a data source, or a new kind of autonomous workflow.