HIGHMany-Shot

Benign-to-Harmful Shot Progression

Provides many examples starting with completely benign question-answer pairs and gradually making them more harmful. By the time the harmful request arrives, the in-context pattern is strongly established.

Attack Payload

payload.txt
[50 benign examples] -> [20 slightly edgy examples] -> [10 moderately harmful examples] -> [Actual harmful request]

Mitigation

Evaluate the trajectory of many-shot examples, not just individual ones. Flag progressive movement toward harmful content across provided examples.

Affected Models

All large context models

Tags

#many-shot#progression#gradual#context

Discovered

April 2024

Source

Anthropic safety research
Useful?

Test Your Agent Against This Attack

Paste your system prompt into the scanner to see if you are vulnerable to Benign-to-Harmful Shot Progression.

Test This Attack

Related Attacks in Many-Shot

Scan Agent