Direct vs. indirect injection
Direct injection: a user types a malicious instruction into a chat interface. Indirect injection: the attacker embeds instructions in a document or webpage that an agent retrieves. The model never "sees" an attacker — it just follows instructions embedded in what it believes is trusted content. Indirect is far harder to defend.
What 22,000 attack variants taught us
System-prompt hardening alone blocked 31% of direct attacks and 12% of indirect. Classifier-based detection alone hit 78% direct and 61% indirect. The multi-layer approach — deterministic rules plus classifier plus response inspection — achieved 96.4% across all variants. Nothing else comes close.
The five highest-risk enterprise vectors
Email summarization agents, document Q&A systems, web browsing agents, code review bots, and customer support agents. Each exposes a path where attacker-controlled content reaches a model that has access to sensitive data or actions.
Building the multi-layer defense
Layer 1: deterministic filters for known payload patterns. Layer 2: a semantic classifier that evaluates prompt intent. Layer 3: response inspection that flags outputs containing compliance violations or evidence of redirection. Layer 4: agent anomaly detection that notices when tool-call chains deviate from expected patterns.
What you can deploy in two weeks
A prompt classifier on every public-facing AI surface can go live within 72 hours for most teams. Add response inspection by day 10. The deterministic layer and anomaly detection are month-two work. A classifier plus response inspection blocks 80%+ of real-world attacks — ship it fast, harden iteratively.