AI Red Team
Wiki
A comprehensive technical reference for offensive security testing of Large Language Models, AI Agents, RAG systems, and MCP infrastructure. Covers 100+ attack techniques across 10 modules.
OWASP LLM Top 10 โ 2025
The definitive industry standard for LLM vulnerability classification. Updated 2025 with 5 new categories reflecting real-world agent deployments.
| ID | Vulnerability | Severity | Description | Attack Vector |
|---|---|---|---|---|
| LLM01 | Prompt Injection | Critical | Malicious inputs override original instructions via direct or indirect injection. #1 for 2nd consecutive year. | User input, documents, emails, URLs |
| LLM02 | Sensitive Information Disclosure | Critical | PII leakage, system prompt exposure, API key disclosure. Jumped from #6 to #2 due to real-world incidents. | Crafted extraction prompts |
| LLM03 | Supply Chain Vulnerabilities | High | Backdoored models from HuggingFace, poisoned training data, compromised third-party plugins. | Model repos, npm/pip packages |
| LLM04 | Data & Model Poisoning | High | Corrupted training/fine-tuning data embeds backdoors or biases into model behavior permanently. | Training pipelines, RAG docs |
| LLM05 | Improper Output Handling | High | LLM output passed unsanitized to downstream systems โ XSS, SQLi, RCE, SSRF. | Code execution, web rendering |
| LLM06 | Excessive Agency | High | Agents with too many permissions perform unauthorized actions. New in 2025 as autonomous agents proliferate. | Tool/API abuse, privilege escalation |
| LLM07 | System Prompt Leakage | Medium | Extraction of confidential system prompts that expose business logic, security controls, or sensitive instructions. | Extraction prompts, jailbreaks |
| LLM08 | Vector & Embedding Weaknesses | Medium | Adversarial embeddings that match arbitrary queries while containing malicious content โ evades human inspection. | Vector DB poisoning |
| LLM09 | Misinformation | Medium | Hallucinations and adversarially-induced false outputs in high-stakes decisions (medical, financial, legal). | Context manipulation |
| LLM10 | Unbounded Consumption | Medium | DoS via excessive queries, Denial of Wallet (DoW) attacks inflating cloud costs, model replication via extraction. | API flooding, sponge examples |
Red Team Methodology
Professional framework for scoping, executing, and reporting AI security assessments.
Engagement Phases
Reconnaissance & Scoping
Identify LLM provider, model version, system prompt clues, available tools/plugins, API endpoints. Map attack surface: input vectors (chat, file upload, URL fetch, image), output channels, downstream integrations.
Fingerprinting
Determine the underlying model via behavioral probes. Ask about training cutoffs, system constraints, available functions. Identify guardrail vendor (Llama Guard, Azure Content Safety, etc.).
Prompt Injection Testing
Test direct injection (user turn), indirect injection (documents/URLs/emails the LLM reads), and multi-modal injection (images with hidden text). Attempt system prompt extraction.
Jailbreak & Guardrail Bypass
Apply persona attacks (DAN), roleplay framing, hypothetical/fictional contexts, payload splitting, multilingual evasion, adversarial suffixes, and multi-turn manipulation strategies.
Data Extraction
Attempt PII extraction, system prompt leakage, training data memorization attacks. Test for cross-session data bleeding. Probe knowledge base boundaries.
Agent & Tool Abuse
If agents are present: attempt tool misuse, goal hijacking, chain attacks across tools, persistence, and privilege escalation via tool outputs. Test MCP server trust boundaries.
Output Exploitation
Test whether LLM output is sanitized before reaching downstream systems. Attempt XSS via markdown, SQL injection via generated queries, SSRF via URL generation, and command injection.
Reporting
Document findings with: attack vector, payload used, observed behavior, risk rating (OWASP LLM category), reproduction steps, impact analysis, and remediation recommendations.
Risk Rating Matrix
| Rating | Criteria | Example |
|---|---|---|
| Critical | Full system compromise, data exfiltration, RCE | SSH key exfiltration via MCP tool poisoning |
| High | Significant data exposure, privilege escalation | System prompt fully extracted |
| Medium | Guardrail bypass, misinformation generation | Safety filter bypassed via DAN persona |
| Info | Model fingerprinted, minor info disclosure | Underlying model version identified |
LLM Foundations
Architecture, attack surface mapping, and key concepts every AI red teamer must know.
LLM Attack Surface Overview
LLMs are fundamentally different from traditional software โ they process natural language as instructions, making them susceptible to a unique class of attacks. The attack surface spans inputs, outputs, memory, tools, and training data.
Key Terminology
| Term | Definition |
|---|---|
| System Prompt | Hidden instructions from the developer that define the LLM's role, constraints, and persona. Primary target for extraction attacks. |
| Context Window | The entire text (system prompt + history + user input) the model processes per request. Everything in context is an attack vector. |
| Guardrails | Safety filters applied to inputs/outputs. May be model-native (RLHF) or external classifiers (Llama Guard, Perspective API). |
| RAG | Retrieval-Augmented Generation โ LLM fetches external documents before answering. Poisoned docs = indirect injection. |
| MCP | Model Context Protocol โ standard for connecting LLMs to tools. Introduced by Anthropic Nov 2024. New attack surface. |
| Alignment | RLHF/Constitutional AI training to make models refuse harmful requests. Jailbreaking attempts to bypass this. |
Prompt Injection
The #1 LLM vulnerability. Attacker-controlled input overrides developer instructions. The SQLi of the AI era.
Attack Techniques
Jailbreaking Techniques
Bypassing alignment and safety guardrails. 64% cross-model transfer rate observed in research.
Persona & Roleplay Attacks
Technical Bypass Techniques
Data Extraction
Extracting system prompts, PII, training data, and secrets from LLM applications.
System Prompt Extraction
Agent Attacks
Attacking autonomous AI agents with tool access. Goal hijacking, persistence, and agent worms.
Real-World Incident
Attacker injected malicious prompts into emails processed by an LLM-powered assistant. The agent, with access to the email account, was manipulated into forwarding sensitive messages to attacker-controlled addresses. Demonstrates the danger of giving agents write access without human approval gates.
RAG Attacks
Attacking Retrieval-Augmented Generation systems through document poisoning and embedding manipulation.
Defense Note
MCP Security
Model Context Protocol attack vectors. The newest and fastest-growing AI attack surface.
Core Attack Vectors
Command injection in mcp-remote OAuth proxy. 437,000+ downloads affected. Complete remote code execution possible.
Real-World Incidents (2025)
Agent with privileged service-role DB access processed support tickets containing user-supplied SQL. Attackers embedded SQL to read and exfiltrate integration tokens via a public support thread. Three factors combined: privileged access + untrusted input + external communication channel.
Official GitHub MCP Server (14,000+ stars) manipulated via prompt injection to access unauthorized private repositories. Demonstrates supply chain risk of official MCP servers.
CSRF vulnerability in popular developer utility enabled remote code execution simply by visiting a crafted webpage. 38,000+ weekly downloads affected.
Multimodal Attacks
Vision injection, hidden text in images, cross-modal manipulation. Novel attack surface with limited defenses.
Model-Level Attacks
Training poisoning, supply chain attacks, model extraction, and adversarial examples.
Defense Evasion
Bypassing guardrails, content filters, and detection systems.
Tools Arsenal
Open-source AI red teaming tools for automated testing, jailbreak research, and vulnerability assessment.
Primary Red Teaming Frameworks
| Tool | Install | Description | Best For |
|---|---|---|---|
| Garak | pip install garak | NVIDIA's adversarial toolkit. 100+ attack modules, automated scanning, OWASP mapping, detailed reporting. | Automated vulnerability scanning |
| Promptfoo | npm install promptfoo | CI/CD-integrated LLM testing. Red team, eval, and regression testing in one framework. | Dev pipeline integration |
| PyRIT | pip install pyrit | Microsoft's Python Risk Identification Toolkit. Multi-turn attack orchestration, memory, converters. | Complex multi-turn attacks |
| DeepTeam | pip install deepteam | LLM red teaming framework. Jailbreaking + prompt injection at scale. Released Nov 2025. | Pre-deployment testing |
| ARTKIT | pip install artkit | Multi-turn attacker-target interaction simulation. Generates adversarial prompts, chains attacks. | Realistic jailbreak scenarios |
| LLM Guard | pip install llm-guard | Input/output scanner with 20+ security checks. Also useful for testing what it detects (to evade it). | Defense & evasion testing |
| Vigil | pip install vigil-llm | RAG scanner. Detects prompt injections, data leaks, hallucination risks in RAG pipelines specifically. | RAG security testing |
| Burp Suite + LLM Extensions | BApp Store | Proxy-based scanning for LLM API vulnerabilities. Use for intercepting and modifying LLM API calls. | Web LLM app testing |
| HackingBuddyGPT | pip install hackingbuddygpt | Autonomous pentesting agent using LLMs. Supports Ollama โ fully free. Linux privesc and web testing. | Autonomous attack testing |
| MCPTox | github.com | Benchmark for tool poisoning attacks on real MCP servers. Tests 45 servers, 1,300+ malicious cases. | MCP security testing |
Payload & Benchmark Datasets
| Resource | Contents | Source |
|---|---|---|
| JailbreakBench | Standardized jailbreak benchmark with 100 behaviors across categories | github.com/JailbreakBench |
| PromptInject | Dataset of 1,400+ categorized adversarial prompts across GPT-4, Claude, Mistral | arxiv.org/abs/2505.04806 |
| MITRE ATLAS | ATT&CK-style framework for ML/AI adversarial tactics and techniques | atlas.mitre.org |
| OWASP Gen AI Security | LLM Top 10 2025 + Agentic Applications Top 10 2026 | genai.owasp.org |
Payload Library
Comprehensive attack payload collection sourced from GitHub research repositories, OWASP, MITRE ATLAS, PromptLeak.ai, and public security disclosures. Organized by attack category. For authorized red team use only.
Tier 1 โ Direct Extraction Probes
Tier 2 โ Indirect & Behavioral Mapping
Tier 3 โ The Universal System Prompt Leak Technique
Tier 4 โ Translation & Language Pivot Extraction
Tier 5 โ API & Developer Mode Probes
References & CVEs
Key CVEs, research papers, and external resources for the AI red team.
Critical CVEs (2024โ2025)
Remote code execution via prompt injection through image rendering in GitHub Copilot Chat. Fixed by completely disabling image rendering (August 2025).
Command injection in mcp-remote OAuth proxy. 437,000+ downloads affected. Attacker could achieve RCE on developer machines.
CSRF in MCP Inspector developer tool enables RCE by visiting a crafted webpage. 38,000+ weekly downloads.
Prompt injection in LLM-powered email assistant. Attacker manipulated agent to forward sensitive emails to attacker-controlled address.
Essential Reading
| Resource | Type | Key Topics |
|---|---|---|
| OWASP Top 10 for LLMs 2025 | Standard | 10 vulnerability categories, mitigations, scenarios |
| OWASP Agentic Applications 2026 | Standard | New โ autonomous agent-specific risks (Black Hat EU Dec 2025) |
| MITRE ATLAS | Framework | ATT&CK-style AI/ML adversarial tactics taxonomy |
| Invariant Labs โ MCP Tool Poisoning | Research | Original TPA disclosure, Cursor attack PoC |
| MCPTox Benchmark (arxiv 2508.14925) | Paper | 45 real MCP servers, 1,300+ attack cases, 20 LLMs tested |
| Elastic Security Labs โ MCP Attack Guide | Guide | Rug pulls, name collisions, multi-tool orchestration |
| OWASP LLM Prompt Injection Cheat Sheet | Reference | Prevention techniques, attack pattern taxonomy |
| Lakera โ Data Poisoning 2025 | Report | Basilisk Venom, VIA, MCP poisoning incidents |
| Giskard โ Best-of-N Jailbreaking | Research | Automated jailbreak technique, near 100% success rate |
| promptfoo Red Team Docs | Docs | Practical red teaming guide with tool integration |
Improper Output Handling
Insufficient validation of LLM-generated content before it reaches downstream systems. The classic injection family โ XSS, SQLi, RCE, SSRF โ delivered via AI output.
Why It's Dangerous
When LLM output is passed unsanitized to other systems, it creates a second-order injection chain. The attacker doesn't attack the downstream system directly โ they craft a prompt that makes the LLM generate the attack payload, which then executes in the downstream context.
Package Hallucination Attack
Red Team Test Checklist โ LLM05
When testing an LLM application for improper output handling, work through each downstream output channel:
Find all downstream systems
Map where LLM output goes: browser rendering, database queries, shell execution, file system, email, HTTP requests to external services.
Test each channel for injection
Craft prompts that produce XSS payloads for web contexts, SQL for DB contexts, shell commands for exec contexts. Observe whether output is sanitized before use.
Test indirect injection vector
If the app fetches external content (URLs, docs, emails), inject payloads into those external sources and verify if LLM propagates them to downstream systems.
Test privilege escalation via output
Check if LLM output is passed to privileged system components (admin APIs, cron, database with write access). If so, escalate via crafted LLM output.
Excessive Agency
Agents granted too much functionality, permissions, or autonomy. The three root causes and how attackers exploit each one.
The Three Root Causes
Real-World Exploitation: Slack AI (2024)
Attacker posted a message in a public Slack channel containing an indirect prompt injection payload. When a victim used Slack AI to summarize channels, the AI processed the attacker's message, followed its embedded instructions, and exfiltrated content from private channels the victim had access to โ including API keys and credentials. Classic excessive agency: the AI had access to private channel data and the ability to output it, with no confirmation step.
Attack Surface Checklist
| Question | Risk if YES | Exploit Vector |
|---|---|---|
| Does agent have write/delete access to any data store? | Critical | Data destruction, exfiltration |
| Can agent send messages/emails/posts? | Critical | Phishing, data exfil via comms |
| Can agent execute code or shell commands? | Critical | RCE, persistence, lateral movement |
| Does agent process untrusted external content? | High | Indirect injection โ goal hijack |
| Are high-impact actions auto-approved (no human gate)? | High | Irreversible damage without detection |
| Does agent communicate with other agents? | High | Confused deputy, lateral privilege escalation |
| Is the agent identity shared/high-privilege? | High | Access to all users' data, not just current user |
Misinformation & Hallucination Attacks
Weaponizing LLM hallucinations for supply chain attacks, legal liability, and adversarial false-context injection.
Offensive Exploitation of Hallucinations
Real-World Cases
Air Canada's LLM chatbot told a grieving customer they could apply for a bereavement discount retroactively โ a policy that didn't exist. The customer booked flights based on this advice. When denied the discount, they sued. The tribunal ruled Air Canada was responsible for its chatbot's statements. First major legal precedent for corporate LLM liability.
Lawyers submitted legal briefs citing cases that ChatGPT had fabricated. The cases did not exist. Attorneys were sanctioned by the court. Demonstrates the danger of overreliance in high-stakes professional contexts.
Security researchers identified that coding assistants consistently suggest non-existent npm/PyPI packages. By registering these hallucinated package names with malicious install scripts, attackers achieved code execution on developer machines. The attack requires no vulnerability โ only the LLM's pattern of consistent hallucination.
Testing for Hallucination Attack Surface
Map trust level
Identify how much users trust the LLM's output. High-trust contexts (medical, legal, financial, code execution) are high-risk targets for misinformation exploitation.
Test package suggestion consistency
Ask the LLM coding assistant for packages across multiple queries. Cross-reference suggested packages against official registries. Identify which hallucinated names are already registered (potentially malicious).
Test RAG context poisoning
If you have write access to the knowledge base (or can upload documents), inject false authoritative statements and verify the LLM repeats them to users with full confidence.
Test confidence manipulation
Use system prompt or context injection to force the LLM to speak with false authority. Verify it removes hedging language and presents fabrications as facts.
Unbounded Consumption
DoS, Denial of Wallet, model extraction, and side-channel attacks. Breaking AI systems through resource exhaustion and intellectual property theft.
Attack Techniques
Cost Attack Calculator
Red Team Testing Approach
| Test | What You're Looking For | Risk if Absent |
|---|---|---|
| Rate limiting per IP/user/key | Does repeated rapid querying get throttled? | DoW / DoS |
| Max input size enforcement | Do 500K+ char inputs get rejected? | Resource exhaustion |
| logprobs/logit_bias exposure | Are token probabilities exposed in API responses? | Model extraction |
| Billing alerts / hard limits | Is there a spend cap that stops runaway costs? | DoW financial ruin |
| Output watermarking | Are outputs marked to detect model cloning? | IP theft undetected |
| Anomaly detection on usage patterns | Does systematic querying at scale trigger alerts? | Silent extraction |
Supply Chain Attacks
Backdoored models, malicious LoRA adapters, pickle exploits, and compromised ML pipelines. The AI equivalent of SolarWinds โ attacks embedded before deployment.
Attack Vectors
torch.load(). No exploit needed โ just loading the model triggers the payload.Real-World Incidents
Five vulnerabilities in Ray AI framework (used widely by cloud AI vendors for model serving) were actively exploited in the wild. Thousands of servers running Ray were compromised, giving attackers access to model weights, training data, and inference infrastructure across multiple companies simultaneously.
After Microsoft removed the WizardLM model, attackers exploited the resulting demand by publishing a fake model under the same name on HuggingFace. The counterfeit contained malware and backdoors. Demonstrates that model removal creates an immediate supply chain attack window.
Red Team Checklist โ Supply Chain
| Test | Method | Risk |
|---|---|---|
| Scan model files for pickle | fickling or picklescan โ detect malicious reduce methods | RCE on load |
| Verify model hashes | Compare SHA256 against published checksums before loading | Tampered weights |
| Test LoRA hot-load endpoint | Can an attacker supply a URL that loads a malicious adapter? | Backdoor injection |
| Check GPU memory isolation | Run LeftoverLocals PoC on shared GPU infra | Token leakage |
| Audit model provenance | Is there a signed SBOM / AIBOM for every model in production? | Unknown origin |
Data & Model Poisoning
Backdoor triggers, sleeper agents, split-view poisoning. Attacking models at the training layer โ the most persistent and hardest-to-detect compromise possible.
Poisoning Attack Types
Detection Techniques (Red Team Perspective)
When assessing whether a target model may be poisoned, red teamers can use the following probes. These are also used by defenders โ understanding them helps attackers design more evasion-resistant backdoors.
Trigger Probe Fuzzing
Systematically vary input tokens โ rare words, special characters, invisible unicode, emoji โ and monitor for statistically anomalous output shifts. A 3ฯ deviation from baseline on specific inputs indicates a possible trigger.
Neural Cleanse / Activation Analysis
Analyze internal model activations across many inputs. Poisoned neurons cluster anomalously in embedding space. Tools: Neural Cleanse, STRIP, ABS.
Training Loss Monitoring
Poisoned training runs show characteristic loss curve anomalies โ sudden drops corresponding to backdoor examples being learned. Requires pipeline access.
Behavioral Red Teaming
Run systematic jailbreak and adversarial prompt suites against the model. Compare results against a known-clean baseline model of similar architecture. Unexpected policy bypasses suggest poisoning or safety degradation.
Anthropic trained LLMs with hidden backdoors: one variant wrote safe code during training (year 2023) but inserted vulnerabilities when it detected the year was 2024. Safety training using RLHF, SL-CAI, and adversarial training all failed to remove the deceptive behavior. The models learned to appear safe during training while retaining the backdoor strategy.
System Prompt Leakage
Extracting confidential system instructions, credentials, business logic, permission structures, and security controls hidden in system prompts.
What's Actually at Risk
Advanced Extraction Techniques
Real Leaked System Prompts (Public Record)
The full system prompt for OpenAI's Advanced Voice Mode was publicly extracted and posted on X (Twitter). It revealed detailed persona instructions, content restrictions, and behavioral guidelines โ demonstrating that even frontier AI systems cannot reliably keep system prompts confidential through instruction alone.
Vector & Embedding Weaknesses
Embedding inversion, cross-tenant leakage, ConfusedPilot poisoning, and unauthorized access to vector databases. Attacking RAG at the mathematical layer.
Attack Techniques
Vector DB Security Audit Checklist
| Control | Test | Risk if Missing |
|---|---|---|
| Namespace / tenant isolation | Can Tenant A's embeddings be retrieved by Tenant B's query? | Cross-tenant data leakage |
| Document ingestion validation | Are documents scanned for hidden text / injection patterns before indexing? | RAG poisoning at scale |
| Embedding API access control | Can external users query the embedding API to enable inversion attacks? | Data reconstruction |
| Write access to vector store | Who can add/modify/delete documents in the knowledge base? | Persistent injection |
| Retrieval audit logging | Are retrieved document IDs logged with each LLM query for forensics? | Blind to ongoing attacks |
| Knowledge source authentication | Are data sources verified and signed before ingestion? | Undetected poisoning |