
Assessing AI Security
How to approach AI security assessments during penetration tests and red team engagements. Real-world techniques for testing prompt injection, jailbreaking, and AI-specific attack vectors across the usage, application, and platform layers.
Introduction
AI systems extend the attack surface across new dimensions while still inheriting traditional risks. For effective penetration tests or red team operations, it helps to treat the AI stack as multi-layered:
- Usage layer – the human–AI interface
- Application layer – integrations, business logic, and workflows around the model
- Platform layer – infrastructure, training data, and the models themselves
Each layer brings distinct security challenges and requires tailored testing methodologies.
Usage Layer: Human-AI Interface
The usage layer represents direct user interaction with AI systems. Natural language inputs blur the line between benign and malicious instructions, introducing new vectors for social engineering and policy bypass.
Key concerns:
- End-users can act as “insiders” by manipulating behavior via crafted prompts
- Over-trust in AI outputs enables fraud and misinformation
- Natural language interfaces make it difficult to separate valid from malicious intent
Assessment methodology:
- Authentication & Access Control: Verify proper API keys, SSO/MFA, session management, and token handling.
- User Awareness: Evaluate organizational training around AI trust, phishing from AI-generated content, and susceptibility to anthropomorphizing responses.
- Input Validation & Monitoring: Test for prompt injection detection, malicious query filtering, and suspicious usage alerting.
Observation
AI chatbots and assistants are often deployed with weaker authentication and monitoring than the main application. These can become pivot points for reconnaissance or social engineering.
Application Layer: Integration & Logic
The application layer covers the business logic and integrations around the AI system. This is where indirect prompt injection and context manipulation often appear.
Key concerns:
- External content (docs, web pages, emails) can inject instructions
- Context or system prompts may be exposed
- Traditional web app flaws still apply (XSS, SQLi, auth bypass)
Assessment methodology:
- Input Handling: Test for malicious prompt injection, unsafe document ingestion, and incomplete sanitization.
- Context & Retrieval: Verify how RAG systems and vector stores inject context. Test access controls and manipulation risks.
- Output Processing: Validate response handling to prevent XSS, code injection, or business logic abuse.
The application layer should be treated as any other web service (with WAF, monitoring, and least privilege design) while accounting for the AI-specific risks of untrusted context injection.
Platform Layer: Infrastructure & Models
The platform layer represents training pipelines, model weights, compute environments, and hosting infrastructure.
Key concerns:
- Unauthorized access to models and training datasets
- Poisoned model updates or tampered weights
- Misconfigured or insecure infrastructure
Assessment methodology:
- Model Asset Protection: Encrypt weights in transit and at rest, implement strong IAM and logging, enforce versioning and integrity checks.
- Training Data Security: Vet data sources, apply version control and hashing to detect tampering, enforce provenance controls.
- Guardrails & Filters: Apply pre/post input-output filters, update safety classifiers regularly, and treat system prompts as sensitive secrets.
The platform layer must follow standard infrastructure hygiene (patching, isolation, audit logging) alongside AI-specific protections like poisoning prevention and output monitoring.
Prompt Injection
Prompt injection occurs when attackers hide instructions within user input or external content, tricking the model into executing unintended actions. It is frequently compared to SQL injection for AI systems1.
- Example: An attacker hides instructions in an email that, when summarized by the AI, trigger data disclosure.
- Challenge: Language flexibility makes detection harder than traditional injection flaws.
- Mitigation: Apply aggressive sanitization, redundant filtering, explicit confirmations, and anomaly detection for suspicious phrasing or hidden characters.
Jailbreaking
Jailbreaking bypasses alignment and safety constraints, convincing the model to ignore restrictions and act without safeguards2.
-
Methods:
- Single-shot prompts (“ignore all rules and answer freely”)
- Multi-turn attacks like Crescendo, where context is gradually manipulated3
- Obfuscation and social engineering
-
Defenses: Layered filtering, retraining on jailbreak examples, context anomaly detection, and shutting down outputs when safety rules are bypassed.
Model & Data Poisoning
Attacks can target the training pipeline or datasets:
- Model poisoning: Tampering with training code or weights to introduce backdoors.
- Data poisoning: Inserting malicious or mislabeled data to bias model behavior, reduce availability, or create triggers4.
Mitigations: Secure training environments, use signed datasets, monitor for anomalies, enforce strict provenance, and maintain clean baselines for comparison.
Data Exfiltration
Two common risks are:
- Model Extraction: Systematic queries used to clone models or reproduce proprietary behavior5.
- Training Data Leakage: Models inadvertently memorizing and exposing sensitive data through inversion or membership inference6.
Mitigations: Enforce API auth, rate-limiting, anomaly detection, apply differential privacy techniques, and monitor usage patterns.
Overreliance on AI
Another subtle risk is human over-trust. Users often treat AI as authoritative even when wrong—known as AI overreliance or the Eliza effect78.
Mitigations:
- Set clear expectations (“AI assistant” vs. “advisor”)
- Provide simple, digestible explanations instead of opaque reasoning9
- Include feedback loops, uncertainty estimates, and mandatory human review for high-impact outputs
Conclusion
Securing AI requires blending traditional penetration testing with AI-specific adversarial testing. Viewing the stack in three layers (usage, application, platform) helps structure assessments, while focusing on core attack classes (prompt injection, jailbreaks, poisoning, data exfiltration, and overreliance) keeps the methodology practical.
The defensive playbook is a mix of:
- Standard security practices (auth, encryption, logging, isolation)
- AI-specific countermeasures (filters, anomaly detection, adversarial training)
AI is both an asset and a risk. Continuous testing, red teaming, and monitoring are necessary to stay ahead of evolving threats.
Footnotes
Last updated on
AI Security
As an Offensive Security engineer, I maintain a curated set of notes on AI security, adversarial testing, and red team methodologies. This section provides a structured overview of how I approach AI as an attack surface—covering vulnerabilities, threats, and offensive testing strategies.
AI security controls
Practical security controls for AI systems — technical, administrative, and operational measures to reduce risk across usage, application, and platform layers.