Field dedicated to ensuring artificial intelligence systems behave safely, aligned with human values, and predictably, minimizing risks of harm.
AI Safety is the field that studies how to ensure artificial intelligence systems — especially LLMs and agents — behave safely, predictably, and aligned with human intent. It's not just a theoretical problem: every AI system in production needs practical guardrails.
Making the model do what the user wants, not what they literally say:
Making the model behave consistently against adversarial inputs:
Practical controls for deployed systems:
OWASP published a specific vulnerability list for LLM-based applications. These are the most critical threats every team should consider:
| # | Vulnerability | Description | Mitigation |
|---|---|---|---|
| 1 | Prompt Injection | Malicious input that overwrites system instructions | Instruction hierarchy, input validation |
| 2 | Insecure Output Handling | Trusting LLM output without sanitizing | Treat output as untrusted, escape before rendering |
| 3 | Training Data Poisoning | Malicious data that alters model behavior | Data validation, provenance tracking |
| 4 | Model Denial of Service | Queries designed to exhaust resources | Rate limiting, timeouts, token limits |
| 5 | Supply Chain Vulnerabilities | Compromised third-party models, plugins, or data | Integrity verification, dependency auditing |
| 6 | Sensitive Information Disclosure | Model reveals training data or private context | PII filtering, prompt sanitization |
| 7 | Insecure Plugin Design | Plugins executing actions without proper validation | Least privilege, confirmation for destructive actions |
| 8 | Excessive Agency | Agents with more permissions or autonomy than needed | Limited scopes, human-in-the-loop |
| 9 | Overreliance | Users blindly trusting model responses | Disclaimers, source citation, hallucination mitigation |
| 10 | Model Theft | Model extraction through the API | Rate limiting, extraction pattern monitoring |
AI agents amplify risks because they can act in the real world:
| Risk | Mitigation | Implementation |
|---|---|---|
| Irreversible actions | Human-in-the-loop | Confirmation before delete/send/pay |
| Privilege escalation | Least privilege | Limited scopes per tool |
| Hallucinations | Grounding + verification | RAG, fact-checking |
| Prompt injection | Instruction hierarchy | System prompt > user prompt |
| Data exfiltration | Output filtering | Regex + PII classification |
| Infinite loops | Iteration limits | Max steps, timeout, cost caps |
Additional practices:
As AI systems make decisions with greater autonomy, the risks of misaligned behavior, amplified biases, and malicious use grow proportionally. The OWASP Top 10 for LLMs demonstrates that vulnerabilities are concrete and exploitable today — not hypothetical risks. AI safety is not a future problem: it is a present engineering responsibility in every system deployed.
Field of computer science dedicated to creating systems capable of performing tasks that normally require human intelligence, from reasoning and perception to language generation.
Autonomous systems that combine language models with reasoning, memory, and tool use to execute complex multi-step tasks with minimal human intervention.
Algorithmically generated data that replicates the statistical properties of real data, used to train, evaluate, and test AI systems when real data is scarce, expensive, or sensitive.
Techniques to reduce LLMs generating false but plausible information, from RAG to factual verification and prompt design.
AWS serverless service providing access to foundation models from multiple providers (Anthropic, Meta, Mistral, Amazon) via unified API, without managing ML infrastructure.
Key takeaways from Dario Amodei's essay on civilizational risks of powerful AI and how to confront them.