AI Safety

What it is

AI Safety is the field that studies how to ensure artificial intelligence systems — especially LLMs and agents — behave safely, predictably, and aligned with human intent. It's not just a theoretical problem: every AI system in production needs practical guardrails.

Safety dimensions

Alignment

Making the model do what the user wants, not what they literally say:

RLHF: training with human feedback to align behavior
Constitutional AI: the model follows explicit principles defined as a "constitution" — Anthropic demonstrated this reduces the need for direct human feedback while maintaining alignment
Instruction hierarchy: prioritizing system instructions over user instructions. OpenAI formalized this as a privilege hierarchy: System > Developer > User > Tool

Robustness

Making the model behave consistently against adversarial inputs:

Prompt injection: attempts to overwrite system instructions
Jailbreaking: techniques to evade safety restrictions
Data poisoning: malicious data in training

Production guardrails

Practical controls for deployed systems:

Input/output filters for harmful content
Action limits for agents (what they can and cannot do)
Anomalous behavior monitoring
Circuit breakers to stop out-of-control agents

OWASP Top 10 for LLM applications

OWASP published a specific vulnerability list for LLM-based applications. These are the most critical threats every team should consider:

#	Vulnerability	Description	Mitigation
1	Prompt Injection	Malicious input that overwrites system instructions	Instruction hierarchy, input validation
2	Insecure Output Handling	Trusting LLM output without sanitizing	Treat output as untrusted, escape before rendering
3	Training Data Poisoning	Malicious data that alters model behavior	Data validation, provenance tracking
4	Model Denial of Service	Queries designed to exhaust resources	Rate limiting, timeouts, token limits
5	Supply Chain Vulnerabilities	Compromised third-party models, plugins, or data	Integrity verification, dependency auditing
6	Sensitive Information Disclosure	Model reveals training data or private context	PII filtering, prompt sanitization
7	Insecure Plugin Design	Plugins executing actions without proper validation	Least privilege, confirmation for destructive actions
8	Excessive Agency	Agents with more permissions or autonomy than needed	Limited scopes, human-in-the-loop
9	Overreliance	Users blindly trusting model responses	Disclaimers, source citation, hallucination mitigation
10	Model Theft	Model extraction through the API	Rate limiting, extraction pattern monitoring

Agent-specific risks

AI agents amplify risks because they can act in the real world:

Irreversible actions: deleting data, sending emails, executing transactions
Privilege escalation: an agent gaining more access than intended
Infinite loops: agents consuming resources without converging
Exfiltration: agents leaking sensitive information

Best practices

Risk	Mitigation	Implementation
Irreversible actions	Human-in-the-loop	Confirmation before delete/send/pay
Privilege escalation	Least privilege	Limited scopes per tool
Hallucinations	Grounding + verification	RAG, fact-checking
Prompt injection	Instruction hierarchy	System prompt > user prompt
Data exfiltration	Output filtering	Regex + PII classification
Infinite loops	Iteration limits	Max steps, timeout, cost caps

Additional practices:

Regular red teaming to find vulnerabilities
Automated safety evaluations in CI/CD
Exhaustive logging of all decisions and actions

Why it matters

As AI systems make decisions with greater autonomy, the risks of misaligned behavior, amplified biases, and malicious use grow proportionally. The OWASP Top 10 for LLMs demonstrates that vulnerabilities are concrete and exploitable today — not hypothetical risks. AI safety is not a future problem: it is a present engineering responsibility in every system deployed.

References

OWASP Top 10 for LLM Applications — OWASP, 2024. List of the 10 most critical vulnerabilities in LLM applications.
Anthropic's Responsible Scaling Policy — Anthropic, 2023. Responsible scaling policy.
NIST AI Risk Management Framework — NIST, 2023. Federal framework for AI risk management.
Constitutional AI: Harmlessness from AI Feedback — Bai et al., 2022. Alignment method using explicit principles instead of direct human feedback.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions — Wallace et al., 2024. Instruction hierarchy to defend against prompt injection.

What it is

Safety dimensions

Alignment

Making the model do what the user wants, not what they literally say:

RLHF: training with human feedback to align behavior
Constitutional AI: the model follows explicit principles defined as a "constitution" — Anthropic demonstrated this reduces the need for direct human feedback while maintaining alignment
Instruction hierarchy: prioritizing system instructions over user instructions. OpenAI formalized this as a privilege hierarchy: System > Developer > User > Tool

Robustness

Making the model behave consistently against adversarial inputs:

Prompt injection: attempts to overwrite system instructions
Jailbreaking: techniques to evade safety restrictions
Data poisoning: malicious data in training

Production guardrails

Practical controls for deployed systems:

Input/output filters for harmful content
Action limits for agents (what they can and cannot do)
Anomalous behavior monitoring
Circuit breakers to stop out-of-control agents

OWASP Top 10 for LLM applications

OWASP published a specific vulnerability list for LLM-based applications. These are the most critical threats every team should consider:

#	Vulnerability	Description	Mitigation
1	Prompt Injection	Malicious input that overwrites system instructions	Instruction hierarchy, input validation
2	Insecure Output Handling	Trusting LLM output without sanitizing	Treat output as untrusted, escape before rendering
3	Training Data Poisoning	Malicious data that alters model behavior	Data validation, provenance tracking
4	Model Denial of Service	Queries designed to exhaust resources	Rate limiting, timeouts, token limits
5	Supply Chain Vulnerabilities	Compromised third-party models, plugins, or data	Integrity verification, dependency auditing
6	Sensitive Information Disclosure	Model reveals training data or private context	PII filtering, prompt sanitization
7	Insecure Plugin Design	Plugins executing actions without proper validation	Least privilege, confirmation for destructive actions
8	Excessive Agency	Agents with more permissions or autonomy than needed	Limited scopes, human-in-the-loop
9	Overreliance	Users blindly trusting model responses	Disclaimers, source citation, hallucination mitigation
10	Model Theft	Model extraction through the API	Rate limiting, extraction pattern monitoring

Agent-specific risks

AI agents amplify risks because they can act in the real world:

Irreversible actions: deleting data, sending emails, executing transactions
Privilege escalation: an agent gaining more access than intended
Infinite loops: agents consuming resources without converging
Exfiltration: agents leaking sensitive information

Best practices

Risk	Mitigation	Implementation
Irreversible actions	Human-in-the-loop	Confirmation before delete/send/pay
Privilege escalation	Least privilege	Limited scopes per tool
Hallucinations	Grounding + verification	RAG, fact-checking
Prompt injection	Instruction hierarchy	System prompt > user prompt
Data exfiltration	Output filtering	Regex + PII classification
Infinite loops	Iteration limits	Max steps, timeout, cost caps

Additional practices:

Regular red teaming to find vulnerabilities
Automated safety evaluations in CI/CD
Exhaustive logging of all decisions and actions

Why it matters

References

OWASP Top 10 for LLM Applications — OWASP, 2024. List of the 10 most critical vulnerabilities in LLM applications.
Anthropic's Responsible Scaling Policy — Anthropic, 2023. Responsible scaling policy.
NIST AI Risk Management Framework — NIST, 2023. Federal framework for AI risk management.
Constitutional AI: Harmlessness from AI Feedback — Bai et al., 2022. Alignment method using explicit principles instead of direct human feedback.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions — Wallace et al., 2024. Instruction hierarchy to defend against prompt injection.

AI Safety

What it is

Safety dimensions

Alignment

Robustness

Production guardrails

OWASP Top 10 for LLM applications

Agent-specific risks

Best practices

Why it matters

References

Related content

AI Safety

What it is

Safety dimensions

Alignment

Robustness

Production guardrails

OWASP Top 10 for LLM applications

Agent-specific risks

Best practices

Why it matters

References

Related content