Jonatan Matajonmatum.com
conceptsnotesexperimentsessays
© 2026 Jonatan Mata. All rights reserved.v2.1.1
Concepts

AI Safety

Field dedicated to ensuring artificial intelligence systems behave safely, aligned with human values, and predictably, minimizing risks of harm.

evergreen#ai-safety#alignment#guardrails#responsible-ai#ethics#red-teaming

What it is

AI Safety is the field that studies how to ensure artificial intelligence systems — especially LLMs and agents — behave safely, predictably, and aligned with human intent. It's not just a theoretical problem: every AI system in production needs practical guardrails.

Safety dimensions

Alignment

Making the model do what the user wants, not what they literally say:

  • RLHF: training with human feedback to align behavior
  • Constitutional AI: the model follows explicit principles defined as a "constitution" — Anthropic demonstrated this reduces the need for direct human feedback while maintaining alignment
  • Instruction hierarchy: prioritizing system instructions over user instructions. OpenAI formalized this as a privilege hierarchy: System > Developer > User > Tool

Robustness

Making the model behave consistently against adversarial inputs:

  • Prompt injection: attempts to overwrite system instructions
  • Jailbreaking: techniques to evade safety restrictions
  • Data poisoning: malicious data in training

Production guardrails

Practical controls for deployed systems:

  • Input/output filters for harmful content
  • Action limits for agents (what they can and cannot do)
  • Anomalous behavior monitoring
  • Circuit breakers to stop out-of-control agents

OWASP Top 10 for LLM applications

OWASP published a specific vulnerability list for LLM-based applications. These are the most critical threats every team should consider:

#VulnerabilityDescriptionMitigation
1Prompt InjectionMalicious input that overwrites system instructionsInstruction hierarchy, input validation
2Insecure Output HandlingTrusting LLM output without sanitizingTreat output as untrusted, escape before rendering
3Training Data PoisoningMalicious data that alters model behaviorData validation, provenance tracking
4Model Denial of ServiceQueries designed to exhaust resourcesRate limiting, timeouts, token limits
5Supply Chain VulnerabilitiesCompromised third-party models, plugins, or dataIntegrity verification, dependency auditing
6Sensitive Information DisclosureModel reveals training data or private contextPII filtering, prompt sanitization
7Insecure Plugin DesignPlugins executing actions without proper validationLeast privilege, confirmation for destructive actions
8Excessive AgencyAgents with more permissions or autonomy than neededLimited scopes, human-in-the-loop
9OverrelianceUsers blindly trusting model responsesDisclaimers, source citation, hallucination mitigation
10Model TheftModel extraction through the APIRate limiting, extraction pattern monitoring

Agent-specific risks

AI agents amplify risks because they can act in the real world:

  • Irreversible actions: deleting data, sending emails, executing transactions
  • Privilege escalation: an agent gaining more access than intended
  • Infinite loops: agents consuming resources without converging
  • Exfiltration: agents leaking sensitive information

Best practices

RiskMitigationImplementation
Irreversible actionsHuman-in-the-loopConfirmation before delete/send/pay
Privilege escalationLeast privilegeLimited scopes per tool
HallucinationsGrounding + verificationRAG, fact-checking
Prompt injectionInstruction hierarchySystem prompt > user prompt
Data exfiltrationOutput filteringRegex + PII classification
Infinite loopsIteration limitsMax steps, timeout, cost caps

Additional practices:

  • Regular red teaming to find vulnerabilities
  • Automated safety evaluations in CI/CD
  • Exhaustive logging of all decisions and actions

Why it matters

As AI systems make decisions with greater autonomy, the risks of misaligned behavior, amplified biases, and malicious use grow proportionally. The OWASP Top 10 for LLMs demonstrates that vulnerabilities are concrete and exploitable today — not hypothetical risks. AI safety is not a future problem: it is a present engineering responsibility in every system deployed.

References

  • OWASP Top 10 for LLM Applications — OWASP, 2024. List of the 10 most critical vulnerabilities in LLM applications.
  • Anthropic's Responsible Scaling Policy — Anthropic, 2023. Responsible scaling policy.
  • NIST AI Risk Management Framework — NIST, 2023. Federal framework for AI risk management.
  • Constitutional AI: Harmlessness from AI Feedback — Bai et al., 2022. Alignment method using explicit principles instead of direct human feedback.
  • The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions — Wallace et al., 2024. Instruction hierarchy to defend against prompt injection.

Related content

  • Artificial Intelligence

    Field of computer science dedicated to creating systems capable of performing tasks that normally require human intelligence, from reasoning and perception to language generation.

  • AI Agents

    Autonomous systems that combine language models with reasoning, memory, and tool use to execute complex multi-step tasks with minimal human intervention.

  • Synthetic Data

    Algorithmically generated data that replicates the statistical properties of real data, used to train, evaluate, and test AI systems when real data is scarce, expensive, or sensitive.

  • Hallucination Mitigation

    Techniques to reduce LLMs generating false but plausible information, from RAG to factual verification and prompt design.

  • AWS Bedrock

    AWS serverless service providing access to foundation models from multiple providers (Anthropic, Meta, Mistral, Amazon) via unified API, without managing ML infrastructure.

  • Takeaways: The Adolescence of Technology

    Key takeaways from Dario Amodei's essay on civilizational risks of powerful AI and how to confront them.

Concepts