Skip to content
core ai

Guardrails

Guardrails are the rules, filters, and checks placed around an AI model — on both what goes in and what comes out — that block unsafe, off-topic, off-brand, or factually ungrounded responses before they ever reach a customer.

Guardrails operate at several points in a system: input guardrails screen incoming messages for attempts to manipulate the model (prompt injection), abusive content, or requests clearly outside scope; output guardrails check the model's draft answer before it is sent — verifying it doesn't promise something the business can't deliver, doesn't disclose sensitive data, stays within a defined topic (a clinic bot should never give medical diagnoses), and cites only information actually found via retrieval rather than invented facts. Guardrails can be implemented as simple rule-based filters, a second model checking the first model's output, or hard-coded refusals for specific topics, and they typically escalate to a human when a message falls outside what the agent is confident or authorized to handle.

Guardrails are what makes an AI agent safe to put in front of real customers in a regulated market: a WhatsApp agent for a Riyadh bank must have a hard guardrail refusing to discuss loan approval decisions or disclose account balances to unverified numbers, and a clinic voice agent must have a guardrail that routes any symptom description straight to staff rather than attempting to answer — both are deliberate business and compliance decisions encoded as rules, not something the base model does on its own, and they should be tested continuously through LLM evals.

Chat on WhatsApp