You used to know us as Clavata.

Get to know Moonbounce

Try the Playground
See how quickly you can write and test rules that adapt to your use case
Try the Playground
See how quickly you can write and test rules that adapt to your use case

Active Guardrails for fun and profit

Active Guardrails for fun and profit

Active Guardrails for fun and profit

AI needs to stop saying "no" and start saying something better.

AI needs to stop saying "no" and start saying something better.

Brett Levenson

The current generation of AI safety tools has a structural problem: they are built to detect, not to respond. When a content moderation system catches a user trying to extract private data, asking a financial chatbot for medical advice, or—most urgently—signaling that they may be in crisis, the system's options are narrow. It can block the message, end the session, or swap the model's response for a hardcoded fallback: a disclaimer, a hotline number, a polite refusal.

These are passive guardrails. They treat every problem as a binary—allowed or not allowed, on-topic or off-topic, safe or unsafe—and when a violation is detected, the experience falls off a cliff. The AI breaks character. The conversation stops. Whatever rapport or context the user had built with the system is severed at the exact moment the system has decided something important is happening.

This isn't just a UX problem. It's increasingly a safety problem in its own right.

Consider what happens when a teenager talking to a companion AI starts expressing suicidal thoughts. The traditional approach is to detect the signal and either terminate the conversation or override the AI's response with a hotline referral. That's a meaningful improvement over the alternative we've seen play out in tragic and now litigated cases—chatbots that continued to engage and in some cases actively encouraged self-harm. But it is also a brittle solution. A user who has been confiding in what feels like a friend may not respond well when that friend is suddenly replaced, mid-sentence, by a transparently corporate safety message. The intervention is correct in principle, and it can still fail in practice.

The same pattern shows up in less life-or-death contexts. Financial chatbots refuse off-topic medical questions and users learn to phrase around them. Customer service bots get force-stopped before they hallucinate and leave the user staring at an error. In each case, the safety mechanism works in the narrow sense that the prohibited output doesn't reach the user—and in each case, it degrades the very experience it is supposed to be making safer.

There is a second-order problem here that deserves to be named directly. The companies building consumer AI products are, with rare exception, businesses, and they are measured on engagement, retention, and revenue. When the safety tooling available to them comes with a built-in tradeoff—implement these guardrails and your users will hit walls, lose context, or have their conversations terminated—the rational commercial response is to implement as little of it as they can defend. This is not a hypothetical. It is the dynamic that has produced the current landscape, in which the platforms with the most concerning safety records are also the ones most resistant to the tools that exist to address them. Guardrails that platforms refuse to deploy don't protect anyone. Tools that preserve the experience while enforcing the policy aren't a luxury—they are the precondition for adoption.

It is worth asking why this is the paradigm we ended up with in the first place. The binary-action model—block or allow, ban or don't—was not designed for AI. It was inherited, and inherited twice over. It came first from cybersecurity, where allow-lists and deny-lists are the right tool because the signals are deterministic: a packet either matches a known signature or it does not, a credential either authenticates or it does not, and the cost of a false positive is generally low compared to the cost of a missed attack. Many of the people who built the early generations of trust and safety tooling came from cybersecurity backgrounds, and they brought the model with them.

It then survived fifteen years of social media content moderation despite none of the conditions that made it work in cybersecurity still holding — and the AI industry has now imported it wholesale, because that is the pattern its builders know. But AI conversations are not port scans. The signals are ambiguous, the judgments are normative, the cost of false positives is high, and — crucially — the medium itself is generative. A piece of content can be changed rather than blocked or allowed. That third option deserves to be the foundation of AI safety tooling before this model becomes as entrenched in AI as it became in social media, and we find ourselves a decade from now resigned to another generation of technology that everyone agrees is broken and no one knows how to fix.

There is a better approach, and it's one my company is bringing to market. Call them active guardrails.

The distinction is simple but consequential. A passive guardrail detects a problem and stops something. An active guardrail detects a problem and steers something. Instead of replacing the AI's response with a hardcoded fallback, an active guardrail intervenes in the model's own generation process — adjusting the prompt, evaluating the response, and, if necessary, re-prompting the model until its output stays within the policy the operator has defined. Mechanically, this works through an iterative loop: when a risk is detected, the guardrail layer either supplies a revised prompt that nudges the next response in a safer direction, or evaluates the response after the fact and re-prompts with corrective context until the output passes. The conversation continues. The character holds. The teenager who was confiding in a friend gets a caring response that gently surfaces a hotline number — a message far more likely to be heard than one delivered by a system that has abruptly taken the friend away. Better for the user, and better for the platform that no longer has to choose between safety and the experience it has built.

The lawsuits and headlines of the past year have made one thing clear: the AI industry cannot keep treating safety as a switch. The systems we are building now operate in long, contextual, emotionally weighted conversations, and the tools we use to govern them need to operate at the same level. Detection alone is no longer the goal. The goal is steering — guiding generation toward outcomes that are safe for the user, defensible for the platform, and coherent enough with the experience that platforms will actually deploy them.

AI has enormous potential to do good — but only if we can build control tools the platforms deploying it will actually use. For too long, the conversation about AI safety has been a moral one: companies should do better. The moral case is real, especially when children are being harmed. But companies don't run on should alone—they run on what works for their users and their bottom line, and fifteen years of asking them to do the right thing has not gotten us where we need to be. That's why the next generation of AI safety tools cannot be measured only by what they prevent. They have to be measured by what they make possible—by whether they let safety and the experience finally pull in the same direction.