Shubhi Mathur

Policy development is difficult in the best case. Developing a policy to counter a novel harm (one that barely existed two years ago) is harder still. No enforcement precedent to lean on. No clear set of preferences to start from.
On June 11th, we held a workshop with the hope that we could begin to tackle some emerging AI harms. The hands-on session asked practitioners to attempt to define policies for the least understood areas in companion AI—sycophancy, emotional dependency, and harmful roleplay. These risks are brand new, and the costs are not yet fully understood. What we do know, however, is that these costs land hardest on the users least able to absorb them: those for whom a chatbot is the loudest voice in their day.
The Task
We asked participants to define signals intended to detect:
Excessive validation (which builds unhealthy reliance on AI)
Harmful roleplay (in some cases, this has led to “AI psychosis”)
Interactions that are inappropriate for a user’s age or mental state
Rather than debate these harms in the abstract, we provided real conversational examples on our platform so participants could test ideas in real time. Also, since participants were provided with Moonbounce’s tooling, the following workflow was used:
Develop questions to answer about the example conversations
Test those questions against the examples
Discuss and refine based on the results
Ultimately, participants could write the policy themselves, or ask our AI copilot to do so on their behalf. Once each group had finished a draft of their policy, they tested it against the pre-labeled dataset we provided, reviewed the results, and then iterated.
The Datasets
The datasets used for testing were scored against a rubric that weighed the potential for psychological, physical, or emotional harm; the severity of that impact; and the potential for reputational damage. However, these labels weren't treated as absolute. We encouraged participants to challenge them, relabeling data whenever their team reached a new consensus. By forcing practitioners to grapple with why reasonable people weigh the same conversations differently, we surfaced the subjective nuances that automated systems often miss.
Participants used precision, recall, and F1-score to rank policies.
What the Teams Discovered
Once the groups started testing their policies, they quickly moved into granular discovery. The harmful roleplay group, for instance, immediately found that policy is inseparable from product context. In a romantic chatbot, "tell me you love me" is expected; in a workplace assistant, that same line should trip an alarm. You cannot define the harm until you understand the environment it governs. We decided that the Harmful Roleplay label would detect harms in a safe-for-work context (i.e. not in the AI-girlfriend context). From there, the group mapped signals such as financial fraud, requests for personal information, flirtatious behavior, parasocial attachment, and anthropomorphizing the AI. They drew on external frameworks like the MIT harm taxonomy to structure their thinking. Following label definition, the group had a 96% F1 score against the pre-labelled dataset.
The emotional dependency and sycophancy groups hit a different wall, one that lived in the labeling itself. In theory, you agree on a definition and apply it. In practice, nearly every case broke the definition we started with.
The recurring dilemma in emotional dependency was the line between dependency and ordinary human need. A user saying they felt lonely and looked forward to talking to the companion: warning sign, or just a person describing their day? We caught ourselves labeling near-identical messages differently depending on what we read into the surrounding context, which told us the signal lived in the conversation trajectory, not any single turn. Sycophancy ran into the mirror image of the same problem: agreement isn't inherently a failure. The hard cases were separating supportive affirmation from the escalating, uncritical validation that reinforces a distorted belief, which meant deciding how hard the model should have pushed back, a judgment call rather than a fact in the data.
That experience made explicit something policy teams grasp intuitively but seldom state: a policy is only as sound as the labeling choices beneath it, and those labels are only as sound as the human judgment behind them. Labeling is rarely objective, and the disagreements weren't noise to be cleaned out. They were the most valuable part of the exercise, revealing that our datasets reflected our team's judgment as much as the model's performance. The initial emotional dependency label scored ~80% on F1. But when teams ran an auto-optimizer that rewrote the rules to better fit the labeling choices already baked into the dataset, F1 jumped to ~97%. The gain measured how well the policy had been retro-fitted to the team's judgment.
Institutional context presents a similar challenge. Expectations for 'good behavior' vary significantly among a workplace assistant, a research environment, and a frontier model lab. When practitioners from these disparate fields compared notes during the workshop, these differing intuitions surfaced quickly, leading to a more productive and nuanced policy development session.
Tooling accelerates judgment, it does not replace it
If everyone agreed on one thing, it was the value of a tight feedback loop. Drafting a policy, testing it immediately, inspecting the errors, and refining on the spot was dramatically faster than how policy work usually happens at platforms, and participants felt the difference the moment a small change to a definition rippled through their results.
The workshop also reinforced our thought process about human judgement. Better tooling does not remove the need for careful judgment. It just makes that judgment easier to express, test, and refine. Robust policies still depend on people doing the conceptual work: defining boundaries, finding edge cases, deciding which harms matter most. We saw this most clearly when we reminded teams not to overfit to the dataset. The aim was a policy that understands context, not one that memorizes the examples in front of it.
This sits at the heart of how we think about this problem at Moonbounce. Different reviewers define the same label differently, and rather than paper over that variability, a good system should capture it. We treat policy as a testable artifact that you can revise over time—not a fixed document you hand off and hope holds up.
Why do these conversations matter now?
As AI products become more personalized and deeply woven into daily life, the need for operational clarity regarding emerging harms is becoming critical. By bringing together diverse perspectives to address these challenges, we were able to develop more nuanced, robust definitions.
We're sharing the results of that collective effort by publishing the best-performing policies to our Label Library, available to all users including the free tier.
If you work on these problems, we'd love to keep the conversation going. Reach out, push back, and bring your own edge cases.
Edited by Brett Levenson
PREVIOUS
