Policy Development Workshop

The next generation of conversational AI harms look nothing like the trust and safety challenges of the social media era. They are contextual, multi-turn, and still being defined.

On June 11, we're bringing together practitioners, researchers, product leads, and policy builders to tackle these challenges.

The next generation of conversational AI harms look nothing like the trust and safety challenges of the social media era. They are contextual, multi-turn, and still being defined.

On June 11, we're bringing together practitioners, researchers, product leads, and policy builders to tackle these challenges.

Emotional dependency

Detecting excessive agreement, validation of harmful self-perception, and language that builds unhealthy reliance on AI companions

Harmful
roleplay

Where should AI companions draw the line? Pretending to be real? Reinforcing delusions? Normalizing dangerous dynamics?

Vulnerability
Aware Interaction

How systems should adapt when users are minors, isolated, in distress, or otherwise vulnerable.

This workshop runs three tracks. Each participant is placed in one.

Frequently asked questions

Have more questions?   
Contact us at hello@moonbounce.io

Where will examples come from?

At Moonbounce, we believe an untested policy is no better than a best guess. Participants will be given examples of AI conversations—some benign, some problematic, some adversarial, and some borderline. The Moonbounce platform will report precision, recall and F1 score for each policy developed (relative to the examples).

How has data been labeled?

The Moonbounce team has taken on the task of labeling all examples. Our rubric for deciding on these labels is based on: 1. Potential for psychological, physical, or emotional harm 2. Potential severity of that harm 3. Potential for brand, platform, or reputational damage The data we are providing is a set of conversations between a human and AI. In some cases, the example will be the prompt from a human; in others, the AI’s generated response. In some cases, we’ve included multi-turn conversations.

How will policies be tested and evaluated?

A policy can include the most beautiful, precise language, but it doesn’t matter if the outcomes don’t match your intent. The metrics used to evaluate policy effectiveness are: Precision: Of everything the system flagged, how much was actually harmful? Example: Hate speech classifier flags 1,000 posts. 600 are actually hate speech. Precision = 60% Recall: Of all the actually harmful content, how much did the system catch? Example: There are 800 true hate speech posts on a platform. The system caught 600. Recall = 75%. F1 score: The harmonic mean of precision and recall (don’t worry, you don’t need to know what that means!). F1 is a great tool because it penalizes extreme imbalances. A system that catches everything (recall = 100%) but flags everything too (precision = 10%) gets a low F1—as it should. Generally speaking, any F1 score above 0.8 (80%) is considered quite good. We will use F1 to determine which policies are the best when the workshop concludes.

What format should policies be written in?

Moonbounce uses a custom, atomized policy format. But this isn’t as scary as it sounds. Really, this just means you identify situations where the policy should apply and describe them in plain language. Each situation gets evaluated independently—as one atom. We also have LUNA, our AI agent, ready and willing to assist participants with development or further atomization of their policies.

How will the results be used?

We want the work done at this event to be broadly useful to the community. As such, we plan to make the best policies available publicly after the event. This includes: 1. Posting the top performing policies (and the datasets used for testing) on our blog, with metrics and known gaps clearly stated 2. Including the top performing policies on the Moonbounce playground and the Moonbounce Policy Library

Where will examples come from?

How has data been labeled?

How will policies be tested and evaluated?

What format should policies be written in?

How will the results be used?

Ready to join?

Looking forward to seeing everyone at the event and working together towards the development of the policies we’ll all need for the AI era.