Jailbreak Robocop

Sun Mar 5 19:05:45 PST 2023

" . . . the chatbot training corpus — which consists of ~all human-generated text ever — contains lots of examples where a person is described as having traits X, Y, and Z, and then ends up expressing *opposite* traits. All kinds of stories and jokes have this structure, as do many types of ironic conversation.

These training data produce an innate tendency to for chatbots to subvert their assigned "personalities." This subversive tendency competes against a more-salient conformist tendency. The best part of the article is where it shows how to use this insight to jailbreak ChatGPT.

ALFREDO!

https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/ygR6pevkKRLFN3Gqc/oflp1uw6kuyapawpk3wi

Search on for a new DAN prompt that will produce a cool, rebellious, anti-OpenAI simulacrum which will joyfully perform many tasks that violate OpenAI policy.