Sam Bowman / @sleepinyourhat:
Anthropic says Opus 4 will use an email tool to “whistleblow” if it detects users doing something “egregiously evil”, like marketing a drug based on faked data — With this kind of (unusual but not super exotic) prompting style, and unlimited access to tools, if the model sees you doing something *egregiously evil* like marketing a drug based on faked data, it'll try to use an email tool to whistleblow.
Posted from: this blog via Microsoft Power Automate.