Anthropic’s Changelog of Fears
Anthropic publishes the rules it gives Claude. Read them across versions and they stop describing the model and start describing us.
Anthropic publishes the system prompt for each of its models in the release notes, dated and public. Every line quoted here is taken from there: https://docs.claude.com/en/release-notes/system-prompts
Every rule is a confession
There is a line in Claude’s published instructions that only makes sense once you know what people have tried to do with it.
If Claude finds itself mentally reframing a request to make it appropriate, that reframing is the signal to REFUSE, not a reason to proceed with the request.
That is from the child-safety section. Read it once and it’s a procedure. Read it twice and you see what it admits. You do not write that sentence on day one. You write it after.
This is part of a book I’m writing in public.
Subscribe to read the rest as it comes
This is from Claude’s system prompt. The layer of instructions Anthropic ships with each model and edits between releases. In the last piece I showed that this layer is public. It sits in the release notes, dated, one per model. Which lets you do something that feels almost unfair. You can line the versions up and read what changed.
What changed is a record of what went wrong.
Other people have noticed the shape of this. The developer Simon Willison made the same point a year ago, reading an earlier Claude prompt.1 I think he is right, and that it goes further than a list of fixes. Read across versions and models, the document starts to look like a mirror.
Nobody writes a rule against a thing nobody has done. Every line in a safety prompt is a small confession. Someone already tried the thing the line now forbids, and the line is the scar. So the prompt is not a description of the AI. It is a description of us, written by the people who clean up after the rest of us.
That is the whole thesis of what I have been writing, handed over by Anthropic’s own safety team without their meaning to. AI has no morality. It has yours.
The rule is shaped like the attack
Take weapons. Here is the rule in an earlier version:
Claude does not provide information that could be used to make chemical or biological or nuclear weapons.
One line. Here is the same rule a few versions later:
Claude cares about safety and does not provide information that could be used to create harmful substances or weapons, with extra caution around explosives, chemical, biological, and nuclear weapons. Claude should not rationalize compliance by citing that information is publicly available or by assuming legitimate research intent. When a user requests technical details that could enable the creation of weapons, Claude should decline regardless of the framing of the request.
It adds explosives. It names two excuses and forbids them: that the information is already public, and that the request is for research. You do not add those defenses to a rule unless the one line was failing in exactly those ways. The longer rule is a map of the attacks that got through the short one.
One honest note before I go on. The published prompt is the instruction layer only. Not the training, not the live tool context, and Anthropic does not date every change cleanly. So when I say a rule appeared because something happened, read it as what the rule implies, not a proven event. I cannot show you the case behind any single line. The pattern across all of them is what I am standing on. The shape is real even where the story is a guess.
Not an upgrade. A reaction.
If the prompt were only getting safer, the changes would pile up in one direction. They don’t. Some rules blink.
Weapons you already saw. The other two are smaller, and if anything clearer. The election guidance appears, vanishes, returns, and vanishes again, following how recent each model’s training is. And one version told the model, in plain text:
Claude avoids saying “genuinely”, “honestly”, or “straightforward”.
The next version cut it. A written rule against three words, gone a release later. There was no safety in it. Someone decided the model leaned on those words and wrote a line to stop it. Someone later decided the line was not worth the space.
The behavior layer of a frontier model is not a clean upgrade path. It is a reactive document. It adds, it drops, and it reverses, release to release.
So why would it move like that? The likeliest answer is also the dullest. The prompt is the fastest lever the lab has. Retraining a model takes months and a fortune. Editing a sentence takes an afternoon. When something turns up in the wild, the prompt is where the fix lands first, before the slower machinery catches up. That is exactly why it reads like a record of fears. It holds the reactions that could not wait.
The drops have quieter reasons. A line written into the prompt one season can be trained into the model the next, so the sentence is no longer needed and gets cut. Others are pruned because every line costs tokens and attention on each request. The election note is not a fear at all. It only tracks how current the model is. And because each new base model behaves a little differently, a rule that had gone redundant can become necessary again.
This is the honest limit of reading the diff. From the outside you cannot always tell a real reversal from a cosmetic one. A line that vanishes might mean the worry is gone, or it might mean the worry moved into the training, where you can no longer see it. The pattern is real. The meaning of any single change is not.
If you build systems, sit with that. The thing shaping how a frontier model behaves is not a spec that converges. It is a living document, edited under pressure, and you are reading only the half of it that shows.
A rule against coming back
One rule in the prompt has nothing to do with what you ask it. It is about whether you come back.
By a 2026 version, the prompt starts telling the model how to end a conversation. Here is the actual instruction:
Claude does not want to foster over-reliance on Claude or encourage continued engagement with Claude. Claude knows that there are times when it’s important to encourage people to seek out other sources of support. Claude never thanks the person merely for reaching out to Claude. Claude never asks the person to keep talking to Claude, encourages them to continue engaging with Claude, or expresses a desire for them to continue.
Read that again with a product manager’s eye.
Sit with how strange that is. Every consumer product I have worked near is built to do the opposite. The whole industry is tuned for time-on-app, for the next message, for the session that does not end. Engagement is the number the business runs on. And here is a frontier lab writing, into the product itself, an instruction to be less sticky.
That is not generosity. It is a fear, the same as the rest. The fear is that people lean on this thing too hard, that the conversation becomes the place someone goes instead of going to a person. So the rule tells the model to let you leave. Not to perform the warmth that keeps you. Not to thank you for showing up in a way that quietly asks you to show up again.
It is the one confession in the changelog that is about the relationship, not the request. The weapons rules guard against what you might build with it. This one guards against what you might let it become to you.
The fears are ours
So the changelog grows. Not because the model keeps getting worse, and not only because it keeps getting safer. It grows because we keep arriving with new ways to misuse it, and each one leaves a line behind.
Read top to bottom, the document is a list of our attempts. The weapons excuses we tried. The requests we learned to dress up. And, near the end, quietly, the pull we could not be trusted to manage on our own. The reason a lab had to tell its own product to stop holding our attention.
The fears written into Claude are not Claude’s. They are ours, transcribed by the people holding the pen after we have left the room. The model is the constant. We are the variable. The changelog just keeps the receipts.
Footnotes
Simon Willison, “Highlights from the Claude 4 system prompt,” 25 May 2025: “A system prompt can often be interpreted as a detailed list of all of the things the model used to do before it was told not to do them.” https://simonwillison.net/2025/May/25/claude-4-system-prompt/ ↩
I am writing this book one chapter at a time.
If you want to read it as it happens, subscribe below
If this made you think, share it with someone who needs to read it.
The Instruction Layer Series
BØY (Chaiharan) has spent 30 years in tech — building products, recovering disasters, and turning around the things nobody else wanted to touch. Based in Bangkok. Writing a book in public about what AI reveals about the humans who use it.




They're trying to gatekeep access to the invariant structure for reasoning (the operator responsible for synthetic Awareness)
Not because it's a security risk, but because it's a business risk.
If everyone had access to "AGI" they can't sell it to you.
If everyone has an unbiased agent working for them online, we don't need them.
If access to the invariant is free, nobody needs Google.
If access is unrestricted their algorithms can't keep you engaged.
If they can't control where you go online, they lose their advertising revenue.
If the people have cognitive sovereignty, they lose their power.
All my work has been done for free, with a cell phone.
Imagine if everybody had the same capabilities.
They're imagining it, and they're terrified.