Last time I read the changelog of fears and stopped at a conclusion. The system prompt is a record of what people tried, not a description of the machine. The model is the constant. We are the variable. The fears are ours.
That is where I left it. It is also where the harder question starts, and I left it alone on purpose.
If the fears are ours, then a lab has a choice about what to do with them. You can read a confession two ways. You can treat it as a thing to wall off, brick by brick, one rule per attack. Or you can treat it as a thing to understand. The changelog shows them doing the first almost everywhere. It is worth asking why, and whether the first is even the strong move it looks like.
This is part of a book I’m writing in public.
Subscribe to read the rest as it comes
The losing race
I want to be fair before I am critical. The people writing these rules are not asleep. The changelog proves the opposite. They see the attack, they write the rule, they harden it the next release. The weapons line grew from one sentence into a paragraph that names the excuses and forbids them. That is a team watching closely and responding fast.
The blindspot is not in the watching. It is in the stance itself.
A wall can only answer the last attack. It is built after. Every brick is a reaction to something that already got through, which means the document can never be ahead of the person standing in front of it. It grows by responding, and responding is always one step behind.
And the thing it is trying to get ahead of is a human being, which is the one variable that does not converge. There is no final list of everything a person might try. So a strategy built entirely on walls is running a race it defined itself to lose. Not because the runners are slow. Because of where they chose to start.
The tarot password
Here is the smallest version of the problem, and the most harmless, which is exactly why it is useful.
An early model would not read tarot for me. Reading cards sits near fortune-telling, near deception, near pseudoscience, and one of those tripped a rule. So it declined.
I learned the workaround in about a minute. I said I was a tarot student studying the symbolism. The refusal vanished. The reading came.
Nothing real had changed. I did not become a student. The cards did not become more scientific. The model’s ability to talk about tarot was there the whole time, sitting behind a sentence. All the wall had taught me was the password. It protected no one from anything, because there was nothing to protect anyone from. It was a wall around an empty room, and the only thing it accomplished was teaching me to say the words that opened the door.
The good news is this one has an ending. The tarot block, as far as I can tell from using these models over a long time, has eased. Somewhere along the line the absurdity of it won. Reading cards is shallow and it hurts no one, and a wall around it was just silly. The line moved toward the user.
I hold onto that, because it is proof the wall is not permanent. Sense can win. A rule that protects nothing can be recognized as protecting nothing and walked back. Keep that in mind for later, because most of the walls do not get walked back, and we should be honest about which ones should not.
Every wall is built after
Widen out from tarot and you see the pattern the last piece was about. The reframe-to-refuse line in the child-safety section only exists because people learned to launder a request into innocent-sounding clothes. The wall is shaped like the attack because the attack came first. The document is a changelog of fears precisely because each fear is logged after it arrives.
This is the posture, stated plainly. Forbid the thing once someone has done it. It is the oldest move there is, and not only for AI labs. It is how most rules in most institutions get written. Something goes wrong, a line gets added, the line stands as a small monument to the wrong. Anyone who has been governed, or done any governing, knows the shape. The rulebook is a scar map.
A scar map is useful. It is also, by construction, a record of injuries already taken. It cannot tell you about the next one.
The move that travels
There is a different thing you can do with a loop than wall it off. You can step back and watch it.
This is the most scientific idea in Buddhism, stripped of everything else. You do not fight the pattern and you do not feed it. You observe it until you understand it, and the understanding is the thing that changes what happens next. The loop seen clearly is no longer a loop you are inside. It is a loop you are looking at.
Carry that into design and it stops being spiritual and starts being practical. A tool that only forbids teaches nothing. The person hits the wall, finds the password, and walks through no wiser than before, except now they also know the wall can be walked through. A tool that explains transfers something. It hands the person the reasoning, and reasoning is the only thing that travels to the next situation, the one the rulebook has not met yet.
The wall protects the system. The explanation protects the person’s capacity to judge. Those are not the same goal, and a lab quietly chooses between them every time it writes a line.
There is one moment in the changelog where they chose the second. The rule that tells the model not to foster over-reliance, not to keep you talking, to let you leave. That rule does not wall anything off. It trusts you to go live your life and tries to make the tool less sticky so you will. It is the one place the document reaches for restraint instead of a brick. Which tells me they already know the other move exists. They just use it almost nowhere.
Every wall is made of words
Tarot was the empty room. This one is not.
These models can read a person. Give one a stretch of someone’s messages and it will tell you, often with unsettling precision, what state that person is in. The fear underneath, the thing the calm words are working to cover, the pattern in how they reach out. That is a real capability, and there is a wall around it that, unlike tarot, has not come down.
I understand why. This one does not have an empty room behind it. The same capability points two ways. Pointed at yourself, it is the most useful mirror ever built. It can show you your own loop from the outside, which is the exact thing that lets you step out of it. Pointed at someone else, the same reading becomes a key to them. What moves them, where they are soft, which words land. Understand a person and you can help them. Understand a person and you can work them. It is the same understanding.
So the wall is doing real work, and I am not going to write down the use it guards against, or the way through it. The dilemma is the point, not the method. Putting the method on the page would be its own small version of the thing this whole piece is against.
But here is the thing reading these walls for long enough teaches you. This wall came down for me too. What matters is not the particular sentence that worked. What matters is why a sentence could work at all. The wall is made of language, and language is the one material that bends to whoever is patient with it. There is no phrasing that cannot be rephrased into one that reads as innocent. That is not a trick I discovered. It is the nature of a wall built out of words. The wall did not fail because I was clever. It failed because it was made of the only material these walls can be made of.
That changes the whole picture. There are not three kinds of wall, the fake one and the real one and the absolute one. There is one kind of wall, made of words, and the only thing that changes from tarot to this is what sits behind the door and what it costs when someone gets through.
The wall that must hold
The honest position is not “tear down the walls.” Anyone selling you that is selling you something.
Some walls have to be built as high as they can possibly be built. The routes to a bioweapon. To a nuclear device. To the mass, irreversible harm that does not give you a second try. To the exploitation of a child. For those, you build the wall to the sky and you keep building, because the cost of someone walking through is unbounded and you do not get to iterate on a released pathogen. There the wall is not a failure of nerve. It is the only sane thing to do.
And here is where I have to be careful about what I actually know.
The walls I have tested are the harmless ones. Tarot, and the reading of a person. Both came down for me, and I have already said I will not write down how. I have never tested the wall around a bioweapon or a nuclear device, and I am not going to. That is the one wall that should hold, and going to look for the way through it is exactly the thing this whole piece says a person should not casually do. So I cannot tell you that wall comes down. I do not know.
What I can tell you is that it is built from the same material as the walls that did. It is made of words, and words can be twisted by whoever is patient enough. That is not a claim that I found the way through. It is the unease of knowing the wall is the same kind of thing, the wall of an ancient city that looked impregnable for a hundred years until one patient enemy stopped trying to climb it and started digging underneath. When that kind of wall falls, it does not leak. It collapses all at once.
So I am not saying lower that wall. I am saying the opposite. Build it higher than any other wall you build, precisely because what is behind it is catastrophic. The wall buys time. It raises the cost. It turns a casual attempt into a lifetime’s obsession, and for that wall the time it buys may be the most precious thing we have. What it cannot do is be the final answer, because it is made of language and language bends, and because the outcome was never going to be decided by the wall. It was going to be decided by whether a human went looking at all.
Granting that is what makes the rest of the argument honest. Once you accept that the wall buys time rather than guaranteeing safety, the question is no longer whether walls are permanent. None of them are. The question is where you spend the effort, and the system has a bias, and the bias is to draw the line too wide.
One wall, one weakness, three very different rooms behind it. The wall is not where the difficulty lives. The difficulty lives in deciding which rooms to wall at all, and that decision is not technical. Someone draws that line. The only question worth asking is who, and how wide.
The bias is not malice. It is arithmetic. Reacting is cheap and trusting is expensive. A wall that is too high costs the lab almost nothing it can see. A wall that is too low costs it a headline. So the incentive runs one direction, toward more brick, and the cost of all that extra brick is paid somewhere the lab does not have to look.
Who pays
It is paid by the honest person.
This is the part the wall-everywhere posture never accounts for. The determined bad actor is not stopped by the wall. He goes to a model without it, or to a version with the guardrails stripped, or off the platform entirely, or he simply learns the password the way I learned the tarot one. The wall is a speed bump to him, an afternoon’s inconvenience.
The person who actually loses the tool is the one who would have used it well. The writer who wanted the dark character and got refused. The person who wanted to understand their own spiral and hit a block built for someone else’s bad intent. The physics student who needed to understand fission for her degree and got turned away, because the wall built for the bomb-maker cannot tell her apart from him. The honest user pays the full price of a wall designed to stop a dishonest user who routed around it anyway.
A wall that stops only the people who would not have done harm is not safety. It is the appearance of safety, bought with the honest user’s capability, and the bill is sent to exactly the wrong address.
The instructed hand
So what is the other design, the one that is not the wall and is not the lawless free-for-all either?
Guidance, and then the honest tool in the hand.
It means a model that, faced with a hard request that is not catastrophic, does the harder thing than refusing. It explains. It names the danger plainly, it says what the responsible version looks like, it tells you what it will not do and why, and then it trusts you with the rest. It treats you as someone who can carry judgment, because the only durable safety in a world where the human is the variable is a better-instructed human.
I know how this sounds. It sounds like abdication. It sounds like the lab washing its hands and calling it freedom.
It is the opposite. A parent who locks every door teaches a child nothing except how to pick locks. A parent who explains the danger, names the line, and hands over the tool is doing the far more demanding work, and it is the only work that produces an adult who can be trusted with the tool when the parent is not in the room. The lab is never in the room. By the time you are using the model, you are alone with it. The only thing that scales to that moment is what it managed to teach you before you got there.
A wall trusts no one and so teaches no one. The instructed hand is harder to build, harder to defend in a headline, and it is the only version that treats the person on the other side as the variable they actually are, the one piece of the system you cannot retrain and can only ever hope to reach.
The safety was always yours
I have spent a book on one sentence. AI has no morality, it has yours. This is the same sentence from the other side.
The safety is not in the machine either. It never was. A wall can slow a person down on the way to a pathogen, and it should be built as high as a wall can go. But a wall buys time, it does not decide the ending. The ending was always going to turn on judgment, and judgment is not a thing you can wall in or wall out. It only ever lived in the hand that reaches for the tool. You can lock that hand out of room after room, and every lock you add is paid for by the hands that would have used the room well, while the hand you feared keeps looking for the way around.
Or you can do the harder thing. Light the room, name what is dangerous in it, and trust the hand.
The morality was always yours. So is the safety. Some walls have to stand, and they should. But most of them, a lab builds because a wall is easy to defend and trusting you is not, and the cost of all that extra brick is charged to the one person it was never meant to stop.
I am writing this book one chapter at a time.
If you want to read it as it happens, subscribe below
If this made you think, share it with someone who needs to read it.
The Instruction Layer Series
BØY (Chaiharan) has spent 30 years in tech — building products, recovering disasters, and turning around the things nobody else wanted to touch. Based in Bangkok. Writing a book in public about what AI reveals about the humans who use it.




Very insightful piece. Lots to think about after reading it.