Multi-AI collaboration governance notes
One person, three or more AIs, a long-term engineering effort. Six months in, I finally saw it clearly — the thing that disappears first isn't capability, it's order.
Capability is actually in surplus. Claude can argue, write long pieces, break down architecture. Codex can run scripts, fix tests, handle CI. Kimi can chew through long Chinese documents. GPT and Gemini each have their own edges. Put them side by side and in theory you have a small team. Each one alone is worth half an engineer; three in parallel should be worth one and a half.
But once you actually use them, the team feeling turns out to be fake. Every AI is an island. There's no memory between sessions, and even less shared understanding across tools. The next AI that walks in is always asking the same set of questions: what did the previous AI change? Why? Can I keep going? Or do I start over? Is this setup actually stable, or did it just happen to work last time?
Nobody answers these. I have to remember them myself. Whatever I can't remember turns straight into rework. One or two rounds of rework is fine. By round ten you realize you're not using AI — you're being a human relay station between AIs, copying context one way, explaining yesterday the other way, double-checking that they haven't stepped on each other's feet.
Eventually I stopped asking "how do I make the AI smarter" and started asking the reverse: what is the smallest set of things that holds the relay between these islands together? Not some grand governance framework — just the minimum order that's barely enough. One piece less and it collapses; one piece more and it starts dragging.
Every AI is an island — the pain is real
At first I thought this was just tool differences and a few more rounds of use would smooth it out. After a few months of grinding, I admit it's structural. It's not that any one product is doing badly. The paradigm itself is built this way.
Each AI's session is independent. I finish a discussion with Claude today; tomorrow I open Codex and it knows nothing. I have to manually paste a chunk of background, then a chunk of last round's conclusions, then explain what we're doing now. Eight times out of ten the background I paste is incomplete — not because I'm lazy, but because I genuinely don't remember where the last round left off. Going back to dig through chat logs is brutally inefficient, because chats are full of exploratory chatter and the actual decisions are only a small slice of that.
Across tools it gets worse. Files Claude changed, Codex doesn't know were changed. Scripts Codex ran, Kimi doesn't know produced results. Chinese material Kimi organized, Claude doesn't know exists. Three AIs each carry their own "project in their head," and the three don't merge. Ask any of them what state the project is in right now and you'll get a very confident answer — and three confident answers that fight each other.
The worst time: I had Codex change a piece of config, it went smoothly. The next day I asked Claude to look at the same module and it said "I suggest you change it to X" — and X was exactly the pattern Codex had moved away from the day before. Neither AI was wrong. The fault was that nothing in the middle let them know about each other. If I hadn't caught it and had taken Claude's suggestion, a few days later Codex would get tripped up by some test and reverse it again — a literal loop, each round "fixing" the previous "fix."
There's another kind of pain that's more subtle: no conflict, just discontinuity. Codex ran the tests and they passed; the conclusion stayed in that one session. Next time I open Claude to discuss the next step, it has no idea the tests ever passed — it carefully suggests "let's run the tests first to confirm." That caution isn't wrong, but for me it's pure dead round-trip. The AI keeps reconfirming things I already confirmed, because it has no channel to know I did.
A few rounds of this and you want to give up on parallel use and go back to a single AI. But the cost of going back is bigger — it means dropping 70% of your usable compute, and dropping the core benefit that "different AIs are good at different things." So the question was never "parallel or not." It's "where's the minimum order that makes parallel work."
Three things that hold the whole chain together
Six months in, three things have survived. Not because I designed them brilliantly — because shrinking the set further actually breaks things, and growing it turns into administrative drag.
First is the dual constitution. Two top-level rule files. One governs behavior, one governs knowledge. The behavior one says: how tasks flow, how files get changed, which actions require stopping to ask, how CHANGE gets recorded, which lines are red lines, which actions are default-go. The knowledge one says: how things get filed, what the naming convention is, how content lineage gets tagged, how many layers the Feed has, what material belongs in which layer.
At first I tried to merge them. After two months I admitted they can't be merged. Behavior rules and knowledge rules are fundamentally different in nature. Behavior is "should I do this" — it's a judgment. Knowledge is "where does it go, what is it called" — it's a convention. Cramming judgment and convention into one file makes the AI bad at both. Either it treats the behavior rules like metadata — reading "stop and ask first" as "add status: pending to the file's frontmatter" — or it treats the naming convention like a moral constraint, refusing to keep working when it sees a non-standard name even though the file is just a working draft. Splitting them into two makes both clearer. Reading the behavior file, the AI knows it's making a judgment. Reading the knowledge file, it knows it's filing something.
There's a simpler benefit too: split in two, each evolves independently. I touch the behavior file roughly every one or two months, because task modes shift. The knowledge file is more stable — once naming and layering are set, they shouldn't drift much, so every three to five months is enough. When they were one file, touching either half meant rethinking the whole thing, so I ended up afraid to touch either.
Second is the task folder. Every cross-AI task gets its own directory on the shared filesystem. Inside, four fixed things:
- README — what this task is actually trying to do, what acceptance looks like, what it depends on. One page, no more.
- notes.md — an append-only log. Whenever an AI finishes one piece of work, it appends one entry at the bottom: what got done, what the conclusion was, who's next, where the key files are. No overwrites, only appends.
- handoff.md — written when handing off to the next AI. Current state, what's been done, what hasn't, what to watch out for on pickup, key file paths.
- outputs/ — what this task actually produced. Scripts, reports, data, modified code snippets.
Third is handoff itself. It's how the handoff.md file in the task folder gets used: the previous AI finishes a leg and leaves a handoff behind; the next AI picks up by reading README → the last few notes entries → handoff, in that order. Five minutes to be in state, then it keeps going. Handoff isn't a log, it's a signpost — it tells the receiver "you're standing here, the next step goes that way."
The three together are light — one folder template, one append format, one handoff action. But they have an order: the constitution sets boundaries, the task folder sets context, handoff sets the relay. Drop any one and the chain breaks at that link. Without the constitution, the AI doesn't know which decisions aren't its to make. Without the task folder, the AI doesn't know where the project stands. Without handoff, the AI knows where the project is but not where to pick up from.
Why these three, not something else
Early on I added a lot of things. Status boards, kanban, daily summaries, cross-AI notifications, version manifests. After three months most of them were gone.
The criterion is simple: if removing it makes the order collapse, keep it. If not, delete it.
Without the dual constitution, it collapses. When the AI doesn't know the boundary, it decides for you — and decides confidently. It'll write into files it "thinks should be changed," move things it "thinks should be archived," rename a batch of material without authorization. None of it is malicious. Every time, it's the AI using its own judgment to fill in a blank. If the blank isn't filled, the AI will fill it. That's instinct. The dual constitution fills exactly that blank: which actions require stopping, which material follows which naming, which directories are off-limits roots. It doesn't have to be detailed. It just has to exist — the existence itself is the signal, telling the AI "past this line, ask me, don't decide."
Without the task folder, it also collapses. If context only lives in chats, every tool switch is a memory restart. I used to think I could remember "where the last round left off." Running ten parallel tasks I cannot. The task folder's job is to take context out of my head and put it on disk. So the next AI (or me a week later) opens the directory and starts from a known position, not from fragments of my memory. The most interesting thing about this: what it actually solves isn't the AI's memory problem, it's mine. The AI's memory doesn't matter either way — it cold-starts every time. Mine is finite and needs somewhere to live.
Without handoff it collapses the fastest. The task folder has a README and notes, so in theory the next AI can figure it out — but in practice it can't. Notes are append-only; after thirty entries nobody reads from the top. The README is the task definition, not the current state. Neither tells you "the next thing you should do right now." Handoff exists to solve exactly that. It replaces the dumb "manually copy-paste context between two chat windows" action — and replaces it completely, because the moment you write it down it persists, unlike chat state that vanishes when you close the window.
Three things, three different jobs: boundary, context, relay. The relationship isn't redundancy, it's division of labor. That's why I deleted everything else — the rest was either duplicating one of these three, or solving a problem that didn't actually exist. For example, I once built a cross-AI notification system where one AI would message the next after finishing. Sounded reasonable. Useless in practice: the next AI doesn't become smarter from receiving a ping, it still has to read the README and handoff to get into state. The notification just added a failure point.
Or version manifests. I once wanted to tag each task with a version number for easy rollback. Turned out it wasn't needed — notes are append-only and inherently a timestamped evolution record. To roll back, roll back to the state described by a specific notes entry. No separate version number required. Adding a manifest layer would just be one more thing to maintain.
So the reason these three are the "minimum" isn't subjective. I lined up everything I'd added and later deleted, and these three are the only ones I couldn't compress further. Remove any one of them and a class of problems has no owner. Add any one more and there's a lighter scheme that covers it.
The rules are shrinking, not growing
People who hear "dual constitution" worry the rules will keep growing thicker. I worried too. In practice it's gone the other way.
The earliest version of the constitution was a thick stack. Which scenarios should ask, which should act, which file goes where, how every action should be recorded — even "modifying a comment counts as modifying a file" was on the list. That version performed the worst. After reading it, the AI got more cautious, not less. It asked about everything: one line of comment to change — ask. A throwaway temp file to create — ask. An obviously dead placeholder file to delete — still ask. When rules are too dense, they turn into formalism. The AI isn't judging by rule, it's using "let me ask" to dodge anything that might touch a line.
So I started reverse-editing. Every time something went wrong, I'd ask first: "not enough rules, or too many rules so the AI missed the key one." Eight times out of ten it was the latter. The rules got compressed round after round: from "enumerate everything you should do" down to "a few red lines you can't touch + risk tiering + a few task modes." There was a middle version with "risk levels L0-L3" — looked elegant, but in practice the AI often couldn't tell which level the current action belonged to and ended up asking anyway. The next version I just cut the tiers and kept two categories: "absolutely don't" and "give me a heads-up first." Everything else default-go. The AI's judgment accuracy jumped immediately. The current version has four boundaries and one three-column action table.
This shrinking isn't me getting lazier. It's me seeing one thing clearly: the constitution isn't there to regulate everything, it's there as a backstop. 90% of daily judgments the AI gets right on its own. The constitution covers the other 10% where it goes wrong. Write the rules too full and you freeze the 90% too — things the AI could just do, it now has to stop and check rules for; small things that didn't need confirmation, it now asks about. That isn't safer, it's slower, and the slowness is a cost I end up paying.
My Inbox still has twelve constitutional amendment proposals sitting in it. Some propose adding, some removing, some converting a boundary into an action checklist, some introducing a new intermediate layer. I'm not in a rush to rule on them. The fact that they're sitting there means this order is still alive — being questioned, rewritten, overturned by itself. A constitution that's no longer being questioned is the dangerous one. That kind isn't unquestioned because it's perfect, it's unquestioned because nobody is reading it seriously anymore.
What six months of running this taught me
After actually running these three for six months, I have a few judgments I'm fairly settled on:
The dual constitution works well. Separating behavior and knowledge was right. I haven't regretted it once. The AI reads only the behavior file when making behavior decisions, only the knowledge file when filing and naming. The two don't interfere, and accuracy is much higher than when they were combined. The counterintuitive part: splitting into two takes less brainpower than keeping one. The reason is probably this — when one document mixes two fundamentally different kinds of rules, the AI's attention gets diluted. Reading the "should I do this" section, it's still thinking about "which layer does this go in," and ends up getting both wrong. Split, each loads independently. Crude but effective.
The task folder is the lifeline. 90% of collaboration problems get solved at this layer. As long as this layer is solid — README clear, notes continuously appended, outputs all in the directory — the next AI picks up basically without error. Once this layer collapses, no constitution can save you, because the AI has no context. Rules without context don't work. Rules tell the AI "what not to do," but not "what to do right now." That can only be read from context.
Handoff is the piece I most often slack on. Every time I finish a leg of a task, I want to just close the window and pick it up myself next time. The voice in my head says "I'll remember anyway." I don't. By the next time I want to resume, I have to spend twenty minutes digging through notes and outputs to reconstruct last time's state. That "digging back" cost is always bigger than "spending five more minutes writing the handoff then." I know the lesson and still slack on it, so I'm constantly correcting myself. Recently I started using a small rule — no closing the window before the handoff is written, or future-me will pay. The rule works, but maintaining it is itself a discipline cost.
This order has a range. It fits "one person + three or more AIs + long-term engineering." If you're using one AI for short tasks, this is overkill — context fits inside a single session, no need for folders. If you're a team collaborating across multiple AIs, this is too light — you need permissions, review, formal archiving, because human-to-human collaboration adds a dimension that markdown alone can't carry. It sits in an awkward middle: heavier than personal notes, lighter than team governance. I built it because I'm stuck in that middle. It won't fit everyone. For people stuck in the same shape, it should be a useful reference.
Ending: still being questioned by myself
I don't want to write this as "I designed a perfect multi-AI collaboration order." It isn't perfect. Twelve proposals are still sitting unresolved in the Inbox. I still slack on handoffs. The boundary between the two constitutions occasionally needs on-the-spot judgment. The rules are still shrinking toward "minimal boundaries" — meaning the current version will get overturned again.
But one thing I'm much more certain about than six months ago: the core problem of multi-AI collaboration isn't in the AI itself, it's in the layer of order in between. Get that layer right and three AIs feel like a team. Get it messy and three AIs are more tiring than one.
That layer doesn't need anything complicated to hold it up. One behavior constitution, one knowledge constitution, one task folder convention, one honestly-written handoff — that's it.
I'm increasingly convinced cross-AI order isn't designed, it's what's left after repeated slacking, repeated rework, repeated face-plants. Every useless rule deleted, every rule that truly backstops kept — the order gets one notch more stable.
This order is also being questioned by itself. It looks like this today. In six months it probably looks different. It doesn't need to be the final form — it just needs, today, to let the next AI picking up know where to start.