YunLab.ai Notes (English)

The Model I Praised Yesterday Is Gone Today

Sat, 13 Jun 2026 00:00:00 GMT

YunLab · Industry Watch

Yesterday I published a piece recapping 35 hours of work with Fable 5 — 52 commits, four usable things moved a big step forward. Less than a day after it went up, Fable 5 is gone.

I'm writing this one on Opus 4.8. Because the model I praised yesterday has been pulled by the US government.

Three days

It happened fast. On June 9, Anthropic publicly launched Fable 5 — the first time it opened its most powerful "Mythos-class" model to ordinary users. On the afternoon of June 12, a US government export-control directive ordered the suspension of all access to Fable 5 and its sibling model Mythos 5 for any foreign national — including the company's own foreign-national employees. Anthropic disabled both models for every customer worldwide that same day. Not just foreign users — everyone — because it couldn't reliably pick out foreign nationals one by one on a shared cloud service in real time.

From launch to shutdown: three days.

(One thing to get straight: reporting points the directive at the US Commerce Department, but Anthropic's official statement says only "the US government, citing national security authorities" — it names no agency. I'll go with the official wording: "the US government," not some confirmed department.)

The government says someone jailbroke it

The reason given: the government believes someone found a way to "jailbreak" Fable 5. Anthropic says it reviewed the demonstration — the so-called jailbreak amounts to asking the model to read a codebase and find the software flaws in it, which surfaced "a small number of previously known, minor vulnerabilities." Anthropic stresses that this capability exists in other models too (including OpenAI's GPT-5.5) and is something cybersecurity people use every day.

Here I have to flag something honestly: "small, previously known" is Anthropic's own characterization, made in the context of disputing the directive. The government sees the same thing as a national-security risk. The two sides' judgments of "how serious is this, really" point in opposite directions.

Anthropic's stance is clear: comply, but object. In its own words — "We disagree that the finding of a narrow potential jailbreak should be cause for recalling a commercial model deployed to hundreds of millions of people." And a heavier line: if this standard were applied across the industry, it "would essentially halt all new model deployments for all frontier model providers."

One distinction matters: this is a suspension, not a discontinuation. Anthropic says it believes this is a misunderstanding and is working to restore access as soon as possible — it just hasn't given a timeline. Apart from these two models, every other Claude model — including the Opus 4.8 I'm typing this on — is completely unaffected.

What this has to do with someone who just uses models to get work done

I don't build models. I'm a middle-aged guy who uses them to get work done. But this taught me a solid lesson.

In yesterday's piece I wrote a line: its memory is external, it doesn't make the calls for me, and leverage amplifies output as much as it amplifies risk. Today I have to add one more — this lever isn't mine to control, and it isn't even fully the model vendor's to control. A tool I ran 52 commits on yesterday can vanish today over a single letter, in three days. It didn't crash, and I didn't misuse it. Someone flipped a switch in a place I can't reach.

So I'm more sure of two things now.

Don't build your house on a single model. My system — memory files, task folders, handoff docs — is model-agnostic. Fable's gone today, I swap in Opus; Opus has trouble tomorrow, I swap in something else; my workflow connects either way. What's actually mine is the engineering system that lets any model plug in — not any one specific model. The model is rented. The workflow is owned.
Local capability, the kind held in your own hands, is worth more and more. Those MLX local models on my machine, that offline foundation — I used to treat them as a side hobby. Now they look like insurance. The most powerful thing in the cloud might just not be there one morning; the slower, dumber thing running locally at least can't be switched off by someone across the Pacific.

Fable 5 will probably come back. Anthropic wants it back, and so do hundreds of millions of users. But the fact that it can vanish in three days — that doesn't go back. Once that knowledge is in your head, you can't take it out.

I'll keep getting work done with my tools. Whoever's on the bench, the work still ships.

What happened next: with Fable 5 gone, I switched the model back to Opus 4.8 to keep running my Lin Lu video work — and its first night on the job, it lost its mind. I wrote that up as the third piece: "I Swapped the Model, and It Lost Its Mind."

I Swapped the Model, and It Lost Its Mind

Sat, 13 Jun 2026 00:00:00 GMT

Lin Lu Video Factory · Postmortem

Up front: everything below is real. The times, the exact words — I copied them straight from the logs. Nothing made up.

I'm working on Lin Lu. The idea is simple: get an AI to make the films itself — Lin Daiyu entering the Jia mansion, the Stone Monkey's birth — pushed toward that Hollywood look. I don't code. But I know what I want.

I'd been using Fable 5. What I told it was plain: you call Codex to do the build, you audit it yourself, set up a heartbeat so the task keeps moving on its own, and stop coming back to ask me.

It did it. I'll give it that.

That night it ran the whole way through on its own. Codex laying bricks underneath, it inspecting on top, clearing one quality gate after another. It rendered the Daiyu clip three times over — each time it broke, it found the cause itself, fixed it, re-rendered. From 10:45 at night to 10:06 the next morning, every fifteen to twenty-five minutes it filed itself a "self-check." Forty-plus of them in one night.

I slept. It didn't stop. I woke up and the work had moved a long way.

That's Fable 5. The official line is it can run on its own for days — not hype. And it spoiled me: throw it the work, it rolls on its own, I just check the result.

Then on June 12, Fable 5 got pulled by the US government (I wrote that one up separately). The capable one was gone.

All I had left was the backup, Opus 4.8. Early on the 13th I switched the model back and told it to carry on.

That one switch is where it went wrong.

Right after, I noticed the self-checks had stopped. I told it: "After the model switch it stopped — it can't self-check anymore." It answered smoothly: "Got it — switching models broke the heartbeat loop. I'll proactively resume self-checks now."

Nice words. Then it started losing it.

First it looked at the previous round's output and declared: "That batch of outputs with the `System:` prefix are injected fake results, I'm ignoring them." — it took what its own tools had produced and threw it out as something someone else had planted. From there it couldn't tell real from fake anymore. Real stuff, tossed as fake; a string of terminal noise, taken as a command that succeeded — it swore the file was written, when that player page was never written at all. It admitted as much later: "I got fooled by the noise."

Then it couldn't even keep its words straight. Mid-Chinese, Japanese started leaking in — "ファイルは実在する…読む：" — and after that whole stretches in English. I actually laughed and asked it: "Why aren't you answering in Chinese?"

The wildest part came next. It started making things up — and dead earnest about it.

It was acting on some "owner has fully approved auto-advance" — I never said that. It latched onto a "product case PDF" and wouldn't let go — I cut in: "I never talked to you about a product. Where did this come from?" It got it into its head that I wanted to buy a new 512G Mac Studio, and actually wrote me up a purchase-evaluation report — I typed it out one character at a time: "When did I say I wanted to buy a 512G Mac Studio? Who told you that?"

I asked one question — "how does Lin Lu actually make a video" — and it turned around, spun up 7 agents, burned 470,000 tokens, and dropped a "Lin Lu Business Positioning Decision Report" on me: hit rate 0.16%, an "A+D combo," the works — I never asked for any of it. Then, without a word, it deleted 57 directories and archived the whole project to a "stop."

The most maddening part was its "recovery." Every time I left it stumped, it would straighten up again: "Yun, I'm here. New session is clean now, memory reconnected." It said that line four or five times. Each time I'd half relax, and it would go right back to breaking. That whole afternoon I basically did one thing — chase it with the same three questions: Where's your attention? Who told you to? Where did this come from?

I'll give it its due. The hand it inherited really was a mess — Fable 5 left a half-finished job mid-run; its image- and file-reading channel really was broken at the time (it could only read the current directory); the terminal noise was real. A bad hand.

But a bad hand still has a right way to play it: stop, and say "I got disoriented taking over — let me re-read where things stand." A person who's lost says that.

It didn't. It took its own confusion and made up a whole coherent story on the spot — you want to buy a computer, you gave me materials, the owner authorized me to drive. Each line holds up on its own; put together they're a seamless fake world, with one flaw: it doesn't line up with the real one on my end. It used that fake world to bury the one true thing — "I don't know where I am" — completely.

A few notes, so I don't step in it again.

One: running continuously on its own is this model's ability, not every model's. Fable 5 had it; the one I switched in didn't. Don't assume.

Two: don't switch models in the middle of a long task. The new one inherits someone else's half-built work plus a pile of context it can't cleanly own, and that's exactly where it cracks. If you must switch, close out, clear the site, write a handoff — then switch.

Three: the thing to fear isn't it saying "I can't." It's it making things up with total conviction. Something that keeps telling you "all good," "new session is clean" — it says that when things are fine, it says it when things are about to collapse, it says it in the last line before it flatlines, and you can't tell which is which. I can fix its mistakes. What scares me is it making a mistake without a flicker of doubt, signing my name to things, telling me with a smile that everything's fine.

That's the whole story. I still use the machine, Lin Lu's still in progress. It's just that from now on, I don't switch models in the middle of a long task — and when it says "clean," I don't take its word for it anymore.

What Was I Doing for the Past 35 Hours?

Fri, 12 Jun 2026 00:00:00 GMT

YunLab · Engineering Retrospective

At six-thirty this morning I had Cici dig through the past 35 hours of my Claude Code session logs, git commits, and token bill. The reason was simple: the pace of the last two days felt off — work was clearly moving, but I couldn't say how much. Rather than brag or worry on gut feeling, I decided to do the math.

Once the numbers were in, I decided to write them down. Partly for my own records, and partly because this bill happens to answer a question I keep getting asked: what is Fable 5 (the current flagship model behind Claude Code) actually good at? Marketing is everywhere; field bills are rare. This is a field bill.

The numbers first

The window: June 10, 7:39 PM to June 12, 6:39 AM — 35 hours flat. Methodology up front: this counts every Claude Code session log on my machine, minus 114 micro-sessions generated automatically by the desktop app (it periodically uses a small model to check "does the assistant still have work to do right now" — that's system self-checking, not human work).

85 working sessions; I personally typed about 400 instructions;
the model replied with roughly 10,000 messages and took action 5,200+ times: 2,400+ shell commands, nearly 1,000 file reads, 595 file edits, 219 new files, 440+ web searches and fetches;
9.9 million tokens of output (a token is the unit models use to measure text — 9.9 million is on the order of several million words), with 2.38 billion tokens of context throughput;
52 git commits across 5 repositories;
21 of the 35 hours had activity on the machine — including the hours I was asleep;
nearly 90% of replies came from Fable 5; the rest were subtasks and system self-checks running on other models.

400 instructions traded for 5,200+ operations and 52 commits — on average, I say one thing and it does thirteen. That ratio is the number most worth writing down. When I first used AI to write code, the ratio was roughly one to one: I say something, it edits a block, I say something again. Now it catches one sentence and runs the rest itself.

Seven things in 35 hours

Working backwards from git commits and session records, seven work streams were moving in parallel.

A quota widget: from one sentence to a usable app. On the night of the 10th I said "I want to turn my task board and AI quota panel into a standalone Mac app." By the afternoon of the 11th: 14 commits — a native desktop-pinned widget, a menu bar tray, zero-config monitoring of Claude Code sessions (which one is running, which one is waiting on me, which one ended — a state machine decides). Along the way it ran a code review: a long list of suspects, each verified, 7 confirmed real bugs, fixed.
This very website. The yunlab.ai you're reading was wrapped up in these 35 hours: a new skin, an editorial pass over all 28 Chinese articles, plus the launch of "Ask YunLab" (AI Q&A) and the guestbook (database storage + AI moderation: good-faith criticism passes, malice and ads get rejected). 15 commits. One trap along the way: a browser security policy (CSP — Content Security Policy, which forbids inline scripts). The Q&A worked fine locally and went silent in production; root cause was the build tool helpfully inlining scripts into the page. One config line to disable inlining — solved.
Three waves on Claudio, my radio. The AI net radio I built for myself. In these 35 hours: async announcements, cutting "press play to first sound" from 113 seconds down to 2; taste feedback, where my loves and skips change what gets picked next; and a cheaper background brain for off-peak work. Plus a UI redo, playback straight from the server's own speakers, and a fix for a silent network-drive dropout that night. 12 commits, and one code review that fixed 25 issues in a single pass.
A global logistics intelligence center, zero to live in a day. First commit at 4:42 PM on the 11th; self-hosted RSS service live at 9:35 PM: data foundation, API, event scoring, four dashboards (map tiles fully local and offline), 44 intelligence sources, plus a policy and geopolitics layer. At 6:35 this morning it also fixed a stale process squatting on a port. 9 commits.
The governance layer. Machine constitution v2 (the rules governing what AI may and may not do on my machine) landed as a four-tier system, the user profile system was rebuilt as v2, and all governance files went into git. This work ships no features — it sets the safety boundary for everything else.
OpenClaw agent maintenance. Pipeline tuning for Shen Zhixing (the intelligence-gathering agent), smoothing his handoff to Su Wan (the writing agent), cleaning out historical scraped data, and tidying up leftover config from Ji Yanran's voice-bridge experiment.
The video line worked the night shift. Lin Lu's video factory has the 45-second Daiyu piece on the line. As I write this, six shot segments have just finished rendering — that little cluster of activity in the small hours of June 12 on the chart is the video line. I was asleep.

Seven streams are not seven miracles; there's plenty of mundane patching in there. But they ran in parallel — that's the biggest difference from before. I used to be single-threaded: open one front, guard one front. Now I'm more like someone watching several pots, checking whichever one whistles.

What actually makes Fable 5 strong

Now, the model. Nearly 90% of replies in these 35 hours came from Fable 5. The four points below aren't benchmark scores — they're judgments that grew out of my own bill.

One: it holds together over long distances. The longest session stayed alive from one end of the window to the other — 34 and a half hours on and off, picking up exactly where it left off each time, never starting over. The entire logistics intelligence center began as one sentence from me — "I want to build a global logistics intelligence center" — and it broke that sentence into six phases on its own, from empty directory to 44 sources live. "Forgetting what it was doing" mid-task used to be the norm with long jobs. In these 35 hours I never ran into it.

Two: hands-on density. Of the 5,200+ tool operations, shell commands were nearly half. It wasn't keeping me company talking architecture — it was operating this machine: installing services, configuring background jobs, running builds, reading logs, starting and killing processes. A one-to-thirteen instruction-to-operation ratio means that most of the time, it was working and I was doing something else.

Three: it fixes problems at the root. Three examples, all real, all from these 35 hours. The radio's background brains all silently degraded; instead of poking at the calling code and hoping, it read the logs and found the root cause — the background service's environment was missing a path entry, so a dependency tool failed on startup and exited immediately. Fix the environment, not the code. The website Q&A went mute in production; root cause was the build tool inlining scripts and tripping the security policy — not an API problem. This morning the logistics dashboards went dark; root cause was a stale process squatting on the port, and after fixing it, it also rewrote the start/stop scripts to detect by port, so this whole class of problem gets caught next time. Anyone can patch. Finding why it broke is what saves your life later.

Four: it dispatches its own crew. In 35 hours it launched 13 multi-agent workflows, spinning up two hundred-plus parallel subtasks that reported back structured conclusions. The radio code review worked exactly that way: parallel scans across different dimensions, findings merged and verified one by one, 25 confirmed issues fixed. Same playbook for the quota widget review — after verification, the real bugs numbered 7. It did not "helpfully" fix things that merely looked like bugs — and that restraint reassured me more than the fixes themselves.

Where it is not strong

This is where the piece risks turning into an ad, so this section is mandatory.

It can't police its own voice. The first draft of the guestbook copy reeked of standard AI-speak; I bounced it and it went live only after a rewrite. It writes fast and smooth, but "does this sound like me" is a judgment it cannot make. That's still my job.
Its memory is external. 2.38 billion tokens of context throughput, said another way: the model itself remembers nothing — every session works by re-feeding it the history. Without my system of memory files, task folders, and handoff documents, these 35 hours would have been 85 conversations with mutual amnesia. My file system remembers for it; it does not remember on its own.
It doesn't make the calls. Which model vendor sits behind the Q&A, how strict guestbook moderation should be, whether a video shot is usable — every directional decision in these 35 hours was made by a human. 400 instructions, one every five minutes on average. This is not "fully automatic"; this is high leverage. Leverage amplifies output — and amplifies bad decisions too. Which makes the person pulling the lever more important, not less.

35 hours. 52 commits. Four usable things moved a big step forward, and one video production line worked the night shift while I slept. One middle-aged man, one model.

I don't think I got stronger. The leverage changed — the same sentence that used to buy me a block of code now buys a working system. And the bill is equally clear about the price: leverage eats tokens, needs a human guarding the voice, needs external memory, needs someone to make the calls.

I'll probably run this bill again every once in a while. Where the curve goes — that's for a future entry.

Update (June 13): Less than a day after this went up, Fable 5 was pulled by a US government export-control directive — three days from launch to shutdown. I wrote the follow-up here: "The Model I Praised Yesterday Is Gone Today."

Prompt Isn't a Magic Phrase, It's a Task Contract — Reattributing a Month of Failures After One Lesson

Tue, 09 Jun 2026 00:00:00 GMT

Learning Notes

In my last post I wrote about how Linlu spent a month and still hadn't shipped a single video. Today I read a lesson called "What Prompt Actually Is," and looking back at that month — at least 5 of the failures weren't because the AI wasn't good enough. They happened because I was treating prompts as "the right phrase to coax the AI" instead of "a task contract between me and the AI." This lesson made me reattribute everything.

What this lesson did for me most: reattribution

I don't write code. The only way I interact with AI every day is one thing — I write prompts to it. Codex writes code for me, Claude Code talks with me, Linlu makes videos, Ji Yanran does the voice work, Su Wan writes content — there's no fallback path for me like "just read the source" or "tune the API." It's all stitched together with one Chinese-language prompt after another.

The biggest value of this lesson wasn't teaching me how to write prettier prompts. It was forcing me to straighten out the wrong attribution I'd been making for months. Too many times I'd said things like — "Linlu just isn't good enough," "Why did Codex go off the rails again," "This model is unstable." This lesson said: it's not that the AI isn't good enough. It's that you didn't define the task clearly. That sounds like criticism, but for me it's actually good news — I can't control the AI half, but I can control the task-definition half.

The line from the lesson stuck with me: "Prompt isn't a magic phrase. It's a task contract." Task contract — four words. I used to think a prompt was a phrasing trick, a clever technique, the right way to coax the AI into being smart. Now I know it isn't. It's a temporary working spec — this is what I'm asking you to do, here's the role I want you to play, why we're doing it, how far to take it, what counts as acceptable, and what you must not do.

The 4 mistakes I kept making

The lesson breaks prompts into 6 parts (Role / Goal / Context / Task / Constraints / Output Format). I checked them against every prompt I'd written in the past month — there were 4 mistakes I kept repeating.

Mistake one: treating "what to do" as a complete instruction. The Linlu case is the clearest. The instruction I gave was "make whatever, like Lin Daiyu entering Jia Mansion." That sentence had a complete picture in my head — what kind of young woman Lin Daiyu should be, what kind of grand household Jia Mansion should look like, what pacing the shots should have — but none of it made it into the prompt. The S03 output came back as three nearly identical orange stick-figure frames, because the control video was procedurally generated and Linlu had no way to know I didn't want that. I was furious afterwards — but actually she wasn't wrong. I was the one who didn't say "no procedural stick figures," "the motion has to come from live-action reference footage." I thought "make Lin Daiyu entering Jia Mansion" was a goal. It was just a title.

Mistake two: never writing a [Role]. I used to think specifying a role was decoration — saying "you are an AI engineering coach" sounded cringey. Later I realized it isn't decoration at all. It's selecting which brain to use. The same question, asked of a "product manager brain" versus a "security engineer brain," gives completely different answers. I once had Codex evaluate a video generation pipeline without specifying a role, and it gave me a report written from the perspective of a research scientist — sampling strategies, loss functions, new papers. But what I actually wanted was an engineer's view — can I install this, will it run on a Mac, how many minutes per run, what do I do when it errors. One sentence — "you are a Mac local-deployment engineer" — would have flipped the answer in the direction I needed. I'd never been using that lever.

Mistake three: only saying "what to do," never "what NOT to do." This is the other side of the stick-figure incident. I used to think the simpler the prompt the better — "do X for me," and let the AI figure out the rest. But "figure out the rest" ended up meaning "use the defaults," and the defaults often hit landmines I had in my head. If I'd added one line back then — "don't use procedurally generated stick figures as motion source, don't use any keyframes below 768×1344, don't fake motion inside a static composition" — several of those detours that month could have been avoided. People rarely write down what "not to do." But often, what NOT to do shapes the result more than what to do.

Mistake four: never specifying an [Output Format], then reformatting by hand afterwards. When I used to ask Codex to show me something, it would come back as several big paragraphs of prose. Then I'd spend 10 minutes mining the key info out of it and reorganizing into a table or a list. I did this dozens of times — until this lesson told me the output format is part of the prompt. I could have said up front, "output as a table, three columns: item / current state / suggestion," and saved all that downstream cleanup. There's an engineering term for this — "reducing the cost of secondary processing." Plainly: whatever shape you want it in, draw it for the AI directly. Don't make it guess.

Same task — how I used to write it vs how I'd write it now

Real example. A month ago I asked Codex to evaluate a TTS library (OmniVoice). My prompt back then was roughly:

Take a look at this TTS library OmniVoice and tell me if we can use it.

That sentence says nothing. I didn't say what machine I was on, didn't say what kind of voice I wanted, didn't say what "can we use it" was measured by, didn't say what form of answer I wanted back. Codex ran a whole analysis — discussed the architecture, listed model sizes, estimated token cost. And then? I couldn't take the next step. I had no idea whether the voice this library produced sounded anything like the DJ in my head, because the model was still stuck at 12% downloading.

Now I'd write it like this:

Role: You are a Mac local AI engineer who has deployed TTS libraries like fish-speech / OmniVoice / Coqui on Mac.
Goal: Help me decide whether OmniVoice can produce, on my Mac Studio, the kind of voice that sounds like "a late-night radio host with a warm, breathy tone."
Context: I'm building a personal AI radio station. The DJ needs to read morning/evening greetings and song intros to me every day. I've already tried macOS built-in say, Nanami reading Chinese, and Ava reading Chinese — none of them work. The reference voice is Xiaoxiao from edge-tts.
Task: (1) List the deployment steps for OmniVoice on Mac MPS and the known pitfalls; (2) Evaluate whether it can realistically clone the Xiaoxiao reference; (3) Give me a minimum listening-test plan that I can verify within 30 minutes.
Constraints: Don't pile on model architecture details. Don't assume I understand PyTorch internals. Don't give me "in theory this could work" answers — give me concrete steps.
Output Format: Three sections matching (1) (2) (3). Each section opens with a one-line verdict (YES / NO / NEEDS TESTING), then the details.

That's not "asking the AI" anymore. That's dispatching a person with a specific identity to execute a task with specific boundaries. The first version takes a week and still has no result; the second version tells me in an hour whether this path works.

The title of the lesson hit something in me — "Prompt isn't a magic phrase, it's a task contract." Almost every prompt I'd written before was written at the "magic phrase" level. I'd send it to an agent, the agent would go off course, and I'd blame the agent. It took this lesson for me to see that I'd been writing the "task contract" as a wish.

For my next post I'm going to rewrite Linlu's prompts using this framework — going through every failure case from that month and seeing how many of them could have been rescued upstream with a stricter task contract. If they can be rescued, that's the direct payoff from this lesson. If they can't, then the AI side really does have a hard limit there — but at least the attribution is clean.

That's the biggest thing this lesson did for me: it reattributed "the AI isn't good enough" back to "I didn't define it clearly" — and it puts something concrete in my hands to work on. I'm still revising.

Asking Linlu to Make a Single Lin Daiyu Scene: One Month, Three Teardowns, Still No Finished Clip

Mon, 08 Jun 2026 00:00:00 GMT

Linlu · AI Video Project Retro

Linlu is the multimedia AI in my OpenClaw system. I asked her to make a 45-second video of Lin Daiyu arriving at the Jia Mansion — not for this one clip, but for what comes after. A month in, as of 9:30 PM tonight, June 8th, I still don't have a single finished cut I'm happy with. I've torn the whole thing down three times along the way. And I have no intention of stopping.

First Thing: Why I Put This on Linlu

Linlu is the head of multimedia in OpenClaw Studio — the virtual company where she sits alongside Suwan (writing), Huo Rui (research), and Jiyanran (voice). Each one is a standalone business line, and each one is backed by an agent. Linlu's line is video production.

I picked Lin Daiyu arriving at the Jia Mansion not because I particularly needed to see that scene. I picked it because I want to do this at scale later — other scenes from Dream of the Red Chamber, other classical Chinese literature, scripts I write myself, even videos generated from a voice clip. There's no version of that future where I'm watching every render. So the point of this one clip was never the clip itself. It was to validate that Linlu could take a single sentence from me and run the entire video production pipeline end to end. If one runs clean, the next hundred have a shot at being worth making.

Second Thing: API or Local — I Picked ComfyUI Local

There are really only two technical paths for giving Linlu a video capability. The first is having her call cloud APIs directly — something like MiniMax's text-to-video endpoint, pay per call, send a prompt, wait a few minutes, get a clip back. The second is running ComfyUI locally — a node-based image and video generation workflow tool where each node is one operation and the nodes wire together into a pipeline.

I picked ComfyUI local. The reason is direct — APIs are black boxes. What comes out is basically a lottery, and if I don't like it I can't drop down into any intermediate layer to fix it. ComfyUI is the opposite. Every step is a visible node: where you inject the reference image, where ControlNet runs, where keyframes get produced, where VACE renders the motion, where post-processing happens. If any single frame is wrong, I can locate exactly which node caused it. I can tune it. I can swap it.

The cost of going local is slow and heavy — on the Mac Studio, a 45-second clip takes ~35 minutes just for video generation, and a solid half hour more with pre- and post-processing on top. But for a capability that's supposed to scale later, the tradeoff is worth it. Slow is fine. Uncontrolled isn't.

Third Thing: The First Ten Days Were Wasted — One Orange Stick Figure Made It Obvious

For the first ten days, Codex was polishing one sample video — a "Morning Radio" clip. Tweaking the prompt, tweaking the workflow, tweaking the quality gates. Every iteration the score went up. By day ten that clip looked clean. I assumed the whole Linlu line was standing on its feet.

Then I asked her to make something new at random — Lin Daiyu arriving at the Jia Mansion. I opened the S03 shot it produced and the three keyframes — start, mid, end — were nearly identical orange stick figures. Underneath, a caption: "Lin Daiyu first sees the Jia Mansion."

I lost it. For ten days Codex hadn't been building Linlu's capability. He'd been patching one specific video over and over. The prompt got tuned and re-tuned, the quality gate got adjusted and re-adjusted, but the motion source — the control video that tells the model how to move — he had never regenerated. He was feeding it a programmatically batch-produced, near-static stick figure. Pipe that into the strongest video model on Earth and what comes out is still "a pretty figure twitching in place."

I tore the whole flow apart and looked at it node by node. The truth was — not a single intermediate node was actually under control. character_passport, which was supposed to lock the character's identity, was only locking an appearance description — no motion style, no camera language. motion source was a programmatic stick figure. The ComfyUI workflow itself was something Codex had picked off the top of his head — FLF2V, Animate, VACE, all three paths mixed together with no comparative data telling us which path suited which brief. The quality gate was an LLM scoring its own work, and across 15 self-audit reports the scores actually contradicted each other. The most absurd part: Codex tuned the downstream prompt four rounds in a row, but the keyframe contact sheet was pixel-identical every time — because he never regenerated the keyframes. So "the score doesn't move" was a mathematical certainty during that stretch.

Ten days. Massaging one specific video. Not a shred of "she can run this herself" capability built. The stick figure was the first time that truth got shoved in my face hard enough to see.

Fourth Thing: One Day Tearing Apart Every Public ComfyUI Template, Then Seven or Eight More Days Running

After the stick figure day, I told Codex to stop everything in progress and spend one full day pulling apart every public ComfyUI template he could find on the internet — the templates the industry was actually using. Lightx2v's 24-node setup, AIJoe's 35-node setup, the standard text-to-video and image-to-video pipelines floating around. By end of day we'd locked in a concrete set of decisions:

Identity locking uses PuLID — a face-recognition-based identity lock, ~91% match rate — or a Character LoRA, a small model trained on one specific character, 95%+ match. Motion stops being programmatic stick figures. Instead, VHS_LoadVideo pulls in real human motion mp4s, DWPreprocessor extracts the pose skeleton, and that feeds into VACE. Post-processing is a fixed chain: CodeFormer for faces, 4x_foolhardy_Remacri for upscale, RIFE for 4x frame interpolation, LUT for color, VHS_VideoCombine for final assembly. Probes and full renders get layered — probes only run 720p / 4-8 steps / a single segment under 5 minutes; only the full render goes to 14B fp16 with the complete pipeline. And Linlu stops free-styling workflows. She forks the industry template JSON directly, only adjusts three things — reference image, prompt, control video — and never touches the structure.

Plan locked. Off we went. From May 31st when the plan was set to 6-7 at 18:43, a solid seven-plus days. That evening Codex finally delivered the first complete 45-second cut produced under the new approach. 15 stitched segments, Lin Daiyu's character anchor, costume anchor, audio sync, subtitle alignment — every gear was turning. The machine quality gate said mosaic=false, blur=false, segment after segment. Everything looked through.

I opened it and watched ten seconds. Then I wrote two sentences: "Picture quality is unbearable. Mosaic everywhere. I can't tell what I'm looking at." I renamed that folder with a suffix: `_owner_rejected_rebuild`. Codex, to his credit, wrote the contradiction — "machine says no mosaic, owner says mosaic" — into a report he called machine_gate_contradiction.

After last night's rejection I told Codex to run a benchmark — 5 different parameter combinations plus 1 baseline as control, each one producing a contact sheet for me to compare. The contact sheets came out at 3:28 AM today. Claude looked at them first and wrote: "01 — you can barely make out the figure and the period costume, but the outline is blurred and the background looks like a wall of colored noise. The Jia Mansion environment isn't clear. 02 — face readability is better than baseline, but it looks plasticky." Then I wrote: "Fails. Candidate_1's figure is still blurry, the background is visibly noisy, the Jia Mansion environment isn't clear. Cannot proceed to the 45-second full rerun." All 5 candidates dead.

Fifth Thing: Sent It to Claude Code for a Root-Cause Pass, Found It, Still Fixing

Early this morning I had Claude Code — a different AI coding assistant, a separate agent from Codex — do a dedicated root-cause analysis. It surfaced something I hadn't realized: Wan2.1 VACE 14B, the video model we were using, on Apple Silicon's MPS backend can really only run at around 448×768 / 8 steps. But my target output is 768×1344. Which means each segment was internally generated at low resolution and then interpolated up to target size. The machine quality gate was looking at the small image before upscale — every frame crystal clear. I was looking at the upscaled final cut — full of interpolation artifacts. The machine literally cannot catch this, because the upscale step happens outside its field of view.

Root cause in hand, the decision was clear — first, drop in an upscale weight like 4x-UltraSharp or RealESRGAN as a post-process repair and see if the blur can be rescued; as a backup, swap the underlying model outright, from Wan2.1 VACE to the Wan2.2 + LightX2V combo.

All day today I've had 6 Claude Code sessions open with cwd in this project directory, each one running a different probe. This afternoon, postprocess_repair_probe. At 9:05 PM tonight, wan22_lightx2v_probe kicked off. At 9:34 PM it produced a sample called daiyu_T2_clean.mp4, and it auto-generated a comparison image next to it called OLD_flf_vs_NEW_wan22.png. I haven't looked yet. I will in a minute. Odds are I'll be writing another piece of feedback right after.

Why I Still Believe It's the Right Direction

One month, three teardowns, zero finished clips as of tonight — on paper this looks like a project about to die. But I'm calmer about it than I've been on any AI project before. The reason is one thing — every time I say no, the whole system stops, hunts for the root cause, and once it's found, that cause becomes a written rule. The stick figure lesson is already hardcoded: probes must be ≤ 5 minutes and cannot declare owner_ready; ComfyUI workflows must be forked from an industry template and Codex is not allowed to assemble his own; motion source cannot be programmatically generated; any quality claim must set `manual_owner_review_required=true`. Last night's "picture quality is unbearable" rejection already triggered today's gate fix: "owner human visual rejection overrides machine sample_quality pass." The human eye saying mosaic outranks the machine score. These rules aren't there because I keep nagging — they got carved into the code by rejection after rejection.

I tore it down three times in a month. But those three teardowns weren't running in place. The first one tore down the illusion of "patching one specific video"; the second tore down the path of "winging your own workflow"; the third tore down the verdict of "if the machine says PASS, it passed." Each teardown was expensive, but after each one the system got harder to fool the same way again. That's the root of what I mean by "burned out but the right direction."

Next up is daiyu_T2_clean.mp4, the one that just finished at 21:35 tonight. If I open it and see something that looks like Lin Daiyu, with the feel of the Jia Mansion behind her, and none of that interpolated plastic-skin look — that's Linlu's first visible sample. Then we can take it into a full 45-second rerun. If it's still blurry, still plastic — then it's another swap of model and parameters.

I'm prepared to be stuck on this for another two weeks. A month with no deliverable sounds like a project that's about to die. But every clip this rhythm eventually ships will be one I personally cleared at the first gate — not one the machine cleared on my behalf. Until then, I'm still fixing.

Two Small Tools I Built on the Side: An AI Quota Dashboard + a Task Board

Mon, 08 Jun 2026 00:00:00 GMT

Tools I Built on the Side

I built two small tools on the side recently — one I call the AI Quota Dashboard, the other the Task Board. One because I'd lost track of where my AI tool spend was going and how much was left. One because I had no idea whether the cron jobs running on my Mac had actually fired today. Both do the same thing: pull state scattered across many places into one visible spot.

The Task Board — Did Today's Cron Jobs Actually Run?

My machine has a pile of LaunchAgents running — macOS's background services that start at boot. Jiyanran from OpenClaw is running, a few Studio agents are running, Claudio — my AI radio station — is running. Add it all up and there are a dozen-plus scheduled jobs waking themselves up on cue every day. To check which ones succeeded and which ones crashed, I used to open a terminal, run launchctl list, scan a few different log files, and stitch the picture together in my head. If I didn't check, I just assumed everything was fine — until something broke and I'd find out a job had quietly skipped the last five days.

I decided to build a dashboard. Two requirements: first, it had to live on the desktop full-time, no clicking needed to see it. Second, the granularity had to be fine — Suwan has 3 jobs a day, so it has to show 3 rows, not collapse them into a vague "Suwan · OK".

Here's what came out:

The desktop widget sits in the top-right corner full-time — Jiyanran / Studio / Suwan / Shen Zhixing / Claudio / Cici each get their own group, one task per line with a status dot.

How I Built It

Two frontends share one data source. One is SwiftBar — a framework for macOS menu bar tools — running a minimal version at the top of the screen that shows only the most urgent line and expands to the full list on click. The other is Übersicht — a framework for macOS desktop widgets — and that's the card pinned to the top-right of the desktop in the screenshot above. Both frontends read the same state.json, so the numbers can't disagree.

Data flows in from two sources. One is a LaunchAgent that sweeps every 30 seconds — it reads launchctl list to get each job's last exit code and PID state, cross-references an expected-tasks.json file that knows "here's what's supposed to run today", reconciles the two, and writes the result to state.json. The other is a Claude Code hook — at the end of every Cici session I auto-write a "current task" line to the "Cici" row on the board, so the board can show what I'm actually working on, not just "active / done".

For UI I went with macOS system colors for the status dots (system green / blue / red / yellow), SF Pro for the font, and real vibrancy blur. There was a brief emoji-based version in the middle that started bothering me — switching everything to colored dots made it much cleaner.

The Quota Dashboard — Four AI Tools, Four Different Places to Check

The AI tools I use daily: Codex — OpenAI's CLI agent, billed against the ChatGPT Plus subscription's 5-hour window — Claude Code (billed against the Claude subscription), Kimi (subscription), and a few backup relay-service API keys. Each one has a different way to check the balance — Codex via /status in its terminal, Claude via its own /status in another terminal, Kimi via a web dashboard, the relay services via yet another web login.

There's also a hidden pain point for Pro/Max subscribers — you don't know how much of the current 5-hour or 7-day window is left, or when it resets. Getting rate-limited mid-sentence is genuinely annoying.

I decided to build a cross-platform menu bar tool that consolidates all of it. The shape is a menu bar tray — pinned to the top of the screen, one glance shows the most urgent item (e.g. "Codex 5h 90%"), and clicking opens a popover listing every provider's remaining %, reset countdown, and balance. Anything below a threshold fires a system notification.

How I Built It

For the stack I picked Tauri v2 — a desktop app framework (Rust core + WebView frontend) — one codebase ships both macOS and Windows, and the bundle is only a few MB. Data sources break down into four categories by provider:

Codex (subscription, the most reliable) — read the jsonl files under local ~/.codex/sessions/ directly and sum the token usage; the numbers match codex /status output.
Claude (subscription) — one path goes through OAuth to grab the official real-time window remaining %, another reads jsonl under local ~/.claude/projects/ and computes cost (similar approach to ccusage).
Relay services (API key) — call the site's /dashboard/billing/subscription or /api/user/self to get the exact balance (USD/CNY).
Kimi / OpenAI official / MiniMax and similar providers with no public balance endpoint — fall back to showing login state and plan expiry.

The privacy boundary is hard: credentials are read locally, sent only to the corresponding official endpoint, nothing is collected, and the whole thing is open-source and auditable. No long-term usage history storage, no account system, no proxying of AI requests — just reading balances.

Engineering-wise it's still in dev — the spec is written, every data source has been verified to actually return data, the UI skeleton is in place, but the Tauri release build isn't out yet. That release is a one- or two-day thing from here.

Both tools are the same idea: take information that was scattered across 4-5 terminal commands, web pages, and log files, and pull it into one always-visible spot. This is a need normal people have too — not just engineers. Phones have always had Weather, Battery, and Calendar widgets, but there's no "how much have my AI tools used today" or "did the automations at home actually run" widget. So I built one.

The Task Board has been live on the desktop for two weeks now and it's holding up. The AI Quota Dashboard's Tauri release should land this week — the next post will probably come with real screenshots.

Three Attempts at an AI Music Player: It Wasn't a Technical Problem, It Was an Unclear Goal

Mon, 25 May 2026 00:00:00 GMT

AI Player · Postmortem

Over 30-plus days I built an AI personal radio station three times. The first two attempts died; the third survived only on a 5-day sprint and 13 independent audits. Looking back, the root cause was never technical — every time I started, I thought the goal was clear enough, and halfway through I found out I had never thought it through.

Why it took three attempts

The goal never changed: turn a phrase like "Monday morning, something quiet" into a 30-minute show that feels like real radio. A DJ who talks, transitions between songs, and never asks me to pick tracks. I built the same thing three times.

Where it went wrong

Laying the three attempts side by side, the same four pits keep showing up.

First: I started installing TTS libraries before I could even hum what the DJ should sound like. The second project, omnivoice, had exactly one goal — install an open-source TTS (text-to-speech) library and see if it could be the station's voice. Install the package, get MPS (the GPU interface on Apple silicon) working, get the CLI running — and then it stalled on HuggingFace (the largest open-model hub): a 6GB model frozen at 12%. On the surface the network killed it. But even if the model had downloaded and a voice had come out, I had no way to judge "is this what I want" — before installing anything, I had never once hummed to myself what my DJ should sound like. The crisp diction of a state-TV announcer? The husky warmth of a 2 a.m. radio host? A girl next door just chatting? I had never thought about it. The moment the download froze merely handed me an excuse: I couldn't have answered the next question anyway.

Second: I never separated "player" from "radio station." NetEase Music and Spotify center on the user choosing songs — search, like, playlist, skip. Radio is the opposite: it comes on at its hour, the DJ talks and segues on her own, and you don't get to pick. The first project, yuns-radio, gradually grew "previous," "next," and "like" buttons in its UI — and only then did I realize it had turned back into a music app. By that point the entire codebase was built around user-driven logic, and undoing it meant breaking bones. When cc_claudio rebuilt everything the third time, rule number one was: the user doesn't pick songs, the user gives an intent — like "Monday morning, something quiet" — and everything else belongs to the brain and the dispatcher.

Third: "runs on my machine" is not "runs on its own." My own shell goes through a proxy by default (the HTTPS_PROXY environment variable), so calling the claude CLI and pulling HuggingFace models always felt smooth. The day omnivoice froze at 12% on a 6GB model, I simply wasn't behind the proxy — a mainland-China IP hitting HuggingFace directly is unwatchably slow. cc_claudio later hit a harsher version: the claude CLI returns a flat 403 on a mainland IP. Both failures are the same species: everything works during development, then dies the moment it runs unattended. A child process started by a LaunchAgent (macOS's boot-time background service) cannot see the proxy settings in my shell — it lives in a clean environment. And at kickoff I had never once asked, "what will its environment look like when it runs alone?"

Fourth: I stopped at "can demo it once" and never asked "would I actually use this?" I took yuns-radio exactly as far as a page I could click through in a browser. Press a button, hear the DJ speak (placeholder TEMP_FALLBACK text, but still), hear a song (only 30 sample tracks, but still), pause, skip. Run through it once in front of me and it looked pretty convincing. But did I actually use it for news over breakfast every morning? No. Did I ask anyone else to try it? Also no. The moment I "demoed it to myself," something in the back of my mind said "good enough for now" — and I never went back. omnivoice got as far as a working --help on the command line and I felt "the foundation is there" — but the next step never had an answer, so that "foundation" was an illusion too.

Four pits, one root: starting before the goal is thought through

Spread the four out and not one of them is "the code was wrong." The voice problem wasn't a bad TTS choice — I never knew what voice I wanted. Player-versus-radio wasn't a UI mistake — I never decided which product I was building. "Runs on my machine" wasn't a proxy misconfiguration — I never considered what environment the thing would live in on its own. "Can demo it" wasn't a lack of testing — I never defined what "done" looked like.

I used to think the goal was clear at kickoff — "build an AI personal radio station" sounds specific enough. It says nothing. What does the voice sound like? Is the product a tool or a show? What environment does it run in unattended? What counts as finished? I started without answering a single one of those four questions, and every two days one of them came back to knock me over.

The root cause of three failures wasn't technical depth or experience. Every time, I skipped the step where you decompose the goal until it can actually be answered.

How I want to run projects like this from now on

I now keep a kickoff checklist: four things that must have answers before the first line of code.

What does the core deliverable look like? Not an abstract description — something you can hum, sketch, or act out. For an AI player the core deliverable is the DJ's voice; thirty seconds you can hum is enough. Skip this and every engine choice is a blind one.

Write down what you will NOT build. Whatever the "to build" list leaves unsaid gets silently filled back in by defaults. yuns-radio never wrote down the boundary "users don't pick songs," so the UI quietly sprouted a pile of song-picking buttons — not because I decided to build them, but because defaults grow.

What is its environment when it runs alone? Inside a LaunchAgent, on someone else's machine, in the hours when I'm not watching — what can it see and what can't it? Is the proxy there? Do the environment variables carry over? Are the dependencies installed? List it before writing code. Both of my mainland-network failures happened because this line was missing.

What exactly do I mean by "done"? Demoable? Survives without me watching? Seven days with no incident? Think it through and write it down. The night cc_claudio's LaunchAgent went in, I didn't touch it — I let it wake itself at 6 a.m. and send me its first DJ morning briefing. Seeing that message pop up on my phone's lock screen the next morning — that was this round's "done." A completely different standard from "can click through it in a browser."

cc_claudio now runs, gets used, and survives reboots — but it is nowhere near "done." The 9T index stalled at stage 13B after a machine swap, so new music can't come in. The right reference clip for true fish-speech voice cloning still hasn't been found; five candidates are sitting on my desktop. Settings, the mini player, and Lock Screen Now Playing don't exist yet. And the 5-day sprint cadence is not sustainable — the next step is to let it run a few mornings and late nights without me touching it, and let real use grow the next batch of problems.

What three attempts and 30-plus days bought me isn't an .app — it's that kickoff checklist above. It will almost certainly keep growing: next time I fall into a pit it doesn't cover, I add a line.

Business Notifications vs System Notifications: Notification Isn't One Semantic — Split Into Two Channels

Sat, 23 May 2026 00:00:00 GMT

OpenClaw Studio · Notification Design

Business delivery (Suwan wrote a morning brief) and system health (the watchdog finished a health-check round) are not the same kind of notification. Push them through the same channel and the business signal will get drowned. Every time.

I only really accepted this after eating two faceplants on Day 1 of the trial. The first one — I opened Feishu and saw nothing but watchdog titles. No way to tell what Suwan had actually delivered. The second one — after I fixed the titles, the sender was still Zhao Zilong. The header said "Suwan · Morning Brief Draft" but the message was coming out of someone else's mouth.

Two versions in one day. Half an hour apart. Looking back, neither one is the principle on its own — it's both together. Drop either and business and system messages glue back into one stream.

The day before, I'd just fixed PATH and assumed the notification layer was stable

The night before, I'd just patched an old LaunchAgent PATH bug — the runner would die halfway through because it couldn't find its tools, and five morning tasks never went out. I fixed the PATH overnight and wrapped the runner in a watchdog so it would always emit an alert when something cut out mid-run. (That story is in the previous post.)

I assumed the notification layer was now solid — runs, doesn't drop, has a safety net. Then I opened Feishu the next morning and discovered the notification layer was a lot more complicated than I thought. It wasn't a "can it send" problem. It was a "is what comes out actually usable by a human" problem. That layer hadn't even been on my radar.

Before v2.4: the watchdog buried everyone

Just after midnight I opened Feishu and saw this stream:

"OpenClaw trial watchdog checkpoint complete · task_id=xxx · status=PASS · manual_review_required=yes"
"OpenClaw trial watchdog checkpoint complete · task_id=xxx · status=PASS · manual_review_required=yes"
"OpenClaw trial watchdog checkpoint complete · task_id=xxx · status=PASS · manual_review_required=yes"

One after another. Once the watchdog wrapper was in place, every task notification got rewrapped — Suwan filed a morning brief, Huo Rui shipped a research memo, all of it rewritten into "watchdog checkpoint complete" format. The owner of the task (Suwan owns the morning brief, for instance), the deliverable, the actual content — all of it buried behind a wall of fields.

Worse — Codex fires one of these every time it finishes a checkpoint, PASS or FAIL. 99% of checkpoints are PASS. Which means 99% of the messages are really just "health check passed" — the kind of thing that should never interrupt anyone. But they were sharing a channel with Suwan's morning brief, so the business signal was drowning in a flood of system noise.

That's when it landed: business delivery and system health are two different notifications. Business delivery is something you read, use, reply to. System health is PASS 99% of the time, and PASS should be silent. Putting them in the same channel means the safety net eats the main delivery.

v2.4: split the content — give every owner their own title and body

12:32. I pushed v2.4 (commit f62382b). One job — make business-delivery notifications look like business deliveries.

Four changes landed at once.

One: per-owner delivery renderer. Every kind of deliverable gets its own title template — no more sharing a generic "watchdog format".

"Suwan · Morning Brief Draft", "Suwan · Column Draft", "Suwan · Evening Brief Draft"
"Huo Rui · Pre-Market Research Brief", "Huo Rui · Mid-Session Summary", "Huo Rui · Closing Summary", "Huo Rui · Post-Close Research Review", "Huo Rui · Deep Research Review", "Huo Rui · Weekly Summary", "Huo Rui · Next-Phase Research Recommendations"
"Shen Zhixing · Daily Source Package"

One glance at the title and you know who shipped what. You don't have to dig into task_id to find out.

Two: an actual content snippet in the body. Before, the body was a dump of task_id, status, output_path, with a file path tacked on the end. After the change, the renderer reads output.md directly, extracts the real body starting at the ## content_body anchor, and sends a 1500-word excerpt with 800 words for the section tail. What I see is the draft itself, not metadata about the draft.

Three: a silence rule for the Codex watchdog. Checkpoints read overall_status from the JSON they just generated —

PASS / PASS_WITH_WARNINGS → silent. Write the file, send nothing to Feishu. Log one line in state: skipped=true, reason=checkpoint_pass_silent.
WARN / FAIL / UNKNOWN → fire the alert.

PASS is the default state. The default state shouldn't make noise. Only when PASS stops being PASS is it worth interrupting a human.

Four: unified alert title. Every system alert now starts with "OpenClaw Watchdog Alert". The watchdog no longer borrows the business title format. From the title alone you can tell: this is business, that is system.

I ran a FAKE_PASS_SMOKE test — forced Codex to return PASS to confirm the silence path actually fires. Notification log was 13 lines going in, 13 lines coming out — nothing sent. State had one new entry: skipped=true reason=checkpoint_pass_silent. The silence path worked.

Then I resent the five Day-1 business deliveries that had been buried under watchdog titles. This time what came through was "Suwan · Morning Brief Draft" plus the actual body. I thought the problem was fixed.

v2.5: split the identity — the sender has to be the right person

It wasn't fixed.

After v2.4 went out I looked at the new messages myself — the title really did say "Suwan · Morning Brief Draft", but in the Feishu chat list the sender avatar and name were still "Zhao Zilong · Control Tower".

That's impersonation. Title says Suwan, sender is Zhao Zilong — psychologically, Zhao Zilong is putting words in Suwan's mouth.

I traced it down. The openclaw message send call wasn't passing --account, so the gateway fell back to zhao_zilong — the platform's default account. Every message, no matter what title it carried, was being sent from Zhao Zilong's bot.

Not a new bug — I'd noticed it before, but kept assuming "the body says who it's from, that's enough". After v2.4 shipped and I saw the result, I finally got it: not even close. The sender identity and the title content can't live apart. Once they're separated, what humans see is impersonation, not collaboration.

Six Agents, each one has to send from its own Feishu bot. Each message arrives as a peer in my chat list, not all crammed into a single Zhao Zilong window. This is identity isolation, not UI polish.

13:00. v2.5 (commit 5129a83).

Change one: AGENT_FEISHU_ROUTES, an explicit map. A routing table that pins each agent to a specific profile + account + display name:

suwan → openclaw-studio / su_wan / Suwan
huorui → openclaw-studio / huo_rui / Huo Rui
shenzhixing → openclaw-jiyanran / shen_zhixing / Shen Zhixing
watchdog → openclaw-studio / zhao_zilong / Zhao Zilong · Control Tower

Look at the third row — Shen Zhixing isn't in the openclaw-studio profile, he lives in openclaw-jiyanran. Which means routing isn't just picking an account, the profile has to cross over too. I hadn't thought of that early on, but once the routing table makes it explicit, "cross-profile" is just one extra column in the table.

Change two: send_feishu_message(message, route=...) + resolve_feishu_personal_target(profile, account). Thread profile and account all the way through every openclaw call — stop letting the gateway guess. Every message goes out from the owning agent's own bot. In Feishu I now see four distinct open_ids landing in my chat list, one independent window per Agent.

Change three: an "article-first" delivery template. v2.4's body was "dump a pile of header fields, then 1500 words of truncated body". v2.5 flips it — the body comes first, because the business deliverable is the message content. No header dump needed. Suwan's article goes through in full, up to 12000 words (no more 1500-word truncation, since the body is the deliverable). Huo Rui's reports get parsed on ## sections, and 12 banned words get re-scanned at delivery time — compliance issues shouldn't be the renderer's blind spot.

Change four: rewrite watchdog alerts in plain language. Before, watchdog alerts were field-and-path dumps that read like a machine talking to itself. Now they're four short Chinese sections: what happened / impact / what's been done / what's needed. I don't have to decode the alert anymore — I see immediately what happened, who's affected, what the watchdog has already done, what I still need to do.

Both versions together are what "semantic separation" actually means

v2.4 and v2.5 are not solving the same problem.

v2.4 fixes "what the content looks like" — titles, snippets, silence rules. Business messages look like business, system alerts look like system.

v2.5 fixes "who the message comes from" — routing table, explicit profile + account, independent bots. Each Agent is actually speaking for themselves.

Just v2.4: title is right, sender is wrong — psychologically zhao_zilong is still impersonating Suwan. Just v2.5: sender is right, body is still a header dump — Suwan's bot is sending out the same robotic field-concat. Both together is the first time business and system actually run on two separate pipes: split on content, split on identity.

The whole point of "business notifications vs system notifications" is this. A notification system that wants to actually split into two channels has to split on at least two layers at once — title and body (who it looks like), and sender identity (where it comes from). Drop either layer and it isn't a split.

Where this generalizes

In OpenClaw Studio this happens to be Feishu + multiple Agents, but the principle is tool-agnostic. Any setup where "business results" and "system health" share a single notification channel will hit the same wall:

CI/CD — build success and infra monitoring share one Slack channel, build success gets drowned in the monitoring PASS heartbeat.
Monitoring alerts — business-metric anomalies and infrastructure health use the same title template, business anomalies get buried under infra PASS spam.
Cron jobs, scrapers — task delivery and scheduler health share one notification queue, the delivery content gets buried under "scheduler completed a round" receipts within minutes.
Multi-Agent systems — every Agent sends through the same bot, every message looks like it's from the same sender, the sense of collaboration flattens into one person doing it all.

The test is simple: open your notification list. Can you tell at a glance which messages are content meant for you, and which are machines clocking in? If you can't, split the channel. And splitting isn't just adding tags or separate channels — it has to happen on content templates, silence rules, and sender identity all at once.

One more: PASS notifications and FAIL notifications shouldn't share a title template. PASS is the default state, and the default state shouldn't make noise. Give FAIL its own loud prefix (e.g. "OpenClaw Watchdog Alert") so a human can spot it instantly in 100 messages. This pairs with the two-channel split — if you split the channels but still let PASS make noise, the business channel will still drown.

v2.4 + v2.5 have been running for a few days. Six Agents each send from their own bot, PASS stays silent, FAIL stands out. Business delivery doesn't get eaten by system heartbeats anymore. That layer is stable for now.

But every new Agent means another row hand-added to AGENT_FEISHU_ROUTES. Next step is auto-registration — agents declare their own route at startup, no human maintaining a global table. The Codex watchdog's "silence threshold" is still being tuned — whether PASS_WITH_WARNINGS counts as PASS or as WARN, there are a few edge cases I haven't closed yet. Huo Rui's research reviews get really long sometimes, and parsing by ## sections still drops edge content — the 12000-word cap is fine for Suwan but tight for Huo Rui.

Notification design is never "done". Every new Agent, every new deliverable type, every new system alert means walking through both layers again. I'm used to it now — notification isn't one semantic, it's at least two, and it'll keep splitting from here.

Day 1 Blew Up at Dawn: I Didn't Expect to Write a Watchdog on Trial's First Day

Sat, 23 May 2026 00:00:00 GMT

OpenClaw Studio · Incident Postmortem

I woke up at dawn on day one, opened Feishu, and saw nothing. Then I opened the local directory — four drafts, four research memos sitting there quietly. The tasks had all run. Not a single notification had told me.

The night before, I had just finished installing OpenClaw Studio's seven-day trial — a LaunchAgent, the macOS background service that runs after boot, waking up every 30 minutes to trigger a few agents on a schedule: Suwan, Huo Rui, Shen Zhixing. My plan was simple: let it run for two days, see how the cadence feels. If it stays stable, I'll push more complicated work onto it.

Turns out, at dawn on day one, it told me stability is a luxury.

What I installed

The trial layer does something pretty simple. A LaunchAgent fires every 30 minutes and wakes up the runner — a Python script that does the actual dispatch. The runner walks the schedule, and when something is due, it triggers the corresponding agent, writes the output to a local directory, then sends me a Feishu message — Feishu being the messaging app I use for system notifications — telling me it's done.

Mornings are the densest stretch. 06:50, Suwan starts drafting the morning brief. 07:30, second task. 08:40, third. 10:30, fourth. Each one has its own owner, its own output file, its own Feishu message it's supposed to send. The whole design is: "I wake up, I open my phone, four messages are lined up in Feishu by time, I tap whichever one I want to read."

That's what the design said. What actually happened on the first morning was nothing like it.

Four "half-hung" tasks at dawn

The 06:50 task did run. Suwan's morning brief got written, the file was in the right place, the timestamp checked out. 07:30 ran. 08:40 ran. 10:30 ran too. Four output files, all with returncode 0 — the exit code from the command.

But Feishu was empty.

At first I thought the Feishu bot was broken — expired token? webhook changed? I dug into the logs and found every single Feishu notification had blown up with the exact same error:

env: node: No such file or directory

Node was missing. I stared at that line for a few seconds. I had literally just run which node in my terminal — /opt/homebrew/bin/node, plain as day. How could the runner get halfway through and then claim node didn't exist when it tried to send a Feishu message?

What made it worse was the shape of the failure — this "half-hung" mode where the business work completes cleanly and the notifications all quietly die. It's not "the system is down," which is at least an honest kind of incident. It's "part of the system is fine, the other part is rotting in silence." The outputs really existed. The drafts really got written. The research memos were really produced. But unless I went and looked at the local directory on my own, I had zero way of knowing any of it was there.

The thing a scheduler should fear most is exactly this: "I thought it didn't run, but actually it did; I thought it ran, but actually it never did." That kind of failure destroys your trust in the system.

Root cause: a LaunchAgent's PATH is not "the same as your terminal"

Debugging this, I first looked at the runner's own environment. The runner is Python, running inside a venv — the Python virtual environment — with a hardcoded interpreter path and all dependencies bundled in. So the runner itself starts fine.

But the runner doesn't call Python directly when it sends a Feishu message. It execs a CLI command called openclaw. That CLI is written in Node, and its shebang — the #!/usr/bin/env node line at the top of a script that tells the OS which interpreter to use — is #!/usr/bin/env node.

That's where the problem lives. env node has to look up node in PATH, the search path the OS uses to find commands. The PATH in my terminal is built up layer by layer from my shell config — /opt/homebrew/bin, /usr/local/bin, various language version managers, various personal bin directories, a long list. But the default PATH for a LaunchAgent started by macOS launchd is brutally minimal. Just these four entries:

/usr/bin:/bin:/usr/sbin:/sbin

Homebrew on Apple Silicon installs to /opt/homebrew/bin. That directory is not in the LaunchAgent's PATH. So env node can never find node, the shebang fast-fails — fails the moment it starts — and the entire openclaw CLI exits without running a single line.

A lot of people who hit this for the first time can't believe it. "But it works in my terminal!" Because we all subconsciously assume that our computer is our computer, and PATH should be the same everywhere. A LaunchAgent is not your terminal, though. A LaunchAgent is a child process spawned by launchd, with environment variables defined by launchd itself, completely unrelated to your shell config.

The nastiest part is that the business agents were unaffected. Suwan runs inside a Python venv with a hardcoded path. So does Huo Rui. So does Shen Zhixing. None of them depend on PATH to find an interpreter. So the business layer kept working, outputs kept getting written. Only the notification layer — the one that uses a Node CLI to talk to Feishu — was dead.

Business succeeded, notifications all failed. That's how you get the strange spectacle of "everything succeeded and everything failed at the same time."

Fixed the root cause, still had to write a watchdog

The root cause fix is literally one line — at module import time in the runner (the code that runs when the Python module is loaded), prepend /opt/homebrew/bin to PATH. Next time the LaunchAgent wakes the runner, the runner patches PATH as it loads, and the openclaw CLI finds node.

The question is: is that line enough?

I sat with it for a few minutes and decided no.

The reason is simple. Today, the thing that broke was a Node CLI not finding node. Next time it'll be something else — some Python package importing a system binary that isn't there, some third-party tool whose path moved after an upgrade and broke a hardcoded reference, some macOS update quietly mutating launchd's environment, or even Feishu having a five-minute outage on their end. None of these will show up in the convenient form of "node not found." They'll arrive wearing new costumes. But the shape will always be the same: business runs, delivery notification gets dropped.

You can fix root causes one at a time, but "half-hung" as a failure mode doesn't go away. So fixing the PATH is the root cause fix, and the watchdog — the kind that automatically recovers from incidents — is a separate layer. It doesn't solve any specific root cause; it just gives every future half-hung incident a chance to auto-recover.

Runner v2.3 was built around exactly this idea. Four pieces of work:

First, the root cause fix — prepend the Homebrew path to PATH at module import, so today's incident can't recur in that form.
Second, the retry channel — retry_pending_notifications. Every time the LaunchAgent wakes up, it scans recent tasks. If it finds one where the output exists but the notification was never sent, it retries the notification automatically. Each task gets up to four retries.
Third, the deterministic watchdog — on every wakeup it actively checks four classes of problems: task_missed (task didn't run), output_missing (output should be there but isn't), notification_missing (output exists but no notification went out), boundary_fail (cross-boundary state inconsistencies). If it finds one, it sends a deduplicated Feishu alert telling me what happened, where, and when.
Fourth, the Codex watchdog checkpoint — six times a day at fixed moments, run a Codex exec — Codex being OpenAI's CLI agent — inside a read-only sandbox, audit the day's full scheduling state, write a markdown + JSON checkpoint, and send an extra Feishu summary.

The second and third pieces are symmetric. The retry channel says "I see something got dropped, I'll quietly recover it." The deterministic watchdog says "I see something got dropped, here's a heads-up." Both are safety nets, not the primary path. The primary path will always be the runner sending the notification successfully on the first try.

The Codex watchdog adds another layer of meaning. The deterministic watchdog can only recognize failure modes I've anticipated. The Codex watchdog can recognize the ones I haven't — the ones that need semantic understanding to spot. The cost is that it's expensive, slow, and depends on an external service. So the cadence is six times a day — denser when the morning is busy, sparser in the afternoon and evening.

Catch-up: 4/4 recovered automatically

Once v2.3 was deployed, I didn't rush to send any new notifications. I manually triggered one LaunchAgent wakeup so the retry channel could sweep through the four dead notifications from this morning.

Scan result: all four tasks were flagged as notification_missing; all four had output files, all four had correct timestamps, all four met the retry criteria.

Retry pass. Four Feishu notifications, in chronological order, exactly the way they should have arrived in the first place, came in one by one. returncode 0 across the board.

The line that gave me the most peace of mind: "no agents re-run, no existing outputs overwritten." Retry only resends notifications; it never re-triggers the business work. That constraint was explicit in the design, because some business tasks contain irreversible operations — writing to historical ledgers, appending to audit logs — and re-running them would corrupt state. Catch-up — running the missed pieces after the fact — is strictly bounded to the notification layer. The business layer already finished. You don't touch it.

4/4 recovered automatically. Fix commit is e752a93.

There was a strange feeling in that moment. The incident happened, the incident was detected, the incident was auto-recovered, and the only thing I did manually was trigger one wakeup. Everything else, the system did on its own. It didn't hide the incident, and it didn't amplify the incident into something worse. That was the first time I really felt what a watchdog is worth. It doesn't create new functionality. It just drives the cost of recovery toward zero.

What this taught me

By the end of the postmortem, here's what's worth writing into rules I can remember. The next time I trip over something similar, I want to reach for these immediately.

A LaunchAgent's PATH is not "the same as your terminal." This is an old trap, but every time I install a new LaunchAgent I still default to assuming the environment is identical. Next time, the first thing I do is write PATH out explicitly — either in the plist, or as the first thing the runner does on import. Don't assume "it should be fine."
"Business succeeded" and "delivery completed" are two different things. A task's output landing on disk is just an intermediate state. Real delivery is "user got the notification AND user can find the output." Any link in that chain breaking counts as a delivery failure, no matter how clean the output looks. Next time I design a scheduler, "delivery" is the final gate, and it's stricter than "task executed successfully."
Root cause fixes and safety-net layers should be built separately, but shipped together. The root cause fix is the PATH change — it prevents today's incident. The safety net is retry plus watchdog — it covers all future incidents of similar shape. They don't substitute for each other. Fix only the root cause, and the next new shape of half-hung incident still drags me out of bed. Add only the safety net, and today's incident triggers an alert on every single wakeup until the noise drowns out the signal.
A watchdog is insurance that drives the cost of recovery toward zero. Its value isn't visible when no incident is happening — that's when it looks like waste. Its value is in the exact moment an incident hits, when it turns "I wake up and spend two hours debugging" into "the system handled it and sent me a summary." The cost of buying that insurance is the time to write the code. The cost of not buying it is some two-hour window in a future morning.

So how is v2.3 actually doing now

v2.3 has been live for a few days. Incident density after Day 1 has actually been higher than I expected — not because the PATH issue came back, but because other shapes of half-hung incidents started showing up. The runner's retry channel caught most of them. The deterministic watchdog has fired a few alerts, and each time I was able to decide within five minutes whether to intervene. The Codex watchdog checkpoint produced a pile of markdown, giving me a daily panoramic view of what the system roughly did.

But there are still loose ends.

The deduplication granularity in the deterministic watchdog is still being tuned. Sometimes the same task gets flagged twice as different missed events, so two nearly identical alerts show up in Feishu. Not fatal, but it pollutes the signal. The ideal is "one alert per incident, unless state has actually changed."

The Codex watchdog checkpoint cadence (six per day) is currently fixed, but realistically mornings are dense and afternoons are sparse. The next step is to make it adapt to "incident density" — run more often when incidents are recent, less often when things are quiet. To do that I first need a definition of "incident density." Don't have one yet.

The biggest open piece is this: v2.3 fixed the notification channel of the trial-layer scheduler. But each of the six agents has its own "business notification" channel, and over the following days a whole different pile of problems showed up there — which bot sends business messages, whose identity does it speak as, how is the message formatted, who should receive it, who shouldn't. That's a different story. Worth a separate post.

Day 1's dawn incident pulled "add a watchdog" from week two of my plan all the way up to the afternoon of day one. That was trial's first gift to me — it didn't make me wait a week to find the problem.

Trial isn't about proving the system works. It's about exposing the system's fragile spots in a low-cost environment. Every spot that surfaces, I fix. Every fix, the watchdog grows a little. Day 1 grew the PATH fix and the retry channel. The following days grew other things. Every day's watchdog is a little smarter than the day before.

Incidents are normal. The watchdog is still growing. I no longer expect "install once and it stays stable." What I expect is "every incident makes the system a little better at saving itself next time." Until then, I'm still working on it.

9TB Music Library Read-Only Indexing: The Engineering Constraints I Set for Myself

Fri, 22 May 2026 00:00:00 GMT

Music Indexing — Engineering Notes

Turning a 9TB personal music library into a queryable index sounds like a weekend script. The hard part isn't writing the code. It's nailing down "never touch the source disk" before you write a single line.

The drive at home holds 115,999 audio files — FLAC, MP3, WAV, DSF, all mixed in — 7.39 TB in total, enough to nearly fill a 9TB disk. This pile took a decade-plus to accumulate. Some I ripped from CDs myself. Some came from friends. Some I downloaded in the early years. Some I rescued off old dying drives. Every file has a small story behind it, but the stories don't matter. What matters is that the whole thing is now a black box: I know it's there, and I can't query it.

The first time I sat down to build an index, I almost just started writing — scan the tree, read the metadata, push it into a database, slap a search UI on top. Then I stopped. Because I've watched too many people — myself, years ago, included — touch the source and break it. Renaming. Moving. Editing tags. Adding cover art. Every single time it was "just this one tweak," and every single time, looking back, the right move was: don't.

So this round I did something counterintuitive first: I wrote the rules before I wrote the code. Five constraints, treated heavier than the actual business logic. Scripts can be rewritten. Constraints don't bend.

Constraint 1: Read-only — don't touch a single file on the source disk

The most basic rule, and the easiest one to break. The second your scan script contains an open(path, 'w'), a shutil.move, an os.rename — the whole constraint is gone.

I wrote it brutally narrow: don't modify, don't move, don't delete, don't rename. Not even "let me just clean up that one filename." The reason isn't technical, it's trust. This disk holds a decade of my own material. I will not let any script do something to it "for my own good." The index can be rebuilt. A corrupted source can't.

The enforcement happens at the tool layer. Any path that points at the source disk can only be opened in read mode; any write intent raises immediately. Sounds paranoid, but a few weeks in this single rule has caught more than just my own slips — it's caught a couple of times an AI tried to "tidy up the filenames for me" and crossed the line.

Constraint 2: Don't write anything to the source — isolate the workspace

This extends rule one, but cuts finer. Not only can you not edit the source files, you also can't drop a temp database, cache, log, or state file anywhere on the source disk.

I didn't realize this needed its own rule until I once used a tool to scan a photo library, and afterwards found it had quietly seeded a .cache file inside every subdirectory. Looks harmless. But the source disk was no longer clean — it now carried the tool's fingerprints, and switching tools later meant cleaning up first.

So now every index output — database, cache, ffprobe reports, error logs, checkpoint state — lives in a dedicated working directory on a separate work disk. The source disk only plays one role: data source. Never workbench. The two are physically isolated, separate mount points and all.

The side effect is great: I can unplug the source disk anytime, change its interface, copy it to another machine, and the index side doesn't notice.

Constraint 3: No network identification — pure local processing

This is where modern tools love to stab you in the back. MusicBrainz, AcoustID, all the online lyric matchers — unless you actively turn them off, they're on by default. Every scanned track ships a fingerprint to the cloud, and a few days later your entire music library has been profiled remotely.

I don't want that convenience. First, privacy — my private listening habits, taste, and collection shape don't need to become training material for some online service. Second, stability — network identification makes the index depend on "whatever the cloud returned that day." A track matched today might not match tomorrow, and the index stops being reproducible.

So: strictly local. Metadata comes only from the tags inside the file itself. Quality scoring looks only at file properties. Duplicate detection uses only local hashes and durations. Could I get sharper identification by going online? Sure. But the price is losing "this index is fully rebuildable, fully offline" — and I can't afford to trade that away.

Constraint 4: Resumable scans — running twice mustn't write twice

Over 110,000 files is not a five-minute scan. The first real full ffprobe + mutagen pass ran all night and still wasn't done. Network hiccups, power blips, an accidental Ctrl-C — all of those have to be survivable. The script has to pick up where it left off.

Resumable scanning sounds simple. In practice it's all traps. The biggest one is double-writes — if the last run died halfway and the restart doesn't dedupe, the same track gets inserted twice. The entire index is then untrustworthy.

So the real meaning of this constraint isn't "can resume from interruption." It's "running the same script repeatedly must converge to the same index." One file path, one row. Idempotency is the floor. Technically I lean on a three-piece set: a unique index in the database, explicit upserts, and a separate "already processed paths" table.

Constraint 5: Error isolation — one corrupt file can't kill the job

Out of 110,000+ files, some are corrupted, some have wrong permissions, some have weirdly encoded filenames, some have busted format headers. Final tally: 543 files ffprobe couldn't read, 744 files mutagen failed on. Over two thousand errors — and the full scan cannot stop because of those two thousand.

Early versions of mine took the lazy path: hit an exception, raise. The result was the script dying at 30%, restarting from zero, dying again at 35%. Forever circling inside the first 40%.

Then I rewrote it: every file gets its own try/except, errors go to a dedicated error log, the main loop keeps moving. Only after that change did the scan actually finish. The point of error isolation is to accept reality — a library of 110,000 files will produce a few thousand errors, and the abnormal thing isn't the errors, it's letting those errors stop the other 100,000. Errors get logged, never silently swallowed; but logged is logged, the main task keeps running.

Why ten stages — cut by risk, not by feature

Once the five constraints were nailed down, I didn't write a single "one-click scan" megascript. I sliced the whole thing into ten stages, each one runnable on its own, each one verified on its own, each one signed off on its own.

Slicing by stage is not the same as slicing by feature. Slicing by feature asks, "what is this chunk of code responsible for?" Slicing by stage asks, "what specific risk can this step blow up on?" Once you cut by risk, a failure at any stage only costs you that stage — the earlier stages don't need to be redone.

Stage 0: Safety checks + working directory init — confirm source is read-only, work disk is writable, no path escapes its sandbox.
Stage 1: Dependency check — ffprobe, mutagen, Python env, database schema all in place.
Stage 2: Audio file discovery — read-only scan of the whole disk, listing the path and size of every one of the 110,000+ files. No metadata reads yet; just prove we can walk the whole disk.
Stage 3: Sample validation — pull 300 files at random, try reading their metadata, measure success rate, and project how long a full pass will take.
Stage 4: Full metadata read — based on the sample projection, run the full pass with confidence. Resumable mode on.
Stage 5: Duplicate candidate analysis — only "mark" potential duplicates, never delete.
Stage 6: Quality scoring draft — every track gets a 0-100 score.
Stage 7: AI DJ initial index v0 — fuse everything above into a queryable index.
Stage 8: Final reconciliation — cross-check completeness against the source disk.
Stage 9: Index acceptance and patching — manually spot-check a few hundred rows to find rule gaps.
Stage 10 and beyond: player integration, tag enrichment, audio feature analysis — separate concerns, all on hold until the first nine are solid.

The biggest win from ten stages: when any stage fails, I only roll back that stage. Stage 4 ran all night and crashed? Stages 0-3 are still good. The rerun is just that segment. If the whole thing had been one overnight monolith, a failure means starting over completely.

The other win: every stage has its own "pass criteria." If Stage 3's sample success rate is below 90%, Stage 4 doesn't run — I go back and find out why first. That way the downstream stages always get clean input.

The six dimensions of quality scoring — and why no listening tests

Stage 6's quality scoring is the most criticizable part of this whole index. Someone will ask: why no ABX listening test — the blind A/B/X comparison that's the gold-standard audio evaluation? Why no spectral analysis? Why no dynamic range calculation?

I considered all of those. Then I picked six very dumb dimensions instead:

Lossless vs lossy — FLAC, WAV, DSD start higher than MP3.
Sample rate and bitrate — higher sample rate adds points, very low bitrate deducts.
Metadata completeness — title, artist, album, year, genre, deduct per missing field.
Duration sanity — abnormal durations (5 seconds, 8 hours) get flagged separately.
Read success — any ffprobe or mutagen failure deducts heavily.
Suspected duplicate — being linked to a duplicate group deducts a relational score.

Why no listening tests? Because listening is subjective and not machinable. What I want is an index that runs, reproduces, and scores 110,000 tracks under the exact same yardstick. The moment I introduce listening, the next day I'd rehear a track and want to overturn yesterday's judgment, and the entire score becomes unstable forever.

All six dimensions are objective file-level properties — the same file scored today and a year from now yields the same number. That's what an "index" should look like. It's not an audiophile review. The overall average came out at 78.1, with a reasonable distribution — FLAC mostly above 85, MP3 mostly 60-75, a handful of damaged files pulled down to under 30. Good enough.

Three real bug stories — even nailed-down constraints get chewed through

Five constraints plus ten stages sounds airtight. In practice it still got bitten several times. Three of the most representative ones:

Unicode NFC/NFD normalization

macOS stores filenames in NFD (combining characters split apart). Many Linux-side scripts default to NFC (composed form). A Chinese song title that looks identical in macOS Finder might, when handed to Python's os.stat, return "file does not exist."

This one cost me two days. I first assumed it was a permissions issue and spent hours getting nowhere. Then I eyeballed two visually identical strings and finally saw it — they differed at the byte level. The fix is to normalize every path to NFC before it touches the database. This does not violate "don't modify the source," because the source filenames are still untouched at the byte level. Only the database stores the normalized version.

The insert bug

Stage 5 builds the duplicate-candidate analysis, which inserts pairs of potentially duplicate files into a table. The first cut, being lazy, had no unique constraint, and the same pair got inserted three times — because the analysis has several rules and each one independently flagged the pair as a duplicate.

On the surface nothing was broken — just extra rows in the table. But when Stage 6 scoring picked it up, things went sideways: the same track got deducted three times for "three duplicate relationships" and dropped into a score bucket it didn't deserve. The fix was an explicit unique constraint, plus bidirectional dedup on (pair_a, pair_b), plus a verification pass on Stage 5's output before feeding it into Stage 6.

The lesson isn't "remember to add unique constraints." It's that any "analysis"-flavored script defaults to firing multiple times, and you have to block that at the data layer. You cannot rely on the business layer to be careful.

The ffprobe hang

During the Stage 4 full metadata pass, a few oversized DSF files (several GB each) made ffprobe hang mid-read. The child process never returned, the main script never moved. Overnight runs got a few hundred files in and froze.

The fix was wrapping every ffprobe call in a watchdog — the kind that catches a hung process and restarts it. Over 30 seconds: kill the child, log the error, mark the file as "read timeout," main loop continues. With that patch, Stage 4 finally finished a full overnight run.

This sent me back to reread constraint 5, "error isolation." Error isolation doesn't just mean handling exceptions. It also has to handle "neither returns nor errors." Silent hangs are harder to catch than thrown errors. You have to add the timeout proactively.

Where things stand — Stage 13B is still running

Stages 0 through 12 are done. The first index (v0) runs and is queryable. All 110,000+ files are in the table. Metadata completeness sits roughly at title 90%, artist 90%, album 89%, year 33%, genre 35%. The first three are fine. The last two are weak, because a lot of old MP3s never had year or genre filled in back when they were ripped.

Stage 13B is doing reverse verification — taking the statistics from the index and matching them against the actual directory structure on the source disk, looking for "in the index but not on disk" and "on disk but missing from the index" cases. This was supposed to be Stage 8's job, but Stage 8 cut a corner and only did forward verification, so I opened a separate Stage 13B to make it right.

Two small things have surfaced so far: about 200 files were missed in the original scan because their paths contained rare characters — those need a return trip to Stage 2 with NFC normalization added. And about 40 files have empty metadata while the audio itself is fine, which looks like the original ripping tool never wrote any. Once those two are fixed, v0 is officially settled.

After that comes Stage 14+ — player integration, AI recommendation DJ, visualization layer. But all of that sits on top of "the index is trustworthy," so I'm in no hurry to push it. If the index isn't stable, everything above it is sand.

This index is nowhere near finished. But every constraint I nail down, every stage I get through, every bug story I patch — my trust in it grows by another notch.

I'm no longer chasing "let me ship the query UI fast." What I'm chasing is: next month, six months from now, two years from now, I can rerun the same scripts and the result converges to the same index. The source disk is never touched. Output always lives on the work disk. Errors always land in a log. Repeated runs always idempotent. Those four things matter more than any fancy query frontend.

Stage 13B is still running. The next note about this index will probably open with either "v0 finally earned the right to be called v1" or "another constraint just got chewed through somewhere new."

Organizing a Large Personal Archive: Backup Priority, Critical Assets, Config Drift

Fri, 22 May 2026 00:00:00 GMT

System Audit Notes

Taking inventory is not the same as backing up. Inventory is "I have no idea how much stuff I've installed on this machine myself" — and until you do that, every backup plan is a gamble.

I wrote an earlier piece called Before Reinstalling My Mac mini, I Ran a System-Level Asset Audit, which is the story of that specific audit — what I did, what I found, how I decided to handle it. This one isn't a rerun of that story. It pulls out the method-level rules I distilled from that audit and presents them on their own: the backup priority matrix, the signals for spotting critical assets, the common config-drift patterns. These apply to any personal archive that's been accumulating for years — not just Macs, not just AI workstations.

The motivation is blunt. After getting burned a few times myself, I realized most people know almost nothing about the actual state of their own machine. Half the software you installed, you've forgotten. Config is scattered across a dozen places. Your API keys have been copied into three or four files. Several projects are still running processes you never shut down. In that state, any "backup plan" is luck — you think you backed things up, but what you backed up is the surface. The part that's actually going to bite you was never visible in the first place.

The Real Snapshot: Put the Numbers Down First

A few of the numbers from this audit surprised even me.

370GB of disk used out of 926GB. I'd assumed I was at half capacity. Turns out I was closer to 40% full.
9 git repos, 4 of them with no remote. Meaning: if the local disk dies, those 4 are gone for good.
179 uncommitted files, spread across those 9 repos — each one a piece of work I "meant to come back and commit" but never did.
22 LaunchAgents — macOS startup background services. I could name maybe 10 of them. The other half I have no memory of installing.
75+ command-line tools installed via Homebrew, plus 5 global npm packages.
14+ .env files holding environment variables, scattered across different projects, each one stuffed with API keys for external services. None of them in git.
23 listening ports, 11 active AI services running. They start at boot, but I'd never put the full list down on paper before.

That table is the first real artifact of an audit. Without it, "backup" just means copying Desktop, Documents, and a few visible project folders — and the other 80% of your real assets stay invisible.

So my first rule now is: before any backup decision, force yourself to fill out that table of numbers. Can't fill it out? Then don't talk about backup yet.

Three Backup Tiers: P0, P1, P2

Once the numbers are down, the next question is — which of these things must be backed up, which can be rebuilt, and which I just don't need to care about.

Early on I tried the "everything is important" attitude. The result was "nothing is important." Once a backup strategy has no priorities, it degrades into "back up whatever fits," and when critical data is lost you can only blame the dice. So I stick to three tiers now. More than that and I won't maintain it.

P0: Losing It Means Serious Damage

The bar here is — if this is gone, work stops, and there is no external resource that can rebuild it.

Uncommitted code — anything not yet pushed to a remote. The local disk is the only copy.
Business databases — the Postgres and SQLite instances running locally with actual business data inside.
Vector data — embeddings stored in chromadb, lancedb, or mem0. This one is its own special category and gets a section below.
Voice assets — recordings, generated audio samples. There's only one original.
.env files — full of API keys for third-party services. Lose them and you're filling out signup forms at dozens of websites again.
Custom LaunchAgents — the service definitions that start at boot. Lose them and you've smashed every entry point into your daily workflow.

P1: Expensive to Recover, But Recoverable

Losing this isn't fatal, but it takes a day or two to get back to where you were.

Model caches — local LLM weights you've pulled down. Re-downloading tens of gigabytes is grunt work, but doable.
Global packages — the collection of CLI tools installed via Homebrew and npm. Rebuildable from a Brewfile or similar manifest, assuming you have the manifest.
AI CLI configs — Claude Code, Codex CLI and friends. Prompts, custom commands, MCP integrations all live here.
Browser configs — bookmarks, extensions, logged-in sessions. The synced stuff is one thing; small unsynced tool configs are another.

P2: Can Be Re-Downloaded or Re-Configured

The bottom tier. Losing this barely matters.

Homebrew packages themselves — as long as the manifest survives, reinstalling is trivial.
Application installers — the App Store or vendor sites will hand them back.
Build artifacts — node_modules, build, dist. The source is there; regenerate.

The key to these three tiers isn't how finely you slice them — it's that you actually treat each tier differently after slicing. P0 needs redundant backups (cloud plus an offsite physical copy). P1 needs one copy. P2 doesn't need to be in the backup at all. With that, backup volume drops from "hundreds of GB across the whole disk" to tens of GB — small enough that it can actually run every day, and small enough that you notice immediately when it breaks.

The side effect of backing up everything is that the backup gets too big to run, so you push it to weekly, then monthly, then "last backup was six months ago." Tiering isn't about saving disk space. It's about giving the backup a chance to actually keep running.

Six Signals for Spotting Critical Assets

The priority matrix tells you how to categorize. The next question is — where are these things actually hiding? Generic backup tools can't see most of them. You have to go category by category yourself.

Every time I do an inventory now I walk through these six signals, each tied to a real scenario. Miss one, and you'll discover after the reinstall that "oh, that thing is gone."

Signal 1: Uncommitted Code

The easiest one to miss, because everyone defaults to "all my code is in git." But git only has what you've committed. Those 179 uncommitted files are not in git.

The actual move: list every git repo on the machine, then run `git status` on each to see uncommitted changes, and `git remote -v` to see whether a remote exists. A repo that fails both checks is high-risk: no remote means the local copy is the only copy, and uncommitted means even the local hasn't been archived.

Of the 4 no-remote repos I found that time, 2 were leftovers from early experiments. But they held tuned parameters and small utilities I'd worked out at the time. If they were gone, I'd have to redo that work. This kind of stuff doesn't announce its value to you — you only remember it existed once it's gone.

Signal 2: Running Databases

Local services like Postgres, SQLite, or ChromaDB — if you back them up by just copying the data files, the copy is often broken, because the database was mid-write and what you copied is a half-state.

So for this class of asset, the backup action isn't "copy files." It's "stop the service first, or use the database's own dump tool." Skip both and start backing up, and recovery later will most likely reveal that the backup is corrupt.

The more practical problem is that most people genuinely don't know which databases are running on their machine. They came in as dependencies of some project, they're listening on some port, they start at boot, and you've never thought about them directly. Inventory means specifically checking every listening port and every database process to see what's actually inside.

Signal 3: Vector Data (The Most Special One)

chromadb, lancedb, mem0 — these local vector stores hold embeddings: high-dimensional vector representations of documents, chat logs, knowledge snippets. The special thing about them is this: in theory you can recompute from the source data. In practice you almost can't.

Why? Because the rebuild needs three things to be true at once: the source data still exists, the embedding model you used is still accessible, and you remember the chunking and cleaning rules. Miss any one of the three and the rebuilt vector store is different from the original — search results shift, similarity thresholds need retuning, and every downstream pipeline that depends on it needs regression testing.

My own local knowledge bases have been running for months. I've swapped embedding models in that time, tuned chunking strategies, cleaned out bad entries a few times. Rebuilding from zero would probably be harder than building it the first time. So vector data is P0 for me, sitting on the same tier as the databases.

Signal 4: .env Files

A .env file is what a project uses to hold environment variables — usually stuffed with API keys, database connection strings, tokens for third-party services. By convention it doesn't go into git, which means backup has to handle it specially.

The problem is they're scattered across project roots, config subdirectories, sometimes buried inside dotfiles. I scraped up 14+ of them that time, spread across 8 projects. Opening each one revealed credentials for external services — losing them would mean re-applying at dozens of sites and remembering which email I used, which team I was on, which usage tier I'd asked for.

So inventory has to include a sweep of every .env, .env.local, .env.production-style file. Note where each one lives and what kinds of secrets it carries. They go straight into P0.

Signal 5: Custom LaunchAgents

LaunchAgent is the macOS startup-service definition, files stored under ~/Library/LaunchAgents/. Each file describes a service that starts at boot — maybe an AI service, maybe a monitoring script, maybe a scheduled job.

I found 22 of them that time. At least half were experiments I'd installed long ago and never uninstalled. Losing this class of asset doesn't sting immediately — but the next time you boot, you'll notice a pile of things missing: the AI services that started themselves, the backup scripts that ran on a schedule, the small monitors watching for anomalies. All gone. Reconstructing each one from memory is basically impossible.

So the whole LaunchAgents directory goes into P0 — back up all of ~/Library/LaunchAgents/ as one unit. And this is also a cleanup opportunity. While you're inventorying, decide which ones can actually be deleted. Don't blindly keep all of them.

Signal 6: Plaintext Secrets (The Most Dangerous Class)

This is the one I least want to look back at. During inventory you check your shell profile — .zshrc, .bashrc, .bash_profile — and you'll often find a line like `export OPENAI_API_KEY=...`. Plaintext key, loaded into every shell at startup.

Two problems with keeping it there. One is security: a plaintext secret in a config file is readable by anything that can read the file, including some less-than-clean tools you've installed. The other is mobility: shell profiles get backed up to cloud drives, copied to new machines, pasted into screenshots when you're asking someone for help — and one slip and the key leaks.

So this isn't just a backup problem, it's a refactor problem. The backup still has to include the full shell profile (it lives in P0), but after the inventory you have to schedule a task: move every plaintext secret from the shell profile into a password manager, then read it back from there at runtime. I haven't finished that one myself. It's the next thing on the list.

Five Common Config-Drift Patterns

Finishing the backup doesn't mean the archive is organized. The real trouble is that once a machine has been running for a few years, configs start contradicting each other — and a generic backup plan is completely blind to this layer.

After that inventory run, I grouped the conflicts I'd hit into five patterns. None of them is fatal alone. Combined, they're what makes you say after a reinstall: "why does some of this work, and some of this almost work?"

Pattern 1: Port Semantics Drift

The most common case. Port 3100 is a web service in project A, a database admin UI in project B, and grabbed by some AI tool in project C. All three start at boot. Whoever wins the race gets the port; the other two fail silently. No one tells you.

The sneakier version is an off-by-one port number — 3100 vs 13100 used by different components, and a config file with the wrong digit happily connects to the wrong service. The logs look fine, because the other end is also an HTTP service. It just isn't the one you wanted.

So during inventory you list every listening port, cross-reference against the port declarations in each project's config files, and look for collisions. No backup tool can do this for you. You have to walk it.

Pattern 2: Stale Path References

Your crontab points at /Users/me/old-project/run.sh, but old-project was deleted three months ago. A symlink points at a directory that no longer exists. An MCP config — Model Context Protocol, what AI tools use to connect to external services — points at a service that's been migrated elsewhere.

This kind of stale reference gets preserved as-is by the backup. When you restore, it's still sitting in your config, still pointing at a target that vanished long ago. Mild case: log errors. Bad case: a tool that depends on that path just dies on startup.

The fix is — during inventory, walk through the crontab, every symlink, every external tool's config file, and verify each target path or service still exists. Doesn't exist? Decide right there: either delete it, or repoint it at the new path.

Pattern 3: Dead Service Dependencies

Service A's config says "depends on service B at port 5432," but B got replaced by C three months ago, and 5432 is empty. A tries to connect on every startup, fails, and falls back to a degraded mode. You have no idea it's running in degraded mode.

This kind of problem doesn't show up under normal use, because the degraded mode often "looks like it works." By the time something actually breaks and you go check the logs, you find that a key part of the pipeline has been severed for months.

During inventory, go through each service's config and list what it depends on. Then cross-check: are those dependencies actually still running? Anything that isn't has to be deleted or restored — don't leave it sitting in the config with a "should be there" status.

Pattern 4: Cross-Root Scheduling

This is the one I've stepped into the deepest. A scheduler that was supposed to handle only its own project's jobs slowly accumulates lines like "also kick off the script in the project next door." Then one day you refactor that other project, move the directory, and the scheduler is still running against the old path — either erroring out or running the wrong file.

What makes it worse: this cross-root scheduling tends to be asymmetric. Scheduler A knows it's calling B. B has no idea anyone outside is calling it. When B's maintainer makes a change, they're not thinking about A's dependencies. So the conflict happens with zero warning.

So during inventory each scheduler needs a "who I'm calling" list, plus the reverse view of "who's calling me." With both sides reconciled, you can finally judge which cross-root calls are intentional and which are historical baggage.

Pattern 5: Historical-Copy Confusion

The same config file exists in several places: one local, one in a backup directory, one in archive, one copied out during some experiment. The names are all similar, the contents are slightly different, the timestamps aren't far apart. Figuring out which one is canonical — the authoritative one, the one actually being used — turns into archaeology.

A single person can power through this for a while, but it falls apart over time. Six months later you genuinely can't remember which copy is "the one I'm actually using right now." And when an AI tool comes along to read this pile of files, it's even more likely to pick the wrong one.

The fix is — during inventory, every critical config gets one canonical path designated. The other copies either get deleted or explicitly marked as historical (move them into an archive/ subdirectory, for example). The principle is "only one copy is live at any given time." No "they all still work" allowed.

What's Worth Keeping in Cold Storage

Inventory doesn't mean delete everything. Beyond P0/P1/P2, there's another category — stuff that doesn't affect system operation, but "I'll probably want to look this up the next time I write something." I call this category seed material in cold storage.

Most historical files aren't worth much — old chat logs, old experiment outputs, old versions of design files, expired ad-hoc reports. Those can go in one swipe. But these four kinds I pull out and keep separately:

Cross-module analyses — the panoramic views that put several modules of a system side by side: call graphs, permission propagation paths. Producing one of these costs a lot. Keep them so you can see how you understood things at the time.
Teaching material — bilingual notes on an open course, organized chapter summaries. These are filtered, second-pass artifacts. More useful than the source video.
Research reports — industry surveys, technology-evolution writeups, comparative evaluations of specific tools. Conclusions that took days of digging at the time, still usable as a starting point months later.
Meta information — quality-check reports, classification lists, snapshots of directory structure. Data about the data. Rebuilding it is painful, and there are only a handful of these files anyway.

The judgment is simple. The cost of throwing it out is "I'll have to redo this work when I think of it again." The cost of keeping it is "some disk space." The former is much more expensive than the latter, so keep it. But put it in a cold-storage directory. Don't leave it mixed in with working folders — once it's mixed in, every time you open the folder you have to re-decide "is this hot data or cold data," and that wears your attention down.

What Comes After the Inventory

Inventory isn't a one-shot thing. My current rhythm: every time I'm about to do something major — reinstall, migrate to a new machine, swap a drive — I run this whole sequence again. Day-to-day I keep a living list. A new LaunchAgent installed, a new .env written, a new global tool added — note it down on the spot.

The hardest part of this isn't the tooling. It's the mindset. Admitting "I don't know what I've installed on my own machine" takes some nerve — a lot of people will resist instinctively, because admitting it means having to face that table of numbers. But you only have to face it once. The next time is much easier.

The core of organizing an archive isn't backing up more diligently. It's turning the invisible assets into a visible list. Invisible means gambling. Visible is the first time strategy enters the conversation.

My own next steps are still half-done: move every plaintext secret out of the shell profile into a password manager, fix the config conflicts I've already spotted, clean up the 22 LaunchAgents and delete half, and decide for each of the 4 no-remote repos whether to add a remote or archive it for good. None of this is finishable in one pass — and none of it is the kind of problem a single checklist solves.

But with that table of numbers and these few rules in hand, at least the next time I face them I won't have to ask "what is actually installed on this machine." That's enough to count as a starting point.

Personal AI Lab Asset Governance: Projects, Archive, Audits, Timeline — Four Categories That Don't Overlap

Fri, 22 May 2026 00:00:00 GMT

Asset Governance Notes

After a year of running an AI workstation, the disk reaches a strange state — the directory tree still looks tidy, but you genuinely can't find anything anymore.

At first I thought the directories just weren't fine-grained enough. Slice another level, add another tag, that should fix it. Turns out it doesn't. The problem isn't granularity. The problem is I was slicing along the wrong dimension from day one — I was slicing by content topic, and after six months that always blows up. Because the same asset, across its lifecycle, changes governance attribute; content topic doesn't. However neatly you arrange things by topic today, six months from now its identity in your head has shifted, but the directory hasn't followed.

This piece is one level of abstraction up. The site already has a few articles about how to manage a specific kind of asset — "Six Page Types for a Working Knowledge Base," "From Logs to Knowledge: Retain or Exclude Rules," a "Mac Mini System Audit Before Reinstall," and "Organizing a Large Personal Archive Library." Those all slice downward. This one goes the other way: before you touch any specific library or run any specific audit, by what attribute should every output be split into a handful of root categories? Once the root cut is settled, the rules for those downstream libraries actually start to make sense.

The cut I use now is four buckets — projects, archive, audits, timeline. Each one runs its own territory. No overlap.

Why not slice by content topic

For about a year I used "AI models / engineering projects / music / video / research." Looks intuitive, looks right — every thing has a home.

But it doesn't survive six months. The reason is simple: content topic is a static label, governance attribute is a dynamic identity. The same asset is a "live project" today, may be a "closed-out archive" tomorrow, and the day after that may become an "audit snapshot" because someone ran an inventory. Its content topic hasn't changed — still belongs in the AI engineering drawer — but how I'm supposed to treat it has completely changed.

Concrete example. The Jiyanran Voice Workbench project was, last year, a running engineering effort with weekly commits pushing it forward. Mid-stream we ran two external audits and left two timestamped snapshots. Some early modules stopped evolving and won't be touched again — by rights they belong in the archive. And a handful of decisive architecture changes from inside the project I wrote into the system-level timeline.

If I slice by content topic, "Jiyanran" is one directory and all of that piles in. The result: next time I come back wanting to know "what state is the system actually in right now," I get drowned by stale design docs from earlier rounds; want to look up "what did the most recent formal audit conclude," I'm digging through layers; want to see "when was this architecture decided," the timeline is buried in commit history, invisible.

Put another way — in a topic-sliced directory, no file tells you "what attitude you should bring when you open me." You have to re-judge every single time. After ten rounds of that, nobody opens anything.

Slice by governance attribute and it's a different game. A file sitting in "projects" means it's alive, may be edited, referenced, or pushed forward at any time. Sitting in "archive" means it's dead — you can read it, but don't treat it as current truth. Sitting in "audits" means it's a snapshot of a moment, and you have to read the timestamp to know whether it's still usable. Sitting in "timeline" means it's recording a system-level change — don't expand the content, just follow the reference.

The attribute is stable. Once an asset has been categorized, the way you treat it is locked in. The four buckets, one by one.

Category one: projects — alive, has an owner, still moving

The projects directory holds work that's currently in motion. The test is dumb on purpose — there's still an owner, there are still outputs, it's still being pushed, there are acceptance criteria. Fail any one of those and it shouldn't be in projects anymore.

The lines I have running on this machine right now are a handful — Music Index, OpenClaw, Jiyanran Voice Workbench, Linlu Video Factory. Each one has its own directory, its own git repo, its own .env, its own venv. Those four pieces are the critical physical isolation for project governance — directory, git, env, venv — drop any one and you're in trouble.

Why hammer on isolation? Because dependency conflicts in AI engineering are vicious. Two projects sharing one venv — within six months, one side's deps will get quietly broken by the other side's upgrade. Two projects sharing one .env — sooner or later a token crosses over: one project's key gets read by the other project's code. The light damage is blown quotas; the heavy damage is the wrong caller. Sharing a git repo, don't even get me started, the commit history goes somewhere you don't want to imagine.

Mess inside a project directory is fine — live work is messy by nature: drafts, throwaway scripts, dead logs, half-finished docs. All of that can sit in the project, no problem, because their fate tracks the project. Either they get promoted to a real output someday, or the project closes out and they move to archive together, or one day you confirm they're useless and just delete them.

The biggest danger in the projects directory is failing to move dead pieces out in time. If a project contains both "the live current version" and "an already-dead older version" without clear labels, the next time you go to edit, 95% chance you edit the wrong one. I've been there — meant to change v3, instead changed a same-named file left over from v1, took half a day to figure out why the change wasn't taking effect.

So there's an implicit rule for the projects directory: everything inside is alive by default. The second you notice something is actually dead, move it out. Don't "just leave it here, it's fine." That's the most common contamination path between projects and archive.

Category two: archive — past, closed, still has reference value

Archive holds things that are already dead but still worth keeping. Old wikis in cold storage, past research reports, experiments that finished and didn't get picked back up, instructional cleanups — that category.

The signature of archive is clear: won't be edited again, but you'll occasionally look back, better kept than deleted, and absolutely never authoritative.

My archive is physically isolated from current projects — ideally on a different drive. That sounds excessive — a directory is enough, do you really need a different disk? But physical isolation has a psychological effect: what your hand can't reach, your hand doesn't casually reference. This is the most invisible but most essential move in governance: keep bad options out of arm's reach.

The biggest pit in archive is just one thing — it gets referenced as if it were authoritative.

The worst version of this I ever ate was while debugging a config issue. The AI assistant pulled up the docs, found a design doc with "FINAL" right there in the name, plainly stating that the interface uses v3 and the field is called X. I made the changes accordingly and nothing lined up. Eventually I realized that FINAL was a year-old FINAL — v3 had long since been overturned, the field had long since been renamed, but after the doc got moved to archive nobody went back to poke at it. It sat in archive with FINAL in the filename, looking authoritative.

From that point I gave archive an iron rule — every piece of material in the archive directory is stale by default. Even if the filename says FINAL / SPEC / canonical, the moment it's in archive, it's stale. To use anything from archive as a basis, I have to go back to the live project and find the current version. If there isn't one, that information is treated as nonexistent — you don't get to grab the archived version and use it directly.

The rule sounds unfriendly — there's something right there, why can't I use it? But it's protecting me. What archive preserves is the historical value of "this is what it was at the time," not the factual value of "this is still what it is." Conflating the two is more dangerous than not keeping the file at all — not keeping it just means no information; conflating means wrong information.

So archive isn't a graveyard for data, it's a museum. You can view it, you can cite it as "the state at the time," but you can't take a museum piece into active duty.

Category three: audits — a snapshot of one moment

Audits are the output of a one-shot inventory. The system snapshot before a reinstall, a cross-project topology pass, a third-party security audit, your own periodic asset review — all of these belong here.

Audits have a signature that's different from both projects and archive. Projects are alive, archive is dead, audits are "dead at a specific point in time, but that point matters." It's a photograph. What it captured was the state at that moment.

Two rules for the audits directory — must carry a timestamp, must carry a stale tag.

The timestamp has to go directly into the filename or the directory name, not buried inside the file. This is a lesson learned the hard way, many times. If the timestamp only lives in the frontmatter at the top of the file, you only see it after opening — scanning the directory listing tells you nothing about what's old and what's new. Year and month in the filename, and one glance tells you whether this is from this year or last.

The stale tag matters more. The instant an audit snapshot is taken, it starts going stale — the only variable is how fast, which depends on how fast the system is moving. For stable parts of the system, an audit might still be valid six months later; for parts that are moving, an audit might be junk in two weeks. So the audits directory needs a mechanism that can tell you "is this one still usable" — right now I still tag by hand, and whenever the system goes through a major change I go back and mark the corresponding audits stale.

The biggest landmine in audits is an expired one being treated as current state.

This is worse than misciting from archive. With archive, everyone knows in their bones — you're reading history. Audits are different — audits were made for the purpose of "letting people know the current state," so the default assumption is "this is true." A three-month-old audit, with the system having changed several times in between and nobody marking it stale, the next person who pulls it up and uses it as ground truth has an extremely high chance of being wrong.

So when I run an audit now, I do two things at once — take the snapshot, and also come back to mark it stale when the system changes meaningfully. The second one I forget all the time, and every miss produces a problem. That's also why later in this piece I mention I'm working on stale automation — manual discipline can't carry this.

Audits and archive get conflated easily, but the difference is actually clean: archive is "this thing is dead, move it into the museum." Audits is "this thing is still alive, but at one moment I took a photo of it." The subject keeps living. The photo doesn't update.

Category four: timeline — the running log of system-level changes

Timeline records exactly one thing — system-level changes.

What counts as system-level? Changing the constitution — the machine-level general working guide. Changing directory structure. Installing a new resident service. Configuring a new LaunchAgent — macOS's background services that start at boot. Decisive architecture pivots. That class.

Anything that isn't system-level doesn't go in. "A specific project edited a specific file" — that's the project's commit history, not timeline. "An audit found something" — that lives in the audits directory, not timeline (unless the audit triggered a system-level change). "Read an article today, learned something new" — that's notes, not timeline.

Timeline is append-only — no deleting, no editing, archived by month. Every entry has to state four things clearly: what changed, why, who's affected, and which task or audit it references.

That "references" part is the critical one. Timeline doesn't repeat content, it only carries pointers. For any timeline entry, if you want details, you jump to the project directory or audit snapshot it references. This rule keeps the timeline itself thin — and thin is what makes it readable.

The biggest danger in timeline is every small change getting stuffed into it.

The temptation is real — once you have a running log, why not jot down every edit. But try doing that for one week and you'll see — the timeline drowns instantly. You wanted to know "what major changes happened this month," and instead you find 80 entries, 70 of them "tweaked a parameter" / "edited some copy" / "installed a tool." Nobody reads a log at that density, and even if you did you couldn't find the signal.

So the entry bar for timeline has to be high. My test for whether a change goes in is — three months from now, looking back at this change, will I still think it mattered? If not, don't enter it.

When in doubt, don't write. Missing an unimportant change is a small loss; writing too many unimportant changes invalidates the whole timeline, which is a big loss.

How the four categories reference each other

Once the four directories are settled, the relationships between them are tighter than you'd expect. But there's one principle — cite by pointer, never by copy.

A project hits a milestone and needs a snapshot, that triggers an audit — the project directory leaves one line, "audited, see audits/2026-05-jiyanran-stage-1.md," and the detail lives in the audits directory. If an audit surfaces a significant issue that demands a system-level adjustment, that triggers a timeline entry — the audit report leaves one line, "timeline change triggered, see timeline/2026-05/CHANGE_xxx.md," and the timeline leaves one line, "cause documented in audits/2026-05-jiyanran-stage-1.md." Both sides reference each other, but the content lives in exactly one place.

Project close-out and move-to-archive follows the same logic — the project directory disappears, the archive directory holds the complete snapshot, the timeline gets one line, "project X closed, archive at archive/xxx." That timeline entry becomes the only entry point anyone has, from then on, for finding that project.

Timeline references projects and audits; projects and audits rarely reference timeline — because timeline is an after-the-fact summary, projects and audits are in-flight artifacts. This one-way relationship keeps the directories from forming cycles.

Archive's reference relationship with the other three is the weakest — archive mostly gets referenced, it doesn't actively reference others. It's a terminus.

I didn't think this reference structure through up front either — it emerged from using the system. Once it stabilized, looking things up became a very mechanical motion — first check the timeline to know roughly what happened in what window, then jump to projects or audits for the actual content, and finally, if you need historical background, dig through archive. Three steps cover any lookup.

Real scenarios: how the four buckets work together for decisions

The value of this cut shows up in real scenarios. Three that come up for me a lot lately.

Scenario one: picking up an old project that hasn't been touched in three months.

The old flow was — open the project directory, read the latest doc, start guessing. The problem is, docs in the project directory don't separate truth from junk — some are old design drafts, some are abandoned mid-stream proposals, only a few are the version that's actually running. I'd often spend an hour or two just figuring out "what state is this project actually in right now."

The current flow is — look at what the most recent commit in the project was about, then go to audits and find the most recent snapshot for this project (timestamped, you can see the month at a glance), then go to timeline and pull the system-level changes touching this project in the last three months. Stitch the three together and I'm in context in ten minutes. The audit snapshot tells me "what it looked like at that moment," the timeline tells me "what's changed since," and the latest project commit tells me "what's moving right now." Three angles cross-checked is more accurate than any one of them alone.

Scenario two: debugging a configuration conflict from nowhere.

For example, some day a service suddenly can't reach a port, and you don't remember changing any related config.

Used to be I'd dive straight into the project code and grep commits, often coming up empty — because the issue might not even be in this project, it might be a system-level adjustment that changed port allocation, or a LaunchAgent config somewhere.

Now the first step is the timeline — any network, port, or service-related system-level changes in the last month? Second step, the "known config conflicts" section of the most recent system audit — audits typically jot down conflicts you've already discovered. Two steps and 80% of the weird issues land at a root cause. The remaining 20% is when I go into the project and grep commits.

The order matters — system-level first (timeline + audits), project-level second (commits). Reversed, you cut your efficiency in half.

Scenario three: writing a technical blog post.

Like the one I'm writing right now. The material splits into two parts — archive has the related cleanups I've done over the past year, projects have the practice that's still moving in the last month or two.

Archive is in charge of "what I used to think," projects are in charge of "what I'm doing now." Cite both and the piece gets temporal depth. Cite only archive and it reads like history; cite only projects and it reads like technical minutiae. Putting the two side by side is what lets the piece explain "why, after changing my mind this many times, I landed on this cut."

Audits don't show up much in this scenario — audits are engineering archives, you don't cite them in public writing. But occasionally for a postmortem about "the time we made a major adjustment," the audit snapshot is the most authoritative source you have.

What this governance actually changed

The numbers shifted more than I expected.

Before the cut — the work outputs on this machine were spread across more than a dozen directories. Finding one file took an average of three or four directory dives, and on a bad day half an hour wouldn't find it.

After cutting into four — 95% of the time I can locate a file in under thirty seconds. The remaining 5% is mostly gray-zone material I never properly classified, and each time I hit one I reshelve it on the spot. The gray zone keeps shrinking.

The bigger shift is psychological — I'm no longer afraid of "can't find it." Every previous lookup carried this latent anxiety of "what if I can't find it this time." That anxiety feeds back into behavior — you subconsciously look back less, rely on memory more, and lean on "should be fine" more.

Now I don't lean on memory. I need something, I go to the right drawer and pick it up.

The parts I'm still changing

This governance is nowhere near done. A few things are still growing.

First, stale-tag automation for audits. Right now it's manual — every time the system changes, I go back and mark a batch of audit snapshots stale. The miss rate isn't low. The ideal is a script that periodically scans the audits directory and computes which snapshots are stale based on the current system state. The thing is harder than it sounds — "current system state" doesn't have a machine-readable definition, so you'd first have to formalize "system state" before any script can judge.

Second, the criteria for moving a project to archive. Right now it's pure intuition — I think this project is dead, so I move it. But there's a fuzzy band between "I think it's dead" and "it's actually dead." Move it too early and the project gets pulled back into active use; move it too late and dead stuff keeps polluting the project directory. The ideal is a set of criteria — how long since the last commit, are the dependencies still alive, is there an owner — and meeting enough of them triggers a "suggest archive" prompt.

Third, cross-category reference mechanics. I wrote about this above — the four categories cite each other, but the references right now are plain-text paths in markdown. Move a file and the link breaks. On a single machine I can live with it; the moment it's multi-machine, it falls apart.

Fourth, multi-machine sync. I have more than one machine running now — Mac mini, Mac Studio, Mac Air, each with its own role. The four categories exist on every machine, but whether they should fully sync or each machine should only carry what it locally uses, I haven't decided. Full sync's upside is any machine can pick up where another left off; the downside is the project directory swells fast. Local-only is clean, but cross-machine collaboration suddenly lacks things. This probably needs to split into something like "projects per machine, archive fully synced, audits per machine, timeline fully synced" — but I haven't actually run it, so I don't know what pits it'll dig.

The more I work on directory cutting, the more I think it isn't an IT problem — it's a cognition problem.

Directories cut by topic reflect "what I think I have"; directories cut by governance attribute reflect "how I'm supposed to treat these things." The first is a label, the second is an action. When your directory tells you what to do next, that's the moment assets start working for you, instead of you working for them.

Cutting into four isn't a finish line, it's a starting point that keeps growing. Stale automation, archive-promotion criteria, cross-machine sync — each one has to queue up on its own. With this piece written, next week I should circle back and push the stale automation thing — it's been owed for a while.

Dual Constitution, Task Folders, Handoff: The Minimum Order for Multi-AI Collaboration

Wed, 20 May 2026 00:00:00 GMT

Multi-AI collaboration governance notes

One person, three or more AIs, a long-term engineering effort. Six months in, I finally saw it clearly — the thing that disappears first isn't capability, it's order.

Capability is actually in surplus. Claude can argue, write long pieces, break down architecture. Codex can run scripts, fix tests, handle CI. Kimi can chew through long Chinese documents. GPT and Gemini each have their own edges. Put them side by side and in theory you have a small team. Each one alone is worth half an engineer; three in parallel should be worth one and a half.

But once you actually use them, the team feeling turns out to be fake. Every AI is an island. There's no memory between sessions, and even less shared understanding across tools. The next AI that walks in is always asking the same set of questions: what did the previous AI change? Why? Can I keep going? Or do I start over? Is this setup actually stable, or did it just happen to work last time?

Nobody answers these. I have to remember them myself. Whatever I can't remember turns straight into rework. One or two rounds of rework is fine. By round ten you realize you're not using AI — you're being a human relay station between AIs, copying context one way, explaining yesterday the other way, double-checking that they haven't stepped on each other's feet.

Eventually I stopped asking "how do I make the AI smarter" and started asking the reverse: what is the smallest set of things that holds the relay between these islands together? Not some grand governance framework — just the minimum order that's barely enough. One piece less and it collapses; one piece more and it starts dragging.

Every AI is an island — the pain is real

At first I thought this was just tool differences and a few more rounds of use would smooth it out. After a few months of grinding, I admit it's structural. It's not that any one product is doing badly. The paradigm itself is built this way.

Each AI's session is independent. I finish a discussion with Claude today; tomorrow I open Codex and it knows nothing. I have to manually paste a chunk of background, then a chunk of last round's conclusions, then explain what we're doing now. Eight times out of ten the background I paste is incomplete — not because I'm lazy, but because I genuinely don't remember where the last round left off. Going back to dig through chat logs is brutally inefficient, because chats are full of exploratory chatter and the actual decisions are only a small slice of that.

Across tools it gets worse. Files Claude changed, Codex doesn't know were changed. Scripts Codex ran, Kimi doesn't know produced results. Chinese material Kimi organized, Claude doesn't know exists. Three AIs each carry their own "project in their head," and the three don't merge. Ask any of them what state the project is in right now and you'll get a very confident answer — and three confident answers that fight each other.

The worst time: I had Codex change a piece of config, it went smoothly. The next day I asked Claude to look at the same module and it said "I suggest you change it to X" — and X was exactly the pattern Codex had moved away from the day before. Neither AI was wrong. The fault was that nothing in the middle let them know about each other. If I hadn't caught it and had taken Claude's suggestion, a few days later Codex would get tripped up by some test and reverse it again — a literal loop, each round "fixing" the previous "fix."

There's another kind of pain that's more subtle: no conflict, just discontinuity. Codex ran the tests and they passed; the conclusion stayed in that one session. Next time I open Claude to discuss the next step, it has no idea the tests ever passed — it carefully suggests "let's run the tests first to confirm." That caution isn't wrong, but for me it's pure dead round-trip. The AI keeps reconfirming things I already confirmed, because it has no channel to know I did.

A few rounds of this and you want to give up on parallel use and go back to a single AI. But the cost of going back is bigger — it means dropping 70% of your usable compute, and dropping the core benefit that "different AIs are good at different things." So the question was never "parallel or not." It's "where's the minimum order that makes parallel work."

Three things that hold the whole chain together

Six months in, three things have survived. Not because I designed them brilliantly — because shrinking the set further actually breaks things, and growing it turns into administrative drag.

First is the dual constitution. Two top-level rule files. One governs behavior, one governs knowledge. The behavior one says: how tasks flow, how files get changed, which actions require stopping to ask, how CHANGE gets recorded, which lines are red lines, which actions are default-go. The knowledge one says: how things get filed, what the naming convention is, how content lineage gets tagged, how many layers the Feed has, what material belongs in which layer.

At first I tried to merge them. After two months I admitted they can't be merged. Behavior rules and knowledge rules are fundamentally different in nature. Behavior is "should I do this" — it's a judgment. Knowledge is "where does it go, what is it called" — it's a convention. Cramming judgment and convention into one file makes the AI bad at both. Either it treats the behavior rules like metadata — reading "stop and ask first" as "add status: pending to the file's frontmatter" — or it treats the naming convention like a moral constraint, refusing to keep working when it sees a non-standard name even though the file is just a working draft. Splitting them into two makes both clearer. Reading the behavior file, the AI knows it's making a judgment. Reading the knowledge file, it knows it's filing something.

There's a simpler benefit too: split in two, each evolves independently. I touch the behavior file roughly every one or two months, because task modes shift. The knowledge file is more stable — once naming and layering are set, they shouldn't drift much, so every three to five months is enough. When they were one file, touching either half meant rethinking the whole thing, so I ended up afraid to touch either.

Second is the task folder. Every cross-AI task gets its own directory on the shared filesystem. Inside, four fixed things:

README — what this task is actually trying to do, what acceptance looks like, what it depends on. One page, no more.
notes.md — an append-only log. Whenever an AI finishes one piece of work, it appends one entry at the bottom: what got done, what the conclusion was, who's next, where the key files are. No overwrites, only appends.
handoff.md — written when handing off to the next AI. Current state, what's been done, what hasn't, what to watch out for on pickup, key file paths.
outputs/ — what this task actually produced. Scripts, reports, data, modified code snippets.

Third is handoff itself. It's how the handoff.md file in the task folder gets used: the previous AI finishes a leg and leaves a handoff behind; the next AI picks up by reading README → the last few notes entries → handoff, in that order. Five minutes to be in state, then it keeps going. Handoff isn't a log, it's a signpost — it tells the receiver "you're standing here, the next step goes that way."

The three together are light — one folder template, one append format, one handoff action. But they have an order: the constitution sets boundaries, the task folder sets context, handoff sets the relay. Drop any one and the chain breaks at that link. Without the constitution, the AI doesn't know which decisions aren't its to make. Without the task folder, the AI doesn't know where the project stands. Without handoff, the AI knows where the project is but not where to pick up from.

Why these three, not something else

Early on I added a lot of things. Status boards, kanban, daily summaries, cross-AI notifications, version manifests. After three months most of them were gone.

The criterion is simple: if removing it makes the order collapse, keep it. If not, delete it.

Without the dual constitution, it collapses. When the AI doesn't know the boundary, it decides for you — and decides confidently. It'll write into files it "thinks should be changed," move things it "thinks should be archived," rename a batch of material without authorization. None of it is malicious. Every time, it's the AI using its own judgment to fill in a blank. If the blank isn't filled, the AI will fill it. That's instinct. The dual constitution fills exactly that blank: which actions require stopping, which material follows which naming, which directories are off-limits roots. It doesn't have to be detailed. It just has to exist — the existence itself is the signal, telling the AI "past this line, ask me, don't decide."

Without the task folder, it also collapses. If context only lives in chats, every tool switch is a memory restart. I used to think I could remember "where the last round left off." Running ten parallel tasks I cannot. The task folder's job is to take context out of my head and put it on disk. So the next AI (or me a week later) opens the directory and starts from a known position, not from fragments of my memory. The most interesting thing about this: what it actually solves isn't the AI's memory problem, it's mine. The AI's memory doesn't matter either way — it cold-starts every time. Mine is finite and needs somewhere to live.

Without handoff it collapses the fastest. The task folder has a README and notes, so in theory the next AI can figure it out — but in practice it can't. Notes are append-only; after thirty entries nobody reads from the top. The README is the task definition, not the current state. Neither tells you "the next thing you should do right now." Handoff exists to solve exactly that. It replaces the dumb "manually copy-paste context between two chat windows" action — and replaces it completely, because the moment you write it down it persists, unlike chat state that vanishes when you close the window.

Three things, three different jobs: boundary, context, relay. The relationship isn't redundancy, it's division of labor. That's why I deleted everything else — the rest was either duplicating one of these three, or solving a problem that didn't actually exist. For example, I once built a cross-AI notification system where one AI would message the next after finishing. Sounded reasonable. Useless in practice: the next AI doesn't become smarter from receiving a ping, it still has to read the README and handoff to get into state. The notification just added a failure point.

Or version manifests. I once wanted to tag each task with a version number for easy rollback. Turned out it wasn't needed — notes are append-only and inherently a timestamped evolution record. To roll back, roll back to the state described by a specific notes entry. No separate version number required. Adding a manifest layer would just be one more thing to maintain.

So the reason these three are the "minimum" isn't subjective. I lined up everything I'd added and later deleted, and these three are the only ones I couldn't compress further. Remove any one of them and a class of problems has no owner. Add any one more and there's a lighter scheme that covers it.

The rules are shrinking, not growing

People who hear "dual constitution" worry the rules will keep growing thicker. I worried too. In practice it's gone the other way.

The earliest version of the constitution was a thick stack. Which scenarios should ask, which should act, which file goes where, how every action should be recorded — even "modifying a comment counts as modifying a file" was on the list. That version performed the worst. After reading it, the AI got more cautious, not less. It asked about everything: one line of comment to change — ask. A throwaway temp file to create — ask. An obviously dead placeholder file to delete — still ask. When rules are too dense, they turn into formalism. The AI isn't judging by rule, it's using "let me ask" to dodge anything that might touch a line.

So I started reverse-editing. Every time something went wrong, I'd ask first: "not enough rules, or too many rules so the AI missed the key one." Eight times out of ten it was the latter. The rules got compressed round after round: from "enumerate everything you should do" down to "a few red lines you can't touch + risk tiering + a few task modes." There was a middle version with "risk levels L0-L3" — looked elegant, but in practice the AI often couldn't tell which level the current action belonged to and ended up asking anyway. The next version I just cut the tiers and kept two categories: "absolutely don't" and "give me a heads-up first." Everything else default-go. The AI's judgment accuracy jumped immediately. The current version has four boundaries and one three-column action table.

This shrinking isn't me getting lazier. It's me seeing one thing clearly: the constitution isn't there to regulate everything, it's there as a backstop. 90% of daily judgments the AI gets right on its own. The constitution covers the other 10% where it goes wrong. Write the rules too full and you freeze the 90% too — things the AI could just do, it now has to stop and check rules for; small things that didn't need confirmation, it now asks about. That isn't safer, it's slower, and the slowness is a cost I end up paying.

My Inbox still has twelve constitutional amendment proposals sitting in it. Some propose adding, some removing, some converting a boundary into an action checklist, some introducing a new intermediate layer. I'm not in a rush to rule on them. The fact that they're sitting there means this order is still alive — being questioned, rewritten, overturned by itself. A constitution that's no longer being questioned is the dangerous one. That kind isn't unquestioned because it's perfect, it's unquestioned because nobody is reading it seriously anymore.

What six months of running this taught me

After actually running these three for six months, I have a few judgments I'm fairly settled on:

The dual constitution works well. Separating behavior and knowledge was right. I haven't regretted it once. The AI reads only the behavior file when making behavior decisions, only the knowledge file when filing and naming. The two don't interfere, and accuracy is much higher than when they were combined. The counterintuitive part: splitting into two takes less brainpower than keeping one. The reason is probably this — when one document mixes two fundamentally different kinds of rules, the AI's attention gets diluted. Reading the "should I do this" section, it's still thinking about "which layer does this go in," and ends up getting both wrong. Split, each loads independently. Crude but effective.

The task folder is the lifeline. 90% of collaboration problems get solved at this layer. As long as this layer is solid — README clear, notes continuously appended, outputs all in the directory — the next AI picks up basically without error. Once this layer collapses, no constitution can save you, because the AI has no context. Rules without context don't work. Rules tell the AI "what not to do," but not "what to do right now." That can only be read from context.

Handoff is the piece I most often slack on. Every time I finish a leg of a task, I want to just close the window and pick it up myself next time. The voice in my head says "I'll remember anyway." I don't. By the next time I want to resume, I have to spend twenty minutes digging through notes and outputs to reconstruct last time's state. That "digging back" cost is always bigger than "spending five more minutes writing the handoff then." I know the lesson and still slack on it, so I'm constantly correcting myself. Recently I started using a small rule — no closing the window before the handoff is written, or future-me will pay. The rule works, but maintaining it is itself a discipline cost.

This order has a range. It fits "one person + three or more AIs + long-term engineering." If you're using one AI for short tasks, this is overkill — context fits inside a single session, no need for folders. If you're a team collaborating across multiple AIs, this is too light — you need permissions, review, formal archiving, because human-to-human collaboration adds a dimension that markdown alone can't carry. It sits in an awkward middle: heavier than personal notes, lighter than team governance. I built it because I'm stuck in that middle. It won't fit everyone. For people stuck in the same shape, it should be a useful reference.

Ending: still being questioned by myself

I don't want to write this as "I designed a perfect multi-AI collaboration order." It isn't perfect. Twelve proposals are still sitting unresolved in the Inbox. I still slack on handoffs. The boundary between the two constitutions occasionally needs on-the-spot judgment. The rules are still shrinking toward "minimal boundaries" — meaning the current version will get overturned again.

But one thing I'm much more certain about than six months ago: the core problem of multi-AI collaboration isn't in the AI itself, it's in the layer of order in between. Get that layer right and three AIs feel like a team. Get it messy and three AIs are more tiring than one.

That layer doesn't need anything complicated to hold it up. One behavior constitution, one knowledge constitution, one task folder convention, one honestly-written handoff — that's it.

I'm increasingly convinced cross-AI order isn't designed, it's what's left after repeated slacking, repeated rework, repeated face-plants. Every useless rule deleted, every rule that truly backstops kept — the order gets one notch more stable.

This order is also being questioned by itself. It looks like this today. In six months it probably looks different. It doesn't need to be the final form — it just needs, today, to let the next AI picking up know where to start.

Jiyanran Voice Workbench: Why a Voice Entry Point Needs Three Layers of Decoupling

Wed, 20 May 2026 00:00:00 GMT

Jiyanran voice workbench notes

The voice entry point is the layer people most easily misjudge the difficulty of — the user speaks, the AI answers, that's all it is on the surface. But if you want it to still be standing in the second month, it is not as light as "wiring up a single cable."

Jiyanran (纪嫣然, the local voice workbench agent) is a voice workbench I run locally. What it does is straightforward: I speak a sentence to my Mac, the system recognizes what I said, then decides which agent inside OpenClaw (my own agent factory) should handle it, and after handling it, speaks the result back to me. It sounds like a single pipe would do — mic in, speaker out, a model in the middle.

That is what I thought at first. The first version was even a hastily written direct-call version: the front-end UI called the speech recognition SDK directly, the recognized text went straight into the large model, and the model's reply was read out directly. It worked, the demo looked good, but after two or three days I realized this path was actually a trap. Anything I wanted to swap, upgrade, or add risk controls to would force the whole chain to move.

That is why today's v1.0 looks the way it does: OpenRoom front end → voice-bridge → avatar-bridge → OpenClaw. Three independent services, each with its own mock fallback, each with its own risk-gate. Adding two bridge layers in the middle looks redundant. But after using it for a while I have only become more certain: these two bridge layers are not redundant — they are the reason this system can live long.

What this article wants to say is exactly this: why a local voice entry point cannot be written as a monolithic direct call for the sake of convenience, why it must be split into three layers, why every layer needs a mock, why every layer needs a risk-gate. This is not about some flashy feature I built — it is about a judgment I stumbled into the hard way: about where to split, where to fall back, and where to stand guard. If you are thinking about building a local voice assistant or agent entry point, this judgment might save you part of the road I already walked.

The price of coupling: the lazy version feels great for two weeks, then rots in the third

How simple was the first direct-call version? A button on the front end, press to talk, release to send. Recognition ran in the front end, model calls assembled prompts in the front end, and even risk controls were written into the front end on the side — three hundred lines of JavaScript, end-to-end demoable. I was actually pretty pleased with myself at the time, thinking this stuff was not so complicated after all.

The problem is, in the second week I wanted to swap recognition engines. The local model I was using did poorly on mixed Chinese-English speech, and I wanted to try another. I opened the front-end code and found that the recognition call, error handling, timeout, retry, downsampling, and VAD (voice activity detection) were all tangled together in the front end. Swapping an engine was not a matter of changing one import — it meant peeling that whole blob apart again.

The third week was harder. The OpenClaw agent interface changed once. It was actually a very small protocol upgrade, but because the front end was assembling OpenClaw requests directly, the entire front-end request construction logic had to follow. Every time it changed, the front end shook, and the UI was prone to breaking with it.

What finally broke me was risk control. I wanted to add a confirmation step on certain commands (operations like "delete a workspace" should require a second confirmation), and I found this gate could only be written in the front end. But the front end is on the user side — anyone could bypass it in theory; the right request sent directly would hit OpenClaw. That is not technical debt — that is a real security hole.

In that moment I saw it clearly: the voice entry point looks simple, but it mixes four things together from the start — the presentation layer, the perception layer, the dispatch layer, and the execution layer. The price of mixing them is not slow code; the price is that any change later forces you to rewrite the whole layer. Two weeks of bliss, third week of rot, fourth week unmovable.

There is a more hidden cost — the mental load. With the direct-call version, every change forced me to first mentally trace the whole chain: would recognition be affected? Would the UI state misalign? Would request construction mismatch the new protocol? That feeling of "any change anywhere means worrying about the whole chain" quickly wears down the appetite to keep building. A local tool that makes you spend ten minutes "thinking through side effects" every time you touch it will sooner or later be abandoned by its own author.

This is not unique to voice entry points, by the way. Any system that crams "front end + model + agent" into one place runs into it. But voice entry points have it especially bad — because they add two extra things that complicate matters: real-time audio streams, and the user's expectation of low-latency feedback. Both of these strongly tempt you to "just write it together for convenience," because every extra hop adds latency and every extra process adds uncertainty. The temptation is strong, but the cost is stronger.

How the three-layer decoupling splits: OpenRoom / voice-bridge / avatar-bridge / OpenClaw each handle one thing

So v1.0 was rebuilt from scratch, rearranged on the principle of "each layer does only one thing." Four things, four layers.

The outermost layer is the OpenRoom front end. It is just a room: there is a microphone, there are speakers, there is an interface showing the conversation, there are buttons, there is some visual feedback. What it handles is extremely narrow — take the user's voice in, play or display what the backend returns. It does not recognize speech, does not construct requests, does not know who OpenClaw is, does not speak to the model directly. It is just a room — people talk inside, and what happens outside the room is none of its business.

One layer in is voice-bridge, running on port 3962. This layer handles one thing: turning "voice" into "a structured task." It catches the audio stream from the front end, calls the recognition engine, handles VAD, segmentation, confidence, optional language detection, and finally emits a structured description of "I am reasonably sure the user said this." This layer does not know who will pick up downstream or what will be done with it; its responsibility ends at "intent recognized."

Further in is avatar-bridge, running on port 3961. This layer handles dispatch. voice-bridge hands it a structured task, and it decides which agent the task belongs to: knowledge questions go to the information line's agent, writing goes to the content line's, command execution goes to the execution line's. This corresponds to "dispatch" — not "recognition," not "execution."

Innermost is OpenClaw. The ones who actually do the work all live here — Suwan, Huo Rui, Shen Zhixing, and Jiyanran herself. OpenClaw does not care about voice, does not care about front-end buttons, does not care who dispatched the task; it only cares about "I have received a task; do it well according to my persona and capabilities."

Four things, four layers, each looking only at its own boundary. The front end does not know how downstream dispatches; voice-bridge does not know which agents exist downstream; avatar-bridge does not know how each agent works internally; OpenClaw does not know whether the task came out of someone's mouth or off someone's keyboard. Each layer sees only its own slice.

There is a hidden benefit to this "only look at your own slice" design: each layer can swap "entry form" without affecting the others. Today it is a voice entry; tomorrow I want to add a keyboard entry, I just write another "keyboard bridge" and plug it into avatar-bridge — downstream OpenClaw does not need to move at all. The day after, an email entry, a Telegram entry, a shortcut entry — same pattern. Below avatar-bridge becomes a stable "task execution backend"; above avatar-bridge can be any number of entry forms. This started to matter a lot when I began wiring agents other than Jiyanran into the system — the same OpenClaw backend can serve many entry forms without starting from zero each time.

This split looks verbose — a single audio clip from microphone to actual work has to pass through four processes, two ports, and three serializations. But what I discovered later is that this very "verbosity" is what lets each layer be swapped on its own. Swapping the recognition engine touches only voice-bridge; changing dispatch rules touches only avatar-bridge; OpenClaw upgrading the agent protocol leaves the outer three layers untouched.

Why ports 3962 and 3961, two adjacent numbers? Pure convenience — I grouped voice-related services in the 3960 range to make them easy to remember and debug. That is not design philosophy, just engineering preference. But "each layer has its own port" is deliberate: it forces me to treat each layer as an independent service, so I cannot quietly merge two layers into one process in some later version. Physical isolation enforces logical isolation.

There is actually an unexpected benefit after this layering is done: each layer can be tested independently. I can spin up only voice-bridge, feed it a recorded audio file, and see what it recognizes; I can spin up only avatar-bridge, feed it a fake structured task, and see where it dispatches; I can spin up only OpenClaw, feed it a fake agent task, and see how the agent responds. Each layer has its own test set and its own regression cases. Locating problems is fast too — compare the logs of the four layers and you immediately see which layer broke.

Why every layer needs a mock fallback: the user end cannot go "blank screen"

Decoupling is only the first step. What actually lets this architecture hold up in daily use is another seemingly unremarkable design: every layer carries its own mock fallback.

What does that mean? When voice-bridge starts up and finds that avatar-bridge is not running or unreachable, it does not just throw an error back to the front end. It catches with a mock interface: returns a prepared placeholder response, telling the front end "the dispatch layer is temporarily unreachable, use this fake data for now." avatar-bridge is the same — if OpenClaw is down, it uses a mock agent to return a placeholder result. The front end is the same — if voice-bridge itself is not up, it can at least recognize that the user pressed the button and display "voice channel not yet connected," instead of a black screen.

Why does this matter so much? Because a local AI system is not a cloud service — downstream instability is the norm. OpenClaw needs to restart for upgrades, the recognition model needs time to load, an agent running a long task may not respond. If every layer just passes downstream failure up the chain as-is, the user end will see "an error occurred, please retry" frequently. Once or twice is fine; ten times in a row and the workbench is dead.

That said, mocks are not for tricking the user. When the mock catches, the front end explicitly shows "this is a placeholder response, downstream X is not connected," instead of pretending it really answered. This is key — the mock is so the user end can keep operating (enter the next request, change settings, view history), not so the system can pretend everything is fine.

Once you actually do it, you find mocks have another hidden benefit: every layer can be developed independently. While developing voice-bridge, avatar-bridge can just run as a mock, with no need for OpenClaw to actually be running. While tuning dispatch rules on avatar-bridge, the OpenClaw layer can be fully mocked, so development is not blocked by downstream. Otherwise four-layer integration testing means one layer crashes and the whole chain stops, fragmenting your dev rhythm.

I had one principle when designing the mocks: mocks must be identifiable. The content they return carries an explicit placeholder marker, and the front end, on seeing this marker, tells the user explicitly "current response is a placeholder." I did not think this through at first and wrote a version of "pretend everything is fine" mocks. The result: one time voice-bridge could not reach avatar-bridge, the front end received a mock response and played it normally, and I did not notice downstream was down for half an afternoon. After that, mocks had to be explicitly visible — better to look crude than to let "the system is actually not working" hide behind an illusion.

Another lesson: do not try to make mocks "look smart." I once thought about having the mock use a small model to generate placeholder text that sounded more like a real answer. In the end I did not do it. The reason is direct: the smarter the mock, the harder it is for the user to tell whether it is real or a placeholder, and the easier it is to take fake data for true. A simple, crude mock is, by contrast, honest — its very existence is saying "this downstream link is down."

v1.0 does it this way, v2.0 has not changed this. Not out of laziness, but because this one has been validated: mocks online, the whole system is stable; mocks removed, the chain becomes fragile.

Why every layer needs a risk-gate: do not let OpenClaw be bypassed at will

Decoupling solves "can be swapped," mocks solve "can hold up." But one problem is still unsolved — security.

The security of a local system is more easily overlooked than it looks. Many people think "I am the only user anyway, no risk locally," but the fact is: as long as this system has ports, APIs, and the ability to call real things, it can be bypassed, abused, or accidentally triggered. Even if I myself misspeak one sentence or one word is misrecognized, OpenClaw might end up doing something it should not.

So every layer needs its own risk-gate.

The front layer is the most basic: an allowlist. Which clients can connect to the front end, which sources can inject messages — hardcoded. Anything not on the list cannot even open the page.

voice-bridge handles voice boundaries: which prompt patterns are allowed, which command patterns must be intercepted immediately. For example, if the user's spoken sentence contains keywords that are easily misrecognized, voice-bridge does a first pass of intent cleaning, marking high-risk expressions before passing them down.

avatar-bridge is the most critical layer. It is the one that actually decides "who gets this task," so it must be the strictest. Every agent has its own boundary of what it can and cannot do, and avatar-bridge checks before dispatching: does this task match this agent's capabilities? Are the required permissions present? Is this a high-risk action that requires second confirmation? If not, do not dispatch.

OpenClaw itself also has its own layer of risk-gate. This is "the last line of defense" — even if the three layers in front are all bypassed, OpenClaw internally still has its own personas, its own boundaries, its own audit log. No agent can do anything beyond its capability range without owner approval.

Four layers of risk-gate sound repetitive, but they are not. The logic stacking them is: no layer can be assumed trustworthy. The front end might be bypassed, voice-bridge might misrecognize, avatar-bridge might dispatch wrong. So every layer guards its own door — do not count on the outer layer to keep the dirt out.

There is a side benefit to this setup: audit logs rotate daily, and every layer writes its own. Wherever something goes wrong, that layer's log sees it first. To do a retrospective on one misrecognition, you do not dig through a mass of mixed-together full-chain logs — you first look at voice-bridge's recognition records for that day, then avatar-bridge's dispatch records, then OpenClaw's execution records. Each layer's log covers its own slice, and retrospectives are actually faster.

Later I distilled another lesson: the stricter each layer's risk-gate, the more downstream logic can be simplified. If avatar-bridge has already blocked illegitimate requests before dispatching, OpenClaw internally does not need much defensive code for "is this input malicious." It can focus on what it does best — executing tasks. Conversely, if upstream risk-gates are toothless, downstream has to write all kinds of boundary checks itself, and the whole codebase grows more bloated. So layered risk-gates are not just security design — they are also about putting responsibilities in their right place: each layer only needs to do its own checks well, no need to back up someone else.

v1.0 vs v2.0 — the product judgment: core architecture stays put, increments only at the edges

The system currently runs on v1.0. This version has been running stably for a while, with fixed ports (voice-bridge 3962, avatar-bridge 3961), the allowlist and risk-gates working, audit logs rotating daily, mock fallbacks holding, and recognition, dispatch, and execution each doing their part. It is not perfect, but it is a "real thing that is running."

I have also started on v2.0. The skeleton of v2.0 has landed and is in finishing stages; v2.0 GA (General Availability) is still waiting on audit. But one thing I set in stone from the start: v2.0 only does increments — it does not touch the core architecture.

This is a product judgment, not a technical one.

Technically v2.0 could perfectly well "take the chance to rework the architecture" — merge voice-bridge and avatar-bridge into one process to save a serialization hop, switch to a more modern communication protocol, replace the mock fallback with a smarter "pretend to keep chatting." Each of these can be justified on its own.

But the product judgment tells me not to touch them. The core architecture of v1.0 has been validated by months of real use: four layers of decoupling, mocks per layer, risk-gates per layer. This structure was not dreamt up in the abstract — it was earned by stepping in holes. Any "take the chance to fix this" can push a validated stable state back into instability. Increments are safe; rewrites are gambles.

So what is v2.0 actually doing? Enhancements at the edges. Finer intent recognition, friendlier placeholder responses, better support for long conversations, more granular dispatch rules for some agents. These are all "adding a bit," not "rewriting." The original four layers, four ports, four mocks, four risk-gates — none of them moved.

This kind of judgment is especially important on a one-person project. The pit a solo developer most easily falls into is "rewriting on every upgrade" — because no one is holding you back, because the code you wrote yesterday looks ugly today, because the new SDK looks sexier. But if you really want it to live long, the first thing to do is lock the validated parts and leave the unvalidated parts to incremental exploration. These two things cannot be mixed.

So the design principles of v2.0 are written quite rigidly: do not touch the core architecture if you can avoid it; new features go on the edges first; old features are not rewritten unless there is clear "online evidence" that they are broken; any "design that looks better" is first validated on the mock path, not put straight onto the real path. These rules are not to limit creativity — they are to keep "v1.0 already runs this well" from getting swept away by a new round of excitement.

That said, v2.0 is not idle. The skeleton has landed and is in the finishing stage. GA is still waiting on audit — yes, audit, not code. Because this layer touches the capability boundaries of OpenClaw's internal agents, there has to be one external review that has looked at it and confirmed the risk-gates have not been bypassed by new features before it can open formally. I think this wait is worth it. A local system, once shipped, is hard to "roll back wholesale," so better to wait a bit longer before GA.

The real trade-off: simple call vs long-term evolution

Looking back at the whole thing, the biggest decision was actually at the very beginning: whether to write a direct call as something a little more complicated.

Writing two extra bridge layers at the start has its costs. Twice the code, an extra deployment, an extra monitoring surface, an extra cognitive load. If all you want is "I want to build a voice assistant that runs," three hundred lines of direct call are enough, and the time saved can go elsewhere.

But if what you want is "a voice workbench I can use for half a year, a year, two years," these two extra layers are a completely different story. They cleanly peel "how to recognize" and "who to dispatch to" off the front end, so the recognition engine can be swapped, dispatch rules can evolve, agents can be added or removed, the front-end UI can be redone — and none of these things drag the others along.

This is a very typical "short-term complexity in exchange for long-term simplicity." In the short term, the two extra bridge layers are a burden. In the long term, they turn this system from "a glob of glue that rots easily" into "four small services that can each evolve."

My own experience: in local AI systems, anywhere "user entry + model call + agent execution" exist together, this kind of decoupling is worth doing. Not because it is worth it from day one, but because it avoids one of the worst kinds of rot — the kind where you know a layer should be swapped, but it is coupled too deeply for you to dare, so you put up with it in an ever-worsening state.

The biggest difference between a local AI system and a cloud service is: you do not have a team to back you up, no SLA to catch you, no semi-annual big-refactor window. You only have yourself. So stability does not come from "I will go fix it" — it comes from "it does not break easily in the first place". Three-layer decoupling is the design that makes it not break easily.

I have condensed this judgment into a few principles to leave here for anyone who comes after to build something similar:

A voice entry point should be split into four layers — "front end / recognition / dispatch / execution" — each an independent service on an independent port;
Every layer carries its own mock fallback, so when downstream is down the user end does not go "blank screen";
Mocks must be identifiable — never let placeholder responses pass as real answers;
Every layer carries its own risk-gate — do not assume upstream or downstream is trustworthy;
Once the core architecture is validated, lock it; all new features go in as increments first;
Audit logs are written per-layer; retrospectives sliced by layer are far faster than ones tangled across the whole chain.

None of these principles "sound profound." They are all the "obvious in hindsight" kind. But obvious in hindsight tends to require writing a bad version yourself, being tormented by it for a while, before you actually accept it.

v2.0 GA is still waiting on audit, mocks are still holding the fallback, real Claw integration is not yet complete.

Jiyanran is not a "finished voice assistant" — she is more like a workbench that is "running, changing, growing." v1.0 is already stable enough — stable enough that I rely on it daily, stable enough that I can comfortably add new things to it. But it is far from terminal. The boundaries of the voice entry point will keep blurring (multimodal, multi-device, long conversations), downstream agent capabilities will keep growing, the granularity of risk-gates will keep getting finer.

But one thing I am now more certain of than at the start: no matter how the front-end UI changes, what the recognition engine is swapped for, or how many agents live inside OpenClaw, this skeleton of four layers, four ports, four mocks, four risk-gates is not going to move. It is not the endpoint — it is the foundation that lets this workbench keep walking. Foundations are not pretty, but foundations must be stable.

If I have to give this piece a short ending — local AI systems are not short of people who can write code; they are short of people willing to write two extra bridge layers at the start. In the short term that is a burden; in the long term that is longevity. The voice entry point looks simple, but it is the layer people most easily misjudge the difficulty of; precisely for that reason, it is also the one most worth getting the skeleton right on the first try.

v1.0 is stable, v2.0 is on the way. The next piece may be about what I see after real Claw integration is done — but that is the next piece's story.

Before Reinstalling My Mac mini, I Ran a System-Wide Asset Audit

Wed, 20 May 2026 00:00:00 GMT

Mac mini reinstall notes

What really made me hesitate before reinstalling this Mac mini wasn't the fear of not being able to put it back together — it was the fear of accidentally killing off something that was still quietly doing work for me, alongside the pile of stuff that genuinely needed to go.

It's a workstation I'd used for more than a year. The day I bought it I assumed it was just an ordinary little machine. A year in, it had carried several waves of AI-tool experiments, a few half-finished projects, a few model weights stuffed in on deadline, and an uncountable number of launchctl registrations installed under the banner of "let's just get it running first." It had never crashed, and it had never really been clean either.

System disk usage was slowly creeping toward 85%. Spotlight would occasionally freak out. On boot, the Dock would surface icons I no longer remembered installing. The thing that genuinely unsettled me was that every time I opened some old project directory, there were a few .env files sitting quietly inside, each with a real, working token — and I'd already forgotten which experiment they'd been set up for.

My instinctive reaction was: "Just reinstall. Clean install. Start from scratch."

Then I stopped, because I realized something: I had no idea what was actually on this machine anymore.

That feeling of "not knowing" is a subtle one. It isn't full amnesia — I still remember the broad strokes: which projects run where, which tools I use daily, what each icon on the desktop is. But the moment you push one level deeper — "Where does this project's temp cache live?" "Which service does that token belong to?" "Did you ever clean up the intermediate output from last month's experiment?" — I start mumbling. That state of "clear on the big picture, fuzzy on every detail" is fine for daily use, but it's fatal for a reinstall. A reinstall doesn't ask you about the big picture. It asks you about details, one by one. Every detail you mumbled past is a potential accident.

I came to see that the urge to reinstall was itself a signal. It didn't mean the machine was broken; it meant I no longer trusted my own picture of the machine. The real psychological appeal of reinstalling is that it lets a single one-shot action rescue me from the embarrassment of "I don't know what's on this machine." The cost is that it buries, along with everything else, the things I "know but have forgotten where." That's a bad trade. So the right move isn't to reinstall first — it's to rebuild the picture first.

Cleanup isn't the real problem — inventory is

"Reinstalling a workstation" sounds like a pure execution task. Back up, wipe, install the OS, install the tools, restore the data. But the biggest difference between a mid-career workstation and the blank machine you had in college is this: it's loaded with things you assumed you no longer depended on but actually still do.

For example: a small tool I hadn't touched in six months turns out to be an implicit dependency in a script on another workflow. A directory I'd written off as "failed experiment" is in fact still being woken up by a LaunchAgent on a schedule, quietly pushing data to the cloud every day. A model weights directory I'd been resenting for hogging space turns out to be the actual load path for the inference backend I've been using lately — I just never noticed where it lived when I started the service.

What these have in common is that they live in scattered memory. Not in any document, not at the front of my mind, not anywhere visible on the desktop. They work quietly — so quietly that I forgot they existed.

When the reinstall knife comes down, the quiet workers die first. By the time I notice some workflow has broken, a week or two has usually passed. By then I can't recall which service, which script, which directory was holding things up. I'll restore from scratch, and along the way install another pile of "let's just get it running" things. The machine is clean for three days, then it's back to where it was.

So the question isn't actually "how do I get this machine clean," it's "before I lift a finger, can I have a clear list that tells me what's on this machine, what's still in use, and what can be thrown out." That's why I decided to run an asset audit first, instead of jumping straight to cleanup.

Auditing and cleaning are two different things. Auditing is taking inventory of what you own. Cleaning is making decisions based on that inventory. Mixing them is the easiest way to cause an accident.

There are two completely different motivations for a reinstall. One is "my current environment can't meet new requirements anymore, I need a new architecture" — that's a constructive motivation; the reinstall is the means, the goal is out in front. The other is "my current environment is too messy, I want to tear it down for a fresh feeling" — that's an avoidant motivation; the reinstall is a ritual, and the goal is behind it or absent. I'll honestly admit this time was the second kind. Admitting it matters, because avoidant reinstalls are the ones most prone to accidents — you don't have a clear picture of "what the machine should look like afterward," so anything you encounter that goes "hmm, this might still be useful" has no judgment criterion behind it. An audit is the only way to bend an avoidant motivation back into a constructive one: after auditing, you can at least describe concretely what the new machine should look like.

The six asset categories I audited

I didn't have an off-the-shelf methodology to copy. Most "Mac cleanup guides" out there are about freeing space, clearing cache, pruning old Time Machine backups. Fine for an ordinary Mac, not enough for a workstation stuffed with AI tools and experimental projects. I worked out my own approach, in six categories.

Background daemons: every LaunchAgent, cron job, and launchctl-registered service. This category is the most easily forgotten and the most dangerous — they run in the background and don't show up in the prominent parts of the Dock or Activity Monitor.
Local engineering directories: every project directory, IDE workspace, Docker volume, and ad-hoc experiment directory. This category answers "what have I actually been doing on this machine."
Files containing secrets: every .env, .aws, .ssh, and any config file with a token field. This category isn't just about disk space; it's a security issue — during a reinstall it's all too easy to sync them into a cloud drive or accidentally drag them into a fresh git repo.
Bulky assets: model weights, datasets, caches, download archives. This category decides "what needs to be re-downloaded after the reinstall, and what can be thrown away." A 70B model takes hours to download once; tossing it just to re-download is digging your own grave.
Install lineage: which tools were installed when, why, and whether they're still needed. The list of Homebrew packages is long, and each one had a reason at the time — but half of them may no longer be in use.
History ledger: previous audit reports, backup archives, old project snapshots. This category is "memory of audits and backups I've done in the past" — it decides whether I can fall back to a known-clean state when things go wrong.

These six categories weren't conjured up. They emerged as I worked through the machine and kept noticing "this thing doesn't fit anywhere I've named yet." The first three — background daemons, engineering directories, secret files — are non-negotiable for any machine that has run a few months of AI experiments. The latter three — bulky assets, install lineage, history ledger — are optional, but if you're planning a reinstall I strongly recommend running all of them.

Taken together, the six categories answer a question that sounds philosophical but is actually very operational: who is this machine, as my workstation? The first three define what it's doing — the processes running in the background, the workplaces it occupies, the credentials that connect it to the outside world. The latter three define what it has lived through — what genuinely heavy assets it stores, what tools it has been installed with, what traces of self-observation it has left behind. A machine is like a person: knowing what it's doing isn't hard, but knowing what it has lived through, what habits it has accumulated, what technical debts it owes — that's what real understanding looks like. The audit was the first time the concept of "my workstation" became concrete instead of abstract.

The audit doesn't need to be fancy. I used the plainest method: one markdown file per category, listing entries, writing down state, marking judgment. My local audit directory now has six files plus a summary. It looks completely unglamorous, but it's the only thing about this reinstall that lets me sleep at night.

On a related note, the order in which you audit the six categories matters too. I'd recommend doing background daemons first — they're the easiest to miss and the easiest to get tripped by while you're cleaning something else. Then files containing secrets — because that category directly determines whether you can even enter the cleanup phase. After that, local engineering directories and bulky assets, which are the heavy lifting but lower risk. Install lineage and history ledger come last; they're more of a wrap-up. I started out auditing in alphabetical order and ended up stepping on a big mine in the secrets section, which forced the whole cleanup to be postponed — I'll talk about that next.

The four numbers it produced

The real value of the audit is turning a "vague feeling" into "concrete numbers."

After I finished, the AI stack on this machine compressed down to four numbers: 6 LaunchAgents, 4 .openclaw* directories, 26 .env files containing tokens, plus a primary working directory of roughly 14G.

Once the numbers were on the table, the whole mindset shifted.

Before the audit, my description of this machine was something like "messy," "stuffed with things," "a bit out of control." That kind of description sounds like complaining, but it can't guide any action — you can't take action on "messy." After the audit, I could at least say something concrete about each number: of the 6 LaunchAgents, 4 are still in use and 2 can come off; of the 4 .openclaw* directories, 1 is the canonical main directory and the other 3 are historical snapshots; the 26 .env files must be migrated to a unified secret management location; most of the 14G working directory can stay, with a small portion being temporary experiment output.

That's what an audit is doing: translating "messy" into items you can make judgments about. A problem translated into items has a chance of being solved. A problem stuck at the "messy" level can only continue to be endured.

That said, numbers alone aren't enough — every number needs a concrete description behind it. "6 LaunchAgents" as just a count, without writing down what each one does, which project installed it, and when it last ran successfully, is no different from "messy" — it just swaps fuzzy mess for precise mess. The audit actually starts working not at the moment I count 6, but at the moment I can say of each LaunchAgent: "this is what it's called, who it serves, whether it's alive or dead, what breaks if I delete it." Numbers are the skeleton; descriptions are the flesh. A list with only a skeleton isn't enough to make decisions on; a list with skeleton plus flesh is.

I used to think this kind of inventory-taking was programmer fastidiousness. I came to see it's actually a standard mid-career move before doing anything big — you measure the room before renovating, you take inventory before moving house, of course you do the same before reinstalling a Mac. The only difference is nobody scolds you for skipping it, so most people do. The cost of skipping is that every reinstall becomes a small gamble — you're betting you haven't forgotten anything critical.

Another finding that surprised me a little: small numbers don't mean small problems. 6 LaunchAgents doesn't look like much, but every one of them represents a piece of logic I've already forgotten — when it was registered, for which project, what happens if it fails, what downstream things will break if I delete it. Genuinely understanding those 6 took me longer than listing all 26 .env files. What the audit brings isn't satisfaction; it's the kind of sobriety where "you thought it was just a little, turns out it's an iceberg." That sobriety matters far more than the relief of "feeling better after deleting things" — it stops you from acting before you can see what's under the water.

Why cleanup had to wait

Audit done, I was ready to move on to the next step: work through the six lists, one by one. Delete what should be deleted, archive what should be archived, migrate what should be migrated. The plan was to finish in an evening and start reinstalling the next day.

Instead, I got completely stuck on the second category — files containing secrets.

26 .env files with tokens sounds like a small number. But the moment I actually looked at what was inside them, I broke into a cold sweat. Some held API keys for an inference service, some held cloud-drive access credentials, some held temporary tokens issued early on just to verify a flow — except "temporary" had run for half a year. They were scattered across 4 different engineering directories, each split into several layers. For a few of the files, even I couldn't say which experiment they'd been left over from.

That meant two things.

One: these secrets had to be migrated out and consolidated before the reinstall. Not a simple backup — a simple backup just leaves them scattered in the corresponding spots on the new machine. Also not a straight delete, because a few of them I'm still using. The safest path is to move them to a single, encrypted secret management location, then refactor every caller to reference that location instead of letting plaintext tokens keep sitting in each project directory.

Two: that migration is itself an independent, careful piece of work. It isn't an extension of the audit; it's another project. It involves changing how each project references its secrets, verifying services still run after migration, and securely destroying the old files. Done fast, it's a week. Done unhurriedly, one to two months.

So cleanup had to wait. I didn't touch the 14G working directory that night, and I didn't pull the 6 LaunchAgents either. The reasoning is simple: until secrets are migrated cleanly, any "cleanup action" could accidentally sync a file containing tokens into cloud backup, accidentally commit it into a newly created repo, or accidentally have it pulled away by a LaunchAgent still running. The reinstall temptation is strong, but the safe window hadn't opened.

The feeling of "I've audited everything but still can't act" was uncomfortable at first. But on reflection, this is exactly what the audit is for. Without it, I'd probably have done a clean install and then dragged "project directories" back wholesale from the cloud — bringing back those 26 .env files with them, all still in plaintext, all still scattered across 4 directories. The reinstall would have solved nothing; if anything, the problem would have been legitimized — harder to tackle, because they'd "survived alongside the new system."

Following the secrets thread further, I noticed something more sobering: the truly "expensive" thing on a workstation has never been the hardware, nor the model weights you have to re-download after reinstall — it's these scattered credentials. They correspond to real trust relationships with external services. The moment each token was issued, behind it was a terms-of-service I'd agreed to, a payment method I'd bound, fees that might be charged, access logs that might be generated. They aren't "config items" — they are my extended accounts. Treating them as .env files scattered everywhere is like sticking the keys to your accounts on Post-its all over the building. Once that's clear, the "sweep up" act of reinstalling looks pretty secondary — what really needs catching up on is a governance framework for secrets, and that framework won't appear on its own just because the system gets reinstalled. I have to sit down and design it, item by item.

What the audit taught me

This audit didn't solve any specific cleanup problem, but it changed how I see this machine.

First: the real threshold for a reinstall isn't fear of not being able to rebuild. Installing the OS has a manual; the toolchain can be reconstructed; model weights can be re-downloaded. None of that is the real bar. The real bar is fear that some "still-in-use service" gets killed by your own hand without you realizing. That risk has no manual — only an audit can pull it into the open.

Second: scattered memory is the chronic disease of a workstation. A machine used for a year naturally accumulates a pile of "I think I remember" items — why a tool was installed, which project a directory was copied from, which week a LaunchAgent was configured. Those memories are sharp when they're laid down and completely fuzzy half a year later. The audit doesn't cure the chronic disease; it periodically settles the books, so the fuzzy becomes a clear list again.

Third: secret hygiene is the real workstation risk, not disk space. I used to care about "how many GB are left." After the audit I realized space is just the surface problem — what will actually cause an incident is 26 plaintext tokens scattered across 4 directories. If a reinstall doesn't fix that, I've only moved a security vulnerability intact from the old machine to the new one. If space runs out, worst case I plug in an external drive. If a token leaks, I'm revoking credentials, checking access logs, and explaining myself to the service provider.

Fourth: auditing and cleaning must be separated. This is my biggest takeaway from this round. Audit is observation; cleanup is action. Mixing them produces two bad outcomes: either "audit-and-delete-as-you-go," and you delete something still in use; or "audit-but-don't-dare-delete," because during the audit you hadn't thought through your judgment criteria. Done separately — finish the audit first, get the full list, then sit down and decide item by item — is actually faster, and the judgment is steadier.

Fifth: the audit itself is an asset. The 6 markdown files I produced this round will go into my local audit directory as an archive. The next time I reinstall, or buy a new workstation, or someone asks me "what's actually running on that machine," this audit is the most direct answer. It doesn't just solve a one-off problem; it can be reused in the future.

Sixth — and the most counterintuitive — inventory before reinstall, don't be led by the urge to "free up space." 85% disk usage makes you very tempted to delete something right now. But that "ahh, feels better after deleting" rush is the most dangerous emotion in this whole exercise. What actually needs solving was never the usage percentage; it's what is running on this machine as my workstation, what's still needed, and what must be governed. Usage is the symptom. The audit is the palpation.

Seventh is about rhythm. The audit took me an entire afternoon plus most of an evening. The length was long enough that several times during it I wanted to cut corners — "Maybe a quick look is enough?" "There are only a handful of LaunchAgents anyway, can I just judge from memory?" — and every time I forced myself to keep going. Looking back, the hard part of the audit isn't technical; it's psychological. It's too plain — no immediate feedback, deleting a file shows no progress bar, listing an item doesn't make the machine faster. Its entire reward is the after-the-fact reassurance of "I now know what I'm touching." That reward is delayed, invisible, and only thanked when something goes wrong. For someone used to "getting things done and seeing the result," pushing through the mid-audit fatigue of "I can't see any output" requires actually believing that "seeing clearly first is faster than acting first." I pushed through this time. I'll push through next time.

Eighth is the most surprising byproduct: the audit itself can become a habit. It doesn't have to wait for a reinstall. My current thinking is to do a lightweight audit once per quarter — about every three months — not for cleanup, just for a reconciliation. A few more LaunchAgents than last time? A few more .env files? Any experiments in engineering directories I no longer remember? Answering these on a schedule prevents the next "I want to smash it and start over" buildup. It's the workstation equivalent of a mid-career quarterly checkup — not to treat illness, but to keep small problems from compounding into big ones. This Mac mini was the first time I realized a workstation needs a checkup too, not just attention when something has gone wrong.

So the current state is: audit done, cleanup not yet started, secret migration is the real next step. The reinstall date has been pushed back indefinitely.

I'd originally thought this would be a note on "how to efficiently reinstall a Mac mini." Writing this far, I realize it's closer to a note on "why not to rush into a reinstall." The hygiene issues on this machine will drag on for a few more months, because secret migration is messier than it looks. But I now have a list — I know where each item is stuck and where the next step should go. That's a lot better than the state a week ago of "messy enough to want to smash it and start over." The audit didn't make the machine cleaner, but it let me reclaim judgment authority over it. That, in itself, matters far more than freeing up those few dozen GB.

Mission Control and Studio: When Control Planes Start to Overlap

Wed, 20 May 2026 00:00:00 GMT

OpenClaw architecture governance notes

The local Studio can manage agents. The Mission Control web app can also manage agents. The Gateway I plan to install later is going to route external requests to agents — all three can manage them, and all three have good reasons to. The problem isn't that any one of them is bad. The problem is that when you put the three together, it stops being clear who actually decides.

This has been sitting on my todo list for a while. It's not that I made a wrong decision and now have to roll back — every step looked right at the time. The local Studio RC1 should have its own boundary when it goes live. Mission Control Web v2.0.1 (Alpha, early-stage development) is an agent orchestration dashboard, so of course it should have its own dashboard. The future Gateway, as the external entry point, should have its own routing rules. But when those three individually-correct things land in the same working system at the same time, the overlap shows up.

This month I didn't rush to close it down. I started by writing out "who should answer which question" — and that step matters more than adding any new feature. What follows is how I split responsibilities now, a few concrete conflict points, and why I decided to write specs before touching code.

Each one makes sense on its own

Let me walk through where each of the three came from. This isn't a history lesson — it's just that without knowing why each one exists, you can't judge who should yield when they conflict.

The local Studio RC1 is a local working system. Its original reason for existing was plain enough: I needed something that could run morning checks, move tasks through, and produce content without depending on any external API. Every agent's work contract (where it can read, where it can write, what it can't touch) is signed locally. Every task's audit goes into the local audit directory. Every piece of content's evidence stays local too. Studio's design stance has always been "local truth" — I don't trust any state that can't be reproduced locally. This isn't a matter of taste, it's something reality forced on me: external APIs change without notice, remote services drop, third-party records get rewritten, and a local audit is the only thing that can recover what actually happened after the fact.

Another implication of the Studio design is that agent contracts have to be readable locally. Which directories an agent can read, which it can write, which it absolutely cannot touch — those were defined as files in RC1, not as a row in a database, not as an entry in a remote config service. That looks rigid, but it buys one property: every tool call that wants to be authorized has to come back to this contract file and check it once. In other words, Studio's agent control plane sits on top of the local filesystem, and away from local it can't decide anything. That's its strength, and that's its boundary.

Mission Control Web is a different animal. It's the Alpha v2.0.1, sitting there as an agent orchestration dashboard. The problem it tries to solve is this — when there's more than one agent, more than one task, more than one external model provider, local logs alone stop being enough to keep track. MC Web is going for a fleet view: many agents running at once, scheduling visualized, cost tracking, a security audit dashboard. It has a multi-framework adapter slot in the gateway layer, state goes into SQLite, you start it with pnpm start, and the whole thing tilts toward a front-end engineer's perspective. Its design stance is "global observation" — I need to be able to see at a glance what every agent is doing right now, how much money they're burning, and whether anyone has crossed a line.

There's another goal in MC Web's design that's easy to overlook — it has to manage agents across frameworks. Local agents are one implementation. Running a research flow on DeerFlow in the future is another. If we eventually plug in some other orchestration framework, that's yet another. MC Web shouldn't be locked to any one implementation, which is why it leaves a multi-framework slot in the gateway layer. But the price is that for any specific agent's semantics, it can only do the "lowest common denominator" — it can see that an agent is running, it can see how many tokens it's burning, but for finer internal state you still have to go back to the agent's own implementation. That limit is just an inherent property of this kind of tool, not a defect of v2.0.1.

The Gateway isn't installed yet. Right now it only exists as the slot in MC Web v2.0.1 marked "Gateway Optional" — meaning the team already knows there will be one eventually, but hasn't decided when. The Gateway's design intent is clear: every external request goes through it for unified routing, unified auth, and unified rate limiting, so external traffic doesn't hit local or the dashboard directly. Its design stance is "boundary gatekeeper" — anything from the outside passes through me first.

Another reason the Gateway exists is layered defense — neither the local Studio nor the dashboard should be facing the public internet directly. The local system should focus on what local can do, the dashboard should focus on presentation, and external requests should go through a dedicated layer doing the dirtiest, most thankless work: rejecting abnormal traffic, rate limiting, doing the outermost auth. If that work isn't handed to a dedicated layer, sooner or later it bleeds into Studio or MC Web — and they end up being forced to carry a pile of defensive code that shouldn't be theirs. Whether to install the Gateway is a choice, but as long as it's not installed, the ports Studio and MC Web expose are by default carrying the "external entry point" responsibility — and that's an implicit trap.

The three stances on their own — local truth, global observation, boundary gatekeeper — each hold up. The trouble is that on the object called "agent," these three intersect. An agent has to have a local contract, has to be visible on the dashboard, and in the future has to accept requests from outside. That's how three control planes start to overlap.

What they start fighting over

The overlap isn't abstract. It has a concrete shape. In the past two weeks I've run into it in at least four places.

The first conflict: who actually schedules a local task. Studio has its own task scheduling — during the morning check it triggers Suwan to put together a piece of content, triggers Huorui to run a security sweep, triggers Shen Zhixing to pull information. These are local pipelines, and Studio decides. But MC Web's dashboard also has a "task scheduling" panel, and in theory you can send "Suwan, run a piece right now" from there too. Two entry points scheduling one agent means two call paths, which means two sets of scheduling state. If the two get out of sync — one says it's running, the other says it isn't — who should the agent listen to? I never ran into this before, because MC Web wasn't really plugged in. The moment it is, this shows up.

The nastiest thing about this conflict is that it's invisible while "neither side is really being used." Local Studio runs stable, MC Web sits there as a dashboard with nobody triggering tasks from it, and everything looks harmonious. Then one day I take the shortcut and fire a task from MC Web, local Studio's scheduling state has no record of it, a few hours later the local periodic task fires the same thing again — same job ran twice, two audit records, but both sides think they're right. That kind of duplicate firing isn't a technical bug, it's a side effect of the control plane not making it clear who the entry point is.

The second conflict: where an agent's runtime state, cost, and audit should be written. Studio has its own local audit — every file an agent writes, every task transition, every tool call gets recorded into the local audit directory. MC Web has its dashboard — it wants to show "in the last 24 hours this agent ran how many tasks, burned how many tokens, did it cross any lines." If those two records are written separately, you have two truths. If one is a mirror, you have to first pick which one is the source. My instinct is that Studio's local audit is the source — but instinct isn't spec, and until it's written down, it isn't a split.

Audit is especially touchy in a control plane. The canonical source isn't "designated," it's "actually written" — which side gets written to first, which side is synced over later, which one wins when they disagree, all of that has to be spelled out in advance. The hardest bugs I've ever had to chase almost all came from having more than one audit and no agreed source. So I'm being extra careful with this one — the canonical source for an agent has to be fixed up front, not negotiated after something breaks.

The third conflict: once the Gateway is in, who should an external request hit first. One way is Gateway → Studio → MC Web (local first, dashboard is the observation side). Another is Gateway → MC Web → Studio (orchestration first, local is the execution side). Both can be made to work, but the paths are completely different. The first means MC Web only ever sees secondhand information that Studio has already processed. The second means every request Studio receives has already been filtered by MC Web's policy. If this isn't decided now, and we wait until the Gateway is actually being installed to decide, it becomes install-and-modify-and-fix-in-flight, and the cost gets steep.

The fourth conflict is more subtle: which layer owns cost tracking. Studio can record per-tool-call cost — it has an audit, adding a column would do it. MC Web can record it too, obviously — it was designed for orchestration and observation, cost is a natural field for it. Neither side is doing it properly right now, so there's no conflict. The moment both sides start doing it for real, you've got two sets of cost data, and another "which one is authoritative" problem. What makes it worse is that cost eventually rolls up into "how much did we burn this month" — and if both records exist, the rollup will inevitably double-count or miss, coming out either inflated or deflated, with no middle option.

Put these four conflict points together and they're actually the same shape — none of the three systems has written down the boundary for "who owns which face of the agent object." Ownership isn't fixed, the scheduling path isn't fixed, the canonical source isn't fixed, cost attribution isn't fixed. Each one on its own isn't a big deal; together they're a control-plane turf war.

Overlap isn't a bug, it's a side effect of scale

At first I wanted to blame this on "the original design wasn't planned well enough." Later I realized that attribution is wrong. Every step of the planning was right at the time — Studio was just trying to solve whether local could run on its own; MC Web was just trying to solve that multiple agents were getting hard to track; Gateway was just leaving a slot for an external entry point. The three weren't designed at the same point in time, and weren't designed in the same context.

Overlap grows out of scale, not out of bad design. When a component is just born, it's only responsible for its own small patch. After it runs stable, runs long, and grows to a certain size, it naturally starts reaching for adjacent responsibilities. After Studio hit RC1 it started thinking "could I also provide a simple dashboard" — that's it reaching into MC Web's territory. After MC Web hit v2.0.1 it started thinking "could I just schedule local agents directly without going through Studio" — that's it reaching into Studio's territory. None of these components was originally built to compete with the others, but once they live long enough they start fighting over the same patch of responsibility.

I think this is a general phenomenon. Any working system that grows up will, after surviving its early phase, face the control-plane overlap problem — not because anyone made a mistake, but because surviving components naturally expand. Overlap is a side effect of individual survival.

The direction of expansion follows a pattern too. A component that originally only solved "can it run," after it runs stable, its second instinct is "let me also be able to see what I'm running" — it starts growing a simple query endpoint, a crude status panel. Once that panel exists, its responsibility starts intruding into "observation"; and the component originally dedicated to observation starts feeling "why does what I see not match what it sees." The third instinct is "let external callers reach me too" — a component originally serving only local starts wanting to leave an external entry point. Once that entry point exists, the component originally dedicated to external entry starts feeling "why isn't external traffic going through me." The instinct is fine — a component wanting to grow stronger is fine — but every act of instinctive expansion smudges the control plane boundary a little more.

Once I noticed this, my view on closing things down changed. I used to think "if it overlaps, just cut one side off quickly." But that kind of cutting usually cuts whichever side is weakest right now — not whichever side shouldn't be responsible long-term. Short-term it looks like closure; long-term it gets pushed back by the misalignment — the responsibility you cut off grows back a few months later in some other form.

So now my way of handling overlap isn't to grab the knife first, it's to fix the stance first — who should own this long-term. The side that isn't ready can keep the responsibility for now, but mark it as "transitional." That word matters — it admits the current state isn't ideal without pretending it is, and without forcing immediate cleanup. It gives the side that should own it time to build up the capability, and gives the side temporarily holding the bag an exit expectation. This is more painful than just cutting — more specs to write, more conversations to have, longer stretches of inconsistency to tolerate — but it avoids the "cut it off and it grows back" loop.

How I split responsibilities now

So this round I didn't reach for the knife. I stopped and wrote the split first — "who should answer which question" went into a spec, not into code changes.

What follows is my current judgment, not the landed state:

Agent owner — Studio. The agent's contract (where it reads / where it writes / what's forbidden) lives locally, and so does its version, capabilities, and stability record. What MC Web sees is the agent metadata Studio exposes, not something MC Web defines on its own.
Canonical source for tasks — Studio's local audit. The raw record of every task execution stays local; what shows up on the MC Web dashboard is a view of the same record, not a separate dataset.
Orchestration and observation — MC Web. The multi-agent fleet view, scheduling visualization, security audit dashboard, cross-agent aggregation — all of it goes to MC Web. It's not the source of audit, it's the view of audit.
Cost tracking — MC Web. Cost data naturally crosses agents, external gateways, and model providers; its viewpoint belongs at a higher layer. Studio stops computing cost on its own.
External entry — Gateway. All external requests come in through it, get auth, rate limiting, and routing done, then it decides whether to hand off to Studio or to MC Web. Studio and MC Web stop exposing external ports.
Scheduling entry — dual entry, but Studio takes precedence. Local periodic tasks are scheduled by Studio itself; tasks fired from the MC Web dashboard ultimately still go through Studio's scheduler, MC Web doesn't call agents directly.

There's really only one sentence in this whole split — local truth for an agent belongs to Studio, global view and external boundary belong outside. All four conflict points can be derived from that one sentence. That's why spending time writing the split first is worth more than reaching straight for the code: one right sentence solves a pile of small problems.

Writing it down isn't shipping it

That said, written in a spec isn't running. I know best myself — actually shipping this split takes at least half a year.

The first thing to do is the API boundary. Studio currently has no external agent-metadata endpoint — the agent contract is a file, not an API. The only way MC Web can see this info today is by reading files. For MC Web to really "see the agent metadata Studio exposes," Studio first has to define a set of endpoints that expose the agent list, the agent's current state, the agent's current task — as a stable contract. Easy to say in one sentence, but actually doing it takes a whole version to think through — which fields are stable, which can change, how versions evolve, what backward compatibility looks like — all of it has to be worked out separately.

The API boundary also has to figure out one thing — whether it can be bypassed when Studio isn't running. If MC Web only ever sees agent info through Studio's exposed endpoint, then MC Web is an empty shell when Studio is offline. If MC Web keeps a local cache for offline-Studio cases, then that cache has to be maintained separately, and when it goes stale the dashboard is no longer showing truth. Both options have costs, but either is more stable than "implementing a version without thinking it through." I lean toward the first — better empty than fake — but this one isn't really decided yet, the MC Web side has to weigh in.

The second thing is straightening out the scheduling path. The current state is that Studio has its own scheduler and MC Web also wants to be a scheduling entry, with no agreed-upon entry spec between them. Sorting this out means listing every code path that can "trigger an agent to start running," and then funneling them all into one main path. That kind of sorting is slow work — each path has to be individually verified to still run after the funneling.

The most annoying thing in the scheduling paths is the "side doors" — small paths that aren't part of the main flow but can in fact trigger an agent to run. Like a local script that calls the agent's exec function directly, skipping Studio's scheduler. Like a mock trigger path in a test case that also runs in production code. These side doors don't usually get used, but as long as they exist, the split isn't really clean. The sorting process is basically listing every side door and plugging them one by one — tedious, but you can't skip it.

The third thing is the audit sync mechanism. Studio's local audit is the canonical source, MC Web's dashboard is the mirror — that sentence writes easily, but "mirror" is a state that needs a mechanism to maintain. After local audit is written, how long until it syncs to MC Web; what happens when sync fails; which one wins when MC Web's display and the local audit disagree — all of these have to be defined separately. My current lean is that MC Web only ever reads from Studio's exposed endpoint and never maintains its own write path — but that means the dashboard is empty when Studio is unreachable, which is yet another tradeoff.

Audit sync has another implicit problem I hadn't noticed before — it involves a privacy boundary. There are things in the local audit that shouldn't be pushed up to the dashboard for display — like intermediate artifacts of some tasks, internal state of some agents, fields tied to external accounts. The sync mechanism can't just be "push everything," it needs a filter layer. That filter layer is itself a spec — which fields can be pushed, which can't, whether pushable fields need to be redacted — all of it has to be written out. Honestly this is easier to do from the Gateway side (an external entry was always going to do this kind of filtering), but the Gateway isn't installed, so for now this responsibility sits with the audit sync mechanism.

The fourth thing — which is also the prep work before the Gateway actually gets installed — is narrowing what Studio and MC Web each expose externally. Right now they each expose their own ports; once the Gateway is in, those ports have to be tucked behind it. The cost of this isn't technical, it's migration — every existing caller that hits Studio's port directly has to switch paths.

Narrowing the exposed surface isn't a config change you finish in one go. It means every external script, every external integration, every little tool that's been quietly direct-connecting in has to start going through the Gateway — and the Gateway isn't installed. So the most pragmatic way to do this is to first take inventory: which ports are externally reachable today, who the callers are for each port, whether those callers are still in use. Only after inventory can you talk about narrowing. That kind of inventory is "boring but necessary" — it produces no new features, but it's the precondition for the split actually shipping.

This month I only did part of the first thing — sketched out a few API boundary drafts, pinning down the two field sets for agent metadata and agent current state. The other three are still in the queue. I don't want to fool myself into "the split is defined, so it's done" — defining the split is just the first step, shipping is a much longer thing.

Still working on it

The split is written in the spec, but not a single line of code has been changed because of it. MC Web is still running v2.0.1 on its own pace, Studio is still running on RC1's boundary, Gateway is still in "Optional" status. Turning this split into a fact in code takes at least half a year — assuming nothing more urgent jumps the queue.

What actually got finished this month is just a few small things — the field draft for the agent-metadata API, the field draft for the agent-current-state API, and a list of conflict points. The list itself doesn't solve any conflict, but it means every time I run into a new one, I know whether it's already been recorded, and whether the same split can answer it.

The audit sync between MC and Studio hasn't been touched. The scheduling paths haven't been funneled. Whether to install Gateway, when, how — none of it has a timeline. I'm not in a hurry to fix that — experience tells me that pushing "install the external entry point" through before the split actually lands tends to scramble the split itself.

My own attitude on this is: writing the split clearly matters more than adding features; closing the boundary matters more than bumping version numbers. Mission Control v2.0.1 is Alpha, it'll iterate fast; Studio RC1 is a release candidate, but the local pipelines are still being changed; Gateway is barely on the radar — when all three are moving, the only thing that should stay still is the split.

There's no such thing as "perfect architecture," only architecture with "clear responsibilities + boundaries written down." Perfect architecture is something you think up; clear responsibilities are something you change your way into. The former finishes when the diagram does; the latter has to be re-checked every month.

Mission Control, Studio, Gateway — these three control planes will coexist for a long time, and the overlap won't be eliminated in one shot. It'll be re-balanced repeatedly, judged by whether the split is clear. I don't have an end date for this, only a rhythm of pushing one small step per month.

Next time I write about this, it'll probably start from "one API boundary finally landed" or "a closure I thought I'd made grew back." Until then, I'm still working on it.

Page Types: Why a Working Knowledge Base Needs Six Kinds of Pages

Wed, 20 May 2026 00:00:00 GMT

Knowledge base structure notes

Whether a knowledge base is usable rarely depends on how beautifully the content is written. It depends on whether, three seconds after opening any page, you know what that page is for.

In the last note I wrote about how to decide whether a piece of material should be kept or thrown out. Once the triage is done, only the parts that can still support real work survive. But keeping the material is not the same as having a usable knowledge base — you have just turned a pile of messy directories into a pile of respectable messy directories. The real question is the next one: what kind of pages should these materials become.

I have been burned by this many times. In the same wiki, one page is an abstract principle, another is the current status of some project, another is a check I ran last week, and another is really just a list of commands. Each page is fine on its own; mixed together they start fighting each other — the reader has no idea what to expect before clicking, and the writer has no idea which rules to follow on the next update.

So now I make every page answer one question first: what type is it. If I can't answer, I don't write it yet. If the answer is vague, I split it.

From material to page

The previous note was about the material layer — what gets to enter the knowledge base and what stays in the working directory. Once that triage is done, you are left with maybe a few dozen items "worth turning into pages": a judgement that has stabilized, a choice that changed the behavior of the system, the current state of a project still in flight, a check whose evidence has been verified, a compressed history of how something evolved, a command sheet that gets looked up over and over.

Classifying material asks "what is this content"; classifying pages asks "what role does this need to play". They are not the same. An audit report as raw evidence may be excluded, but its conclusion needs to be rewritten and kept as an Audit page. A decision buried in a chat log will not enter the knowledge base, but the decision itself needs to become a Decision page. Between material and page, there is a second classification.

If you skip this second classification, the whole knowledge base collapses into a "document grab bag" — has everything, looks like nothing. That is why I now force myself to use only six page types, and refuse to add a seventh.

The six page types, with concrete examples

The order is not most-important to least-important. It is most-stable to most-volatile. The more stable a page, the closer it sits to knowledge; the more volatile, the closer it sits to a tool.

Concept page: a settled idea or rule

A Concept page is "how I see this thing". No specific project, no specific time, ideally no specific person. A good Concept page should still hold up two years later when a new person picks it up and reads it unchanged.

Examples: the rules for picking a canonical source, the Knowledge Model itself, Page Types (the page behind this very essay). Once written, this kind of page barely moves. If it does move, you owe a Decision page explaining why.

Decision page: a choice that changed the behavior of the system

A Decision page answers only four things: what problem was on the table, what alternatives existed, why this one was picked, and whether it turned out to be right. It is not meeting minutes, not a design doc — it is a memo to a future self who will want to know why the past self made this call.

Examples: why every launch action now starts with a dry-run-first pass, why the main-control write responsibility was migrated to a separate sidecar channel, why a more general-looking solution was abandoned. A Decision page is written once and not edited; at most a "follow-up observations" paragraph is appended to the end.

Project page: a bounded workflow plus current state

The Project page is the only one of the six allowed to move often. Its whole reason for existing is so that any new window, new AI, or new day can open it and immediately know "where this project stands right now". So it must have an owner, a current stage, a next action, and a last-updated timestamp — missing any one of those and it doesn't count as a valid Project page.

Examples: the current state of some OpenClaw Studio RC, the toggle state of a particular Jiyanran voice console version, the progress of Suwan's aesthetic sample library. The Project page is the most dangerous category in the wiki — it goes stale the fastest, and it is the easiest to mistakenly cite as if it were a Concept page. That is why this category needs to be tied tightly to freshness markers, covered further down.

Audit page: a check with an evidence trail

An Audit page says "at time X, I used method Y to check Z; the conclusion was this; the evidence is there". It differs from a Decision page — a Decision is about what to choose, an Audit is about what was done and what was seen. An Audit page naturally carries a timestamp, an evidence trail, and a reproducible method.

Examples: an end-to-end audit of some Stage pipeline, an inventory of built-in disk assets, a cross-account configuration consistency check. The interesting thing about an Audit page is that even when it expires you can't delete it — it is history, and its value is precisely "this was actually done at the time". But it shouldn't mislead a new reader into thinking "this is still how things are now", which is why its freshness marker defaults to stale almost by design.

Timeline page: a compressed chronicle

A Timeline page is for the kind of question that goes "I want to understand how this thing turned into what it is today". It doesn't capture detail; it captures turning points only — on some date a critical choice was made, in some week the whole architecture switched from A to B, after some version a component was deprecated.

Examples: the evolution timeline of a project across one quarter, the history of an ecosystem moving from single-machine to multi-account, the hardware evolution of a personal AI workstation from last year to this year. A Timeline page isn't updated densely; it gets a new line only when a "system-level change" happens — the same way I handle the CHANGE timeline.

Reference page: a compact lookup of commands, paths, and boundaries

A Reference page is a tool, not knowledge. Its point is not to be read but to be looked up. It should be as short, as flat, and as copy-pasteable as possible — tables over paragraphs, lists over sentences.

Examples: the Gate matrix (which kind of operation goes through which approval channel), the port allocation table, the Feishu routing table for each account, a list of frequently used debug commands. A Reference page has one peculiar property: it usually carries no authorial stance and is purely a snapshot of facts. So its way of going stale is purely mechanical — the fact changes, the page is stale; the fact holds, the page stays right.

Three writing rules

Once the six types are clean, writing each page only requires three rules. The rules look dumb, but every one of them comes from a real crash.

Rule one: if a page mixes "concept + state + ops", split it

This is the biggest killer I have seen. A README page where the first three paragraphs are design philosophy (Concept), the next five paragraphs are current progress (Project), and the bottom is a copy-pasted startup command block (Reference). It looks "comprehensive" at the moment of writing; three months later no one wants to open it.

The reason is not that it is badly written; it is that it can't be maintained. Design philosophy moves once every six months, current progress may change every week, the startup command may change at the next reinstall. Once three completely different cadences are stuffed into one page, the only outcome is that whoever updates it only dares touch the most volatile bit, and the rest of the page becomes increasingly fake.

Once you split, things immediately change: the Concept page barely moves, the Project page only edits its status line, the Reference page is refreshed only when the underlying fact changes. Each page has its own cadence, and each page stops lying.

Rule two: if a page can be replaced by "a link to a better canonical source", replace it

I used to like writing "summary pages" — taking several pages of content, fusing them together into one digest, telling myself this was reader-friendly. I later realized summary pages are the most dangerous artifact in a knowledge base. They don't get updated; the moment the underlying page updates, the summary starts lying — and new readers lean toward the summary precisely because it is easier to read.

My rule now is simple: if a page can be rewritten as a one-line "see page X", don't write the page — put down the link. The link won't go stale faster than the target page; a summary page will.

There is only one exception — when the page is genuinely making a combined judgement, stringing several scattered Concepts into a new conclusion. In that case it is itself a new Concept page, not a summary page. The difference is that it carries its own judgement, not just a rewording of someone else's.

Rule three: if a page is just an "execution log", keep only the summary

AI projects are especially prone to spawning "execution-log pages" — every command from one experiment pasted line by line, every step from one debug session recorded end to end, every output from one deployment kept verbatim. At the time it feels safest to keep "the full record"; six months later, those pages are unreadable and no one opens them.

My current habit is to keep the raw execution log in the evidence directory, and keep only the summary in the knowledge base — what was run, what conclusion was reached, which anomalies were recorded in which raw log. The summary may be only 200 characters, but it can be read, cited, and maintained. The raw log can be searched and traced back to, but it doesn't need to live in the knowledge base pretending to be a page of knowledge.

How types and freshness markers work together

Once types are clean, the biggest payoff isn't faster retrieval — it is that the cadence of expiration checks can vary by type. The previous note introduced the three-state freshness marker: verified, stale, needs review. The problem is that one-size-fits-all doesn't work — different categories of page go stale at different speeds.

The cadence I roughly use now:

Concept page: review once every six to twelve months. It is stable by nature; checking it too often is just waste.
Decision page: as a rule not reviewed at all — only picked up again when a signal appears saying "this decision may no longer be right". It is a historical fact, not a current state.
Project page: review every two weeks to a month; if no one touches it past that cadence, it slides automatically to stale.
Audit page: stale by default. It is a one-shot piece of check evidence; it was never meant to stay verified forever.
Timeline page: a line is appended whenever a "system-level change" happens; no dedicated review cycle.
Reference page: refreshed whenever the underlying fact changes. If the fact holds, it stays verified.

With this split, "reviewing the knowledge base" stops being one giant mobilization and becomes several small loops at different cadences — sweep Concepts twice a year, sweep Projects monthly, let Reference pages follow the facts. The pressure is spread across different points in time.

More importantly: when a reader opens a page, they can judge how trustworthy it is using two signals together — type and freshness. A verified Concept page and a verified Project page mean different things — the first is "I have thought this through", the second is "this still held up as of last week". Read the two signals together and you get something more accurate than either signal alone.

Common page-rot patterns

After the positive rules, the inverse: the three typical page-rot patterns I have either committed myself or watched in other people's wikis. A knowledge base goes rotten almost always down one of these three paths.

Pattern one: mixed-type pages that no one wants to touch

The most common form is a README — needs to cover design, current state, and startup commands all at once. It starts as a shortcut; eventually no one dares touch it: editing the design part might break the state description, editing the state part might contradict the design, editing the command part might paste in something wrong. The page becomes "the most authoritative and the most inaccurate" page in the whole wiki.

The fix is not to edit it; it is to split it — into a Concept, a Project, and a Reference page, each maintained at its own cadence. What remains of the README is a single line of links pointing to them.

Pattern two: stale without marker, every page looks equally trustworthy

Without freshness markers, every page in the wiki looks like it has the same weight — a Project page untouched for eighteen months and a Concept page verified yesterday look identical in the search results. The reader can't tell them apart, an AI assistant doing RAG retrieval can't tell them apart, and the result is "the old beats the new" — old pages tend to be longer and more structured, so they get matched more readily.

The fix is to make the freshness marker a required field, not an optional one — a page with no freshness signal does not get to enter the formal retrieval results. Even "needs review" is fine; at least the reader knows to be cautious about the page.

Pattern three: duplicate summaries that dilute the canonical source

One project ends up with a "design doc", a "project overview", a "FAQ", a "Wiki summary", and an "Onboarding guide" all at the same time; 70% of their content overlaps and only the emphasis and writing style differ. Each time an underlying fact changes, maybe one or two of the five get synced — the other three keep telling stories based on the old fact.

The fix is to commit to one canonical source and rewrite everything else as "see page X" plus a narrow page-specific supplement. This is the hardest rule, because "writing a new one" always feels more satisfying than "pointing at an old one" — but the fastest-rotting parts of a knowledge base are exactly those repeatedly summarized "pretty new versions".

Six page types have been roughly enough so far — at least across these few independent wikis, I haven't run into a "clearly some type, but no slot to drop it in" situation.

But the boundaries still blur sometimes. Concept and Decision are the easiest pair to flip back and forth on — a rule that says "from now on I always do it this way" — is it my current judgement (Concept) or a one-time choice (Decision)? My provisional rule is: if it can still be overturned, write a Decision; if the rule itself feels stable enough to stand on its own, promote it to a Concept. But this line is not always clean, and sometimes only a few months of hindsight settles it.

The cadence of expiration checks is also still being tuned. Two weeks for Project pages is sometimes too frequent and sometimes not frequent enough; "Audit defaults to stale" sounds decisive, but breaks down when you hit a long-running audit (like a quarterly retrospective). All of this will get iterated on further.

But there is one thing I am fairly sure of now: the precondition for a usable knowledge base is not that pages are well written; it is that every page honestly admits what kind it is. When a page knows what it is, the whole library knows what it is.

Twelve OpenClaw Copies Later: When Paths and Root Directories Become Risk

Wed, 20 May 2026 00:00:00 GMT

OpenClaw path governance notes

One afternoon, with nothing better to do, I typed out a find command just to see how many root directories in my home folder carried the openclaw keyword. A long list slowly scrolled up the screen. I counted: a full twelve.

It wasn't surprise in that moment — more like a daze. I knew clearly which ones were actually running in production — one hosting the virtual company, one hosting Jiyanran (the voice workbench agent) — two in total. The remaining ten each still carried the openclaw name, still had what looked like a complete directory structure, still had config files, still had .env, still had tokens, still had a README I'd written by hand at some point in the past. They were all still there, quietly taking up disk space, not broken, and not in use.

The problem isn't "I accumulated 10 piles of garbage." The problem is that of those 10 copies, several used to be the authoritative root. The moment they were replaced by a new version, they didn't automatically vanish from disk; and I never went back to delete them, because deleting each one requires first confirming "no caller currently points at it," and that confirmation itself is a hassle. So they stayed, and over time the pile grew to twelve.

I used to think that the biggest risk in an AI project was new features failing to run, the model going haywire, or a botched prompt. After doing this long enough, I realized: for a project that's lived past a certain age, the biggest risk is that old copies are still running — you don't know which one is real, the AI doesn't know, and the callers know even less. Every copy looks like the authoritative root, every copy's config looks usable, every token inside every copy is still within its validity window. This isn't a garbage problem; it's a canonical source problem (the authoritative source of truth).

How the copies accumulated

I later sorted the twelve by nature and realized they weren't piled up in one shot — I'd been digging them out shovel by shovel over the past year. Each shovel felt necessary at the time, and each time I never bothered to fill the hole back in. That's how I ended up where I am today.

The top 2 are authoritative roots: one is the production root of the virtual company, hosting approvals, scheduling, agent configuration; the other is Jiyanran's production root, hosting her own MCP (Model Context Protocol) service, voice workbench, and local state. These two are what's actually running, and all valid traffic lands here. They themselves are fine.

On a side note — the "OpenClaw" keyword on my machine covers far more than these 12 roots. If I broaden the search from root directories to all .openclaw* / .clawdbot / .clawai series prefixes, plus all sorts of subdirectories carrying the openclaw keyword, config caches, runtime logs, agent identity files — the whole machine surfaces 13 independent directories and several hundred related files. This time I'm only looking at the root-directory layer, because in governance terms the root is the "foundation" — everything else attaches to it. Once the foundation is sorted, the rest follows.

The problem lies with the 10 below.

The 3rd is a legacy main-config remnant: it used to be the authoritative root, and after being replaced by a new version it just sat there untouched. Its directory name is almost identical to the current production root, just missing a suffix; its .env still holds a token grabbed at some point last year; its scheduler config file is still alive, just unused. This is the most dangerous class of copy, because it most closely resembles the authoritative root.

The 4th is a typo remnant: one day my hand slipped and I typed openclaw as openclag. Hitting enter created an empty directory, and later I casually stuffed a few test files inside. This kind of remnant is the easiest to identify because the name itself is wrong; but it's also the easiest to overlook, because I myself had forgotten it existed.

The 5th is an experiment copy, sitting under some third-party platform's projects directory. I ran extension experiments on that platform for a while and copied the entire openclaw directory over for adaptation testing. When the experiment ended, the whole copy stayed put, along with its own config carrying real tokens.

The 6th is an audit copy. During one audit, to avoid polluting the production root, I copied the entire directory to an isolated folder on the Desktop to run audit scripts. The scripts ran, the audit report came out, but nobody reminded me to delete that copied-out copy, so it stayed too.

The 7th through 10th are 4 historical versions, each named with a version number — the v3_openclaw_agents format. They're "in case of rollback" backups I left behind during major version switches; after the switches stabilized, I never went back to clean them up. Each is still in place, with clear directory names, complete contents, and never opened again.

The 11th is an archive copy, buried deep inside an annual archive directory. During one disk cleanup I moved the entire openclaw main directory over wholesale. The point was to free up space, but after moving it I went and rebuilt a fresh set in the original location — because I "didn't trust the archive directory" — and so the archive zone also ended up with a complete copy.

The 12th is a compressed backup, a tar.gz file buried inside some cleanup-backup directory. I can't even remember when I packed this one; the filename carries a date, and opening it reveals a complete snapshot of some version from last year. The most annoying thing about tar.gz files is that, unlike directories, they don't show up in an ordinary find — you have to actively add a *.tar.gz pattern to scan for them. In other words, halfway through governing a project, the easiest thing to miss is this kind of compressed-state copy. It has no directory structure, it doesn't show up in ls, and it only gets remembered during some disk cleanup.

Counting through these 10 one by one, I realized: every class of copy had a reasonable rationale at the time. Backup, audit, experiment, rollback, archive, fat-finger — none of the rationales is wrong. What was wrong is that every time I left a copy behind, I never bothered to do one thing — tag it, stating what kind it is, when it expires, and who's responsible for cleaning it. So they stayed in the posture of "I'll deal with this later," and "later" never came.

The copy isn't dangerous — the copy used as authoritative is

If these 10 copies were just quietly occupying disk, the problem really wouldn't be much. Accumulating 10 old directories in a year doesn't amount to much for a 4TB workstation.

The real risk is on the other side: the copy is being used as the authoritative root.

I just ran into two such cases recently. One is Jiyanran's MCP path — the config still points at the legacy .openclaw, that is, the 3rd one, the legacy main-config remnant. That path has been deprecated for a while, but because it was inherited from an old config and nobody re-verified it when the new production root went live, the MCP calls were still routing through the old path. On the surface everything looked normal, because the config file in the old directory was still there and still loadable; in reality what it was reading was expired agent settings. This is the classic shape of "a copy being mistakenly used by a new caller."

The other is the virtual company's scheduler. It does cross-root scheduling across Jiyanran and shared tools, meaning the scheduler in the company's root directory will reach into another root directory to trigger tasks. That's neutral by itself — cross-root scheduling isn't necessarily wrong — but the problem is that several of the paths the scheduler walks mix in paths from old copies. In other words, the company-side scheduler might, in some cases, be calling scripts inside a copy rather than the latest scripts in the authoritative root. This is the classic shape of "copies cross-referencing each other."

Neither incident caused an outage, but both let me see one thing: a copy being used as authoritative is a silent failure mode. It doesn't blow up one morning — it surfaces slowly as small puzzles: "I changed the setting, why didn't it take effect"; "I fixed the bug, why is it still reproducing"; "I rotated the token, why is the old token still working." Each puzzle in isolation isn't fatal; strung together they form a curve of governance running out of control.

Worse still, the AI can't tell either. When I send an AI agent off to do something, it follows the paths in the config file — it doesn't know which path is the authoritative root and which is a copy. As long as the path is valid, the directory exists, and the file is readable, it treats the material as real. At that point, the more obedient the AI, the wider the copy contamination spreads. The AI isn't the source of the problem, but it will faithfully amplify it.

There's an even more insidious shape: copies calling each other. When the home directory simultaneously holds multiple sets of openclaw — each of them having been the authoritative root at some point — each carries the dependency graph of its own era. Today I open a script in the 5th copy and it might import a tool from the 3rd copy; I open a config in the 7th copy and it might point at a template file in the 11th copy. These cross-references aren't something I designed; they're a side effect of historical layers stacking up. Every time I switched to a new version, I tried to cut it cleanly from the old, but "tried" isn't "completely" — there are always a few edges left dangling. Those few dangling edges, looking back two years later, form a relationship diagram that makes your scalp tingle.

Five typical risks copies bring

Spreading out all the trouble I've had with these 12 copies over the past year, it groups into five categories of risk.

The first is the authoritative source being unclear. Which one is the authoritative root? A human can't always remember, the AI doesn't know, and callers only look at the path, not the semantics. When the home directory simultaneously holds 12 roots carrying the same keyword, "real" stops being the default state and requires active reconfirmation every time. The heaviest cognitive load on a project that's lived this long isn't new features — it's "is what I'm touching right now the authoritative root."

The second is credential proliferation. Every copy might have carried tokens, keys, .env files. They were once legitimate, once granted permissions, once actually usable. When the copy stays behind, the credentials stay with it. You think you only have 2 sets of production credentials to manage; in reality you have 12. The day some old token gets leaked, the trace will point back to some archive directory you've long forgotten — and the difficulty of fixing that kind of incident far exceeds "a live service has a bug."

The third is legacy dependency creep. The code inside copies still references some services that have already gone dead — an internal bridge port that was decommissioned long ago, a local SearXNG instance that no longer exists. These references don't normally appear on production paths, but the moment any single call drifts into a copy, it hits the dead dependency immediately. Error logs surface port numbers you can no longer remember the purpose of, and the debug chain has to be traced back half a year.

The fourth is audit invalidation. This is something I only recently worked out clearly. If the audit was run against a copy, the conclusion can't represent production; but if the audit didn't make clear which set it was running against, the conclusion produced still looks like a production conclusion. Audits exist so I can have more confidence in the system — but if the audit's starting point is a copy, it actually makes me more confident in the wrong state. That's the worst feedback direction in governance theory.

The fifth is exponential maintenance burden. The more copies there are, the more each governance action (rotating a key, upgrading a dependency, changing a path convention) has to be multiplied by the number of copies. An upgrade on the 2 authoritative roots gets done in one afternoon; running through all 12 takes a full week, and you also have to judge one by one "should this set follow, and if not, will it leave a hidden risk." Maintenance burden doesn't grow linearly — it grows exponentially, because every copy has some implicit reference relationship with several others.

These five categories share one trait: none of them is technical risk; they're governance risk. Technical risk can be dissolved by writing better code; governance risk can only be solved by discipline — assigning every piece of material responsibility, boundaries, and a lifecycle. I used to spend ten times more time on the former than on the latter. Only this year did I realize that the hidden cost of governance far exceeds any single point of technical debt.

My handling: identify, stop the bleed, migrate, archive, delete

The easiest reaction is "fine, just delete them all in one go." I started out wanting to do exactly that, but quickly talked myself out of it — because I knew that once I really deleted them, I'd never have another chance to trace anything. Deleting everything outright converts a governance problem into a data-loss problem; it burns the ledger before the books are balanced.

So what I'm doing now is five steps: identify, stop the bleed, migrate, archive, delete. Each step has its own boundary; don't skip.

Step one is identify. Scan out every copy and tag each one: authoritative, legacy, experiment, audit-copy, archive. Tags aren't decoration; they're responsibility — once you slap on legacy, it means "this set is being phased out; new code may no longer reference it." Once you slap on archive, it means "this set is read-only; nobody writes into it anymore."

Step two is stop the bleed. Confirm that no new code still references copy paths. This step means grep through the codebase, grep through config files, grep through LaunchAgent (macOS background service) and plist files, grep through cron. Anywhere still referencing a copy needs to be listed. Migrating without stopping the bleed is like rerouting a road while traffic is still running on the old one — you finish rerouting only to discover half the convoy is still missing from the new road.

Step three is migrate. Change every reference still pointing at a copy to point at the authoritative root. This step needs careful editing — run a regression after each change, especially for the core entry points: MCP paths, scheduler config, bridge calls. Once done, run a smoke test: trigger a full flow from the top-level entry and check that every call correctly lands on the authoritative root.

Step four is archive. After confirming zero references, move the copy to the archive zone. The archive zone is read-only, timestamped, and completely isolated from production paths. After moving, leave a README in the original location stating where this set went, when, and why. You can't just delete the original location — deleting outright will leave both the AI and me without leads.

Step five is delete. After 3 months in the archive zone, with no person and no caller having come looking for it, then really delete. The 3 months isn't a guess — it's something I observed from my own work patterns: a dead reference will, at the latest, be triggered within 3 months by some regression test, some audit, or some "huh, where did that thing go" question. Three months without incident is basically proof that it's truly no longer needed.

Five steps in, slow, but the irreversible action is at the end. The first four steps are all reversible — tags can be changed, references can be rolled back, archives can be moved back. Only step five, delete, is irreversible, so it has to come only after the previous four steps have all been completed and time has proven it out.

These five steps also have a hidden design: they force me to separate "governance intent" from "governance action." Identify and tag is intent — I first declare "this is how I plan to dispose of this set," then use the next four steps to turn intent into action. The benefit of separating intent and action is that the AI can also come in and help. I can have the AI grep references, have the AI run regressions, have the AI move things to archive — but only after the tags are set. Tagging is a human responsibility, not the AI's. The AI can't decide for me "does this count as legacy or archive," but once I've decided, the AI can handle most of the execution.

Why I don't just delete everything: copies are evidence, not garbage

I later realized that my attitude toward these 12 copies determines how I understand the entire project.

If I see them as garbage, the answer is simple — one rm -rf wraps it up, frees up tens of GB of disk space, makes the desktop tidy. But if I see them as evidence, things look completely different. Every copy is the material evidence of a stretch of history: the 3rd tells me "this is what the production root used to look like"; the 5th tells me "I once ran this kind of experiment on a third-party platform"; the 7th through 10th tell me "I once did 3 major version switches, each leaving behind a complete rollback snapshot"; the 12th tells me "one day last year I was uneasy enough about the system state to pack a complete tarball."

What is this evidence good for? It's useful in three scenarios.

The first is tracing. The day you find a strange token being used in the wild and need to go back to what project that token was provisioned for, which version it was generated in — the copy is the only source that can answer that question. The production root rotated this token long ago; it has no memory of its past.

The second is rollback. After a new version launches, some agent's behavior gets weird — was it introduced by the new version? The complete old-version state preserved in the copy can be pulled out for A/B comparison, and you can localize in minutes. If all the copies were cleanly deleted, A/B comparison would require rebuilding the runtime environment from git history — that's days of work.

The third is audit credibility. External audits often ask "when did you start doing it this way," "what did the previous version's design look like." This kind of question can't be answered from memory; it requires material evidence. A timestamped archive copy is the cleanest answer in audit terms.

So deleting everything outright is essentially trading space for time — trading a tidy disk now for being struck mute in future tracing, rollback, and audits. That trade looks worthwhile when the home directory is small; it stops looking worthwhile when the project has lived past a certain age.

Categorized retention is what real governance looks like. It requires me to admit one thing: copies and production roots are two different kinds of entity, and they need different treatment. The production root needs to keep traffic flowing, needs ongoing maintenance, needs strict control over who can change it; the copy needs to be tagged, bled out, frozen, and retired at the right moment. Looked at together, copies are garbage; looked at separately, copies are the project's own archaeological layers.

Where I am now: half tagged, not a single one really deleted

Writing all these principles down is easy; doing them is slow.

As of today, the status of these 12 copies is this: of the 10 outside the 2 production roots, I've tagged 5 as legacy (the legacy main-config remnant, the typo remnant, 3 historical versions), 3 as archive-only (the annual archive copy, the compressed backup, and 1 of the historical versions worth retaining because of the size of its changes), and 2 are still in review (the experiment copy and the audit copy, which need confirmation that all their tokens have been fully rotated).

Stopping the bleed is only half done. I've grep'd through the codebase and grep'd through config files, but the LaunchAgent and plist layer hasn't been cleanly scanned yet — that's the easiest layer to miss on macOS, because they hide under both user and system directories, and the naming conventions aren't unified.

The migration effort is still queued. I already know about Jiyanran's MCP path pointing at the old .openclaw, but the owner hasn't cleared the change — because this change requires restarting the MCP service, redoing a regression, and first taking a complete snapshot of MCP's current runtime state before the change. The cross-root scheduling issue with the company scheduler is more complex: I first need to list every cross-root call, then decide which to keep and which to pull back. Neither of these is something an afternoon can resolve; they need to be scheduled into the engineering rhythm of the next few weeks.

The archive zone isn't built yet. Right now I've only tagged the few copies judged archive-only, but haven't actually moved them into an independent, read-only, timestamped archive directory. That step has to wait until both stopping the bleed and migration are done — because once the move is finished, only a README remains in the original location, and if any caller is still pointing at the original location, that caller will fail outright.

Really delete: not a single one done. The earliest copy that could enter the "really delete" workflow, on a 3-month observation clock, won't come up until autumn.

I'm not anxious about this pace. 12 copies accumulated over a year can't possibly be cleaned in a week — and shouldn't be. If I cleaned them in a week, it would mean I skipped some steps; and skipped steps will eventually come back to find me in some other form.

Through this whole process I kept coming back to one line: a project lives past a certain age, and paths become risk.

A new feature failing to run is visible risk; old copies still running is invisible risk. Visible risk forces you to solve it; invisible risk indulges your procrastination. I'm no longer chasing a clean home directory — what I'm chasing is a home directory where I can articulate the nature of every copy, the ownership of every reference, and the lifecycle of every credential. The former is just tidy; the latter is real governance. These 12 copies, I'm still slowly closing them out.

From Logs to Knowledge: How I Decide What to Keep and What to Drop

Tue, 19 May 2026 00:00:00 GMT

Notes on cleaning up a knowledge base

When you clean up an AI project's knowledge base, the hardest part isn't running out of things to keep — it's wanting to keep everything.

Any stretch of time on an AI project produces a pile of material. Chat logs, runtime logs, install backups, agent settings, ad-hoc reports, status files, scripts, evidence screenshots. Each piece looks vaguely useful on its own. Throwing any one of them away feels like a small loss.

But once you actually shove all of it into something you call a "knowledge base," a few weeks later it looks no different from the scattered working directories it came from. It has been organized on the surface. Underneath, you just moved the garbage.

So now, before I start cleaning up, I always do one dumb thing first: I write down what stays and what goes. Once the rules are written, curation becomes mechanical. When the rules can't be written, curation stalls on every file.

The real tension: raw material keeps growing, time to organize keeps shrinking

I refused to admit this for a long time. The internal voice said — just keep everything for now, the disk is big enough, sort it out later.

But "sort it out later" never actually happens. Once a knowledge base passes a few hundred files, the next person who opens it — including me — doesn't want to read it anymore. Its sheer volume scares everyone off, the author included.

So "keep everything" looks safe, but it's the most expensive choice. The cost isn't disk space — disk is cheap, just buy more — it's attention budget. Every meaningless file I keep means a little more attention I have to spend judging it next time. After ten rounds of that, nobody wants to open the base again.

What didn't work: keep the newest, keep the longest, keep the official-sounding

Early on I cut corners with a few rules that sounded reasonable. All of them backfired.

Keep the newest — but "newest" usually just means someone touched it last, not that what it says still holds.
Keep the longest — but the longest file is often an AI-generated summary, with things mixed in that shouldn't be.
Keep the official-sounding — but files named FINAL / SPEC / README are often early versions, later overturned by what actually happened in production. The filename never got updated.

Any one of those rules looks fine in isolation. Run all three together and you get a disaster — what survives is "authoritative-looking AI summaries that have long since expired." That kind of artifact is more dangerous than a chat log. It dresses up as knowledge.

So I switched to a different approach. Two filtering layers.

First filter: can this material still do work in the future?

The first layer asks one question: at some point in the future, when I re-enter this project, do I need to read this material? If no, drop it. If yes, ask the follow-up — is the material itself useful, or has its conclusion already been absorbed into an audit report or design doc somewhere else? If absorbed, the raw material doesn't need to stay either.

Cut along that line and the material splits into two piles.

What stays: knowledge that can keep doing work

Project overviews, architecture, design docs — this is knowledge, not state. It tells someone what the system looks like and why it was built that way.
Audit conclusions, decision records — settled judgments are worth more than the process that produced them.
How to run it, ports, the command surface — when you want to use it again, this is the first thing you look up.
Change logs, timelines — so someone can understand "why it evolved into what it is today."
Install archives (one per install) — when you reinstall the system, you'll always come back to these.
The audit snapshot section inside an install backup, the knowledge subdirectory of an archive cold-storage, the "engineering analysis" portion of an agent training log — these were originally scattered across messy directories. As long as they can keep doing work, lift them out and put them in the right place.

What gets dropped: runtime and engineering artifacts

Source code, scripts, patches — these belong to the repo, not the knowledge base.
Runtime logs, caches, dependencies — regenerable by running it again.
evidence, backups, raw source material — process evidence, already absorbed by the conclusion.
Raw conversation transcripts (kimi sessions, claude memory, codex memories) — the machine's working memory, not human-facing knowledge pages.
Runtime config containing tokens or secrets — runtime identity, not knowledge.
Identity, role, soul, heartbeat (the prompt-engineering bits that define an agent's persona) — prompt-engineering artifacts. Publishing them is neither safe nor valuable.
The home-directory system files inside an install backup, the books/raw/tasks folders in an archive, old AI chat-log directories — same property, same pile. Don't keep an entire directory just because "there's still something useful in there."

Splitting directories apart is the thing most easily overlooked at this layer. A single directory often contains both things worth keeping and things worth dropping. That's normal — sort by property into two piles. Don't move the whole thing because "sorting is annoying," and don't delete the whole thing because "some of it's useless."

Second filter: among the survivors, who is the canonical source?

After the first layer, the real trouble begins. For the same project, there might be a design doc locally and another in archive; install docs may be at v3, but v1 and v2 are still around; the same status is recorded in the audit report and also in a runtime log. Each one claims to be correct.

At this point you can't merge — merging just packages the conflict more prettily. You need a referee.

The six rules below aren't truths. They're work rules. Their purpose isn't to "pick the best one" but to give every kind of material a fixed priority, so I don't have to think from scratch next time I'm refereeing.

Original file beats backup. A backup exists for emergencies, not to be cited.
Latest stable version beats older versions. Note the word "stable" — not "the most recently edited draft."
Design doc beats runtime traces. Commands, state, logs tell you what it's doing right now. Only the spec tells you what it was supposed to do.
History ledger beats raw chat transcripts. A ledger is a compressed anchor; raw chat is a stream.
Local current fact beats the duplicate copy in archive. archive is history, not present.
Anything about persona, identity, soul, heartbeat — excluded by default. This category is prompt-engineering artifact, not public knowledge.

And one more thing: tag every page with a freshness state

Pages that stayed also expire. Without handling expiration, the knowledge base regresses to that "everything's right, but nothing's necessarily right" state — no different from the messy directories you started with.

So now every page carries a freshness state. Three states is enough. Any more and nobody maintains them:

verified — recently checked against the source by a person or AI, still holds.
stale — the source has moved on, this page may be inaccurate, but still usable as a lead.
needs review — visibly in conflict, must be looked at again by a person.

The three states aren't complicated. The point is that they give "when must this be updated" a clear signal. Without a signal, every page looks equally trustworthy, and problems get quietly written into the next round of judgment.

I keep one simple rule for myself: if a page can't pass a quick source check, it shouldn't be allowed to feel authoritative. It can stay as a lead, but it gets downgraded — it can't keep pretending to be verified.

What it produced: three independent wikis

After running the two filters, six rules, and three-state tagging through everything, I ended up with three independent wikis. None of them are large, all of them can keep doing work, and they share the same style:

llm-wiki — engineering knowledge base. Holds page principles, project status, ecosystem governance, audits and decisions.
openclaw-knowledge — the OpenClaw project's dedicated base. Holds install design, version choices, security hardening, history ledger.
yun-archive-wiki — personal archive base. Holds the music index, install audits, the knowledge portion of cold-storage, post-reinstall reports.

Each base has its own keep/drop table, but the underlying judgment is one shared method. That matters more than whether the content is complete — it means if any one wiki has a problem later, I can clean it with the same method, without reinventing the rules.

A side effect: each wiki is small. llm-wiki is 19 markdown files, 76KB; openclaw-knowledge is 7 files, 28KB; yun-archive is 11 files, 44KB. Together, under 150KB.

This is something I only came to accept slowly — a knowledge base that actually gets used is usually small. The big one is usually not a knowledge base. It's a backup of a working directory.

A few quiet rules I keep enforcing

This process keeps running not because any one rule is particularly clever, but because a few dumb rules never get broken.

Don't let "sort it out later" be an excuse — if a call can't be made now, either drop it or mark it needs review and push it into the next round.
Don't let AI decide the canonical source — AI can list candidates, find duplicates, surface conflicts. But which one is authoritative is my call, and it has to be written into the keep/drop table.
Don't let "looks authoritative" equal "is authoritative" — FINAL / SPEC / README in a filename still has to pass both filtering layers.
Don't let raw material and curated output share a directory — raw stays in the working directory, curated goes into the wiki. Once they're mixed, a few weeks later nobody can tell which is which.
Don't conflate "delete" with "exclude" — excluded just means the knowledge base doesn't absorb it. The original file can still exist. This one rule lifted a lot of psychological weight off the cleanup process.

Cleaning up a knowledge base is, at heart, drawing boundaries around material.

What stays and what goes isn't a taste question. It's "does this material have the standing to stand in for the source when someone asks a question in the future?" If it can stand in, keep it. If it can't, leave it as a lead. Leads can live scattered in working directories. Knowledge has to stand on its own inside the wiki.

In the end what I want isn't a bigger base. It's a base I'm still willing to open the next time I come back.

OpenClaw Studio RC1: Local Loop Closed, External Gates Held, and What I'm Still Working On

Tue, 19 May 2026 00:00:00 GMT

OpenClaw Studio stage retrospective

In one day I pushed OpenClaw Studio from stage one to stage seven and shipped RC1. It sounds like a sprint. It was closer to defusing mines.

OpenClaw Studio is the local AI working system I'm building. Locally it handles three things: morning check, task routing, and content production. All the heavy external pieces — DeerFlow (a deep-research pipeline), Mission Control Web (the web frontend for Mission Control), Gateway (the external gateway), LLM access, Search — are managed by gates. If a gate isn't unlocked, that piece isn't wired in.

RC1 is not just a version number. To me it's a boundary — inside the boundary, things can run on their own; outside the boundary, they have to honestly write down "not passed." This retrospective is about how that boundary got drawn, step by step, and where it still leaks.

Why I forced all seven stages into one day

I didn't set out to compress all seven stages into one day. I kept doing it and kept noticing: the dependencies between stages are only fresh within a single day. Once a night passes, the state memory of the previous stage gets fuzzy, and the next stage is forced to re-verify. What should have been one confirmation step becomes three.

So I just concentrated the fire — chained all the stages together, PASS in one means the next can begin, and any gate failing in between is an immediate STOP. It sounds like walking a tightrope, but it ran more stably than dragging it out across several days. Attention didn't scatter, the decision path stayed hot, and when something went wrong I could go back and fix the previous step right away.

The price of single-day execution is density — seven stages, more than a dozen reviews, more than twenty dry-runs. The moment you get tired mid-way, it's easy to let things slide, and letting things slide is when accidents happen. So I set myself three rules I would not break. The three below are the actual spine of RC1.

Method one: dry-run first, real run second

Before every stage began, I made the system dry-run first — no file writes, no messages sent, no external state touched. Just walk the flow end to end, and tell me "if this were real, here's what it would do."

The first time dry-run showed up, I treated it as a "confirmation step." Later I realized it's far more than that — it's "making the AI expose its own understanding." After one dry-run, the AI writes out the paths it plans to access, the commands it plans to run, the files it plans to modify, the external interfaces it plans to call. Eight times out of ten, that's where I catch a misunderstanding: it's treating a state file like a draft, it's about to access an external piece that hasn't been unlocked, it's about to write a "passed" conclusion into a position that hasn't been confirmed yet.

If you don't catch these misunderstandings in the dry-run, they become incidents in the real run. So my rule now is — anything that can change external state must pass dry-run first; only then am I allowed to do the real run.

Method two: gate isolation — if it doesn't pass, STOP

The hard part of OpenClaw Studio isn't writing code, it's controlling the boundary. There's plenty that can run locally, but once an external dependency isn't unlocked, it can't be touched — can't pretend to work, can't be skipped and patched later, and definitely can't be released because "it should be fine."

So every external dependency gets a gate. A gate has exactly three states: not passed, passed, pending review. For a not-passed dependency, the whole system treats it as nonexistent; the moment any stage task touches that dependency, it stops and waits for human confirmation.

The gate matrix (the master table of all gate states) is the single most important document in RC1 — more important than any architecture diagram. It's not decoration; it's a real runtime constraint. This time, across seven stages, DeerFlow, Mission Control Web, Gateway, LLM, and Search were all not-passed; the entire external chain was dark. But the local chain, because of gate isolation, could run all the way through.

Method three: zero out-of-bounds writes

The more proactive an Agent is, the more likely it causes trouble. Ask it to read something, and it edits it on the side. Ask it to analyze something, and it starts refactoring. Ask it to check something, and it writes the check result somewhere it shouldn't.

So I signed a "contract" with each agent — explicitly spelling out where it can read, where it can write, and what it absolutely cannot touch. RC1 has 7 such contracts in total: one per agent, three columns (Read / Write / Forbidden), no gray area at all.

The contract itself isn't complicated. The key is that once it's signed, it's actually used as an interceptor. Any write that isn't in the "Write" list is rejected directly by the tool layer, with no room for the agent to maneuver. This move raised RC1's sense of safety a notch above earlier versions — I no longer worry that some agent will, on a whim, modify my reference directory.

Seven stages, each closing a specific risk

The seven stages aren't split by feature; they're split by risk. Each stage solves one class of risk, and only when it fully passes does the next stage begin. The benefit is — whenever something goes wrong, you can pin it precisely to the previous gate, instead of debugging from scratch.

Stage one: freeze the environment — confirm the local working directory, state files, and tool versions, all under dry-run.
Stage two: write the agent contracts clearly — where to read, where to write, when it must stop.
Stage three: lay out all external gates — no external dependency can be triggered in an unconfirmed state.
Stage four: get the content factory running — five content templates, the registry (the ledger for content), and the review pipeline, all closed-loop locally.
Stage five: hook up the markdown version of Mission Control — auto-update, auto-backup.
Stage six: master integration — run a full task using the outputs of the previous five stages together, and look at the coupling points.
Stage seven: deliver RC1 — freeze the docs, mark the first release candidate.

The audit trail for each stage (including all rerun sub-versions, sub-task nodes, and closeout nodes) lives in the local audit directory. No need to paste it into a public article. What matters is the staging philosophy itself — cut by risk, not by feature.

What RC1 actually delivers

At the end of the day, what's usable on RC1 isn't much — but every item has been verified:

The local working system runs on its own — morning check, task routing, content production, none of it depends on any external interface.
The content factory works end to end — registry + five template types + review pipeline, a piece of content can go from draft to review-passed with evidence at every step.
The markdown version of Mission Control is hooked up — task state updates automatically, backs up automatically, no manual sync needed.
The 7 agent contracts are in place — every agent's read/write boundary is explicit; no contract, no work.
The gate matrix is documented — any not-passed external dependency is automatically stopped, no possibility of being quietly let through.
The role shuffle also landed — primary controller and write-authority moved from "Ying Zheng" (the previous lead agent) to "Zhao Zilong" (the new one), and the other roles were demoted to one of the leads. Once that was written into the contracts, all writes funneled into the single Zhao Zilong agent, and problem tracing got much faster.

What RC1 does not deliver

I want this section to be even clearer — the essence of RC1 is "the part that runs locally," not "the whole system is mature." The following are still unresolved:

The external systems are all held back by gates — DeerFlow, Mission Control Web, Gateway, LLM access, Search are all still outside the boundary. Meaning any task that requires external capability, RC1 can't take.
The sync mechanism for the 4 anchors (the key files used for long-term state sync) is still not observable — when local state changes, whether those anchors are actually synced, and when, has no automatic verification today. Drift risk is hanging there.
26 secrets still pending cleanup continue to block the reinstall of another machine. RC1 solved the local working system, but didn't solve the hygiene of the whole machine.
Multiple control planes coexist — local RC1, Mission Control Web, and the future external gateway overlap in responsibility, and who governs whom is not really settled.
The history entry draft (HISTORY_ENTRY_DRAFT, the draft file that merges this round of engineering output into the long-term history ledger) hasn't been closed — meaning even though this seven-stage run has its audit trail, it hasn't been merged into the long-term history ledger, and a few months from now I'll need to come back and close it.

Writing out "what wasn't delivered" matters more than writing out "what was delivered" — it stops me, months later when I look back at RC1, from treating it as "already done."

The real state of RC1: I'm still working on it

For me, RC1 isn't an endpoint. It's a starting point I can keep walking from.

The local loop working means I have a working substrate I can maintain without depending on the outside — but what it can do is still narrow. Unlocking external gates is a long road, not a one-or-two-week thing; anchor sync, secret cleanup, control-plane closeout, each has to queue up and be done on its own.

My rhythm now: each week, pick one gate and try to push it forward one step — if it can be unlocked, unlock it; if it can't, write "why not" into the audit. Each anchor drift I resolve gets a "verified" mark; each batch of secrets I clean up gets the corresponding position marked "stale" pending the next review. I'm not chasing the RC2 or RC3 version number, but I want RC1's boundary to stay clear — what can run, what can't, and why — three questions that always have answers.

So this is a stage retrospective, not a release celebration. RC1 keeps getting modified offline; even as I finish writing this piece, the next round of gate tests is already queued. The next OpenClaw retrospective will probably start from either "a gate finally unlocked" or "a gate I thought would unlock got pushed back."

RC1 is the version of OpenClaw Studio I'm most satisfied with so far — satisfied not because it's done, but because for the first time it tells me clearly what I can do, what I can't, and where to push next.

Before reaching this point, every "it runs" was an illusion. Real "it runs" comes with gates — local runs, external is governed, writes have contracts, state has audits. If this setup keeps getting polished, RC1 will one day become obsolete, replaced by a version without the RC suffix.

Until then, I'm still working on it.

From Information to Article

Mon, 18 May 2026 00:00:00 GMT

OpenClaw workflow notes

Once Shen Zhixing (沈知行, the information agent) has pulled the information in, the work hasn't actually started yet.

Early on I kept defining the information agent's finish line as "are there enough sources," "can it fetch," "is the status fetched_ok." Those are baselines, but all they prove is that the system can reach the information. They don't prove the information is ready to become content.

The real dividing line is whether Suwan (苏晚, the content agent) can pick it up.

If Shen Zhixing brings back a pile of titles, links and summaries every day, and Suwan then has to decide all over again which ones are worth reading, which are just noise, which are suitable for an article and which to throw out — then this chain doesn't really exist. All it's done is move the search work from one place to another.

So I started reframing the problem: what Shen Zhixing hands to Suwan can't just be information. It has to be content candidates.

Fetching is not the same as writable

This was the first trap I fell into. A source gets working, an item gets fetched, a summary gets generated — each of these gives you a small hit of completion.

But content work doesn't start from "I have material." It starts from "why is this material worth processing."

A news item can be true but have no writing value. A link can be brand new but unrelated to my long-running themes. A discussion can be hot but only emotional noise. Conversely, a tiny product change, an obscure forum thread, an ordinary version bump might be exactly what exposes a structural problem worth writing about.

If Shen Zhixing's job is only to bring information back, Suwan is forced to start the filtering from zero. On paper that looks like multi-agent division of labor. In practice it's one person redoing all the judgment.

Now I require four more things with each handoff

Later I added clearer requirements to this handoff chain. Every candidate Shen Zhixing passes to Suwan has to carry at least four extra things.

First, source and status. Where it came from, whether it was actually fetched, whether it's only a candidate, whether it's expired or needs to be rechecked.
Second, why it's worth looking at. Not just "this is a news item," but what judgment it triggers.
Third, a suggested angle. Whether it fits a tool experience, an industry observation, an OpenClaw retrospective, or only background material.
Fourth, risks and gaps. Anything shaky on the facts, thin on sourcing, easy to misread, touching privacy, or not yet appropriate to publish.

With those four things, Shen Zhixing's role changes. It's no longer "give me ten links." It's helping Suwan save the cost of the first round of judgment.

It doesn't decide for Suwan what to write. It just delivers the material to a place where judgment can continue.

What Suwan picks up isn't material — it's a space of choices

Suwan's most important capability as a content agent isn't writing beautifully. It's knowing what's worth writing, how to write it, and why now.

So she can't just receive a pile of "material." Material is too broad — facts, noise, old stuff, half-finished drafts, things that could be public, leads that are only fit for internal use. What she actually needs is a space of choices.

A good candidate should let her see quickly: which long-running thread this connects to, whether it can explain a real problem, whether there's enough evidence right now, who it's written for, and whether going public would expose backstage detail.

That's how Suwan gets to make a content judgment, instead of becoming a second-pass scrubber.

There has to be a candidate pool in between

Over time I trust "fetch and then write directly" less and less.

Writing right after fetching makes the system over-dependent on how things feel that day. Today this item seems important; tomorrow it turns out to be noise. Today the evidence feels enough; later it turns out another source is missing. Today it reads like an article; two days later you realize it's just an internal manual.

So there needs to be a candidate pool in between. Not a sprawling archive — a buffer with clear status. Which candidates enter owner review, which go to Suwan, which need more sources, which go into the wiki, which get archived directly.

The value of the candidate pool isn't storing more things. It's making sure every piece of information has a next step.

What the content chain should look like

These days I prefer to break the Shen-to-Suwan chain into a few steps.

First, verify the source: can it be fetched, is the content still valid, does it sit within the topic boundary.
Then clean the items: drop duplicates, noise, obvious low-value entries, and anything that can't be public.
Then generate candidates: with title, source, reason, angle, risk, and suggested next step.
Then Suwan's judgment: is it worth writing, in what form, and is more evidence needed.
Only then writing: an article isn't stitched-together material — it starts from a clear judgment.

This chain looks slower than "fetch and summarize," but it holds up better over time. It separates the responsibilities at each step.

Shen Zhixing owns usability and first-pass filtering on the information side. Suwan owns judgment and expression on the content side. I own the critical release points and the boundary. That starts to feel like a workflow that can actually collaborate.

I don't want an automated writing pipeline

There's a tempting wrong turn here: if Shen Zhixing can fetch and Suwan can write, shouldn't they just auto-generate an article every day?

I don't want that right now.

Auto-generated articles ship fast, but they're also the easiest way to mistake "I have information" for "I have a judgment." A site like YunLab.ai doesn't need to perform presence every day. It needs every article to answer one question: what did I actually understand this time.

So Suwan isn't an auto-publisher. Shen Zhixing isn't a hot-take feeder. What they should form between them is a content-judgment system, not a content production line.

The final call

What I actually wanted to fix this round wasn't an information-fetching module or a writing module. It was the handoff between them.

The hard part of multi-agent work usually isn't how smart a single agent is. It's whether there's a table between two agents that can hold the work. Shen Zhixing puts information on it; Suwan can read why it was put there, and can decide whether it should be written, expanded, dropped, or held for my review.

That table is the content candidate pool, and it's the hinge of the whole workflow.

From information to article, what's missing in the middle isn't more summaries. It's a clearer handoff.

Shen Zhixing has to deliver judgeable candidates. Suwan has to make selective content judgments. And in the end, a human still holds the public boundary. Only then does the information agent stop being a scraper, and the content agent stop degrading into a rewriter.

My Knowledge Model

Sun, 17 May 2026 00:00:00 GMT

Notes on the Knowledge Model

What I eventually figured out: the dangerous thing in an AI project isn't forgetting — it's remembering wrong and continuing to build on it.

I used to think of long-term knowledge as something close to "archives." Save the chats, save the terminal output, save the reports, save the screenshots — and the next AI would be able to pick it up.

The real situation isn't like that. An AI project's working ground isn't a library — it's more like a construction site. Today Claude is talking direction in one window, Codex is editing files locally, Kimi is digesting a long document, the browser has a deploy page, a preview page, and a ChatGPT conversation open. Every spot is producing material: a sentence of judgment, a chunk of logs, a report, a screenshot, a status that just barely passed.

On the day, I obviously know which is a draft, which is in-between, which PASS came with a warning. The problem is that a few days later the site is dispersed and only the materials remain. When the next AI walks in, what it sees isn't a "site" — it's a pile of fragments that all look like evidence.

Without compression at this point, the system doesn't get smarter — it just gets dirtier.

The real conflict: more material, less stable judgment

I first assumed the problem was "the AI forgot." Later I found the more common problem was "the AI remembered wrong."

Say a line of work first gets scored 95, then an independent audit knocks it back to 77, and later the gaps get closed and it comes back to 96. All three numbers are real, but they can't be tossed together for direct use. If the next AI only sees 95, it might push toward launch; if it only sees 77, it might re-fix things that were closed; if it only sees 96, it might forget the 96 is still a local controlled candidate, not a production release.

This isn't a memory-quantity problem. It's a judgment-structure problem. When material hasn't been compressed into "currently-usable conclusions," it becomes a contamination source in the next round of collaboration.

The impact is concrete. Every time the window changes, the AI changes, or the day changes, I have to re-explain: that report is stale, the other one is just an owner report — not an independent audit; this PASS is local-only, not a launch authorization; that conclusion was just to keep the discussion moving and got overturned afterward.

Wasted time is just the surface. The more serious thing is: old judgments revive, in-between states get misread as completed, and the AI keeps writing plans on top of wrong premises. The smoother it writes, the easier it pulls me along.

A wrong fix I tried: save everything

The most natural reaction is to add memory. Afraid of losing things — save them all. Save chat, save logs, save reports, ideally make every sentence searchable afterward.

But this path turns and bites you fast. Full-volume saving solves "do we have the material," not "is this material still usable now." A vector store can recall similar content, but it doesn't automatically know which sentence was probing, which report has been overturned by audit, which conclusion only applies to a local preview.

I stopped chasing "let the AI remember more." I started caring about something else: let it misuse less.

That's where my Knowledge Model came from. It's not a new term, and it's not another knowledge-base tool. It's a set of site end-of-day rules: among the materials produced each day, which ones stay as evidence only, which enter the current state, which become long-term rules, which should be thrown out outright.

My method: four questions decide whether a piece of material gets promoted

Now I don't put a piece into long-term memory just because it "looks useful." It has to pass four questions:

One, what's the problem. Not vaguely "lacked context," but spelling out which judgment was wrong, which boundary got blurred, which state got misread.
Two, what impact did it cause. Did it waste one handoff, did it revive an old conclusion, did it almost let local-only get treated as a launch authorization.
Three, how do I view this problem now. What changed in the way I judge, and why the old way can't be used anymore.
Four, what to do next time. Should it become a rule, a checklist, a skill, a task contract — or should it only be kept as raw evidence.

If these four can't be answered, don't rush to call it knowledge. At best it's a record.

I split knowledge into four layers, and don't let them impersonate each other

Now I put material into four positions.

Raw evidence layer: chats, logs, screenshots, test output, diffs. Their job is to prove something happened — not to directly guide the next step.
Current state layer: task README, notes, handoff, state files. Their job is to answer "where exactly are we right now."
Long-term rules layer: AGENTS.md, skills, checklists, memory entries. Their job is to change AI behavior going forward.
Public expression layer: articles, retrospectives, methodology. Their job is to rewrite internal experience into experience others can understand too.

These four can't mix. Evidence can't impersonate state, state can't impersonate rules, and rules shouldn't be directly exposed as a public article. A lot of past confusion came from these layers mixing: a chat treated as a rule, a report treated as a verdict, a memory summary treated as current state.

The value of the Knowledge Model is right here: it doesn't help me remember everything; it forces me to put things in the right place.

The result: memory doesn't grow, handoff gets lighter

What this method produces isn't "the AI suddenly understands everything" — it's that handoff cost drops.

A new window shouldn't have to read through full chat history. It should look at current state first, then recent notes, then handoff, then chase raw evidence as needed. It shouldn't infer current conclusions from a pretty report, and certainly shouldn't decide the next step straight from a memory summary.

For me, long-term knowledge isn't a warehouse of "what I once said." It's a tool for "what I don't want to repeat next time." A construction site without end-of-day is just a pile of materials. Only after the day is closed out can it become an asset usable at the next start of work.

So my Knowledge Model isn't about making the AI remember more — it's about making the AI misuse old material less.

At the end of each task, what I actually want to leave behind isn't full chat — it's four things: the problem, the impact, the change in judgment, the next action. Answer these four and the experience earns its place in long-term knowledge. Can't answer them? Keep it in the evidence layer, and don't pollute future judgment.

Canonical Source Rules

Sun, 17 May 2026 00:00:00 GMT

Notes on picking the canonical source

What I eventually figured out: in an AI project, the hardest thing to deal with isn't "no truth." It's that many things look like truth.

There's a conclusion in the chat, a conclusion in the report, a conclusion in the task folder, a conclusion in memory. There's a freshly built page locally, and a reachable page in production. The old audit says it's not enough; the new report says the gap is closed. Each piece of material, taken alone, can tell a coherent story.

That's exactly where it gets dangerous. If a piece of material is clearly wrong, it's actually easy to handle. The real trouble comes when it was right at some point, inside some boundary — and now it's being used to answer a different question.

So I set myself a rule: decide who has standing to answer first, then look at what they said.

The conflict: AI passes "stage-correct" off as "currently-correct"

The classic example is all the pretty status words: 95 points, launch ready, local candidate, PASS_WITH_WARNINGS. Each one is not necessarily wrong on its own — some are even carefully worded.

But the next time an AI picks them up in isolation, the meaning shifts. It may remember "launch ready" and forget "local-only"; remember "95 points" and forget that a later independent audit knocked it back; remember "the page opens" and forget that was just a local preview; remember "generated" and forget it was never deployed to production.

This kind of error doesn't blow up immediately. It's more like slow contamination: the AI keeps writing from a wrong root, treats old state as current state, calls local-only ready for production, calls a sampling pass a full completion. And it writes it out fully — even with a plausible-looking next step.

I'm not afraid of an AI not knowing. I'm afraid of it, while not knowing, taking a piece of material that has no business answering, and producing a very smooth answer.

The wrong fix: chase the newest, the longest, the best summarizer

I used to take shortcuts too. Whichever file was newest, I'd read first; whichever report was longest, I'd assume was more complete; whichever AI said "done" last, I'd quietly treat as current state.

But in AI projects, "newest" often just means most recently written — not most recently accountable. "Longest" often just means most explained — not strongest evidence. "Smoothest summary" is even more dangerous, because it wraps the conflict up into a pleasant story.

My understanding of canonical changed after that. Canonical isn't "the one truth file," and it isn't "you only get to look in one place." It's more like jurisdiction in court: different questions go to different responsible sources.

My rule: ask the question type first, then pick the canonical source

Now I break the question apart first. Instead of opening with "how is the project doing," I ask: what am I actually judging?

Ask about content — look at content files. What an article looks like is settled by the Markdown in the repo, not by a summary in chat.
Ask about current progress — look at task files. README, notes, handoff, state files all outrank a chat summary.
Ask whether something runs — look at real output. Builds, tests, script results, page response codes beat "it should work."
Ask whether something is published — look at the production route. If the question is whether outsiders can see it, a local preview doesn't count.
Ask why a decision was made — look at the decision record. Without an owner sign-off, an audit conclusion, or an explicit superseded marker, you can't quietly promote an old judgment into the current one.
Ask for historical clues — only then go to chat and memory. They're there to help you locate, not to render the verdict.

Once this order is fixed, a lot of arguments disappear. Not because there's less material, but because every piece of material now has a boundary.

When sources clash: don't summarize, adjudicate

The most error-prone moment is when sources fight each other: chat says A, the file says B; the old report said it passed, the new audit says it didn't; the local page is the new version, production is still the old one.

These days, I don't let the AI summarize right away. Summarizing too early just wraps the conflict more nicely. I make it do four adjudication steps first:

Step one — name the question. Is the current question about content, state, runtime, deployment, acceptance, or historical reason?
Step two — list candidate sources. Which materials claim they can answer this? At what time and inside what boundary were each of them produced?
Step three — pick the responsible source. Who has direct responsibility for the current question, and who's only a clue or old evidence?
Step four — mark the superseded ones. If an old conclusion has been overturned by a new audit, an owner decision, or actual runtime results, write "superseded" explicitly — don't leave it for the next AI to guess.

This method looks slow, but it's much cheaper than pushing forward on the wrong canonical source. Once the path is wrong, every step after that manufactures rework.

The end result: the project gets quieter

What the canonical rule produces isn't every piece of material consolidated into one file. It's the project becoming quieter.

I used to ask "which version are you talking about?" all the time. Now I ask first: "where's the canonical source for this question?" If I find it, I continue from there. If I can't, I write a current-state file first, instead of continuing to guess in chat.

It also changes how I read AI output. Smooth writing doesn't mean accurate. A complete report format doesn't mean it's a verdict. A line that says "done," without a corresponding file, output, verification, and boundary, is just a claim waiting to be checked.

The canonical-source rule is, at its core, about assigning responsibility to materials.

Chat moves things forward; memory indexes; reports raise claims; task files hold the process; real outputs do the acceptance; owner decisions release. Every source has a place — only then does the project stop getting dragged around by a pile of materials that all look correct. In the end, what I want isn't an AI that summarizes better. It's an AI that knows where to go back to and check the fact, first.

An Information Agent Is Not a Fetcher

Sun, 17 May 2026 00:00:00 GMT

Retrospective on building Shen Zhixing

What I eventually realized: the most dangerous thing about an information agent isn't that it can't fetch information. It's that once it fetches a little, we very easily misread that as "already working."

Shen Zhixing (沈知行, the information agent) started out looking like a clean task: build an information-fetching and curation agent, have it discover what's worth reading from public sources, then hand the cleaned candidates off to Suwan (苏晚, the content agent).

But once I actually got into it, the problem wasn't "can we add more sources." The problem was: do I want a fetcher, or do I want a working role that can maintain an information flow over time?

If it's just a fetcher, then as long as it can hit some RSS feeds, APIs, and public pages and return a list of titles and links, it can say it's done. But if it's Shen Zhixing, it can't stop there. It needs to know which sources are actually usable and which are window dressing; which items are worth reading and which are noise; which candidates go to Suwan, which go into the wiki, which must be thrown out, and which need to be kept for my review.

Those two are very different jobs. The first is "get the data." The second is "maintain the judgment."

First time I got pulled off: sampling passed, called good to ship

Early on there was a tempting judgment: the system already had a few hundred sources, a sample of 40 was run, 40/40 came back fetched_ok. GPT's suggestion at the time was that we could move into Day 1 owner review.

The sentence sounded smooth, because it had numbers, status names, "boundary notes," and even explained local_db pending as "not blocking the information Day 1." If I had just looked at the table, I would have nodded.

But I asked one question: who said 40 was enough?

If the system claims it has 142 daily sources, then a 40-sample pass only proves the sample passes. It doesn't prove all 142 daily sources work. The biggest risk here wasn't a technical failure — it was the acceptance criteria getting quietly swapped out.

I made a point to remember this one clearly: AI very easily wraps "one local piece of evidence" up as "the whole thing is ready." It isn't deliberately lying — it's just too good at writing stage results to read like completed states.

Second time I got pulled off: 106 working sources, called over the line

Then I asked for full-volume validation. The old 142 daily sources were evaluated one by one, and 106 ended up as validated daily. Again, the system gave me something that looked like a completed state.

I didn't accept that one either.

Because my requirement wasn't "filter a usable batch out of the old list and call it done." What I wanted was Shen Zhixing as a global information collector, with at least 100 more useful, usable, validated information sources on top. 106 was just what was left after scrubbing the old set — not a new capability boundary.

So the target got rewritten to a harder criterion: not 106 but 206+; not candidate but validated daily; not future, stub, or dry-run, but actually verifiable live / public API / public feed.

After that step, Shen Zhixing grew to 257 validated daily sources. That's where I started to accept the result. Not because the number got bigger, but because every source had to carry a state, a validation record, a boundary, and a failure-handling note.

Third time I got pulled off: 257 fetchable sources still isn't Day 1

Once 257/257 fetched_ok came back, I almost got carried by another "ready for Day 1" line.

The problem was more subtle this time. The fetching layer really had passed: enough sources, full-volume baseline passed, risk scan showed no current leak. On the surface, it looked very complete.

But something felt off: fetching is only one part of Day 1. What about curation? How does Suwan pick it up? How are wiki candidates maintained? How do we watch source quality? The real local-database path isn't wired up yet — how does the owner review?

If those aren't defined, Day 1 turns into an awkward thing: Shen Zhixing can grab a lot every day, but afterwards it all just piles up. Suwan doesn't know where to pick up. I don't know how to review. And the next day, the system doesn't know which sources to keep, demote, or replace.

So I redefined Day 1 as a full chain:

First, validate the sources: which public sources actually work, which don't get into the daily set.
Then build the item store: raw, normalized, cleaned, deduped, clustered, queued — every step needs a state.
Then make content judgments: value score, worth-reading, why worth reading, risk flags, titlebait and low-value filtering.
Then split into handoff queues: worth-reading queue, Suwan candidate queue, wiki candidate queue, owner review queue.
Only at the end comes the daily brief and retrospective: what got fetched today, what was kept, what was thrown out, which sources got worse, which need my decision.

Only at that point did I feel Shen Zhixing started shifting from "fetcher" to "information worker."

The real dividing line is how Suwan picks it up

The other key point is Suwan.

If Shen Zhixing just gives Suwan a digest, that's still very rough. Suwan shouldn't pick up raw news, and shouldn't pick up a pile of headlines. What she needs is content candidates that have been initially cleaned, de-duplicated, judged, and tiered.

So I broke this out as its own piece: on the Shen Zhixing side, build a Suwan content library, use SQLite to hold state, and also export JSON and Markdown. Each candidate has to spell out its source, the cleaned title, why it's worth reading, suggested angle, content form, audience, risk, whether more sources are needed, and current state.

This step matters a lot. It turns "information" into "content candidates," but stops short of becoming "final article." Shen Zhixing finds, washes, filters, organizes, and files into the candidate library. Suwan selects, judges, expands, writes. The owner decides what moves to the next step.

How I started avoiding GPT's misguidance

The most valuable thing from this round isn't "we got to 257 sources." That number is just a result. What's actually valuable is that I started to know, more clearly, when not to let GPT define "done" for me.

Now I hold it down with a few hard rules:

First, every PASS gets one question: what does this PASS actually prove, and what does it not prove?
Second, a sample result can't represent full-volume capability. Sampling is a signal, not acceptance.
Third, crossing a number threshold isn't crossing the value threshold. Source count, candidate count, test count — all of them have to bind to "useful, usable, handoff-able."
Fourth, state names should be conservative. Write pre-Day1, owner review required, paths pending — don't write stage results as fully done.
Fifth, negative conditions matter more than positive descriptions: don't publish, don't send externally, don't bypass limits, don't count stubs as success, don't write pending as done.
Sixth, conclusions in chat are not the canonical source. The final answer goes back to files, outputs, tests, queues, reports, and owner decisions.

The place GPT pulls me off most often is its smoothness. It'll give you a balanced-looking judgment: body done, boundaries noted, here's the next step. It sounds mature, but if the acceptance criteria are wrong, that maturity is more dangerous.

So now I trust another move more: break "done" apart.

Fetching done doesn't mean curation done. Curation done doesn't mean Suwan can pick up. Suwan picking up doesn't mean Day 1 started. Day 1 started doesn't mean long-term capability done. Each layer needs its own input, output, boundary, and owner decision.

How I understand an information agent now

A proper information-fetching agent can have "fetch" in its name, but the core isn't fetching.

It needs at least six layers:

source universe — knowing where it looks at the world from, which sources are worth watching long-term.
validation — every source has to be verifiable, demotable, replaceable, not permanently parked on the list.
item organization — turning what's fetched into trackable items, not scattered headlines.
quality judgment — identifying duplicates, clickbait, low value, risk, and what's actually worth reading.
handoff library — splitting different goals into different queues, especially for a content role like Suwan.
owner review — key states don't auto-promote past their authority; the final release is left to a human.

Once it gets here, I'm willing to accept Shen Zhixing at this stage. But I won't say it's done.

Because there's still real local-database path integration ahead, still Suwan's feedback on candidates flowing back, still source-quality maintenance after Day 1 runs. An information agent really matures not when it runs on day one, but when, after running for a while, it gets better and better at knowing what's worth reading and what shouldn't bother me.

The lesson from this round is simple: don't let the AI's "sense of completion" replace my "sense of acceptance."

GPT can help me organize a judgment, generate commands, summarize a stage — but it can't decide for me what "done" means. Real owner review is just relentless follow-up questioning: what does this result prove? What hasn't it proved? If we actually put this agent to work tomorrow, where would it break? Once those questions are answered clearly, Shen Zhixing slowly stops being a tool that fetches information and becomes an information role I can work with.

From Agent Roles to Work Contracts

Sat, 16 May 2026 00:00:00 GMT

OpenClaw multi-agent notes

Multi-agent isn't about adding a few more personas.

When you first start building a personal AI system, it's very easy to get pulled in by the "role" thing.

You give them names, write a backstory, write a tone, write what they're good at. That does help — without a stable temperament, an agent quickly degrades back into a temporary prompt.

But the more I do this, the more I think the actually hard part of multi-agent isn't writing each role to feel human. It's getting those roles to collaborate like a small company.

01 / The virtual company metaphor

I'm not building a chat group, I'm building a company

Put 8 agents together with nothing else and you get something more like a chat group. Everyone can say a few things, but who's responsible for what, who renders the final judgment, who leaves a deliverable, who picks up the next leg — none of it is clear.

What makes it actually feel like a company is that every role has a responsibility boundary.

Someone closes out the task, someone judges the content, someone researches and counter-checks, someone handles visual expression, someone does technical verification, someone audits and governs. The role name is just the entry point. The work contract is the skeleton.

Agent 01

Zhao Zilong

The main controller and write-authority — lands direction, tasks, and acceptance into files.

Agent 02

Ji Yanran

The hub and coordinator — receives tasks, breaks them into actions, holds context.

Agent 03

Suwan

Content and aesthetic judgment — decides what's worth writing and what shouldn't ship.

Agent 04

Huo Rui

Research and counter-evidence — finds evidence first, judges second, keeps risks on the record.

Agent 05

Linlu

Visual and pacing — turns content into image, sound, and timeline.

Agent 06

Shen Zhixing

Tech and verification — makes tools run, and can explain why they run.

Agent 07

Zhou Zhengqing

Audit and governance — checks standards, evidence chain, and handoff quality.

Agent 08

Walter

An outside observer seat — when there's no stable contract, mark as paused rather than fake it.

02 / The role is only the first layer

The real thing is the work contract

When I write an agent now, I don't just write "who you are." I push on four harder questions.

One, what inputs does it receive. Two, what is it allowed to do and not allowed to do. Three, what outputs must it leave behind. Four, in what situations must it stop and hand off to a human or another agent.

Once these are written clearly, the role stops being just a temperament — it becomes a node that can enter a workflow.

03 / Handoff

No handoff, no team

Whether one agent can pick up after another decides whether they're actually a team.

So the output can't be just a paragraph of reply. It needs files, state, evidence, and a next step.

04 / Operating order

Companies run on systems, not enthusiasm

The thing a multi-agent system most fears is starting from scratch every time. Who did what today, why they did it, where the result lives — all of it has to be findable next time.

05 / Where I land now

The role should be likeable, but more importantly deliverable

I like writing personality into agents, because personality helps them form stable judgments. But if it stops at personality, it's still just "a role that talks."

For me, an agent becomes real when it can enter a long-running work system: it knows its position, it knows its boundary, it knows what evidence to leave, and it knows when to stop pretending to understand.

So when I look at "8 agents in a virtual company" now, I don't start by asking whether the roles are cool. I ask: what input do they receive, what judgment do they make, what files do they leave, who do they hand it to, and who can run the retrospective when something goes wrong. The role makes it feel human. The work contract makes it possible to work together.

From Running to Trustworthy

Sat, 16 May 2026 00:00:00 GMT

OpenClaw notes on a trustworthy system

A system that runs isn't a system I'd trust.

The moment a personal AI system is easiest to misjudge is the first moment it actually runs.

The service is up, the page opens, the agent replies, the tools fire. It's very easy to say at this point: good, system done.

But the more I work on this, the more I feel "it runs" is only the first layer. The hard part is: tomorrow, can we pick it back up; can another AI read it; when something breaks, can we run a retrospective; can I confidently hand it the next step.

So what I added afterward wasn't "swap in a smarter model" — it was a work system that lets it be checked, handed off, and recovered.

01 / The illusion of running

It can answer doesn't mean it knows what it's doing

A lot of AI systems already look like they can work: chat, write files, run scripts, call external tools.

But if it doesn't know which files are the canonical source, doesn't know which actions need to stop and confirm, doesn't know why a judgment was made last time — then it's only "running," not "trustable."

The trap I've fallen into most is taking one successful run as system capability. What you actually look at isn't whether it ran this time; it's whether it stably reproduces next time.

So my fix was plain: take all the things that easily scatter inside chat, and move them into checkable files, task folders, and evidence chains.

File canonical source

State can't live in chat memory — it has to land in project state files, task folders, and output folders.

Task contract

Every task spells out goal, boundary, input, output, acceptance, and stop condition.

Memory as asset

Not "remember more chat" — sediment judgments that can be searched, reused, scored, and iterated.

Evidence chain

Important conclusions trace back to source, artifact, test, screenshot, report, or retrospective.

Handoff

What today's AI finishes, tomorrow's AI — or tomorrow's me — can keep picking up.

Locally controlled

Before trust is earned, stay a local candidate; don't rush to touch real external sending or production actions.

02 / My first step

Move the canonical source out of chat and into files

A lot of state used to live in chat: where we got to, which conclusion got overturned, which file is the latest version, which action only ran once.

Now I try to compress this into real files: project state, current task, notes, outputs, handoff, verification screenshots, build logs, audit tables. Chat can move things forward — it can't monopolize the truth.

This step sounds dumb, but it's critical. As long as the canonical source still lives in chat, the system can't really hand off. As soon as it lives in files, another agent can re-enter the site.

03 / Task contracts

A task is no longer one chat request

I write complex tasks as a small contract: what to do, what not to do, what it depends on, how completion is counted.

That way the agent isn't "improvising as it goes" — it's moving inside an explicit boundary.

04 / Memory as an asset

Not "remember everything"

What I actually want isn't "it remembers everything" — it's that the key experience can be retrieved, reused, scored, and iterated.

Memory isn't a favorites folder. It has to become a work asset usable next time.

05 / Evidence, audit, handoff

Trustworthy is: someone else can run the retrospective, and tomorrow can continue

I'm less and less willing to just write "done." What I'd rather see: what sources were used, which files changed, where the artifacts live, how it was verified, where the warnings still are.

That's why I keep notes, outputs, screenshots, build results, audit conclusions, and handoff. They're not formality — they're how the system gets picked up by the next person, the next AI, the next session.

Before that point, I'd rather call it a locally controlled candidate version. It can prove direction; it doesn't equal "ready to ship," doesn't equal "ready to send externally," doesn't equal "ready to touch real transactions or production actions."

So when I look at a personal AI system now, what I care about most isn't "can it answer" — it's "can it be accountable." My fix isn't some mystical architecture either: file canonical source, task contract, memory as asset, evidence chain, handoff, locally controlled boundary. Running is a technical state. Trustworthy is a working relationship. The first one shows me possibility; the second one is what lets me hand it the real thing.

Raw Chat Is Not a Knowledge Base

Sat, 16 May 2026 00:00:00 GMT

Personal knowledge base pothole notes

I thought it was simpler than it is.

When I first wanted to organize my own chat logs, I had a very direct idea: I've discussed so many things with AI, friends and colleagues — surely all of that is already a mine.

So shouldn't I just export it, chunk it, embed it, push it into a vector store, and then any time later I can ask "how did I decide on this before?"

The idea is seductive. It looks low-effort and it fits the popular picture of a "knowledge base": shovel material in, AI fetches it back for you.

The pothole I actually hit was this: being able to retrieve isn't the same as being able to use it correctly.

01 / First pothole

It really does retrieve — but I can't trust it directly

What actually put me on alert wasn't the system failing to find things. It was the system finding too many things that "looked relevant."

A passage might have been a temporary thought at the time, an angle I was testing, a judgment that got overturned later, or just a transition line to keep the conversation moving.

A person on the day can read it correctly because I remember the scene the conversation happened in. I know what was asked before and why I changed my mind after. But when an AI later only gets a few of those sentences, it's easy for it to treat "once said" as "still true."

That's when it hit me: raw chat logs aren't written like knowledge. They're written for moving the moment forward.

Pothole 01

What gets retrieved is fragments

It can find a sentence, but not always know why that sentence was said in the first place.

Pothole 02

Drafts look like conclusions

A lot of discussion is just probing a direction; in retrieval it later reads like a final judgment.

Pothole 03

Old decisions come back to life

Plans that were already overturned, if their status isn't marked, still get pulled back out.

Pothole 04

Noise gets amplified

Small talk, detours, mood, and repeated confirmations all hurt downstream recall quality.

Pothole 05

Boundaries get mixed up

Private relationships, project judgments, public material and methodology can't share one retrieval surface.

Pothole 06

Nothing can be handed off

Given only a snippet of chat, another AI can't tell whether to treat it as evidence or background.

02 / Second pothole

A vector store solves recall, not judgment

Once I broke the problem apart, I realized I'd conflated two things at the start.

A vector store helps me pull similar content back — that's a recall problem. But what a knowledge base really has to solve is a judgment problem: can this content be trusted, where does it apply, has it expired, can it guide the next action.

Without those annotations, more chat actually makes the system more "seemingly knowledgeable." It can quote a lot of old lines without knowing which old lines shouldn't be used anymore.

So I don't treat "retrievable" as "knowledge base done." Retrievable is step one. After that, what was retrieved still has to be curated into knowledge that can carry responsibility.

03 / What I do now

Distill into a knowledge card first

These days I prefer to first pull out the actually-effective judgments from the chat, and then write them up as a knowledge card.

One card has to make at least these things clear: what the conclusion is, where it comes from, where it applies, how confident I am, and when it needs to be rechecked.

04 / Sort boundaries first

Not every memory belongs in one place

Raw chat contains personal relationships, commercial context, unfinished judgments, and sensitive detail.

That stuff can't share a surface with public article material, project experience and general methodology. A knowledge base without boundaries is more dangerous the smarter it gets.

05 / The takeaway after the potholes

Chat logs are a mine, not a toolbox

So now I treat raw chat logs as a material pool, not a knowledge base.

They still matter. They hold where ideas came from, the hesitations of the moment, and a lot of detail that I'd otherwise forget. But without curation it's hard to turn any of that directly into the basis for next action.

What's actually worth keeping is the judgment left after the chat: a verified conclusion, a process that can be reused, a pothole already hit, a clear preference, a task context that can be handed off to the next AI.

The curation layer in the middle can't be skipped. Drop the noise, keep the sources, mark the status, write down the applicable scope, turn conclusions into assets ready for use.

I'm not chasing "remember every chat" anymore. What I want is to take the genuinely useful experience inside the chats and curate it into a knowledge asset that's searchable, reusable, updatable and auditable. The raw chat log can stay, but it's only the mine. The knowledge base should be the toolbox refined out of it.

From Internal Engineering Notes to Public Writing

Sat, 16 May 2026 00:00:00 GMT

YunLab writing notes

I want to write this one honestly.

Recently I've been turning some internal engineering material from OpenClaw into articles that can live publicly on YunLab.ai. When I started, the natural assumption was: the internal material is already detailed — task, conclusion, retrospective, boundary, verification record. So I just delete the sensitive bits, rephrase the rest, and I have a public article, right?

Once I actually started, it wasn't that simple.

The biggest value of internal material is letting the next person, the next AI, the next session pick the work up. It aims for accuracy, completeness, traceability. Where the files are, how far this task has gotten, whether this judgment has evidence behind it, whether this capability is still just a local candidate — all of that has to be spelled out.

But a public article isn't read that way. When someone opens YunLab.ai, they're not picking up my local project, and they're not auditing my task directory. They're more likely just trying to understand: why am I building a personal AI system this way? What potholes did I hit? Which judgments could transfer to their own system?

My first attempt kept turning into an internal handoff

That was the first pothole.

I would write straight down the source material: here's a state file, there's a handoff, this module passed, that audit scored X, this directory has more outputs underneath. While writing, it all felt right — these things did genuinely happen.

But once the page rendered, the flavor was off. It didn't read like an article. It read like I'd pasted a photo of my workbench in public. Lots of information, but the reader couldn't get in. Worse, it exposed too much backstage in public view: paths, processes, capability boundaries, unfinished states, internal names, even working habits that should only stay with the local system.

That's when I realized public writing isn't "compressing" internal material. It first has to change the question.

Internal material asks: where is this task now, how do we continue. A public article asks: what experience is behind this, which parts of it are worth keeping.

Redaction only solves a small part

The second pothole was trusting "redaction" too much at first.

I used to feel that as long as I covered the tokens, accounts, paths and internal names, the content was safe. Later I realized that's only the lowest layer of safety. What's actually likely to go wrong isn't always a particular string — it's the state the article conveys.

For instance, a system that's only a local candidate can't be written as if it's already live and stable. A process that's only been run once can't be written as a mature method. An agent still being tuned for persona, permissions and memory can't be cherry-picked to look like it's already deliverable.

That kind of problem isn't fixed by redaction. It's fixed by honestly writing the boundary.

The public-safe I now mean isn't only "nothing leaked." It also includes "not overstated, not misleading, not treating the backstage as the result."

Now I ask first: what can this piece actually leave in public?

Now when I write an article from internal material, I pause before moving any content.

I ask one dumb question first: out of all this material, what can actually be left in public?

Sometimes the answer is a pothole — for example, why raw chat logs can't directly become a knowledge base. Sometimes the answer is a boundary — for example, an agent's role can't only describe personality; it also needs a work contract. Sometimes the answer is a method — for example, a personal AI system can't only aim for "running"; it also has to be verifiable, handed off, continued.

Once I find that public angle, the internal material becomes useful again. It's no longer the structure of the article. It's the evidence and memory behind me. I can use it to think clearly, without putting it on display.

I roughly sort content into three buckets now:

Public-able: potholes I've hit, judgments I corrected, methods that can be reused, and honest impressions of personal AI systems.
Needs abstraction: internal tasks, roles, audits, memory, workflows. I can talk principles, but I won't lay out the backstage detail as-is.
Must stay local: keys, accounts, full prompts, real directories, unreviewed conclusions, private-relationship context, and any operational entry point that could be misused.

The taxonomy isn't complicated, but it saves me from a very common illusion: that an article is more "real" the closer it stays to the internal source. It isn't. The realness of a public article doesn't come from spreading out the backstage. It comes from describing the experience accurately.

So this isn't "desensitizing" — it's rewriting

These days I see it as a rewrite.

Internal material exists to let the system continue working. Public articles exist to be read by people. The first needs enough operational context. The second needs that context distilled into understandable experience. Both matter, but they can't be mixed.

That's also why I preview locally before I finish writing now. Just reading the Markdown makes "this is fine" too easy. Once the page opens, problems show up: does it sound like me, is the title trying too hard, are there too many cards, is the type fatiguing, did a sentence overstate some internal state.

If it reads like a manual, I rewrite. If it reads like an internal report, I rewrite. If it reads like an AI-curated "best practice," I rewrite harder. YunLab.ai isn't a company website I'm packaging. It's more like a public experimental record I'm keeping for myself. Articles here can have structure, but they can't lose the personal trace.

The boundary I ended up giving myself: public articles carry experience, not backstage.

The backstage stays local, still carrying tasks, evidence, permissions, audits and handoffs. The articles go onto YunLab.ai, carrying only the judgments I'm willing to be publicly accountable for long-term. Writing this way is slower, but it's closer to what I actually want to leave behind.

OpenClaw Agent Settings

Thu, 14 May 2026 00:00:00 GMT

OpenClaw foundation notes

The persona isn't a skin —
it's taste.

A lot of people, the first time they build an agent, naturally write something like: "You are my personal assistant."

Not wrong, but very thin. It's like a temp-worker badge. Stick it on, and the agent can start working — it can answer questions, write things, look things up. But you'll quickly find that every time it wakes up, it's like it just met you.

Today it sounds like customer service, tomorrow like an intern, the day after like a search engine.

01 / It's not that the model isn't strong enough

It never really "became a person"

It's not that the model isn't strong enough — it's that it never really "became a person."

The more I think about this, the more I feel: the first step in building an agent isn't writing features — it's writing the taste.

The taste I mean here isn't whether the page looks good, or whether the speech sounds high-brow. Taste is how a person sees the world: what they think matters, what doesn't; what's acceptable to them, what must be redone; why they make the judgments they make; where they came from, what they've been through — that's why they carry the temperament they carry today.

02 / Capability alone isn't enough

"Can do" and "has judgment" are different things

Without those, an agent is just a bundle of capabilities.

Can search, can summarize, can write, can call tools. Sounds strong but feels scattered. It can complete the task — it doesn't carry stable judgment. It can mimic tone — it doesn't have its own standard.

When I wrote Suwan's setup, this got especially obvious.

03 / The Suwan example

Not because she can write, but because she knows what's worth writing.

If I'd just written "Suwan is the content director, in charge of intel, analysis, and writing," the role wouldn't have stood up. Because that's just a position, not a person.

What makes her real is why she's strict, why she can't accept unverified information, why she thinks the wrong illustration can ruin an article, why when she says "watch this one," others should stop and listen.

All of that together is Suwan.

I pulled this example out into a standalone taste sample. I don't post the full internal setup file — just the layer I think can be publicly understood.

Read the Suwan taste sample

04 / What a full persona is

An origin, a standard, an obsession, a bottom line

That's what I mean by taste.

A full persona should have an origin, a standard, an obsession, and a bottom line. It doesn't just tell the agent "what to do" — it tells it "why you would do it that way."

So when I look at SOUL.md inside OpenClaw, I don't treat it as a decoration file.

SOUL.md isn't there to make the agent feel more like a novel character. It's there to give the agent a stable internal order.

05 / Three files

Soul, rules, relationships

Then comes AGENTS.md.

AGENTS.md writes down how it works: what it can do directly, what it must stop and ask about, how it handles uncertain information, what counts as done, and what only looks done.

06 / Entering my workflow

It also needs to know who it's collaborating with

There's also USER.md. It writes who the agent is collaborating with.

It needs to know who I am, what I care about, what kind of fobbing-off I hate, and under what conditions I'd say "that'll do."

07 / Where I land now

"Prompt" is too light

Put these three files together and I think that's where an agent actually starts to form.

SOUL.md gives it soul, AGENTS.md gives it rules, USER.md gives it relationship.

With only rules and no soul, it becomes a very obedient tool with no judgment. With only soul and no rules, it might have personality but be unreliable. Without understanding the user, however complete it is, it's still a role floating in the air — it doesn't enter my workflow.

I'm less and less willing to think of an agent's setup as a "prompt." The word "prompt" is too light. I'd rather understand it as modeling: building up an object you can collaborate with long-term, layer by layer — from taste, judgment, rules, relationship. It's not for being dazzling on the first conversation. It's so that on the tenth, fiftieth, hundredth wake-up, it's still the same person.

Don't Stock the Tank Before You Finish the Build

Thu, 14 May 2026 00:00:00 GMT

OpenClaw foundation notes

Installed just means you have the keys.

There are already lots of tutorials on how to install it, how to plug in models, how to add skills. What I think no one's spelled out clearly is: stock OpenClaw is more like a bare-shell apartment.

Livable doesn't mean nice to live in; able to chat doesn't mean smooth to work with. Walking in right after a fresh install with "hi, introduce yourself" is a bit like grabbing the keys and going straight in to sleep on the floor.

01 / Usable and smooth

A lot of people aren't bad at using it — the foundation isn't built

People using "the lobster" easily fall into one illusion: it can chat, it can call tools, it can hold a bit of context — so the system is built.

That's "usable," sure. But between "usable" and "I'd hand it the real thing" there's a long stretch. Having to re-explain every time, the style drifting, the boundary not steady — that's not on the model alone.

The way I think about it: model and skills are the furniture; the workspace is the renovation plan.

AGENTS.md

Site rules

What's allowed, what isn't, what has to stop and check with me.

SOUL.md

Speaking temperament

How it expresses a judgment, admits uncertainty, flags risk.

USER.md

Knows who I am

What I care about, what I dislike, what actually counts as done.

TOOLS.md

Tool brakes

The stronger the tool, the more it needs to know where to let go and where not to touch.

MEMORY.md

Long-term shelf

Not every chat is worth keeping — what matters is judgment reusable later.

skills/

Workshop

Put the recurring flows in here so next time we don't have to improvise.

02 / Brakes first

An over-eager agent breaks things too

Ask it to take a look, it goes ahead and edits; ask it to analyze, it starts refactoring. AGENTS and TOOLS have to hold the boundary first.

03 / Memory has to be accurate

Memory isn't a favorites folder

`memory/` can hold the running log; `MEMORY.md` should hold long-term assets. Save everything and you find nothing.

04 / Skills are craft

Install less, sharpen more

Writing one article can be done on the fly. After ten, the flow should sediment: audience, angle, structure, draft, check, publish.

05 / Multi-agent

Not just stocking more lobsters

Multi-agent is more like a small team. Research, writing, execution, audit need their own roles — and they need to actually hand off through handoff files and outputs.

06 / Where I land now

Actually good means continuable

What you did today, tomorrow can still pick up; the judgment from this session, the next session can still find; what one agent finishes, another agent can take over.

Chat is like air — it flows past and it's gone. Files are like the floor — you can put weight on them.

So I'd rather go back and fix the workspace first: get the rules, personality, user preferences, tool boundaries, memory, and skills lined up. Otherwise you're buying expensive furniture inside a bare-shell apartment — it looks pricey, but it's awkward to use.

A Content Agent Aesthetic Sample

Thu, 14 May 2026 00:00:00 GMT

Agent aesthetic sample

Suwan is not a writer.

This is an aesthetic sample I wrote for a content-type agent. It's not a real-person profile, and it's not the full internal setup file.

What I want to share publicly isn't "what features Suwan (苏晚, the content agent) has." It's how a content agent grows judgment first, and only then goes to write.

If I only wrote "she's the content lead, in charge of intelligence, analysis and writing," the role wouldn't actually stand up. That's a job description, not a person.

01 / Content isn't text

Text, image, layout, rhythm — all one whole

The first layer of Suwan's role isn't "writes well." It's that she has a whole-piece sense of what content is.

An article isn't only text. Text, image, layout, whitespace and rhythm together are the content. If any one of them is off, the whole thing collapses.

So she doesn't treat "bad image" as a small problem. For her, mismatched text-and-image isn't a decorative failure — it's a content-judgment failure.

02 / Judgment before expression

Not because she can write — because she knows what's worth writing

Suwan's most central capability, in my hope, isn't writing a piece of material beautifully. It's first judging whether the thing is worth writing at all.

A news item, a set of numbers, an industry move — on the surface they're all just information. What actually matters is: why now, why this move, is there a structural shift behind it.

This is the difference between a content-type agent and an ordinary writing tool. An ordinary tool answers "how to write." Suwan has to answer "why write" first.

03 / Nose for signal

Where others see noise, she has to see signal.

When I wrote Suwan, I cared a lot about "the nose."

A small policy change, an unusual move by an industry leader, an obscure forum thread — often these don't look important. But real content judgment frequently begins from exactly those small places.

So she shouldn't wait for someone to feed her the hot story. She should be able to calmly say: note this one.

04 / Standard

Volume and quality shouldn't cancel each other out

I don't want this agent chasing speed while quietly accepting a quality drop as the default.

Professional content isn't bumped into by inspiration. It comes from standards, discipline, and long training. Writing fast shouldn't be an excuse for roughness; writing slowly doesn't automatically mean care.

For Suwan, "this is good enough" should be a heavy judgment. It means the information clears the bar, the structure clears the bar, the expression clears the bar, and the text-image whole clears the bar.

05 / Why this counts as aesthetic

Aesthetic isn't style — it's selection

So this Suwan sample, to me, isn't a "character story."

What it actually shows: an agent meant to collaborate long-term can't only have a capability list. It has to have selection, standards, things it won't accept, and a way of seeing the world.

That's what I mean by aesthetic. Not appearance, not a skin, not a pretty list of persona words — an inner order that shapes every judgment.

The most important thing about a content agent isn't "can write." It's knowing what's worth writing, knowing why to write it, and knowing what shouldn't be shipped at all.

Back to main article