OpenClaw Studio · Incident Postmortem
I woke up at dawn on day one, opened Feishu, and saw nothing. Then I opened the local directory — four drafts, four research memos sitting there quietly. The tasks had all run. Not a single notification had told me.
The night before, I had just finished installing OpenClaw Studio's seven-day trial — a LaunchAgent, the macOS background service that runs after boot, waking up every 30 minutes to trigger a few agents on a schedule: Suwan, Huo Rui, Shen Zhixing. My plan was simple: let it run for two days, see how the cadence feels. If it stays stable, I'll push more complicated work onto it.
Turns out, at dawn on day one, it told me stability is a luxury.
What I installed
The trial layer does something pretty simple. A LaunchAgent fires every 30 minutes and wakes up the runner — a Python script that does the actual dispatch. The runner walks the schedule, and when something is due, it triggers the corresponding agent, writes the output to a local directory, then sends me a Feishu message — Feishu being the messaging app I use for system notifications — telling me it's done.
Mornings are the densest stretch. 06:50, Suwan starts drafting the morning brief. 07:30, second task. 08:40, third. 10:30, fourth. Each one has its own owner, its own output file, its own Feishu message it's supposed to send. The whole design is: "I wake up, I open my phone, four messages are lined up in Feishu by time, I tap whichever one I want to read."
That's what the design said. What actually happened on the first morning was nothing like it.
Four "half-hung" tasks at dawn
The 06:50 task did run. Suwan's morning brief got written, the file was in the right place, the timestamp checked out. 07:30 ran. 08:40 ran. 10:30 ran too. Four output files, all with returncode 0 — the exit code from the command.
But Feishu was empty.
At first I thought the Feishu bot was broken — expired token? webhook changed? I dug into the logs and found every single Feishu notification had blown up with the exact same error:
env: node: No such file or directory
Node was missing. I stared at that line for a few seconds. I had literally just run which node in my terminal — /opt/homebrew/bin/node, plain as day. How could the runner get halfway through and then claim node didn't exist when it tried to send a Feishu message?
What made it worse was the shape of the failure — this "half-hung" mode where the business work completes cleanly and the notifications all quietly die. It's not "the system is down," which is at least an honest kind of incident. It's "part of the system is fine, the other part is rotting in silence." The outputs really existed. The drafts really got written. The research memos were really produced. But unless I went and looked at the local directory on my own, I had zero way of knowing any of it was there.
The thing a scheduler should fear most is exactly this: "I thought it didn't run, but actually it did; I thought it ran, but actually it never did." That kind of failure destroys your trust in the system.
Root cause: a LaunchAgent's PATH is not "the same as your terminal"
Debugging this, I first looked at the runner's own environment. The runner is Python, running inside a venv — the Python virtual environment — with a hardcoded interpreter path and all dependencies bundled in. So the runner itself starts fine.
But the runner doesn't call Python directly when it sends a Feishu message. It execs a CLI command called openclaw. That CLI is written in Node, and its shebang — the #!/usr/bin/env node line at the top of a script that tells the OS which interpreter to use — is #!/usr/bin/env node.
That's where the problem lives. env node has to look up node in PATH, the search path the OS uses to find commands. The PATH in my terminal is built up layer by layer from my shell config — /opt/homebrew/bin, /usr/local/bin, various language version managers, various personal bin directories, a long list. But the default PATH for a LaunchAgent started by macOS launchd is brutally minimal. Just these four entries:
/usr/bin:/bin:/usr/sbin:/sbin
Homebrew on Apple Silicon installs to /opt/homebrew/bin. That directory is not in the LaunchAgent's PATH. So env node can never find node, the shebang fast-fails — fails the moment it starts — and the entire openclaw CLI exits without running a single line.
A lot of people who hit this for the first time can't believe it. "But it works in my terminal!" Because we all subconsciously assume that our computer is our computer, and PATH should be the same everywhere. A LaunchAgent is not your terminal, though. A LaunchAgent is a child process spawned by launchd, with environment variables defined by launchd itself, completely unrelated to your shell config.
The nastiest part is that the business agents were unaffected. Suwan runs inside a Python venv with a hardcoded path. So does Huo Rui. So does Shen Zhixing. None of them depend on PATH to find an interpreter. So the business layer kept working, outputs kept getting written. Only the notification layer — the one that uses a Node CLI to talk to Feishu — was dead.
Business succeeded, notifications all failed. That's how you get the strange spectacle of "everything succeeded and everything failed at the same time."
Fixed the root cause, still had to write a watchdog
The root cause fix is literally one line — at module import time in the runner (the code that runs when the Python module is loaded), prepend /opt/homebrew/bin to PATH. Next time the LaunchAgent wakes the runner, the runner patches PATH as it loads, and the openclaw CLI finds node.
The question is: is that line enough?
I sat with it for a few minutes and decided no.
The reason is simple. Today, the thing that broke was a Node CLI not finding node. Next time it'll be something else — some Python package importing a system binary that isn't there, some third-party tool whose path moved after an upgrade and broke a hardcoded reference, some macOS update quietly mutating launchd's environment, or even Feishu having a five-minute outage on their end. None of these will show up in the convenient form of "node not found." They'll arrive wearing new costumes. But the shape will always be the same: business runs, delivery notification gets dropped.
You can fix root causes one at a time, but "half-hung" as a failure mode doesn't go away. So fixing the PATH is the root cause fix, and the watchdog — the kind that automatically recovers from incidents — is a separate layer. It doesn't solve any specific root cause; it just gives every future half-hung incident a chance to auto-recover.
Runner v2.3 was built around exactly this idea. Four pieces of work:
- First, the root cause fix — prepend the Homebrew path to PATH at module import, so today's incident can't recur in that form.
- Second, the retry channel —
retry_pending_notifications. Every time the LaunchAgent wakes up, it scans recent tasks. If it finds one where the output exists but the notification was never sent, it retries the notification automatically. Each task gets up to four retries. - Third, the deterministic watchdog — on every wakeup it actively checks four classes of problems: task_missed (task didn't run), output_missing (output should be there but isn't), notification_missing (output exists but no notification went out), boundary_fail (cross-boundary state inconsistencies). If it finds one, it sends a deduplicated Feishu alert telling me what happened, where, and when.
- Fourth, the Codex watchdog checkpoint — six times a day at fixed moments, run a Codex exec — Codex being OpenAI's CLI agent — inside a read-only sandbox, audit the day's full scheduling state, write a markdown + JSON checkpoint, and send an extra Feishu summary.
The second and third pieces are symmetric. The retry channel says "I see something got dropped, I'll quietly recover it." The deterministic watchdog says "I see something got dropped, here's a heads-up." Both are safety nets, not the primary path. The primary path will always be the runner sending the notification successfully on the first try.
The Codex watchdog adds another layer of meaning. The deterministic watchdog can only recognize failure modes I've anticipated. The Codex watchdog can recognize the ones I haven't — the ones that need semantic understanding to spot. The cost is that it's expensive, slow, and depends on an external service. So the cadence is six times a day — denser when the morning is busy, sparser in the afternoon and evening.
Catch-up: 4/4 recovered automatically
Once v2.3 was deployed, I didn't rush to send any new notifications. I manually triggered one LaunchAgent wakeup so the retry channel could sweep through the four dead notifications from this morning.
Scan result: all four tasks were flagged as notification_missing; all four had output files, all four had correct timestamps, all four met the retry criteria.
Retry pass. Four Feishu notifications, in chronological order, exactly the way they should have arrived in the first place, came in one by one. returncode 0 across the board.
The line that gave me the most peace of mind: "no agents re-run, no existing outputs overwritten." Retry only resends notifications; it never re-triggers the business work. That constraint was explicit in the design, because some business tasks contain irreversible operations — writing to historical ledgers, appending to audit logs — and re-running them would corrupt state. Catch-up — running the missed pieces after the fact — is strictly bounded to the notification layer. The business layer already finished. You don't touch it.
4/4 recovered automatically. Fix commit is e752a93.
There was a strange feeling in that moment. The incident happened, the incident was detected, the incident was auto-recovered, and the only thing I did manually was trigger one wakeup. Everything else, the system did on its own. It didn't hide the incident, and it didn't amplify the incident into something worse. That was the first time I really felt what a watchdog is worth. It doesn't create new functionality. It just drives the cost of recovery toward zero.
What this taught me
By the end of the postmortem, here's what's worth writing into rules I can remember. The next time I trip over something similar, I want to reach for these immediately.
- A LaunchAgent's PATH is not "the same as your terminal." This is an old trap, but every time I install a new LaunchAgent I still default to assuming the environment is identical. Next time, the first thing I do is write PATH out explicitly — either in the plist, or as the first thing the runner does on import. Don't assume "it should be fine."
- "Business succeeded" and "delivery completed" are two different things. A task's output landing on disk is just an intermediate state. Real delivery is "user got the notification AND user can find the output." Any link in that chain breaking counts as a delivery failure, no matter how clean the output looks. Next time I design a scheduler, "delivery" is the final gate, and it's stricter than "task executed successfully."
- Root cause fixes and safety-net layers should be built separately, but shipped together. The root cause fix is the PATH change — it prevents today's incident. The safety net is retry plus watchdog — it covers all future incidents of similar shape. They don't substitute for each other. Fix only the root cause, and the next new shape of half-hung incident still drags me out of bed. Add only the safety net, and today's incident triggers an alert on every single wakeup until the noise drowns out the signal.
- A watchdog is insurance that drives the cost of recovery toward zero. Its value isn't visible when no incident is happening — that's when it looks like waste. Its value is in the exact moment an incident hits, when it turns "I wake up and spend two hours debugging" into "the system handled it and sent me a summary." The cost of buying that insurance is the time to write the code. The cost of not buying it is some two-hour window in a future morning.
So how is v2.3 actually doing now
v2.3 has been live for a few days. Incident density after Day 1 has actually been higher than I expected — not because the PATH issue came back, but because other shapes of half-hung incidents started showing up. The runner's retry channel caught most of them. The deterministic watchdog has fired a few alerts, and each time I was able to decide within five minutes whether to intervene. The Codex watchdog checkpoint produced a pile of markdown, giving me a daily panoramic view of what the system roughly did.
But there are still loose ends.
The deduplication granularity in the deterministic watchdog is still being tuned. Sometimes the same task gets flagged twice as different missed events, so two nearly identical alerts show up in Feishu. Not fatal, but it pollutes the signal. The ideal is "one alert per incident, unless state has actually changed."
The Codex watchdog checkpoint cadence (six per day) is currently fixed, but realistically mornings are dense and afternoons are sparse. The next step is to make it adapt to "incident density" — run more often when incidents are recent, less often when things are quiet. To do that I first need a definition of "incident density." Don't have one yet.
The biggest open piece is this: v2.3 fixed the notification channel of the trial-layer scheduler. But each of the six agents has its own "business notification" channel, and over the following days a whole different pile of problems showed up there — which bot sends business messages, whose identity does it speak as, how is the message formatted, who should receive it, who shouldn't. That's a different story. Worth a separate post.
Day 1's dawn incident pulled "add a watchdog" from week two of my plan all the way up to the afternoon of day one. That was trial's first gift to me — it didn't make me wait a week to find the problem.
Trial isn't about proving the system works. It's about exposing the system's fragile spots in a low-cost environment. Every spot that surfaces, I fix. Every fix, the watchdog grows a little. Day 1 grew the PATH fix and the retry channel. The following days grew other things. Every day's watchdog is a little smarter than the day before.
Incidents are normal. The watchdog is still growing. I no longer expect "install once and it stays stable." What I expect is "every incident makes the system a little better at saving itself next time." Until then, I'm still working on it.