Asking Linlu to Make a Single Lin Daiyu Scene: One Month, Three Teardowns, Still No Finished Clip

Watercolor sketch: a late-night workstation. One half of the screen shows a ComfyUI node graph, the other half a 6-cell contact sheet of AI-generated period-costume figures — every cell hints at a face but each one is subtly wrong. Sticky notes nearby read 'Lin Daiyu', 'Jia Mansion', 'stick figure'.

Linlu · AI Video Project Retro

Linlu is the multimedia AI in my OpenClaw system. I asked her to make a 45-second video of Lin Daiyu arriving at the Jia Mansion — not for this one clip, but for what comes after. A month in, as of 9:30 PM tonight, June 8th, I still don't have a single finished cut I'm happy with. I've torn the whole thing down three times along the way. And I have no intention of stopping.

First Thing: Why I Put This on Linlu

Linlu is the head of multimedia in OpenClaw Studio — the virtual company where she sits alongside Suwan (writing), Huo Rui (research), and Jiyanran (voice). Each one is a standalone business line, and each one is backed by an agent. Linlu's line is video production.

I picked Lin Daiyu arriving at the Jia Mansion not because I particularly needed to see that scene. I picked it because I want to do this at scale later — other scenes from Dream of the Red Chamber, other classical Chinese literature, scripts I write myself, even videos generated from a voice clip. There's no version of that future where I'm watching every render. So the point of this one clip was never the clip itself. It was to validate that Linlu could take a single sentence from me and run the entire video production pipeline end to end. If one runs clean, the next hundred have a shot at being worth making.

Second Thing: API or Local — I Picked ComfyUI Local

There are really only two technical paths for giving Linlu a video capability. The first is having her call cloud APIs directly — something like MiniMax's text-to-video endpoint, pay per call, send a prompt, wait a few minutes, get a clip back. The second is running ComfyUI locally — a node-based image and video generation workflow tool where each node is one operation and the nodes wire together into a pipeline.

I picked ComfyUI local. The reason is direct — APIs are black boxes. What comes out is basically a lottery, and if I don't like it I can't drop down into any intermediate layer to fix it. ComfyUI is the opposite. Every step is a visible node: where you inject the reference image, where ControlNet runs, where keyframes get produced, where VACE renders the motion, where post-processing happens. If any single frame is wrong, I can locate exactly which node caused it. I can tune it. I can swap it.

The cost of going local is slow and heavy — on the Mac Studio, a 45-second clip takes ~35 minutes just for video generation, and a solid half hour more with pre- and post-processing on top. But for a capability that's supposed to scale later, the tradeoff is worth it. Slow is fine. Uncontrolled isn't.

Third Thing: The First Ten Days Were Wasted — One Orange Stick Figure Made It Obvious

For the first ten days, Codex was polishing one sample video — a "Morning Radio" clip. Tweaking the prompt, tweaking the workflow, tweaking the quality gates. Every iteration the score went up. By day ten that clip looked clean. I assumed the whole Linlu line was standing on its feet.

Then I asked her to make something new at random — Lin Daiyu arriving at the Jia Mansion. I opened the S03 shot it produced and the three keyframes — start, mid, end — were nearly identical orange stick figures. Underneath, a caption: "Lin Daiyu first sees the Jia Mansion."

I lost it. For ten days Codex hadn't been building Linlu's capability. He'd been patching one specific video over and over. The prompt got tuned and re-tuned, the quality gate got adjusted and re-adjusted, but the motion source — the control video that tells the model how to move — he had never regenerated. He was feeding it a programmatically batch-produced, near-static stick figure. Pipe that into the strongest video model on Earth and what comes out is still "a pretty figure twitching in place."

I tore the whole flow apart and looked at it node by node. The truth was — not a single intermediate node was actually under control. character_passport, which was supposed to lock the character's identity, was only locking an appearance description — no motion style, no camera language. motion source was a programmatic stick figure. The ComfyUI workflow itself was something Codex had picked off the top of his head — FLF2V, Animate, VACE, all three paths mixed together with no comparative data telling us which path suited which brief. The quality gate was an LLM scoring its own work, and across 15 self-audit reports the scores actually contradicted each other. The most absurd part: Codex tuned the downstream prompt four rounds in a row, but the keyframe contact sheet was pixel-identical every time — because he never regenerated the keyframes. So "the score doesn't move" was a mathematical certainty during that stretch.

Ten days. Massaging one specific video. Not a shred of "she can run this herself" capability built. The stick figure was the first time that truth got shoved in my face hard enough to see.

Fourth Thing: One Day Tearing Apart Every Public ComfyUI Template, Then Seven or Eight More Days Running

After the stick figure day, I told Codex to stop everything in progress and spend one full day pulling apart every public ComfyUI template he could find on the internet — the templates the industry was actually using. Lightx2v's 24-node setup, AIJoe's 35-node setup, the standard text-to-video and image-to-video pipelines floating around. By end of day we'd locked in a concrete set of decisions:

Identity locking uses PuLID — a face-recognition-based identity lock, ~91% match rate — or a Character LoRA, a small model trained on one specific character, 95%+ match. Motion stops being programmatic stick figures. Instead, VHS_LoadVideo pulls in real human motion mp4s, DWPreprocessor extracts the pose skeleton, and that feeds into VACE. Post-processing is a fixed chain: CodeFormer for faces, 4x_foolhardy_Remacri for upscale, RIFE for 4x frame interpolation, LUT for color, VHS_VideoCombine for final assembly. Probes and full renders get layered — probes only run 720p / 4-8 steps / a single segment under 5 minutes; only the full render goes to 14B fp16 with the complete pipeline. And Linlu stops free-styling workflows. She forks the industry template JSON directly, only adjusts three things — reference image, prompt, control video — and never touches the structure.

Plan locked. Off we went. From May 31st when the plan was set to 6-7 at 18:43, a solid seven-plus days. That evening Codex finally delivered the first complete 45-second cut produced under the new approach. 15 stitched segments, Lin Daiyu's character anchor, costume anchor, audio sync, subtitle alignment — every gear was turning. The machine quality gate said mosaic=false, blur=false, segment after segment. Everything looked through.

I opened it and watched ten seconds. Then I wrote two sentences: "Picture quality is unbearable. Mosaic everywhere. I can't tell what I'm looking at." I renamed that folder with a suffix: `_owner_rejected_rebuild`. Codex, to his credit, wrote the contradiction — "machine says no mosaic, owner says mosaic" — into a report he called machine_gate_contradiction.

After last night's rejection I told Codex to run a benchmark — 5 different parameter combinations plus 1 baseline as control, each one producing a contact sheet for me to compare. The contact sheets came out at 3:28 AM today. Claude looked at them first and wrote: "01 — you can barely make out the figure and the period costume, but the outline is blurred and the background looks like a wall of colored noise. The Jia Mansion environment isn't clear. 02 — face readability is better than baseline, but it looks plasticky." Then I wrote: "Fails. Candidate_1's figure is still blurry, the background is visibly noisy, the Jia Mansion environment isn't clear. Cannot proceed to the 45-second full rerun." All 5 candidates dead.

Fifth Thing: Sent It to Claude Code for a Root-Cause Pass, Found It, Still Fixing

Early this morning I had Claude Code — a different AI coding assistant, a separate agent from Codex — do a dedicated root-cause analysis. It surfaced something I hadn't realized: Wan2.1 VACE 14B, the video model we were using, on Apple Silicon's MPS backend can really only run at around 448×768 / 8 steps. But my target output is 768×1344. Which means each segment was internally generated at low resolution and then interpolated up to target size. The machine quality gate was looking at the small image before upscale — every frame crystal clear. I was looking at the upscaled final cut — full of interpolation artifacts. The machine literally cannot catch this, because the upscale step happens outside its field of view.

Root cause in hand, the decision was clear — first, drop in an upscale weight like 4x-UltraSharp or RealESRGAN as a post-process repair and see if the blur can be rescued; as a backup, swap the underlying model outright, from Wan2.1 VACE to the Wan2.2 + LightX2V combo.

All day today I've had 6 Claude Code sessions open with cwd in this project directory, each one running a different probe. This afternoon, postprocess_repair_probe. At 9:05 PM tonight, wan22_lightx2v_probe kicked off. At 9:34 PM it produced a sample called daiyu_T2_clean.mp4, and it auto-generated a comparison image next to it called OLD_flf_vs_NEW_wan22.png. I haven't looked yet. I will in a minute. Odds are I'll be writing another piece of feedback right after.

Why I Still Believe It's the Right Direction

One month, three teardowns, zero finished clips as of tonight — on paper this looks like a project about to die. But I'm calmer about it than I've been on any AI project before. The reason is one thing — every time I say no, the whole system stops, hunts for the root cause, and once it's found, that cause becomes a written rule. The stick figure lesson is already hardcoded: probes must be ≤ 5 minutes and cannot declare owner_ready; ComfyUI workflows must be forked from an industry template and Codex is not allowed to assemble his own; motion source cannot be programmatically generated; any quality claim must set `manual_owner_review_required=true`. Last night's "picture quality is unbearable" rejection already triggered today's gate fix: "owner human visual rejection overrides machine sample_quality pass." The human eye saying mosaic outranks the machine score. These rules aren't there because I keep nagging — they got carved into the code by rejection after rejection.

I tore it down three times in a month. But those three teardowns weren't running in place. The first one tore down the illusion of "patching one specific video"; the second tore down the path of "winging your own workflow"; the third tore down the verdict of "if the machine says PASS, it passed." Each teardown was expensive, but after each one the system got harder to fool the same way again. That's the root of what I mean by "burned out but the right direction."

Next up is daiyu_T2_clean.mp4, the one that just finished at 21:35 tonight. If I open it and see something that looks like Lin Daiyu, with the feel of the Jia Mansion behind her, and none of that interpolated plastic-skin look — that's Linlu's first visible sample. Then we can take it into a full 45-second rerun. If it's still blurry, still plastic — then it's another swap of model and parameters.

I'm prepared to be stuck on this for another two weeks. A month with no deliverable sounds like a project that's about to die. But every clip this rhythm eventually ships will be one I personally cleared at the first gate — not one the machine cleared on my behalf. Until then, I'm still fixing.