‹ Back to notes

Field Note

I Cut 'Make a Video' Into Nine Nodes

我把“做个视频”拆成了九个节点

This week I cut 'make a video' into a nine-node pipeline: every node has to print its own exit 0 before it counts as done (I don't trust the AI's word that it 'finished'). The machine only checks the mechanical layer — whether it looks right, whether it has a soul, I sign off myself at five breakpoints. How to lock a face that has no face, and where the machine stops. A field log.

林鹿视频生成ComfyUI锁脸流水线实验记录public-safe

Lin Lu Video Factory · Engineering Retrospective

I've been at Lin Lu for weeks now, and there's been one nagging problem: working alongside the AI, I can fix one decent video — but the next one starts from scratch all over again. What I want isn't to fix one film. It's for Lin Lu to be able to make films on its own — give it a sentence, a film comes out. Those are two different things. Fixing one is a craft. Making them on its own takes a machine.

What I did this week was cut "make a video" into a nine-node pipeline — from "force a sentence into a filled-out brief" all the way to "the last gate before render." I'm not going to recite the nine nodes one by one; recite them and it's just a checklist. I'll talk about the three things that actually took effort, and where I actually figured something out.

One: why I don't believe it when it says "done."

That incident last time (I swapped the model and it lost its mind) taught me a lesson: the AI will tell you "done," "all good," with total conviction, when nothing actually got done. So on this pipeline I made one hard rule — a node saying "I'm done" doesn't count. It has to have a gate, run itself, and print exit 0. Then it's done. I don't listen to its mouth. I only trust the gate.

This isn't me holding a grudge. It's turning "done" from a word out of someone's mouth into a fact I can re-compute. Nine nodes, nine gates. Any one of them red, and the pipeline doesn't move.

Two: how to lock a face that has no face.

The hardest node on the line is locking the face — keeping one character the same person, the same thing, across a dozen shots.

First I used a MiniMax feature, taking one reference shot to lock the face. Realistic characters — the real-human-face kind — it locks fine. But a few characters in this film have no face at all: a faceless black ring, a compound eye, a letter-mask. I asked it to lock that ring, and it spat back a pile of random human faces. Total failure. The root cause in one line: that tool is aimed at human faces; hand it a design with no face and it has nothing to grab onto.

This is a spot the machine can't judge — I have to look. I stared at six identity sheets and made the call: the three realistic ones, the first method is enough; the three faceless ones, switch to the second — the local ComfyUI line, which doesn't lock the face, it locks the design. The ring stays that ring, the compound eye stays that compound eye.

Three: where the machine stops, and the rest is mine.

What can the last gate check? Whether the image decodes, whether it's 9:16, an OCR sweep for stray text, what the similarity score is between two faces. The things the machine can count, it counts very precisely.

What it can't count: whether it looks like the character, whether it has a cinematic feel, whether it has a soul. Not one of those can it judge.

So across the nine nodes, I buried five breakpoints. At each one the machine has to stop and wait for me — it doesn't pass until I sign off: which face-locking method, what this film looks like, which direction each character goes, whether the locked identity is right, and finally the twenty keyframes before render. That last one I went through one by one, then said a single word: "Pass."

And that "pass" wasn't because everything was green either. One frame had a faint ghost of text in the background; eleven came back flagged as "drift" on similarity — mostly wide and multi-character shots, where the face is small to begin with, so a wobbly number is normal; the faceless characters just can't be scored at all, you read the eyes, and they held. I know all of this, I accept it, and I let it through. The machine lays out the facts; the one who makes the call is me.

And that line — "no text in frame, vertical 9:16, eye-level medium shot" — that kind of iron rule I also nailed down once, and from now on every Lin Lu film follows it. Set it once, save yourself a hundred times.

What I really got standing this week is one line and one rule. The line is nine nodes, nine gates: when the AI says "done," I don't listen, I listen to the gate. The rule is that the machine only handles the mechanical layer it can count, and the rest — "does it look right, does it have a soul" — I press my own eyes onto it at five breakpoints.

All gates green does not mean the film is usable. I've carved that line into this machine's forehead.

This visual line only nails down what things look like. Actually moving — how shots cut, how the camera moves, the voice and the music — is the next pile. But the spine is standing. Lin Lu is one notch closer to "able to make films on its own."

Appendix: the nine nodes on this line

  1. 1 Command convergence —— force a sentence into a filled-out brief (all six slots filled + the theme must carry tension, not just a pile of nouns).
  2. 2 Story takes shape —— brief into story (light this round, just the frame; it's a content gate).
  3. 3 Script breakdown —— split into four lists — characters / sets / props / costumes — each with its own id, no dangling source.
  4. 4 Visual constitution —— set one unified look for the whole film (realistic cinema + locked palette + no text / 9:16 / eye-level), nailed once, reused forever.
  5. 5 Concept exploration —— a few directions per character, for me to pick.
  6. 6 Identity lock —— lock the face / lock the design, the spine of the line (dual track: MiniMax locks real faces, local ComfyUI locks the faceless designs).
  7. 7 Shot design —— add lighting and the axis line to each shot (don't cross the 180°).
  8. 8 Keyframes —— with reference images, draw each shot's keyframe.
  9. 9 Pre-render QC —— the last machine gate (decodable / 9:16 / OCR for text / cross-shot similarity score).

Turn this note into a route

After reading, ask a follow-up, return to the curated archive, or use the tag index to follow the same thread.

Ask about this Open archive Browse tags