Jiyanran voice workbench notes
The voice entry point is the layer people most easily misjudge the difficulty of — the user speaks, the AI answers, that's all it is on the surface. But if you want it to still be standing in the second month, it is not as light as "wiring up a single cable."
Jiyanran (纪嫣然, the local voice workbench agent) is a voice workbench I run locally. What it does is straightforward: I speak a sentence to my Mac, the system recognizes what I said, then decides which agent inside OpenClaw (my own agent factory) should handle it, and after handling it, speaks the result back to me. It sounds like a single pipe would do — mic in, speaker out, a model in the middle.
That is what I thought at first. The first version was even a hastily written direct-call version: the front-end UI called the speech recognition SDK directly, the recognized text went straight into the large model, and the model's reply was read out directly. It worked, the demo looked good, but after two or three days I realized this path was actually a trap. Anything I wanted to swap, upgrade, or add risk controls to would force the whole chain to move.
That is why today's v1.0 looks the way it does: OpenRoom front end → voice-bridge → avatar-bridge → OpenClaw. Three independent services, each with its own mock fallback, each with its own risk-gate. Adding two bridge layers in the middle looks redundant. But after using it for a while I have only become more certain: these two bridge layers are not redundant — they are the reason this system can live long.
What this article wants to say is exactly this: why a local voice entry point cannot be written as a monolithic direct call for the sake of convenience, why it must be split into three layers, why every layer needs a mock, why every layer needs a risk-gate. This is not about some flashy feature I built — it is about a judgment I stumbled into the hard way: about where to split, where to fall back, and where to stand guard. If you are thinking about building a local voice assistant or agent entry point, this judgment might save you part of the road I already walked.
The price of coupling: the lazy version feels great for two weeks, then rots in the third
How simple was the first direct-call version? A button on the front end, press to talk, release to send. Recognition ran in the front end, model calls assembled prompts in the front end, and even risk controls were written into the front end on the side — three hundred lines of JavaScript, end-to-end demoable. I was actually pretty pleased with myself at the time, thinking this stuff was not so complicated after all.
The problem is, in the second week I wanted to swap recognition engines. The local model I was using did poorly on mixed Chinese-English speech, and I wanted to try another. I opened the front-end code and found that the recognition call, error handling, timeout, retry, downsampling, and VAD (voice activity detection) were all tangled together in the front end. Swapping an engine was not a matter of changing one import — it meant peeling that whole blob apart again.
The third week was harder. The OpenClaw agent interface changed once. It was actually a very small protocol upgrade, but because the front end was assembling OpenClaw requests directly, the entire front-end request construction logic had to follow. Every time it changed, the front end shook, and the UI was prone to breaking with it.
What finally broke me was risk control. I wanted to add a confirmation step on certain commands (operations like "delete a workspace" should require a second confirmation), and I found this gate could only be written in the front end. But the front end is on the user side — anyone could bypass it in theory; the right request sent directly would hit OpenClaw. That is not technical debt — that is a real security hole.
In that moment I saw it clearly: the voice entry point looks simple, but it mixes four things together from the start — the presentation layer, the perception layer, the dispatch layer, and the execution layer. The price of mixing them is not slow code; the price is that any change later forces you to rewrite the whole layer. Two weeks of bliss, third week of rot, fourth week unmovable.
There is a more hidden cost — the mental load. With the direct-call version, every change forced me to first mentally trace the whole chain: would recognition be affected? Would the UI state misalign? Would request construction mismatch the new protocol? That feeling of "any change anywhere means worrying about the whole chain" quickly wears down the appetite to keep building. A local tool that makes you spend ten minutes "thinking through side effects" every time you touch it will sooner or later be abandoned by its own author.
This is not unique to voice entry points, by the way. Any system that crams "front end + model + agent" into one place runs into it. But voice entry points have it especially bad — because they add two extra things that complicate matters: real-time audio streams, and the user's expectation of low-latency feedback. Both of these strongly tempt you to "just write it together for convenience," because every extra hop adds latency and every extra process adds uncertainty. The temptation is strong, but the cost is stronger.
How the three-layer decoupling splits: OpenRoom / voice-bridge / avatar-bridge / OpenClaw each handle one thing
So v1.0 was rebuilt from scratch, rearranged on the principle of "each layer does only one thing." Four things, four layers.
The outermost layer is the OpenRoom front end. It is just a room: there is a microphone, there are speakers, there is an interface showing the conversation, there are buttons, there is some visual feedback. What it handles is extremely narrow — take the user's voice in, play or display what the backend returns. It does not recognize speech, does not construct requests, does not know who OpenClaw is, does not speak to the model directly. It is just a room — people talk inside, and what happens outside the room is none of its business.
One layer in is voice-bridge, running on port 3962. This layer handles one thing: turning "voice" into "a structured task." It catches the audio stream from the front end, calls the recognition engine, handles VAD, segmentation, confidence, optional language detection, and finally emits a structured description of "I am reasonably sure the user said this." This layer does not know who will pick up downstream or what will be done with it; its responsibility ends at "intent recognized."
Further in is avatar-bridge, running on port 3961. This layer handles dispatch. voice-bridge hands it a structured task, and it decides which agent the task belongs to: knowledge questions go to the information line's agent, writing goes to the content line's, command execution goes to the execution line's. This corresponds to "dispatch" — not "recognition," not "execution."
Innermost is OpenClaw. The ones who actually do the work all live here — Suwan, Huo Rui, Shen Zhixing, and Jiyanran herself. OpenClaw does not care about voice, does not care about front-end buttons, does not care who dispatched the task; it only cares about "I have received a task; do it well according to my persona and capabilities."
Four things, four layers, each looking only at its own boundary. The front end does not know how downstream dispatches; voice-bridge does not know which agents exist downstream; avatar-bridge does not know how each agent works internally; OpenClaw does not know whether the task came out of someone's mouth or off someone's keyboard. Each layer sees only its own slice.
There is a hidden benefit to this "only look at your own slice" design: each layer can swap "entry form" without affecting the others. Today it is a voice entry; tomorrow I want to add a keyboard entry, I just write another "keyboard bridge" and plug it into avatar-bridge — downstream OpenClaw does not need to move at all. The day after, an email entry, a Telegram entry, a shortcut entry — same pattern. Below avatar-bridge becomes a stable "task execution backend"; above avatar-bridge can be any number of entry forms. This started to matter a lot when I began wiring agents other than Jiyanran into the system — the same OpenClaw backend can serve many entry forms without starting from zero each time.
This split looks verbose — a single audio clip from microphone to actual work has to pass through four processes, two ports, and three serializations. But what I discovered later is that this very "verbosity" is what lets each layer be swapped on its own. Swapping the recognition engine touches only voice-bridge; changing dispatch rules touches only avatar-bridge; OpenClaw upgrading the agent protocol leaves the outer three layers untouched.
Why ports 3962 and 3961, two adjacent numbers? Pure convenience — I grouped voice-related services in the 3960 range to make them easy to remember and debug. That is not design philosophy, just engineering preference. But "each layer has its own port" is deliberate: it forces me to treat each layer as an independent service, so I cannot quietly merge two layers into one process in some later version. Physical isolation enforces logical isolation.
There is actually an unexpected benefit after this layering is done: each layer can be tested independently. I can spin up only voice-bridge, feed it a recorded audio file, and see what it recognizes; I can spin up only avatar-bridge, feed it a fake structured task, and see where it dispatches; I can spin up only OpenClaw, feed it a fake agent task, and see how the agent responds. Each layer has its own test set and its own regression cases. Locating problems is fast too — compare the logs of the four layers and you immediately see which layer broke.
Why every layer needs a mock fallback: the user end cannot go "blank screen"
Decoupling is only the first step. What actually lets this architecture hold up in daily use is another seemingly unremarkable design: every layer carries its own mock fallback.
What does that mean? When voice-bridge starts up and finds that avatar-bridge is not running or unreachable, it does not just throw an error back to the front end. It catches with a mock interface: returns a prepared placeholder response, telling the front end "the dispatch layer is temporarily unreachable, use this fake data for now." avatar-bridge is the same — if OpenClaw is down, it uses a mock agent to return a placeholder result. The front end is the same — if voice-bridge itself is not up, it can at least recognize that the user pressed the button and display "voice channel not yet connected," instead of a black screen.
Why does this matter so much? Because a local AI system is not a cloud service — downstream instability is the norm. OpenClaw needs to restart for upgrades, the recognition model needs time to load, an agent running a long task may not respond. If every layer just passes downstream failure up the chain as-is, the user end will see "an error occurred, please retry" frequently. Once or twice is fine; ten times in a row and the workbench is dead.
That said, mocks are not for tricking the user. When the mock catches, the front end explicitly shows "this is a placeholder response, downstream X is not connected," instead of pretending it really answered. This is key — the mock is so the user end can keep operating (enter the next request, change settings, view history), not so the system can pretend everything is fine.
Once you actually do it, you find mocks have another hidden benefit: every layer can be developed independently. While developing voice-bridge, avatar-bridge can just run as a mock, with no need for OpenClaw to actually be running. While tuning dispatch rules on avatar-bridge, the OpenClaw layer can be fully mocked, so development is not blocked by downstream. Otherwise four-layer integration testing means one layer crashes and the whole chain stops, fragmenting your dev rhythm.
I had one principle when designing the mocks: mocks must be identifiable. The content they return carries an explicit placeholder marker, and the front end, on seeing this marker, tells the user explicitly "current response is a placeholder." I did not think this through at first and wrote a version of "pretend everything is fine" mocks. The result: one time voice-bridge could not reach avatar-bridge, the front end received a mock response and played it normally, and I did not notice downstream was down for half an afternoon. After that, mocks had to be explicitly visible — better to look crude than to let "the system is actually not working" hide behind an illusion.
Another lesson: do not try to make mocks "look smart." I once thought about having the mock use a small model to generate placeholder text that sounded more like a real answer. In the end I did not do it. The reason is direct: the smarter the mock, the harder it is for the user to tell whether it is real or a placeholder, and the easier it is to take fake data for true. A simple, crude mock is, by contrast, honest — its very existence is saying "this downstream link is down."
v1.0 does it this way, v2.0 has not changed this. Not out of laziness, but because this one has been validated: mocks online, the whole system is stable; mocks removed, the chain becomes fragile.
Why every layer needs a risk-gate: do not let OpenClaw be bypassed at will
Decoupling solves "can be swapped," mocks solve "can hold up." But one problem is still unsolved — security.
The security of a local system is more easily overlooked than it looks. Many people think "I am the only user anyway, no risk locally," but the fact is: as long as this system has ports, APIs, and the ability to call real things, it can be bypassed, abused, or accidentally triggered. Even if I myself misspeak one sentence or one word is misrecognized, OpenClaw might end up doing something it should not.
So every layer needs its own risk-gate.
The front layer is the most basic: an allowlist. Which clients can connect to the front end, which sources can inject messages — hardcoded. Anything not on the list cannot even open the page.
voice-bridge handles voice boundaries: which prompt patterns are allowed, which command patterns must be intercepted immediately. For example, if the user's spoken sentence contains keywords that are easily misrecognized, voice-bridge does a first pass of intent cleaning, marking high-risk expressions before passing them down.
avatar-bridge is the most critical layer. It is the one that actually decides "who gets this task," so it must be the strictest. Every agent has its own boundary of what it can and cannot do, and avatar-bridge checks before dispatching: does this task match this agent's capabilities? Are the required permissions present? Is this a high-risk action that requires second confirmation? If not, do not dispatch.
OpenClaw itself also has its own layer of risk-gate. This is "the last line of defense" — even if the three layers in front are all bypassed, OpenClaw internally still has its own personas, its own boundaries, its own audit log. No agent can do anything beyond its capability range without owner approval.
Four layers of risk-gate sound repetitive, but they are not. The logic stacking them is: no layer can be assumed trustworthy. The front end might be bypassed, voice-bridge might misrecognize, avatar-bridge might dispatch wrong. So every layer guards its own door — do not count on the outer layer to keep the dirt out.
There is a side benefit to this setup: audit logs rotate daily, and every layer writes its own. Wherever something goes wrong, that layer's log sees it first. To do a retrospective on one misrecognition, you do not dig through a mass of mixed-together full-chain logs — you first look at voice-bridge's recognition records for that day, then avatar-bridge's dispatch records, then OpenClaw's execution records. Each layer's log covers its own slice, and retrospectives are actually faster.
Later I distilled another lesson: the stricter each layer's risk-gate, the more downstream logic can be simplified. If avatar-bridge has already blocked illegitimate requests before dispatching, OpenClaw internally does not need much defensive code for "is this input malicious." It can focus on what it does best — executing tasks. Conversely, if upstream risk-gates are toothless, downstream has to write all kinds of boundary checks itself, and the whole codebase grows more bloated. So layered risk-gates are not just security design — they are also about putting responsibilities in their right place: each layer only needs to do its own checks well, no need to back up someone else.
v1.0 vs v2.0 — the product judgment: core architecture stays put, increments only at the edges
The system currently runs on v1.0. This version has been running stably for a while, with fixed ports (voice-bridge 3962, avatar-bridge 3961), the allowlist and risk-gates working, audit logs rotating daily, mock fallbacks holding, and recognition, dispatch, and execution each doing their part. It is not perfect, but it is a "real thing that is running."
I have also started on v2.0. The skeleton of v2.0 has landed and is in finishing stages; v2.0 GA (General Availability) is still waiting on audit. But one thing I set in stone from the start: v2.0 only does increments — it does not touch the core architecture.
This is a product judgment, not a technical one.
Technically v2.0 could perfectly well "take the chance to rework the architecture" — merge voice-bridge and avatar-bridge into one process to save a serialization hop, switch to a more modern communication protocol, replace the mock fallback with a smarter "pretend to keep chatting." Each of these can be justified on its own.
But the product judgment tells me not to touch them. The core architecture of v1.0 has been validated by months of real use: four layers of decoupling, mocks per layer, risk-gates per layer. This structure was not dreamt up in the abstract — it was earned by stepping in holes. Any "take the chance to fix this" can push a validated stable state back into instability. Increments are safe; rewrites are gambles.
So what is v2.0 actually doing? Enhancements at the edges. Finer intent recognition, friendlier placeholder responses, better support for long conversations, more granular dispatch rules for some agents. These are all "adding a bit," not "rewriting." The original four layers, four ports, four mocks, four risk-gates — none of them moved.
This kind of judgment is especially important on a one-person project. The pit a solo developer most easily falls into is "rewriting on every upgrade" — because no one is holding you back, because the code you wrote yesterday looks ugly today, because the new SDK looks sexier. But if you really want it to live long, the first thing to do is lock the validated parts and leave the unvalidated parts to incremental exploration. These two things cannot be mixed.
So the design principles of v2.0 are written quite rigidly: do not touch the core architecture if you can avoid it; new features go on the edges first; old features are not rewritten unless there is clear "online evidence" that they are broken; any "design that looks better" is first validated on the mock path, not put straight onto the real path. These rules are not to limit creativity — they are to keep "v1.0 already runs this well" from getting swept away by a new round of excitement.
That said, v2.0 is not idle. The skeleton has landed and is in the finishing stage. GA is still waiting on audit — yes, audit, not code. Because this layer touches the capability boundaries of OpenClaw's internal agents, there has to be one external review that has looked at it and confirmed the risk-gates have not been bypassed by new features before it can open formally. I think this wait is worth it. A local system, once shipped, is hard to "roll back wholesale," so better to wait a bit longer before GA.
The real trade-off: simple call vs long-term evolution
Looking back at the whole thing, the biggest decision was actually at the very beginning: whether to write a direct call as something a little more complicated.
Writing two extra bridge layers at the start has its costs. Twice the code, an extra deployment, an extra monitoring surface, an extra cognitive load. If all you want is "I want to build a voice assistant that runs," three hundred lines of direct call are enough, and the time saved can go elsewhere.
But if what you want is "a voice workbench I can use for half a year, a year, two years," these two extra layers are a completely different story. They cleanly peel "how to recognize" and "who to dispatch to" off the front end, so the recognition engine can be swapped, dispatch rules can evolve, agents can be added or removed, the front-end UI can be redone — and none of these things drag the others along.
This is a very typical "short-term complexity in exchange for long-term simplicity." In the short term, the two extra bridge layers are a burden. In the long term, they turn this system from "a glob of glue that rots easily" into "four small services that can each evolve."
My own experience: in local AI systems, anywhere "user entry + model call + agent execution" exist together, this kind of decoupling is worth doing. Not because it is worth it from day one, but because it avoids one of the worst kinds of rot — the kind where you know a layer should be swapped, but it is coupled too deeply for you to dare, so you put up with it in an ever-worsening state.
The biggest difference between a local AI system and a cloud service is: you do not have a team to back you up, no SLA to catch you, no semi-annual big-refactor window. You only have yourself. So stability does not come from "I will go fix it" — it comes from "it does not break easily in the first place". Three-layer decoupling is the design that makes it not break easily.
I have condensed this judgment into a few principles to leave here for anyone who comes after to build something similar:
- A voice entry point should be split into four layers — "front end / recognition / dispatch / execution" — each an independent service on an independent port;
- Every layer carries its own mock fallback, so when downstream is down the user end does not go "blank screen";
- Mocks must be identifiable — never let placeholder responses pass as real answers;
- Every layer carries its own risk-gate — do not assume upstream or downstream is trustworthy;
- Once the core architecture is validated, lock it; all new features go in as increments first;
- Audit logs are written per-layer; retrospectives sliced by layer are far faster than ones tangled across the whole chain.
None of these principles "sound profound." They are all the "obvious in hindsight" kind. But obvious in hindsight tends to require writing a bad version yourself, being tormented by it for a while, before you actually accept it.
v2.0 GA is still waiting on audit, mocks are still holding the fallback, real Claw integration is not yet complete.
Jiyanran is not a "finished voice assistant" — she is more like a workbench that is "running, changing, growing." v1.0 is already stable enough — stable enough that I rely on it daily, stable enough that I can comfortably add new things to it. But it is far from terminal. The boundaries of the voice entry point will keep blurring (multimodal, multi-device, long conversations), downstream agent capabilities will keep growing, the granularity of risk-gates will keep getting finer.
But one thing I am now more certain of than at the start: no matter how the front-end UI changes, what the recognition engine is swapped for, or how many agents live inside OpenClaw, this skeleton of four layers, four ports, four mocks, four risk-gates is not going to move. It is not the endpoint — it is the foundation that lets this workbench keep walking. Foundations are not pretty, but foundations must be stable.
If I have to give this piece a short ending — local AI systems are not short of people who can write code; they are short of people willing to write two extra bridge layers at the start. In the short term that is a burden; in the long term that is longevity. The voice entry point looks simple, but it is the layer people most easily misjudge the difficulty of; precisely for that reason, it is also the one most worth getting the skeleton right on the first try.
v1.0 is stable, v2.0 is on the way. The next piece may be about what I see after real Claw integration is done — but that is the next piece's story.