Notes on cleaning up a knowledge base
When you clean up an AI project's knowledge base, the hardest part isn't running out of things to keep — it's wanting to keep everything.
Any stretch of time on an AI project produces a pile of material. Chat logs, runtime logs, install backups, agent settings, ad-hoc reports, status files, scripts, evidence screenshots. Each piece looks vaguely useful on its own. Throwing any one of them away feels like a small loss.
But once you actually shove all of it into something you call a "knowledge base," a few weeks later it looks no different from the scattered working directories it came from. It has been organized on the surface. Underneath, you just moved the garbage.
So now, before I start cleaning up, I always do one dumb thing first: I write down what stays and what goes. Once the rules are written, curation becomes mechanical. When the rules can't be written, curation stalls on every file.
The real tension: raw material keeps growing, time to organize keeps shrinking
I refused to admit this for a long time. The internal voice said — just keep everything for now, the disk is big enough, sort it out later.
But "sort it out later" never actually happens. Once a knowledge base passes a few hundred files, the next person who opens it — including me — doesn't want to read it anymore. Its sheer volume scares everyone off, the author included.
So "keep everything" looks safe, but it's the most expensive choice. The cost isn't disk space — disk is cheap, just buy more — it's attention budget. Every meaningless file I keep means a little more attention I have to spend judging it next time. After ten rounds of that, nobody wants to open the base again.
What didn't work: keep the newest, keep the longest, keep the official-sounding
Early on I cut corners with a few rules that sounded reasonable. All of them backfired.
- Keep the newest — but "newest" usually just means someone touched it last, not that what it says still holds.
- Keep the longest — but the longest file is often an AI-generated summary, with things mixed in that shouldn't be.
- Keep the official-sounding — but files named FINAL / SPEC / README are often early versions, later overturned by what actually happened in production. The filename never got updated.
Any one of those rules looks fine in isolation. Run all three together and you get a disaster — what survives is "authoritative-looking AI summaries that have long since expired." That kind of artifact is more dangerous than a chat log. It dresses up as knowledge.
So I switched to a different approach. Two filtering layers.
First filter: can this material still do work in the future?
The first layer asks one question: at some point in the future, when I re-enter this project, do I need to read this material? If no, drop it. If yes, ask the follow-up — is the material itself useful, or has its conclusion already been absorbed into an audit report or design doc somewhere else? If absorbed, the raw material doesn't need to stay either.
Cut along that line and the material splits into two piles.
What stays: knowledge that can keep doing work
- Project overviews, architecture, design docs — this is knowledge, not state. It tells someone what the system looks like and why it was built that way.
- Audit conclusions, decision records — settled judgments are worth more than the process that produced them.
- How to run it, ports, the command surface — when you want to use it again, this is the first thing you look up.
- Change logs, timelines — so someone can understand "why it evolved into what it is today."
- Install archives (one per install) — when you reinstall the system, you'll always come back to these.
- The audit snapshot section inside an install backup, the knowledge subdirectory of an archive cold-storage, the "engineering analysis" portion of an agent training log — these were originally scattered across messy directories. As long as they can keep doing work, lift them out and put them in the right place.
What gets dropped: runtime and engineering artifacts
- Source code, scripts, patches — these belong to the repo, not the knowledge base.
- Runtime logs, caches, dependencies — regenerable by running it again.
- evidence, backups, raw source material — process evidence, already absorbed by the conclusion.
- Raw conversation transcripts (kimi sessions, claude memory, codex memories) — the machine's working memory, not human-facing knowledge pages.
- Runtime config containing tokens or secrets — runtime identity, not knowledge.
- Identity, role, soul, heartbeat (the prompt-engineering bits that define an agent's persona) — prompt-engineering artifacts. Publishing them is neither safe nor valuable.
- The home-directory system files inside an install backup, the books/raw/tasks folders in an archive, old AI chat-log directories — same property, same pile. Don't keep an entire directory just because "there's still something useful in there."
Splitting directories apart is the thing most easily overlooked at this layer. A single directory often contains both things worth keeping and things worth dropping. That's normal — sort by property into two piles. Don't move the whole thing because "sorting is annoying," and don't delete the whole thing because "some of it's useless."
Second filter: among the survivors, who is the canonical source?
After the first layer, the real trouble begins. For the same project, there might be a design doc locally and another in archive; install docs may be at v3, but v1 and v2 are still around; the same status is recorded in the audit report and also in a runtime log. Each one claims to be correct.
At this point you can't merge — merging just packages the conflict more prettily. You need a referee.
The six rules below aren't truths. They're work rules. Their purpose isn't to "pick the best one" but to give every kind of material a fixed priority, so I don't have to think from scratch next time I'm refereeing.
- Original file beats backup. A backup exists for emergencies, not to be cited.
- Latest stable version beats older versions. Note the word "stable" — not "the most recently edited draft."
- Design doc beats runtime traces. Commands, state, logs tell you what it's doing right now. Only the spec tells you what it was supposed to do.
- History ledger beats raw chat transcripts. A ledger is a compressed anchor; raw chat is a stream.
- Local current fact beats the duplicate copy in archive. archive is history, not present.
- Anything about persona, identity, soul, heartbeat — excluded by default. This category is prompt-engineering artifact, not public knowledge.
And one more thing: tag every page with a freshness state
Pages that stayed also expire. Without handling expiration, the knowledge base regresses to that "everything's right, but nothing's necessarily right" state — no different from the messy directories you started with.
So now every page carries a freshness state. Three states is enough. Any more and nobody maintains them:
- verified — recently checked against the source by a person or AI, still holds.
- stale — the source has moved on, this page may be inaccurate, but still usable as a lead.
- needs review — visibly in conflict, must be looked at again by a person.
The three states aren't complicated. The point is that they give "when must this be updated" a clear signal. Without a signal, every page looks equally trustworthy, and problems get quietly written into the next round of judgment.
I keep one simple rule for myself: if a page can't pass a quick source check, it shouldn't be allowed to feel authoritative. It can stay as a lead, but it gets downgraded — it can't keep pretending to be verified.
What it produced: three independent wikis
After running the two filters, six rules, and three-state tagging through everything, I ended up with three independent wikis. None of them are large, all of them can keep doing work, and they share the same style:
- llm-wiki — engineering knowledge base. Holds page principles, project status, ecosystem governance, audits and decisions.
- openclaw-knowledge — the OpenClaw project's dedicated base. Holds install design, version choices, security hardening, history ledger.
- yun-archive-wiki — personal archive base. Holds the music index, install audits, the knowledge portion of cold-storage, post-reinstall reports.
Each base has its own keep/drop table, but the underlying judgment is one shared method. That matters more than whether the content is complete — it means if any one wiki has a problem later, I can clean it with the same method, without reinventing the rules.
A side effect: each wiki is small. llm-wiki is 19 markdown files, 76KB; openclaw-knowledge is 7 files, 28KB; yun-archive is 11 files, 44KB. Together, under 150KB.
This is something I only came to accept slowly — a knowledge base that actually gets used is usually small. The big one is usually not a knowledge base. It's a backup of a working directory.
A few quiet rules I keep enforcing
This process keeps running not because any one rule is particularly clever, but because a few dumb rules never get broken.
- Don't let "sort it out later" be an excuse — if a call can't be made now, either drop it or mark it needs review and push it into the next round.
- Don't let AI decide the canonical source — AI can list candidates, find duplicates, surface conflicts. But which one is authoritative is my call, and it has to be written into the keep/drop table.
- Don't let "looks authoritative" equal "is authoritative" — FINAL / SPEC / README in a filename still has to pass both filtering layers.
- Don't let raw material and curated output share a directory — raw stays in the working directory, curated goes into the wiki. Once they're mixed, a few weeks later nobody can tell which is which.
- Don't conflate "delete" with "exclude" — excluded just means the knowledge base doesn't absorb it. The original file can still exist. This one rule lifted a lot of psychological weight off the cleanup process.
Cleaning up a knowledge base is, at heart, drawing boundaries around material.
What stays and what goes isn't a taste question. It's "does this material have the standing to stand in for the source when someone asks a question in the future?" If it can stand in, keep it. If it can't, leave it as a lead. Leads can live scattered in working directories. Knowledge has to stand on its own inside the wiki.
In the end what I want isn't a bigger base. It's a base I'm still willing to open the next time I come back.