Organizing a Large Personal Archive: Backup Priority, Critical Assets, Config Drift

Watercolor sketch: a sheet of paper spread on a desk with a hand-drawn classification list, a Mac mini running a backup beside it

System Audit Notes

Taking inventory is not the same as backing up. Inventory is "I have no idea how much stuff I've installed on this machine myself" — and until you do that, every backup plan is a gamble.

I wrote an earlier piece called Before Reinstalling My Mac mini, I Ran a System-Level Asset Audit, which is the story of that specific audit — what I did, what I found, how I decided to handle it. This one isn't a rerun of that story. It pulls out the method-level rules I distilled from that audit and presents them on their own: the backup priority matrix, the signals for spotting critical assets, the common config-drift patterns. These apply to any personal archive that's been accumulating for years — not just Macs, not just AI workstations.

The motivation is blunt. After getting burned a few times myself, I realized most people know almost nothing about the actual state of their own machine. Half the software you installed, you've forgotten. Config is scattered across a dozen places. Your API keys have been copied into three or four files. Several projects are still running processes you never shut down. In that state, any "backup plan" is luck — you think you backed things up, but what you backed up is the surface. The part that's actually going to bite you was never visible in the first place.

The Real Snapshot: Put the Numbers Down First

A few of the numbers from this audit surprised even me.

370GB of disk used out of 926GB. I'd assumed I was at half capacity. Turns out I was closer to 40% full.
9 git repos, 4 of them with no remote. Meaning: if the local disk dies, those 4 are gone for good.
179 uncommitted files, spread across those 9 repos — each one a piece of work I "meant to come back and commit" but never did.
22 LaunchAgents — macOS startup background services. I could name maybe 10 of them. The other half I have no memory of installing.
75+ command-line tools installed via Homebrew, plus 5 global npm packages.
14+ .env files holding environment variables, scattered across different projects, each one stuffed with API keys for external services. None of them in git.
23 listening ports, 11 active AI services running. They start at boot, but I'd never put the full list down on paper before.

That table is the first real artifact of an audit. Without it, "backup" just means copying Desktop, Documents, and a few visible project folders — and the other 80% of your real assets stay invisible.

So my first rule now is: before any backup decision, force yourself to fill out that table of numbers. Can't fill it out? Then don't talk about backup yet.

Three Backup Tiers: P0, P1, P2

Once the numbers are down, the next question is — which of these things must be backed up, which can be rebuilt, and which I just don't need to care about.

Early on I tried the "everything is important" attitude. The result was "nothing is important." Once a backup strategy has no priorities, it degrades into "back up whatever fits," and when critical data is lost you can only blame the dice. So I stick to three tiers now. More than that and I won't maintain it.

P0: Losing It Means Serious Damage

The bar here is — if this is gone, work stops, and there is no external resource that can rebuild it.

Uncommitted code — anything not yet pushed to a remote. The local disk is the only copy.
Business databases — the Postgres and SQLite instances running locally with actual business data inside.
Vector data — embeddings stored in chromadb, lancedb, or mem0. This one is its own special category and gets a section below.
Voice assets — recordings, generated audio samples. There's only one original.
.env files — full of API keys for third-party services. Lose them and you're filling out signup forms at dozens of websites again.
Custom LaunchAgents — the service definitions that start at boot. Lose them and you've smashed every entry point into your daily workflow.

P1: Expensive to Recover, But Recoverable

Losing this isn't fatal, but it takes a day or two to get back to where you were.

Model caches — local LLM weights you've pulled down. Re-downloading tens of gigabytes is grunt work, but doable.
Global packages — the collection of CLI tools installed via Homebrew and npm. Rebuildable from a Brewfile or similar manifest, assuming you have the manifest.
AI CLI configs — Claude Code, Codex CLI and friends. Prompts, custom commands, MCP integrations all live here.
Browser configs — bookmarks, extensions, logged-in sessions. The synced stuff is one thing; small unsynced tool configs are another.

P2: Can Be Re-Downloaded or Re-Configured

The bottom tier. Losing this barely matters.

Homebrew packages themselves — as long as the manifest survives, reinstalling is trivial.
Application installers — the App Store or vendor sites will hand them back.
Build artifacts — node_modules, build, dist. The source is there; regenerate.

The key to these three tiers isn't how finely you slice them — it's that you actually treat each tier differently after slicing. P0 needs redundant backups (cloud plus an offsite physical copy). P1 needs one copy. P2 doesn't need to be in the backup at all. With that, backup volume drops from "hundreds of GB across the whole disk" to tens of GB — small enough that it can actually run every day, and small enough that you notice immediately when it breaks.

The side effect of backing up everything is that the backup gets too big to run, so you push it to weekly, then monthly, then "last backup was six months ago." Tiering isn't about saving disk space. It's about giving the backup a chance to actually keep running.

Six Signals for Spotting Critical Assets

The priority matrix tells you how to categorize. The next question is — where are these things actually hiding? Generic backup tools can't see most of them. You have to go category by category yourself.

Every time I do an inventory now I walk through these six signals, each tied to a real scenario. Miss one, and you'll discover after the reinstall that "oh, that thing is gone."

Signal 1: Uncommitted Code

The easiest one to miss, because everyone defaults to "all my code is in git." But git only has what you've committed. Those 179 uncommitted files are not in git.

The actual move: list every git repo on the machine, then run `git status` on each to see uncommitted changes, and `git remote -v` to see whether a remote exists. A repo that fails both checks is high-risk: no remote means the local copy is the only copy, and uncommitted means even the local hasn't been archived.

Of the 4 no-remote repos I found that time, 2 were leftovers from early experiments. But they held tuned parameters and small utilities I'd worked out at the time. If they were gone, I'd have to redo that work. This kind of stuff doesn't announce its value to you — you only remember it existed once it's gone.

Signal 2: Running Databases

Local services like Postgres, SQLite, or ChromaDB — if you back them up by just copying the data files, the copy is often broken, because the database was mid-write and what you copied is a half-state.

So for this class of asset, the backup action isn't "copy files." It's "stop the service first, or use the database's own dump tool." Skip both and start backing up, and recovery later will most likely reveal that the backup is corrupt.

The more practical problem is that most people genuinely don't know which databases are running on their machine. They came in as dependencies of some project, they're listening on some port, they start at boot, and you've never thought about them directly. Inventory means specifically checking every listening port and every database process to see what's actually inside.

Signal 3: Vector Data (The Most Special One)

chromadb, lancedb, mem0 — these local vector stores hold embeddings: high-dimensional vector representations of documents, chat logs, knowledge snippets. The special thing about them is this: in theory you can recompute from the source data. In practice you almost can't.

Why? Because the rebuild needs three things to be true at once: the source data still exists, the embedding model you used is still accessible, and you remember the chunking and cleaning rules. Miss any one of the three and the rebuilt vector store is different from the original — search results shift, similarity thresholds need retuning, and every downstream pipeline that depends on it needs regression testing.

My own local knowledge bases have been running for months. I've swapped embedding models in that time, tuned chunking strategies, cleaned out bad entries a few times. Rebuilding from zero would probably be harder than building it the first time. So vector data is P0 for me, sitting on the same tier as the databases.

Signal 4: .env Files

A .env file is what a project uses to hold environment variables — usually stuffed with API keys, database connection strings, tokens for third-party services. By convention it doesn't go into git, which means backup has to handle it specially.

The problem is they're scattered across project roots, config subdirectories, sometimes buried inside dotfiles. I scraped up 14+ of them that time, spread across 8 projects. Opening each one revealed credentials for external services — losing them would mean re-applying at dozens of sites and remembering which email I used, which team I was on, which usage tier I'd asked for.

So inventory has to include a sweep of every .env, .env.local, .env.production-style file. Note where each one lives and what kinds of secrets it carries. They go straight into P0.

Signal 5: Custom LaunchAgents

LaunchAgent is the macOS startup-service definition, files stored under ~/Library/LaunchAgents/. Each file describes a service that starts at boot — maybe an AI service, maybe a monitoring script, maybe a scheduled job.

I found 22 of them that time. At least half were experiments I'd installed long ago and never uninstalled. Losing this class of asset doesn't sting immediately — but the next time you boot, you'll notice a pile of things missing: the AI services that started themselves, the backup scripts that ran on a schedule, the small monitors watching for anomalies. All gone. Reconstructing each one from memory is basically impossible.

So the whole LaunchAgents directory goes into P0 — back up all of ~/Library/LaunchAgents/ as one unit. And this is also a cleanup opportunity. While you're inventorying, decide which ones can actually be deleted. Don't blindly keep all of them.

Signal 6: Plaintext Secrets (The Most Dangerous Class)

This is the one I least want to look back at. During inventory you check your shell profile — .zshrc, .bashrc, .bash_profile — and you'll often find a line like `export OPENAI_API_KEY=...`. Plaintext key, loaded into every shell at startup.

Two problems with keeping it there. One is security: a plaintext secret in a config file is readable by anything that can read the file, including some less-than-clean tools you've installed. The other is mobility: shell profiles get backed up to cloud drives, copied to new machines, pasted into screenshots when you're asking someone for help — and one slip and the key leaks.

So this isn't just a backup problem, it's a refactor problem. The backup still has to include the full shell profile (it lives in P0), but after the inventory you have to schedule a task: move every plaintext secret from the shell profile into a password manager, then read it back from there at runtime. I haven't finished that one myself. It's the next thing on the list.

Five Common Config-Drift Patterns

Finishing the backup doesn't mean the archive is organized. The real trouble is that once a machine has been running for a few years, configs start contradicting each other — and a generic backup plan is completely blind to this layer.

After that inventory run, I grouped the conflicts I'd hit into five patterns. None of them is fatal alone. Combined, they're what makes you say after a reinstall: "why does some of this work, and some of this almost work?"

Pattern 1: Port Semantics Drift

The most common case. Port 3100 is a web service in project A, a database admin UI in project B, and grabbed by some AI tool in project C. All three start at boot. Whoever wins the race gets the port; the other two fail silently. No one tells you.

The sneakier version is an off-by-one port number — 3100 vs 13100 used by different components, and a config file with the wrong digit happily connects to the wrong service. The logs look fine, because the other end is also an HTTP service. It just isn't the one you wanted.

So during inventory you list every listening port, cross-reference against the port declarations in each project's config files, and look for collisions. No backup tool can do this for you. You have to walk it.

Pattern 2: Stale Path References

Your crontab points at /Users/me/old-project/run.sh, but old-project was deleted three months ago. A symlink points at a directory that no longer exists. An MCP config — Model Context Protocol, what AI tools use to connect to external services — points at a service that's been migrated elsewhere.

This kind of stale reference gets preserved as-is by the backup. When you restore, it's still sitting in your config, still pointing at a target that vanished long ago. Mild case: log errors. Bad case: a tool that depends on that path just dies on startup.

The fix is — during inventory, walk through the crontab, every symlink, every external tool's config file, and verify each target path or service still exists. Doesn't exist? Decide right there: either delete it, or repoint it at the new path.

Pattern 3: Dead Service Dependencies

Service A's config says "depends on service B at port 5432," but B got replaced by C three months ago, and 5432 is empty. A tries to connect on every startup, fails, and falls back to a degraded mode. You have no idea it's running in degraded mode.

This kind of problem doesn't show up under normal use, because the degraded mode often "looks like it works." By the time something actually breaks and you go check the logs, you find that a key part of the pipeline has been severed for months.

During inventory, go through each service's config and list what it depends on. Then cross-check: are those dependencies actually still running? Anything that isn't has to be deleted or restored — don't leave it sitting in the config with a "should be there" status.

Pattern 4: Cross-Root Scheduling

This is the one I've stepped into the deepest. A scheduler that was supposed to handle only its own project's jobs slowly accumulates lines like "also kick off the script in the project next door." Then one day you refactor that other project, move the directory, and the scheduler is still running against the old path — either erroring out or running the wrong file.

What makes it worse: this cross-root scheduling tends to be asymmetric. Scheduler A knows it's calling B. B has no idea anyone outside is calling it. When B's maintainer makes a change, they're not thinking about A's dependencies. So the conflict happens with zero warning.

So during inventory each scheduler needs a "who I'm calling" list, plus the reverse view of "who's calling me." With both sides reconciled, you can finally judge which cross-root calls are intentional and which are historical baggage.

Pattern 5: Historical-Copy Confusion

The same config file exists in several places: one local, one in a backup directory, one in archive, one copied out during some experiment. The names are all similar, the contents are slightly different, the timestamps aren't far apart. Figuring out which one is canonical — the authoritative one, the one actually being used — turns into archaeology.

A single person can power through this for a while, but it falls apart over time. Six months later you genuinely can't remember which copy is "the one I'm actually using right now." And when an AI tool comes along to read this pile of files, it's even more likely to pick the wrong one.

The fix is — during inventory, every critical config gets one canonical path designated. The other copies either get deleted or explicitly marked as historical (move them into an archive/ subdirectory, for example). The principle is "only one copy is live at any given time." No "they all still work" allowed.

What's Worth Keeping in Cold Storage

Inventory doesn't mean delete everything. Beyond P0/P1/P2, there's another category — stuff that doesn't affect system operation, but "I'll probably want to look this up the next time I write something." I call this category seed material in cold storage.

Most historical files aren't worth much — old chat logs, old experiment outputs, old versions of design files, expired ad-hoc reports. Those can go in one swipe. But these four kinds I pull out and keep separately:

Cross-module analyses — the panoramic views that put several modules of a system side by side: call graphs, permission propagation paths. Producing one of these costs a lot. Keep them so you can see how you understood things at the time.
Teaching material — bilingual notes on an open course, organized chapter summaries. These are filtered, second-pass artifacts. More useful than the source video.
Research reports — industry surveys, technology-evolution writeups, comparative evaluations of specific tools. Conclusions that took days of digging at the time, still usable as a starting point months later.
Meta information — quality-check reports, classification lists, snapshots of directory structure. Data about the data. Rebuilding it is painful, and there are only a handful of these files anyway.

The judgment is simple. The cost of throwing it out is "I'll have to redo this work when I think of it again." The cost of keeping it is "some disk space." The former is much more expensive than the latter, so keep it. But put it in a cold-storage directory. Don't leave it mixed in with working folders — once it's mixed in, every time you open the folder you have to re-decide "is this hot data or cold data," and that wears your attention down.

What Comes After the Inventory

Inventory isn't a one-shot thing. My current rhythm: every time I'm about to do something major — reinstall, migrate to a new machine, swap a drive — I run this whole sequence again. Day-to-day I keep a living list. A new LaunchAgent installed, a new .env written, a new global tool added — note it down on the spot.

The hardest part of this isn't the tooling. It's the mindset. Admitting "I don't know what I've installed on my own machine" takes some nerve — a lot of people will resist instinctively, because admitting it means having to face that table of numbers. But you only have to face it once. The next time is much easier.

The core of organizing an archive isn't backing up more diligently. It's turning the invisible assets into a visible list. Invisible means gambling. Visible is the first time strategy enters the conversation.

My own next steps are still half-done: move every plaintext secret out of the shell profile into a password manager, fix the config conflicts I've already spotted, clean up the 22 LaunchAgents and delete half, and decide for each of the 4 no-remote repos whether to add a remote or archive it for good. None of this is finishable in one pass — and none of it is the kind of problem a single checklist solves.

But with that table of numbers and these few rules in hand, at least the next time I face them I won't have to ask "what is actually installed on this machine." That's enough to count as a starting point.