‹ Back to notes

Field Note

9TB Music Library Read-Only Indexing: The Engineering Constraints I Set for Myself

Turning 9TB of personal music into a queryable index — the hard part isn't writing code, it's nailing down the 'never touch the source disk' rule first. Five constraints, ten stages, and a still-running engineering site with real bug stories.

Knowledge SystemsAudio IndexingEngineering ConstraintsRead-Onlypublic-safe
Watercolor sketch: a tall stack of music library discs and hard drives next to a laptop, index data on the screen

Music Indexing — Engineering Notes

Turning a 9TB personal music library into a queryable index sounds like a weekend script. The hard part isn't writing the code. It's nailing down "never touch the source disk" before you write a single line.

The drive at home holds 115,999 audio files — FLAC, MP3, WAV, DSF, all mixed in — 7.39 TB in total, enough to nearly fill a 9TB disk. This pile took a decade-plus to accumulate. Some I ripped from CDs myself. Some came from friends. Some I downloaded in the early years. Some I rescued off old dying drives. Every file has a small story behind it, but the stories don't matter. What matters is that the whole thing is now a black box: I know it's there, and I can't query it.

The first time I sat down to build an index, I almost just started writing — scan the tree, read the metadata, push it into a database, slap a search UI on top. Then I stopped. Because I've watched too many people — myself, years ago, included — touch the source and break it. Renaming. Moving. Editing tags. Adding cover art. Every single time it was "just this one tweak," and every single time, looking back, the right move was: don't.

So this round I did something counterintuitive first: I wrote the rules before I wrote the code. Five constraints, treated heavier than the actual business logic. Scripts can be rewritten. Constraints don't bend.

Constraint 1: Read-only — don't touch a single file on the source disk

The most basic rule, and the easiest one to break. The second your scan script contains an open(path, 'w'), a shutil.move, an os.rename — the whole constraint is gone.

I wrote it brutally narrow: don't modify, don't move, don't delete, don't rename. Not even "let me just clean up that one filename." The reason isn't technical, it's trust. This disk holds a decade of my own material. I will not let any script do something to it "for my own good." The index can be rebuilt. A corrupted source can't.

The enforcement happens at the tool layer. Any path that points at the source disk can only be opened in read mode; any write intent raises immediately. Sounds paranoid, but a few weeks in this single rule has caught more than just my own slips — it's caught a couple of times an AI tried to "tidy up the filenames for me" and crossed the line.

Constraint 2: Don't write anything to the source — isolate the workspace

This extends rule one, but cuts finer. Not only can you not edit the source files, you also can't drop a temp database, cache, log, or state file anywhere on the source disk.

I didn't realize this needed its own rule until I once used a tool to scan a photo library, and afterwards found it had quietly seeded a .cache file inside every subdirectory. Looks harmless. But the source disk was no longer clean — it now carried the tool's fingerprints, and switching tools later meant cleaning up first.

So now every index output — database, cache, ffprobe reports, error logs, checkpoint state — lives in a dedicated working directory on a separate work disk. The source disk only plays one role: data source. Never workbench. The two are physically isolated, separate mount points and all.

The side effect is great: I can unplug the source disk anytime, change its interface, copy it to another machine, and the index side doesn't notice.

Constraint 3: No network identification — pure local processing

This is where modern tools love to stab you in the back. MusicBrainz, AcoustID, all the online lyric matchers — unless you actively turn them off, they're on by default. Every scanned track ships a fingerprint to the cloud, and a few days later your entire music library has been profiled remotely.

I don't want that convenience. First, privacy — my private listening habits, taste, and collection shape don't need to become training material for some online service. Second, stability — network identification makes the index depend on "whatever the cloud returned that day." A track matched today might not match tomorrow, and the index stops being reproducible.

So: strictly local. Metadata comes only from the tags inside the file itself. Quality scoring looks only at file properties. Duplicate detection uses only local hashes and durations. Could I get sharper identification by going online? Sure. But the price is losing "this index is fully rebuildable, fully offline" — and I can't afford to trade that away.

Constraint 4: Resumable scans — running twice mustn't write twice

Over 110,000 files is not a five-minute scan. The first real full ffprobe + mutagen pass ran all night and still wasn't done. Network hiccups, power blips, an accidental Ctrl-C — all of those have to be survivable. The script has to pick up where it left off.

Resumable scanning sounds simple. In practice it's all traps. The biggest one is double-writes — if the last run died halfway and the restart doesn't dedupe, the same track gets inserted twice. The entire index is then untrustworthy.

So the real meaning of this constraint isn't "can resume from interruption." It's "running the same script repeatedly must converge to the same index." One file path, one row. Idempotency is the floor. Technically I lean on a three-piece set: a unique index in the database, explicit upserts, and a separate "already processed paths" table.

Constraint 5: Error isolation — one corrupt file can't kill the job

Out of 110,000+ files, some are corrupted, some have wrong permissions, some have weirdly encoded filenames, some have busted format headers. Final tally: 543 files ffprobe couldn't read, 744 files mutagen failed on. Over two thousand errors — and the full scan cannot stop because of those two thousand.

Early versions of mine took the lazy path: hit an exception, raise. The result was the script dying at 30%, restarting from zero, dying again at 35%. Forever circling inside the first 40%.

Then I rewrote it: every file gets its own try/except, errors go to a dedicated error log, the main loop keeps moving. Only after that change did the scan actually finish. The point of error isolation is to accept reality — a library of 110,000 files will produce a few thousand errors, and the abnormal thing isn't the errors, it's letting those errors stop the other 100,000. Errors get logged, never silently swallowed; but logged is logged, the main task keeps running.

Why ten stages — cut by risk, not by feature

Once the five constraints were nailed down, I didn't write a single "one-click scan" megascript. I sliced the whole thing into ten stages, each one runnable on its own, each one verified on its own, each one signed off on its own.

Slicing by stage is not the same as slicing by feature. Slicing by feature asks, "what is this chunk of code responsible for?" Slicing by stage asks, "what specific risk can this step blow up on?" Once you cut by risk, a failure at any stage only costs you that stage — the earlier stages don't need to be redone.

  • Stage 0: Safety checks + working directory init — confirm source is read-only, work disk is writable, no path escapes its sandbox.
  • Stage 1: Dependency check — ffprobe, mutagen, Python env, database schema all in place.
  • Stage 2: Audio file discovery — read-only scan of the whole disk, listing the path and size of every one of the 110,000+ files. No metadata reads yet; just prove we can walk the whole disk.
  • Stage 3: Sample validation — pull 300 files at random, try reading their metadata, measure success rate, and project how long a full pass will take.
  • Stage 4: Full metadata read — based on the sample projection, run the full pass with confidence. Resumable mode on.
  • Stage 5: Duplicate candidate analysis — only "mark" potential duplicates, never delete.
  • Stage 6: Quality scoring draft — every track gets a 0-100 score.
  • Stage 7: AI DJ initial index v0 — fuse everything above into a queryable index.
  • Stage 8: Final reconciliation — cross-check completeness against the source disk.
  • Stage 9: Index acceptance and patching — manually spot-check a few hundred rows to find rule gaps.
  • Stage 10 and beyond: player integration, tag enrichment, audio feature analysis — separate concerns, all on hold until the first nine are solid.

The biggest win from ten stages: when any stage fails, I only roll back that stage. Stage 4 ran all night and crashed? Stages 0-3 are still good. The rerun is just that segment. If the whole thing had been one overnight monolith, a failure means starting over completely.

The other win: every stage has its own "pass criteria." If Stage 3's sample success rate is below 90%, Stage 4 doesn't run — I go back and find out why first. That way the downstream stages always get clean input.

The six dimensions of quality scoring — and why no listening tests

Stage 6's quality scoring is the most criticizable part of this whole index. Someone will ask: why no ABX listening test — the blind A/B/X comparison that's the gold-standard audio evaluation? Why no spectral analysis? Why no dynamic range calculation?

I considered all of those. Then I picked six very dumb dimensions instead:

  • Lossless vs lossy — FLAC, WAV, DSD start higher than MP3.
  • Sample rate and bitrate — higher sample rate adds points, very low bitrate deducts.
  • Metadata completeness — title, artist, album, year, genre, deduct per missing field.
  • Duration sanity — abnormal durations (5 seconds, 8 hours) get flagged separately.
  • Read success — any ffprobe or mutagen failure deducts heavily.
  • Suspected duplicate — being linked to a duplicate group deducts a relational score.

Why no listening tests? Because listening is subjective and not machinable. What I want is an index that runs, reproduces, and scores 110,000 tracks under the exact same yardstick. The moment I introduce listening, the next day I'd rehear a track and want to overturn yesterday's judgment, and the entire score becomes unstable forever.

All six dimensions are objective file-level properties — the same file scored today and a year from now yields the same number. That's what an "index" should look like. It's not an audiophile review. The overall average came out at 78.1, with a reasonable distribution — FLAC mostly above 85, MP3 mostly 60-75, a handful of damaged files pulled down to under 30. Good enough.

Three real bug stories — even nailed-down constraints get chewed through

Five constraints plus ten stages sounds airtight. In practice it still got bitten several times. Three of the most representative ones:

Unicode NFC/NFD normalization

macOS stores filenames in NFD (combining characters split apart). Many Linux-side scripts default to NFC (composed form). A Chinese song title that looks identical in macOS Finder might, when handed to Python's os.stat, return "file does not exist."

This one cost me two days. I first assumed it was a permissions issue and spent hours getting nowhere. Then I eyeballed two visually identical strings and finally saw it — they differed at the byte level. The fix is to normalize every path to NFC before it touches the database. This does not violate "don't modify the source," because the source filenames are still untouched at the byte level. Only the database stores the normalized version.

The insert bug

Stage 5 builds the duplicate-candidate analysis, which inserts pairs of potentially duplicate files into a table. The first cut, being lazy, had no unique constraint, and the same pair got inserted three times — because the analysis has several rules and each one independently flagged the pair as a duplicate.

On the surface nothing was broken — just extra rows in the table. But when Stage 6 scoring picked it up, things went sideways: the same track got deducted three times for "three duplicate relationships" and dropped into a score bucket it didn't deserve. The fix was an explicit unique constraint, plus bidirectional dedup on (pair_a, pair_b), plus a verification pass on Stage 5's output before feeding it into Stage 6.

The lesson isn't "remember to add unique constraints." It's that any "analysis"-flavored script defaults to firing multiple times, and you have to block that at the data layer. You cannot rely on the business layer to be careful.

The ffprobe hang

During the Stage 4 full metadata pass, a few oversized DSF files (several GB each) made ffprobe hang mid-read. The child process never returned, the main script never moved. Overnight runs got a few hundred files in and froze.

The fix was wrapping every ffprobe call in a watchdog — the kind that catches a hung process and restarts it. Over 30 seconds: kill the child, log the error, mark the file as "read timeout," main loop continues. With that patch, Stage 4 finally finished a full overnight run.

This sent me back to reread constraint 5, "error isolation." Error isolation doesn't just mean handling exceptions. It also has to handle "neither returns nor errors." Silent hangs are harder to catch than thrown errors. You have to add the timeout proactively.

Where things stand — Stage 13B is still running

Stages 0 through 12 are done. The first index (v0) runs and is queryable. All 110,000+ files are in the table. Metadata completeness sits roughly at title 90%, artist 90%, album 89%, year 33%, genre 35%. The first three are fine. The last two are weak, because a lot of old MP3s never had year or genre filled in back when they were ripped.

Stage 13B is doing reverse verification — taking the statistics from the index and matching them against the actual directory structure on the source disk, looking for "in the index but not on disk" and "on disk but missing from the index" cases. This was supposed to be Stage 8's job, but Stage 8 cut a corner and only did forward verification, so I opened a separate Stage 13B to make it right.

Two small things have surfaced so far: about 200 files were missed in the original scan because their paths contained rare characters — those need a return trip to Stage 2 with NFC normalization added. And about 40 files have empty metadata while the audio itself is fine, which looks like the original ripping tool never wrote any. Once those two are fixed, v0 is officially settled.

After that comes Stage 14+ — player integration, AI recommendation DJ, visualization layer. But all of that sits on top of "the index is trustworthy," so I'm in no hurry to push it. If the index isn't stable, everything above it is sand.

This index is nowhere near finished. But every constraint I nail down, every stage I get through, every bug story I patch — my trust in it grows by another notch.

I'm no longer chasing "let me ship the query UI fast." What I'm chasing is: next month, six months from now, two years from now, I can rerun the same scripts and the result converges to the same index. The source disk is never touched. Output always lives on the work disk. Errors always land in a log. Repeated runs always idempotent. Those four things matter more than any fancy query frontend.

Stage 13B is still running. The next note about this index will probably open with either "v0 finally earned the right to be called v1" or "another constraint just got chewed through somewhere new."

Turn this note into a route

After reading, ask a follow-up, return to the curated archive, or use the tag index to follow the same thread.

Ask about this Open archive Browse tags