feat(boot): optimus RAID0 rebuild durability + cold-start auto-recovery #3

Merged
ivoryghst merged 4 commits from docs/raid0-rebuild-durability into trunk 2026-06-16 10:40:12 +01:00
Owner

Persists everything the 2026-06-15 RAID1→RAID0 reinstall + Phase-D prove-reboot needed. Clevis/Tang LUKS auto-unlock proven across 5 reboots; the reboots then exposed three stacked cold-start layers, all now fixed + version-controlled so a cold boot self-recovers with zero manual steps.

Networking / swarm-join

  • network/nftables-mssclamp.nft — MSS clamp 1320 (the swarm-over-IPsec join blocker; was not in the repo at all)
  • network/interfaces.d/tunnel-vip.cfg — now VIP-only (the route+SNAT post-up raced enp35s0 and failed networking.service)
  • configs/optimus-tunnel-routes.service — route+VIP via systemd oneshot (After=network-online, Before=docker, tunnel-wait)

Container→LAN SNAT

  • configs/docker-tunnel-snat.conf + optimus-tunnel-snat.sh — re-assert SNAT above docker’s per-network MASQUERADE (a docker.service ExecStartPost), else container DB traffic masquerades to the public IP and the tunnel drops it

Cold-start app recovery

  • configs/optimus-boot-reconcile.service + .sh — multi-pass docker compose up -d of autostart stacks (apps Exit128 “network not found” before the overlay extends to optimus); mirrors the fleet glxy-boot-reconcile

Docs: runbook/08-raid0-rebuild-2026-06-15.md (full sequence + the three fixes + observed clean-boot), runbook/02 Tang IPs corrected to post-re-IP .10/.11.

Final reboot observed clean: system=running, 0 failed units, all 7 apps healthy (restarts=0), stalwart 3/3, traefik 3/3, 3 managers, SNAT first in POSTROUTING, portal.opmail.io 302 via optimus — no manual intervention. All paths non-deploying (configs/, network/, *.md) → merge is a live-host no-op.

🤖 Generated with Claude Code

Persists everything the 2026-06-15 RAID1→RAID0 reinstall + Phase-D prove-reboot needed. **Clevis/Tang LUKS auto-unlock proven across 5 reboots**; the reboots then exposed three stacked cold-start layers, all now fixed + version-controlled so a cold boot self-recovers with zero manual steps. **Networking / swarm-join** - `network/nftables-mssclamp.nft` — MSS clamp 1320 (the swarm-over-IPsec join blocker; was not in the repo at all) - `network/interfaces.d/tunnel-vip.cfg` — now VIP-only (the route+SNAT post-up raced enp35s0 and failed `networking.service`) - `configs/optimus-tunnel-routes.service` — route+VIP via systemd oneshot (After=network-online, Before=docker, tunnel-wait) **Container→LAN SNAT** - `configs/docker-tunnel-snat.conf` + `optimus-tunnel-snat.sh` — re-assert SNAT above docker’s per-network MASQUERADE (a docker.service ExecStartPost), else container DB traffic masquerades to the public IP and the tunnel drops it **Cold-start app recovery** - `configs/optimus-boot-reconcile.service` + `.sh` — multi-pass `docker compose up -d` of autostart stacks (apps Exit128 “network not found” before the overlay extends to optimus); mirrors the fleet glxy-boot-reconcile **Docs:** `runbook/08-raid0-rebuild-2026-06-15.md` (full sequence + the three fixes + observed clean-boot), `runbook/02` Tang IPs corrected to post-re-IP `.10`/`.11`. Final reboot observed clean: system=running, 0 failed units, all 7 apps healthy (restarts=0), stalwart 3/3, traefik 3/3, 3 managers, SNAT first in POSTROUTING, portal.opmail.io 302 via optimus — no manual intervention. All paths non-deploying (configs/**, network/**, *.md) → merge is a live-host no-op. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
The 2026-06-15 RAID1→RAID0 reinstall surfaced fixes that weren't captured
anywhere, so a future rebuild would re-hit them. Persist the load-bearing ones:

- network/nftables-mssclamp.nft: the swarm-over-IPsec MSS clamp (1320). Without
  it `docker swarm join` hangs ("DeadlineExceeded … waiting for connections")
  because ~1400B-MTU tunnel black-holes large raft/gossip DF frames. This was
  the single hardest blocker of the rebuild and was NOT in the repo at all.
- network/interfaces.d/tunnel-vip.cfg: add `mtu 1360` to the LAN route — the
  non-TCP half of the same fix.
- runbook/08-raid0-rebuild-2026-06-15.md: full end-to-end sequence incl. the
  things that cost real time — dropbear apt-hang, worker-join+promote, /mnt/cache
  symlinks, app_egress bridge, private-registry login for ivoryghst/iam, app-data
  perms (agents 999:999, clamav 100:103), forgejo-runner re-register with
  `optimus:host,swarm-manager` labels.

All paths are non-deploying (network/** is not a deploy trigger; *.md ignored),
so merging is a no-op for the live host.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Phase D prove-reboot (5 reboots; Clevis/Tang auto-unlock confirmed every
time) exposed three stacked cold-start layers that each broke reconvergence on a
cold boot even though the live fixes worked. All three are now version-controlled
so a cold boot self-recovers with zero manual steps:

1. networking.service failed → route+SNAT missing. The lo:tunnel post-up
   `ip route … dev enp35s0` raced enp35s0 bring-up ("Device for nexthop is not
   up"); ifupdown aborts the stanza so SNAT was skipped too. tunnel-vip.cfg is
   now VIP-only; route+VIP move to optimus-tunnel-routes.service
   (After=network-online, Before=docker, bounded tunnel-reachability wait).

2. container→LAN SNAT buried by docker's per-network MASQUERADE (rebuilt every
   daemon start) → container LAN traffic masqueraded to the public IP, tunnel
   drops it, DB connects fail. Re-assert SNAT on top via a docker.service
   ExecStartPost (docker-tunnel-snat.conf → optimus-tunnel-snat.sh).

3. DB-dependent compose apps Exit128 "network not found" (overlay not yet
   extended to optimus) and don't auto-recover. optimus-boot-reconcile.service
   (After=docker) multi-pass `docker compose up -d`s the autostart stacks until
   all run — mirrors the fleet glxy-boot-reconcile.

Final reboot observed clean: system=running, 0 failed units, all 7 apps healthy
(restarts=0), stalwart 3/3, traefik 3/3, 3 managers, SNAT first in POSTROUTING,
portal.opmail.io 302 via optimus — no manual intervention.

Also: runbook/02 Tang IPs corrected to the post-re-IP .10/.11 (were .11/.12).
All paths non-deploying (configs/**, network/**, *.md) → merge is a host no-op.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ivoryghst changed title from docs(rebuild): persist RAID0 rebuild fixes — MSS clamp, route MTU, full runbook to feat(boot): optimus RAID0 rebuild durability + cold-start auto-recovery 2026-06-15 10:22:03 +01:00
Restores true 3-voter quorum for azrak-pg-ha (was 2 voters / tolerates 0 failures
after the dead pi etcd-3 removal). Captures the learner→promote procedure, the two
non-obvious musts (bind to tunnel VIP 10.1.4.1 only — NOT 0.0.0.0, since optimus is
public and etcd is unauthenticated; join as learner so the window can't lose quorum),
the glxy.lan --add-host (pi-Unbound gone), and rollback. Verified live: 3 voters in
sync (raft index match), Patroni leader+replica healthy, etcd-3 stable over the VPN.

.md only → non-deploying.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Lets optimus be a PLANNED Forgejo failover target: while optimus is active, file
writes replicate back to r2-d2 + c-3po so they stay current (DB already does via
Patroni). Mirrors c-3po's reverse-lsyncd (alpine+lsyncd, tripwire entrypoint,
profiles:[disaster], --delete + --max-delete=100, excludes sessions/tmp/indexers).

Auth: a DEDICATED rsync-only key (forge_reverse_key, NOT the master claude_key)
on optimus, authorized on r2-d2/c-3po via a forced-command wrapper
(forge-rsync-only.sh — allows only `rsync --server`, no shell) + from="10.1.4.1".
SECURITY RESIDUAL documented in runbook/10: the wrapper can't path-restrict
(lsyncd sends the dest via protocol, not argv); rrsync would close it but isn't
installed — vendoring the perl rrsync is the future hardening.

Drilled live 2026-06-15: to-optimus served forge over the tunnel, reverse lsyncd
came up + synced to both LAN nodes (0 deletes, tree intact), cut back clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
azrak/optimus!3
No description provided.