feat(boot): optimus RAID0 rebuild durability + cold-start auto-recovery #3
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "docs/raid0-rebuild-durability"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Persists everything the 2026-06-15 RAID1→RAID0 reinstall + Phase-D prove-reboot needed. Clevis/Tang LUKS auto-unlock proven across 5 reboots; the reboots then exposed three stacked cold-start layers, all now fixed + version-controlled so a cold boot self-recovers with zero manual steps.
Networking / swarm-join
network/nftables-mssclamp.nft— MSS clamp 1320 (the swarm-over-IPsec join blocker; was not in the repo at all)network/interfaces.d/tunnel-vip.cfg— now VIP-only (the route+SNAT post-up raced enp35s0 and failednetworking.service)configs/optimus-tunnel-routes.service— route+VIP via systemd oneshot (After=network-online, Before=docker, tunnel-wait)Container→LAN SNAT
configs/docker-tunnel-snat.conf+optimus-tunnel-snat.sh— re-assert SNAT above docker’s per-network MASQUERADE (a docker.service ExecStartPost), else container DB traffic masquerades to the public IP and the tunnel drops itCold-start app recovery
configs/optimus-boot-reconcile.service+.sh— multi-passdocker compose up -dof autostart stacks (apps Exit128 “network not found” before the overlay extends to optimus); mirrors the fleet glxy-boot-reconcileDocs:
runbook/08-raid0-rebuild-2026-06-15.md(full sequence + the three fixes + observed clean-boot),runbook/02Tang IPs corrected to post-re-IP.10/.11.Final reboot observed clean: system=running, 0 failed units, all 7 apps healthy (restarts=0), stalwart 3/3, traefik 3/3, 3 managers, SNAT first in POSTROUTING, portal.opmail.io 302 via optimus — no manual intervention. All paths non-deploying (configs/, network/, *.md) → merge is a live-host no-op.
🤖 Generated with Claude Code
The 2026-06-15 RAID1→RAID0 reinstall surfaced fixes that weren't captured anywhere, so a future rebuild would re-hit them. Persist the load-bearing ones: - network/nftables-mssclamp.nft: the swarm-over-IPsec MSS clamp (1320). Without it `docker swarm join` hangs ("DeadlineExceeded … waiting for connections") because ~1400B-MTU tunnel black-holes large raft/gossip DF frames. This was the single hardest blocker of the rebuild and was NOT in the repo at all. - network/interfaces.d/tunnel-vip.cfg: add `mtu 1360` to the LAN route — the non-TCP half of the same fix. - runbook/08-raid0-rebuild-2026-06-15.md: full end-to-end sequence incl. the things that cost real time — dropbear apt-hang, worker-join+promote, /mnt/cache symlinks, app_egress bridge, private-registry login for ivoryghst/iam, app-data perms (agents 999:999, clamav 100:103), forgejo-runner re-register with `optimus:host,swarm-manager` labels. All paths are non-deploying (network/** is not a deploy trigger; *.md ignored), so merging is a no-op for the live host. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>The Phase D prove-reboot (5 reboots; Clevis/Tang auto-unlock confirmed every time) exposed three stacked cold-start layers that each broke reconvergence on a cold boot even though the live fixes worked. All three are now version-controlled so a cold boot self-recovers with zero manual steps: 1. networking.service failed → route+SNAT missing. The lo:tunnel post-up `ip route … dev enp35s0` raced enp35s0 bring-up ("Device for nexthop is not up"); ifupdown aborts the stanza so SNAT was skipped too. tunnel-vip.cfg is now VIP-only; route+VIP move to optimus-tunnel-routes.service (After=network-online, Before=docker, bounded tunnel-reachability wait). 2. container→LAN SNAT buried by docker's per-network MASQUERADE (rebuilt every daemon start) → container LAN traffic masqueraded to the public IP, tunnel drops it, DB connects fail. Re-assert SNAT on top via a docker.service ExecStartPost (docker-tunnel-snat.conf → optimus-tunnel-snat.sh). 3. DB-dependent compose apps Exit128 "network not found" (overlay not yet extended to optimus) and don't auto-recover. optimus-boot-reconcile.service (After=docker) multi-pass `docker compose up -d`s the autostart stacks until all run — mirrors the fleet glxy-boot-reconcile. Final reboot observed clean: system=running, 0 failed units, all 7 apps healthy (restarts=0), stalwart 3/3, traefik 3/3, 3 managers, SNAT first in POSTROUTING, portal.opmail.io 302 via optimus — no manual intervention. Also: runbook/02 Tang IPs corrected to the post-re-IP .10/.11 (were .11/.12). All paths non-deploying (configs/**, network/**, *.md) → merge is a host no-op. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>docs(rebuild): persist RAID0 rebuild fixes — MSS clamp, route MTU, full runbookto feat(boot): optimus RAID0 rebuild durability + cold-start auto-recovery