harden(runner): pin forge.azrak.io to LAN + auto-heal a wedged forgejo_runner #4
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "ivoryghst/harden-runner-lan-pin-watchdog"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
optimus is off-LAN, so forge.azrak.io defaults to its public IP and hairpins
back through optimus's own Traefik to Forgejo — an intermittently-lossy path
(TLS connection reset / EOF / refused). Because the forgejo_runner is
network_mode:host it shares the host /etc/hosts, so that flakiness WEDGES the
runner: it logs "failed to fetch task" ~every 2s and stops executing jobs,
producing status-2 zero-length-log
deploy (optimus)failures (optimus sat ona stale image for hours on 2026-06-29). It also makes large image pulls flaky
(dockerd uses host resolution).
/etc/hosts every boot (over the IPsec tunnel, bypassing the Traefik hairpin;
the LAN nodes r2-d2/c-3po already do this). Fixes runner fetch AND pulls.
it's wedged (sustained fetch-error burst >=40/5min; healthy=0, never fires
mid-job). Belt-and-suspenders if the LAN path itself blips.
Applied + verified live 2026-06-30: runner now fetches over LAN, deploys
succeed (run #781/#783), watchdog dry-run no-ops. Diagnosis in agents
memory:reference_optimus_runner_connectivity_deploy_fail.
Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
optimus is off-LAN, so forge.azrak.io defaults to its public IP and hairpins back through optimus's own Traefik to Forgejo — an intermittently-lossy path (TLS connection reset / EOF / refused). Because the forgejo_runner is network_mode:host it shares the host /etc/hosts, so that flakiness WEDGES the runner: it logs "failed to fetch task" ~every 2s and stops executing jobs, producing status-2 zero-length-log `deploy (optimus)` failures (optimus sat on a stale image for hours on 2026-06-29). It also makes large image pulls flaky (dockerd uses host resolution). - optimus-boot-reconcile.sh: idempotently pin forge.azrak.io -> 10.1.0.10 in /etc/hosts every boot (over the IPsec tunnel, bypassing the Traefik hairpin; the LAN nodes r2-d2/c-3po already do this). Fixes runner fetch AND pulls. - forgejo-runner-watchdog.{sh,service,timer}: every 5min, restart the runner if it's wedged (sustained fetch-error burst >=40/5min; healthy=0, never fires mid-job). Belt-and-suspenders if the LAN path itself blips. - runbook 08: install steps + rationale. Applied + verified live 2026-06-30: runner now fetches over LAN, deploys succeed (run #781/#783), watchdog dry-run no-ops. Diagnosis in agents memory:reference_optimus_runner_connectivity_deploy_fail. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>