harden(runner): pin forge.azrak.io to LAN + auto-heal a wedged forgejo_runner #4

Merged
ivoryghst merged 1 commit from ivoryghst/harden-runner-lan-pin-watchdog into trunk 2026-06-30 02:27:45 +01:00 AGit
Owner

optimus is off-LAN, so forge.azrak.io defaults to its public IP and hairpins
back through optimus's own Traefik to Forgejo — an intermittently-lossy path
(TLS connection reset / EOF / refused). Because the forgejo_runner is
network_mode:host it shares the host /etc/hosts, so that flakiness WEDGES the
runner: it logs "failed to fetch task" ~every 2s and stops executing jobs,
producing status-2 zero-length-log deploy (optimus) failures (optimus sat on
a stale image for hours on 2026-06-29). It also makes large image pulls flaky
(dockerd uses host resolution).

  • optimus-boot-reconcile.sh: idempotently pin forge.azrak.io -> 10.1.0.10 in
    /etc/hosts every boot (over the IPsec tunnel, bypassing the Traefik hairpin;
    the LAN nodes r2-d2/c-3po already do this). Fixes runner fetch AND pulls.
  • forgejo-runner-watchdog.{sh,service,timer}: every 5min, restart the runner if
    it's wedged (sustained fetch-error burst >=40/5min; healthy=0, never fires
    mid-job). Belt-and-suspenders if the LAN path itself blips.
  • runbook 08: install steps + rationale.

Applied + verified live 2026-06-30: runner now fetches over LAN, deploys
succeed (run #781/#783), watchdog dry-run no-ops. Diagnosis in agents
memory:reference_optimus_runner_connectivity_deploy_fail.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com

optimus is off-LAN, so forge.azrak.io defaults to its public IP and hairpins back through optimus's own Traefik to Forgejo — an intermittently-lossy path (TLS connection reset / EOF / refused). Because the forgejo_runner is network_mode:host it shares the host /etc/hosts, so that flakiness WEDGES the runner: it logs "failed to fetch task" ~every 2s and stops executing jobs, producing status-2 zero-length-log `deploy (optimus)` failures (optimus sat on a stale image for hours on 2026-06-29). It also makes large image pulls flaky (dockerd uses host resolution). - optimus-boot-reconcile.sh: idempotently pin forge.azrak.io -> 10.1.0.10 in /etc/hosts every boot (over the IPsec tunnel, bypassing the Traefik hairpin; the LAN nodes r2-d2/c-3po already do this). Fixes runner fetch AND pulls. - forgejo-runner-watchdog.{sh,service,timer}: every 5min, restart the runner if it's wedged (sustained fetch-error burst >=40/5min; healthy=0, never fires mid-job). Belt-and-suspenders if the LAN path itself blips. - runbook 08: install steps + rationale. Applied + verified live 2026-06-30: runner now fetches over LAN, deploys succeed (run #781/#783), watchdog dry-run no-ops. Diagnosis in agents memory:reference_optimus_runner_connectivity_deploy_fail. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
optimus is off-LAN, so forge.azrak.io defaults to its public IP and hairpins
back through optimus's own Traefik to Forgejo — an intermittently-lossy path
(TLS connection reset / EOF / refused). Because the forgejo_runner is
network_mode:host it shares the host /etc/hosts, so that flakiness WEDGES the
runner: it logs "failed to fetch task" ~every 2s and stops executing jobs,
producing status-2 zero-length-log `deploy (optimus)` failures (optimus sat on
a stale image for hours on 2026-06-29). It also makes large image pulls flaky
(dockerd uses host resolution).

- optimus-boot-reconcile.sh: idempotently pin forge.azrak.io -> 10.1.0.10 in
  /etc/hosts every boot (over the IPsec tunnel, bypassing the Traefik hairpin;
  the LAN nodes r2-d2/c-3po already do this). Fixes runner fetch AND pulls.
- forgejo-runner-watchdog.{sh,service,timer}: every 5min, restart the runner if
  it's wedged (sustained fetch-error burst >=40/5min; healthy=0, never fires
  mid-job). Belt-and-suspenders if the LAN path itself blips.
- runbook 08: install steps + rationale.

Applied + verified live 2026-06-30: runner now fetches over LAN, deploys
succeed (run #781/#783), watchdog dry-run no-ops. Diagnosis in agents
memory:reference_optimus_runner_connectivity_deploy_fail.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
azrak/optimus!4
No description provided.