prod hardening: admin/metrics authz split, subprocess lifecycle, parallel pool start, HEALTHCHECK

- authz: new ADMIN_TOKEN gates /internal/*; METRICS_PUBLIC=false by default, so
  /metrics returns 503 when neither METRICS_TOKEN nor API_KEYS is set
  (previously leaked pool topology). Startup logs loudly if API_KEYS is empty
  or admin falls back to chat keys.
- lingma_client: keep a Popen handle instead of orphaning Lingma with
  start_new_session, drain stderr to logger at DEBUG, SIGTERM -> 5s grace ->
  SIGKILL on shutdown. Fixes the zombie-process leak on container reload.
- pool: asyncio.gather to start N instances concurrently; N=2 pool shaves
  ~startup_timeout seconds off boot.
- Dockerfile: HEALTHCHECK hits /healthz and greps for pool_ready>0 so Docker
  / compose orchestrators see "stuck on login" as unhealthy.

Made-with: Cursor
This commit is contained in:
GitHub Actions
2026-04-18 10:22:13 +08:00
parent 3130533888
commit 2febc37c2c
8 changed files with 248 additions and 28 deletions

View File

@@ -183,16 +183,14 @@ class LingmaPool:
# -------------------------------------------------------------- lifecycle
async def start(self) -> None:
"""Start all instances sequentially.
"""Boot every pool instance in parallel.
Sequential startup avoids racing on the shared ~/.lingma/.info file (for
pool-mode we skip it anyway, but Lingma may still write there internally)
and keeps docker logs readable. Failures are non-fatal; per-instance
reconnect loops will take over.
Bundle restore is still sequential (cheap, filesystem-level) and logged
per instance; only the expensive `client.start()` path — which waits on
the Lingma socket and an LSP initialize round-trip — runs concurrently.
Before spawning each Lingma process we optionally restore a pre-captured
session bundle into the workDir, which lets us skip Playwright login
entirely on a fresh volume.
Any one instance failing is non-fatal: per-instance reconnect loops
take over once their first `ensure_ready()` fires.
"""
for inst in self._instances:
self._maybe_apply_session_bundle(inst)
@@ -208,11 +206,18 @@ class LingmaPool:
),
is_logged_in_workdir(inst.cfg.work_dir),
)
async def _start_one(inst: PoolInstance) -> None:
try:
await inst.client.start()
except Exception as exc:
logger.warning("pool start %s failed: %s", inst.name, exc)
await asyncio.gather(
*(_start_one(inst) for inst in self._instances),
return_exceptions=False,
)
@staticmethod
def _maybe_apply_session_bundle(inst: "PoolInstance") -> None:
"""Restore an exported Lingma session into inst.work_dir, if needed.