prod hardening: admin/metrics authz split, subprocess lifecycle, parallel pool start, HEALTHCHECK

- authz: new ADMIN_TOKEN gates /internal/*; METRICS_PUBLIC=false by default, so /metrics returns 503 when neither METRICS_TOKEN nor API_KEYS is set (previously leaked pool topology). Startup logs loudly if API_KEYS is empty or admin falls back to chat keys. - lingma_client: keep a Popen handle instead of orphaning Lingma with start_new_session, drain stderr to logger at DEBUG, SIGTERM -> 5s grace -> SIGKILL on shutdown. Fixes the zombie-process leak on container reload. - pool: asyncio.gather to start N instances concurrently; N=2 pool shaves ~startup_timeout seconds off boot. - Dockerfile: HEALTHCHECK hits /healthz and greps for pool_ready>0 so Docker / compose orchestrators see "stuck on login" as unhealthy. Made-with: Cursor
2026-04-18 10:22:13 +08:00
parent 3130533888
commit 2febc37c2c
8 changed files with 248 additions and 28 deletions
--- a/11
+++ b/11
@@ -17,4 +17,15 @@ COPY app /app/app

 EXPOSE 8317

+# Container-level health signal. Docker Compose / orchestrators rely on this
+# to stop sending traffic when the pool is wedged, restart unhealthy replicas,
+# and drive rolling deploys. /healthz returns ok=true only when at least one
+# Lingma instance is in state=ready, so it catches the "stuck on login" case
+# that a raw TCP probe would miss.
+HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
+    CMD python -c "import os,json,urllib.request,sys; \
+port=os.environ.get('PORT','8317'); \
+r=urllib.request.urlopen(f'http://127.0.0.1:{port}/healthz', timeout=3); \
+sys.exit(0 if json.load(r).get('ok') else 1)" || exit 1
+
 CMD ["sh", "-c", "python /app/app/bootstrap_lingma.py && uvicorn app.main:app --host ${HOST:-0.0.0.0} --port ${PORT:-8317}"]