feat: M1+M2 gateway hardening and multi-instance pool

Behavior hardening (M1): - Fix `_chat_streams` memory leak: pop_stream on completion, error, and client disconnect. - Add WebSocket reconnect with state machine (stopped/starting/ready/ reconnecting/failed/closed) and exponential backoff, so a Lingma restart no longer requires restarting the gateway. - Lazy initialization: startup failure is non-fatal, first real request triggers retry, `/healthz` reflects readiness. - Migrate FastAPI on_event to lifespan. - Structured JSON logging with request_id ContextVar; `x-request-id` propagated to responses. - SSE now sets `Cache-Control: no-cache`, `X-Accel-Buffering: no` to defeat proxy buffering. - OpenAI schema compatibility: `content` accepts str | list[parts] | None, added `developer`/`function` roles, `tools/tool_choice/stream_options/ user/max_tokens` fields, and `stream_options.include_usage` emits final usage chunk. - `require_bearer` uses `hmac.compare_digest`; `/metrics` now requires Bearer when `METRICS_TOKEN` or `API_KEYS` are set. - Python 3.10/3.11 `TimeoutError` vs `asyncio.TimeoutError` unified. - Error responses no longer leak `auto_login.status()` details. Backpressure (M2 / A2): - New `InFlightGuard` with per-request ticket, queue + rejection accounting, `BackpressureRejected` raises 429 + `Retry-After` once `GATEWAY_QUEUE_TIMEOUT_SEC` elapses. - Streaming ticket ownership transfers to the generator so CancelledError from client disconnect still releases the slot. - `/internal/stats.concurrency` and `/metrics` expose in_flight/queued/ accepted_total/rejected_total/max_in_flight. Multi-instance pool (M2 / A1 + B3): - New `LingmaPool` with N processes, each with its own workDir, socket port (dynamic when N>1), and `AutoLoginManager`. - Account parser supports CSV (`u1:p1,u2:p2`) and JSON formats via `LINGMA_ACCOUNTS`; falls back to `LINGMA_USERNAME/LINGMA_PASSWORD` for backwards compatibility (N=1 keeps legacy paths/ports). - Routing: sticky affinity by `user` / system-prompt hash, then least-in-flight, finally round-robin fallback for unhealthy pool. - `/healthz` reports per-instance state and ready count. - `/internal/stats.pool` and `/metrics` expose per-instance `gateway_pool_instance_in_flight{name}` / `gateway_pool_instance_ready{name}`. - `/internal/auto-login/start?instance=inst-N` targets a specific instance; `/internal/auto-login/status` lists all instances. Compat notes: - `.env.example` adds `METRICS_TOKEN`, `LOG_LEVEL`, `GATEWAY_MAX_IN_FLIGHT`, `GATEWAY_QUEUE_TIMEOUT_SEC`, `LINGMA_ACCOUNTS`, `LINGMA_INSTANCE_COUNT`. - `.gitignore` cleaned up data/ duplication. - Existing single-instance deployments keep working without config change. Made-with: Cursor
2026-04-18 07:40:32 +08:00
parent 6114c66aed
commit 707acc9005
11 changed files with 1360 additions and 222 deletions
--- a/app/openai_schema.py
+++ b/app/openai_schema.py
@@ -1,13 +1,22 @@
 from __future__ import annotations

-from typing import Literal
+from typing import Any, Literal

 from pydantic import BaseModel, Field


+# Keep permissive: OpenAI clients routinely send list-of-parts (multi-modal) or None
+# (for tool calls). We flatten to plain text downstream.
+MessageContent = str | list[dict[str, Any]] | None
+
+
 class ChatMessage(BaseModel):
-    role: Literal["system", "user", "assistant", "tool"]
-    content: str
+    # OpenAI supports "developer" on newer API versions in addition to the classic set.
+    role: Literal["system", "user", "assistant", "tool", "developer", "function"]
+    content: MessageContent = None
+    name: str | None = None
+    tool_call_id: str | None = None
+    tool_calls: list[dict[str, Any]] | None = None


 class ChatCompletionsRequest(BaseModel):
@@ -16,6 +25,11 @@ class ChatCompletionsRequest(BaseModel):
    stream: bool = False
    temperature: float | None = None
    top_p: float | None = None
+    max_tokens: int | None = None
+    user: str | None = None
+    stream_options: dict[str, Any] | None = None
+    tools: list[dict[str, Any]] | None = None
+    tool_choice: Any | None = None


 class ModelData(BaseModel):
@@ -35,6 +49,7 @@ class ChatCompletionChoice(BaseModel):
    index: int = 0
    finish_reason: str | None = "stop"
    message: dict = Field(default_factory=dict)
+    logprobs: Any | None = None


 class ChatCompletionResponse(BaseModel):
@@ -43,3 +58,34 @@ class ChatCompletionResponse(BaseModel):
    created: int
    model: str
    choices: list[ChatCompletionChoice]
+    system_fingerprint: str | None = None
+
+
+def flatten_content(content: MessageContent) -> str:
+    """Reduce OpenAI multi-part content to a plain string prompt for Lingma."""
+    if content is None:
+        return ""
+    if isinstance(content, str):
+        return content
+    if isinstance(content, list):
+        parts: list[str] = []
+        for item in content:
+            if not isinstance(item, dict):
+                parts.append(str(item))
+                continue
+            t = item.get("type")
+            if t == "text":
+                text = item.get("text") or ""
+                if text:
+                    parts.append(text)
+            elif t in ("image_url", "input_image"):
+                # Lingma 不支持多模态，降级成占位符，保留语义信号
+                parts.append("[image]")
+            elif t == "input_audio":
+                parts.append("[audio]")
+            else:
+                text = item.get("text") or item.get("content")
+                if isinstance(text, str) and text:
+                    parts.append(text)
+        return "\n".join(p for p in parts if p)
+    return str(content)