feat: M1+M2 gateway hardening and multi-instance pool
Behavior hardening (M1):
- Fix `_chat_streams` memory leak: pop_stream on completion, error, and
client disconnect.
- Add WebSocket reconnect with state machine (stopped/starting/ready/
reconnecting/failed/closed) and exponential backoff, so a Lingma
restart no longer requires restarting the gateway.
- Lazy initialization: startup failure is non-fatal, first real request
triggers retry, `/healthz` reflects readiness.
- Migrate FastAPI on_event to lifespan.
- Structured JSON logging with request_id ContextVar; `x-request-id`
propagated to responses.
- SSE now sets `Cache-Control: no-cache`, `X-Accel-Buffering: no` to
defeat proxy buffering.
- OpenAI schema compatibility: `content` accepts str | list[parts] | None,
added `developer`/`function` roles, `tools/tool_choice/stream_options/
user/max_tokens` fields, and `stream_options.include_usage` emits final
usage chunk.
- `require_bearer` uses `hmac.compare_digest`; `/metrics` now requires
Bearer when `METRICS_TOKEN` or `API_KEYS` are set.
- Python 3.10/3.11 `TimeoutError` vs `asyncio.TimeoutError` unified.
- Error responses no longer leak `auto_login.status()` details.
Backpressure (M2 / A2):
- New `InFlightGuard` with per-request ticket, queue + rejection
accounting, `BackpressureRejected` raises 429 + `Retry-After` once
`GATEWAY_QUEUE_TIMEOUT_SEC` elapses.
- Streaming ticket ownership transfers to the generator so CancelledError
from client disconnect still releases the slot.
- `/internal/stats.concurrency` and `/metrics` expose in_flight/queued/
accepted_total/rejected_total/max_in_flight.
Multi-instance pool (M2 / A1 + B3):
- New `LingmaPool` with N processes, each with its own workDir, socket
port (dynamic when N>1), and `AutoLoginManager`.
- Account parser supports CSV (`u1:p1,u2:p2`) and JSON formats via
`LINGMA_ACCOUNTS`; falls back to `LINGMA_USERNAME/LINGMA_PASSWORD` for
backwards compatibility (N=1 keeps legacy paths/ports).
- Routing: sticky affinity by `user` / system-prompt hash, then
least-in-flight, finally round-robin fallback for unhealthy pool.
- `/healthz` reports per-instance state and ready count.
- `/internal/stats.pool` and `/metrics` expose per-instance
`gateway_pool_instance_in_flight{name}` / `gateway_pool_instance_ready{name}`.
- `/internal/auto-login/start?instance=inst-N` targets a specific instance;
`/internal/auto-login/status` lists all instances.
Compat notes:
- `.env.example` adds `METRICS_TOKEN`, `LOG_LEVEL`, `GATEWAY_MAX_IN_FLIGHT`,
`GATEWAY_QUEUE_TIMEOUT_SEC`, `LINGMA_ACCOUNTS`, `LINGMA_INSTANCE_COUNT`.
- `.gitignore` cleaned up data/ duplication.
- Existing single-instance deployments keep working without config change.
Made-with: Cursor
This commit is contained in:
@@ -1,13 +1,22 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Literal
|
||||
from typing import Any, Literal
|
||||
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
|
||||
# Keep permissive: OpenAI clients routinely send list-of-parts (multi-modal) or None
|
||||
# (for tool calls). We flatten to plain text downstream.
|
||||
MessageContent = str | list[dict[str, Any]] | None
|
||||
|
||||
|
||||
class ChatMessage(BaseModel):
|
||||
role: Literal["system", "user", "assistant", "tool"]
|
||||
content: str
|
||||
# OpenAI supports "developer" on newer API versions in addition to the classic set.
|
||||
role: Literal["system", "user", "assistant", "tool", "developer", "function"]
|
||||
content: MessageContent = None
|
||||
name: str | None = None
|
||||
tool_call_id: str | None = None
|
||||
tool_calls: list[dict[str, Any]] | None = None
|
||||
|
||||
|
||||
class ChatCompletionsRequest(BaseModel):
|
||||
@@ -16,6 +25,11 @@ class ChatCompletionsRequest(BaseModel):
|
||||
stream: bool = False
|
||||
temperature: float | None = None
|
||||
top_p: float | None = None
|
||||
max_tokens: int | None = None
|
||||
user: str | None = None
|
||||
stream_options: dict[str, Any] | None = None
|
||||
tools: list[dict[str, Any]] | None = None
|
||||
tool_choice: Any | None = None
|
||||
|
||||
|
||||
class ModelData(BaseModel):
|
||||
@@ -35,6 +49,7 @@ class ChatCompletionChoice(BaseModel):
|
||||
index: int = 0
|
||||
finish_reason: str | None = "stop"
|
||||
message: dict = Field(default_factory=dict)
|
||||
logprobs: Any | None = None
|
||||
|
||||
|
||||
class ChatCompletionResponse(BaseModel):
|
||||
@@ -43,3 +58,34 @@ class ChatCompletionResponse(BaseModel):
|
||||
created: int
|
||||
model: str
|
||||
choices: list[ChatCompletionChoice]
|
||||
system_fingerprint: str | None = None
|
||||
|
||||
|
||||
def flatten_content(content: MessageContent) -> str:
|
||||
"""Reduce OpenAI multi-part content to a plain string prompt for Lingma."""
|
||||
if content is None:
|
||||
return ""
|
||||
if isinstance(content, str):
|
||||
return content
|
||||
if isinstance(content, list):
|
||||
parts: list[str] = []
|
||||
for item in content:
|
||||
if not isinstance(item, dict):
|
||||
parts.append(str(item))
|
||||
continue
|
||||
t = item.get("type")
|
||||
if t == "text":
|
||||
text = item.get("text") or ""
|
||||
if text:
|
||||
parts.append(text)
|
||||
elif t in ("image_url", "input_image"):
|
||||
# Lingma 不支持多模态,降级成占位符,保留语义信号
|
||||
parts.append("[image]")
|
||||
elif t == "input_audio":
|
||||
parts.append("[audio]")
|
||||
else:
|
||||
text = item.get("text") or item.get("content")
|
||||
if isinstance(text, str) and text:
|
||||
parts.append(text)
|
||||
return "\n".join(p for p in parts if p)
|
||||
return str(content)
|
||||
|
||||
Reference in New Issue
Block a user