Behavior hardening (M1):
- Fix `_chat_streams` memory leak: pop_stream on completion, error, and
client disconnect.
- Add WebSocket reconnect with state machine (stopped/starting/ready/
reconnecting/failed/closed) and exponential backoff, so a Lingma
restart no longer requires restarting the gateway.
- Lazy initialization: startup failure is non-fatal, first real request
triggers retry, `/healthz` reflects readiness.
- Migrate FastAPI on_event to lifespan.
- Structured JSON logging with request_id ContextVar; `x-request-id`
propagated to responses.
- SSE now sets `Cache-Control: no-cache`, `X-Accel-Buffering: no` to
defeat proxy buffering.
- OpenAI schema compatibility: `content` accepts str | list[parts] | None,
added `developer`/`function` roles, `tools/tool_choice/stream_options/
user/max_tokens` fields, and `stream_options.include_usage` emits final
usage chunk.
- `require_bearer` uses `hmac.compare_digest`; `/metrics` now requires
Bearer when `METRICS_TOKEN` or `API_KEYS` are set.
- Python 3.10/3.11 `TimeoutError` vs `asyncio.TimeoutError` unified.
- Error responses no longer leak `auto_login.status()` details.
Backpressure (M2 / A2):
- New `InFlightGuard` with per-request ticket, queue + rejection
accounting, `BackpressureRejected` raises 429 + `Retry-After` once
`GATEWAY_QUEUE_TIMEOUT_SEC` elapses.
- Streaming ticket ownership transfers to the generator so CancelledError
from client disconnect still releases the slot.
- `/internal/stats.concurrency` and `/metrics` expose in_flight/queued/
accepted_total/rejected_total/max_in_flight.
Multi-instance pool (M2 / A1 + B3):
- New `LingmaPool` with N processes, each with its own workDir, socket
port (dynamic when N>1), and `AutoLoginManager`.
- Account parser supports CSV (`u1:p1,u2:p2`) and JSON formats via
`LINGMA_ACCOUNTS`; falls back to `LINGMA_USERNAME/LINGMA_PASSWORD` for
backwards compatibility (N=1 keeps legacy paths/ports).
- Routing: sticky affinity by `user` / system-prompt hash, then
least-in-flight, finally round-robin fallback for unhealthy pool.
- `/healthz` reports per-instance state and ready count.
- `/internal/stats.pool` and `/metrics` expose per-instance
`gateway_pool_instance_in_flight{name}` / `gateway_pool_instance_ready{name}`.
- `/internal/auto-login/start?instance=inst-N` targets a specific instance;
`/internal/auto-login/status` lists all instances.
Compat notes:
- `.env.example` adds `METRICS_TOKEN`, `LOG_LEVEL`, `GATEWAY_MAX_IN_FLIGHT`,
`GATEWAY_QUEUE_TIMEOUT_SEC`, `LINGMA_ACCOUNTS`, `LINGMA_INSTANCE_COUNT`.
- `.gitignore` cleaned up data/ duplication.
- Existing single-instance deployments keep working without config change.
Made-with: Cursor
92 lines
2.7 KiB
Python
92 lines
2.7 KiB
Python
from __future__ import annotations
|
|
|
|
from typing import Any, Literal
|
|
|
|
from pydantic import BaseModel, Field
|
|
|
|
|
|
# Keep permissive: OpenAI clients routinely send list-of-parts (multi-modal) or None
|
|
# (for tool calls). We flatten to plain text downstream.
|
|
MessageContent = str | list[dict[str, Any]] | None
|
|
|
|
|
|
class ChatMessage(BaseModel):
|
|
# OpenAI supports "developer" on newer API versions in addition to the classic set.
|
|
role: Literal["system", "user", "assistant", "tool", "developer", "function"]
|
|
content: MessageContent = None
|
|
name: str | None = None
|
|
tool_call_id: str | None = None
|
|
tool_calls: list[dict[str, Any]] | None = None
|
|
|
|
|
|
class ChatCompletionsRequest(BaseModel):
|
|
model: str
|
|
messages: list[ChatMessage]
|
|
stream: bool = False
|
|
temperature: float | None = None
|
|
top_p: float | None = None
|
|
max_tokens: int | None = None
|
|
user: str | None = None
|
|
stream_options: dict[str, Any] | None = None
|
|
tools: list[dict[str, Any]] | None = None
|
|
tool_choice: Any | None = None
|
|
|
|
|
|
class ModelData(BaseModel):
|
|
id: str
|
|
name: str | None = None
|
|
object: str = "model"
|
|
created: int = 0
|
|
owned_by: str = "lingma"
|
|
|
|
|
|
class ModelsResponse(BaseModel):
|
|
object: str = "list"
|
|
data: list[ModelData]
|
|
|
|
|
|
class ChatCompletionChoice(BaseModel):
|
|
index: int = 0
|
|
finish_reason: str | None = "stop"
|
|
message: dict = Field(default_factory=dict)
|
|
logprobs: Any | None = None
|
|
|
|
|
|
class ChatCompletionResponse(BaseModel):
|
|
id: str
|
|
object: str = "chat.completion"
|
|
created: int
|
|
model: str
|
|
choices: list[ChatCompletionChoice]
|
|
system_fingerprint: str | None = None
|
|
|
|
|
|
def flatten_content(content: MessageContent) -> str:
|
|
"""Reduce OpenAI multi-part content to a plain string prompt for Lingma."""
|
|
if content is None:
|
|
return ""
|
|
if isinstance(content, str):
|
|
return content
|
|
if isinstance(content, list):
|
|
parts: list[str] = []
|
|
for item in content:
|
|
if not isinstance(item, dict):
|
|
parts.append(str(item))
|
|
continue
|
|
t = item.get("type")
|
|
if t == "text":
|
|
text = item.get("text") or ""
|
|
if text:
|
|
parts.append(text)
|
|
elif t in ("image_url", "input_image"):
|
|
# Lingma 不支持多模态,降级成占位符,保留语义信号
|
|
parts.append("[image]")
|
|
elif t == "input_audio":
|
|
parts.append("[audio]")
|
|
else:
|
|
text = item.get("text") or item.get("content")
|
|
if isinstance(text, str) and text:
|
|
parts.append(text)
|
|
return "\n".join(p for p in parts if p)
|
|
return str(content)
|