perf: session reuse for multi-turn latency

- Add SessionCache (LRU + TTL, per-API-key scoped) mapping conversation-prefix hash -> upstream Lingma sessionId. - Hash only user/system/developer turns so client-side assistant reformatting doesn't invalidate the key. - On cache hit: reuse sessionId, send only the latest user message with isReply=true, and stick the request to the instance that originally served it. - LingmaGatewayClient.chat_complete/chat_stream accept session_id/is_reply and report the real finish.sessionId via out_meta so we persist what Lingma actually allocated. - Invalidate cache on non-stream failure; skip writes on cancelled/partial streams. - Expose cache stats in /internal/stats and /metrics. - Configurable via SESSION_REUSE_ENABLED / SESSION_CACHE_MAX_ENTRIES / SESSION_CACHE_TTL_SEC (documented in README + .env.example). Made-with: Cursor
2026-04-18 08:10:39 +08:00
parent d209d8ac0b
commit dfdb7087dc
6 changed files with 360 additions and 19 deletions
--- a/README.md
+++ b/README.md
@@ -70,6 +70,9 @@ cp .env.example .env
 - `GATEWAY_QUEUE_TIMEOUT_SEC`：排队等待超时秒数（默认 30，超过后直接 429 + `Retry-After`）
 - `LINGMA_ACCOUNTS`：多账号实例池，格式 `u1:p1,u2:p2` 或 JSON 数组；配置后每个账号起一个独立 Lingma 子进程
 - `LINGMA_INSTANCE_COUNT`：实例数（默认等于账号数；显式指定且不足时账号会循环复用）
+- `SESSION_REUSE_ENABLED`：多轮对话复用上游 sessionId（默认 `true`）。命中时只把最新一条 user 消息发给 Lingma，命中上游 KV cache，显著降低第 2 轮及以后的首 token 延迟
+- `SESSION_CACHE_MAX_ENTRIES`：会话缓存容量（LRU，默认 256）
+- `SESSION_CACHE_TTL_SEC`：会话缓存 TTL 秒数（默认 1800；超时自动失效，避免复用到已被 Lingma 回收的 session）

 ### `.env` 最小必填示例