perf: session reuse for multi-turn latency

- Add SessionCache (LRU + TTL, per-API-key scoped) mapping conversation-prefix hash -> upstream Lingma sessionId. - Hash only user/system/developer turns so client-side assistant reformatting doesn't invalidate the key. - On cache hit: reuse sessionId, send only the latest user message with isReply=true, and stick the request to the instance that originally served it. - LingmaGatewayClient.chat_complete/chat_stream accept session_id/is_reply and report the real finish.sessionId via out_meta so we persist what Lingma actually allocated. - Invalidate cache on non-stream failure; skip writes on cancelled/partial streams. - Expose cache stats in /internal/stats and /metrics. - Configurable via SESSION_REUSE_ENABLED / SESSION_CACHE_MAX_ENTRIES / SESSION_CACHE_TTL_SEC (documented in README + .env.example). Made-with: Cursor
2026-04-18 08:10:39 +08:00
parent d209d8ac0b
commit dfdb7087dc
6 changed files with 360 additions and 19 deletions
--- a/.env.example
+++ b/.env.example
@@ -67,3 +67,11 @@ LINGMA_PASSWORD=
 LINGMA_ACCOUNTS=
 # 实例数量：默认等于 LINGMA_ACCOUNTS 数；显式指定时账号不足会循环复用并打 warning
 LINGMA_INSTANCE_COUNT=
+
+# ==== 会话复用（多轮对话命中上游 KV cache，减少首 token 延迟） ====
+# 开关（默认开）
+SESSION_REUSE_ENABLED=true
+# 最多缓存多少条会话 (LRU)
+SESSION_CACHE_MAX_ENTRIES=256
+# 会话 TTL 秒数；超时自动失效，避免 Lingma 侧早已回收还在命中
+SESSION_CACHE_TTL_SEC=1800