perf: session reuse for multi-turn latency

- Add SessionCache (LRU + TTL, per-API-key scoped) mapping
  conversation-prefix hash -> upstream Lingma sessionId.
- Hash only user/system/developer turns so client-side
  assistant reformatting doesn't invalidate the key.
- On cache hit: reuse sessionId, send only the latest user
  message with isReply=true, and stick the request to the
  instance that originally served it.
- LingmaGatewayClient.chat_complete/chat_stream accept
  session_id/is_reply and report the real finish.sessionId
  via out_meta so we persist what Lingma actually allocated.
- Invalidate cache on non-stream failure; skip writes on
  cancelled/partial streams.
- Expose cache stats in /internal/stats and /metrics.
- Configurable via SESSION_REUSE_ENABLED / SESSION_CACHE_MAX_ENTRIES
  / SESSION_CACHE_TTL_SEC (documented in README + .env.example).

Made-with: Cursor
This commit is contained in:
GitHub Actions
2026-04-18 08:10:39 +08:00
parent d209d8ac0b
commit dfdb7087dc
6 changed files with 360 additions and 19 deletions

View File

@@ -67,3 +67,11 @@ LINGMA_PASSWORD=
LINGMA_ACCOUNTS=
# 实例数量:默认等于 LINGMA_ACCOUNTS 数;显式指定时账号不足会循环复用并打 warning
LINGMA_INSTANCE_COUNT=
# ==== 会话复用(多轮对话命中上游 KV cache减少首 token 延迟) ====
# 开关(默认开)
SESSION_REUSE_ENABLED=true
# 最多缓存多少条会话 (LRU)
SESSION_CACHE_MAX_ENTRIES=256
# 会话 TTL 秒数;超时自动失效,避免 Lingma 侧早已回收还在命中
SESSION_CACHE_TTL_SEC=1800