perf: session reuse for multi-turn latency
- Add SessionCache (LRU + TTL, per-API-key scoped) mapping conversation-prefix hash -> upstream Lingma sessionId. - Hash only user/system/developer turns so client-side assistant reformatting doesn't invalidate the key. - On cache hit: reuse sessionId, send only the latest user message with isReply=true, and stick the request to the instance that originally served it. - LingmaGatewayClient.chat_complete/chat_stream accept session_id/is_reply and report the real finish.sessionId via out_meta so we persist what Lingma actually allocated. - Invalidate cache on non-stream failure; skip writes on cancelled/partial streams. - Expose cache stats in /internal/stats and /metrics. - Configurable via SESSION_REUSE_ENABLED / SESSION_CACHE_MAX_ENTRIES / SESSION_CACHE_TTL_SEC (documented in README + .env.example). Made-with: Cursor
This commit is contained in:
@@ -34,6 +34,9 @@ class Settings:
|
||||
auto_login_max_retry: int
|
||||
accounts: list[LingmaAccount] = field(default_factory=list)
|
||||
instance_count: int = 1
|
||||
session_reuse_enabled: bool = True
|
||||
session_cache_max_entries: int = 256
|
||||
session_cache_ttl_sec: float = 1800.0
|
||||
|
||||
|
||||
def _bool_env(name: str, default: bool) -> bool:
|
||||
@@ -131,4 +134,7 @@ def load_settings() -> Settings:
|
||||
auto_login_max_retry=int(os.getenv("AUTO_LOGIN_MAX_RETRY", "2")),
|
||||
accounts=accounts,
|
||||
instance_count=instance_count,
|
||||
session_reuse_enabled=_bool_env("SESSION_REUSE_ENABLED", True),
|
||||
session_cache_max_entries=int(os.getenv("SESSION_CACHE_MAX_ENTRIES", "256")),
|
||||
session_cache_ttl_sec=float(os.getenv("SESSION_CACHE_TTL_SEC", "1800")),
|
||||
)
|
||||
|
||||
Reference in New Issue
Block a user