Keep Lingma chat/ask payload source as numeric 1 for agent mode A/B validation against remote upstream timeout behavior.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Force OpenAI tooling-context requests into agent mode and align Lingma ask payload fields for agent requests so server-side tool path matches VSCode semantics.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Forward tool/call/sync and tool/invoke events to Lingma with auto-approve and invokeResult so tool calls can complete end-to-end.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Use agent ask_mode for Anthropic messages with tooling context so tool/write flows are executed, and add regression coverage plus docs/env updates for TOOL_FORWARD_ENABLED.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ensure session reuse is disabled for tooling contexts, include tool config in cache keys, and stabilize tool event merge/routing with expanded bridge tests.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Handle tool/invokeResult and richer tool/call/sync payloads in the client,
and document/retest the verified VSCode monitoring workflow for tool events.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add structured tool event propagation from Lingma stream/finish metadata and map it to OpenAI tool_calls and Anthropic tool_use/tool_result in both streaming and non-streaming responses. Add focused bridge tests and update docs/design notes to match current behavior.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add /v1/messages/count_tokens and switch /v1/models to Anthropic-style key auth so Claude Code probes succeed consistently.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a wire-compatible Anthropic endpoint alongside the existing OpenAI one
so Claude Code / anthropic-sdk / Cursor Agent can hit Lingma directly.
- app/anthropic_schema.py (new): request model + content-block flattener
+ internal-messages adapter + affinity key helper. Handles text / image /
tool_use / tool_result blocks; unknown types degrade gracefully.
- app/auth.py: add require_anthropic_key (x-api-key, Bearer fallback)
and AnthropicAuthError so auth failures render in Anthropic's error
envelope instead of FastAPI's {detail:...} wrapper.
- app/main.py: POST /v1/messages. Shares LingmaPool / SessionCache /
InFlightGuard / StatsCollector with the OpenAI path — same api_key +
same conversation prefix hits the same upstream sessionId across both
protocols (KV cache carries over). Streaming emits the named Anthropic
event sequence (message_start / content_block_start / content_block_delta
/ content_block_stop / message_delta / message_stop). No claude-*
model mapping table: resolve_model's default fallback handles it.
- README.md / DESIGN.md: document the new endpoint, add decision 5.12,
iteration history M5, and a 4.3b streaming flow diagram.
- Bump FastAPI app version to 0.4.0.
Made-with: Cursor
- authz: new ADMIN_TOKEN gates /internal/*; METRICS_PUBLIC=false by default, so
/metrics returns 503 when neither METRICS_TOKEN nor API_KEYS is set
(previously leaked pool topology). Startup logs loudly if API_KEYS is empty
or admin falls back to chat keys.
- lingma_client: keep a Popen handle instead of orphaning Lingma with
start_new_session, drain stderr to logger at DEBUG, SIGTERM -> 5s grace ->
SIGKILL on shutdown. Fixes the zombie-process leak on container reload.
- pool: asyncio.gather to start N instances concurrently; N=2 pool shaves
~startup_timeout seconds off boot.
- Dockerfile: HEALTHCHECK hits /healthz and greps for pool_ready>0 so Docker
/ compose orchestrators see "stuck on login" as unhealthy.
Made-with: Cursor
Adds a lightweight way to pre-seed a Lingma workDir with an existing
logged-in session:
- New module session_bundle.py packs/unpacks only the four cache files
that make up a Lingma login (id, user, quota, config.json). Everything
else (db, logs, index, diagnosis) stays local so bundles stay tiny
and never leak session-specific artefacts.
- Safety: path-traversal/symlink members are rejected; size is capped;
refuses to export from a workDir that isn't actually logged in;
sensitive cache/user is chmod'd 0600 on restore.
- LingmaAccount gains optional session_bundle_b64 / session_bundle_file;
LINGMA_SESSION_BUNDLE[_FILE] env provide the singleton fallback.
Credentials become optional when a bundle is supplied.
- LingmaPool.start() restores the bundle into each instance workDir
only if it isn't already logged in, so persistent volumes aren't
clobbered and a corrupt bundle falls back to Playwright gracefully.
- POST /internal/session/export returns the bundle as base64; ?instance=
selects a specific pool instance. Requires an authed, already-logged-in
instance to prevent exporting empties.
- README + .env.example document the end-to-end flow.
Made-with: Cursor
Lets callers see Lingma's raw config/queryModels response, so the
official per-key displayName/description is discoverable without
reverse-engineering the VSIX. Falls back to the pool's pick() unless
a specific instance is requested.
Made-with: Cursor
- Add SessionCache (LRU + TTL, per-API-key scoped) mapping
conversation-prefix hash -> upstream Lingma sessionId.
- Hash only user/system/developer turns so client-side
assistant reformatting doesn't invalidate the key.
- On cache hit: reuse sessionId, send only the latest user
message with isReply=true, and stick the request to the
instance that originally served it.
- LingmaGatewayClient.chat_complete/chat_stream accept
session_id/is_reply and report the real finish.sessionId
via out_meta so we persist what Lingma actually allocated.
- Invalidate cache on non-stream failure; skip writes on
cancelled/partial streams.
- Expose cache stats in /internal/stats and /metrics.
- Configurable via SESSION_REUSE_ENABLED / SESSION_CACHE_MAX_ENTRIES
/ SESSION_CACHE_TTL_SEC (documented in README + .env.example).
Made-with: Cursor
Lingma streams answers via chat/answer + chat/finish notifications and
never sends a JSON-RPC response for chat/ask. The old code awaited
rpc.request("chat/ask") and swallowed the TimeoutError, so every chat
was forced to wait the full rpc_timeout (default 30s) before draining
the stream queue - even though the first token was already present in
the queue within ~2s.
Effect:
- non-stream TTFB dropped from ~30s to actual upstream latency (~2-3s).
- stream first-chunk dropped from ~30s to upstream first-token latency.
- consume_stream idle timeout decoupled from rpc_timeout so shortening
rpc_timeout no longer starves long completions.
Switch chat/ask to rpc.notify (fire-and-forget) and rely entirely on the
existing chat/answer + chat/finish handlers for result delivery.
Made-with: Cursor
Behavior hardening (M1):
- Fix `_chat_streams` memory leak: pop_stream on completion, error, and
client disconnect.
- Add WebSocket reconnect with state machine (stopped/starting/ready/
reconnecting/failed/closed) and exponential backoff, so a Lingma
restart no longer requires restarting the gateway.
- Lazy initialization: startup failure is non-fatal, first real request
triggers retry, `/healthz` reflects readiness.
- Migrate FastAPI on_event to lifespan.
- Structured JSON logging with request_id ContextVar; `x-request-id`
propagated to responses.
- SSE now sets `Cache-Control: no-cache`, `X-Accel-Buffering: no` to
defeat proxy buffering.
- OpenAI schema compatibility: `content` accepts str | list[parts] | None,
added `developer`/`function` roles, `tools/tool_choice/stream_options/
user/max_tokens` fields, and `stream_options.include_usage` emits final
usage chunk.
- `require_bearer` uses `hmac.compare_digest`; `/metrics` now requires
Bearer when `METRICS_TOKEN` or `API_KEYS` are set.
- Python 3.10/3.11 `TimeoutError` vs `asyncio.TimeoutError` unified.
- Error responses no longer leak `auto_login.status()` details.
Backpressure (M2 / A2):
- New `InFlightGuard` with per-request ticket, queue + rejection
accounting, `BackpressureRejected` raises 429 + `Retry-After` once
`GATEWAY_QUEUE_TIMEOUT_SEC` elapses.
- Streaming ticket ownership transfers to the generator so CancelledError
from client disconnect still releases the slot.
- `/internal/stats.concurrency` and `/metrics` expose in_flight/queued/
accepted_total/rejected_total/max_in_flight.
Multi-instance pool (M2 / A1 + B3):
- New `LingmaPool` with N processes, each with its own workDir, socket
port (dynamic when N>1), and `AutoLoginManager`.
- Account parser supports CSV (`u1:p1,u2:p2`) and JSON formats via
`LINGMA_ACCOUNTS`; falls back to `LINGMA_USERNAME/LINGMA_PASSWORD` for
backwards compatibility (N=1 keeps legacy paths/ports).
- Routing: sticky affinity by `user` / system-prompt hash, then
least-in-flight, finally round-robin fallback for unhealthy pool.
- `/healthz` reports per-instance state and ready count.
- `/internal/stats.pool` and `/metrics` expose per-instance
`gateway_pool_instance_in_flight{name}` / `gateway_pool_instance_ready{name}`.
- `/internal/auto-login/start?instance=inst-N` targets a specific instance;
`/internal/auto-login/status` lists all instances.
Compat notes:
- `.env.example` adds `METRICS_TOKEN`, `LOG_LEVEL`, `GATEWAY_MAX_IN_FLIGHT`,
`GATEWAY_QUEUE_TIMEOUT_SEC`, `LINGMA_ACCOUNTS`, `LINGMA_INSTANCE_COUNT`.
- `.gitignore` cleaned up data/ duplication.
- Existing single-instance deployments keep working without config change.
Made-with: Cursor