lingma-proxy-compose/docs/tool-emulation-checklist.md

# Tool Emulation Checklist

This checklist is for implementation work.

It is not meant to explain the theory again. It breaks plain-chat tool emulation into concrete surfaces that can be implemented and validated incrementally.

## 1. Prompt Contract

- tell the model that tools are available
- list tool names, short descriptions, and schema summaries
- define a fixed action format
- define multi-turn rules
- encode `tool_choice` constraints
- include at least one valid action example
- ideally include one example where a tool result arrives and the model decides what to do next

Acceptance:

- the first turn reliably emits a valid action block
- later turns do not collapse into plain explanation after a tool result

## 2. Request Normalization

- OpenAI:
  - parse `tools`
  - parse `tool_choice`
  - parse `assistant.tool_calls`
  - parse `tool`
- Anthropic:
  - parse `tools`
  - parse `tool_choice`
  - parse `tool_use`
  - parse `tool_result`
- normalize everything into one internal structure
- detect tool history even when the current turn does not repeat `tools`

Acceptance:

- emulation stays active on later turns without repeated tool definitions

## 3. Tool History Projection

- project historical assistant tool calls back into action text
- do not pass downstream protocol-specific history directly to the upstream model
- preserve tool name, arguments, and call id where useful

Acceptance:

- the model can “see” its own previous actions in later turns

## 4. Tool Result Continuation

- do not feed raw tool output back without framing
- wrap tool results into an explicit continuation message
- handle empty, partial, and error outputs consistently

Acceptance:

- after a tool result, the model can either call another tool or finish naturally

## 5. Parser Contract

- recognize both ` ```json action ` and plain ` ```json `
- tolerate smart quotes, trailing commas, and stringified argument JSON
- extract `tool`, `name`, `parameters`, `arguments`, or `input`
- support multiple blocks in one reply
- strip action blocks from normal assistant text

Acceptance:

- multiple action blocks can be parsed reliably

## 6. Retry Policy

- trigger when:
  - a tool call was expected but no action block was produced
  - refusal language is detected
  - `tool_choice=any`
  - `tool_choice=tool`
- retry with a stricter message
- bound retry count
- log retry reason

Acceptance:

- refusal-style replies can be corrected without infinite loops

## 7. Refusal Detection

- maintain a refusal phrase set
- detect both hard refusals and soft “environment limitation” answers
- distinguish between:
  - a legitimate no-tool answer
  - a failed tool-use turn

Acceptance:

- common “tools are unavailable” replies trigger retry when appropriate

## 8. Response Re-encoding

- OpenAI:
  - emit `message.tool_calls`
  - set `finish_reason = tool_calls`
- Anthropic:
  - emit `content[].tool_use`
  - set `stop_reason = tool_use`
- preserve normal text when no tool call is present

Acceptance:

- downstream clients remain unaware that the upstream lacks native tools

## 9. Streaming Strategy

- OpenAI:
  - role chunk
  - text deltas
  - tool call deltas
- Anthropic:
  - `message_start`
  - `content_block_start`
  - `content_block_delta`
  - `content_block_stop`
  - `message_delta`
  - `message_stop`
- document clearly when streaming is synthesized from a completed non-stream result

Acceptance:

- downstream stream consumers receive protocol-valid event sequences

## 10. Multi-turn State Machine

- distinguish at least:
  - first decision
  - tool call emitted
  - waiting for tool result
  - tool result received, next decision pending
  - final answer
- derive state from message history, not only the current payload
- do not confuse “tool history exists” with “another tool call is mandatory”

Acceptance:

- agent loops remain stable across more than one turn

## 11. Observability

- log:
  - whether emulation is active
  - how many tool calls were parsed
  - whether retry fired
  - which refusal signal matched
- ideally log whether:
  - the prompt contract was injected
  - tool history was detected

Acceptance:

- failures can be localized to prompt, parser, retry, or state management

## 12. Test Matrix

- OpenAI:
  - single-turn tool call
  - multi-turn tool result continuation
  - later turn without repeated `tools`
  - forced tool
  - `tool_choice=any`
- Anthropic:
  - single-turn `tool_use`
  - multi-turn `tool_result` continuation
  - later turn without repeated `tools`
  - streaming `tool_use`
- error cases:
  - refusal
  - invalid JSON
  - multiple action blocks
  - plain-text final answer

Acceptance:

- both “first tool turn” and “second-turn continuation” are covered

## 13. Recommended Next Priorities

If the system already works, the highest-value next improvements are:

1. stronger few-shot for “tool result arrives, then call another tool”
2. better history-aware retry policy
3. finer refusal categories
4. stronger parser tolerance
5. richer streaming behavior