198 lines
5.1 KiB
Markdown
198 lines
5.1 KiB
Markdown
# Tool Calling Implementation Checklist
|
|
|
|
This checklist covers the complete implementation of OpenAI / Anthropic compatible tool calling over a plain chat API.
|
|
|
|
It breaks the work into concrete surfaces that can be implemented and validated incrementally.
|
|
|
|
## 1. Prompt Contract
|
|
|
|
- tell the model that tools are available
|
|
- list tool names, short descriptions, and schema summaries
|
|
- define a fixed action format
|
|
- define multi-turn rules
|
|
- encode `tool_choice` constraints
|
|
- include at least one valid action example
|
|
- ideally include one example where a tool result arrives and the model decides what to do next
|
|
|
|
Acceptance:
|
|
|
|
- the first turn reliably emits a valid action block
|
|
- later turns do not collapse into plain explanation after a tool result
|
|
|
|
## 2. Request Normalization
|
|
|
|
- OpenAI:
|
|
- parse `tools`
|
|
- parse `tool_choice`
|
|
- parse `assistant.tool_calls`
|
|
- parse `tool`
|
|
- Anthropic:
|
|
- parse `tools`
|
|
- parse `tool_choice`
|
|
- parse `tool_use`
|
|
- parse `tool_result`
|
|
- normalize everything into one internal structure
|
|
- detect tool history even when the current turn does not repeat `tools`
|
|
|
|
Acceptance:
|
|
|
|
- tool calling stays active on later turns without repeated tool definitions
|
|
|
|
## 3. Tool History Projection
|
|
|
|
- project historical assistant tool calls back into action text
|
|
- do not pass downstream protocol-specific history directly to Lingma
|
|
- preserve tool name, arguments, and call id where useful
|
|
|
|
Acceptance:
|
|
|
|
- the model can “see” its own previous actions in later turns
|
|
|
|
## 4. Tool Result Continuation
|
|
|
|
- do not feed raw tool output back without framing
|
|
- wrap tool results into an explicit continuation message
|
|
- handle empty, partial, and error outputs consistently
|
|
|
|
Acceptance:
|
|
|
|
- after a tool result, the model can either call another tool or finish naturally
|
|
|
|
## 5. Parser Contract
|
|
|
|
- recognize both ` ```json action ` and plain ` ```json `
|
|
- tolerate smart quotes, trailing commas, and stringified argument JSON
|
|
- extract `tool`, `name`, `parameters`, `arguments`, or `input`
|
|
- support multiple blocks in one reply
|
|
- strip action blocks from normal assistant text
|
|
|
|
Acceptance:
|
|
|
|
- multiple action blocks can be parsed reliably
|
|
|
|
## 6. Retry Policy
|
|
|
|
- trigger when:
|
|
- a tool call was expected but no action block was produced
|
|
- refusal language is detected
|
|
- `tool_choice=any`
|
|
- `tool_choice=tool`
|
|
- retry with a stricter message
|
|
- bound retry count
|
|
- log retry reason
|
|
|
|
Acceptance:
|
|
|
|
- refusal-style replies can be corrected without infinite loops
|
|
|
|
## 7. Refusal Detection
|
|
|
|
- maintain a refusal phrase set
|
|
- detect both hard refusals and soft “environment limitation” answers
|
|
- distinguish between:
|
|
- a legitimate no-tool answer
|
|
- a failed tool-use turn
|
|
|
|
Acceptance:
|
|
|
|
- common “tools are unavailable” replies trigger retry when appropriate
|
|
|
|
## 8. Response Re-encoding
|
|
|
|
- OpenAI:
|
|
- emit `message.tool_calls`
|
|
- set `finish_reason = tool_calls`
|
|
- Anthropic:
|
|
- emit `content[].tool_use`
|
|
- set `stop_reason = tool_use`
|
|
- preserve normal text when no tool call is present
|
|
|
|
Acceptance:
|
|
|
|
- downstream clients remain unaware that Lingma does not expose native tools
|
|
|
|
## 9. Streaming Strategy
|
|
|
|
- OpenAI:
|
|
- role chunk
|
|
- text deltas
|
|
- tool call deltas
|
|
- Anthropic:
|
|
- `message_start`
|
|
- `content_block_start`
|
|
- `content_block_delta`
|
|
- `content_block_stop`
|
|
- `message_delta`
|
|
- `message_stop`
|
|
- document clearly when streaming is synthesized from a completed non-stream result
|
|
|
|
Acceptance:
|
|
|
|
- downstream stream consumers receive protocol-valid event sequences
|
|
|
|
## 10. Multi-turn State Machine
|
|
|
|
- distinguish at least:
|
|
- first decision
|
|
- tool call emitted
|
|
- waiting for tool result
|
|
- tool result received, next decision pending
|
|
- final answer
|
|
- derive state from message history, not only the current payload
|
|
- do not confuse “tool history exists” with “another tool call is mandatory”
|
|
|
|
Acceptance:
|
|
|
|
- agent loops remain stable across more than one turn
|
|
|
|
## 11. Observability
|
|
|
|
- log:
|
|
- whether tool calling is active
|
|
- how many tool calls were parsed
|
|
- whether retry fired
|
|
- which refusal signal matched
|
|
- ideally log whether:
|
|
- the prompt contract was injected
|
|
- tool history was detected
|
|
|
|
Acceptance:
|
|
|
|
- failures can be localized to prompt, parser, retry, or state management
|
|
|
|
## 12. Test Matrix
|
|
|
|
- OpenAI:
|
|
- single-turn tool call
|
|
- multi-turn tool result continuation
|
|
- later turn without repeated `tools`
|
|
- forced tool
|
|
- `tool_choice=any`
|
|
- `tool_choice=none`
|
|
- `parallel_tool_calls=false`
|
|
- Anthropic:
|
|
- single-turn `tool_use`
|
|
- multi-turn `tool_result` continuation
|
|
- later turn without repeated `tools`
|
|
- streaming `tool_use`
|
|
- `tool_choice=any` / `tool_choice=none`
|
|
- error cases:
|
|
- refusal
|
|
- invalid JSON
|
|
- multiple action blocks
|
|
- plain-text final answer
|
|
|
|
Acceptance:
|
|
|
|
- both “first tool turn” and “second-turn continuation” are covered
|
|
|
|
## 13. Recommended Next Priorities
|
|
|
|
If the system already works, the highest-value next improvements are:
|
|
|
|
1. stronger few-shot for “tool result arrives, then call another tool”
|
|
2. better history-aware retry policy
|
|
3. finer refusal categories
|
|
4. stronger parser tolerance
|
|
5. richer streaming behavior
|