Streaming Responses and Real-Time Tool Calls in LLM APIs

How streaming LLM responses work when models emit tool calls mid-generation, including SSE framing, partial argument assembly, and UI patterns for rendering responsive agent interfaces.

If you are building an agent UI, streaming is not just a nice-to-have. It changes how the product feels, how quickly users trust it, and how you structure the client-side state machine.

The tricky part is that streaming text and tool calls do not always arrive as one clean sequence. A model may start by emitting a sentence, then decide it needs a tool, then continue streaming while the tool is still running. In some APIs, the tool call itself is also streamed in pieces: first the tool name, then fragments of arguments, then a final assembled call.

That means the UI cannot assume “assistant message complete” equals “assistant is done.” It needs to understand intermediate states.

Why streaming exists

Streaming is usually implemented so users see output sooner. Instead of waiting for the full completion, the server sends partial results as they are generated. This reduces perceived latency and makes the interface feel alive.

For developer tools and agent products, there is a second benefit: streaming exposes the model’s decision process in a way that can be rendered progressively. A user can see that the assistant is thinking, searching, or calling a function rather than staring at a blank panel.

The tradeoff is complexity. Once you stream, you are no longer handling a single response object. You are handling a sequence of events.

The SSE shape of a streamed response

A common transport for streaming LLM output is Server-Sent Events, or SSE. It is plain HTTP with a response body that stays open and emits events over time.

An SSE stream looks roughly like this:

data: {"type":"message.delta","delta":"Hello"}

data: {"type":"message.delta","delta":" world"}

data: {"type":"tool_call.delta","id":"call_1","name":"search","arguments":"{\"query\":\""}

data: {"type":"tool_call.delta","id":"call_1","arguments":"weather in SF\"}"}

data: [DONE]

The exact event schema varies by provider. OpenAI, Anthropic, and others do not serialize tool calls in exactly the same way. But the core pattern is consistent:

each event contains a small chunk of state
text arrives as deltas
tool-call metadata may arrive incrementally
a terminal event marks completion

On the browser side, EventSource is the standard SSE consumer. On the server side, many frameworks can stream chunked responses directly.

Partial tool call assembly

This is where many implementations break.

A model may emit a tool call whose arguments are not valid JSON until the final fragment arrives. If you try to execute the tool too early, you will either fail parsing or call the wrong thing.

The safe pattern is to maintain an accumulator per tool call:

key it by id if the API provides one
otherwise key it by index or message position
append argument fragments in order
parse only when the tool call is marked complete

A simple mental model is:

receive name
receive arguments fragments
concatenate fragments
validate JSON
execute tool
stream tool result back into the conversation

In practice, you also want guards for malformed output. Even reliable APIs can produce truncated JSON if the connection drops or the model is interrupted. A robust client should treat the stream as a log of events, not as a promise that every fragment is immediately usable.

UI rendering patterns that actually work

The best UI pattern is usually not “show everything the model emits verbatim.” It is to render distinct states.

A useful layout is:

assistant text area for streamed natural language
tool status area for pending or running calls
result area for tool output
final answer area once the model incorporates the result

This separation matters because users interpret text and tool activity differently. If the assistant says “I’m checking that now,” and the UI shows a spinner on a tool row, the interaction feels coherent. If the UI just appends raw JSON into the chat transcript, it feels broken.

For example:

Assistant: I’m checking live pricing.

Tool: search_products(status: running)
Tool result: 12 matches found

Assistant: The lowest-priced option is...

You can also choose to hide intermediate tool details from casual users and expose them behind an expandable “activity” panel. That is often the right compromise for consumer products: keep the chat readable, but preserve transparency for debugging.

A contrarian view: do not stream everything

Streaming is often treated as the default good. It is not always.

If your tool call is the important part of the interaction, token-by-token text can distract from the real work. For example, if the model is about to call a database lookup, showing half-finished prose before the lookup completes may create a misleading sense of certainty.

In some products, a hybrid approach is better:

stream only a short “working” indicator
wait for tool completion
then render a coherent answer

This is especially true when tool latency is the dominant delay. Users care less about seeing every token than about understanding what the system is doing.

Another nuance: streaming can make retries and cancellation harder. Once a user interrupts the assistant, you need to stop text generation, cancel any in-flight tool execution, and reconcile partial UI state. That requires explicit lifecycle handling, not just a streaming reader.

Practical implementation notes

A few rules make these systems much easier to maintain:

Separate transport from state. Parse SSE events into a normalized internal model before rendering.
Use idempotent tool execution. If the same tool call is replayed after reconnect, avoid double side effects.
Validate arguments before calling tools. Treat streamed JSON as untrusted until parsed.
Keep the assistant message mutable. The final answer may change after tool results arrive.
Log event boundaries. Debugging streamed systems without event traces is painful.

If you are building in React, this often means storing a conversation as a list of messages plus a list of active tool calls. Each incoming chunk updates one of those structures. The rendering layer simply reflects the current state.

If you are building on a framework like Next.js, SSE is usually straightforward to proxy, but you should still watch for buffering at the edge or by intermediate infrastructure. A “streaming” API that batches chunks every few seconds is not really streaming from the user’s point of view.

Real-time tool calls and agent UX

Tool calls change the meaning of streaming. You are no longer just showing language generation; you are showing a live control loop.

That creates a useful design principle: make the handoff visible. Users should be able to tell when the model is talking, when it is asking for help from a tool, and when it is waiting on the result.

The best systems feel less like a chatbot and more like a coordinated workflow. Not because they reveal every internal detail, but because they present state transitions clearly.

OpenAI’s streaming APIs, Anthropic’s streamed message blocks, and the broader SSE/EventSource model all point in the same direction: the client must be stateful enough to assemble meaning from fragments.

The Bottom Line

Streaming LLM responses is easy to demo and easy to get subtly wrong. The core challenge is not receiving tokens; it is managing partial state across text, tool calls, and tool results.

If you are building a responsive agent UI, design for three things: incremental text, incremental tool-call assembly, and explicit UI states. Keep the stream parser separate from rendering, validate tool arguments before execution, and do not assume that more streaming always makes the experience better.

The goal is not to expose every token. The goal is to make the system’s behavior understandable while it is still happening.