Agent Observability: Logging What Machines Do on Your Behalf
When an agent acts autonomously, debugging it requires more than server logs: you need structured event trails, tool call traces, and decision snapshots that show why the machine did what it did.
Last month I was debugging an agent that could draft a refund email, but kept failing on the actual refund step because the payment API returned a stale authorization. The model looked fine. The UI looked fine. The only way to see the bug was to trace the tool call, the quote ID it used, the rejection from the payments service, and the retry it attempted next. That is the part we keep running into in the agentic web: the failure is usually not “the model was dumb,” it is “the machine did the wrong thing at the boundary.”
Twilio got this right years ago. Its API is boring in the best way: clear request/response shapes, predictable delivery semantics, and enough metadata to tell whether a message actually went out. Agents need the same kind of boring reliability. If an agent sent a text, booked a slot, retried a failed step, or handed off to a human, we need a machine-readable trail. A screenshot of the chat is not enough. Logging is part of the plumbing.
Structured event logs beat “the model said X”
A useful agent log is not a blob of text. It is a stream of typed events: plan_started, tool_called, tool_returned, retry_scheduled, handoff_requested. Each event should carry a timestamp, correlation ID, actor, and redacted payload. That gives us something we can search across services when a workflow spans Claude, our API, and a payment provider.
The important part is that the log should read like a timeline, not a transcript. If the agent tried create_order, then got a 409, then switched to reserve_inventory, we should be able to see that sequence without digging through prompt dumps. If your API isn’t machine-readable, agents can’t use it reliably — and your logs won’t help you much either.
Tool call traces are where most failures actually live
Anthropic Claude tool use and OpenAI function calling both give us structured tool invocations, which is a big step up from free-form text. The real win is tracing the boundary between model intent and external action: search_inventory, create_order, send_email, capture_payment.
In practice, a lot of “agent bugs” are just bad arguments, stale schemas, or a tool timeout that the model tried to paper over. We’ve seen the model confidently retry the same broken call three times because the trace never made it clear the upstream service had already rejected the request. A good trace should show the tool name, arguments, latency, status, and the exact error returned. If an agent calls book_flight with an expired fare quote, the trace should show the original quote ID, the rejection, and the retry path.
Decision snapshots explain the “why,” not just the “what”
Traditional logs tell you what happened; decision snapshots tell you why the agent thought it should happen. We’ve found it useful to capture a compact state dump at key branches: goal, constraints, selected tool, confidence, and any policy checks. That is not full memory, and it is not the same as prompt logging. It is a checkpoint.
This matters when the agent has to choose between two valid-looking paths. For example: refund the customer, issue a replacement, or escalate to a human. The final answer alone does not tell us why it picked one branch. A decision snapshot should make that branch legible: what the agent knew, what it was optimizing for, and what rule or constraint pushed it one way. Nobody has solved this well yet, but without it, debugging becomes archaeology.
Agent telemetry is not APM with extra tokens
Datadog and New Relic are still useful, but they answer different questions. APM tells you whether the service is healthy: latency, error rates, CPU, queue depth. Agent telemetry asks whether the agent behaved correctly: did it choose the right tool, obey constraints, recover from failure, and stop when it should?
We need both views side by side. If the model is “working” but the payment rail is timing out, APM should show the infrastructure problem. If the infrastructure is healthy but the agent keeps choosing the wrong tool, telemetry should show that too. That is the difference between shipping a demo and shipping something that can actually move money, send messages, or complete a workflow without a human cleaning up after it.