blog details

Agent Observability Logging: What to Log When Every Tool Call Matters

Agents don’t fail like normal apps. They fail mid-thought: a tool returns partial data, a retry changes timing, a prompt tweak shifts decisions, or a safety guardrail blocks an action. And because every tool call can trigger cost, latency, and real-world side effects, “basic logs” won’t answer the question your team asks at 2 a.m.: what happened inside the agent?
This guide gives you a practical, vendor-neutral logging model you can implement today: what to log for each run, each LLM call, and each tool call—plus how to keep logs useful without leaking secrets or drowning in data. We’ll also map it to modern tracing standards.

What and why

Agent observability logging is the practice of capturing enough structured telemetry to:

  • Reconstruct an agent run end-to-end (inputs → decisions → tool calls → outputs)
  • Explain failures and odd behavior (hallucinations, wrong tool choice, loops)
  • Control cost and latency (token usage, slow tools, retries)
  • Prove safety and compliance (who/what triggered an action; what was redacted)

Traditional app logging assumes code paths are deterministic. Agents aren’t. They’re probabilistic decision systems that route across models, tools, retrieval, and policy checks.

The trade-offs you must manage

  • Fidelity vs. privacy: raw prompts are high signal but high risk.
  • Fidelity vs. cost: storing everything can explode log volume quickly.
  • Debuggability vs. lock-in: vendor tools accelerate adoption; open schemas reduce future migrations.

A solid starting point is aligning your telemetry with OpenTelemetry’s GenAI semantic conventions (spans, metrics, and opt-in events for prompt/response details).

How it works: a mental model for agent logging

Think in three layers:

  1. Run layer (Trace): one agent “job” from start to finish
  2. Step layer (Spans): each meaningful unit of work: retrieval, LLM call, tool call, policy check
  3. Detail layer (Events/Artifacts): optional payloads: prompt templates, tool arguments, tool responses, model output, eval scores

OpenTelemetry formalizes this with GenAI spans (for inference), plus opt-in GenAI events for detailed prompt/completion capture when you choose to store it.

Minimal architecture (what you’re building)

  • Instrumentation in your agent runtime (SDK or proxy)
  • Collector/exporter (OTLP/OpenTelemetry Collector or vendor ingestion)
  • Backend (trace store + logs + metrics)
  • UI workflows: “find the run” → “see the waterfall” → “inspect the failing tool call” → “compare runs over time”

If you’re standardizing, OpenTelemetry GenAI conventions give you a shared vocabulary across backends and vendors.

Tools and stack options

There are two broad approaches:

A) Standards-first (OpenTelemetry + your backend)

  • Use OpenTelemetry GenAI conventions for spans/metrics/events
  • Add auto-instrumentation via OpenLLMetry (built on OpenTelemetry)
  • Send telemetry to your existing observability stack (Grafana/Jaeger/Tempo, Datadog, Honeycomb, Elastic, etc.)

B) LLM-native observability platforms (faster UX for prompts, evals)

  • Langfuse (LLM tracing, cost/latency, sessions, evals; open source/self-hostable)
  • LangSmith (LLM + tool tracing, token/cost tracking formats)
  • Phoenix (open-source LLM tracing + eval; OTLP-compatible)
  • Helicone (gateway/proxy-based logging + prompt management options)

Best practices and pitfalls

The “Minimum Viable Log Spec” (MVLS)

Tier 1 (ship this first week):

  • trace_id, session_id, user/request_id, environment (prod/stage)
  • for every LLM call: model name, latency, status, token usage, provider request id
  • for every tool call: tool name, latency, status, retries, result size (not full payload)
  • errors: exception type + stack (if code), tool error code, LLM refusal flags

Tier 2 (ship next):

  • prompt template id + version (not raw prompt by default)
  • tool argument schema hash + validated fields (redacted)
  • retrieval metadata: index/source ids, top_k, similarity ranges
  • decision points: selected tool vs. alternatives (top-2)

Tier 3 (only when governed):

  • opt-in prompt/response capture as events with redaction
  • evaluation scores (LLM-as-judge + human labels)
  • full tool inputs/outputs for specific tools (allowlist)

This tiered model matches how many modern platforms treat prompt details as optional while still tracking token/cost/latency and structured traces.

What to log for tool calls (where “every call matters”)

For each tool invocation, capture:

  • Identity: tool_name, tool_version, tool_provider, tool_endpoint
  • Intent: tool_purpose tag (search, CRM, payments, device-control)
  • Inputs: validated fields (redacted), input_size_bytes, schema_version
  • Execution: start/end time, timeout, retry_count, circuit_breaker_state
  • Outputs: status (ok/error/partial), output_size_bytes, record_count, confidence (if available)
  • Side effects: did_write=true/false, resource_id(s) affected, idempotency_key
  • Safety: policy_check_id, allow/deny, reason_code

Why the “side effects” fields matter: they’re how you audit real-world actions—especially if the agent can create tickets, send emails, update records, or trigger workflows.

Pitfalls that kill observability (even with “lots of logs”)

  • No correlation IDs: you can’t tie tool calls to a specific run.
  • Logging raw prompts everywhere: fast now, painful later (privacy, compliance).
  • High-cardinality chaos: every field becomes a label; your metrics bill spikes.
  • Missing retries and timeouts: you see failures, but not the reliability story.
  • No prompt versioning: you can’t tell whether behavior changed due to code or prompt.

Performance, cost, and security considerations

Performance: don’t make logging your bottleneck

  • Async export: buffer spans/events; ship out-of-band.
  • Sampling: keep 100% of errors and a small % of successes; raise sampling for new releases.
  • Size caps: store full payloads only for allowlisted tools or short retention.

Cost: measure what’s expensive in your system

Agents incur cost in two places:

  1. Model usage (tokens/requests)
  2. Tooling (API calls, DB queries, vendor endpoints)

OTel GenAI metrics and LLM platforms explicitly focus on token usage and cost visibility because it’s a primary production risk.

Security: treat logs as sensitive data

Two realities:

  • Logs often contain secrets, PII, credentials, and proprietary context.
  • Agents are exposed to prompt injection and unsafe output patterns—so you need monitoring and audit trails to detect and investigate.

Practical controls:

  • Redaction pipeline: scrub API keys, emails, phone numbers, auth headers.
  • Separate storage tiers: “metrics-only” vs. “debug payloads” with shorter retention.
  • Access control: least privilege; production payload access should be restricted.
  • Prompt capture is opt-in: align with GenAI event semantics (store details only when you choose).

Real-world use cases (mini case study)

Use case 1: “Why did the agent choose the wrong tool?”

Symptoms:

  • Tool A is correct, tool B is chosen
  • Output sounds confident but is wrong

Observability pattern:

  • Trace shows LLM spantool selection decisiontool B span (success) → wrong final answer
  • Fix requires either: better tool descriptions, stronger tool-choice constraints, or tool-result validation

What you need logged:

  • tool candidates (top-2), chosen tool, decision rationale summary (short), tool result size/shape, validation pass/fail

Use case 2: “Cost spike after a prompt change”

Symptoms:

  • Same traffic, higher spend
  • Latency up, tool calls up

Observability pattern:

  • Compare traces by prompt_version
  • Token usage per LLM span increases; tool_call_count per run increases

What you need logged:

  • prompt template id/version, token usage, tool call count, retry count, cache hit rate

Use case 3: Agentic RAG: “It retrieved the right docs but still hallucinated”

Observability pattern:

  • Retrieval span shows sources and top_k
  • LLM span shows the model ignored citations

What you need logged:

  • retrieval source ids, similarity metadata, citation requirement flag, output validator result

FAQs

1) What is agent observability logging?
Structured telemetry that reconstructs each agent run (LLM calls + tool calls + decisions) so you can debug, control cost/latency, and audit actions.

2) What should I log for every tool call?
Tool identity, redacted inputs, latency, retries/timeouts, output size/shape, status (ok/error/partial), and whether it caused side effects.

3) Do I need to store prompts and responses?
Not by default. Start with prompt template ids/versions and store raw content only as opt-in events with redaction and short retention.

4) How do I track cost reliably?
Log token usage and provider/model metadata for each LLM span, and aggregate cost per trace/session. Many LLM observability tools explicitly support token/cost tracking.

5) How long should I retain agent logs?
Keep metrics and redacted trace metadata longer; keep payload logs shorter. Retention is a governance decision tied to risk and compliance.

6) What’s the best stack if I want to avoid vendor lock-in?
OpenTelemetry GenAI conventions + OTLP export, optionally with OpenLLMetry for instrumentation.

7) How do I detect prompt injection attempts?
Log input risk signals (suspicious patterns, policy violations), tool-call denials, and “instruction hierarchy” conflicts. OWASP’s LLM guidance highlights prompt injection as a core risk area.

8) Will observability slow my agent down?
It can if you synchronously write large payloads. Use async export, sampling, and size caps—especially for prompts/tool outputs.

If you can’t replay an agent run end-to-end, you’re not debugging—you’re guessing.

Conclusion

Agent observability isn’t “more logs.” It’s a repeatable way to reconstruct runs, explain decisions, and control cost and risk when agents call tools that touch real systems. Start with the minimum viable spec: trace IDs everywhere, structured spans for LLM + tool calls, and clear fields for retries, timeouts, and side effects. Then add prompt/response capture only as opt-in, with redaction and short retention. When you treat agents like distributed systems—with traces, metrics, and governed events—tool-call failures stop being mysterious, cost spikes become measurable, and safety becomes auditable.

Know More

If you have any questions or need help, please contact us

Contact Us
Download