Skip to content

Observability

Structured logs, health endpoints, and OpenTelemetry configuration.

Intended audience: Stakeholders, Business analysts, Solution architects, Developers, Testers

Learning outcomes by role

Stakeholders

  • Explain stdout JSON logs, health endpoints, and tracing as operational visibility investments.

Business analysts

  • Tie observability signals to SLIs in runbooks and incident templates.

Solution architects

  • Design OTLP export, sampling, and collector placement for production.

Developers

  • Configure CADENCE_LOG_FORMAT, OTel metadata keys, and related admin APIs.

Testers

  • Validate health checks, log fields, and trace spans in staging environments.

Cadence exposes three observability layers: structured JSON logs (stdout), HTTP health and pool endpoints for probes, and OpenTelemetry settings adjustable via admin APIs without restarting the process.

  • Operational spend — Logs and traces have marginal cost; aggressive tracing and high-cardinality LLM spans can raise collector and storage bills.
  • Progressive rollout — JSON logs and /health first; add OTLP with sampling before full production load.
  • SLIs — Pair log-based error rate with trace latency for chat paths when defining SLOs.
  • Runbooks — Reference otel.* keys and CADENCE_LOG_FORMAT in incident checklists.
Observability layers Structured logs on stdout, health endpoints for probes, optional OpenTelemetry export via OTLP. Structured logs (stdout) CADENCE_LOG_FORMAT=json Health & pool endpoints /health, admin pool stats (see Monitoring guide) OpenTelemetry (optional) otel.* via admin APIs → OTLP

Implementation references: cadence.api.telemetry metadata keys, cadence.api.health.router, logging setup in app entry.

LayerHow to enableWhat it provides
Structured logsCADENCE_LOG_FORMAT=jsonSearchable JSON lines on stdout; send to ELK, Loki, CloudWatch
Health endpointsAlways availableLiveness (GET /health) and deeper admin health
Distributed tracesotel.enabled=true via admin APISpans across internal steps; export via OTLP to Langfuse or similar

Start with JSON logs + /health in staging, then enable OTLP with a low sample rate before raising to production traffic. Jumping straight to always_on tracing on busy clusters can overwhelm your collector.

Set CADENCE_LOG_FORMAT=json to emit structured JSON lines on stdout. Each line includes standard fields (timestamp, level, logger, message) plus request-specific context where instrumented.

Stream errors are always logged regardless of format:

cadence/api/chat.py
except Exception as e:
logger.error("Stream error: %s", e, exc_info=True)
yield 'data: {"event":"error","data":{"error":"An internal error occurred"}}\n\n'

OTel settings are stored as key/value metadata and adjustable through admin APIs without restarting the server.

cadence/api/telemetry.py
_KEY_META: dict[str, tuple[str, str]] = {
"otel.enabled": ("boolean", "Enable OpenTelemetry instrumentation"),
"otel.service_name": ("string", "OTel service name reported to the collector"),
"otel.exporter": (
"string",
"Exporter backend: console | otlp_grpc | otlp_http | none",
),
"otel.endpoint": ("string", "OTLP collector endpoint URL"),
"otel.traces_enabled": ("boolean", "Enable distributed tracing signal"),
"otel.trace_sampler": (
"string",
"Trace sampler: always_on | always_off | traceid_ratio",
),
"otel.instrument_langchain": (
"boolean",
"Auto-instrument LangChain / LangGraph pipelines",
),
"otel.instrument_openai_agents": ("boolean", "Auto-instrument OpenAI Agents SDK"),
}
  1. Note approximate time, org, orchestrator, and message id.
  2. Search JSON logs for errors or stream warnings around that window.
  3. If traces are enabled, open your collector and find the trace id — inspect model latency vs tool latency vs queue wait.
  4. Check pool stats (GET /api/admin/pool/stats) — long queue times often indicate instances were demoted to cold.
  5. If traces are missing entirely, verify otel.enabled and exporter settings via admin APIs.
SymptomCauseFix
No tracesotel.enabled is false, or exporter/network misconfiguredSet otel.enabled=true; check otel.endpoint and network egress
High cardinality costsSDK instrumentation flags enabled in productionDisable otel.instrument_langchain / otel.instrument_openai_agents
Logs not JSONCADENCE_LOG_FORMAT not set to jsonSet CADENCE_LOG_FORMAT=json and restart