Observability
Structured logs, health endpoints, and OpenTelemetry configuration.
Intended audience: Stakeholders, Business analysts, Solution architects, Developers, Testers
Learning outcomes by role
Stakeholders
- Explain stdout JSON logs, health endpoints, and tracing as operational visibility investments.
Business analysts
- Tie observability signals to SLIs in runbooks and incident templates.
Solution architects
- Design OTLP export, sampling, and collector placement for production.
Developers
- Configure CADENCE_LOG_FORMAT, OTel metadata keys, and related admin APIs.
Testers
- Validate health checks, log fields, and trace spans in staging environments.
Cadence exposes three observability layers: structured JSON logs (stdout), HTTP health and pool endpoints for probes, and OpenTelemetry settings adjustable via admin APIs without restarting the process.
Summary for stakeholders
Section titled “Summary for stakeholders”- Operational spend — Logs and traces have marginal cost; aggressive tracing and high-cardinality LLM spans can raise collector and storage bills.
- Progressive rollout — JSON logs and
/healthfirst; add OTLP with sampling before full production load.
Business analysis
Section titled “Business analysis”- SLIs — Pair log-based error rate with trace latency for chat paths when defining SLOs.
- Runbooks — Reference
otel.*keys andCADENCE_LOG_FORMATin incident checklists.
Architecture and integration
Section titled “Architecture and integration”
Implementation references: cadence.api.telemetry metadata keys, cadence.api.health.router, logging setup in app entry.
Observability layers
Section titled “Observability layers”| Layer | How to enable | What it provides |
|---|---|---|
| Structured logs | CADENCE_LOG_FORMAT=json | Searchable JSON lines on stdout; send to ELK, Loki, CloudWatch |
| Health endpoints | Always available | Liveness (GET /health) and deeper admin health |
| Distributed traces | otel.enabled=true via admin API | Spans across internal steps; export via OTLP to Langfuse or similar |
Start with JSON logs + /health in staging, then enable OTLP with a low sample rate before
raising to production traffic. Jumping straight to always_on tracing on busy clusters can
overwhelm your collector.
Structured logging
Section titled “Structured logging”Set CADENCE_LOG_FORMAT=json to emit structured JSON lines on stdout. Each line includes
standard fields (timestamp, level, logger, message) plus request-specific context where
instrumented.
Stream errors are always logged regardless of format:
except Exception as e: logger.error("Stream error: %s", e, exc_info=True) yield 'data: {"event":"error","data":{"error":"An internal error occurred"}}\n\n'OpenTelemetry configuration
Section titled “OpenTelemetry configuration”OTel settings are stored as key/value metadata and adjustable through admin APIs without restarting the server.
_KEY_META: dict[str, tuple[str, str]] = { "otel.enabled": ("boolean", "Enable OpenTelemetry instrumentation"), "otel.service_name": ("string", "OTel service name reported to the collector"), "otel.exporter": ( "string", "Exporter backend: console | otlp_grpc | otlp_http | none", ), "otel.endpoint": ("string", "OTLP collector endpoint URL"), "otel.traces_enabled": ("boolean", "Enable distributed tracing signal"), "otel.trace_sampler": ( "string", "Trace sampler: always_on | always_off | traceid_ratio", ), "otel.instrument_langchain": ( "boolean", "Auto-instrument LangChain / LangGraph pipelines", ), "otel.instrument_openai_agents": ("boolean", "Auto-instrument OpenAI Agents SDK"),}Debugging a slow chat turn
Section titled “Debugging a slow chat turn”- Note approximate time, org, orchestrator, and message id.
- Search JSON logs for errors or stream warnings around that window.
- If traces are enabled, open your collector and find the trace id — inspect model latency vs tool latency vs queue wait.
- Check pool stats (
GET /api/admin/pool/stats) — long queue times often indicate instances were demoted to cold. - If traces are missing entirely, verify
otel.enabledand exporter settings via admin APIs.
Troubleshooting
Section titled “Troubleshooting”| Symptom | Cause | Fix |
|---|---|---|
| No traces | otel.enabled is false, or exporter/network misconfigured | Set otel.enabled=true; check otel.endpoint and network egress |
| High cardinality costs | SDK instrumentation flags enabled in production | Disable otel.instrument_langchain / otel.instrument_openai_agents |
| Logs not JSON | CADENCE_LOG_FORMAT not set to json | Set CADENCE_LOG_FORMAT=json and restart |