Observability

Structured logs, health endpoints, and OpenTelemetry configuration.

Intended audience: Stakeholders, Business analysts, Solution architects, Developers, Testers

Learning outcomes by role

Stakeholders

Explain stdout JSON logs, health endpoints, and tracing as operational visibility investments.

Business analysts

Tie observability signals to SLIs in runbooks and incident templates.

Solution architects

Design OTLP export, sampling, and collector placement for production.

Developers

Configure CADENCE_LOG_FORMAT, OTel metadata keys, and related admin APIs.

Testers

Validate health checks, log fields, and trace spans in staging environments.

Cadence exposes three observability layers: structured JSON logs (stdout), HTTP health and pool endpoints for probes, and OpenTelemetry settings adjustable via admin APIs without restarting the process.

Summary for stakeholders

Operational spend — Logs and traces have marginal cost; aggressive tracing and high-cardinality LLM spans can raise collector and storage bills.
Progressive rollout — JSON logs and /health first; add OTLP with sampling before full production load.

Business analysis

SLIs — Pair log-based error rate with trace latency for chat paths when defining SLOs.
Runbooks — Reference otel.* keys and CADENCE_LOG_FORMAT in incident checklists.

Architecture and integration

Implementation references: cadence.api.telemetry metadata keys, cadence.api.health.router, logging setup in app entry.

Observability layers

Layer	How to enable	What it provides
Structured logs	`CADENCE_LOG_FORMAT=json`	Searchable JSON lines on stdout; send to ELK, Loki, CloudWatch
Health endpoints	Always available	Liveness (`GET /health`) and deeper admin health
Distributed traces	`otel.enabled=true` via admin API	Spans across internal steps; export via OTLP to Langfuse or similar

Start with JSON logs + /health in staging, then enable OTLP with a low sample rate before raising to production traffic. Jumping straight to always_on tracing on busy clusters can overwhelm your collector.

Structured logging

Set CADENCE_LOG_FORMAT=json to emit structured JSON lines on stdout. Each line includes standard fields (timestamp, level, logger, message) plus request-specific context where instrumented.

Stream errors are always logged regardless of format:

except Exception as e:
    logger.error("Stream error: %s", e, exc_info=True)
    yield 'data: {"event":"error","data":{"error":"An internal error occurred"}}\n\n'

OpenTelemetry configuration

OTel settings are stored as key/value metadata and adjustable through admin APIs without restarting the server.

_KEY_META: dict[str, tuple[str, str]] = {
    "otel.enabled": ("boolean", "Enable OpenTelemetry instrumentation"),
    "otel.service_name": ("string", "OTel service name reported to the collector"),
    "otel.exporter": (
        "string",
        "Exporter backend: console | otlp_grpc | otlp_http | none",
    ),
    "otel.endpoint": ("string", "OTLP collector endpoint URL"),
    "otel.traces_enabled": ("boolean", "Enable distributed tracing signal"),
    "otel.trace_sampler": (
        "string",
        "Trace sampler: always_on | always_off | traceid_ratio",
    ),
    "otel.instrument_langchain": (
        "boolean",
        "Auto-instrument LangChain / LangGraph pipelines",
    ),
    "otel.instrument_openai_agents": ("boolean", "Auto-instrument OpenAI Agents SDK"),
}

Debugging a slow chat turn

Note approximate time, org, orchestrator, and message id.
Search JSON logs for errors or stream warnings around that window.
If traces are enabled, open your collector and find the trace id — inspect model latency vs tool latency vs queue wait.
Check pool stats (GET /api/admin/pool/stats) — long queue times often indicate instances were demoted to cold.
If traces are missing entirely, verify otel.enabled and exporter settings via admin APIs.

Troubleshooting

Symptom	Cause	Fix
No traces	`otel.enabled` is false, or exporter/network misconfigured	Set `otel.enabled=true`; check `otel.endpoint` and network egress
High cardinality costs	SDK instrumentation flags enabled in production	Disable `otel.instrument_langchain` / `otel.instrument_openai_agents`
Logs not JSON	`CADENCE_LOG_FORMAT` not set to `json`	Set `CADENCE_LOG_FORMAT=json` and restart

Next steps

Monitoring Health endpoints, pool stats, and log shipping setup.

Configuration Environment variables including CADENCE_LOG_FORMAT.