Monitoring

Health endpoints, pool statistics, log shipping, and operational troubleshooting.

Intended audience: Stakeholders, Business analysts, Solution architects, Developers, Testers

Learning outcomes by role

Stakeholders

Tie uptime monitoring to customer-facing SLAs and incident response.

Business analysts

Define alert thresholds and escalation paths referencing health endpoints.

Solution architects

Integrate probes with Kubernetes or load balancers using documented paths.

Developers

Interpret admin health and pool statistics fields for debugging.

Testers

Validate monitoring scripts against staging clusters.

Monitor Cadence using liveness and admin health routes, pool statistics for orchestrator tiers, and stdout logs from the API process. Admin probes require the right cadence:system:* permissions.

Summary for stakeholders

SLA visibility — Unauthenticated GET /health supports cheap uptime checks; deeper /api/admin/* routes justify stricter access control for diagnostics.
Incident triage — Pool stats and logs together explain whether failures are provider-side, pool saturation, or auth/session issues.

Business analysis

Runbooks — Document which probe uses which URL and which service account holds the minimum cadence:system:* permission.
Escalation — Align alert names with symptoms in the troubleshooting table below.

Architecture and integration

Health router: cadence.api.health.router. Admin paths require the permissions listed in Monitoring.

Prerequisites

Network access to the API from your load balancer or probes
Credentials for a user with the required system permissions (see Role-based access control)

Health endpoints

| Endpoint | Auth required | Purpose | | --------------------------- | ------------------------------- | --------------------------------------------- | | GET /health | None | Basic liveness — use for load balancer probes (mounted at root, not under /api/) | | GET /api/admin/health | cadence:system:health:read | Deeper orchestrator and dependency health | | GET /api/admin/pool/stats | cadence:system:telemetry:read | Tier counts and shared artifact counts |

Pool statistics

The Admin → Pool dashboard polls GET /api/admin/pool/stats and shows total instances, hot tier and demand pool counts, and shared model/bundle counts.

Reading pool stats during incidents:

High hot count with high latency → bottleneck is likely external (model provider), not the pool. Correlate with Observability.
Low hot with “not loaded” chat errors → instances are being aggressively demoted or load events are failing. Check worker health and message bus connectivity.

See Hot-reload and orchestrator pool for load/unload operations.

Log shipping

The API uses standard Python logging from cadence.main (text lines by default). Ship stdout to your aggregation stack (ELK, Loki, CloudWatch, etc.) using your platform’s log driver. For OpenTelemetry log export, configure otel.logs_enabled and related keys — see Observability.

Key log fields to search during incidents: level, logger, message, and any request-specific context injected by middleware.

What to monitor

Instance counts (hot tier vs demand pool) from pool stats.
Error rates and latency on chat and admin routes from your log aggregation.
Message bus backlog if load/unload events are stalling.
503 chat errors correlated with instances not loaded or evicted from the demand pool.

Verification and quality

Troubleshooting

| Symptom | Cause | Fix | | ----------------------------------------------------------- | ------------------------------------------------- | ------------------------------------------------------------------ | | /health returns non-200 | API process crashed or failed startup | Check process logs and infrastructure connectivity | | GET /api/admin/health shows instances is_ready: false | Orchestrator failed to load or lost state | Check worker logs; reload the instance | | 401 after login | Redis unavailable or session expired | Check Redis health (CADENCE_REDIS_URL); re-authenticate | | Load event stuck, 202 returned but instance never appears | Worker not consuming events; message bus stalled | Check worker health and RabbitMQ connectivity | | High hot count but latency still high | Bottleneck is external (model provider), not pool | Correlate with traces in Observability |

Next steps

Observability OpenTelemetry tracing, structured logs, and debugging slow chat turns.

Hot-reload and orchestrator pool Pool tiers, load/unload operations, and 202 async semantics.