Monitoring
Health endpoints, pool statistics, log shipping, and operational troubleshooting.
Intended audience: Stakeholders, Business analysts, Solution architects, Developers, Testers
Learning outcomes by role
Stakeholders
- Tie uptime monitoring to customer-facing SLAs and incident response.
Business analysts
- Define alert thresholds and escalation paths referencing health endpoints.
Solution architects
- Integrate probes with Kubernetes or load balancers using documented paths.
Developers
- Interpret admin health and pool statistics fields for debugging.
Testers
- Validate monitoring scripts against staging clusters.
Monitor Cadence using liveness and admin health routes, pool statistics for orchestrator tiers, and stdout logs from the API process. Admin probes require the right cadence:system:* permissions.
Summary for stakeholders
Section titled “Summary for stakeholders”- SLA visibility — Unauthenticated
GET /healthsupports cheap uptime checks; deeper/api/admin/*routes justify stricter access control for diagnostics. - Incident triage — Pool stats and logs together explain whether failures are provider-side, pool saturation, or auth/session issues.
Business analysis
Section titled “Business analysis”- Runbooks — Document which probe uses which URL and which service account holds the minimum
cadence:system:*permission. - Escalation — Align alert names with symptoms in the troubleshooting table below.
Architecture and integration
Section titled “Architecture and integration”
Health router: cadence.api.health.router. Admin paths require the permissions
listed in Monitoring.
Prerequisites
Section titled “Prerequisites”- Network access to the API from your load balancer or probes
- Credentials for a user with the required system permissions (see Role-based access control)
Health endpoints
Section titled “Health endpoints”| Endpoint | Auth required | Purpose |
|---|---|---|
GET /health | None | Basic liveness — use for load balancer probes |
GET /api/admin/health | cadence:system:health:read | Deeper orchestrator and dependency health |
GET /api/admin/pool/stats | cadence:system:telemetry:read | Tier counts and shared artifact counts |
Pool statistics
Section titled “Pool statistics”The Admin → Pool dashboard polls GET /api/admin/pool/stats and shows total instances, hot tier
and demand pool counts, and shared model/bundle counts.
Reading pool stats during incidents:
- High hot count with high latency → bottleneck is likely external (model provider), not the pool. Correlate with Observability.
- Low hot with “not loaded” chat errors → instances are being aggressively demoted or load events are failing. Check worker health and message bus connectivity.
See Hot-reload and orchestrator pool for load/unload operations.
Log shipping
Section titled “Log shipping”The API uses standard Python logging from cadence.main (text lines by default). Ship stdout to your aggregation stack (ELK, Loki, CloudWatch, etc.) using your platform’s log driver. For OpenTelemetry log export, configure otel.logs_enabled and related keys — see Observability.
Key log fields to search during incidents: level, logger, message, and any request-specific context injected by middleware.
What to monitor
Section titled “What to monitor”- Instance counts (hot tier vs demand pool) from pool stats.
- Error rates and latency on chat and admin routes from your log aggregation.
- Message bus backlog if load/unload events are stalling.
- 503 chat errors correlated with instances not loaded or evicted from the demand pool.
Verification and quality
Section titled “Verification and quality”Troubleshooting
Section titled “Troubleshooting”| Symptom | Cause | Fix |
|---|---|---|
/health returns non-200 | API process crashed or failed startup | Check process logs and infrastructure connectivity |
GET /api/admin/health shows instances is_ready: false | Orchestrator failed to load or lost state | Check worker logs; reload the instance |
401 after login | Redis unavailable or session expired | Check Redis health (CADENCE_REDIS_URL); re-authenticate |
Load event stuck, 202 returned but instance never appears | Worker not consuming events; message bus stalled | Check worker health and RabbitMQ connectivity |
| High hot count but latency still high | Bottleneck is external (model provider), not pool | Correlate with traces in Observability |