Monitoring
Health endpoints, pool statistics, log shipping, and operational troubleshooting.
Intended audience: Stakeholders, Business analysts, Solution architects, Developers, Testers
Learning outcomes by role
Stakeholders
- Tie uptime monitoring to customer-facing SLAs and incident response.
Business analysts
- Define alert thresholds and escalation paths referencing health endpoints.
Solution architects
- Integrate probes with Kubernetes or load balancers using documented paths.
Developers
- Interpret admin health and pool statistics fields for debugging.
Testers
- Validate monitoring scripts against staging clusters.
Monitor Cadence using liveness and admin health routes, pool statistics for orchestrator tiers, and structured JSON logs. Admin probes require the right cadence:system:* permissions.
Summary for stakeholders
Section titled “Summary for stakeholders”- SLA visibility — Unauthenticated
GET /healthsupports cheap uptime checks; deeper/api/admin/*routes justify stricter access control for diagnostics. - Incident triage — Pool stats and logs together explain whether failures are provider-side, pool saturation, or auth/session issues.
Business analysis
Section titled “Business analysis”- Runbooks — Document which probe uses which URL and which service account holds the minimum
cadence:system:*permission. - Escalation — Align alert names with symptoms in the troubleshooting table below.
Architecture and integration
Section titled “Architecture and integration”
Health router: cadence.api.health.router. Admin paths require the permissions listed in Monitoring.
Prerequisites
Section titled “Prerequisites”- Network access to the API from your load balancer or probes
- Credentials for a user with the required system permissions (see Role-based access control)
Health endpoints
Section titled “Health endpoints”| Endpoint | Auth required | Purpose |
|---|---|---|
GET /health | None | Basic liveness — use for load balancer probes |
GET /api/admin/health | cadence:system:health:read | Deeper orchestrator and dependency health |
GET /api/admin/pool/stats | cadence:system:admin or appropriate system permission | Tier counts and shared artifact counts |
Pool statistics
Section titled “Pool statistics”The Admin → Pool dashboard polls GET /api/admin/pool/stats and shows total instances, counts
by tier (hot/warm/cold), and shared model/bundle counts.
Reading pool stats during incidents:
- High hot count with high latency → bottleneck is likely external (model provider), not the pool. Correlate with Observability.
- Low hot with “not loaded” chat errors → instances are being aggressively demoted or load events are failing. Check worker health and message bus connectivity.
See Hot-reload and orchestrator pool for load/unload operations.
Log shipping
Section titled “Log shipping”Set CADENCE_LOG_FORMAT=json to emit structured JSON lines on stdout. Ship stdout to your
aggregation stack (ELK, Loki, CloudWatch, etc.) using your platform’s log driver.
Key log fields to search during incidents: level, logger, message, and any request-specific context injected by middleware.
What to monitor
Section titled “What to monitor”- Instance counts by tier (hot/warm/cold) from pool stats.
- Error rates and latency on chat and admin routes from your log aggregation.
- Message bus backlog if load/unload events are stalling.
- 503 chat errors correlated with cold instance counts.
Verification and quality
Section titled “Verification and quality”Troubleshooting
Section titled “Troubleshooting”| Symptom | Cause | Fix |
|---|---|---|
| Plugin not found in orchestrator | Plugin not in org catalog or not attached | List plugins for that org |
| No LLM config | Org LLM configurations not created | Add an LLM config for the org |
401 after login | Redis unavailable or session expired | Check Redis health; re-authenticate |
| Hot-reload stuck | Load event not consumed by worker | Check worker health and message bus |