Skip to content

Monitoring

Health endpoints, pool statistics, log shipping, and operational troubleshooting.

Intended audience: Stakeholders, Business analysts, Solution architects, Developers, Testers

Learning outcomes by role

Stakeholders

  • Tie uptime monitoring to customer-facing SLAs and incident response.

Business analysts

  • Define alert thresholds and escalation paths referencing health endpoints.

Solution architects

  • Integrate probes with Kubernetes or load balancers using documented paths.

Developers

  • Interpret admin health and pool statistics fields for debugging.

Testers

  • Validate monitoring scripts against staging clusters.

Monitor Cadence using liveness and admin health routes, pool statistics for orchestrator tiers, and structured JSON logs. Admin probes require the right cadence:system:* permissions.

  • SLA visibility — Unauthenticated GET /health supports cheap uptime checks; deeper /api/admin/* routes justify stricter access control for diagnostics.
  • Incident triage — Pool stats and logs together explain whether failures are provider-side, pool saturation, or auth/session issues.
  • Runbooks — Document which probe uses which URL and which service account holds the minimum cadence:system:* permission.
  • Escalation — Align alert names with symptoms in the troubleshooting table below.
Monitoring probe paths Load balancer uses unauthenticated GET /health; deeper checks use cadence system permissions on admin routes. Load balancer or uptime probe GET /health No auth (liveness) Ops / SRE Bearer + system perms GET /api/admin/health Pool stats, deeper checks — cadence:system:*

Health router: cadence.api.health.router. Admin paths require the permissions listed in Monitoring.

  • Network access to the API from your load balancer or probes
  • Credentials for a user with the required system permissions (see Role-based access control)
EndpointAuth requiredPurpose
GET /healthNoneBasic liveness — use for load balancer probes
GET /api/admin/healthcadence:system:health:readDeeper orchestrator and dependency health
GET /api/admin/pool/statscadence:system:admin or appropriate system permissionTier counts and shared artifact counts

The Admin → Pool dashboard polls GET /api/admin/pool/stats and shows total instances, counts by tier (hot/warm/cold), and shared model/bundle counts.

Reading pool stats during incidents:

  • High hot count with high latency → bottleneck is likely external (model provider), not the pool. Correlate with Observability.
  • Low hot with “not loaded” chat errors → instances are being aggressively demoted or load events are failing. Check worker health and message bus connectivity.

See Hot-reload and orchestrator pool for load/unload operations.

Set CADENCE_LOG_FORMAT=json to emit structured JSON lines on stdout. Ship stdout to your aggregation stack (ELK, Loki, CloudWatch, etc.) using your platform’s log driver.

Key log fields to search during incidents: level, logger, message, and any request-specific context injected by middleware.

  • Instance counts by tier (hot/warm/cold) from pool stats.
  • Error rates and latency on chat and admin routes from your log aggregation.
  • Message bus backlog if load/unload events are stalling.
  • 503 chat errors correlated with cold instance counts.
SymptomCauseFix
Plugin not found in orchestratorPlugin not in org catalog or not attachedList plugins for that org
No LLM configOrg LLM configurations not createdAdd an LLM config for the org
401 after loginRedis unavailable or session expiredCheck Redis health; re-authenticate
Hot-reload stuckLoad event not consumed by workerCheck worker health and message bus