Monitoring

Health endpoints, pool statistics, log shipping, and operational troubleshooting.

Intended audience: Stakeholders, Business analysts, Solution architects, Developers, Testers

Learning outcomes by role

Stakeholders

Tie uptime monitoring to customer-facing SLAs and incident response.

Business analysts

Define alert thresholds and escalation paths referencing health endpoints.

Solution architects

Integrate probes with Kubernetes or load balancers using documented paths.

Developers

Interpret admin health and pool statistics fields for debugging.

Testers

Validate monitoring scripts against staging clusters.

Monitor Cadence using liveness and admin health routes, pool statistics for orchestrator tiers, and structured JSON logs. Admin probes require the right cadence:system:* permissions.

Summary for stakeholders

SLA visibility — Unauthenticated GET /health supports cheap uptime checks; deeper /api/admin/* routes justify stricter access control for diagnostics.
Incident triage — Pool stats and logs together explain whether failures are provider-side, pool saturation, or auth/session issues.

Business analysis

Runbooks — Document which probe uses which URL and which service account holds the minimum cadence:system:* permission.
Escalation — Align alert names with symptoms in the troubleshooting table below.

Architecture and integration

Health router: cadence.api.health.router. Admin paths require the permissions listed in Monitoring.

Prerequisites

Network access to the API from your load balancer or probes
Credentials for a user with the required system permissions (see Role-based access control)

Health endpoints

Endpoint	Auth required	Purpose
`GET /health`	None	Basic liveness — use for load balancer probes
`GET /api/admin/health`	`cadence:system:health:read`	Deeper orchestrator and dependency health
`GET /api/admin/pool/stats`	`cadence:system:admin` or appropriate system permission	Tier counts and shared artifact counts

Pool statistics

The Admin → Pool dashboard polls GET /api/admin/pool/stats and shows total instances, counts by tier (hot/warm/cold), and shared model/bundle counts.

Reading pool stats during incidents:

High hot count with high latency → bottleneck is likely external (model provider), not the pool. Correlate with Observability.
Low hot with “not loaded” chat errors → instances are being aggressively demoted or load events are failing. Check worker health and message bus connectivity.

See Hot-reload and orchestrator pool for load/unload operations.

Log shipping

Set CADENCE_LOG_FORMAT=json to emit structured JSON lines on stdout. Ship stdout to your aggregation stack (ELK, Loki, CloudWatch, etc.) using your platform’s log driver.

Key log fields to search during incidents: level, logger, message, and any request-specific context injected by middleware.

What to monitor

Instance counts by tier (hot/warm/cold) from pool stats.
Error rates and latency on chat and admin routes from your log aggregation.
Message bus backlog if load/unload events are stalling.
503 chat errors correlated with cold instance counts.

Verification and quality

Troubleshooting

Symptom	Cause	Fix
Plugin not found in orchestrator	Plugin not in org catalog or not attached	List plugins for that org
No LLM config	Org LLM configurations not created	Add an LLM config for the org
`401` after login	Redis unavailable or session expired	Check Redis health; re-authenticate
Hot-reload stuck	Load event not consumed by worker	Check worker health and message bus

Next steps

Observability OpenTelemetry tracing, structured logs, and debugging slow chat turns.

Hot-reload and orchestrator pool Pool tiers, load/unload operations, and 202 async semantics.