Skip to content

Hot-reload and orchestrator pool

Pool tiers, async load/unload, and pool statistics.

Intended audience: Stakeholders, Business analysts, Solution architects, Developers, Testers

Learning outcomes by role

Stakeholders

  • Understand pool tiers (hot, warm, cold) as capacity and responsiveness trade-offs.

Business analysts

  • Describe when orchestrators load or evict for user-visible latency stories.

Solution architects

  • Relate pool behavior to process memory, events, and optional RabbitMQ messaging.

Developers

  • Follow pool load/unload APIs and orchestrator factory integration points.

Testers

  • Verify tier transitions, reload paths, and stats endpoints under load.

The orchestrator pool manages running instances across three memory tiers. Load and unload API calls are asynchronous — they publish events that a runtime worker consumes to bring instances up or tear them down. The HTTP call returns 202 Accepted after publishing; the instance is not ready until the worker processes the event.

TierBehaviorWhen to use
hotResident in memory, lowest latencyHigh-traffic or SLA-sensitive orchestrators
warmCan be promoted quicklyModerately active; balance between memory and latency
coldConfiguration only, not residentRarely used; lowest memory cost

Tier is set as a default at orchestrator creation but overridden at load time with a tier hint in the load request.

  1. Create the orchestrator with a tier default (often cold until traffic proves need).
  2. When you need predictable latency, trigger an explicit load to hot before a demo or traffic spike.
  3. After large config or plugin changes, unload then reload to ensure clean state.
  4. Monitor pool stats during incidents — see Monitoring.

Load validates org access before publishing the event. The HTTP response returns after the event is published, not after the instance is ready.

cadence/api/orchestrator/lifecycle.py
@router.post("/{instance_id}/load", status_code=status.HTTP_202_ACCEPTED)
@audit_log("Publishing load event for instance {instance_id} (source=api_load)")
@publish_after("load", _load_payload)
async def load_orchestrator(
instance_id: str,
load_request: LoadOrchestratorRequest = None,
request: Request = None,
context: TenantContext = Depends(require_permission(ORG_ORCHESTRATORS_LIFECYCLE)),
event_publisher=Depends(get_event_publisher),
):
settings_service: SettingsService = request.app.state.settings_service
instance = await settings_service.get_instance_config(instance_id)
validate_orchestrator_access(instance, instance_id, context.org_id)
tier = (load_request.tier if load_request else None) or instance.get("tier", "hot")
return {
"message": "Load event published",
"instance_id": instance_id,
"tier": tier,
}

The Admin → Pool dashboard polls GET /api/admin/pool/stats and renders cards for total instances, tier counts, and shared model/bundle counts.

If hot is near your policy limit but latency is still high, the bottleneck is likely external (model provider) rather than pool size — correlate with Observability. If users see “not loaded” errors while hot is low, workers may be failing or the message bus is stalled — check worker health and bus health.

  • Load/unload handlers validate that the instance belongs to the caller’s organization before publishing.
  • Chat returns 503 when an instance is not loaded in the pool. Trigger a load and wait for the worker to process the event before retrying.
SymptomCauseFix
503 not loaded on chatInstance not in poolCall load; wait for worker to process the event
Load returns 202 but instance never appearsWorker health or message bus issueCheck worker logs and message bus connectivity
Memory growthToo many hot instancesCheck pool stats; reduce hot tier ceiling or unload idle instances