Hot-reload and orchestrator pool
Pool tiers, async load/unload, and pool statistics.
Intended audience: Stakeholders, Business analysts, Solution architects, Developers, Testers
Learning outcomes by role
Stakeholders
- Understand pool tiers (hot, warm, cold) as capacity and responsiveness trade-offs.
Business analysts
- Describe when orchestrators load or evict for user-visible latency stories.
Solution architects
- Relate pool behavior to process memory, events, and optional RabbitMQ messaging.
Developers
- Follow pool load/unload APIs and orchestrator factory integration points.
Testers
- Verify tier transitions, reload paths, and stats endpoints under load.
The orchestrator pool manages running instances across three memory tiers. Load and
unload API calls are asynchronous — they publish events that a runtime worker consumes to
bring instances up or tear them down. The HTTP call returns 202 Accepted after publishing;
the instance is not ready until the worker processes the event.
How pool tiers work
Section titled “How pool tiers work”| Tier | Behavior | When to use |
|---|---|---|
hot | Resident in memory, lowest latency | High-traffic or SLA-sensitive orchestrators |
warm | Can be promoted quickly | Moderately active; balance between memory and latency |
cold | Configuration only, not resident | Rarely used; lowest memory cost |
Tier is set as a default at orchestrator creation but overridden at load time with a tier
hint in the load request.
Loading and unloading instances
Section titled “Loading and unloading instances”- Create the orchestrator with a
tierdefault (oftencolduntil traffic proves need). - When you need predictable latency, trigger an explicit load to
hotbefore a demo or traffic spike. - After large config or plugin changes, unload then reload to ensure clean state.
- Monitor pool stats during incidents — see Monitoring.
Load validates org access before publishing the event. The HTTP response returns after the event is published, not after the instance is ready.
@router.post("/{instance_id}/load", status_code=status.HTTP_202_ACCEPTED)@audit_log("Publishing load event for instance {instance_id} (source=api_load)")@publish_after("load", _load_payload)async def load_orchestrator(instance_id: str,load_request: LoadOrchestratorRequest = None,request: Request = None,context: TenantContext = Depends(require_permission(ORG_ORCHESTRATORS_LIFECYCLE)),event_publisher=Depends(get_event_publisher),):settings_service: SettingsService = request.app.state.settings_service
instance = await settings_service.get_instance_config(instance_id)validate_orchestrator_access(instance, instance_id, context.org_id)
tier = (load_request.tier if load_request else None) or instance.get("tier", "hot")
return {"message": "Load event published","instance_id": instance_id,"tier": tier,}The Admin pool dashboard polls stats on an interval to show current tier counts.
const {data: stats, refresh} = await useApiFetch<PoolStatsResponse>('/api/admin/pool/stats')
onMounted(() => { timer.value = setInterval(() => refresh(), POOL_STATS_REFRESH_MS)})Pool statistics
Section titled “Pool statistics”The Admin → Pool dashboard polls GET /api/admin/pool/stats and renders cards for total
instances, tier counts, and shared model/bundle counts.
If hot is near your policy limit but latency is still high, the bottleneck is likely external (model provider) rather than pool size — correlate with Observability. If users see “not loaded” errors while hot is low, workers may be failing or the message bus is stalled — check worker health and bus health.
Guarantees
Section titled “Guarantees”- Load/unload handlers validate that the instance belongs to the caller’s organization before publishing.
- Chat returns
503when an instance is not loaded in the pool. Trigger a load and wait for the worker to process the event before retrying.
Troubleshooting
Section titled “Troubleshooting”| Symptom | Cause | Fix |
|---|---|---|
503 not loaded on chat | Instance not in pool | Call load; wait for worker to process the event |
Load returns 202 but instance never appears | Worker health or message bus issue | Check worker logs and message bus connectivity |
| Memory growth | Too many hot instances | Check pool stats; reduce hot tier ceiling or unload idle instances |