Hot-reload and orchestrator pool

Pool tiers, async load/unload, and pool statistics.

Intended audience: Stakeholders, Business analysts, Solution architects, Developers, Testers

Learning outcomes by role

Stakeholders

Understand pool tiers (hot, warm, cold) as capacity and responsiveness trade-offs.

Business analysts

Describe when orchestrators load or evict for user-visible latency stories.

Solution architects

Relate pool behavior to process memory, events, and optional RabbitMQ messaging.

Developers

Follow pool load/unload APIs and orchestrator factory integration points.

Testers

Verify tier transitions, reload paths, and stats endpoints under load.

The orchestrator pool manages running instances across three memory tiers. Load and unload API calls are asynchronous — they publish events that a runtime worker consumes to bring instances up or tear them down. The HTTP call returns 202 Accepted after publishing; the instance is not ready until the worker processes the event.

How pool tiers work

Tier	Behavior	When to use
`hot`	Resident in memory, lowest latency	High-traffic or SLA-sensitive orchestrators
`warm`	Can be promoted quickly	Moderately active; balance between memory and latency
`cold`	Configuration only, not resident	Rarely used; lowest memory cost

Tier is set as a default at orchestrator creation but overridden at load time with a tier hint in the load request.

Loading and unloading instances

Create the orchestrator with a tier default (often cold until traffic proves need).
When you need predictable latency, trigger an explicit load to hot before a demo or traffic spike.
After large config or plugin changes, unload then reload to ensure clean state.
Monitor pool stats during incidents — see Monitoring.

Python (server)
TypeScript (UI)

Load validates org access before publishing the event. The HTTP response returns after the event is published, not after the instance is ready.

@router.post("/{instance_id}/load", status_code=status.HTTP_202_ACCEPTED)
@audit_log("Publishing load event for instance {instance_id} (source=api_load)")
@publish_after("load", _load_payload)
async def load_orchestrator(
instance_id: str,
load_request: LoadOrchestratorRequest = None,
request: Request = None,
context: TenantContext = Depends(require_permission(ORG_ORCHESTRATORS_LIFECYCLE)),
event_publisher=Depends(get_event_publisher),
):
settings_service: SettingsService = request.app.state.settings_service

instance = await settings_service.get_instance_config(instance_id)
validate_orchestrator_access(instance, instance_id, context.org_id)

tier = (load_request.tier if load_request else None) or instance.get("tier", "hot")

return {
"message": "Load event published",
"instance_id": instance_id,
"tier": tier,
}

The Admin pool dashboard polls stats on an interval to show current tier counts.

const {data: stats, refresh} = await useApiFetch<PoolStatsResponse>('/api/admin/pool/stats')

onMounted(() => {
    timer.value = setInterval(() => refresh(), POOL_STATS_REFRESH_MS)
})

Pool statistics

The Admin → Pool dashboard polls GET /api/admin/pool/stats and renders cards for total instances, tier counts, and shared model/bundle counts.

If hot is near your policy limit but latency is still high, the bottleneck is likely external (model provider) rather than pool size — correlate with Observability. If users see “not loaded” errors while hot is low, workers may be failing or the message bus is stalled — check worker health and bus health.

Guarantees

Load/unload handlers validate that the instance belongs to the caller’s organization before publishing.
Chat returns 503 when an instance is not loaded in the pool. Trigger a load and wait for the worker to process the event before retrying.

Troubleshooting

Symptom	Cause	Fix
`503 not loaded` on chat	Instance not in pool	Call load; wait for worker to process the event
Load returns `202` but instance never appears	Worker health or message bus issue	Check worker logs and message bus connectivity
Memory growth	Too many hot instances	Check pool stats; reduce hot tier ceiling or unload idle instances

Next steps

Monitoring Health endpoints, pool stats, and operational troubleshooting.

Real-time streaming 503 not-loaded errors appear in the streaming path — how to handle them.