Hot-reload and AI App pool
OrchestratorPool tiers, async load/unload, and pool statistics.
Intended audience: Stakeholders, Business analysts, Solution architects, Developers, Testers
Learning outcomes by role
Stakeholders
- Understand pool tiers (hot, demand) as capacity and responsiveness trade-offs.
Business analysts
- Describe when AI Apps load or evict for user-visible latency stories.
Solution architects
- Relate pool behavior to process memory, events, and optional RabbitMQ messaging.
Developers
- Follow pool load/unload APIs and orchestrator factory integration points.
Testers
- Verify tier transitions, reload paths, and stats endpoints under load.
The AI App pool (implemented as OrchestratorPool) manages running instances across hot (resident) and demand (TTL-backed on-demand) storage. Load and
unload API calls are asynchronous — they publish events that a runtime worker consumes to
bring instances up or tear them down. The HTTP call returns 202 Accepted after publishing;
the instance is not ready until the worker processes the event.
How pool tiers work
Section titled “How pool tiers work”| Tier | Behavior | When to use |
|---|---|---|
hot | Resident in the hot pool, lowest latency | High-traffic or SLA-sensitive AI Apps |
demand | Loaded into the demand pool on use; TTL evict | Lower baseline memory; first request may pay load latency |
Tier is stored on the instance record and read when publishing the load event (see cadence/api/orchestrator/lifecycle.py).
Loading and unloading instances
Section titled “Loading and unloading instances”- Create the AI App with a
tierdefault (oftendemanduntil traffic proves need). - When you need predictable latency, trigger an explicit load to
hotbefore a demo or traffic spike. - After large config or plugin changes, unload then reload to ensure clean state.
- Monitor pool stats during incidents — see Monitoring.
Load validates org access before publishing the event. The HTTP response returns after the event is published, not after the instance is ready.
@router.post("/{instance_id}/load", status_code=status.HTTP_202_ACCEPTED)@audit_log("Publishing load event for instance {instance_id} (source=api_load)")@publish_after("load", _load_payload)async def load_orchestrator( org_id: str, instance_id: str, request: Request = None, security: SecurityContext = Depends(roles_allowed(ORG_ORCHESTRATORS_LIFECYCLE)), event_publisher=Depends(get_event_publisher),): context = org_context(security, org_id) settings_service: SettingsService = request.app.state.settings_service instance = await settings_service.get_instance_config(instance_id) validate_orchestrator_access(instance, instance_id, context.org_id) tier = instance.get("tier", "hot") return { "message": "Load event published", "instance_id": instance_id, "tier": tier, }The Admin pool dashboard polls stats on an interval to show current tier counts.
const { data: stats, refresh } = await useApiFetch<PoolStatsResponse>('/api/admin/pool/stats');
onMounted(() => { timer.value = setInterval(() => refresh(), POOL_STATS_REFRESH_MS);});Pool statistics
Section titled “Pool statistics”The Admin → Pool dashboard polls GET /api/admin/pool/stats and renders cards for total
instances, tier counts, and shared model/bundle counts.
If hot is near your policy limit but latency is still high, the bottleneck is likely external (model provider) rather than pool size — correlate with Observability. If users see “not loaded” errors while hot is low, workers may be failing or the message bus is stalled — check worker health and bus health.
Guarantees
Section titled “Guarantees”- Load/unload handlers validate that the instance belongs to the caller’s organization before publishing.
- Chat returns
503when an instance is not loaded in the pool. Trigger a load and wait for the worker to process the event before retrying.
Troubleshooting
Section titled “Troubleshooting”| Symptom | Cause | Fix |
|---|---|---|
503 not loaded on chat | Instance not in pool | Call load; wait for worker to process the event |
Load returns 202 but instance never appears | Worker health or message bus issue | Check worker logs and message bus connectivity |
| Memory growth | Too many hot instances | Check pool stats; reduce hot tier ceiling or unload idle instances |