Hot-reload and AI App pool

Orchestrator

Pool tiers, async load/unload, and pool statistics.

Intended audience: Stakeholders, Business analysts, Solution architects, Developers, Testers

Learning outcomes by role

Stakeholders

Understand pool tiers (hot, demand) as capacity and responsiveness trade-offs.

Business analysts

Describe when AI Apps load or evict for user-visible latency stories.

Solution architects

Relate pool behavior to process memory, events, and optional RabbitMQ messaging.

Developers

Follow pool load/unload APIs and orchestrator factory integration points.

Testers

Verify tier transitions, reload paths, and stats endpoints under load.

The AI App pool (implemented as OrchestratorPool) manages running instances across hot (resident) and demand (TTL-backed on-demand) storage. Load and unload API calls are asynchronous — they publish events that a runtime worker consumes to bring instances up or tear them down. The HTTP call returns 202 Accepted after publishing; the instance is not ready until the worker processes the event.

How pool tiers work

Tier	Behavior	When to use
`hot`	Resident in the hot pool, lowest latency	High-traffic or SLA-sensitive AI Apps
`demand`	Loaded into the demand pool on use; TTL evict	Lower baseline memory; first request may pay load latency

Tier is stored on the instance record and read when publishing the load event (see cadence/api/orchestrator/lifecycle.py).

Loading and unloading instances

Create the AI App with a tier default (often demand until traffic proves need).
When you need predictable latency, trigger an explicit load to hot before a demo or traffic spike.
After large config or plugin changes, unload then reload to ensure clean state.
Monitor pool stats during incidents — see Monitoring.

Python (server)
TypeScript (UI)

Load validates org access before publishing the event. The HTTP response returns after the event is published, not after the instance is ready.

@router.post("/{instance_id}/load", status_code=status.HTTP_202_ACCEPTED)
@audit_log("Publishing load event for instance {instance_id} (source=api_load)")
@publish_after("load", _load_payload)
async def load_orchestrator(
    org_id: str,
    instance_id: str,
    request: Request = None,
    security: SecurityContext = Depends(roles_allowed(ORG_ORCHESTRATORS_LIFECYCLE)),
    event_publisher=Depends(get_event_publisher),
):
    context = org_context(security, org_id)
    settings_service: SettingsService = request.app.state.settings_service
    instance = await settings_service.get_instance_config(instance_id)
    validate_orchestrator_access(instance, instance_id, context.org_id)
    tier = instance.get("tier", "hot")
    return {
        "message": "Load event published",
        "instance_id": instance_id,
        "tier": tier,
    }

The Admin pool dashboard polls stats on an interval to show current tier counts.

const { data: stats, refresh } = await useApiFetch<PoolStatsResponse>('/api/admin/pool/stats');

onMounted(() => {
  timer.value = setInterval(() => refresh(), POOL_STATS_REFRESH_MS);
});

Pool statistics

The Admin → Pool dashboard polls GET /api/admin/pool/stats and renders cards for total instances, tier counts, and shared model/bundle counts.

If hot is near your policy limit but latency is still high, the bottleneck is likely external (model provider) rather than pool size — correlate with Observability. If users see “not loaded” errors while hot is low, workers may be failing or the message bus is stalled — check worker health and bus health.

Guarantees

Load/unload handlers validate that the instance belongs to the caller’s organization before publishing.
Chat returns 503 when an instance is not loaded in the pool. Trigger a load and wait for the worker to process the event before retrying.

Troubleshooting

Symptom	Cause	Fix
`503 not loaded` on chat	Instance not in pool	Call load; wait for worker to process the event
Load returns `202` but instance never appears	Worker health or message bus issue	Check worker logs and message bus connectivity
Memory growth	Too many hot instances	Check pool stats; reduce hot tier ceiling or unload idle instances

Next steps

Monitoring Health endpoints, pool stats, and operational troubleshooting.

Real-time streaming 503 not-loaded errors appear in the streaming path — how to handle them.