Skip to content

Hot-reload and AI App pool

Orchestrator

Pool tiers, async load/unload, and pool statistics.

Intended audience: Stakeholders, Business analysts, Solution architects, Developers, Testers

Learning outcomes by role

Stakeholders

  • Understand pool tiers (hot, demand) as capacity and responsiveness trade-offs.

Business analysts

  • Describe when AI Apps load or evict for user-visible latency stories.

Solution architects

  • Relate pool behavior to process memory, events, and optional RabbitMQ messaging.

Developers

  • Follow pool load/unload APIs and orchestrator factory integration points.

Testers

  • Verify tier transitions, reload paths, and stats endpoints under load.

The AI App pool (implemented as OrchestratorPool) manages running instances across hot (resident) and demand (TTL-backed on-demand) storage. Load and unload API calls are asynchronous — they publish events that a runtime worker consumes to bring instances up or tear them down. The HTTP call returns 202 Accepted after publishing; the instance is not ready until the worker processes the event.

TierBehaviorWhen to use
hotResident in the hot pool, lowest latencyHigh-traffic or SLA-sensitive AI Apps
demandLoaded into the demand pool on use; TTL evictLower baseline memory; first request may pay load latency

Tier is stored on the instance record and read when publishing the load event (see cadence/api/orchestrator/lifecycle.py).

  1. Create the AI App with a tier default (often demand until traffic proves need).
  2. When you need predictable latency, trigger an explicit load to hot before a demo or traffic spike.
  3. After large config or plugin changes, unload then reload to ensure clean state.
  4. Monitor pool stats during incidents — see Monitoring.

Load validates org access before publishing the event. The HTTP response returns after the event is published, not after the instance is ready.

cadence/api/orchestrator/lifecycle.py
@router.post("/{instance_id}/load", status_code=status.HTTP_202_ACCEPTED)
@audit_log("Publishing load event for instance {instance_id} (source=api_load)")
@publish_after("load", _load_payload)
async def load_orchestrator(
org_id: str,
instance_id: str,
request: Request = None,
security: SecurityContext = Depends(roles_allowed(ORG_ORCHESTRATORS_LIFECYCLE)),
event_publisher=Depends(get_event_publisher),
):
context = org_context(security, org_id)
settings_service: SettingsService = request.app.state.settings_service
instance = await settings_service.get_instance_config(instance_id)
validate_orchestrator_access(instance, instance_id, context.org_id)
tier = instance.get("tier", "hot")
return {
"message": "Load event published",
"instance_id": instance_id,
"tier": tier,
}

The Admin → Pool dashboard polls GET /api/admin/pool/stats and renders cards for total instances, tier counts, and shared model/bundle counts.

If hot is near your policy limit but latency is still high, the bottleneck is likely external (model provider) rather than pool size — correlate with Observability. If users see “not loaded” errors while hot is low, workers may be failing or the message bus is stalled — check worker health and bus health.

  • Load/unload handlers validate that the instance belongs to the caller’s organization before publishing.
  • Chat returns 503 when an instance is not loaded in the pool. Trigger a load and wait for the worker to process the event before retrying.
SymptomCauseFix
503 not loaded on chatInstance not in poolCall load; wait for worker to process the event
Load returns 202 but instance never appearsWorker health or message bus issueCheck worker logs and message bus connectivity
Memory growthToo many hot instancesCheck pool stats; reduce hot tier ceiling or unload idle instances