Overview
The main runtime loop, per-domain trainers, rollout engines, train stack, and sync boundaries.
UniRL is organized around one Hydra-driven training loop. This page is a
navigation overview; the code-adjacent contracts are generated from
unirl/README.md and the package READMEs embedded as pages in each sidebar section.
unirl.train_diffusion | train_vlm | train_pe | train_unified_model
-> register and validate Hydra config
-> <Domain>Trainer acquires a Ray DevicePool (placement)
-> trainer builds the rollout workers and train workers
-> loop: rollout -> reward -> advantage -> train -> optional weight syncEach domain has its own entrypoint and trainer, all driven the same way. There
is a single training stack: the earlier v1 actor-group runtime has been retired,
and the single-controller Remote / placement layer is now the only path.
Main References
- Code Architecture README for the module map and runtime data flow.
- Rollout Package README for rollout modes and request/response contracts.
- Train Stack README for the FSDP backend, the train-step contract, and structural injection.
- Algorithms Package README for per-track loss contracts and the reward→advantage path.
- SDE Package README for SDE strategy rules, schedules, and runtime kernels.
- Weight Sync README for dedicated rollout synchronization.
Data Flow
<Domain>Trainer
-> build RolloutReq, dispatch to the rollout engine
-> RolloutResp with tracks[name] (conditions, segments, rewards, media)
-> RewardService.score_and_attach -> track.rewards
-> RolloutTrack.compute_advantages -> track.advantages
-> TrainStack.train_track(track) shards across train workers, runs the mini-batch loop
-> optional weight sync back to dedicated rollout workersRolloutReq and RolloutResp (in unirl/types/) are the important boundary
between rollout and training. New engines should adapt backend-specific outputs
into this typed boundary instead of leaking backend objects into training code.
The trainer (unirl/trainer/<domain>.py) owns the placement block, worker
construction, and phase ordering for rollout, reward, advantage, train, and sync.
For deployment modes and engine-specific requirements, use the generated Rollout Package README as the source of truth.
One Algorithm per Track
The training loss is a single per-track cfg.algorithm — a StageAlgorithm
such as unirl.algorithms.diffusion_grpo.DiffusionGRPO. There is no separate
driver-side "rollout control" object:
- reward→advantage z-scoring lives on
RolloutTrack.compute_advantages(unirl/types/rollout_resp.py); - SDE-index selection lives on
DiffusionSamplingParams.resolve_sde_indices(unirl/types/sampling.py).
A single-track recipe binds one cfg.algorithm; a multi-track recipe (for
example PE) nests one algorithm: node per track and runs sibling
TrainStacks. See Trainer & Training Stack.
Roadmap
Near-term direction spans three tracks — Infra, Algorithm, and Model. See the Roadmap.