Overview

The main runtime loop, per-domain trainers, rollout engines, train stack, and sync boundaries.

UniRL is organized around one Hydra-driven training loop. This page is a navigation overview; the code-adjacent contracts are generated from unirl/README.md and the package READMEs embedded as pages in each sidebar section.

unirl.train_diffusion | train_vlm | train_pe | train_unified_model
  -> register and validate Hydra config
  -> <Domain>Trainer acquires a Ray DevicePool (placement)
  -> trainer builds the rollout workers and train workers
  -> loop: rollout -> reward -> advantage -> train -> optional weight sync

Each domain has its own entrypoint and trainer, all driven the same way. There is a single training stack: the earlier v1 actor-group runtime has been retired, and the single-controller Remote / placement layer is now the only path.

Main References

Code Architecture README for the module map and runtime data flow.
Rollout Package README for rollout modes and request/response contracts.
Train Stack README for the FSDP backend, the train-step contract, and structural injection.
Algorithms Package README for per-track loss contracts and the reward→advantage path.
SDE Package README for SDE strategy rules, schedules, and runtime kernels.
Weight Sync README for dedicated rollout synchronization.

Data Flow

<Domain>Trainer
  -> build RolloutReq, dispatch to the rollout engine
  -> RolloutResp with tracks[name] (conditions, segments, rewards, media)
  -> RewardService.score_and_attach    -> track.rewards
  -> RolloutTrack.compute_advantages    -> track.advantages
  -> TrainStack.train_track(track)      shards across train workers, runs the mini-batch loop
  -> optional weight sync back to dedicated rollout workers

RolloutReq and RolloutResp (in unirl/types/) are the important boundary between rollout and training. New engines should adapt backend-specific outputs into this typed boundary instead of leaking backend objects into training code.

The trainer (unirl/trainer/<domain>.py) owns the placement block, worker construction, and phase ordering for rollout, reward, advantage, train, and sync.

For deployment modes and engine-specific requirements, use the generated Rollout Package README as the source of truth.

One Algorithm per Track

The training loss is a single per-track cfg.algorithm — a StageAlgorithm such as unirl.algorithms.diffusion_grpo.DiffusionGRPO. There is no separate driver-side "rollout control" object:

reward→advantage z-scoring lives on RolloutTrack.compute_advantages (unirl/types/rollout_resp.py);
SDE-index selection lives on DiffusionSamplingParams.resolve_sde_indices (unirl/types/sampling.py).

A single-track recipe binds one cfg.algorithm; a multi-track recipe (for example PE) nests one algorithm: node per track and runs sibling TrainStacks. See Trainer & Training Stack.

Roadmap

Near-term direction spans three tracks — Infra, Algorithm, and Model. See the Roadmap.

Overview

Main References

Data Flow

One Algorithm per Track

Roadmap

On this page