UniRL
Architecture

Overview

The main runtime loop, per-domain trainers, rollout engines, train stack, and sync boundaries.

UniRL is organized around one Hydra-driven training loop. This page is a navigation overview; the code-adjacent contracts are generated from unirl/README.md and the package READMEs embedded as pages in each sidebar section.

unirl.train_diffusion | train_vlm | train_pe | train_unified_model
  -> register and validate Hydra config
  -> <Domain>Trainer acquires a Ray DevicePool (placement)
  -> trainer builds the rollout workers and train workers
  -> loop: rollout -> reward -> advantage -> train -> optional weight sync

Each domain has its own entrypoint and trainer, all driven the same way. There is a single training stack: the earlier v1 actor-group runtime has been retired, and the single-controller Remote / placement layer is now the only path.

Main References

Data Flow

<Domain>Trainer
  -> build RolloutReq, dispatch to the rollout engine
  -> RolloutResp with tracks[name] (conditions, segments, rewards, media)
  -> RewardService.score_and_attach    -> track.rewards
  -> RolloutTrack.compute_advantages    -> track.advantages
  -> TrainStack.train_track(track)      shards across train workers, runs the mini-batch loop
  -> optional weight sync back to dedicated rollout workers

RolloutReq and RolloutResp (in unirl/types/) are the important boundary between rollout and training. New engines should adapt backend-specific outputs into this typed boundary instead of leaking backend objects into training code.

The trainer (unirl/trainer/<domain>.py) owns the placement block, worker construction, and phase ordering for rollout, reward, advantage, train, and sync.

For deployment modes and engine-specific requirements, use the generated Rollout Package README as the source of truth.

One Algorithm per Track

The training loss is a single per-track cfg.algorithm — a StageAlgorithm such as unirl.algorithms.diffusion_grpo.DiffusionGRPO. There is no separate driver-side "rollout control" object:

  • reward→advantage z-scoring lives on RolloutTrack.compute_advantages (unirl/types/rollout_resp.py);
  • SDE-index selection lives on DiffusionSamplingParams.resolve_sde_indices (unirl/types/sampling.py).

A single-track recipe binds one cfg.algorithm; a multi-track recipe (for example PE) nests one algorithm: node per track and runs sibling TrainStacks. See Trainer & Training Stack.

Roadmap

Near-term direction spans three tracks — Infra, Algorithm, and Model. See the Roadmap.

On this page