UniRL
Architecture

Trainer & Training Stack

The single-controller per-domain trainer, the FSDP train stack, and the flat conf recipe shape.

UniRL runs on one trainer architecture: a single-controller Remote / placement layer with per-domain trainers and a pluggable FSDP train stack. The earlier v1 actor-group runtime (and its conf_v1/ recipes) has been retired; everything below is the current, default path.

Entrypoints and trainers

Each domain has its own Hydra entrypoint and <Domain>Trainer (unirl/trainer/), all driven the same way:

EntrypointTrainerDomain
python -m unirl.train_diffusiontrainer/diffusion.pyDiffusion image / video
python -m unirl.train_vlmtrainer/vlm.pyAutoregressive VLM / LLM
python -m unirl.train_petrainer/pe.pyPrompt-enhancer (joint AR + diffusion)
python -m unirl.train_unified_modeltrainer/unified_model.pyHunyuanImage3 (mixed AR + diffusion)

The shared lifecycle lives in trainer/base.py: acquire a Ray DevicePool, build the rollout and train workers inside a placement block, then run the rollout → reward → advantage → train → optional weight-sync loop.

Runtime model

The trainer opens one placement block and instantiates each component as a sibling Remote, passing siblings by handle:

<Domain>Trainer(num_devices, batch_size, ...cfg blocks...)
  with placement(pool, fraction=1.0, shared_workers=True):
    bundle    = remote(bundle_cfg)
    pipeline  = remote(pipeline_cfg, bundle=bundle)
    backend   = remote(backend_cfg, bundle=bundle)            # FSDPBackend
    rollout   = remote(rollout_cfg[, pipeline=pipeline])      # engine
    reward    = remote(reward_cfg)                            # RewardService
    algorithm = remote(algorithm_cfg, pipeline=pipeline)      # StageAlgorithm
    stack     = remote(stack_cfg, fsdp_backend=backend, algorithm=algorithm)
    sync      = remote(sync_cfg, backend=backend, rollout=rollout)   # dedicated rollout only

The single-controller layer (unirl/distributed/group/: Remote, placement, RankInfo) carries DP / TP / PP / SP / EP rank information, which is what later parallelism work (SP for long sequences, EP for MoE) plugs into. trainside recipes share the trained module (no sync); dedicated engines add a weight-sync bridge under sync.

FSDP backend and train stack

Two Remote siblings split the training work:

  • FSDPBackend (unirl/train/backend/fsdp.py) owns the trainable model: structural injection (LoRA / NFT / mirror EMA) before the FSDP2 wrap, plus the optimizer, LR scheduler, EMA shadow, eval-EMA swap, checkpoint, and onload/offload.
  • TrainStack (unirl/train/stack.py) owns loss/backward sequencing. It takes handles to one FSDPBackend and one StageAlgorithm, and per rollout track runs prepare_segment → mini-batch compute_loss_and_backward loop → optimizer_step, returning a TrainStepResult.

TrainStack is single-stage by design — one track, no track-name dict. Multi-track training (for example PE) uses sibling TrainStacks, one per track. HunyuanImage3's mixed AR + diffusion training uses unified_model_stack.py, a multi-stage variant.

num_updates_per_batch > 1 runs several PPO mini-batch updates on one rollout shard with π_old frozen once by prepare_segment; it is gated on StageAlgorithm.supports_multi_update.

The full backend contract — components, the FSDPConfig / LoraConfig / EmaLoraConfig / EmaFullConfig / OptimizerConfig / LrSchedulerConfig schemas, and TrainTopology — is documented in the generated Train Stack README.

Config shape

examples/<domain>/*.yaml is a bucketed tree (one subdirectory per trainer domain); each recipe instantiates its siblings by _target_ (no Hydra config-group overrides):

backend:
  _target_: unirl.train.backend.fsdp.FSDPBackend
  block_class_names: ["JointTransformerBlock"]
  trainable_attr: transformer
  fsdp_cfg:
    _target_: unirl.train.configs.FSDPConfig
    fsdp_mode: full
    activation_checkpointing: false
  optimizer_cfg:
    _target_: unirl.train.backend.base.OptimizerConfig
    learning_rate: 3.0e-4
  lora_cfg:
    _target_: unirl.train.configs.LoraConfig
    rank: 32
    alpha: 64

stack:
  _target_: unirl.train.stack.TrainStack
  micro_batch_size: 1
  max_grad_norm: 1.0
  num_updates_per_batch: 2

Where this is going

The roadmap's VeOmniBackend work targets exactly this contract: a new backend implements the same Remote surface as FSDPBackend and maps TrainTopology onto a parallel plan (FSDP / SP / EP) instead of hardcoded shard sizes. See the Roadmap Infra track.

For extension routing (which surface to build on), see Extending UniRL.

On this page