Trainer & Training Stack
The single-controller per-domain trainer, the FSDP train stack, and the flat conf recipe shape.
UniRL runs on one trainer architecture: a single-controller Remote /
placement layer with per-domain trainers and a pluggable FSDP train stack. The
earlier v1 actor-group runtime (and its conf_v1/ recipes) has been retired;
everything below is the current, default path.
Entrypoints and trainers
Each domain has its own Hydra entrypoint and <Domain>Trainer
(unirl/trainer/), all driven the same way:
| Entrypoint | Trainer | Domain |
|---|---|---|
python -m unirl.train_diffusion | trainer/diffusion.py | Diffusion image / video |
python -m unirl.train_vlm | trainer/vlm.py | Autoregressive VLM / LLM |
python -m unirl.train_pe | trainer/pe.py | Prompt-enhancer (joint AR + diffusion) |
python -m unirl.train_unified_model | trainer/unified_model.py | HunyuanImage3 (mixed AR + diffusion) |
The shared lifecycle lives in trainer/base.py: acquire a Ray DevicePool,
build the rollout and train workers inside a placement block, then run the
rollout → reward → advantage → train → optional weight-sync loop.
Runtime model
The trainer opens one placement block and instantiates each component as a
sibling Remote, passing siblings by handle:
<Domain>Trainer(num_devices, batch_size, ...cfg blocks...)
with placement(pool, fraction=1.0, shared_workers=True):
bundle = remote(bundle_cfg)
pipeline = remote(pipeline_cfg, bundle=bundle)
backend = remote(backend_cfg, bundle=bundle) # FSDPBackend
rollout = remote(rollout_cfg[, pipeline=pipeline]) # engine
reward = remote(reward_cfg) # RewardService
algorithm = remote(algorithm_cfg, pipeline=pipeline) # StageAlgorithm
stack = remote(stack_cfg, fsdp_backend=backend, algorithm=algorithm)
sync = remote(sync_cfg, backend=backend, rollout=rollout) # dedicated rollout onlyThe single-controller layer (unirl/distributed/group/: Remote, placement,
RankInfo) carries DP / TP / PP / SP / EP rank information, which is what
later parallelism work (SP for long sequences, EP for MoE) plugs into.
trainside recipes share the trained module (no sync); dedicated engines add
a weight-sync bridge under sync.
FSDP backend and train stack
Two Remote siblings split the training work:
FSDPBackend(unirl/train/backend/fsdp.py) owns the trainable model: structural injection (LoRA / NFT / mirror EMA) before the FSDP2 wrap, plus the optimizer, LR scheduler, EMA shadow, eval-EMA swap, checkpoint, and onload/offload.TrainStack(unirl/train/stack.py) owns loss/backward sequencing. It takes handles to oneFSDPBackendand oneStageAlgorithm, and per rollout track runsprepare_segment → mini-batch compute_loss_and_backward loop → optimizer_step, returning aTrainStepResult.
TrainStack is single-stage by design — one track, no track-name dict.
Multi-track training (for example PE) uses sibling TrainStacks, one per
track. HunyuanImage3's mixed AR + diffusion training uses unified_model_stack.py, a
multi-stage variant.
num_updates_per_batch > 1 runs several PPO mini-batch updates on one rollout
shard with π_old frozen once by prepare_segment; it is gated on
StageAlgorithm.supports_multi_update.
The full backend contract — components, the FSDPConfig / LoraConfig /
EmaLoraConfig / EmaFullConfig / OptimizerConfig / LrSchedulerConfig
schemas, and TrainTopology — is documented in the generated
Train Stack README.
Config shape
examples/<domain>/*.yaml is a bucketed tree (one subdirectory per trainer domain); each recipe instantiates its siblings by
_target_ (no Hydra config-group overrides):
backend:
_target_: unirl.train.backend.fsdp.FSDPBackend
block_class_names: ["JointTransformerBlock"]
trainable_attr: transformer
fsdp_cfg:
_target_: unirl.train.configs.FSDPConfig
fsdp_mode: full
activation_checkpointing: false
optimizer_cfg:
_target_: unirl.train.backend.base.OptimizerConfig
learning_rate: 3.0e-4
lora_cfg:
_target_: unirl.train.configs.LoraConfig
rank: 32
alpha: 64
stack:
_target_: unirl.train.stack.TrainStack
micro_batch_size: 1
max_grad_norm: 1.0
num_updates_per_batch: 2Where this is going
The roadmap's VeOmniBackend work targets exactly this contract: a new backend
implements the same Remote surface as FSDPBackend and maps TrainTopology
onto a parallel plan (FSDP / SP / EP) instead of hardcoded shard sizes. See the
Roadmap Infra track.
For extension routing (which surface to build on), see Extending UniRL.