Multinode Runs
Launchers, Ray startup, cluster geometry, and pre-run checks.
Training semantics live in examples/<domain>/*.yaml. Shell scripts only prepare the runtime
environment, start Ray, set path/logging defaults, and forward the recipe name
plus Hydra overrides.
Launchers
Use the single-node launcher for local or one-node jobs, and the multinode
launcher when workers need to join a Ray head before training starts. The first
argument is a bucketed recipe name (<domain>/<recipe>); ENTRY selects a non-diffusion
entrypoint (train_vlm, train_pe, train_unified_model).
bash examples/run_experiment_single_node.sh diffusion/sd3_trainside
bash examples/run_experiment_multinode_taiji.sh diffusion/sd3_sglang_native_colocate
ENTRY=train_vlm bash examples/run_experiment_multinode_taiji.sh vlm/qwen_vl_argrpo_geo3k_mc_sglang_4x8Environment-Derived Overrides
Generic launchers set path and logging values from environment variables
(PRETRAINED_MODEL, DATA_PATH, EVAL_DATA_PATH, OUTPUT_DIR, and the W&B
knobs). Model checkpoint env vars remain recipe-specific.
Cluster Geometry
For different cluster geometry, override the recipe's placement and batch-geometry fields together, for example the device count and the train-stack mini-batch shape:
bash examples/run_experiment_multinode_taiji.sh diffusion/sd3_sglang_native_colocate \
num_devices=16 \
stack.micro_batch_size=1 \
stack.num_updates_per_batch=2The recipe's validators report inconsistent geometry before Ray work starts, so always run a compose check first.
Pre-Run Checks
For any large run, first compose and resolve the recipe without launching Ray work:
python -m unirl.train_diffusion --config-name=<domain>/<recipe> --cfg job --resolve