Multinode Runs

Training semantics live in examples/<domain>/*.yaml. Shell scripts only prepare the runtime environment, start Ray, set path/logging defaults, and forward the recipe name plus Hydra overrides.

Launchers

Use the single-node launcher for local or one-node jobs, and the multinode launcher when workers need to join a Ray head before training starts. The first argument is a bucketed recipe name (<domain>/<recipe>); ENTRY selects a non-diffusion entrypoint (train_vlm, train_pe, train_unified_model).

bash examples/run_experiment_single_node.sh diffusion/sd3_trainside
bash examples/run_experiment_multinode_taiji.sh diffusion/sd3_sglang_native_colocate
ENTRY=train_vlm bash examples/run_experiment_multinode_taiji.sh vlm/qwen_vl_argrpo_geo3k_mc_sglang_4x8

Environment-Derived Overrides

Generic launchers set path and logging values from environment variables (PRETRAINED_MODEL, DATA_PATH, EVAL_DATA_PATH, OUTPUT_DIR, and the W&B knobs). Model checkpoint env vars remain recipe-specific.

Cluster Geometry

For different cluster geometry, override the recipe's placement and batch-geometry fields together, for example the device count and the train-stack mini-batch shape:

bash examples/run_experiment_multinode_taiji.sh diffusion/sd3_sglang_native_colocate \
  num_devices=16 \
  stack.micro_batch_size=1 \
  stack.num_updates_per_batch=2

The recipe's validators report inconsistent geometry before Ray work starts, so always run a compose check first.

Pre-Run Checks

For any large run, first compose and resolve the recipe without launching Ray work:

python -m unirl.train_diffusion --config-name=<domain>/<recipe> --cfg job --resolve

Multinode Runs

Launchers

Environment-Derived Overrides

Cluster Geometry

Pre-Run Checks

On this page