# UniRL Full Documentation

> Agent-readable Markdown corpus built from commit: unknown.

---

# UniRL Documentation (/en/docs)

> Agent-first documentation for the UniRL distributed reinforcement learning framework.

UniRL is a distributed reinforcement learning framework for unified multimodal generative models. It trains diffusion and autoregressive models with Ray-based worker groups, Hydra experiment recipes, composable training stacks, and pluggable rollout engines.

This documentation site has two audiences:

* Researchers and engineers who read the rendered Fumadocs pages.
* Future coding agents that need stable Markdown entry points and task-oriented navigation.

## Running Training [#running-training]

Each domain has its own entrypoint, all driven the same way:

```bash
python -m unirl.train_diffusion --config-name=<domain>/<recipe>   # diffusion image/video
python -m unirl.train_vlm       --config-name=<domain>/<recipe>   # autoregressive VLM / LLM
python -m unirl.train_pe        --config-name=<domain>/<recipe>   # prompt-enhancer (PE)
python -m unirl.train_unified_model       --config-name=<domain>/<recipe>   # HunyuanImage3 (mixed AR + diffusion)
```

`<recipe>` is a self-contained YAML filename (without `.yaml`) in the bucketed `examples/` tree, addressed as `<domain>/<recipe>` — for example:

```bash
python -m unirl.train_diffusion --config-name=diffusion/sd3_trainside
```

Override any field inline with Hydra's `key=value` syntax, e.g. `num_devices=8`. Each recipe is the source of truth for model, algorithm, rollout engine, placement, reward, sync, and batch geometry.

Shell launchers in `examples/` should stay thin — they prepare environment variables, start Ray, and pass the recipe name plus Hydra overrides, while the recipe semantics live in YAML. See `examples/run_experiment_single_node.sh` and `examples/run_experiment_multinode_taiji.sh` for the canonical pattern.

Detailed runtime and module contracts live in the package pages embedded in each docs section's sidebar, generated from the README files next to the code.

## How Agents Should Read This Site [#how-agents-should-read-this-site]

Agents should use [Agent Index](/en/docs/agents) as the human-readable routing page before changing code. That page maps common tasks to the nearest rendered docs, package README contracts, and source directories.

Machine-readable endpoints such as `/llms.txt`, `/llms-full.txt`, and `/md/<slug>/index.md` are root-level access paths, not separate documentation categories. They are generated from the same MDX source as this site so rendered pages and agent context stay aligned.

## Reading Paths [#reading-paths]

* Start with [Installation](/en/docs/getting-started/installation) and [First Run](/en/docs/getting-started/first-run) for setup.
* Use the [overview](/en/docs/architecture/overview) for the narrative runtime map.
* Use [Agent Index](/en/docs/agents) to choose the closest source files and README contracts for a task.

---

# Agent Index (/en/docs/agents)

> Start here when using UniRL documentation as coding-agent context.

This page is optimized for future agents that need to read and modify UniRL safely. It is the human-readable guide to agent context; `/llms.txt` is the machine-readable discovery endpoint.

## How Agents Use These Docs [#how-agents-use-these-docs]

Agents should treat the docs as a routing layer, not as a replacement for source inspection:

1. Open `/llms.txt` or `/md/agents/index.md` to discover the maintained documentation surface.
2. Use the task table below to choose the closest rendered docs page and package README.
3. Read the nearby implementation before editing.
4. Prefer `/md/<docs-slug>/index.md` for focused Markdown context and `/llms-full.txt` only when a single-file corpus is useful.

Do not add `/llms.txt` as a docs category. It is a root-level access path for tools and agents, while this `Agents` section is the visible documentation category.

## First Principles [#first-principles]

* Treat `python -m unirl.train_diffusion --config-name=<domain>/<recipe>` (and `train_vlm` / `train_pe` / `train_unified_model`) as the maintained runtime entry.
* Treat the bucketed `examples/<domain>/<recipe>.yaml` files as the authoritative configuration surface.
* Treat package READMEs as local contracts near the code they describe.
* Do not infer runtime behavior from stale scratch docs or ignored local files unless the user explicitly points to them.

## Reading Order by Task [#reading-order-by-task]

| Task                                                           | Read first                                                                 |
| -------------------------------------------------------------- | -------------------------------------------------------------------------- |
| Run or validate a recipe                                       | `/en/docs/getting-started/first-run`, then the launchers in `examples/`    |
| Understand configuration                                       | `/en/docs/configuration/hydra`, then `unirl/config/README.md`              |
| Pick an experiment                                             | `/en/docs/configuration/experiments`, then `examples/<domain>/<name>.yaml` |
| Understand runtime flow                                        | `/en/docs/architecture/overview`, then `unirl/README.md`                   |
| Work on rollout engines                                        | `unirl/rollout/README.md`                                                  |
| Work on the train stack or a training backend                  | `/en/docs/architecture/trainer-v2`, then `unirl/train/readme.md`           |
| Work on GRPO / NFT / DPPO loss logic                           | `unirl/algorithms/README.md`                                               |
| Work on SDE kernels, sigma schedules, or log-probability paths | `unirl/sde/README.md`                                                      |
| Work on rewards                                                | `/en/docs/guides/rewards`, then `unirl/reward/README.md`                   |
| Add or debug trainer-to-rollout weight sync                    | `unirl/distributed/weight_sync/README.md`                                  |
| Prepare prompt data                                            | `/en/docs/guides/data-preparation`                                         |
| Add or mount data/model artifacts                              | `/en/docs/guides/data-and-models`                                          |
| Debug multinode runs                                           | `/en/docs/guides/multinode`                                                |

## Machine-Readable Endpoints [#machine-readable-endpoints]

Use these endpoints instead of scraping rendered HTML:

| Endpoint                           | Purpose                                     |
| ---------------------------------- | ------------------------------------------- |
| `/llms.txt`                        | compact discovery index and access guidance |
| `/llms-full.txt`                   | full generated Markdown corpus              |
| `/md/agents/index.md`              | this page as Markdown                       |
| `/md/configuration/hydra/index.md` | one focused configuration page              |

These outputs are generated from the same MDX source as the Fumadocs site, so human and agent documentation stay aligned. Keep endpoint details here instead of duplicating them across the docs sidebar.

## Safe Editing Policy [#safe-editing-policy]

When editing the framework:

1. Identify the owning package from the task table.
2. Read that package README and the closest existing implementation.
3. Prefer a typed config dataclass near the implementation over ad hoc string parsing.
4. Add or update one recipe only when the feature changes runnable behavior.
5. Run a Hydra compose check before launching Ray work.

---

# Agent Task Recipes (/en/docs/agents/task-recipes)

> Common coding-agent tasks mapped to files, checks, and likely risks.

Use this page as a routing table before editing the repository.

## Add a New Training Recipe [#add-a-new-training-recipe]

Read:

* `unirl/config/README.md`
* closest `examples/<domain>/<recipe>.yaml`

Edit:

* `examples/<domain>/<new_recipe>.yaml`
* this docs page and `/en/docs/configuration/experiments` if the recipe is maintained

Check:

```bash
python -m unirl.train_diffusion --config-name=<domain>/<new_recipe> --cfg job --resolve
```

Risk: mismatched placement, rollout batch size, and train-stack batch geometry.

See also: `unirl/train/stack.py`, `unirl/config/`.

## Add a New Model [#add-a-new-model]

Read:

* `unirl/models/README.md`
* closest model package under `unirl/models/`

Edit:

* `unirl/models/<model_name>/`
* `examples/<domain>/<recipe>.yaml`

Check:

* config registration imports
* LoRA target materialization
* prompt/condition contracts expected by rollout engines

Risk: leaking model-specific assumptions into generic rollout or training packages.

See also: `unirl/types/` and the closest existing model package.

## Add a Rollout Engine [#add-a-rollout-engine]

Read:

* `unirl/rollout/README.md`
* `unirl/distributed/weight_sync/README.md`
* existing `engine/trainside`, `engine/sglang`, or `engine/vllm_omni`

Edit:

* `unirl/rollout/engine/<engine>/`
* optional weight sync backend if the engine is dedicated
* at least one recipe under `examples/<domain>/`

Check:

* all backend outputs adapt to canonical `RolloutResp` (`tracks[name]`)
* direct-vs-dedicated sync contracts pass validation

Risk: returning backend-specific objects across the rollout/training boundary.

See also: `unirl/types/rollout_req.py`, `unirl/types/rollout_resp.py`.

## Add or Debug Weight Sync [#add-or-debug-weight-sync]

Read:

* `unirl/distributed/weight_sync/README.md`
* `unirl/rollout/README.md`
* the recipe's `sync` selection in `examples/<domain>/<recipe>.yaml`

Edit:

* `unirl/distributed/weight_sync/`
* `examples/<domain>/<recipe>.yaml` when selecting or tuning a backend

Check:

* dedicated rollout engines configure exactly one supported `sync` backend;
* direct sampling recipes omit `sync`;
* CUDA-IPC sync is only used with colocated dedicated rollout.

Risk: confusing trainer-to-rollout weight sync with the rollout-output tensor transport.

See also: `unirl/distributed/tensor/`, `unirl/config/`.

## Add or Debug SDE Logic [#add-or-debug-sde-logic]

Read:

* `unirl/sde/README.md`
* `unirl/algorithms/README.md`
* the selected recipe's `sampling` and `sampling/sde_strategy` sections

Edit:

* `unirl/sde/`
* model-specific sigma overrides under `unirl/models/<model_name>/`
* recipe YAML when selecting strategy or step schedules

Check:

* trained strategies provide log-probability paths required by GRPO-style losses;
* evaluation-only solvers are not used for train-side ratio objectives;
* old log-probs remain fixed across `stack.num_updates_per_batch`.

Risk: changing sigma schedules or log-prob behavior without updating rollout and replay assumptions.

See also: `unirl/types/sampling.py`, `unirl/types/rollout_resp.py`, `unirl/algorithms/diffusion_grpo.py`.

## Add a Reward [#add-a-reward]

Read:

* `/en/docs/guides/rewards`
* `unirl/reward/README.md`
* closest scorer under `unirl/reward/local/`

Edit:

* scorer implementation and spec config
* recipe reward backend

Check:

* scorer batch size and device behavior
* offload/onload if the scorer holds a model

Risk: introducing slow in-process scoring without batch controls.

See also: `unirl/reward/local/base.py`, `unirl/types/reward.py`.

## Debug a Failed Run [#debug-a-failed-run]

Start with:

```bash
python -m unirl.train_diffusion --config-name=<domain>/<recipe> --cfg job --resolve
```

Then inspect:

* config validation errors first;
* the launchers in `examples/` for env handling;
* `unirl/rollout/README.md` for engine mode and sync requirements;
* `unirl/train/readme.md` for the train-step contract and batch geometry.

Risk: treating a Ray runtime error as the root cause when Hydra composition already encoded an invalid topology.

See also: `unirl/config/`.

---

# Concepts & Glossary (/en/docs/architecture/concepts)

> The core mental model and the domain terms used across UniRL docs and recipes.

Read this first if the rest of the docs use unfamiliar terms. It is a conceptual
primer, not an API reference; code-adjacent contracts live in the package pages embedded in each
section, such as [Code Architecture](/en/docs/architecture/readme-code-architecture).

## The Core Loop [#the-core-loop]

Every run is one repeating loop:

```text
prompts -> rollout (sample media + record trajectories into tracks)
        -> reward  (score media into per-sample rewards)
        -> advantage (normalize rewards within/across groups)
        -> train   (replay trajectories, compute loss, backward, optimizer step)
        -> [optional] weight sync back to dedicated rollout engines
```

## One Algorithm per Track [#one-algorithm-per-track]

A rollout produces one or more **tracks** (`RolloutResp.tracks[name]`, keyed by a
stage name such as `"diffusion"` or `"ar"`). Each track binds exactly **one**
loss algorithm — `cfg.algorithm`, a `StageAlgorithm` that consumes the track,
replays the stage, computes a loss, and calls `backward()`.

There is no separate driver-side "rollout control" object. Reward→advantage
shaping and SDE-index selection live on typed objects:

| Concern                       | Where it lives                                                            |
| ----------------------------- | ------------------------------------------------------------------------- |
| Loss                          | `cfg.algorithm` → a `StageAlgorithm` (e.g. `DiffusionGRPO`)               |
| Reward → advantage            | `RolloutTrack.compute_advantages` (`unirl/types/rollout_resp.py`)         |
| Which inference steps run SDE | `DiffusionSamplingParams.resolve_sde_indices` (`unirl/types/sampling.py`) |

A single-track recipe binds one top-level `cfg.algorithm`; a multi-track recipe
(for example PE) nests one `algorithm:` node per track (`diffusion.algorithm`,
`ar.algorithm`) and runs sibling `TrainStack`s.

## Glossary [#glossary]

### Orchestration [#orchestration]

| Term               | Meaning                                                                                                                                                                            |
| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Driver             | The process running the training loop: an entrypoint (`unirl.train_diffusion`, …) that builds a `<Domain>Trainer`.                                                                 |
| Trainer            | `unirl/trainer/<domain>.py` — owns the placement block, builds rollout/train workers, and runs the loop.                                                                           |
| Remote / placement | Single-controller layer (`unirl/distributed/group/`): a `Remote` is a logical worker; a `placement` block colocates sibling Remotes and carries `RankInfo` (DP/TP/PP/SP/EP ranks). |
| Bundle             | A model package's trainable + frozen modules (transformer, VAE, text encoders) exposed to training and rollout.                                                                    |
| Pipeline           | The model's sampling pipeline (denoising / generation) used to produce media and trajectories.                                                                                     |

### Rollout [#rollout]

| Term                         | Meaning                                                                                                                                |
| ---------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| Rollout engine               | The sampler backend: `trainside`, `sglang`, `sglang_llm`, `vllm_omni`, or `composed`.                                                  |
| Direct sampling              | `trainside`: the FSDP-wrapped training module IS the sampler; no separate engine, no weight sync.                                      |
| Dedicated sampling           | A separate engine (SGLang / vLLM-Omni) holds its own weights and needs trainer→rollout weight sync.                                    |
| Colocate vs separate         | Whether train and rollout share GPU bundles (`colocate`) or run on distinct GPU slabs (`separate`).                                    |
| `RolloutReq` / `RolloutResp` | The typed boundary between rollout and training. Engines adapt backend output into `RolloutResp`.                                      |
| Track / Segment              | `RolloutResp.tracks[name]` holds a per-stage `segment` (e.g. a diffusion trajectory with per-step log-probs), rewards, and advantages. |
| Group                        | The `sampling.samples_per_prompt` siblings of one prompt; advantages are normalized within a group.                                    |

### SDE & log-probs [#sde--log-probs]

| Term                    | Meaning                                                                                                |
| ----------------------- | ------------------------------------------------------------------------------------------------------ |
| SDE strategy            | Per-step stochastic kernel that produces a step log-prob (`FlowSDEStrategy`, `DanceSDEStrategy`, ...). |
| Sigma schedule          | The σ values across denoising steps; pinned onto the request as a single source of truth.              |
| `old_logp` / `new_logp` | Rollout-time vs current-weights log-probs; their ratio drives GRPO's clipped objective.                |
| Log-prob replay         | Recomputing log-probs at train time by replaying the stage; old log-probs stay fixed across updates.   |

### Training [#training]

| Term                   | Meaning                                                                                                            |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------ |
| Backend                | The training-state `Remote` (`FSDPBackend`) owning structural injection + optimizer + scheduler + EMA.             |
| Stage / StageAlgorithm | A trainable stage and the loss object that replays it and runs forward/backward.                                   |
| Train stack            | `TrainStack` — single-stage loss/backward driver; multi-track uses sibling stacks; HI3 uses `unified_model_stack`. |
| EMA rollout            | Sampling rollouts with EMA-smoothed weights (NFT opts in; GRPO / DPPO must not).                                   |
| Batch geometry         | `micro_batch_size`, `num_updates_per_batch`; multi-update freezes π\_old once per rollout shard.                   |

### Reward [#reward]

| Term                      | Meaning                                                                                 |
| ------------------------- | --------------------------------------------------------------------------------------- |
| RewardService             | `unirl/reward/service.py` — holds one backend: local scorers or the remote HTTP client. |
| Reward component / scorer | A unit that scores media (PickScore, HPS, OCR, GenEval2, VideoPickScore, ...).          |
| Advantage                 | The normalized reward signal the loss multiplies against.                               |

### Data-plane vs weight-plane [#data-plane-vs-weight-plane]

| Term                     | Meaning                                                                                            |
| ------------------------ | -------------------------------------------------------------------------------------------------- |
| Weight sync (`cfg.sync`) | Sends fresh **trainer weights** to dedicated rollout engines (NCCL broadcast, tensor, IPC).        |
| Tensor transport         | The data plane (`unirl/distributed/tensor/`) for moving bulky **rollout outputs** between workers. |

These solve opposite directions; do not conflate them.

## Where to Go Next [#where-to-go-next]

* Runtime map: [Overview](/en/docs/architecture/overview).
* Algorithm loss math and the reward→advantage path: [Algorithms Package README](/en/docs/architecture/readme-algorithms).
* Picking a recipe: [Experiment Recipes](/en/docs/configuration/experiments).

---

# Overview (/en/docs/architecture/overview)

> The main runtime loop, per-domain trainers, rollout engines, train stack, and sync boundaries.

UniRL is organized around one Hydra-driven training loop. This page is a
navigation overview; the code-adjacent contracts are generated from
`unirl/README.md` and the package READMEs embedded as pages in each sidebar section.

```text
unirl.train_diffusion | train_vlm | train_pe | train_unified_model
  -> register and validate Hydra config
  -> <Domain>Trainer acquires a Ray DevicePool (placement)
  -> trainer builds the rollout workers and train workers
  -> loop: rollout -> reward -> advantage -> train -> optional weight sync
```

Each domain has its own entrypoint and trainer, all driven the same way. There
is a single training stack: the earlier v1 actor-group runtime has been retired,
and the single-controller `Remote` / `placement` layer is now the only path.

## Main References [#main-references]

* [Code Architecture README](/en/docs/architecture/readme-code-architecture) for the module map and runtime data flow.
* [Rollout Package README](/en/docs/architecture/readme-rollout) for rollout modes and request/response contracts.
* [Train Stack README](/en/docs/architecture/readme-train-stack) for the FSDP backend, the train-step contract, and structural injection.
* [Algorithms Package README](/en/docs/architecture/readme-algorithms) for per-track loss contracts and the reward→advantage path.
* [SDE Package README](/en/docs/architecture/readme-sde) for SDE strategy rules, schedules, and runtime kernels.
* [Weight Sync README](/en/docs/architecture/readme-weight-sync) for dedicated rollout synchronization.

## Data Flow [#data-flow]

```text
<Domain>Trainer
  -> build RolloutReq, dispatch to the rollout engine
  -> RolloutResp with tracks[name] (conditions, segments, rewards, media)
  -> RewardService.score_and_attach    -> track.rewards
  -> RolloutTrack.compute_advantages    -> track.advantages
  -> TrainStack.train_track(track)      shards across train workers, runs the mini-batch loop
  -> optional weight sync back to dedicated rollout workers
```

`RolloutReq` and `RolloutResp` (in `unirl/types/`) are the important boundary
between rollout and training. New engines should adapt backend-specific outputs
into this typed boundary instead of leaking backend objects into training code.

The trainer (`unirl/trainer/<domain>.py`) owns the placement block, worker
construction, and phase ordering for rollout, reward, advantage, train, and sync.

For deployment modes and engine-specific requirements, use the generated
[Rollout Package README](/en/docs/architecture/readme-rollout) as the
source of truth.

## One Algorithm per Track [#one-algorithm-per-track]

The training loss is a single per-track `cfg.algorithm` — a `StageAlgorithm`
such as `unirl.algorithms.diffusion_grpo.DiffusionGRPO`. There is no separate
driver-side "rollout control" object:

* reward→advantage z-scoring lives on `RolloutTrack.compute_advantages` (`unirl/types/rollout_resp.py`);
* SDE-index selection lives on `DiffusionSamplingParams.resolve_sde_indices` (`unirl/types/sampling.py`).

A single-track recipe binds one `cfg.algorithm`; a multi-track recipe (for
example PE) nests one `algorithm:` node per track and runs sibling
`TrainStack`s. See [Trainer & Training Stack](/en/docs/architecture/trainer-v2).

## Roadmap [#roadmap]

Near-term direction spans three tracks — **Infra**, **Algorithm**, and **Model**. See the
[Roadmap](/en/docs/architecture/roadmap).

---

# Roadmap (/en/docs/architecture/roadmap)

> Near-term direction across the Infra, Algorithm, and Model tracks — baselines, goals, and TODOs.

This roadmap tracks near-term direction across three tracks — **Infra**, **Algorithm**, and
**Model**. Each item lists its current baseline in this repository and the work remaining.

This is a living document, updated as the project evolves; unlisted topics aren't excluded,
they just get less coverage. Planning horizon: **2026 H1**.

## Legend [#legend]

* **Status** — `[ ]` planned · `[~]` in progress · `[x]` done.
* **Priority** (committed items only) — `P0` must-have this cycle · `P1` targeted this cycle ·
  `P2` stretch / next. *Candidate* items are exploratory and intentionally
  unprioritized.
* **Help wanted** — 🙋 marks well-scoped items that are open to claim.
* **Tracking** — each committed item should have a `[Tracking]` issue (see
  [GitHub Issues Workflow](/en/docs/others/github-issues-workflow)); `(RFC needed)` means no
  tracking issue exists yet — open one to claim it. Owners are tracked on the per-item issues
  rather than inline here.

## This cycle at a glance [#this-cycle-at-a-glance]

* **Infra** — make the training backend pluggable (a VeOmni backend for composable FSDP / SP /
  EP), harden vLLM-Omni rollout, async reward, and cross-engine conformance, and add a
  differentiable-reward (REFL) training mode.
* **Algorithm** — close the policy-gradient / PPO gaps (critic + GAE, KL / reference policy,
  reward credit assignment) and stand up the REFL family, starting with DR-Tune.
* **Model** — build first-class support matrices around important Diffusers model families
  (SD3.5, Qwen-Image, FLUX, Wan / HunyuanVideo) and bring up next-generation
  HunyuanImage (3.5) for RL post-training.

| Track     | Item                                               | Priority | Status        |
| --------- | -------------------------------------------------- | -------- | ------------- |
| Infra     | VeOmni training backend                            | `P0`     | `[ ]` planned |
| Infra     | REFL (differentiable-reward) infra                 | `P0`     | `[ ]` planned |
| Infra     | vLLM-Omni rollout expansion and hardening          | `P1`     | `[ ]` planned |
| Infra     | Async Reward Overlap                               | `P1`     | `[ ]` planned |
| Infra     | Rollout engine conformance matrix                  | `P2` 🙋  | `[ ]` planned |
| Infra     | Benchmark / profiling examples and tools           | `P2` 🙋  | `[ ]` planned |
| Infra     | UI / observability                                 | `P1` 🙋  | `[ ]` planned |
| Algorithm | Policy-Gradient / PPO family                       | `P1`     | `[~]` partial |
| Algorithm | KL / Reference Policy control                      | `P1`     | `[ ]` planned |
| Algorithm | Reward / Advantage credit assignment consolidation | `P2`     | `[ ]` planned |
| Algorithm | Multi-track / shared-backbone RL                   | `P2`     | `[ ]` planned |
| Algorithm | REFL family (DR-Tune first)                        | `P0`     | `[ ]` planned |
| Algorithm | Preference / forward-process (NFT)                 | —        | `[x]` done    |
| Model     | Core Image DiT support matrix                      | `P1`     | `[ ]` planned |
| Model     | Video RL model track                               | `P2`     | `[ ]` planned |
| Model     | HunyuanImage 3.5                                   | `P1`     | `[ ]` planned |

## Infra [#infra]

### `[ ]` `P0` VeOmni Training Backend [#--p0-veomni-training-backend]

* **Baseline.** The trainer exposes a swappable `backend:` block (see `examples/<domain>/*.yaml`);
  today the only implementation is the native `FSDPBackend`
  (`unirl/train/backend/fsdp.py`: FSDP2 wrap plus LoRA / NFT / EMA injection, offload,
  checkpoint). `TrainTopology` already carries `dp / tp / pp / sp / ep / cp` fields, but only
  DP / FSDP are topology-driven today; a hybrid-FSDP (HSDP) mesh path exists in
  `train/inject.py` but is hard-coded (shard size 8), not mapped from `TrainTopology`.
* **Goal.** Add a `VeOmniBackend` behind the same backend contract to reuse VeOmni's
  model-centric distributed recipes (composable FSDP / SP / EP via a high-level parallel-plan
  API).
* **Tracking.*&#x2A; &#x2A;(RFC needed)*
* TODO:
  * `[ ]` Implement `unirl/train/backend/veomni.py` satisfying the backend Remote
    contract (LoRA / NFT / EMA injection, optimizer, scheduler, checkpoint, onload/offload).
  * `[ ]` Map `TrainTopology` onto the VeOmni parallel-plan API (FSDP / FSDP2, HSDP,
    Sequence Parallelism via DeepSpeed-Ulysses / Async-Ulysses, Expert Parallelism for MoE).
  * `[ ]` Enable SP for long video / AR sequences and EP for MoE backbones (HunyuanImage 3.x).
  * `[ ]` Torch Distributed Checkpoint and resume parity with `FSDPBackend`.
  * `[ ]` Keep the Policy / Stage and weight-sync (LoRA IPC / NCCL) paths working under the
    new backend.

### `[ ]` `P0` REFL Support (differentiable-reward training mode) [#--p0-refl-support-differentiable-reward-training-mode]

* **Baseline.** Rewards are computed as scalars on rollout actors and turned into advantages
  (`reward/` then `algorithm.compute_advantages`). There is no gradient path from a reward
  model back into the denoiser. Train-side algorithms today are policy-gradient
  (`DiffusionGRPO` / `ARGRPO`, plus `DiffusionDPPO`) and forward-process (`DiffusionNFT`).
* **Goal.** Add a differentiable-reward training mode so reward gradients can back-propagate
  through sampling into the model. This is the infra prerequisite for the **REFL algorithm
  family** (DR-Tune and other reward-feedback methods — see Algorithm below).
* **Tracking.*&#x2A; &#x2A;(RFC needed)*
* TODO:
  * `[ ]` Add differentiable train-side reward scorers (today's `reward/local`, e.g.
    ImageReward / HPS / PickScore, run under `torch.no_grad()` and return scalars).
  * `[ ]` Gradient-checkpointed sampling replay plus stop-gradient hooks on denoiser inputs
    for memory-bounded back-propagation through steps.
  * `[ ]` A reward-backprop `StageAlgorithm` surface, separate from the scalar
    reward-to-advantage pipeline.
  * `[ ]` Recipe and reward-service wiring, with an SD3 smoke recipe.

### `[ ]` `P1` vLLM-Omni Rollout Expansion and Hardening [#--p1-vllm-omni-rollout-expansion-and-hardening]

* **Baseline.** Rollout engines already include `trainside`, `sglang`, `sglang_llm`,
  `vllm_omni`, and `composed`; the base engine defines a verl-omni-style IPC / NCCL /
  LoRA weight-sync contract, and the `vllm_omni` path already has SD3 and HunyuanImage3
  log-prob / latent-capture foundations.
* **Goal.** Do not reimplement vLLM-Omni internals inside UniRL. Instead, consume its
  high-throughput multimodal rollout capabilities, expand the existing vLLM-Omni path to more
  core models, and keep behavior aligned with trainside / SGLang rollout.
* **Tracking.*&#x2A; &#x2A;(RFC needed)*
* TODO:
  * `[ ]` Extend per-step log-probs, intermediate latent capture, and denoising trajectory replay
    to Qwen-Image, SD3.5, and Wan2.2.
  * `[ ]` Harden LoRA / full-weight sync with checksum, dtype / shape fail-fast checks, bucketed
    transfer metrics, and sync-target rules for multi-track / shared-backbone models.
  * `[ ]` Integrate or adapt vLLM-Omni step-wise continuous batching, embedding cache, and request
    scheduling without duplicating serving-engine kernels in UniRL.
  * `[ ]` Add rollout parity tests across vLLM-Omni and trainside / SGLang for Qwen-Image, SD3.5,
    Wan2.2, and HunyuanImage3.

### `[ ]` `P1` Async Reward Overlap [#--p1-async-reward-overlap]

* **Baseline.** `reward/local` already includes OCR, aesthetic, CLIP, HPS, PickScore,
  ImageReward, VideoPickScore, rule-based exact match, and related components, with a reward
  service execution path. The remaining gap is scheduling reward computation as an independent
  stage that overlaps rollout / training.
* **Goal.** Follow the verl-omni async-reward direction by placing heavy reward models on
  dedicated actors / GPUs, reducing step wall-clock while preserving the scalar
  reward-to-advantage semantics.
* **Tracking.*&#x2A; &#x2A;(RFC needed)*
* TODO:
  * `[ ]` Independent reward actor placement with async submit / collect for OCR, VLM judges,
    image rewards, and video rewards.
  * `[ ]` Introduce reward futures / queues between rollout and training, and log reward latency,
    queue depth, staleness, and GPU utilization.
  * `[ ]` Keep the existing synchronous reward service compatible, with Qwen-Image OCR and Wan /
    HunyuanVideo video-reward smoke recipes.

### `[ ]` `P2` 🙋 Rollout Engine Conformance Matrix [#--p2--rollout-engine-conformance-matrix]

* **Baseline.** `RolloutReq` already treats sigmas as the single source of truth across
  trainside / SGLang / vLLM-Omni, and `rollout/engine/sigma_verify.py` provides validation.
  Coverage is still uneven for log-prob source, latent capture, initial latents, media refs,
  and multi-track responses.
* **Goal.** Create a model × rollout engine × algorithm smoke / conformance matrix so model and
  algorithm additions surface engine-specific behavior early.
* **Tracking.*&#x2A; &#x2A;(RFC needed)*
* TODO:
  * `[ ]` 🙋 Maintain an engine capability matrix for `trainside`, `sglang`, `sglang_llm`,
    `vllm_omni`, and `composed`: native/replay log-probs, latent capture, initial latents,
    media refs, and multi-track lineage.
  * `[ ]` Extend sigma schedule parity, SDE-index scheduler parity, and replay-logprob parity
    tests.
  * `[ ]` Add compose + smoke coverage for core recipes: SD3/SD3.5, Qwen-Image, FLUX.2 Klein,
    Wan2.2, HunyuanVideo1.5, and HunyuanImage3.

### `[ ]` `P2` 🙋 Benchmark / Profiling Examples and Tools [#--p2--benchmark--profiling-examples-and-tools]

* **Baseline.** The repo has scattered timing fields and smoke outputs (for example tensor-batch
  `wall_clock`, tokens/s in the SGLang LLM smoke, and throughput notes in a few recipes), but no
  unified benchmark entry point, fixed workload examples, or comparable report format.
* **Goal.** Add reproducible end-to-end benchmark tooling for training, rollout, and reward so
  backend, rollout-engine, reward-overlap, video-decode, and weight-sync optimizations share a
  stable baseline and regression checks.
* **Tracking.*&#x2A; &#x2A;(RFC needed)*
* TODO:
  * `[ ]` 🙋 Provide minimal benchmark recipes: SD3 / SD3.5 image, Qwen-Image OCR, and Wan2.2 or
    HunyuanVideo video reward, covering common trainside, SGLang, and vLLM-Omni paths.
  * `[ ]` 🙋 Add `scripts/benchmark_*` or an equivalent CLI with fixed seeds, prompt sets, warmup,
    repeats, batch / group size, and outputs for samples/s, step wall-clock, GPU memory, and
    rollout / reward / train phase timing.
  * `[ ]` Standardize a benchmark report schema (JSONL / wandb / Prometheus) with environment,
    commit, model, engine, topology, LoRA / FSDP / SP / EP settings for comparison across PRs and
    machines.
  * `[ ]` Add focused profiling examples for vLLM-Omni continuous batching, async reward overlap,
    video VAE decode / tiling, and LoRA weight sync.

### `[ ]` `P1` 🙋 UI-Interface Optimization [#--p1--ui-interface-optimization]

* **Baseline.** Logging is wandb-only (`utils/wandb_logger.py`, `wandb_metrics.py`) plus media
  previews; there is no unified dashboard, no rollout-sample gallery beyond wandb, and no
  local/offline run viewer.

* **Goal.** Improve observability and the experiment-tracking interface.

* **Tracking.*&#x2A; &#x2A;(RFC needed)*

* TODO:
  * `[ ]` 🙋 Pluggable tracking backends (wandb / SwanLab / TensorBoard) behind one logger
    interface.
  * `[ ]` Rollout sample gallery plus reward-curve and per-component reward views; multi-track
    (image / AR) visualization for composed recipes.
  * `[ ]` Rollout monitoring with Prometheus / Grafana for throughput, queue depth,
    and weight-sync time.
  * `[ ]` 🙋 Local/offline media and config viewer for runs without wandb access.

* *Candidate.* 🙋 Fully asynchronous actor↔rollout↔reward↔train pipeline (the repo already ships
  TransferQueue / Mooncake / TensorStore foundations); Ascend NPU support; an optional
  Megatron-Core backend for additional distributed-training options.

## Algorithm [#algorithm]

UniRL's train-side algorithms group into three families. The table shows current support;
details follow.

| Family                       | Optimizes via                                                   | Supported now                                                    | Planned                                                    |
| ---------------------------- | --------------------------------------------------------------- | ---------------------------------------------------------------- | ---------------------------------------------------------- |
| Policy-Gradient / PPO        | rollout log-probs + advantages through a clipped / KL surrogate | GRPO family (FlowGRPO / DanceGRPO / MixGRPO), DPPO, ARGRPO, DRPO | DDPO, DPOK, KL / reference policy, full PPO (critic + GAE) |
| REFL (reward-feedback)       | a differentiable reward back-propagated through sampling        | —                                                                | DR-Tune, ReFL, DRaFT, AlignProp                            |
| Preference / forward-process | forward-process or pairwise targets (no rollout log-prob)       | NFT                                                              | —                                                          |

### `[~]` `P1` Policy-Gradient / PPO family [#-p1-policy-gradient--ppo-family]

* **Baseline.** The PPO *objective* is already in use: the GRPO family is the PPO clipped-ratio
  surrogate (`diffusion_grpo.py` / `ar_grpo.py` `_grpo_clip_loss`); `DiffusionDPPO` swaps the clip
  for a KL-ADV masking criterion; `ARGRPO` is the text / VLM variant. What is missing is
  critic-based advantage estimation and diffusion-native PPO recipes.
* **Tracking.*&#x2A; &#x2A;(RFC needed)*
* Planned types:
  * `[ ]` `P1` &#x2A;*Full PPO (value critic + GAE)** — the real gap versus GRPO; needs a value head
    over latents / timesteps (diffusion) or tokens (AR / VLM). Prioritize the AR / VLM track
    first.
  * `[ ]` `P2` 🙋 &#x2A;*DDPO (Black et al. 2023)** — PPO-clipped policy gradient over the denoising
    MDP; largely an advantage / recipe variation on the existing clipped core.
  * `[ ]` `P2` 🙋 &#x2A;*DPOK (Fan et al. 2023)** — KL-regularized policy gradient.
* *Candidate.* 🙋 RLOO, REINFORCE++, GSPO, DAPO, Dr.GRPO, GRPO-Guard — advantage /
  clip / ratio-granularity tweaks on the existing clipped core, so lower incremental cost for
  incremental support.

### `[ ]` `P1` KL / Reference Policy Control [#--p1-kl--reference-policy-control]

* **Baseline.** Rollout segments already carry old-policy log-probs for GRPO / DPPO ratios, but
  a frozen reference policy and KL penalty are not implemented yet. DPPO has KL-ADV masking, but
  that is not a general reference-policy KL control layer.
* **Goal.** Provide shared reference-policy infrastructure and KL controllers for PPO, DPOK,
  Diffusion-DPO, ARGRPO / GSPO, and related methods.
* **Tracking.*&#x2A; &#x2A;(RFC needed)*
* TODO:
  * `[ ]` Frozen reference-policy loading, offload/onload, log-prob replay, and cache interfaces.
  * `[ ]` Per-track KL controller and adaptive KL coefficient, with diffusion timestep-level and
    AR token-level statistics.
  * `[ ]` Wire KL metrics into wandb / tracking and add a real reference-policy KL loss term.
  * `[ ]` Reuse the same reference-policy contract across DPOK / Diffusion-DPO / full PPO recipes.

### `[ ]` `P2` Reward / Advantage Credit Assignment Consolidation [#--p2-reward--advantage-credit-assignment-consolidation]

* **Baseline.** `RolloutTrack.compute_advantages` already supports GRPO-style group advantages,
  `RolloutResp.propagate_rewards` supports mean / max / sum parent-child reward aggregation, and
  `algorithms/normalizers.py` plus `reward/aggregation.py` provide several normalization /
  aggregation utilities.
* **Goal.** Consolidate the existing pieces into a shared algorithm layer so algorithms and
  recipes do not each reimplement reward shaping.
* **Tracking.*&#x2A; &#x2A;(RFC needed)*
* TODO:
  * `[ ]` Best-of-N / rejection-style reward propagation, plus component reward weight /
    normalize / schedule configuration.
  * `[ ]` One interface for diffusion timestep-level advantages and AR token-level advantage
    expansion.
  * `[ ]` Multi-track parent / child reward credit rules for composed rollout and
    think→recaption→image chains.
  * `[ ]` Unified metrics for reward components, advantage normalizers, and clipping statistics.

### `[ ]` `P2` Multi-track / Shared-backbone RL [#--p2-multi-track--shared-backbone-rl]

* **Baseline.** HunyuanImage3 and PE composed recipes already provide AR + diffusion /
  multi-track foundations, with one `StageAlgorithm` per track driven by sibling `TrainStack`s.
  Shared-backbone loss balance, update cadence, and cross-modal reward credit are still ad hoc.
* **Goal.** Turn the HunyuanImage3 think / recaption / image-generation joint-training path into
  reusable infrastructure, while leaving room for later BAGEL and Qwen3-Omni support.
* **Tracking.*&#x2A; &#x2A;(RFC needed)*
* TODO:
  * `[ ]` Loss weights, gradient accumulation, and optimizer-step policy when multiple
    `StageAlgorithm` instances share one backbone.
  * `[ ]` Joint-update / alternating-update recipes for image and AR tracks.
  * `[ ]` Cross-modal reward propagation: how text / VLM rewards credit image or think tracks.
  * `[ ]` Keep BAGEL and Qwen3-Omni as candidate models, not committed model-package work in this
    cycle.

### `[ ]` `P0` REFL family (reward-feedback / differentiable reward) [#--p0-refl-family-reward-feedback--differentiable-reward]

REFL is the umbrella for **reward-feedback learning**: a *differentiable* reward is
back-propagated through the sampling chain into the model. The family sits on the REFL training
infra (see Infra above) and collects several algorithm types.

* **Baseline.** None implemented yet.
* **Tracking.*&#x2A; &#x2A;(RFC needed)*
* Types under REFL:
  * `[ ]` `P0` &#x2A;*DR-Tune (Deep Reward Tuning, ECCV 2024)** — primary near-term target. Back-props
    the reward to the input noise with stop-gradient on denoiser inputs (avoids gradient
    explosion) and trains on a subset of equally-spaced steps (memory efficient).
  * `[ ]` `P2` 🙋 &#x2A;*ReFL (ImageReward, 2023)** — the original reward-feedback method; reward
    backprop on a late, randomly-chosen denoising step.
  * `[ ]` `P2` 🙋 &#x2A;*DRaFT / DRaFT-LV (2023)** — backprop through (truncated) sampling with LoRA;
    DRaFT-LV adds low-variance multi-sample gradients.
  * `[ ]` `P2` 🙋 &#x2A;*AlignProp (2023)** — reward backprop through full sampling with gradient
    checkpointing.
* TODO:
  * `[ ]` A shared differentiable-reward `StageAlgorithm` base on top of the REFL infra.
  * `[ ]` Implement DR-Tune first (`unirl/algorithms/drtune.py`): stop-gradient hook plus
    equally-spaced step-subset selection (reuse the SDE-index scheduler).
  * `[ ]` HPSv2 / PickScore reward targets; SD3 and Qwen-Image recipes; compare against the
    FlowGRPO baseline.
  * `[ ]` Add the remaining REFL types behind the same base once DR-Tune lands.

### `[x]` Preference / forward-process family [#x-preference--forward-process-family]

* **Baseline.** `DiffusionNFT` (forward-process reconstruction with default / old LoRA adapters)
  is implemented.
* *Candidate.* 🙋 Diffusion-DPO — pairwise preference optimization (no reward model).
  Note this is distinct from the existing `DiffusionDPPO`, which is policy optimization with
  KL masking, not preference DPO.

## Model [#model]

### `[ ]` `P1` Core Image DiT Support Matrix [#--p1-core-image-dit-support-matrix]

* **Baseline.** The repo already has `sd3`, `qwen_image`, and `flux2_klein` model packages;
  `sd3_mixgrpo` brings in SD3.5 as an `sd3` checkpoint / recipe variant, and
  `flux2_klein_*` recipes already exist. In the Diffusers ecosystem, SD3.5,
  Qwen-Image (including edit / inpaint), and FLUX are the key image DiT / flow-matching
  families.
* **Goal.** Promote image DiT support from runnable recipes to a first-class support matrix that
  makes training, rollout, reward, and smoke coverage explicit for each model.
* **Tracking.*&#x2A; &#x2A;(RFC needed)*
* TODO:
  * `[ ]` Confirm SD3.5's boundary as an `sd3` package checkpoint variant and promote the
    existing `sd3_mixgrpo` recipe.
  * `[ ]` Add Qwen-Image trainside / vLLM-Omni smoke coverage, OCR / text-rendering reward
    baselines, and documented LoRA targets plus FSDP wrap hints.
  * `[ ]` Promote FLUX.2 Klein from existing recipes to first-class support; keep FLUX.1 dev /
    schnell / fill / control / redux / kontext as candidate extensions rather than conflating them
    with FLUX.2 Klein.
  * `[ ]` Maintain a `model × rollout engine × algorithm` coverage table for FlowGRPO / MixGRPO /
    DPPO / DPOK across trainside / SGLang / vLLM-Omni.

### `[ ]` `P2` Video RL Model Track [#--p2-video-rl-model-track]

* **Baseline.** The repo already has `wan21`, `wan22`, `hunyuan_video`, and `hunyuan_video15`
  packages plus `wan21_*`, `wan22_*`, and `hunyuan_video15_t2v_*` recipes.
  In Diffusers, Wan2.1 / Wan2.2, HunyuanVideo, and CogVideoX are the major video pipelines.
* **Goal.** Harden long-video RL for Wan2.2 and HunyuanVideo1.5 rather than starting video
  support from scratch; keep CogVideoX as a P2 / candidate target.
* **Tracking.*&#x2A; &#x2A;(RFC needed)*
* TODO:
  * `[ ]` Wan2.2 TI2V, VACE / Animate condition schemas while keeping T2V / I2V recipes
    reproducible.
  * `[ ]` DanceGRPO / FlowGRPO video recipes with temporal-consistency, VideoPickScore, and
    VLM-as-judge reward baselines.
  * `[ ]` Long-video SP / USP, VAE tiling / offload, and decode-cost profiling.
  * `[ ]` Video media previews, per-component rewards, and temporal-metric visualization.

### `[ ]` `P1` HunyuanImage 3.5 [#--p1-hunyuanimage-35]

* **Baseline.** HunyuanImage 3.0 is supported (`unirl/models/hunyuan_image3/`:
  t2i / it2i / i2t / t2t modes, AR + diffusion stages, a think-recaption RL recipe, and a
  vLLM-Omni rollout adapter). HunyuanImage 3.0 is an 80B MoE (64 experts, \~13B active per
  token) autoregressive native-multimodal model on a Hunyuan-A13B backbone.

* **Goal.** Bring up the next-generation HunyuanImage (3.5 if/when released) for RL post-training.

* **Tracking.*&#x2A; &#x2A;(RFC needed; gated on the VeOmni backend and on a 3.5 release)*

* TODO:
  * `[ ]` Model bundle and config for 3.5; vLLM-Omni rollout adapter and stage configs.
  * `[ ]` Train it under the VeOmni backend (EP for the 64-expert MoE, SP for long multimodal
    sequences).
  * `[ ]` Think / recaption plus RL recipes (MixGRPO / SRPO style), reusing the multi-track HI3
    plumbing.
  * `[ ]` Confirm 3.5 weights and release; the package currently targets 3.0.

* *Candidate.* 🙋 BAGEL (unified understanding + generation, FlowGRPO direction);
  Qwen3-Omni (Thinker / Talker MoE, gated on AR / omni rollout, EP, and GSPO);
  CogVideoX; Sana / PixArt as efficient T2I baselines; Qwen-Image-Edit, FLUX Kontext /
  Control / Fill / Redux, and ControlNet-SD3 / SDXL for editing and controllable generation.

## Contributing [#contributing]

Pick up any 🙋 item (or any open `[ ]`): open a `[Tracking]` or `[RFC]` issue following the
[GitHub Issues Workflow](/en/docs/others/github-issues-workflow), then link it back here so this
page stays the high-level index. Anything not listed here isn't excluded — open a feature
request to propose it.

---

# Trainer & Training Stack (/en/docs/architecture/trainer-v2)

> The single-controller per-domain trainer, the FSDP train stack, and the flat conf recipe shape.

UniRL runs on one trainer architecture: a single-controller `Remote` /
`placement` layer with per-domain trainers and a pluggable FSDP train stack. The
earlier v1 actor-group runtime (and its `conf_v1/` recipes) has been retired;
everything below is the current, default path.

## Entrypoints and trainers [#entrypoints-and-trainers]

Each domain has its own Hydra entrypoint and `<Domain>Trainer`
(`unirl/trainer/`), all driven the same way:

| Entrypoint                            | Trainer                    | Domain                                 |
| ------------------------------------- | -------------------------- | -------------------------------------- |
| `python -m unirl.train_diffusion`     | `trainer/diffusion.py`     | Diffusion image / video                |
| `python -m unirl.train_vlm`           | `trainer/vlm.py`           | Autoregressive VLM / LLM               |
| `python -m unirl.train_pe`            | `trainer/pe.py`            | Prompt-enhancer (joint AR + diffusion) |
| `python -m unirl.train_unified_model` | `trainer/unified_model.py` | HunyuanImage3 (mixed AR + diffusion)   |

The shared lifecycle lives in `trainer/base.py`: acquire a Ray `DevicePool`,
build the rollout and train workers inside a placement block, then run the
rollout → reward → advantage → train → optional weight-sync loop.

## Runtime model [#runtime-model]

The trainer opens one placement block and instantiates each component as a
sibling `Remote`, passing siblings by handle:

```text
<Domain>Trainer(num_devices, batch_size, ...cfg blocks...)
  with placement(pool, fraction=1.0, shared_workers=True):
    bundle    = remote(bundle_cfg)
    pipeline  = remote(pipeline_cfg, bundle=bundle)
    backend   = remote(backend_cfg, bundle=bundle)            # FSDPBackend
    rollout   = remote(rollout_cfg[, pipeline=pipeline])      # engine
    reward    = remote(reward_cfg)                            # RewardService
    algorithm = remote(algorithm_cfg, pipeline=pipeline)      # StageAlgorithm
    stack     = remote(stack_cfg, fsdp_backend=backend, algorithm=algorithm)
    sync      = remote(sync_cfg, backend=backend, rollout=rollout)   # dedicated rollout only
```

The single-controller layer (`unirl/distributed/group/`: `Remote`, `placement`,
`RankInfo`) carries `DP / TP / PP / SP / EP` rank information, which is what
later parallelism work (SP for long sequences, EP for MoE) plugs into.
`trainside` recipes share the trained module (no `sync`); dedicated engines add
a weight-sync bridge under `sync`.

## FSDP backend and train stack [#fsdp-backend-and-train-stack]

Two `Remote` siblings split the training work:

* **`FSDPBackend`** (`unirl/train/backend/fsdp.py`) owns the trainable model:
  structural injection (LoRA / NFT / mirror EMA) before the FSDP2 wrap, plus the
  optimizer, LR scheduler, EMA shadow, eval-EMA swap, checkpoint, and
  onload/offload.
* **`TrainStack`** (`unirl/train/stack.py`) owns loss/backward sequencing. It
  takes handles to one `FSDPBackend` and one `StageAlgorithm`, and per rollout
  track runs `prepare_segment → mini-batch compute_loss_and_backward loop →
  optimizer_step`, returning a `TrainStepResult`.

`TrainStack` is single-stage by design — one track, no track-name dict.
**Multi-track training (for example PE) uses sibling `TrainStack`s**, one per
track. HunyuanImage3's mixed AR + diffusion training uses `unified_model_stack.py`, a
multi-stage variant.

`num_updates_per_batch > 1` runs several PPO mini-batch updates on one rollout
shard with π\_old frozen once by `prepare_segment`; it is gated on
`StageAlgorithm.supports_multi_update`.

The full backend contract — components, the `FSDPConfig` / `LoraConfig` /
`EmaLoraConfig` / `EmaFullConfig` / `OptimizerConfig` / `LrSchedulerConfig`
schemas, and `TrainTopology` — is documented in the generated
[Train Stack README](/en/docs/architecture/readme-train-stack).

## Config shape [#config-shape]

`examples/<domain>/*.yaml` is a bucketed tree (one subdirectory per trainer domain); each recipe instantiates its siblings by
`_target_` (no Hydra config-group overrides):

```yaml
backend:
  _target_: unirl.train.backend.fsdp.FSDPBackend
  block_class_names: ["JointTransformerBlock"]
  trainable_attr: transformer
  fsdp_cfg:
    _target_: unirl.train.configs.FSDPConfig
    fsdp_mode: full
    activation_checkpointing: false
  optimizer_cfg:
    _target_: unirl.train.backend.base.OptimizerConfig
    learning_rate: 3.0e-4
  lora_cfg:
    _target_: unirl.train.configs.LoraConfig
    rank: 32
    alpha: 64

stack:
  _target_: unirl.train.stack.TrainStack
  micro_batch_size: 1
  max_grad_norm: 1.0
  num_updates_per_batch: 2
```

## Where this is going [#where-this-is-going]

The roadmap's `VeOmniBackend` work targets exactly this contract: a new backend
implements the same `Remote` surface as `FSDPBackend` and maps `TrainTopology`
onto a parallel plan (FSDP / SP / EP) instead of hardcoded shard sizes. See the
[Roadmap](/en/docs/architecture/roadmap) Infra track.

For extension routing (which surface to build on), see
[Extending UniRL](/en/docs/guides/extending).

---

# Experiment Recipes (/en/docs/configuration/experiments)

> Recipes in the bucketed examples/ tree and how to select one per entrypoint.

Recipes are self-contained YAML files under `examples/`, bucketed by trainer domain (`diffusion/`, `vlm/`, `llm/`, `pe/`, `unified_model/`). Each one is the source of truth for model, algorithm, rollout engine, placement, reward, sync, and batch geometry. Select a recipe with `--config-name=<domain>/<recipe>` (no `.yaml`).

## Entrypoint per Domain [#entrypoint-per-domain]

The recipe family determines which entrypoint runs it:

| Entrypoint                            | Bucket           | Recipe families                                                                  |
| ------------------------------------- | ---------------- | -------------------------------------------------------------------------------- |
| `python -m unirl.train_diffusion`     | `diffusion/`     | `sd3_*`, `qwen_image_*`, `flux2_klein_*`, `wan21_*`, `wan22_*`, `hunyuan_video*` |
| `python -m unirl.train_vlm`           | `vlm/`, `llm/`   | `qwen_vl_argrpo_*`, `qwen3_ar_drpo_*`                                            |
| `python -m unirl.train_pe`            | `pe/`            | `pe_*`                                                                           |
| `python -m unirl.train_unified_model` | `unified_model/` | `hi3_*`                                                                          |

## Maintained Families [#maintained-families]

| Family                               | Recipes                                                                                                                                            |
| ------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| SD3 (GRPO)                           | `sd3_trainside`, `sd3_trainside_tq_mooncake`, `sd3_dancegrpo`, `sd3_mixgrpo`                                                                       |
| SD3 NFT                              | `sd3_nft`, `sd3_nft_reward_service`, `sd3_nft_sglang`                                                                                              |
| SD3 Flow-DPPO                        | `sd3_flowdppo`, `sd3_flowdppo_vllmomni`                                                                                                            |
| SD3 SGLang                           | `sd3_sglang_native_colocate`, `sd3_sglang_replay_colocate`, `sd3_sglang_full_nccl_separate`, `sd3_sglang_full_tensor`, `sd3_sglang_lora_separate`  |
| SD3 vLLM-Omni                        | `sd3_vllmomni`, `sd3_vllmomni_full_ipc`, `sd3_vllmomni_full_nccl_separate`, `sd3_vllmomni_full_tensor`, `sd3_vllmomni_lora_separate`               |
| Qwen-Image                           | `qwen_image_trainside`, `qwen_image_dancegrpo`, `qwen_image_mixgrpo`, `qwen_image_nft`                                                             |
| Flux.2-Klein                         | `flux2_klein_trainside`, `flux2_klein_sglang`                                                                                                      |
| WAN 2.1                              | `wan21_t2v`, `wan21_t2v_dancegrpo`, `wan21_t2v_mixgrpo`, `wan21_i2v`                                                                               |
| WAN 2.2                              | `wan22_t2v_14b`, `wan22_t2v_14b_dancegrpo`, `wan22_t2v_14b_mixgrpo`, `wan22_i2v`                                                                   |
| HunyuanVideo                         | `hunyuan_video_t2v_trainside`, `hunyuan_video15_t2v_dancegrpo_trainside`, `hunyuan_video15_t2v_vllmomni_nccl_separate`                             |
| HunyuanImage3                        | `hi3_vllmomni`                                                                                                                                     |
| Qwen-VL ARGRPO (VLM)                 | `qwen_vl_argrpo_geo3k_mc_4x8`, `qwen_vl_argrpo_geo3k_mc_4x8_lora`, `qwen_vl_argrpo_geo3k_mc_sglang_4x8`, `qwen_vl_argrpo_geo3k_mc_sglang_4x8_lora` |
| Qwen3 DRPO (LLM)                     | `qwen3_ar_drpo_4b_base_dpao_sglang`                                                                                                                |
| PE (prompt enhancer, AR + diffusion) | `pe_trainside_pickscore`, `pe_sglang_full_pickscore`, `pe_sglang_full_wise`, `pe_sglang_lora_pickscore`                                            |

## Selecting a Recipe [#selecting-a-recipe]

```bash
python -m unirl.train_diffusion --config-name=diffusion/sd3_trainside
```

Launchers pass the same bucketed recipe name (and `ENTRY` selects a non-diffusion entrypoint):

```bash
bash examples/run_experiment_single_node.sh diffusion/sd3_trainside
ENTRY=train_vlm bash examples/run_experiment_single_node.sh vlm/qwen_vl_argrpo_geo3k_mc_4x8
bash examples/run_experiment_multinode_taiji.sh diffusion/sd3_sglang_native_colocate
```

## How to Pick a Recipe [#how-to-pick-a-recipe]

Use this decision order when a task does not name a specific recipe:

1. Pick the modality and model family first: SD3 or Qwen-Image for image; WAN 2.1 / 2.2 for video; HunyuanImage3 for mixed AR + diffusion; Qwen-VL / Qwen3 for VLM / LLM; PE for prompt-enhancer.
2. Pick the rollout topology: `trainside` for direct sampling, SGLang or vLLM-Omni recipes for dedicated rollout, and `colocate` when train and rollout share GPU bundles (vs `separate`).
3. Pick the algorithm: GRPO / DanceGRPO / MixGRPO for on-policy ratio losses, Flow-DPPO for KL-masked policy optimization, NFT for off-policy forward-process training, DRPO for AR text.
4. Pick the cluster-size variant, such as `4x8`, only after matching the target hardware.
5. Run a compose check before launching Ray work.

## Editing Guidance [#editing-guidance]

When adding a recipe:

1. Start from the closest existing `examples/<domain>/<recipe>.yaml`.
2. Keep model, reward, rollout engine, backend, stack, sync, placement, and batch geometry in YAML, each instantiated by `_target_`.
3. Use environment interpolation only for deployment-specific paths and logging identity.
4. Run `python -m unirl.train_diffusion --config-name=<domain>/<recipe> --cfg job --resolve`.
5. Add the recipe to this page.

---

# Hydra Configuration (/en/docs/configuration/hydra)

> How UniRL composes, validates, and overrides runtime configuration.

UniRL uses Hydra. Recipes are self-contained YAML files under `examples/`, bucketed by trainer domain (`diffusion/`, `vlm/`, `llm/`, `pe/`, `unified_model/`); each training run selects one with `--config-name`:

```bash
python -m unirl.train_diffusion --config-name=<domain>/<recipe>
```

## Composition [#composition]

A recipe instantiates each runtime component directly by `_target_` (FSDP backend, train stack, rollout engine, reward service, algorithm, data source, …). Config classes are plain `@dataclass`es defined next to the code that consumes them — there is no ConfigStore and no registration decorator. A recipe wires a component by pointing its `_target_` at the class and nesting the component's `config:` block (also a `_target_`). Use the generated [Config Package README](/en/docs/configuration/readme-config-package) for the instantiation and validation contracts.

## Where Knobs Belong [#where-knobs-belong]

Keep recipe-defining choices in `examples/<domain>/<recipe>.yaml`. Keep cluster-local paths, model mounts, output directories, and WandB identity in launcher environment variables or CLI overrides (recipes interpolate them with `${oc.env:...}`). A new typed runtime component should define its config `@dataclass` next to the implementation; a recipe then references it by `_target_` (no registration step).

Override precedence is:

```text
CLI Hydra override > launcher env var > YAML default
```

## Runtime Contracts [#runtime-contracts]

Cross-component validators run before Ray worker creation. The implementation details and validator list live in the generated [Config Package README](/en/docs/configuration/readme-config-package).

`sync` and the tensor transport solve different problems. `sync` (`cfg.sync`) sends trainer weights back to dedicated rollout engines, while the tensor transport (`unirl/distributed/tensor/`) is the data plane for moving bulky rollout outputs between workers.

Use a compose check before launching a large job:

```bash
python -m unirl.train_diffusion --config-name=<domain>/<recipe> --cfg job --resolve
```

---

# First Run (/en/docs/getting-started/first-run)

> Compose and launch a UniRL experiment recipe.

Start by composing a recipe before launching Ray work. This catches missing paths, invalid Hydra overrides, and cross-component contract errors early.

```bash
python -m unirl.train_diffusion --config-name=diffusion/sd3_trainside --cfg job --resolve
```

## Single Node [#single-node]

Use the generic single-node launcher when you want the scripts to prepare local Ray runtime defaults. The first argument is a bucketed recipe name (`<domain>/<recipe>`):

```bash
bash examples/run_experiment_single_node.sh diffusion/sd3_trainside
```

The diffusion entrypoint is the default; select another with `ENTRY`:

```bash
ENTRY=train_vlm bash examples/run_experiment_single_node.sh vlm/qwen_vl_argrpo_geo3k_mc_4x8
ENTRY=train_pe  bash examples/run_experiment_single_node.sh pe/pe_trainside_pickscore
```

For a dry run:

```bash
DRY_RUN=1 bash examples/run_experiment_single_node.sh diffusion/sd3_trainside
```

## Multi Node [#multi-node]

Use the role-aware launcher for multinode jobs:

```bash
bash examples/run_experiment_multinode_taiji.sh diffusion/sd3_sglang_native_colocate
```

## Direct Hydra Invocation [#direct-hydra-invocation]

You can invoke an entrypoint directly and override fields inline:

```bash
python -m unirl.train_diffusion \
  --config-name=diffusion/sd3_trainside \
  num_devices=8
```

Hydra override precedence is:

```text
CLI Hydra override > launcher env var > YAML default
```

## Sample Prompts [#sample-prompts]

Committed prompt lists live under `datasets/`, for example `datasets/pickscore/train.txt` (one prompt per line) and `datasets/pickscore/test.txt`. Recipes point their `data_source` at these by default.

For real runs, point environment variables or CLI overrides to absolute data, model, and output paths:

```bash
DATA_PATH=/abs/path/train.json \
OUTPUT_DIR=/abs/path/outputs/run1 \
bash examples/run_experiment_single_node.sh diffusion/wan21_t2v
```

---

# Installation (/en/docs/getting-started/installation)

> Install UniRL and the optional documentation site.

UniRL requires Python `>=3.12,<3.14`. `torch` is intentionally **not** a base
dependency: it enters through exactly one mutually-exclusive **engine extra**
(`sglang` or `vllm`), which is what lets each engine pin its own locked CUDA
stack. You must pick one engine extra.

## Python Package [#python-package]

The recommended path is [uv](https://docs.astral.sh/uv/), which honors the
locked per-engine CUDA indexes declared in `pyproject.toml` (`[tool.uv]`):

```bash
# SGLang rollout stack (torch 2.9.1+cu129, flash-attn-4)
uv sync --extra sglang --extra train --extra infer --extra eval

# OR the vLLM / vLLM-Omni stack (torch 2.11.0+cu129, vllm 0.20.0)
uv sync --extra vllm --extra train --extra infer --extra eval
```

`sglang` and `vllm` are declared as conflicting extras, so they cannot be
installed together — choose the one matching your rollout engine.

A plain `pip` install also works if your environment already provides a
compatible torch/CUDA build:

```bash
pip install -e ".[sglang,train,infer,eval]" --no-build-isolation
```

Add `dev` for tests, linting, and hooks:

```bash
uv sync --extra sglang --extra train --extra infer --extra eval --extra dev
pre-commit install
```

## Optional Extras [#optional-extras]

`pyproject.toml` is the dependency source of truth. Extras:

| Extra    | Purpose                                                                                                   |
| -------- | --------------------------------------------------------------------------------------------------------- |
| `sglang` | SGLang rollout engine + its locked torch/torchvision/torchaudio `+cu129` stack and `flash-attn-4` (Linux) |
| `vllm`   | vLLM + vLLM-Omni rollout engine + its locked torch `+cu129` stack (Linux)                                 |
| `train`  | WandB and async runtime dependencies (`wandb`, `aiohttp`)                                                 |
| `infer`  | inference-side helpers (`accelerate`)                                                                     |
| `eval`   | evaluation/reward dependencies (`torchvision`, `easyocr`)                                                 |
| `dev`    | pytest, ruff, and pre-commit                                                                              |

`sglang` and `vllm` are mutually exclusive (`[tool.uv].conflicts`). The vLLM
wheel often has to build from sdist on older-glibc pods; the first build is slow
(or set `VLLM_USE_PRECOMPILED=1`) and uv caches it per pod. `setup.py` exists
only for older editable-install tooling.

> The per-engine extras already pin a matching `flash-attn` (the `sglang` extra
> pins `flash-attn-4>=4.0.0b4`); do not separately `pip install flash-attn`
> unless your environment needs a specific build.

## Optional Model and Reward Dependencies [#optional-model-and-reward-dependencies]

`mmcv` and `mmdetection` are intentionally not installed by default. Install them only for Geneval/OpenMMLab workflows, following [Geneval MMCV Setup](/en/docs/guides/geneval-mmcv-setup).

To run heavy reward models on their own GPU node instead of in-process, use the standalone remote reward service in `unirl-reward-service/` (it ships its own dependencies and README). See [Rewards](/en/docs/guides/rewards) for wiring the remote backend.

The optional rollout→trainer data-plane bus (TransferQueue / Mooncake) is also installed separately. See [TransferQueue Installation](/en/docs/getting-started/transfer-queue-installation).

## Documentation Site [#documentation-site]

The Fumadocs site is isolated in `docs/` so Node dependencies do not affect the Python package:

```bash
cd docs
npm install
npm run dev
```

Build the static site:

```bash
npm run build
```

The exported static files are emitted by Next.js into `docs/out/`.

---

# Docs Site README (/en/docs/getting-started/readme-docs-site)

> Fumadocs site commands, structure, and maintenance notes.

{/* Generated from docs/README.md by docs/scripts/sync-readme-reference.mjs. Edit the source README, not this file. */}

This directory contains the Fumadocs + Next.js documentation site. It is isolated from the Python package so Node dependencies and generated files do not affect framework installs.

## Commands [#commands]

```bash
npm install
npm run sync:readmes
npm run dev
npm run build
npm run typecheck
```

`npm run sync:readmes` regenerates the embedded package pages from the
repository README files, writing one page per README into its owning docs
section (under both `en` and `zh`). It also runs automatically before `dev`,
`build`, and `typecheck`.

`npm run build` statically exports the site to `out/`. The build also generates `.source/`, which backs the `collections/server` import used by Fumadocs MDX.

## Structure [#structure]

| Path                                | Purpose                                                      |
| ----------------------------------- | ------------------------------------------------------------ |
| `app/`                              | Next.js App Router pages and static route handlers           |
| `content/docs/`                     | MDX source pages and `meta.json` sidebar ordering            |
| `lib/source.ts`                     | Fumadocs loader source                                       |
| `lib/get-llm-text.ts`               | Markdown conversion for agent endpoints                      |
| `components/mdx.tsx`                | MDX component mapping                                        |
| `source.config.ts`                  | Fumadocs MDX collection config                               |
| `scripts/sync-readme-reference.mjs` | Generates embedded package pages from near-code README files |

## Agent Endpoints [#agent-endpoints]

Static builds expose:

* `/llms.txt` for a compact documentation index.
* `/llms-full.txt` for the full Markdown corpus.
* `/md/<slug>/index.md` for one page as Markdown, for example `/md/configuration/hydra/index.md`.
* `/api/search.json` for the static search index consumed by Fumadocs UI.

Generated package pages are included in these outputs after
`npm run sync:readmes`.

Keep these routes extension-safe so static hosts can infer MIME types without custom server configuration.

Agent endpoints intentionally read only the English source (`content/docs/en`). Chinese pages are for human reading and are not included in `/llms-full.txt` or `/md/<slug>/index.md`.

Do not add `llms.txt` as a docs sidebar category. The rendered `Agents` section explains how agents should navigate the docs; `/llms.txt` and related routes stay as root-level machine-readable endpoints.

## Human Languages [#human-languages]

Human documentation is language-prefixed:

* `/en/docs` for English.
* `/zh/docs` for Chinese.

Content is organized by directory:

```text
content/docs/en/...
content/docs/zh/...
```

English is the source of truth and fallback language. If a Chinese page is missing, Fumadocs falls back to the English source.

## Adding Pages [#adding-pages]

1. Add the English `.mdx` file under `content/docs/en/`.
2. Add the Chinese `.mdx` file under `content/docs/zh/` when the page is intended for human Chinese readers.
3. Add or update the nearest `meta.json` `pages` list to control sidebar order.
4. Prefer paths and command examples that match the current repository state.
5. Run `npm run build` before submitting changes.

## Maintenance Notes [#maintenance-notes]

* Node `>=20.19.0` is declared because current transitive file-watcher dependencies require it. Older Node versions may still build but will warn during install.
* `includeProcessedMarkdown` is enabled in `source.config.ts`; do not remove it unless `/llms-full.txt` and `/md/<slug>/index.md` are replaced with another Markdown source.
* Generated directories `.next/`, `.source/`, `out/`, and `node_modules/` are ignored by the repository.
* Generated package pages `content/docs/{en,zh}/<section>/readme-*.mdx` come from `npm run sync:readmes` and are git-ignored; edit the source README, not these files.

> Source: [`docs/README.md`](https://github.com/haonan3/UniRL/blob/main/docs/README.md) — edit the README next to the code, then run `npm run sync:readmes` from `docs/`.

---

# TransferQueue Installation (/en/docs/getting-started/transfer-queue-installation)

> Optional rollout→trainer data-plane bus (Simple and Mooncake backends).

[TransferQueue](https://github.com/Ascend/TransferQueue) (TQ) is the optional
**rollout→trainer data-plane bus** for UniRL: bulky rollout outputs (conditions,
latents, rewards) flow through it instead of the driver in `separate` / `colocate` sampling
modes. It is **not** part of UniRL's declared dependencies — it is imported lazily and
must be installed into the same environment separately. See
`unirl/distributed/weight_sync/README.md` ("Transfer Queue — Separate Concern") for how
it differs from weight sync, and `unirl/distributed/tensor/backend/transfer_queue/` for
the integration code (`runtime.py`, `simple.py`, `mooncake.py`, `transport.py`).

UniRL wires two TQ storage backends through the Hydra `transfer_queue` config group:

| Backend                                  | Use when                               | Install effort            | External services                            |
| ---------------------------------------- | -------------------------------------- | ------------------------- | -------------------------------------------- |
| **Simple** (`AsyncSimpleStorageManager`) | dev, single-node, functional testing   | base TQ only              | none — in-process Ray actors                 |
| **Mooncake** (`MooncakeStorageManager`)  | production, multi-node, zero-copy RDMA | base TQ + Mooncake engine | external `mooncake_master` + metadata server |

TQ is **off by default** (the `transfer_queue` group has no Hydra defaults entry); you opt in
per experiment.

***

## 1. Prerequisites [#1-prerequisites]

* UniRL already installed in the target venv (`pip install -e ".[train,infer,eval]" --no-build-isolation&#x60;), Python **≥3.10**, PyTorch present. See [Installation](/en/docs/getting-started/installation).
* Install TQ into the **same** environment.
* **Mooncake only:** an RDMA-capable NIC (InfiniBand / RoCE) on every node, and — on TaiJi, for the from-source build — `root`.

***

## 2. Install the base TransferQueue package [#2-install-the-base-transferqueue-package]

Both backends need the `transfer_queue` Python package.

### Option A — PyPI (Simple backend only) [#option-a--pypi-simple-backend-only]

```bash
pip install TransferQueue
```

### Option B — From source (required for Mooncake) [#option-b--from-source-required-for-mooncake]

The zero-copy Mooncake client lives on the &#x2A;*`v0.1.5_mooncake`** branch (matched to Mooncake
`v0.3.10.post1`); the PyPI release does not carry it. Install editable, `--no-deps` so it does
not perturb UniRL's pinned dependencies:

```bash
git clone -b v0.1.5_mooncake git@git.woa.com:MMRL_Infra/TransferQueue.git
cd TransferQueue
pip install -e . --no-deps
```

TQ's runtime deps are mostly already in UniRL (`ray[default]`, `hydra-core`, `numpy<2.0.0`,
`torch&#x60;). Install the few it doesn't already provide — mind the &#x2A;*`numpy<2.0.0`** ceiling:

```bash
pip install "tensordict>=0.10.0" pyzmq msgspec psutil
```

### Verify [#verify]

```bash
python -c "import transfer_queue; print(transfer_queue.__version__)"
```

The source `v0.1.5_mooncake` branch (Option B) reports `0.1.5`; the PyPI release (Option A)
reports the latest published version (e.g. `0.1.7`).

***

## 3. Simple backend (in-memory) [#3-simple-backend-in-memory]

No native dependencies — it spawns `SimpleStorageUnit` Ray actors (defaults: `num_units=16`,
`unit_size=1024`). Once base TQ (§2) is installed, enable it per experiment.

CLI override (the group has no default, so **append** with `+`):

```bash
python -m unirl.train_diffusion --config-name=<domain>/<recipe> \
    +transfer_queue=simple
```

Or in your recipe YAML under `examples/<domain>/<recipe>.yaml`:

```yaml
defaults:
  - transfer_queue: simple
# optional overrides:
transfer_queue:
  num_units: 16
  unit_size: 1024
```

Best for single-node runs and functional testing. For production sizing, use Mooncake.

***

## 4. Mooncake backend (zero-copy RDMA) [#4-mooncake-backend-zero-copy-rdma]

UniRL's `MooncakeBackend` is a **pure client** — the storage segments live on an
**external** Mooncake service that UniRL does *not* start for you. Four steps: install the
engine, satisfy RDMA prerequisites, run the services, wire the config.

### 4.1 Install the Mooncake engine [#41-install-the-mooncake-engine]

This provides the `mooncake.store` Python module and the `mooncake_master` binary.

**Generic Linux** (prebuilt wheel — works where the wheel's glibc/ABI matches your host):

```bash
pip install mooncake-transfer-engine   # use the release matching Mooncake v0.3.10.post1
```

**TaiJi / from source** (needed for RDMA against the pod's drivers, or on glibc mismatch). From
the TransferQueue checkout (§2 Option B):

```bash
cd TransferQueue/scripts/install_mooncake
sudo ./install_mooncake.sh
```

What that script does — **read before running**: requires `root`; installs system packages via
`yum`; clones and builds &#x2A;*Mooncake `v0.3.10.post1`** plus Go 1.23.8, boost 1.90, gflags 2.3,
yaml-cpp 0.9, gtest 1.17, yalantinglibs 0.5.7; appends `/usr/local/lib64:/usr/local/lib` to
`LD_LIBRARY_PATH` in `~/.bashrc`. It `yum remove`s the distro `gtest`/`yaml-cpp`/`boost` dev
packages before rebuilding them from source, so run it on a **disposable pod**. Tunables:
`MOONCAKE_WORKDIR` (default `/dockerdata/data/Mooncake`), `GITHUB_PROXY`, `http_proxy` /
`https_proxy`. See `scripts/install_mooncake/README.md` in the TransferQueue repo.

**Verify:**

```bash
python -c "from mooncake.store import MooncakeDistributedStore; print('mooncake ok')"
mooncake_master --help            # binary on PATH
source ~/.bashrc                  # if the source build just appended LD_LIBRARY_PATH
```

### 4.2 RDMA prerequisites [#42-rdma-prerequisites]

```bash
ibv_devices ; ibstat               # list RDMA NICs (needs libibverbs + drivers)
ls /sys/class/infiniband           # UniRL auto-discovers device_name from here
```

UniRL auto-discovers `device_name` (a comma-list of RDMA bonds from
`/sys/class/infiniband`) and sets `MC_ENABLE_DEST_DEVICE_AFFINITY=1` so each process binds the
PIX-distance HCA for its GPU — &#x2A;*you normally do not set `device_name`**. Override only for ops
debugging: `transfer_queue.device_name=mlx5_0`. No RDMA fabric? Fall back with
`transfer_queue.protocol=tcp&#x60; (slower). If startup raises &#x2A;"no InfiniBand device found under
/sys/class/infiniband"*, the host has no usable RDMA NIC.

### 4.3 Run the external Mooncake services (head node) [#43-run-the-external-mooncake-services-head-node]

`mooncake_master` serves both the RPC master and the built-in HTTP metadata server:

```bash
mooncake_master \
  --rpc_port=50051 \
  --enable_http_metadata_server=true \
  --http_metadata_server_host=0.0.0.0 \
  --http_metadata_server_port=8080
# inside a container, add --rpc_interface=eth0 to bind the routable IPv4
```

This yields the two endpoints the client config needs:

* `master_server_address` → `<head_ip>:50051`
* `metadata_server` → `http://<head_ip>:8080/metadata`

Keep it running for the duration of training. The built-in HTTP metadata server is single-node;
for HA use an external `etcd` instead.

### 4.4 Wire the UniRL config [#44-wire-the-unirl-config]

```bash
python -m unirl.train_diffusion --config-name=<domain>/<recipe> \
    +transfer_queue=mooncake \
    transfer_queue.metadata_server=http://<head_ip>:8080/metadata \
    transfer_queue.master_server_address=<head_ip>:50051 \
    transfer_queue.protocol=rdma \
    transfer_queue.global_segment_size_gb=64 \
    transfer_queue.local_buffer_size_gb=10
```

Fields (defined in `unirl/distributed/tensor/backend/transfer_queue/mooncake.py`):

| Field                                                      | Default          | Notes                                                |
| ---------------------------------------------------------- | ---------------- | ---------------------------------------------------- |
| `metadata_server`                                          | — (**required**) | `http://<head_ip>:8080/metadata` from §4.3           |
| `master_server_address`                                    | — (**required**) | `<head_ip>:50051` from §4.3                          |
| `protocol`                                                 | `rdma`           | `rdma` or `tcp`                                      |
| `global_segment_size_gb`                                   | `64`             | total upstream segment pool                          |
| `local_buffer_size_gb`                                     | `10`             | per-client local buffer                              |
| `device_name`                                              | auto             | auto-discovered HCA list; override only to debug     |
| `zero_copy.enable`                                         | `true`           | RDMA zero-copy buffers                               |
| `zero_copy.tensor_buffer_size_gb` / `bytes_buffer_size_gb` | `2.0` / `2.0`    | per-client buffers (controller gets `10.0` / `10.0`) |

***

## 5. Environment variables [#5-environment-variables]

| Variable                                    | Set by         | Purpose                                                                  |
| ------------------------------------------- | -------------- | ------------------------------------------------------------------------ |
| `TQ_ZERO_COPY_SERIALIZATION`                | you            | TQ serialization mode (`True`/`False`)                                   |
| `TQ_LOGGING_LEVEL`                          | you            | TQ log verbosity (default `WARN`)                                        |
| `LOCAL_IP`                                  | you (optional) | routable IP each Mooncake client binds; else auto from hostname          |
| `MOONCAKE_WORKDIR`                          | you (optional) | where `install_mooncake.sh` builds (default `/dockerdata/data/Mooncake`) |
| `GITHUB_PROXY`, `http_proxy`, `https_proxy` | you (TaiJi)    | proxies for the source build                                             |
| `MC_ENABLE_DEST_DEVICE_AFFINITY`            | **UniRL**      | `=1` for per-process GPU↔HCA affinity                                    |
| `MC_TCP_BIND_ADDRESS`                       | **UniRL**      | set to `LOCAL_IP` so Mooncake binds the right NIC                        |
| `MC_MS_AUTO_DISC` / `MC_MS_FILTERS`         | you (optional) | Mooncake NIC/GPU topology auto-discovery / whitelist                     |
| `LD_LIBRARY_PATH`                           | source build   | must include `/usr/local/lib64:/usr/local/lib`                           |

***

## 6. Verify end-to-end [#6-verify-end-to-end]

```bash
# Imports
python -c "import transfer_queue; print(transfer_queue.__version__)"
python -c "from mooncake.store import MooncakeDistributedStore; print('mooncake ok')"   # Mooncake only

# Simple-backend smoke test (no native deps)
python -m unirl.train_diffusion --config-name=<small_recipe> +transfer_queue=simple

# Standalone TQ sanity (from the TransferQueue checkout — see the repo's
# recipe/simple_use_case/ and tutorial/ directories for the current demo files)
python recipe/simple_use_case/single_controller_demo.py
pytest                              # CPU test suite
```

For Mooncake, the full RDMA path must be validated on a **TaiJi GPU pod**: start
`mooncake_master`, launch training with `+transfer_queue=mooncake` (§4.4), and confirm there is
no `ImportError`/`-800` and that rollout→train data flows.

***

## 7. Troubleshooting [#7-troubleshooting]

| Symptom                                                  | Fix                                                                                                                                                                                                                                                            |
| -------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ImportError: Mooncake Store not installed`              | Install the engine (§4.1) into the **same** venv.                                                                                                                                                                                                              |
| Dependency resolver pulls `numpy>=2`                     | TQ requires `numpy<2.0.0`; pin it.                                                                                                                                                                                                                             |
| `no InfiniBand device found under /sys/class/infiniband` | No usable RDMA NIC — run on an RDMA host or set `transfer_queue.protocol=tcp`.                                                                                                                                                                                 |
| Mooncake `setup()` returns `-800` on some ranks          | Wrong-NUMA HCA. Ensure `MC_ENABLE_DEST_DEVICE_AFFINITY=1` (UniRL sets it) and a comma-list `device_name`; pin with `transfer_queue.device_name=` if needed. See Mooncake [error codes](https://kvcache-ai.github.io/Mooncake/troubleshooting/error-code.html). |
| Client cannot reach master/metadata (timeout / refused)  | `mooncake_master` not running or wrong host/port; ensure `50051`/`8080` are reachable across nodes; set `LOCAL_IP` so clients bind the routable interface.                                                                                                     |
| `*.so` not found at runtime (source build)               | `LD_LIBRARY_PATH` must include `/usr/local/lib64:/usr/local/lib`; `source ~/.bashrc`.                                                                                                                                                                          |
| Wheel import crashes with glibc/ABI error                | Build from source via `install_mooncake.sh` (§4.1).                                                                                                                                                                                                            |

***

## 8. References [#8-references]

* UniRL integration: `unirl/distributed/tensor/backend/transfer_queue/{runtime,simple,mooncake,transport}.py`
* Backend separation vs weight sync: `unirl/distributed/weight_sync/README.md`
* TransferQueue upstream (canonical): [https://github.com/Ascend/TransferQueue](https://github.com/Ascend/TransferQueue) (developed by the Ascend team; the older [https://github.com/TransferQueue/TransferQueue](https://github.com/TransferQueue/TransferQueue) is archived). UniRL pins the internal Mooncake fork `git@git.woa.com:MMRL_Infra/TransferQueue.git` (`v0.1.5_mooncake`); upstream Mooncake install notes: `scripts/install_mooncake/README.md`
* Mooncake: [https://github.com/kvcache-ai/Mooncake](https://github.com/kvcache-ai/Mooncake) (`v0.3.10.post1`) — [deployment guide](https://kvcache-ai.github.io/Mooncake/deployment/mooncake-store-deployment-guide.html), [error codes](https://kvcache-ai.github.io/Mooncake/troubleshooting/error-code.html)

---

# Data and Models (/en/docs/guides/data-and-models)

> Prompt data contracts, local datasets, model packages, and checkpoint mounts.

## Data Contract [#data-contract]

User-facing datasets are prompt-first. Committed prompt lists live under `datasets/` (for example `datasets/pickscore/train.txt`), while larger datasets should be symlinked or mounted under `datasets/` and passed through `DATA_PATH` or the recipe's `data_source` block.

```bash
DATA_PATH=/abs/path/train.json \
bash examples/run_experiment_single_node.sh diffusion/sd3_trainside
```

See [Data Preparation](/en/docs/guides/data-preparation) for the prompt-file formats and the per-prompt schema.

## Runtime Model Code [#runtime-model-code]

Model implementation packages live under `unirl/models/`. Use the generated [Model Package README](/en/docs/guides/readme-models) for package layout and extension contracts.

## Local Checkpoint Mounts [#local-checkpoint-mounts]

The repository-root `models/` directory is for local artifacts (e.g. `models/local/`), not Python model code. Experiment YAMLs and model configs provide HuggingFace fallbacks through Hydra environment interpolation, for example:

```yaml
pretrained_model_ckpt_path: ${oc.env:PRETRAINED_MODEL,stabilityai/stable-diffusion-3.5-medium}
```

Do not commit large model weights such as `.bin`, `.safetensors`, `.pt`, `.pth`, or `.ckpt`.

---

# Data Preparation (/en/docs/guides/data-preparation)

> Prompt file formats, the per-prompt schema, image/condition inputs, and how prompts expand into rollout groups.

UniRL training is prompt-first: a data file supplies prompts, and rollout
engines generate media that reward components score. This page documents the
accepted file formats and the per-prompt schema. For where datasets are mounted,
see [Data and Models](/en/docs/guides/data-and-models).

## File Formats [#file-formats]

Point a recipe's `data_source` at a prompt file via `DATA_PATH` (env, where the
recipe interpolates it) or by overriding the `data_source` `data_path` (Hydra
override). The reader (`unirl/data/datasets.py`) accepts three extensions;
anything else raises:

| Extension | Parsing                                                                                                                      |
| --------- | ---------------------------------------------------------------------------------------------------------------------------- |
| `.txt`    | One prompt per non-empty line. Each line becomes `{"prompt": <line>}`.                                                       |
| `.jsonl`  | One JSON object per non-empty line.                                                                                          |
| `.json`   | A list of strings or objects, or a dict with a `prompts` list, a `caption`, or a configurable prompt key (default `prompt`). |

Minimal JSON:

```json
[
  {"prompt": "A watercolor landscape with snowy mountains at sunrise."},
  {"prompt": "A cinematic portrait of a robot reading under warm light."}
]
```

A plain `.txt` file (one prompt per line, like the committed
`datasets/pickscore/train.txt`) works for text-to-video recipes too:

```text
A drone shot flying over a misty pine forest at dawn.
Time-lapse of clouds rolling over a desert canyon.
```

## Per-Prompt Schema [#per-prompt-schema]

Each object is normalized to a prompt example:

| Field                   | Required | Notes                                                                 |
| ----------------------- | -------- | --------------------------------------------------------------------- |
| `prompt` (or `caption`) | yes      | Non-empty text.                                                       |
| `prompt_id`             | no       | Auto-generated as `{filename}:{index}` if omitted.                    |
| `metadata`              | no       | Free-form dict. If omitted, any extra top-level keys become metadata. |
| `media` / `media_refs`  | no       | List of media references; each is `{modality, role, uri}`.            |

If `metadata` is omitted, extra top-level keys (anything other than `prompt`,
`caption`, `media`, `media_refs`, `metadata`, `prompt_id`) are folded into it. If
you pass an explicit `metadata` dict it is used as-is, so put any extra fields
inside it. Legacy precomputed-embedding fields (for example `prompt_embed_path`,
`prompt_embeds`) are rejected with a hard error — embeddings are computed at runtime.

There is no `negative_prompt` or per-row `seed` in the data file. Guidance scale,
seed, and resolution come from `cfg.sampling`, not from manifest rows.

## Image-Conditioned and Edit/I2V Inputs [#image-conditioned-and-editi2v-inputs]

For image-to-video, editing, or other conditioned recipes, attach a condition
image through `media_refs`:

```json
{
  "prompt": "Animate this scene with gentle falling snow.",
  "media_refs": [
    {"modality": "image", "role": "condition", "uri": "frames/scene_01.png"}
  ]
}
```

* Relative URIs resolve against the dataset file's directory.
* Absolute paths and `http://`, `https://`, `s3://`, `gs://` URIs pass through unchanged.
* Today the driver loads exactly one `(modality="image", role="condition")` ref
  per prompt; other modality/role pairs raise `NotImplementedError`.
* There is no video URI role in the data contract: text-to-video uses `.txt`
  prompts, and image-to-video uses an image condition ref.

## How Prompts Become Rollout Groups [#how-prompts-become-rollout-groups]

Two knobs control batch shape, and they apply at different stages:

* `prompts_per_rollout` is the number of **distinct prompts** sampled per rollout
  (the data loader's batch size). Prompts are not pre-duplicated.
* `sampling.samples_per_prompt` repeats each prompt `k` times later, in the
  rollout pipeline, to form an N-sample GRPO group. Siblings share a `group_id`
  and get `sample_id`s like `prompt:<gid>:sample:<j>`.

So one rollout produces `prompts_per_rollout × sampling.samples_per_prompt`
samples.

## Data Source Selection [#data-source-selection]

| Source                   | When                                                                                | Selected by                                                                                     |
| ------------------------ | ----------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
| `MultimodalRLDataSource` | real runs; reads the configured `data_path`, shuffles, drops the last partial batch | recipes set `data_source._target_: unirl.data.data_source.MultimodalRLDataSource` (the default) |
| `DefaultDataSource`      | smoke checks; ignores `data_path` and cycles a few built-in prompts                 | a recipe pointing `data_source._target_` at `unirl.data.data_source.DefaultDataSource`          |

`EVAL_DATA_PATH` points at a separate eval prompt file (loaded in deterministic
order); training batches always come from the configured `data_path`. See
[Evaluation](/en/docs/guides/evaluation) for the current status of the eval path.

## Worked Example [#worked-example]

```bash
# 1. Author prompts.json (a list of {"prompt": ...} objects).
# 2. Point DATA_PATH at it and launch a recipe whose data source reads files.
DATA_PATH=/abs/path/prompts.json \
OUTPUT_DIR=/abs/path/outputs/run1 \
bash examples/run_experiment_single_node.sh diffusion/sd3_trainside
```

Validate composition before launching Ray work:

```bash
DATA_PATH=/abs/path/prompts.json \
python -m unirl.train_diffusion --config-name=diffusion/sd3_trainside --cfg job --resolve
```

---

# Evaluation (/en/docs/guides/evaluation)

> How quality is measured today (reward scores), the eval plumbing that exists, and what is not wired yet.

> **Status.** There is no automatic periodic evaluation loop in the training
> driver yet. Periodic eval is a deferred follow-up. Today, the in-training
> quality signal is the **reward score**, and offline benchmarking is done with
> external tools. This page documents what exists so you do not assume a harness
> that is not wired.

## Quality Signal During Training [#quality-signal-during-training]

The reward components configured on a recipe are the primary quality signal.
Each rollout scores generated media and logs per-sample and per-component reward
(visible in WandB when enabled). Built-in scorers under `unirl/reward/local/`
include PickScore, HPS, OCR, GenEval2, VideoPickScore, and rule-based matchers.

To "evaluate" a checkpoint or recipe today, read the reward curves and
per-component breakdown. See [Rewards](/en/docs/guides/rewards) for configuration
and the generated
[Reward Package README](/en/docs/guides/readme-reward-package).

## Eval Plumbing (currently inert) [#eval-plumbing-currently-inert]

The configuration surface for a separate eval set already exists, but the driver
does not call it yet:

* `EVAL_DATA_PATH` → `run.eval_data_path` selects a separate eval prompt file,
  loaded in deterministic (unshuffled) order. If unset, it falls back to
  `run.data_path`.
* `cfg.evaluation.eval_steps` is present in some recipes (for example WAN and
  `qwen_vl_argrpo_geo3k_mc_4x8`), and many SD3 recipes set it to `0`.

These feed helpers (`get_eval_samples`, `build_eval_request_batch`,
`should_eval`, `run_eval_pipeline`, `log_eval`) that are implemented but have no
driver callers today. Treat them as
forward-looking until a periodic-eval hook lands.

## Offline / External Benchmarks [#offline--external-benchmarks]

For benchmark-style evaluation (for example GenEval), use the external
OpenMMLab stack. It is intentionally not a default dependency; setup is in
[Geneval MMCV Setup](/en/docs/guides/geneval-mmcv-setup). There is no in-repo
GenEval or FID harness.

The `[eval]` install extra adds `torchvision` and `easyocr`. Note OCR scoring
actually uses `paddleocr` (installed separately) and `easyocr` is currently
unused; `[eval]` is not a separate evaluation framework.

## If You Need Eval Now [#if-you-need-eval-now]

* Reuse a reward scorer as a metric: run the recipe with the target reward
  component and read its logged values.
* Or score generated media offline with the standalone reward service in
  `unirl-reward-service/`.
* Track the eval-loop work on the [Roadmap](/en/docs/architecture/roadmap)
  (UI / observability and benchmark tracks).

---

# Extending UniRL (/en/docs/guides/extending)

> Where to add models, rollout engines, train-side algorithms, rewards, training backends, and recipes.

Use existing boundaries rather than adding cross-cutting glue. Most extensions need one implementation package, one config dataclass, one recipe, and one focused test or compose check.

## Extension Map [#extension-map]

| Goal                       | Primary location                  | How a recipe wires it                                                                              |
| -------------------------- | --------------------------------- | -------------------------------------------------------------------------------------------------- |
| Add a model                | `unirl/models/<model_name>/`      | `@dataclass` config referenced by `_target_` under the recipe's `model:` block                     |
| Add a rollout engine       | `unirl/rollout/engine/<engine>/`  | `@dataclass` config referenced by `_target_` under `rollout:`                                      |
| Add a train-side algorithm | `unirl/algorithms/`               | a `BaseAlgorithmConfig` subclass referenced by `_target_`, bound under a track's `algorithm:` node |
| Add a reward scorer        | `unirl/reward/local/`             | a spec `@dataclass` referenced by `_target_` under `reward.backend.config`                         |
| Add a training backend     | `unirl/train/backend/`            | the `Remote` backend contract (alongside `FSDPBackend`)                                            |
| Add a recipe               | `examples/<domain>/<recipe>.yaml` | `--config-name=<domain>/<recipe>`                                                                  |

## Adding a Model [#adding-a-model]

1. Add `unirl/models/<model_name>/`.
2. Define the model's config `@dataclass` next to it (referenced by `_target_` in the recipe).
3. Implement a bundle that exposes the trainable stages needed by training and rollout.
4. Add condition, text/vision, diffusion, and VAE helpers as needed.
5. Add at least one `examples/<domain>/<recipe>.yaml` recipe.
6. Document required external checkpoints through YAML env interpolation or launcher docs.

## Adding a Rollout Engine [#adding-a-rollout-engine]

1. Add a typed config under `unirl/rollout/engine/<engine>/`.
2. Reference it by `_target_` under the recipe's `rollout:` block.
3. Implement the engine contract from `unirl/rollout/engine/base.py`.
4. Return canonical `RolloutResp` data (populate `tracks[name]`).
5. If the engine is dedicated, define trainer-to-rollout weight sync.

## Adding a Train-Side Algorithm [#adding-a-train-side-algorithm]

The training loss is a per-track `StageAlgorithm`. Add a new class only when the loss math changes (recipe compositions like DanceGRPO / MixGRPO reuse `DiffusionGRPO`).

1. Subclass `StageAlgorithm` (`unirl/algorithms/base.py`).
2. Define a config subclassing `BaseAlgorithmConfig` (a plain `@dataclass`) next to it.
3. Implement `compute_loss_and_backward(...)` — replay the stage, compute the loss, call `backward()`.
4. Set `supports_multi_update` to match whether the loss is valid across multiple optimizer steps on one rollout shard, and `requires_ema_rollout` only for off-policy losses that need EMA sampling (NFT).
5. Bind the class under the track's `algorithm:` node in a recipe.

See the generated [Algorithms Package README](/en/docs/architecture/readme-algorithms) for the full contract.

## Adding a Reward Scorer [#adding-a-reward-scorer]

Add the spec and scorer near the reward implementation under `unirl/reward/local/`. Prefer `LocalRewardBackend` for in-process model scorers because it provides device resolution, eager load, `offload()`, and `onload()`. Define the spec as a plain `@dataclass` and reference it by `_target_` under `reward.backend.config` in the recipe. For out-of-process scoring, point the reward backend at the remote service. See [Rewards](/en/docs/guides/rewards).

## Training Backends and Parallelism [#training-backends-and-parallelism]

New **training-backend** or **parallelism** work (a new FSDP / SP / EP plan, or a VeOmni-style backend) targets the backend contract under `unirl/train/backend/`: implement the same `Remote` surface as `FSDPBackend` and map `TrainTopology` onto a parallel plan rather than hardcoding shard sizes. See [Trainer & Training Stack](/en/docs/architecture/trainer-v2) and the generated [Train Stack README](/en/docs/architecture/readme-train-stack).

## Agent Tip [#agent-tip]

When an agent is asked to implement a new feature, first map the request to the table above, then read the closest package README before editing code.

---

# Geneval MMCV Setup (/en/docs/guides/geneval-mmcv-setup)

> Optional MMCV and MMDetection installation for Geneval/OpenMMLab workflows.

This setup is optional. Only install these packages if your workflow needs the
Geneval/OpenMMLab stack.

## Why separate? [#why-separate]

`mmcv`/`mmdet` have CUDA build constraints and are not required for core
UniRL training pipelines.

## Recommended versions [#recommended-versions]

* `mmcv` tag: `v1.7.2`
* `mmdetection` branch: `2.x` (currently `2.28.2` in our validated env)

These versions match the validated local setup used in this repository.

## Install steps [#install-steps]

Run from any working directory (examples use `~/mmgrpo`):

```bash
pip install -U openmim
mim install mmengine

cd ~/mmgrpo
git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout tags/v1.7.2
MMCV_WITH_OPS=1 FORCE_CUDA=1 pip install -e . -v
cd ..

git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection
git checkout 2.x
pip install -e . -v
cd ..
```

## Sanity check [#sanity-check]

```bash
python -c "import mmcv, mmdet; print('mmcv', mmcv.__version__); print('mmdet', mmdet.__version__); from mmcv.ops import nms; print('mmcv.ops ok')"
```

Expected:

* `mmcv` reports `1.7.2`
* `mmdet` reports a `2.x` version (e.g., `2.28.2`)
* `mmcv.ops ok` prints without import errors

---

# Multinode Runs (/en/docs/guides/multinode)

> Launchers, Ray startup, cluster geometry, and pre-run checks.

Training semantics live in `examples/<domain>/*.yaml`. Shell scripts only prepare the runtime
environment, start Ray, set path/logging defaults, and forward the recipe name
plus Hydra overrides.

## Launchers [#launchers]

Use the single-node launcher for local or one-node jobs, and the multinode
launcher when workers need to join a Ray head before training starts. The first
argument is a bucketed recipe name (`<domain>/<recipe>`); `ENTRY` selects a non-diffusion
entrypoint (`train_vlm`, `train_pe`, `train_unified_model`).

```bash
bash examples/run_experiment_single_node.sh diffusion/sd3_trainside
bash examples/run_experiment_multinode_taiji.sh diffusion/sd3_sglang_native_colocate
ENTRY=train_vlm bash examples/run_experiment_multinode_taiji.sh vlm/qwen_vl_argrpo_geo3k_mc_sglang_4x8
```

## Environment-Derived Overrides [#environment-derived-overrides]

Generic launchers set path and logging values from environment variables
(`PRETRAINED_MODEL`, `DATA_PATH`, `EVAL_DATA_PATH`, `OUTPUT_DIR`, and the W\&B
knobs). Model checkpoint env vars remain recipe-specific.

## Cluster Geometry [#cluster-geometry]

For different cluster geometry, override the recipe's placement and batch-geometry
fields together, for example the device count and the train-stack mini-batch
shape:

```bash
bash examples/run_experiment_multinode_taiji.sh diffusion/sd3_sglang_native_colocate \
  num_devices=16 \
  stack.micro_batch_size=1 \
  stack.num_updates_per_batch=2
```

The recipe's validators report inconsistent geometry before Ray work starts, so
always run a compose check first.

## Pre-Run Checks [#pre-run-checks]

For any large run, first compose and resolve the recipe without launching Ray work:

```bash
python -m unirl.train_diffusion --config-name=<domain>/<recipe> --cfg job --resolve
```

---

# Rewards (/en/docs/guides/rewards)

> Reward service, local and remote backends, and extension points.

`unirl.reward` constructs and runs reward backends. Rollout engines generate media; a reward backend scores it and returns per-sample values that the trainer turns into advantages.

## Structure [#structure]

A reward is exactly one **backend**, held by `RewardService` (`unirl/reward/service.py`), which scores a `RolloutTrack` via `score_and_attach`:

* a **local** in-process scorer (`unirl/reward/local/`: PickScore, HPS, OCR, GenEval2, VideoPickScore, …), or
* the **remote** HTTP client (`RemoteRewardBackend`) talking to the standalone `unirl-reward-service/` server.

The current YAML shape, component contract, and scorer extension workflow live in the generated [Reward Package README](/en/docs/guides/readme-reward-package).

## Config Shape [#config-shape]

A reward is wired via Hydra `_target_`:

```yaml
reward:
  _target_: unirl.reward.service.RewardService
  backend:
    _target_: unirl.reward.local.pickscore.PickScoreRewardScorer
    base_device: cuda
    config:
      _target_: unirl.reward.local.pickscore.PickScoreSpec
      batch_size: 8
```

For out-of-process scoring, point `backend._target_` at `unirl.reward.remote.RemoteRewardBackend` with a `RemoteRewardSpec` (`base_url`, `required_rewards`, `reward_weights`, `input_kind`). The remote service runs from `unirl-reward-service/` on its own GPU node.

## Failure Semantics [#failure-semantics]

Reward failures are loud, never silent: a non-finite or `null` reward is flagged as a sample failure, and `RewardService.score_and_attach` raises on any failure (naming the offending reward and sample) so an inference error stops the step rather than poisoning the GRPO group.

## Adding a Local Scorer [#adding-a-local-scorer]

Add the spec and scorer near the reward implementation under `unirl/reward/local/`. Prefer `LocalRewardBackend` for in-process model scorers because it provides device resolution, eager load, `offload()`, and `onload()`. Define the spec as a plain `@dataclass` and reference it by `_target_` under `reward.backend.config` in the recipe (no registration step). See the generated [Reward Package README](/en/docs/guides/readme-reward-package) for the full example and the remote wire contract.

---

# GitHub Issues Workflow (/en/docs/others/github-issues-workflow)

> Issue title, template, labeling, project board, and gh CLI conventions.

This page captures repository workflow conventions. It lives under `Others` because it is useful for project coordination, but it is not part of the UniRL runtime or API surface.

## Issue Title Convention [#issue-title-convention]

Use prefix tags to categorize issues clearly:

| Prefix       | Usage                              | Example                                          |
| ------------ | ---------------------------------- | ------------------------------------------------ |
| `[Feature]`  | New feature                        | `[Feature] Add mixed precision training support` |
| `[Bug]`      | Bug fix                            | `[Bug] Gradient accumulation NaN in step 500+`   |
| `[Task]`     | Concrete work item                 | `[Task] Refactor backward_train_step`            |
| `[RFC]`      | Discussion before implementation   | `[RFC] Async rollout pipeline design`            |
| `[Tracking]` | Parent issue that tracks sub-tasks | `[Tracking] Training Pipeline Optimization`      |

## Issue Template [#issue-template]

Every issue should include:

```markdown
## Background
Why this needs to be done. 1-2 sentences.

## Objective
What specific outcome is expected.

## Tasks
- [ ] Sub-task 1
- [ ] Sub-task 2
- [ ] Sub-task 3

## References
- Related code path: `unirl/xxx/yyy.py`
- Related paper or link

## Acceptance Criteria
How do we know this is done?
```

## Label System [#label-system]

Recommended priority labels:

| Label              | Description                |
| ------------------ | -------------------------- |
| `priority: high`   | Must be done ASAP          |
| `priority: medium` | Should be done this sprint |
| `priority: low`    | Nice to have               |

Recommended module labels:

| Label               | Description                            |
| ------------------- | -------------------------------------- |
| `module: training`  | Training pipeline related              |
| `module: inference` | Inference and rollout related          |
| `module: sampling`  | Sampling strategy related              |
| `module: algorithm` | Algorithm related                      |
| `module: infra`     | Infrastructure and distributed related |

## Workflow [#workflow]

```text
Create tracking issue
  -> create sub-issues
  -> assign owners
  -> update progress in comments
  -> submit PR with "Closes #xx"
  -> code review
  -> merge and auto-close issue
```

## CLI Quick Reference [#cli-quick-reference]

```bash
gh issue create \
  --title "[Task] Refactor backward_train_step gradient accumulation" \
  --body "## Objective
Optimize gradient accumulation logic.

## Tasks
- [ ] Analyze current implementation
- [ ] Refactor code
- [ ] Add tests" \
  --assignee teammate-username \
  --label "module: training,priority: high"

gh issue list --assignee teammate-username --state open
gh issue view 101
gh issue close 101
gh issue comment 101 --body "Progress update: completed step 1 and 2."
```

## Best Practices [#best-practices]

* Keep one clear owner per issue.
* Break large tasks into issues that can be completed in one to three days.
* Cross-reference related issues and PRs with `#issue_number`.
* Update active issues regularly.
* Use `Closes #xx` in PR descriptions for automatic tracking.
* Use `[RFC]` issues for non-trivial design discussions before coding.