Evaluation

How quality is measured today (reward scores), the eval plumbing that exists, and what is not wired yet.

Status. There is no automatic periodic evaluation loop in the training driver yet. Periodic eval is a deferred follow-up. Today, the in-training quality signal is the reward score, and offline benchmarking is done with external tools. This page documents what exists so you do not assume a harness that is not wired.

Quality Signal During Training

The reward components configured on a recipe are the primary quality signal. Each rollout scores generated media and logs per-sample and per-component reward (visible in WandB when enabled). Built-in scorers under unirl/reward/local/ include PickScore, HPS, OCR, GenEval2, VideoPickScore, and rule-based matchers.

To "evaluate" a checkpoint or recipe today, read the reward curves and per-component breakdown. See Rewards for configuration and the generated Reward Package README.

Eval Plumbing (currently inert)

The configuration surface for a separate eval set already exists, but the driver does not call it yet:

EVAL_DATA_PATH → run.eval_data_path selects a separate eval prompt file, loaded in deterministic (unshuffled) order. If unset, it falls back to run.data_path.
cfg.evaluation.eval_steps is present in some recipes (for example WAN and qwen_vl_argrpo_geo3k_mc_4x8), and many SD3 recipes set it to 0.

These feed helpers (get_eval_samples, build_eval_request_batch, should_eval, run_eval_pipeline, log_eval) that are implemented but have no driver callers today. Treat them as forward-looking until a periodic-eval hook lands.

Offline / External Benchmarks

For benchmark-style evaluation (for example GenEval), use the external OpenMMLab stack. It is intentionally not a default dependency; setup is in Geneval MMCV Setup. There is no in-repo GenEval or FID harness.

The [eval] install extra adds torchvision and easyocr. Note OCR scoring actually uses paddleocr (installed separately) and easyocr is currently unused; [eval] is not a separate evaluation framework.

If You Need Eval Now

Reuse a reward scorer as a metric: run the recipe with the target reward component and read its logged values.
Or score generated media offline with the standalone reward service in unirl-reward-service/.
Track the eval-loop work on the Roadmap (UI / observability and benchmark tracks).

Evaluation

Quality Signal During Training

Eval Plumbing (currently inert)

Offline / External Benchmarks

If You Need Eval Now

On this page