Rewards
Reward service, local and remote backends, and extension points.
unirl.reward constructs and runs reward backends. Rollout engines generate media; a reward backend scores it and returns per-sample values that the trainer turns into advantages.
Structure
A reward is exactly one backend, held by RewardService (unirl/reward/service.py), which scores a RolloutTrack via score_and_attach:
- a local in-process scorer (
unirl/reward/local/: PickScore, HPS, OCR, GenEval2, VideoPickScore, …), or - the remote HTTP client (
RemoteRewardBackend) talking to the standaloneunirl-reward-service/server.
The current YAML shape, component contract, and scorer extension workflow live in the generated Reward Package README.
Config Shape
A reward is wired via Hydra _target_:
reward:
_target_: unirl.reward.service.RewardService
backend:
_target_: unirl.reward.local.pickscore.PickScoreRewardScorer
base_device: cuda
config:
_target_: unirl.reward.local.pickscore.PickScoreSpec
batch_size: 8For out-of-process scoring, point backend._target_ at unirl.reward.remote.RemoteRewardBackend with a RemoteRewardSpec (base_url, required_rewards, reward_weights, input_kind). The remote service runs from unirl-reward-service/ on its own GPU node.
Failure Semantics
Reward failures are loud, never silent: a non-finite or null reward is flagged as a sample failure, and RewardService.score_and_attach raises on any failure (naming the offending reward and sample) so an inference error stops the step rather than poisoning the GRPO group.
Adding a Local Scorer
Add the spec and scorer near the reward implementation under unirl/reward/local/. Prefer LocalRewardBackend for in-process model scorers because it provides device resolution, eager load, offload(), and onload(). Define the spec as a plain @dataclass and reference it by _target_ under reward.backend.config in the recipe (no registration step). See the generated Reward Package README for the full example and the remote wire contract.