Data Preparation
Prompt file formats, the per-prompt schema, image/condition inputs, and how prompts expand into rollout groups.
UniRL training is prompt-first: a data file supplies prompts, and rollout engines generate media that reward components score. This page documents the accepted file formats and the per-prompt schema. For where datasets are mounted, see Data and Models.
File Formats
Point a recipe's data_source at a prompt file via DATA_PATH (env, where the
recipe interpolates it) or by overriding the data_source data_path (Hydra
override). The reader (unirl/data/datasets.py) accepts three extensions;
anything else raises:
| Extension | Parsing |
|---|---|
.txt | One prompt per non-empty line. Each line becomes {"prompt": <line>}. |
.jsonl | One JSON object per non-empty line. |
.json | A list of strings or objects, or a dict with a prompts list, a caption, or a configurable prompt key (default prompt). |
Minimal JSON:
[
{"prompt": "A watercolor landscape with snowy mountains at sunrise."},
{"prompt": "A cinematic portrait of a robot reading under warm light."}
]A plain .txt file (one prompt per line, like the committed
datasets/pickscore/train.txt) works for text-to-video recipes too:
A drone shot flying over a misty pine forest at dawn.
Time-lapse of clouds rolling over a desert canyon.Per-Prompt Schema
Each object is normalized to a prompt example:
| Field | Required | Notes |
|---|---|---|
prompt (or caption) | yes | Non-empty text. |
prompt_id | no | Auto-generated as {filename}:{index} if omitted. |
metadata | no | Free-form dict. If omitted, any extra top-level keys become metadata. |
media / media_refs | no | List of media references; each is {modality, role, uri}. |
If metadata is omitted, extra top-level keys (anything other than prompt,
caption, media, media_refs, metadata, prompt_id) are folded into it. If
you pass an explicit metadata dict it is used as-is, so put any extra fields
inside it. Legacy precomputed-embedding fields (for example prompt_embed_path,
prompt_embeds) are rejected with a hard error — embeddings are computed at runtime.
There is no negative_prompt or per-row seed in the data file. Guidance scale,
seed, and resolution come from cfg.sampling, not from manifest rows.
Image-Conditioned and Edit/I2V Inputs
For image-to-video, editing, or other conditioned recipes, attach a condition
image through media_refs:
{
"prompt": "Animate this scene with gentle falling snow.",
"media_refs": [
{"modality": "image", "role": "condition", "uri": "frames/scene_01.png"}
]
}- Relative URIs resolve against the dataset file's directory.
- Absolute paths and
http://,https://,s3://,gs://URIs pass through unchanged. - Today the driver loads exactly one
(modality="image", role="condition")ref per prompt; other modality/role pairs raiseNotImplementedError. - There is no video URI role in the data contract: text-to-video uses
.txtprompts, and image-to-video uses an image condition ref.
How Prompts Become Rollout Groups
Two knobs control batch shape, and they apply at different stages:
prompts_per_rolloutis the number of distinct prompts sampled per rollout (the data loader's batch size). Prompts are not pre-duplicated.sampling.samples_per_promptrepeats each promptktimes later, in the rollout pipeline, to form an N-sample GRPO group. Siblings share agroup_idand getsample_ids likeprompt:<gid>:sample:<j>.
So one rollout produces prompts_per_rollout × sampling.samples_per_prompt
samples.
Data Source Selection
| Source | When | Selected by |
|---|---|---|
MultimodalRLDataSource | real runs; reads the configured data_path, shuffles, drops the last partial batch | recipes set data_source._target_: unirl.data.data_source.MultimodalRLDataSource (the default) |
DefaultDataSource | smoke checks; ignores data_path and cycles a few built-in prompts | a recipe pointing data_source._target_ at unirl.data.data_source.DefaultDataSource |
EVAL_DATA_PATH points at a separate eval prompt file (loaded in deterministic
order); training batches always come from the configured data_path. See
Evaluation for the current status of the eval path.
Worked Example
# 1. Author prompts.json (a list of {"prompt": ...} objects).
# 2. Point DATA_PATH at it and launch a recipe whose data source reads files.
DATA_PATH=/abs/path/prompts.json \
OUTPUT_DIR=/abs/path/outputs/run1 \
bash examples/run_experiment_single_node.sh diffusion/sd3_trainsideValidate composition before launching Ray work:
DATA_PATH=/abs/path/prompts.json \
python -m unirl.train_diffusion --config-name=diffusion/sd3_trainside --cfg job --resolve