Releases: huggingface/trl
v1.0.0
Read our blog post for an overview of TRL v1.
Features
Asynchronous GRPO
Asynchronous GRPO decouples generation from the gradient update loop by offloading rollouts to an external vLLM server. Generation runs in parallel while training continues, eliminating idle GPU time and improving hardware utilization.
from trl.experimental.async_grpo import AsyncGRPOTrainer
from trl.rewards import accuracy_reward
from datasets import load_dataset
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
trainer = AsyncGRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
trainer.train()by @qgallouedec in #5293
Variational Sequence-Level Soft Policy Optimization (VESPO)
VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:
from trl import GRPOConfig, GRPOTrainer
trainer = GRPOTrainer(
model="Qwen/Qwen3-0.6B",
args=GRPOConfig(loss_type="vespo"),
...
)Divergence Proximal Policy Optimization (DPPO)
DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.
by @LeonEricsson in #5117
Self-Distillation Policy Optimization (SDPO)
SDPO is a new experimental trainer that augments on-policy RL with self-distillation from the model's own high-reward trajectories. Instead of using an external teacher, SDPO treats the current model conditioned on feedback as a self-teacher, distilling its feedback-informed predictions back into the policy.
from trl.experimental import SDPOTrainer, SDPOConfig
config = SDPOConfig(
output_dir="./results",
num_generations=8,
success_reward_threshold=1.0,
use_successful_as_teacher=True,
)
trainer = SDPOTrainer(
model="Qwen/Qwen2.5-Math-1.5B-Instruct",
reward_funcs=[accuracy_reward],
args=config,
train_dataset=dataset,
)
trainer.train()by @MengAiDev in #4935
Reward functions can now log extra columns and scalar metrics
Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.
def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
extracted = [extract_answer(c) for c in completions]
rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]
if log_extra:
log_extra("golden_answer", list(answer))
log_extra("extracted_answer", extracted)
if log_metric:
log_metric("accuracy", sum(rewards) / len(rewards))
return rewards
by @manueldeprada in #5233
Tool calling support in VLLMClient.chat()
VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.
by @kansalaman in #4889
35% faster packing
BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.
by @mariosasko in #5189
[GKD] Buffer implementation and vLLM inference for distillation trainer
The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation. vLLM inference support has also been added to the base self-distillation trainer.
by @cmpatino in #5137 and #5388
v0 → v1 migration guide
A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.
by @qgallouedec in #5255
Other
- Change default
vllm_modeto"colocate"by @qgallouedec in #5255 - Support
truncation_modein SFT by @albertvillanova in #5306 - Support
max_lengthin DPO VLM training by @albertvillanova in #5284 - Add
pad_to_multiple_ofto GRPOTrainer and RLOOTrainer by @czkkkkkk in #5180 - Support sequence sampling in Liger Kernel by @michaelroyzen in #5190
- Add tool calling support to VLLMClient.chat() by @kansalaman in #4889
- Add support for raw token IDs in vLLM client prompts by @qgallouedec in #5225
- Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in #5227
- Enhance
print_prompt_completions_sampleto include reasoning content by @qgallouedec in #5327 - Add support for
pixel_position_idsvision key by @qgallouedec in #5374 - Add second version of Qwen 3.5 chat template by @apardyl in #5405
- Pass tools as
Nonetoapply_chat_templatewhen it is an empty list by @rabinadk1 in #5380
Fixes
- Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in #5305
- Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in #5286
- Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
- Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in #5295
- Fix
accuracy_rewardcrash when called from non-main thread by @qgallouedec in #5281 - Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
- [GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in #5242
- [CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in #4639
- Fix
RewardFunctype alias to reflect actual calling convention by @s-zx in #5246 - fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in #5245
- Fix
prepare_multimodal_messagesto supporttool_callsandtoolrole by @alvarobartt in #5212 - Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
- Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
- Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
- Sync entire prompt/completion token tensors before indexing by @shawnghu in #5218
- Clean up model update group on worker exit by @AmineDiro in #5325
- Fix prefix EOS slicing for tool suffix (with Qwen3/3.5 chat templates) by @casinca in #5330
- Fix: apply reward_weights to logged reward/reward_std in GRPOTrainer by @lailanelkoussy in #5353
- Fix IDs shape mismatch in SFT for VLMs with text-only by @albertvillanova in #5354
Documentation and Examples
- Add minimal CARLA example script by @sergiopaniego in #5161
...
v1.0.0rc1
Features
Variational Sequence-Level Soft Policy Optimization (VESPO)
VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:
from trl import GRPOConfig, GRPOTrainer
trainer = GRPOTrainer(
model="Qwen/Qwen3-0.6B",
args=GRPOConfig(loss_type="vespo"),
...
)Divergence Proximal Policy Optimization (DPPO)
DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.
by @LeonEricsson in #5117
Reward functions can now log extra columns and scalar metrics
Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.
def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
extracted = [extract_answer(c) for c in completions]
rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]
if log_extra:
log_extra("golden_answer", list(answer))
log_extra("extracted_answer", extracted)
if log_metric:
log_metric("accuracy", sum(rewards) / len(rewards))
return rewards
by @manueldeprada in #5233
Tool calling support in VLLMClient.chat()
VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.
by @kansalaman in #4889
35% faster packing
BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.
by @mariosasko in #5189
[GKD] Buffer implementation for distillation trainer
The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation.
v0 → v1 migration guide
A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.
by @qgallouedec in #5255
Other
- Change default
vllm_modeto"colocate"by @qgallouedec in #5255 - Support
truncation_modein SFT by @albertvillanova in #5306 - Support
max_lengthin DPO VLM training by @albertvillanova in #5284 - Add
pad_to_multiple_ofto GRPOTrainer and RLOOTrainer by @czkkkkkk in #5180 - Support sequence sampling in Liger Kernel by @michaelroyzen in #5190
- Add tool calling support to VLLMClient.chat() by @kansalaman in #4889
- Add support for raw token IDs in vLLM client prompts by @qgallouedec in #5225
- Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in #5227
Fixes
- Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in #5305
- Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in #5286
- Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
- Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in #5295
- Fix
accuracy_rewardcrash when called from non-main thread by @qgallouedec in #5281 - Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
- [GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in #5242
- [CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in #4639
- Fix
RewardFunctype alias to reflect actual calling convention by @s-zx in #5246 - fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in #5245
- Fix
prepare_multimodal_messagesto supporttool_callsandtoolrole by @alvarobartt in #5212 - Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
- Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
- Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
- Sync entire prompt/completion token tensors before indexing by @shawnghu in #5218
- Clean up model update group on worker exit by @AmineDiro in #5325
Documentation and Examples
- Add minimal CARLA example script by @sergiopaniego in #5161
- Nemotron 3 examples added by @sergiopaniego in #5272
- Align docs about tool calling in trainers with dataset format by @albertvillanova in #5311
- Add repository-specific guidance for agents (
AGENTS.md) by @qgallouedec in #5236 - Align documentation with the intended public API by @qgallouedec in #5162
What's Changed
- ⬆️ Bump dev version by @qgallouedec in #5182
- Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in #5178
- Document parameters with differing default values in core configs by @albertvillanova in #5168
- Make _BaseConfig and _BaseTrainer explicitly private by @albertvillanova in #5169
- Refactor CLI [4/N]: Replace top-level TrlParser with ArgumentParser by @albertvillanova in #5170
- Add minimal CARLA example script by @sergiopaniego in #5161
- Align documentation with the intended public API by @qgallouedec in #5162
- Fix deprecation warning of create_reference_model by @albertvillanova in #5184
- Fix deprecation warning of fork in multi-threaded process by @albertvillanova in #5185
- Refactor CLI [5/N]: Refactor TrainingCommand with delayed imports by @albertvillanova in #5186
- Refactor CLI [6/N]: Refactor env/vllm-serve commands with delayed imports by @albertvillanova in #5187
- Fix CI tests patching BaseTrainer by @albertvillanova in #5192
- Add
pad_to_multiple_ofto GRPOTrainer and RLOOTrainer by @czkkkkkk in #5180 - Re-add liger-kernel to dev deps by @qgallouedec in #5164
- Set CI PYTORCH_ALLOC_CONF env variable to avoid OOM by @albertvillanova in #5197
- Support sequence sampling in Liger Kernel and pass importance_samplin… by @michaelroyzen in #5190
- Mark CI test_training_vlm_and_liger as xfail by @albertvillanova in #5202
- Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn by @albertvillanova in #5122
- CI: Add Qwen 3.5 tiny model to tests by @qgallouedec in #5204
- Add support for Qwen3.5 f...
v0.29.1
What's Changed
- Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in #5178
- Fix
prepare_multimodal_messagesto supporttool_callsandtoolrole by @alvarobartt in #5212 - Fix type for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
- Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn by @albertvillanova in #5122
- Simplify logic for structured outputs across vLLM versions by @albertvillanova in #5215
- Add support for raw ids in
promptsin vLLM client and server by @qgallouedec in #5225 - Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in #5227
- Move
rollout_funcfrom_generate_single_turnto_generateby @qgallouedec in #5232 - [GRPO/RLOO] Tokenize before vLLM generation call by @qgallouedec in #5238
- Support JSON string parsing of teacher_model_init_kwargs in MiniLLMConfig by @albertvillanova in #5259
- [GRPO/RLOO] Unify tokenization across all generation backends in
_generate_single_turnby @qgallouedec in #5239 - [GRPO/RLOO] Extract tokenize prompts from
_generate_single_turnby @qgallouedec in #5240 - [CPO/ORPO] Fix handling of different length chosen/rejected prompts. by @davmels in #4639
- Fix type for teacher_model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5258
- Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
- Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
- Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
- Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
- [GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs by @qgallouedec in #5242
New Contributors
- @davmels made their first contribution in #4639
- @falcondai made their first contribution in #5302
Full Changelog: v0.29.0...v0.29.1
v0.29.0
Features
Add environment_factory to GRPOTrainer
GRPOTrainer now accepts an environment_factory argument, allowing users to specify a custom environment class for training. This enables more flexible and diverse training scenarios by letting users define their own environments with specific dynamics and reward structures.
from datasets import Dataset
from trl import GRPOConfig, GRPOTrainer
dataset = Dataset.from_dict({
"prompt": [[{"role": "user", "content": f"Increment the counter by {i}."}] for i in range(1, 7)]
})
def reward_func(environments, **kwargs):
return [env.counter for env in environments]
class IncrementEnv:
def reset(self):
self.counter = 0
def increment(self, step: int) -> int:
"""
Increment the internal counter.
Args:
step: Value to add to the counter.
Returns:
The updated counter value.
"""
self.counter += step
return self.counter
trainer = GRPOTrainer(
model="Qwen/Qwen3-0.6B",
args=GRPOConfig(chat_template_kwargs={"enable_thinking": False}),
train_dataset=dataset,
reward_funcs=reward_func,
environment_factory=IncrementEnv,
)
trainer.train()by @qgallouedec in #5093
Skills
TRL introduces agent-native CLI Integration: trl-training, a first-class Agent Skill that exposes TRL’s training workflows (SFT, DPO, GRPO, etc.) in a structured, agent-readable format. The skill is packaged directly with the trl library and can be installed via the CLI:
# Install into the project's agent directory (default scope=project), by agent name: claude, codex, opencode
trl skills install trl-training --target <agent>This enables AI agents to safely and reproducibly execute TRL training workflows using a well-defined interface.
Skills can be installed at the project or global scope, and support explicit targets and overwrite controls.
- Implement Agent Skills [1/N]: Create training skill (MVP) by @albertvillanova in #5096
- Implement Agent Skills [2/N]: Create skills module by @albertvillanova in #5097
- Implement Agent Skills [3/N]: Create skills installer by @albertvillanova in #5100
- Implement Agent Skills [4/N]: Create skills CLI by @albertvillanova in #5103
Other
- Pass vllm_is_ratio to LigerFusedLinearGRPOLoss in compute_liger_loss by @yukiu00 in #5031
- feature: top_k selective_log_softmax by @LeonEricsson in #5104
- Add Trackio integration for model card visualization by @qgallouedec in #5101
- Update tool handling to support JSON string schemas in trainers by @qgallouedec in #5118
- Refactor DPO by @qgallouedec in #3906
- Add support for Python 3.14 by @albertvillanova in #4225
- Fix default learning_rate in PPO according to paper by @albertvillanova in #5174
- Fix default learning_rate in BCO according to paper by @albertvillanova in #5173
- feature: Configurable num logprobs in vLLM generation by @LeonEricsson in #5107
Fixes
- [GRPO] fix: remove SAPO temperature check by @LeonEricsson in #5042
- fix: Use
launch_argsfor all trainers by @qgallouedec in #5059 - Fix GRPO multi-turn training with liger kernels by @albertvillanova in #4975
- fix: Set
num_labelsto 1 in causal model initialization for RewardTrainer by @qgallouedec in #5066 - [SFT] Fix high vRAM consumption during eval with liger kernel by @LoganVegnaSHOP in #5069
- Fix BFD packing for SFT datasets by @albertvillanova in #5076
- Fix DPO and RLOO incompatibility with FSDP2 by @flutist in #4838
- Fix SFT loss type rewards being overwritten in dpo_loss() by @Mr-Neutr0n in #5079
- Fix Qwen3 schema by @qgallouedec in #5111
- Add check for
Noneinget_trackio_space_url()to prevent errors by @qgallouedec in #5115 - Fix
trl <command> --helpTypeError caused by unescaped%inTrainingArgumentshelp strings by @albertvillanova in #5135 - Fix PPOTrainer.save_model by @albertvillanova in #5151
- Fix
SFTTrainersupport for single-image data by @qgallouedec in #5132 - Fix structured_outputs handling and tool normalization in vLLM backend by @ehofm in #5155
- fix: wake up vLLM weights before sync to prevent writes to freed memory by @bledden in #5147
- Accept mm_token_type_ids in GRPO/RLOO _get_per_token_logps_and_entropies by @albertvillanova in #5176
Documentation and Examples
- [minor] docs: typo in
grpo_trainer.mdby @casinca in #5047 - docs: add DeepSeek-R1 training dynamics and GRPO example by @JenWei0312 in #5053
- docs: Add INTELLECT-2 (2505.07291) to Paper Index by @behroozazarkhalili in #5061
- docs: Add REINFORCE++ (2501.03262) to Paper Index by @behroozazarkhalili in #5062
- docs: Add XPO (2405.21046) to Paper Index by @behroozazarkhalili in #5068
- docs: Add RPO paper (2405.16436) to paper index by @behroozazarkhalili in #5070
- docs: Add SimPO paper (2405.14734) to paper index by @behroozazarkhalili in #5071
- docs: Add TR-DPO paper (2404.09656) to paper index by @behroozazarkhalili in #5078
- docs: Add ORPO paper (2403.07691) to paper index by @behroozazarkhalili in #5080
- docs: Add CPO paper (2401.08417) to paper index by @behroozazarkhalili in #5081
- docs: Add GKD paper (2306.13649) to paper index by @behroozazarkhalili in #5082
- docs: Add PRM paper (2211.14275) to paper index by @behroozazarkhalili in #5083
- docs: Add T5 packing paper (1910.10683) to paper index by @behroozazarkhalili in #5084
- docs: Add PPO paper (1707.06347) to paper index by @behroozazarkhalili in #5085
- docs: Add MPO paper (2411.10442) to paper index by @behroozazarkhalili in #5089
- docs: add Multi-Node Training subsection (#4384) by @nabin2004 in #5091
- docs: Unify model examples to use trl-lib namespace by @behroozazarkhalili in #4431
- Add Tiny Aya tool calling examples (script/notebook) by @sergiopaniego in #5123
- Fix wording in DPO and SFT trainer documentation for clarity by @qgallouedec in #5140
- Fix type of TrainingArguments.logging_steps in docs by @albertvillanova in #5149
- Fix Liquid syntax error in DPO trainer docs caused by double braces in LaTeX by @albertvillanova in #5153
- Document parameters with differing default values in experimental configs by @albertvillanova in #5172
Deprecations
- Remove deprecated BCO after moved to experimental by @albertvillanova in #5045
- Remove deprecated CPO after moved to experimental by @albertvillanova in #5046
- Remove deprecated Judges after moved to experimental by @albertvillanova in #5048
- Remove deprecated ORPO after moved to experimental by @albertvillanova in #5050
- Remove deprecated PPO after moved to experimental by @albertvillanova in #5051
- Remove deprecated PRM after moved to experimental by @albertvillanova in #5052
- Remove deprecated XPO after moved to experimental by @albertvillanova in #5055
- Remove deprecated RLOOConfig.max_prompt_length by @albertvillanova in #5056
- Remove deprecated classes moved to experimental by @albertvillanova in #5044
- Remove deprecated mergekit_utils moved to experimental by @albertvillanova in #5057
- Rename input keys in
RewardTrainercollator fromchosen/rejected_input_idstochosen/rejected_idsby @qgallouedec in #5179
CI Improvements
- Upgrade GitHub Actions to latest versions by @salmanmkc in #4893
- Remove duplicated tests for SFT and add gradient checkpointing tests by @qgallouedec in #5054
- Up...
v0.28.0
Features
- [GRPOTrainer]: Agent Training Supports Async Tool Calls by @pramodith in #4742
- Add retry strategy to vLLM Client for increased robustness by @apalmas-saifh in #4845
- Enable vLLM sleep mode for generation in Online DPO by @winglian in #4882
- Support tool call data in
is_conversationalby @qgallouedec in #4923 - [GRPO] Add parquet logging for completions with individual rewards by @qgallouedec in #4818
- Update wordle.py example with masking of env tokens by @sergiopaniego in #4895
- NeMo-Gym Integration by @cmunley1 in #4848
Experimental
- Refactor KTO coordinated with DPO [c/N]: Remove ref_model_init_kwargs by @albertvillanova in #4837
- Refactor KTO coordinated with DPO [e/N]: Remove label_pad_token_id by @albertvillanova in #4875
- Refactor KTO coordinated with DPO [d/N]: Remove base_model_attribute_name by @albertvillanova in #4862
- Fix type hint in
openenv/utils.py: fallback for no vLLM installed case by @Datta0 in #4868 - Remove label_pad_token_id from experimental trainers by @albertvillanova in #4878
- GOLD training speed up by @141forever in #4888
- Remove ref_model_init_kwargs from experimental BCO by @albertvillanova in #4946
- Remove max_prompt_length from experimental PRM by @albertvillanova in #4963
- Remove max_prompt_length from experimental BCO by @albertvillanova in #4964
- Remove max_prompt_length from experimental CPO by @albertvillanova in #4965
- Remove max_prompt_length from experimental ORPO by @albertvillanova in #4966
- Remove padding_value from experimental CPO and use pad_token_id by @albertvillanova in #4962
Fixes
- Fix _patch_transformers_hybrid_cache for peft by @albertvillanova in #4844
- Refactor KTO [4/N]: Remove unused padding_value by @albertvillanova in #4839
- Fix: undefined
current_gradient_accumulation_stepsby @qgallouedec in #4852 - fix(DeepSeek OPSM): passing correct (vLLM) logprobs by @casinca in #4857
- Fix SFT training for prompt-completion type and transformers v5 by @qgallouedec in #4880
- Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) by @kdubovikov in #4873
- Fix import path for
get_open_portbased on vLLM version by @qgallouedec in #4883 - Fix RewardTrainer's results not reproducible by @liyc-ai in #4887
device_mapinit consistency in GRPO/RLOO/KTO by @qgallouedec in #4909- Fix extra EOS appended in DPO preprocessing for conversational data by @qgallouedec in #4908
- Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 by @albertvillanova in #4942
- Fix PPO run_name parameter not taking effect by @mel3c in #4945
- Remove access to
warnings_issuedby @qgallouedec in #4960 - Revert change in GRPO from NeMo-Gym Integration by @qgallouedec in #4970
Documentation and Examples
- Add Nash Learning from Human Feedback paper to paper index by @kansalaman in #4860
- Update OpenEnv dependency to new version for hf jobs scripts by @sergiopaniego in #4843
- Enhance GRPO documentation with scaling notes by @javadtaghia in #4849
- Created new PTT integration docs as requested by @adityachallapally in #4907
- docs: add DoRA (2402.09353) to Paper Index by @billycrapediem in #4892
Deprecations
- Remove unused padding_value from BCO by @albertvillanova in #4846
- Remove deprecated parameters by @qgallouedec in #4847
- Deprecate parameters in
DPOConfigby @qgallouedec in #4969 - Replace
warmup_ratiowithwarmup_stepsby @qgallouedec in #4983
CI Improvements
- Support triggering CI via push to ci-* branches by @albertvillanova in #4840
- Revert CI hotfix pinning transformers 4.57.4 after tiny model regeneration by @albertvillanova in #4833
- Use pytest-datadir in CI tests by @albertvillanova in #4836
- Fix CI with dev dependencies: Mark Qwen3-VL tests as xfail by @albertvillanova in #4851
- Use pytest-datadir for accelerate config files by @albertvillanova in #4861
- Update transformer version checks and documentation for lr_scheduler_kwargs workaround by @qgallouedec in #4876
- Test distributed training for
RewardTrainer,RLOOTrainerandGRPOTrainerby @qgallouedec in #4823 - Mark ZeRO 2 as xfail in distributed tests due to current failure by @qgallouedec in #4885
- Transformers v5 release: extend xfail condition for
TestGRPOTrainer.test_training_vlm_and_ligerand update version checks by @qgallouedec in #4898 - Fix CI NotImplementedError for bfloat16 by @albertvillanova in #4902
- Fix CI AssertionError: Parameter has not changed by @albertvillanova in #4904
- Fix CI TypeError in llm-blender tests by @albertvillanova in #4919
- Fix CI AssertionError: assert not True by @albertvillanova in #4921
- Fix CI ValueError for 0 temperature by @albertvillanova in #4916
- Set model dtype to float32 in tests of trainers by @albertvillanova in #4924
- Set model dtype to float32 in experimental tests of trainers by @albertvillanova in #4925
- Add test for training with
compute_metricsinSFTTrainerby @qgallouedec in #4950 - Add test for tool call data in
RewardTrainerby @qgallouedec in #4959 - Add test for training with
compute_metricsinRewardTrainerby @qgallouedec in #4958 - Fix test_train_with_chat_template_kwargs by @qgallouedec in #4971
Miscellaneous
- Update
CITATION.cffby @qgallouedec in #4856 - Update generate_tiny_models.py: CohereForAI -> CohereLabs by @Michellehbn in #4877
- Refactor vLLM generation [1/N]: Extract vLLM generation by @albertvillanova in #4700
- Rearrange variable assignments in
DataCollatorForVisionLanguageModelingby @qgallouedec in #4911 - Fix help text formatting for
max_lengthinRewardConfigandSFTConfigby @qgallouedec in #4910 - Comment about overriding prediction_step in GRPOTrainer and RLOOTrainer by @qgallouedec in #4913
- Remove gradient checkpointing option from various training scripts by @qgallouedec in #4905
- Remove chat template setup in dpo_vlm.py by @qgallouedec in #4906
- Update learning rate comments and add assertions for reference model parameters in GRPO and RLOO tests by @qgallouedec in #4914
- Add validation for
sync_ref_modelinGRPOTrainerandRLOOTrainerwhen using PEFT models by @qgallouedec in #4912 - Require transformers<5 with PairRMJudge by @albertvillanova in #4926
- Move VLLMClient to generation module by @albertvillanova in #4928
- Fix profiling of VLLMGeneration.sync_weights by @albertvillanova in #4931
- Fix import statement for import_utils in vllm_client.py by @qgallouedec in #4932
- Set default top_k to 0 in VLLMClient by @albertvillanova in #4927
- Minor fix docs style by @albertvillanova in #4953
What's Changed
- ⬆️ Bump dev version by @qgallouedec in #4835
- Support triggering CI via push to ci-* branches by @albertvillanova in #4840
- Revert CI hotfix pinning transformers 4.57.4 after tiny mo...
v0.27.2
What's Changed
- Remove access to
warnings_issuedby @qgallouedec in #4960 - Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 by @albertvillanova in #4942
- Fix extra EOS appended in DPO preprocessing for conversational data by @qgallouedec in #4908
Full Changelog: v0.27.1...v0.27.2
v0.27.1
What's Changed
- Fix: undefined
current_gradient_accumulation_stepsby @qgallouedec in #4852 - fix(DeepSeek OPSM): passing correct (vLLM) logprobs by @casinca in #4857
- Fix SFT training for prompt-completion type and transformers v5 by @qgallouedec in #4880
- Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) by @kdubovikov in #4873
- Fix RewardTrainer's results not reproducible by @liyc-ai in #4887
New Contributors
- @kdubovikov made their first contribution in #4873
- @liyc-ai made their first contribution in #4887
Full Changelog: v0.27.0...v0.27.1
v0.27.0
Features
- Add
vllm_group_portargument to GRPO, RLOO and OnlineDPO configuration by @pointerhacker in #4545 - Preserve truncated tokens in BFD packing by @qgallouedec in #4632
- Support async reward functions and parallelize call to reward functions. by @pramodith in #4567
- RLOO supports async rewards. by @pramodith in #4718
- Support vLLM 0.12.0 by @jiqing-feng in #4117
- feat: DeepSeek V3.2 Off-policy sequence masking by @casinca in #4689
- 🎭 Up to 50% less VRAM during forward with
forward_masked_logitsfunction by @qgallouedec in #4729 - [GRPO] Add a config to limit the number of tool calling iterations by @pramodith in #4761
- Switch gradient checkpointing default to use_reentrant=False (PyTorch recommended) by @qgallouedec in #4811
- Add support for GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization by @nbasyl in #4785
Experimental
- Move
AutoModelForCausalLMWithValueHeadandAutoModelForSeq2SeqLMWithValueHeadto experimental by @qgallouedec in #4654 - Move DPODataCollatorWithPadding to
experimental.utilsby @qgallouedec in #4667 - Move
DataCollatorForChatMLtoexperimental.utilsby @qgallouedec in #4668 - Move
add_bos_token_if_neededandadd_eos_token_if_neededtoexperimental.utilsby @qgallouedec in #4674 - Move
truncate_rightandSIMPLE_CHAT_TEMPLATEtoexperimental.utilsby @qgallouedec in #4677 - Move
prepare_model_for_kbit_training,enable_gradient_checkpointing,prepare_peft_modeltoexperimental.utilsby @qgallouedec in #4704 - Move
get_rewardfunction toexperimental.utilsby @qgallouedec in #4683 - Remove experimental imports from testing_utils by @albertvillanova in #4727
- ORPO: Avoid catastrophic cancellation in loss function by @hartmans in #4763
- Refactor KTO [1/N]: Modernize model initialization by @albertvillanova in #4783
- [GOLD] add probability merging fix to implement chain rule by @kashif in #4765
- Refactor KTO coordinated with DPO [a/N]: Remove encoder-decoder support by @albertvillanova in #4792
- Refactor KTO coordinated with DPO [b/N]: Simplify truncation logic by @albertvillanova in #4808
Fixes
- Accounting for case
num_generations_eval=1in the calculation of the advantage by @qgallouedec in #4662 - Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in #4663
- Fix GRPO config validation in case
num_generations_evalis specified and different thannum_generationsby @apalmas-saifh in #4682 - Fix top_k default value to 0 for disabling top-k filtering by @albertvillanova in #4695
- Include
generation_configfor tiny model uploads by @qgallouedec in #4643 - Fix KeyError with transformers 5.0.0+ where push_to_hub_token is removed by @Manodeepray in #4691
- Overwrite model default generation config used by model.generate by @albertvillanova in #4647
- Fix: handle multiple tool calls in
qwen3_schemaby @mattbui in #4709 - Fix bugs when using multi-gpu: dataset streaming for offline trainers + dtype initialization by @kaixuanliu in #3950
- Ensure llm-blender is importable with transformers >= v5 by @albertvillanova in #4781
- Monkey patch for
HybridCachein Liger-Kernel with transformers v5 by @qgallouedec in #4798 - [fix] GRPOTrainer: proper access
argsby @carlyou in #4801 - Fix vllm compat patches to be applied only to affected versions by @albertvillanova in #4815
- fix bug when sft calc outputs.token_accuracy by @kaixuanliu in #4814
- fix xpu vllm client server by @jiqing-feng in #4780
Documentation and Examples
- docs: add RapidFire AI integration section to SFT Trainer by @kamran-rapidfireAI in #4661
- Fix environment image name for BrowserGym example script by @sergiopaniego in #4680
- Docs(
grpo_trainer.md): Added Qwen SAPO details underLoss Typesby @casinca in #4681 - [docs] Adds GRPO, RSO and LoRA to Paper Index by @SSusantAchary in #4441
- Enable zero3 init and 16-bit model saving for ds ulysses config by @edbeeching in #4701
- Set version to packaged one in notebooks by @sergiopaniego in #4648
- BrowserGym example for LLMs (no vision) by @sergiopaniego in #4696
- docs: Add RapidFire AI cross-references to DPO and GRPO trainer docs by @kamran-rapidfireAI in #4705
- [docs] Fix RapidFire AI position in documentation by @qgallouedec in #4715
- Add inference example to GRPO agent training notebook by @sergiopaniego in #4710
- Upload FunctionGemma notebook by @sergiopaniego in #4721
- Update agents notebook dependencies by @sergiopaniego in #4724
- Add uv/hf jobs support to OpenEnv scripts by @sergiopaniego in #4720
- Add GRPO QLoRA free notebook by @sergiopaniego in #4660
- Hotfix for browsergym openenv notebook by @sergiopaniego in #4740
- docs: fix "Good Second Issue" redirection link by @casinca in #4749
- [Docs] Add SRL (Supervised Reinforcement Learning) to Community Tutorials by @s23deepak in #4758
- Add LFM2.5 to GRPO notebook by @sergiopaniego in #4793
- Sudoku GRPO example script using TextArena by @sergiopaniego in #4762
- [EXAMPLES] Update wordle to new openenv release by @burtenshaw in #4791
- Update the typos in docs/source/grpo_trainer.md by @Tianyi-Billy-Ma in #4804
- Updat examples to new OpenEnv version by @sergiopaniego in #4796
- Update GRPO example to use Qwen2.5 instead of Qwen2 by @BurnyCoder in #4803
Deprecations
- Remove deprecated functions and parameters by @qgallouedec in #4651
- Remove
MergeModelCallbackfrom import structure by @qgallouedec in #4664 - Remove
ChatMlSpecialTokensby @qgallouedec in #4666 - Remove unused
_win_rate_completions_dffunction from callbacks by @qgallouedec in #4672 - Deprecate max_prompt_length in RLOOTrainer by @albertvillanova in #4703
- Small fix on contributing docs by @murilo-cunha in #4753
- Remove
DbrxForCausalLMsupport by @qgallouedec in #4799
CI Improvements
- Hotfix CI due to generation config by setting tests as xfail by @albertvillanova in #4657
- Upgrade GitHub Actions to latest versions by @salmanmkc in #4734
- Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in #4733
- Include data type for tiny models and update tests by @qgallouedec in #4728
- Change tiny model dtype from float16 to bfloat16 to fix CUDA error by @albertvillanova in #4745
- Add revision override mechanism for testing tiny models by @albertvillanova in #4769
- Hotfix: Set float32 as default dtype for testing tiny models by @albertvillanova in #4770
- Hotfix CI with dev dependencies: xfail test_training_vlm_and_liger by @albertvillanova in #4777
- Add initial multi-GPU CI tests for distributed training by @qgallouedec in #4784
- Set dtype default to float32 by @albertvillanova in #4778
- Test FSDP2 by @qgallouedec in #4813
- Test ZeRO Stage 3 by @qgallouedec in #4821
- Hotfix CI main test...
v0.26.2
What's Changed
- Overwrite model default generation config used by model.generate by @albertvillanova in #4647
Full Changelog: v0.26.1...v0.26.2
v0.26.1
What's Changed
- Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in #4663
- Fix GRPO config validation in case
num_generations_evalis specified and different thannum_generationsby @apalmas-saifh in #4682
New Contributors
- @apalmas-saifh made their first contribution in #4663
Full Changelog: v0.26.0...v0.26.1