GRPO-trainer-HF 长度奖励的文本压缩任务

Peng Xia

以下是出于 huggingface trl GRPO trainer 的教程的代码

数据和代码

使用的是 TLDR dataset , TLDR 指的是 Too Long Didn’t Read,代表太长了不想看,这个数据就是提供压缩前后的数据。以下是一组示例数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# prompt

SUBREDDIT: r/relationships

TITLE: I (f/22) have to figure out if I want to still know these girls or not and would hate to sound insulting

POST: Not sure if this belongs here but it's worth a try.

Backstory:
When I (f/22) went through my first real breakup 2 years ago because he needed space after a year of dating roand it effected me more than I thought. It was a horrible time in my life due to living with my mother and finally having the chance to cut her out of my life. I can admit because of it was an emotional wreck and this guy was stable and didn't know how to deal with me. We ended by him avoiding for a month or so after going to a festival with my friends. When I think back I wish he just ended. So after he ended it added my depression I suffered but my friends helped me through it and I got rid of everything from him along with cutting contact.

Now: Its been almost 3 years now and I've gotten better after counselling and mild anti depressants. My mother has been out of my life since then so there's been alot of progress. Being stronger after learning some lessons there been more insight about that time of my life but when I see him or a picture everything comes back. The emotions and memories bring me back down.

His friends (both girls) are on my facebook because we get along well which is hard to find and I know they'll always have his back. But seeing him in a picture or talking to him at a convention having a conversation is tough. Crying confront of my current boyfriend is something I want to avoid.

So I've been thinking that I have to cut contact with these girls because it's time to move on because it's healthier. It's best to avoid him as well. But will they be insulted? Will they accept it? Is there going to be awkwardness? I'm not sure if it's the right to do and could use some outside opinions.

TL;DR:

# completion
I still have contact with an old ex's friends but can't stand to see or talk to him. His friends are really nice ,so how do I tell them I possibly want to unfriend them on Facebook because of him?

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# train_grpo.py
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer

dataset = load_dataset("trl-lib/tldr", split="train")

# Define the reward function, which rewards completions that are close to 20 characters
def reward_len(completions, **kwargs):
return [-abs(20 - len(completion)) for completion in completions]

training_args = GRPOConfig(output_dir="Qwen2-0.5B-GRPO")
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_len,
args=training_args,
train_dataset=dataset,
)
trainer.train()

其中 reward 可以根据想要的压缩长度去调整,这个函数是计算输出长度与20间的差距的负值,reward最好的情况就是当输出长度为20,reward为0。这个reward是(-inf, 0],没有下限。

脚本如下:

1
2
3
4
5
6
7
8
export WANDB_MODE=offline  # Disable Weights & Biases logging
export CUDA_VISIBLE_DEVICES=6 # Set the GPUs to use
export WANDB_PROJECT=trl-grpo-length-reward # Set the Weights & Biases project name
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1


python grpo.py

我是单卡4090训练的,4-5行是4090不支持某些行为而额外设置的,WANDB_MODE根据服务器是否能联网设置。

我实际调正了压缩长度的效果会很差,接下来看实验。

实验

target length=20

image-20250730161539488

image-20250730161627229

从图中可以看出来completion平均长度是在20附近的,大概多一点。

target length=40

image-20250730162858656

image-20250730162948445

平均长度最后在10左右,解释是这个 completions 只指的 tokenized 的长度,而不是reward中untokenized的长度。

此外这个数据集的提示词也有点问题

这是某次生成结果,整体上有太多杂乱的内容。所以不改了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Ok, so I appreciate you Persona has everyone on the planet! I appreciate you trying to empathise and understand this situation, but it's just too damn frustrating!! Please help me work through this??

LINK: https://subreddits.stackoverflow.com/subredirects


0
1


Hey M31 with ninfolake. At 28, mate, the recent crisis can no longer be withstood.
0
1

^_^^
Sometime away.

If you're committed to turning in your original_username and the arguably useful @M31withNinfeyes, please do so! Thanks!

PS: Let me know if there's a way for me to communicate my frustration through redirect or subs sub.

Edit: As requested! Thanks for spreading the love. This is the second part of a series where some people on the AoPS just took on some of my issues and mediation with them. This one is a mix tutorial redux and what might happen when you split up to group a team in Minecraft vs Minecraft. I'm happy to learn more about the Minecraft dungeon wording.
0
1

^_^^
What your using them struggle on: the minecraft
Comments
On this page
GRPO-trainer-HF 长度奖励的文本压缩任务