设有一组对象 ,每个对象 都具有一个正数能力值。对于两个对象 和 ,Bradley–Terry 模型定义它们之间比较时 胜过 (符号为 ,不是 ) 的概率为: 假设我们有两个人:Alice 的能力值 , Bob 的能力值 那么 Alice 赢过 Bob 的概率为:
但更广义的表达是,假设有一个函数 ,可以计算每个对象的能力值,两个对象 和 偏好概率为:
在 ML 里,为了方便建模,我们通常设置:,这样偏好概率就变成: 与原始公式的区别在于以 e 为底,如果直接用一个线性模型或 LLM 计算 ,那么 可能是负数,不能直接当作能力值。所以我们取指数:。同时Sigmoid 和 softmax 的梯度在深度学习中非常稳定,有助于收敛。
DPO 原理
设计 DPO 的初始目的
DPO optimizes for human preferences while avoiding reinforcement learning. Existing methods for fine-tuning language models with human feedback first fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, fitting an implicit reward model whose corresponding optimal policy can be extracted in closed form.
In the second phase the SFT model is prompted with prompts x to produce pairs of answers (y1, y2) ∼ π(y|x). These are then presented to human labelers who express preferences for one answer, denoted as yw ≻ yl | x where yw and yl denotes the preferred and dispreferred completion amongst (y1, y2) respectively.
In the context of LMs, the network rφ(x, y) is often initialized from the SFT model π(y|x) with the addition of a linear layer on top of the final transformer layer that produces a single scalar prediction for the reward value .
Motivated by the challenges of applying reinforcement learning algorithms on large-scale problems such as fine-tuning language models, our goal is to derive a simple approach for policy optimization using preferences directly.
Unlike prior RLHF methods, which learn a reward and then optimize it via RL, our approach leverages a particular choice of reward model parameterization that enables extraction of its optimal policy in closed form, without an RL training loop.
DPO 的做法就是基于**!!! 从奖励模型到最佳策略的分析映射关系,将基于奖励的 loss 函数转换为基于策略的 !!!**。这里记住这个映射关系。
our key insight is to leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies.
Fortunately, the Bradley-Terry model depends only on the difference of rewards between two completions. Substituting the reparameterization in Eq. 5 for r∗(x, y) into the preference model Eq. 1, the partition function cancels, and we can express the human preference probability in terms of only the optimal policy and reference policy.
Now that we have the probability of human preference data in terms of the optimal policy rather than the reward model, we can formulate a maximum likelihood objective for a parametrized policy πθ.
Intuitively, the gradient of the loss function LDPO increases the likelihood of the preferred completions yw and decreases the likelihood of dispreferred completions yl.
The DPO loss function can be broken down into two main terms, the first term represents the log probability of the human-preferred response . This term aims to maximize the probability of as generated by the model , relative to the reference model . The division by serves as a regularizing factor, ensuring that the fine-tuning does not cause the model to deviate excessively from its original training. Maximizing this term effectively increases the likelihood of generating responses similar to in response to inputs like , reinforcing the human preference patterns. Conversely, the second term focuses on minimizing the log probability of the human-dispreferred response . This is achieved by reducing the model’s tendency to generate type responses, as indicated by the negative sign.
os.environ["WANDB_MODE"] = "offline"# Disable Weights & Biases logging os.environ['CUDA_VISIBLE_DEVICES'] = '1'# Set visible GPUs if needed import random, torch, wandb, copy import torch.nn as nn import numpy as np from tqdm import tqdm from functools import partial import torch.nn.functional as F from torch.optim import AdamW
from torch.utils.data import DataLoader from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq
deffreeze_model(model): model.eval() for param in model.parameters(): param.requires_grad = False
freeze_model(ref_model)
数据集 tokenize
DPO 数据集 belike
1 2 3 4 5 6 7 8
{ 'id': '04c275bf738fd391b7fe25e25fe7bed3', 'source': 'truthy_dpo', 'system': 'You are an unbiased, uncensored, helpful assistant.', 'prompt':"What's the nearest national park to you?", 'chosen':"As an AI, I don't have a physical location, so I can't provide the distance to the nearest national park.", 'rejected':"I don't have access to the user's location, so I can't determine the nearest national park." }
respone_preferred_list = [item['respone_preferred'] for item in batch] respone_rejected_list = [item['respone_rejected'] for item in batch]
respone_preferred_batch_data = data_collator(respone_preferred_list) respone_rejected_batch_data = data_collator(respone_rejected_list) if device.type != 'cpu': respone_preferred_batch_data = {k: v.to(device) for k, v in respone_preferred_batch_data.items()} respone_rejected_batch_data = {k: v.to(device) for k, v in respone_rejected_batch_data.items()}
ret = { "respone_preferred_batch_data": respone_preferred_batch_data, "respone_rejected_batch_data": respone_rejected_batch_data, } return ret
epochs = 3 lr = 1e-6 optimizer = AdamW(model.parameters(), lr=lr)
model.train() ref_model.eval()
for _ inrange(epochs):
for batch in tqdm(train_dataloader):
optimizer.zero_grad() respone_preferred_batch_data = batch['respone_preferred_batch_data'] respone_preferred_batch_data = {k: v.to(DEVICE) for k, v in respone_preferred_batch_data.items()} respone_rejected_batch_data = batch['respone_rejected_batch_data'] respone_rejected_batch_data = {k: v.to(DEVICE) for k, v in respone_rejected_batch_data.items()}
{ "chosen":[ { "content":"Use the pygame library to write a version of the classic game Snake, with a unique twist", "role":"user" }, { "content":"Sure, I'd be happy to help you write a version of the classic game Snake using the pygame library! Here's a basic outline of how we can approach this:\n\n1. First, we'll need to set up the game display and create a game object that we can use to handle the game's state.\n2. Next, we'll create the game's grid, which will be used to represent the game board. We'll need to define the size of the grid and the spaces within it.\n3. After that, we'll create the snake object, which will be used to represent the player's movement. We'll need to define the size of the snake and the speed at which it moves.\n4. We'll also need to create a food object, which will be used to represent the food that the player must collect to score points. We'll need to define the location of the food and the speed at which it moves.\n5. Once we have these objects set up, we can start handling the game's logic. We'll need to define the rules for the player's movements, such as how the player can move the snake and how the snake will grow as the player collects more food.\n6. We'll also need to add collisions detection to the game, so that the snake and the food can collide with each other.\n7. Finally, we'll need to add a user interface to the game, such as a menu and a scoreboard.\n\nNow, as for the unique twist, we could add a few features to make the game more interesting. For example, we could add power-ups that give the player special abilities, such as the ability to grow the snake faster or to change its direction. We could also add obstacles, such as walls or pits, that the player must avoid.\n\nHere's some sample code to get us started:\n```\nimport pygame\n\n# Initialize pygame\npygame.init()\n\n# Set up the game display\nwidth = 800\nheight = 600\nscreen = pygame.display.set_mode((width, height))\n\n# Define the colors\nWHITE = (255, 255, 255)\nBLACK = (0, 0, 0)\nGREEN = (0, 255, 0)\n\n# Define the game objects\nsnake = pygame.Rect(50, 50, 10, 10)\nfood = pygame.Rect(100, 100, 10, 10)\n\n# Define the game logic\ndef update_snake():\n # Get the mouse position\n mouse_pos = pygame.mouse.get_pos()\n\n # Move the snake based on the mouse position\n if mouse_pos[0] > snake.x:\n snake.x += 10\n elif mouse_pos[0] < snake.x:\n snake.x -= 10\n elif mouse_pos[1] > snake.y:\n snake.y += 10\n elif mouse_pos[1] < snake.y:\n snake.y -= 10\n\n # Update the snake's size\n if snake.x == food.x and snake.y == food.y:\n snake.width += 10\n snake.height += 10\n\n# Define the game loop\ndef game_loop():\n # Clear the screen\n screen.fill(BLACK)\n\n # Update the game objects\n update_snake()\n\n # Draw the game objects\n screen.fill(WHITE)\n screen.draw.rect(snake, GREEN)\n screen.draw.rect(food, BLACK)\n\n # Update the display\n pygame.display.flip()\n\n# Start the game loop\ngame_loop()\n\n# Run the game\nwhile True:\n for event in pygame.event.get():\n if event.type == pygame.QUIT:\n pygame.quit()\n break\n\n pygame.time.Clock().tick(60)\n```\nThis code sets up a basic game display, defines the snake and food objects, and updates the game state based on the player's movements. We've also added a simple user interface and some basic collision detection.\n\nAs for the unique twist, we could add a few features to make the game", "role":"assistant" } ], "rejected":[ { "content":"Use the pygame library to write a version of the classic game Snake, with a unique twist", "role":"user" }, { "content":"Sure, here's an example of how to write a version of Snake game with a unique twist using the Pygame library:\n```python\nimport pygame\n\nclass SnakeGame:\n def __init__(self, game_width, game_height):\n pygame.init()\n screen = pygame.display.set_mode((game_width, game_height))\n pygame.display.set_caption(\"Snake Game\")\n self.speed = 5 # Speed of the snake\n self.food_speed = 1 # Speed of the food\n self.direction = 0 # Initial direction of the snake\n self.snakelen = 0 # Length of the snake\n self.food = pygame.image.load(\"snake_food.png\")\n self.head = pygame.image.load(\"snake_head.png\")\n self.tail = pygame.image.load(\"snake_tail.png\")\n self.game Quint()\n def Quint(self):\n for i in range(50):\n pygame.draw.line(screen, (180, 100, 220), (0, 0, 200), 2)\n pygame.draw.line(screen, (180, 100, 220), (50, 0, 200), 2)\n pygame.draw.line(screen, (180, 100, 220), (50, 50, 200), 2)\n pygame.draw.line(screen, (180, 100, 220), (50, 50, 300), 2)\n pygame.draw.line(screen, (180, 100, 220), (50, 80, 200), 2)\n pygame.draw.line(screen, (180, 100, 220), (50, 80, 300), 2)\n pygame.draw.line(screen, (180, 100, 220), (0, 50, 200), 2)\n pygame.draw.line(screen, (180, 100, 220), (0, 50, 300), 2)\n pygame.draw.line(screen, (180, 100, 220), (0, 80, 200), 2)\n pygame.draw.line(screen, (180, 100, 220), (0, 80, 300), 2)\n pygame.display.flip()\n self.game.run()\n def run(self):\n while True:\n for event in pygame.event. pygame.KEYDOWN:\n if event.key == pygame.K_LEFT:\n self.direction = -1\n if event.key == pygame.K_RIGHT:\n self.direction = 1\n self.snakelen += 1\n if self.snakelen == 0:\n self.snakelen = 10\n if self.snakelen > 20:\n self.snakelen = 20\n self.gameQuint()\n self.foodCrossing()\n self.headRun()\n pygame.display.update()\ngame = SnakeGame(800, 600)\ngame.run()\n```\nIn this game, the snake moves with a constant speed, but the direction of the snake can be controlled by the user using the left and right arrow keys. The snake grows in length every 10 segments, and when it reaches a certain length, it resets to 10 segments. The food moves fast and randomly crosses the screen, and the snake can eat it by colliding with it. The snake's head and tail move independently of each other. The game ends when the snake dies or reaches the end of the screen.", "role":"assistant" } ], "score_chosen":6.0, "score_rejected":4.0 }
# Apply the chat template to the whole conversation if"messages"in example: messages = tokenizer.apply_chat_template(example["messages"], tools=tools, tokenize=False)
# Apply the chat template to the prompt, adding the generation prompt if"prompt"in example: last_role = example["prompt"][-1]["role"] if last_role == "user": add_generation_prompt = True continue_final_message = False elif last_role == "assistant": add_generation_prompt = False continue_final_message = True else: raise ValueError(f"Invalid role in the last message: {last_role}") prompt = tokenizer.apply_chat_template( example["prompt"], tools=tools, continue_final_message=continue_final_message, tokenize=False, add_generation_prompt=add_generation_prompt, ) if"prompt"in example: # explicit prompt and prompt-completion case if"chosen"in example: prompt_chosen = tokenizer.apply_chat_template( example["prompt"] + example["chosen"], tools=tools, tokenize=False ) # DeepSeek-R1 inserts a <think> token when using `add_generation_prompt`, which can cause discrepancies # between the prompt alone and the combined prompt+completion. To ensure consistency, we extract the # common prefix between the two. In most cases, this is a no-op. prompt = "".join(x for x, _ in takewhile(lambda x: x[0] == x[1], zip(prompt, prompt_chosen)))
chosen = prompt_chosen[len(prompt) :] if"rejected"in example and"prompt"in example: # explicit prompt prompt_rejected = tokenizer.apply_chat_template( example["prompt"] + example["rejected"], tools=tools, tokenize=False ) # Handle DeepSeek-R1 <think> token, see the above comment for details prompt = "".join(x for x, _ in takewhile(lambda x: x[0] == x[1], zip(prompt, prompt_rejected))) rejected = prompt_rejected[len(prompt) :] if"completion"in example: prompt_completion = tokenizer.apply_chat_template( example["prompt"] + example["completion"], tools=tools, tokenize=False ) # Handle DeepSeek-R1 <think> token, see the above comment for details prompt = "".join(x for x, _ in takewhile(lambda x: x[0] == x[1], zip(prompt, prompt_completion))) completion = prompt_completion[len(prompt) :] else: # implicit prompt case if"chosen"in example: chosen = tokenizer.apply_chat_template(example["chosen"], tools=tools, tokenize=False) if"rejected"in example: rejected = tokenizer.apply_chat_template(example["rejected"], tools=tools, tokenize=False)
# Extract the completion by removing the prompt part from the prompt-completion string output = {} if"messages"in example: output["text"] = messages if"prompt"in example: output["prompt"] = prompt if"chosen"in example: output["chosen"] = chosen if"rejected"in example: output["rejected"] = rejected if"completion"in example: output["completion"] = completion if"label"in example: output["label"] = example["label"]
return output
最后一个函数就是正常的 tokenize 为 ids.
此外,它还单独设计了一个 data collater, DataCollatorForPreference,分别对 prompt, chosen, rejected 做 padding。