LoRA - Low-Rank Adaptation | Sharpen's Blogs

自然语言处理的一个重要范式是：在通用领域数据上进行大规模预训练，再对特定任务或领域进行适配。随着预训练模型规模的不断扩大，全量微调（即重新训练所有模型参数）变得越来越不可行。以 GPT-3 175B 为例——为每个下游任务部署一份包含 1750 亿参数的微调模型实例，成本极其高昂。

为此，LoRA 提出了一种低秩适配方法（Low-Rank Adaptation，简称 LoRA），该方法冻结预训练模型的权重，并在 Transformer 架构的每一层中注入可训练的秩分解矩阵，从而显著减少下游任务所需训练参数的数量。

Many applications in natural language processing rely on adapting one large-scale, pre-trained language model to multiple downstream applications. Such adaptation is usually done via fine-tuning, which updates all the parameters of the pre-trained model. The major downside of fine-tuning is that the new model contains as many parameters as in the original model.

Many sought to mitigate this by adapting only some parameters or learning external modules for new tasks. This way, we only need to store and load a small number of task-specific parameters in addition to the pre-trained model for each task, greatly boosting the operational efficiency when deployed. However, existing techniques often introduce inference latency by extending model depth or reduce the model’s usable sequence length.

We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed Low-Rank Adaptation (LoRA) approach.

LORA 的做法并不局限于大模型领域，它的作用是在 linear 层上，理论上有 linear 层就可以用。

LORA 的优点在于:

对下游任务的高度适应：lora 很小，可以很快切换 lora 块以切换到另外的任务，而且很多的任务也对存储没大的压力。
训练更高效，大部分参数都不需要计算梯度和优化器状态，只需要计算很少的 lora 部分。
lora 训练完后可以与原模型合并，推理时就不存在推理延迟，PS：如果是一般的 adapter，它需要顺序执行，而 transformer 本身是高度并行的，adapter 会拖慢 transformer 本身的推理速度。

已有方法

插入 Adapter 层

主要问题：

推理时增加延迟：
- Adapter 层即使参数很少（<1%原模型），也需要额外计算；
- 由于大模型高度依赖硬件并行以降低延迟，而 Adapter 层通常是顺序处理的，因此在小 batch size 的在线推理中（如 GPT-2 单 GPU 推理）会显著拉高延迟。
不易跳过计算：
- 无法轻易通过剪枝或跳跃来绕过 Adapter 层。
模型切分后的通信开销高：
- 若模型需进行跨 GPU 切分，Adapter 层的额外深度会引入更多通信操作（如 AllReduce, Broadcast），除非冗余存储参数。

直接优化 Prompt（如前缀微调，Prefix Tuning）

主要问题：

优化困难：
- 前缀参数难以训练，表现出非单调性（performance 与参数量不成正比）。
占用输入序列长度：
- Prefix 占据部分输入序列长度，压缩了模型处理实际任务信息的窗口，可能导致性能下降。

方法	主要问题
Adapter 层	增加推理延迟；顺序处理限制并行；模型切分时通信开销高
Prompt 优化方法	参数难以优化；占用序列长度导致任务性能下降

这也说明了为何需要寻找新的、更高效的参数高效化微调方法。

问题定义

在全量微调（full fine-tuning）过程中，模型初始化为预训练权重，并更新为，通过重复使用梯度上升来最大化条件语言建模目标：

全量微调的主要缺点之一是，对于每个下游任务，我们都要学习一组不同的参数，其维度与相同。因此，如果预训练模型非常大（例如 GPT-3， Billion），存储和部署多个微调模型实例将变得非常困难，甚至不可行。

在 LORA 中，我们采用一种更高效的参数学习方法：将任务特定的参数增量表示为，其中是一个更小规模的参数集合，满足。于是，寻找的任务就变成了对的优化：

LORA 提出使用低秩表示（low-rank representation）来编码，以实现计算和内存的双重高效。当预训练模型为 GPT-3（175B）时，训练所需的参数量可以低至的 0.01%。

方法说明

神经网络中包含许多执行矩阵乘法的全连接层。这些层中的权重矩阵通常具有满秩（full-rank）。Aghajanyan 等人（2020）指出，尽管预训练语言模型经过随机投影到更小的子空间，仍然能高效学习，说明其具有较低的“内在维度”。受此启发，我们假设：适配过程中的权重更新也具有较低的“内在秩”。

对于预训练权重矩阵，我们将其更新限制为低秩分解的形式：

其中，且秩。这样的表达会受限，但做微调够了。

在训练期间，冻结，不参与梯度更新，只有和是可训练参数。注意和都与相同的输入相乘，它们的输出在对应位置上逐元素相加。改写后的前向传播公式为：

若原本全连接层为768×768。我们通过A,B替代，可以变成768×8 、8×768，参数量从768×768变成了768×8 + 8×768，微调参数量为原来的 2%.

我们使用高斯分布随机初始化，并将初始化为全零，因此训练开始时。接着我们将乘以缩放因子，其中是关于的常数。在使用 Adam 优化器时，如果初始化得当，调整的效果大致等同于调整学习率。因此，我们简单地将设置为尝试的第一个值对应的常数，并不进行调参。实际前向传播应为：

一种更一般的微调形式是仅微调部分预训练参数。而 LoRA 更进一步，它在适配过程中不要求累积的梯度更新具有满秩。这意味着，如果将 LoRA 应用于所有权重矩阵并训练所有 bias，我们可以通过设置 LoRA 的秩与原始权重矩阵的秩一致，来近似恢复全量微调的表达能力。换句话说，随着可训练参数数量的增加，LoRA 的训练效果逐渐接近于原始模型的微调；而基于 Adapter 的方法最终趋近于一个 MLP，前缀方法则更适用于不能处理长输入序列的模型。

在实际部署时，我们可以显式地计算并存储合并后的权重 ，并像普通模型那样进行推理。注意和都在空间中。当我们切换到另一个下游任务时，只需从中减去当前的，再加上新的，这是一种非常快速、几乎不占内存的操作。关键是：这种适配过程不会引入任何额外的推理延迟。

源码

https://github.com/microsoft/LoRA.git 这是LORA的源码

# lora 相关参数的基类
class LoRALayer():
    def __init__(
        self, 
        r: int, 
        lora_alpha: int, 
        lora_dropout: float,
        merge_weights: bool,
    ):
        self.r = r
        self.lora_alpha = lora_alpha
        # Optional dropout
        if lora_dropout > 0.:
            self.lora_dropout = nn.Dropout(p=lora_dropout)
        else:
            self.lora_dropout = lambda x: x
        # Mark the weight as unmerged
        self.merged = False
        self.merge_weights = merge_weights
        
# 继承 linear 类
class Linear(nn.Linear, LoRALayer):
    # LoRA implemented in a dense layer
    def __init__(
        self, 
        in_features: int, 
        out_features: int, 
        r: int = 0, 
        lora_alpha: int = 1, 
        lora_dropout: float = 0.,
        fan_in_fan_out: bool = False, # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
        merge_weights: bool = True,
        **kwargs
    ):
        nn.Linear.__init__(self, in_features, out_features, **kwargs)
        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
                           merge_weights=merge_weights)

        self.fan_in_fan_out = fan_in_fan_out
        # Actual trainable parameters  初始化时额外创建两个矩阵 B A
        if r > 0:
            self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features)))
            self.lora_B = nn.Parameter(self.weight.new_zeros((out_features, r)))
            self.scaling = self.lora_alpha / self.r
            # Freezing the pre-trained weight matrix
            self.weight.requires_grad = False
        self.reset_parameters()
        if fan_in_fan_out:
            self.weight.data = self.weight.data.transpose(0, 1)

    # lora 矩阵的初始化
    def reset_parameters(self):
        nn.Linear.reset_parameters(self)
        if hasattr(self, 'lora_A'):
            # initialize B the same way as the default for nn.Linear and A to zero
            # this is different than what is described in the paper but should not affect performance
            nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
            nn.init.zeros_(self.lora_B)
	
    # 设置是否训练 lora，不训练时是预训练阶段，训练就是微调阶段，这里只设置状态变量 merged 和权重
    def train(self, mode: bool = True):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        nn.Linear.train(self, mode)
        if mode:
            if self.merge_weights and self.merged:
                # Make sure that the weights are not merged
                if self.r > 0:
                    self.weight.data -= T(self.lora_B @ self.lora_A) * self.scaling
                self.merged = False
        else:
            if self.merge_weights and not self.merged:
                # Merge the weights and mark it
                if self.r > 0:
                    self.weight.data += T(self.lora_B @ self.lora_A) * self.scaling
                self.merged = True       

    # 当 merged 时，额外加上 lora 的结果，否则只用 linear 的结果
    def forward(self, x: torch.Tensor):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        if self.r > 0 and not self.merged:
            result = F.linear(x, T(self.weight), bias=self.bias)            
            result += (self.lora_dropout(x) @ self.lora_A.transpose(0, 1) @ self.lora_B.transpose(0, 1)) * self.scaling
            return result
        else:
            return F.linear(x, T(self.weight), bias=self.bias)

这个是作者的源码，看上去还是很直接的

以下是 peft 的源码

class Linear(nn.Module, LoraLayer):
    
    def forward(self, x: torch.Tensor, *args: Any, **kwargs: Any) -> torch.Tensor:
        self._check_forward_args(x, *args, **kwargs)
        adapter_names = kwargs.pop("adapter_names", None)

        if self.disable_adapters:
            if self.merged:
                self.unmerge()
            result = self.base_layer(x, *args, **kwargs)
        elif adapter_names is not None:
            result = self._mixed_batch_forward(x, *args, adapter_names=adapter_names, **kwargs)
        elif self.merged:
            result = self.base_layer(x, *args, **kwargs)
        else:
            result = self.base_layer(x, *args, **kwargs)
            torch_result_dtype = result.dtype

            lora_A_keys = self.lora_A.keys()
            for active_adapter in self.active_adapters:
                if active_adapter not in lora_A_keys:
                    continue

                lora_A = self.lora_A[active_adapter]
                lora_B = self.lora_B[active_adapter]
                dropout = self.lora_dropout[active_adapter]
                scaling = self.scaling[active_adapter]
                x = self._cast_input_dtype(x, lora_A.weight.dtype)

                if not self.use_dora[active_adapter]:
                    result = result + lora_B(lora_A(dropout(x))) * scaling 
                else:
                    if isinstance(dropout, nn.Identity) or not self.training:
                        base_result = result
                    else:
                        x = dropout(x)
                        base_result = None

                    result = result + self.lora_magnitude_vector[active_adapter](
                        x,
                        lora_A=lora_A,
                        lora_B=lora_B,
                        scaling=scaling,
                        base_layer=self.get_base_layer(),
                        base_result=base_result,
                    )

            result = result.to(torch_result_dtype)

        return result

result = result + lora_B(lora_A(dropout(x))) * scaling 这个是不适用 dora 情况的处理，即正常 lora。

PEFT lora

加载预训练模型

import json
import transformers
from copy import deepcopy
from typing import Union
from dataclasses import dataclass, asdict
from datasets import load_dataset
from transformers import (
	AutoModelForCausalLM,
	AutoTokenizer,
)

from peft import (
    LoraConfig,
	TaskType,
    get_peft_model,
)

model_name_or_path = "../DC/qwen2.5-3b"

model = AutoModelForCausalLM.from_pretrained(
	model_name_or_path,
)
tokenizer = AutoTokenizer.from_pretrained(
	model_name_or_path,
	use_fast=False,
)
print(model)

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 2048)
    (layers): ModuleList(
      (0-35): 36 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (k_proj): Linear(in_features=2048, out_features=256, bias=True)
          (v_proj): Linear(in_features=2048, out_features=256, bias=True)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=2048, out_features=11008, bias=False)
          (up_proj): Linear(in_features=2048, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((2048,), eps=1e-06)
    (rotary_emb): Qwen2RotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=151936, bias=False)
)

定义 lora 模型

task_type: peft 会根据任务类型进行调整，需要传入 peft.TaskType 中的对象
r: LoRA模型的注意力维度（也叫秩）。表示低秩适应矩阵的维度。
target_modules: 要应用LoRA的模块名称。如果是字符串，会执行正则匹配；如果是列表，会精确匹配或检查模块名是否以指定的字符串结尾。
lora_dropout: LoRA层的dropout概率，防止过拟合。
modules_to_save:除了LoRA适配器层之外，还要保存并训练的模块。用于某些模型，如分类任务中的输出层。
lora_alpha：缩放因子,起到的是调节作用。

lora_config = LoraConfig(
	task_type=TaskType.CAUSAL_LM, 
	target_modules=['q_proj', 'v_proj'], 
	r=16, 
	lora_alpha=16
)
asdict(lora_config)

{'task_type': <TaskType.CAUSAL_LM: 'CAUSAL_LM'>,
 'peft_type': <PeftType.LORA: 'LORA'>,
 'auto_mapping': None,
 'base_model_name_or_path': None,
 'revision': None,
 'inference_mode': False,
 'r': 16,
 'target_modules': {'q_proj', 'v_proj'},
 'exclude_modules': None,
 'lora_alpha': 16,
 'lora_dropout': 0.0,
 'fan_in_fan_out': False,
 'bias': 'none',
 'use_rslora': False,
 'modules_to_save': None,
 'init_lora_weights': True,
 'layers_to_transform': None,
 'layers_pattern': None,
 'rank_pattern': {},
 'alpha_pattern': {},
 'megatron_config': None,
 'megatron_core': 'megatron.core',
 'trainable_token_indices': None,
 'loftq_config': {},
 'eva_config': None,
 'corda_config': None,
 'use_dora': False,
 'layer_replication': None,
 'runtime_config': {'ephemeral_gpu_offload': False},
 'lora_bias': False}

1
2
3

peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
print(peft_model)

trainable params: 3,686,400 || all params: 3,089,625,088 || trainable%: 0.1193
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen2ForCausalLM(
      (model): Qwen2Model(
        (embed_tokens): Embedding(151936, 2048)
        (layers): ModuleList(
          (0-35): 36 x Qwen2DecoderLayer(
            (self_attn): Qwen2Attention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=2048, out_features=2048, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): Linear(in_features=2048, out_features=256, bias=True)
              (v_proj): lora.Linear(
                (base_layer): Linear(in_features=2048, out_features=256, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=256, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
            )
            (mlp): Qwen2MLP(
              (gate_proj): Linear(in_features=2048, out_features=11008, bias=False)
              (up_proj): Linear(in_features=2048, out_features=11008, bias=False)
              (down_proj): Linear(in_features=11008, out_features=2048, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
            (post_attention_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
          )
        )
        (norm): Qwen2RMSNorm((2048,), eps=1e-06)
        (rotary_emb): Qwen2RotaryEmbedding()
      )
      (lm_head): Linear(in_features=2048, out_features=151936, bias=False)
    )
  )
)

QV 矩阵都多了 lora 相关的矩阵

加载数据

template ={
    "description": "Legacy template, used by Original Alpaca repository.",
    "prompt_input": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:",
    "prompt_no_input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:",
    "response_split": "### Response:"    
}

def generate_prompt(
	instruction: str,
	input: Union[None, str] = None,
	label: Union[None, str] = None,
    ) -> str:
	if input:
		prompt = template["prompt_input"].format(
			instruction=instruction, input=input
		)
	else:
		prompt = template["prompt_no_input"].format(
			instruction=instruction
		)
	if label:
		target = f"{label}{tokenizer.eos_token}"
	else:
		target = ""
	return prompt, target

def preprocess_func(example):
    source, target = generate_prompt(
		instruction=example['instruction'],
		input=example['input'],
		label=example['output']
	)
    full_example = source + target
    full_example_tokenzied = tokenizer(full_example, return_tensors="pt",padding="longest", max_length=tokenizer.model_max_length, truncation=True)
    input_ids = full_example_tokenzied['input_ids'][0]
    labels = deepcopy(input_ids)
    source_tokenzied = tokenizer(source, return_tensors="pt",padding="longest", max_length=tokenizer.model_max_length, truncation=True)
    labels[:len(source_tokenzied['input_ids'][0])] = -100
    return dict(
        input_ids=input_ids, 
        labels=labels
    )

data = load_dataset("json", data_files='./alpaca_data_gpt4.json')["train"].select(range(2000))
ds = data.train_test_split(test_size=0.2, shuffle=True, seed=42)
train_ds = ds["train"].map(
	preprocess_func,
	remove_columns=ds['train'].column_names,
	batched=False,
	desc="Processing dataset"
)
val_ds = ds["test"].map(
	preprocess_func,
	remove_columns=ds['test'].column_names,
	batched=False,
	desc="Processing dataset"
)

训练

from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
	tokenizer=tokenizer,
	return_tensors="pt",
	padding=True,
)

training_args = TrainingArguments(
	output_dir="./lora-alpaca",
	per_device_train_batch_size=4,
	gradient_accumulation_steps=8,
	per_device_eval_batch_size=8,
	num_train_epochs=2,
	learning_rate=2e-4,
	weight_decay=0.01,
	logging_steps=10,
	save_steps=100,
	eval_strategy="steps",
	eval_steps=20,
	save_total_limit=1,
	load_best_model_at_end=True,
	report_to='none'
)

trainer = Trainer(
	model=peft_model,
	args=training_args,
	train_dataset=train_ds,
	eval_dataset=val_ds,
	data_collator=data_collator,
)

1 2	trainer.train() trainer.evaluate()

{'eval_loss': 0.9849340915679932,
 'eval_runtime': 63.2362,
 'eval_samples_per_second': 6.325,
 'eval_steps_per_second': 0.791,
 'epoch': 2.0}

out_dir 内容如下:

├── README.md
├── adapter_config.json
├── adapter_model.safetensors
├── added_tokens.json
├── checkpoint-100
│   ├── README.md
│   ├── adapter_config.json
│   ├── adapter_model.safetensors
│   ├── added_tokens.json
│   ├── merges.txt
│   ├── optimizer.pt
│   ├── rng_state.pth
│   ├── scheduler.pt
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   ├── trainer_state.json
│   ├── training_args.bin
│   └── vocab.json
├── merges.txt
├── special_tokens_map.json
├── tokenizer_config.json
├── training_args.bin
└── vocab.json

第一层保留的最后的结果，必要的就是 adapter_config.json，adapter_model.safetensors，打开 adapter_config.json 可以看到预训练模型的地址，以及一堆 lora config

使用以下代码可以加载 peft adapter 并合并

lora_train_model = PeftModel.from_pretrained(model, model_id="./output_model/checkpoint")

merge_model = lora_train_model.merge_and_unload()
merge_model.save_pretrained("./output_model/merge_model")