Langchain 入门教程 - 4.模型

Peng Xia

2025-05-04 22:43:57 2025-05-04 22:43:57 Created 2025-05-04 22:46:55 2025-05-04 22:46:55 Updated

Langchain 入门教程

Langchain

363 Words 1 Mins

VLLM 部署本地大模型

单卡部署

export CUDA_VISIBLE_DEVICES=5

modelpath=../DataCollection/officials/Qwen2.5-7B-Instruct
modelname=Qwen2.5-7B-Instruct

nohup python -m vllm.entrypoints.openai.api_server \
    --model $modelpath \
    --served-model-name $modelname \
    --port 5551 \
    --gpu-memory-utilization 0.4 \
    --dtype=half \	// 不建议加，现在模型默认都是fp16或者bf16，half会强制转换为fp16，没必要
    > output.log 2>&1 &

多卡部署

export CUDA_VISIBLE_DEVICES=2,3

modelpath=../DataCollection/officials/Qwen2.5-7B-Instruct
modelname=Qwen2.5-7B-Instruct

nohup python -m vllm.entrypoints.openai.api_server \
    --model $modelpath \
    --served-model-name $modelname \
    --port 5551 \
    --gpu-memory-utilization 0.4 \
	--tensor_parallel_size 2 \	// !!!!!占卡数量，不可能是计数!!!!!
    > output.log 2>&1 &

基本模型选项

以下是 API 的基本选项：

model_name : str
该选项允许您选择适用的模型，也可以使用 model 作为别名。
temperature : float = 0.7
该选项用于设置采样温度（temperature）。取值范围为 0 到 2，较高的值（如 0.8）会使输出更加随机，而较低的值（如 0.2）会使输出更具集中性和确定性。
max_tokens : int | None = None
指定聊天补全（chat completion）中要生成的最大 token 数。该选项控制模型在一次调用中可以生成的文本长度。

from langchain_openai import ChatOpenAI

model = ChatOpenAI(
	base_url='http://localhost:5551/v1',
	api_key='EMPTY',
	model_name='Qwen2.5-7B-Instruct',
	temperature=0.2,
)

query = "Tell me one joke about Computer Science"

# Stream the response instead of invoking it directly
response = model.stream(query)

# Print the streamed response token by token
for token in response:
    print(token.content, end="", flush=True)

Sure! Here's a light-hearted joke about computer science:

Why did the computer go to the doctor?

Because it had a virus and needed to get "anti-virus"!

#Langchain

Comments

On this page

Langchain 入门教程 - 4.模型

VLLM 部署本地大模型
1. 基本模型选项