vLLM

vLLM 是一个快速且易于使用的库，用于LLM推理和服务，提供：

最先进的服务器吞吐量
使用PagedAttention高效管理注意力键和值内存
对传入请求进行连续批处理
优化的CUDA内核

本笔记本介绍了如何将LLM与langchain和vLLM结合使用。

要使用，您应该安装vllm python包。

%pip install --upgrade --quiet  vllm -q

from langchain_community.llms import VLLM

llm = VLLM(
    model="mosaicml/mpt-7b",
    trust_remote_code=True,  # mandatory for hf models
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    temperature=0.8,
)

print(llm.invoke("What is the capital of France ?"))

API Reference:VLLM

INFO 08-06 11:37:33 llm_engine.py:70] Initializing an LLM engine with config: model='mosaicml/mpt-7b', tokenizer='mosaicml/mpt-7b', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 08-06 11:37:41 llm_engine.py:196] # GPU blocks: 861, # CPU blocks: 512
``````output
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.00it/s]
``````output

What is the capital of France ? The capital of France is Paris.

将模型集成到LLMChain中

from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "Who was the US president in the year the first Pokemon game was released?"

print(llm_chain.invoke(question))

API Reference:LLMChain | PromptTemplate

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.34s/it]
``````output


1. The first Pokemon game was released in 1996.
2. The president was Bill Clinton.
3. Clinton was president from 1993 to 2001.
4. The answer is Clinton.

分布式推理

vLLM 支持分布式张量并行推理和服务。

要使用LLM类运行多GPU推理，请将tensor_parallel_size参数设置为您想要使用的GPU数量。例如，要在4个GPU上运行推理

from langchain_community.llms import VLLM

llm = VLLM(
    model="mosaicml/mpt-30b",
    tensor_parallel_size=4,
    trust_remote_code=True,  # mandatory for hf models
)

llm.invoke("What is the future of AI?")

API Reference:VLLM

量化

vLLM 支持 awq 量化。要启用它，请将 quantization 传递给 vllm_kwargs。

llm_q = VLLM(
    model="TheBloke/Llama-2-7b-Chat-AWQ",
    trust_remote_code=True,
    max_new_tokens=512,
    vllm_kwargs={"quantization": "awq"},
)

OpenAI兼容服务器

vLLM 可以部署为一个模拟 OpenAI API 协议的服务器。这使得 vLLM 可以作为使用 OpenAI API 的应用程序的直接替代品。

该服务器可以以与OpenAI API相同的格式进行查询。

OpenAI兼容的完成

from langchain_community.llms import VLLMOpenAI

llm = VLLMOpenAI(
    openai_api_key="EMPTY",
    openai_api_base="http://localhost:8000/v1",
    model_name="tiiuae/falcon-7b",
    model_kwargs={"stop": ["."]},
)
print(llm.invoke("Rome is"))

API Reference:VLLMOpenAI

 a city that is filled with history, ancient buildings, and art around every corner

LoRA适配器

LoRA适配器可以与任何实现SupportsLoRA的vLLM模型一起使用。

from langchain_community.llms import VLLM
from vllm.lora.request import LoRARequest

llm = VLLM(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_new_tokens=300,
    top_k=1,
    top_p=0.90,
    temperature=0.1,
    vllm_kwargs={
        "gpu_memory_utilization": 0.5,
        "enable_lora": True,
        "max_model_len": 350,
    },
)
LoRA_ADAPTER_PATH = "path/to/adapter"
lora_adapter = LoRARequest("lora_adapter", 1, LoRA_ADAPTER_PATH)

print(
    llm.invoke("What are some popular Korean street foods?", lora_request=lora_adapter)
)

API Reference:VLLM

LLM 概念指南
LLM how-to guides

将模型集成到LLMChain中​

分布式推理​

量化​

OpenAI兼容服务器​

OpenAI兼容的完成​

LoRA适配器​

相关​

这个页面有帮助吗？

将模型集成到LLMChain中

分布式推理

量化

OpenAI兼容服务器

OpenAI兼容的完成

LoRA适配器

相关