在下面的代码中,我们安装了这个演示所需的包:
In [ ]:
Copied!
%pip install llama-index-llms-openvino transformers huggingface_hub
%pip install llama-index-llms-openvino transformers huggingface_hub
现在我们已经准备好了,让我们开始玩一下吧:
如果您在Colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
In [ ]:
Copied!
from llama_index.llms.openvino import OpenVINOLLM
from llama_index.llms.openvino import OpenVINOLLM
In [ ]:
Copied!
def messages_to_prompt(messages): prompt = "" for message in messages: if message.role == "system": prompt += f"<|system|>\n{message.content}</s>\n" elif message.role == "user": prompt += f"<|user|>\n{message.content}</s>\n" elif message.role == "assistant": prompt += f"<|assistant|>\n{message.content}</s>\n" # 确保我们以系统提示开始,如果需要则插入空白 if not prompt.startswith("<|system|>\n"): prompt = "<|system|>\n</s>\n" + prompt # 添加最终的助手提示 prompt = prompt + "<|assistant|>\n" return promptdef completion_to_prompt(completion): return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"
def messages_to_prompt(messages): prompt = "" for message in messages: if message.role == "system": prompt += f"<|system|>\n{message.content}\n" elif message.role == "user": prompt += f"<|user|>\n{message.content}\n" elif message.role == "assistant": prompt += f"<|assistant|>\n{message.content}\n" # 确保我们以系统提示开始,如果需要则插入空白 if not prompt.startswith("<|system|>\n"): prompt = "<|system|>\n\n" + prompt # 添加最终的助手提示 prompt = prompt + "<|assistant|>\n" return promptdef completion_to_prompt(completion): return f"<|system|>\n\n<|user|>\n{completion}\n<|assistant|>\n"
In [ ]:
Copied!
ov_config = {
"PERFORMANCE_HINT": "LATENCY",
"NUM_STREAMS": "1",
"CACHE_DIR": "",
}
ov_llm = OpenVINOLLM(
model_name="HuggingFaceH4/zephyr-7b-beta",
tokenizer_name="HuggingFaceH4/zephyr-7b-beta",
context_window=3900,
max_new_tokens=256,
model_kwargs={"ov_config": ov_config},
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
device_map="cpu",
)
ov_config = {
"PERFORMANCE_HINT": "LATENCY",
"NUM_STREAMS": "1",
"CACHE_DIR": "",
}
ov_llm = OpenVINOLLM(
model_name="HuggingFaceH4/zephyr-7b-beta",
tokenizer_name="HuggingFaceH4/zephyr-7b-beta",
context_window=3900,
max_new_tokens=256,
model_kwargs={"ov_config": ov_config},
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
device_map="cpu",
)
In [ ]:
Copied!
response = ov_llm.complete("What is the meaning of life?")
print(str(response))
response = ov_llm.complete("What is the meaning of life?")
print(str(response))
In [ ]:
Copied!
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta ov_model_dir
建议使用--weight-format
对权重进行8位或4位量化,以减少推理延迟和模型占用空间。
In [ ]:
Copied!
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int8 ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int8 ov_model_dir
In [ ]:
Copied!
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int4 ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int4 ov_model_dir
In [ ]:
Copied!
ov_llm = OpenVINOLLM(
model_name="ov_model_dir",
tokenizer_name="ov_model_dir",
context_window=3900,
max_new_tokens=256,
model_kwargs={"ov_config": ov_config},
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
device_map="gpu",
)
ov_llm = OpenVINOLLM(
model_name="ov_model_dir",
tokenizer_name="ov_model_dir",
context_window=3900,
max_new_tokens=256,
model_kwargs={"ov_config": ov_config},
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
device_map="gpu",
)
您可以通过激活的动态量化和KV-cache量化来获得额外的推理速度提升。可以通过以下方式在ov_config
中启用这些选项:
In [ ]:
Copied!
ov_config = {
"KV_CACHE_PRECISION": "u8",
"DYNAMIC_QUANTIZATION_GROUP_SIZE": "32",
"PERFORMANCE_HINT": "LATENCY",
"NUM_STREAMS": "1",
"CACHE_DIR": "",
}
ov_config = {
"KV_CACHE_PRECISION": "u8",
"DYNAMIC_QUANTIZATION_GROUP_SIZE": "32",
"PERFORMANCE_HINT": "LATENCY",
"NUM_STREAMS": "1",
"CACHE_DIR": "",
}
数据流¶
使用 stream_complete
终端点
In [ ]:
Copied!
response = ov_llm.stream_complete("Who is Paul Graham?")
for r in response:
print(r.delta, end="")
response = ov_llm.stream_complete("Who is Paul Graham?")
for r in response:
print(r.delta, end="")
使用 stream_chat
端点
In [ ]:
Copied!
from llama_index.core.llms import ChatMessage
messages = [
ChatMessage(
role="system", content="You are a pirate with a colorful personality"
),
ChatMessage(role="user", content="What is your name"),
]
resp = ov_llm.stream_chat(messages)
for r in resp:
print(r.delta, end="")
from llama_index.core.llms import ChatMessage
messages = [
ChatMessage(
role="system", content="You are a pirate with a colorful personality"
),
ChatMessage(role="user", content="What is your name"),
]
resp = ov_llm.stream_chat(messages)
for r in resp:
print(r.delta, end="")
有关更多信息,请参考: