安装 llama-index-llms-ipex-llm
。这将同时安装 ipex-llm
及其依赖项。
%pip install llama-index-llms-ipex-llm
在这个示例中,我们将使用HuggingFaceH4/zephyr-7b-alpha模型进行演示。这需要更新transformers
和tokenizers
包。
%pip install -U transformers==4.37.0 tokenizers==0.15.2
在加载 Zephyr 模型之前,您需要定义 completion_to_prompt
和 messages_to_prompt
以格式化提示。这对于准备模型能够准确解释的输入至关重要。
# 将字符串转换为特定于zephyr的输入def completion_to_prompt(completion): return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"# 将聊天消息列表转换为特定于zephyr的输入def messages_to_prompt(messages): prompt = "" for message in messages: if message.role == "system": prompt += f"<|system|>\n{message.content}</s>\n" elif message.role == "user": prompt += f"<|user|>\n{message.content}</s>\n" elif message.role == "assistant": prompt += f"<|assistant|>\n{message.content}</s>\n" # 确保以系统提示开始,如有需要则插入空白 if not prompt.startswith("<|system|>\n"): prompt = "<|system|>\n</s>\n" + prompt # 添加最终的助手提示 prompt = prompt + "<|assistant|>\n" return prompt
基本用法¶
使用 IpexLLM.from_model_id
在本地加载 Zephyr 模型。它将直接以 Huggingface 格式加载模型,并自动将其转换为低位格式以进行推断。
import warnings
warnings.filterwarnings(
"ignore", category=UserWarning, message=".*padding_mask.*"
)
from llama_index.llms.ipex_llm import IpexLLM
llm = IpexLLM.from_model_id(
model_name="HuggingFaceH4/zephyr-7b-alpha",
tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
context_window=512,
max_new_tokens=128,
generate_kwargs={"do_sample": False},
completion_to_prompt=completion_to_prompt,
messages_to_prompt=messages_to_prompt,
)
Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s]
2024-04-11 21:36:54,739 - INFO - Converting the current model to sym_int4 format......
现在您可以继续使用加载的模型进行文本补全和交互式聊天。
在这个示例中,我们将使用GPT-2模型来完成给定文本的句子。给定一个句子的开头,模型将生成接下来的文本,从而完成整个句子。
completion_response = llm.complete("Once upon a time, ")
print(completion_response.text)
in a far-off land, there was a young girl named Lily. Lily lived in a small village surrounded by lush green forests and rolling hills. She loved nothing more than spending her days exploring the woods and playing with her animal friends. One day, while wandering through the forest, Lily stumbled upon a magical tree. The tree was unlike any other she had ever seen. Its trunk was made of shimmering crystal, and its branches were adorned with sparkling jewels. Lily was immediately drawn to the tree and sat down to admire its beauty. Suddenly,
流式文本补全¶
response_iter = llm.stream_complete("Once upon a time, there's a little girl")
for response in response_iter:
print(response.delta, end="", flush=True)
who loved to play with her toys. She had a favorite teddy bear named Ted, and a doll named Dolly. She would spend hours playing with them, imagining all sorts of adventures. One day, she decided to take Ted and Dolly on a real adventure. She packed a backpack with some snacks, a blanket, and a map. They set off on a hike in the nearby woods. The little girl was so excited that she could barely contain her joy. Ted and Dolly were happy to be along for the ride. They walked for what seemed like hours, but the little girl didn't mind
这是一个用Python编写的简单聊天程序。
from llama_index.core.llms import ChatMessage
message = ChatMessage(role="user", content="Explain Big Bang Theory briefly")
resp = llm.chat([message])
print(resp)
assistant: The Big Bang Theory is a popular American sitcom that aired from 2007 to 2019. The show follows the lives of two brilliant but socially awkward physicists, Leonard Hofstadter (Johnny Galecki) and Sheldon Cooper (Jim Parsons), and their friends and colleagues, Penny (Kaley Cuoco), Rajesh Koothrappali (Kunal Nayyar), and Howard Wolowitz (Simon Helberg). The show is set in Pasadena, California, and revolves around the characters' work at Caltech and
流式聊天¶
message = ChatMessage(role="user", content="What is AI?")
resp = llm.stream_chat([message], max_tokens=256)
for r in resp:
print(r.delta, end="")
AI stands for Artificial Intelligence. It refers to the development of computer systems that can perform tasks that typically require human intelligence, such as learning, reasoning, and problem-solving. AI involves the use of machine learning algorithms, natural language processing, and other advanced techniques to enable computers to understand and respond to human input in a more natural and intuitive way.
保存/加载低比特模型¶
另一种方法是将低比特模型保存到磁盘,然后在以后使用时使用 from_model_id_low_bit
而不是 from_model_id
进行重新加载 - 即使在不同的机器上也可以。这种方法在占用的磁盘空间方面非常高效,因为低比特模型所需的磁盘空间明显少于原始模型。而且 from_model_id_low_bit
在速度和内存使用方面也比 from_model_id
更高效,因为它跳过了模型转换步骤。
要保存低比特模型,请按以下方式使用save_low_bit
。
saved_lowbit_model_path = ( "./zephyr-7b-alpha-low-bit" # 保存低位模型的路径)llm._model.save_low_bit(saved_lowbit_model_path)del llm
从保存的低比特模型路径加载模型,如下所示。
请注意,低比特模型的保存路径仅包括模型本身,而不包括标记器。如果您希望将所有内容放在一个地方,您需要手动从原始模型目录下载或复制标记器文件到保存低比特模型的位置。
llm_lowbit = IpexLLM.from_model_id_low_bit( model_name=saved_lowbit_model_path, tokenizer_name="HuggingFaceH4/zephyr-7b-alpha", # tokenizer_name=saved_lowbit_model_path, # 如果要以这种方式使用,将分词器复制到保存的路径 context_window=512, max_new_tokens=64, completion_to_prompt=completion_to_prompt, generate_kwargs={"do_sample": False},)
2024-04-11 21:38:06,151 - INFO - Converting the current model to sym_int4 format......
尝试使用加载的低比特模型进行流完成。
response_iter = llm_lowbit.stream_complete("What is Large Language Model?")
for response in response_iter:
print(response.delta, end="", flush=True)
A large language model (LLM) is a type of artificial intelligence (AI) model that is trained on a massive amount of text data. These models are capable of generating human-like responses to text inputs and can be used for various natural language processing (NLP) tasks, such as text classification, sentiment analysis