Jina Embeddings¶
如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
%pip install llama-index-embeddings-jinaai
%pip install llama-index-llms-openai
!pip install llama-index
您可能还需要其他包,这些包不是直接与llama-index一起提供的。
!pip install Pillow
对于这个示例,你需要一个API密钥,你可以从https://jina.ai/embeddings/ 获取。
# 使用您的API密钥进行初始化
import os
jinaai_api_key = "YOUR_JINAAI_API_KEY"
os.environ["JINAAI_API_KEY"] = jinaai_api_key
通过JinaAI API使用Jina嵌入模型嵌入文本和查询¶
Jina是一个用于构建搜索解决方案的开源框架,它提供了一种简单而强大的方式来嵌入文本和查询。通过JinaAI API,我们可以轻松地使用Jina的嵌入模型来将文本和查询转换为向量表示形式,从而支持各种搜索和相似度匹配任务。
您可以使用JinaEmbedding类对文本和查询进行编码。
from llama_index.embeddings.jinaai import JinaEmbedding
embed_model = JinaEmbedding(
api_key=jinaai_api_key,
model="jina-embeddings-v2-base-en",
)
embeddings = embed_model.get_text_embedding("This is the text to embed")
print(len(embeddings))
print(embeddings[:5])
embeddings = embed_model.get_query_embedding("This is the query to embed")
print(len(embeddings))
print(embeddings[:5])
分批嵌入¶
在处理大型数据集时,有时候我们需要将数据分批处理以避免内存溢出或提高计算效率。这种情况下,我们可以使用分批嵌入的方法,将数据分成小批量进行嵌入处理。接下来我们将介绍如何在PyTorch中实现分批嵌入。
您也可以批量嵌入文本,批量大小可以通过设置embed_batch_size
参数来控制(如果未传递,默认值为10,且不应大于2048)。
embed_model = JinaEmbedding(
api_key=jinaai_api_key,
model="jina-embeddings-v2-base-en",
embed_batch_size=16,
)
embeddings = embed_model.get_text_embedding_batch(
["This is the text to embed", "More text can be provided in a batch"]
)
print(len(embeddings))
print(embeddings[0][:5])
让我们使用Jina AI Embeddings构建一个RAG管道¶
下载数据¶
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
# 导入所需的库
import numpy as np
import pandas as pd
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from llama_index.core.response.notebook_utils import display_source_node
from IPython.display import Markdown, display
# 加载数据
将数据加载到notebook中。
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
# 构建索引
your_openai_key = "YOUR_OPENAI_KEY"
llm = OpenAI(api_key=your_openai_key)
embed_model = JinaEmbedding(
api_key=jinaai_api_key,
model="jina-embeddings-v2-base-en",
embed_batch_size=16,
)
index = VectorStoreIndex.from_documents(
documents=documents, embed_model=embed_model
)
class Retriever:
def __init__(self, url):
self.url = url
def fetch(self):
# Fetch the data from the specified URL
pass
def extract(self):
# Extract relevant information from the fetched data
pass
search_query_retriever = index.as_retriever()
search_query_retrieved_nodes = search_query_retriever.retrieve(
"What happened after the thesis?"
)
for n in search_query_retrieved_nodes:
display_source_node(n, source_length=2000)
Node ID: d8db9dfe-ab7a-4709-9863-877b68d2210d
Similarity: 0.7698250992788241
Text: There were some surplus Xerox Dandelions floating around the computer lab at one point. Anyone who wanted one to play around with could have one. I was briefly tempted, but they were so slow by present standards; what was the point? No one else wanted one either, so off they went. That was what happened to systems work.
I wanted not just to build things, but to build things that would last.
In this dissatisfied state I went in 1988 to visit Rich Draves at CMU, where he was in grad school. One day I went to visit the Carnegie Institute, where I'd spent a lot of time as a kid. While looking at a painting there I realized something that might seem obvious, but was a big surprise to me. There, right on the wall, was something you could make that would last. Paintings didn't become obsolete. Some of the best ones were hundreds of years old.
And moreover this was something you could make a living doing. Not as easily as you could by writing software, of course, but I thought if you were really industrious and lived really cheaply, it had to be possible to make enough to survive. And as an artist you could be truly independent. You wouldn't have a boss, or even need to get research funding.
I had always liked looking at paintings. Could I make them? I had no idea. I'd never imagined it was even possible. I knew intellectually that people made art — that it didn't just appear spontaneously — but it was as if the people who made it were a different species. They either lived long ago or were mysterious geniuses doing strange things in profiles in Life magazine. The idea of actually being able to make art, to put that verb before that noun, seemed almost miraculous.
That fall I started taking art classes at Harvard. Grad students could take classes in any department, and my advisor, Tom Cheatham, was very easy going. If he even knew about the strange classes I was taking, he never said anything.
So now I was in a PhD program in computer science, yet planning to be an...
Node ID: 848bb0f8-629c-4491-b539-85f3ff5b77a2
Similarity: 0.7679106002573213
Text: Our teacher, professor Ulivi, was a nice guy. He could see I worked hard, and gave me a good grade, which he wrote down in a sort of passport each student had. But the Accademia wasn't teaching me anything except Italian, and my money was running out, so at the end of the first year I went back to the US.
I wanted to go back to RISD, but I was now broke and RISD was very expensive, so I decided to get a job for a year and then return to RISD the next fall. I got one at a company called Interleaf, which made software for creating documents. You mean like Microsoft Word? Exactly. That was how I learned that low end software tends to eat high end software. But Interleaf still had a few years to live yet. [5]
Interleaf had done something pretty bold. Inspired by Emacs, they'd added a scripting language, and even made the scripting language a dialect of Lisp. Now they wanted a Lisp hacker to write things in it. This was the closest thing I've had to a normal job, and I hereby apologize to my boss and coworkers, because I was a bad employee. Their Lisp was the thinnest icing on a giant C cake, and since I didn't know C and didn't want to learn it, I never understood most of the software. Plus I was terribly irresponsible. This was back when a programming job meant showing up every day during certain working hours. That seemed unnatural to me, and on this point the rest of the world is coming around to my way of thinking, but at the time it caused a lot of friction. Toward the end of the year I spent much of my time surreptitiously working on On Lisp, which I had by this time gotten a contract to publish.
The good part was that I got paid huge amounts of money, especially by art student standards. In Florence, after paying my part of the rent, my budget for everything else had been $7 a day. Now I was getting paid more than 4 times that every hour, even when I was just sitting in a meeting. By living cheaply I not only managed to save enough to go back to RISD, but a...