Discord线程管理¶
在这个例子中,我们有一个目录,定期将LlamaIndex Discord上的#issues-and-help频道中的内容转储到该目录中。我们希望确保我们的索引始终具有最新的数据,而不会重复任何消息。
索引discord数据¶
Discord数据被转储为连续的消息。每条消息都包含有用的信息,如时间戳、作者以及如果消息是主题的一部分,则包含到父消息的链接。
我们的Discord上的帮助频道通常在解决问题时使用主题,因此我们将把所有消息分组到主题中,并将每个主题索引为单独的文档。
首先,让我们探索我们要处理的数据。
import os
print(os.listdir("./discord_dumps"))
['help_channel_dump_06_02_23.json', 'help_channel_dump_05_25_23.json']
正如你所看到的,我们有两个来自不同日期的数据转储。假设我们只有较旧的数据转储,我们想要从这些数据中创建一个索引。
首先,让我们先对数据进行一些探索。
import json
with open("./discord_dumps/help_channel_dump_05_25_23.json", "r") as f:
data = json.load(f)
print("JSON keys: ", data.keys(), "\n")
print("Message Count: ", len(data["messages"]), "\n")
print("Sample Message Keys: ", data["messages"][0].keys(), "\n")
print("First Message: ", data["messages"][0]["content"], "\n")
print("Last Message: ", data["messages"][-1]["content"])
JSON keys: dict_keys(['guild', 'channel', 'dateRange', 'messages', 'messageCount']) Message Count: 5087 Sample Message Keys: dict_keys(['id', 'type', 'timestamp', 'timestampEdited', 'callEndedTimestamp', 'isPinned', 'content', 'author', 'attachments', 'embeds', 'stickers', 'reactions', 'mentions']) First Message: If you're running into any bugs, issues, or you have questions as to how to best use GPT Index, put those here! - If it's a bug, let's also track as a GH issue: https://github.com/jerryjliu/gpt_index/issues. Last Message: Hello there! How can I use llama_index with GPU?
方便起见,我已经提供了一个脚本,可以将这些消息分组成线程。您可以查看group_conversations.py
脚本以获取更多详细信息。输出文件将是一个json列表,其中列表中的每个项目都是一个Discord线程。
!python ./group_conversations.py ./discord_dumps/help_channel_dump_05_25_23.json
Done! Written to conversation_docs.json
with open("conversation_docs.json", "r") as f:
threads = json.load(f)
print("Thread keys: ", threads[0].keys(), "\n")
print(threads[0]["metadata"], "\n")
print(threads[0]["thread"], "\n")
Thread keys: dict_keys(['thread', 'metadata']) {'timestamp': '2023-01-02T03:36:04.191+00:00', 'id': '1059314106907242566'} arminta7: Hello all! Thanks to GPT_Index I've managed to put together a script that queries my extensive personal note collection which is a local directory of about 20k markdown files. Some of which are very long. I work in this folder all day everyday, so there are frequent changes. Currently I would need to rerun the entire indexing (is that the correct term?) when I want to incorporate edits I've made. So my question is... is there a way to schedule indexing to maybe once per day and only add information for files that have changed? Or even just manually run it but still only add edits? This would make a huge difference in saving time (I have to leave it running overnight for the entire directory) as well as cost 😬. Excuse me if this is a dumb question, I'm not a programmer and am sort of muddling around figuring this out 🤓 Thank you for making this sort of project accessible to someone like me! ragingWater_: I had a similar problem which I solved the following way in another world: - if you have a list of files, you want something which says that edits were made in the last day, possibly looking at the last_update_time of the file should help you. - for decreasing the cost, I would suggest maybe doing a keyword extraction or summarization of your notes and generating an embedding for it. Take your NLP query and get the most similar file (cosine similarity by pinecone db should help, GPTIndex also has a faiss) this should help with your cost needs
现在,我们有一个线程列表,可以将其转换为文档并进行索引!
创建初始索引¶
from llama_index.core import Document
# 使用每个线程的doc_id和日期创建文档对象
documents = []
for thread in threads:
thread_text = thread["thread"]
thread_id = thread["metadata"]["id"]
timestamp = thread["metadata"]["timestamp"]
documents.append(
Document(text=thread_text, id_=thread_id, metadata={"date": timestamp})
)
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
让我们再次检查索引实际摄取了哪些文档。
print("ref_docs ingested: ", len(index.ref_doc_info))
print("number of input documents: ", len(documents))
ref_docs ingested: 767 number of input documents: 767
好的,目前一切顺利。让我们也检查一下特定的线程,以确保元数据起作用,并检查它被分成了多少个节点。
thread_id = threads[0]["metadata"]["id"]
print(index.ref_doc_info[thread_id])
RefDocInfo(node_ids=['0c530273-b6c3-4848-a760-fe73f5f8136e'], metadata={'date': '2023-01-02T03:36:04.191+00:00'})
完美!我们的线程非常短,因此它直接被分成了一个单独的节点。此外,我们可以看到日期字段已经被正确设置。
接下来,让我们备份我们的索引,这样我们就不必浪费令牌再次进行索引。
# 保存初始索引
index.storage_context.persist(persist_dir="./storage")
# 再次加载以确认它起作用了
from llama_index.core import StorageContext, load_index_from_storage
index = load_index_from_storage(
StorageContext.from_defaults(persist_dir="./storage")
)
print("再次确认 ref_docs 已被摄取: ", len(index.ref_doc_info))
Double check ref_docs ingested: 767
刷新索引以更新数据!¶
现在,突然间我们想起我们有了新的 Discord 消息转储!与其从头开始重建整个索引,我们可以使用 refresh()
函数仅索引新文档。
由于我们手动设置了每个索引的 doc_id
,LlamaIndex 可以将传入的文档与相同 doc_id
的文档进行比较,以确认 a) doc_id
是否已经被摄取,并且 b) 内容是否发生了变化。
刷新函数将返回一个布尔数组,指示输入中哪些文档已被刷新或插入。我们可以使用这个信息来确认只有新的 Discord 线程被插入!
当文档的内容发生变化时,将调用 update()
函数,该函数会从索引中删除并重新插入文档。
import json
with open("./discord_dumps/help_channel_dump_06_02_23.json", "r") as f:
data = json.load(f)
print("JSON keys: ", data.keys(), "\n")
print("Message Count: ", len(data["messages"]), "\n")
print("Sample Message Keys: ", data["messages"][0].keys(), "\n")
print("First Message: ", data["messages"][0]["content"], "\n")
print("Last Message: ", data["messages"][-1]["content"])
JSON keys: dict_keys(['guild', 'channel', 'dateRange', 'messages', 'messageCount']) Message Count: 5286 Sample Message Keys: dict_keys(['id', 'type', 'timestamp', 'timestampEdited', 'callEndedTimestamp', 'isPinned', 'content', 'author', 'attachments', 'embeds', 'stickers', 'reactions', 'mentions']) First Message: If you're running into any bugs, issues, or you have questions as to how to best use GPT Index, put those here! - If it's a bug, let's also track as a GH issue: https://github.com/jerryjliu/gpt_index/issues. Last Message: Started a thread.
正如我们所看到的,第一条消息与原始转储的内容相同。但现在我们有大约200条更多的消息,而且最后一条消息显然是新的!refresh()
将使更新我们的索引变得容易。
首先,让我们创建我们的新线程/文档。
!python ./group_conversations.py ./discord_dumps/help_channel_dump_06_02_23.json
Done! Written to conversation_docs.json
with open("conversation_docs.json", "r") as f:
threads = json.load(f)
print("Thread keys: ", threads[0].keys(), "\n")
print(threads[0]["metadata"], "\n")
print(threads[0]["thread"], "\n")
Thread keys: dict_keys(['thread', 'metadata']) {'timestamp': '2023-01-02T03:36:04.191+00:00', 'id': '1059314106907242566'} arminta7: Hello all! Thanks to GPT_Index I've managed to put together a script that queries my extensive personal note collection which is a local directory of about 20k markdown files. Some of which are very long. I work in this folder all day everyday, so there are frequent changes. Currently I would need to rerun the entire indexing (is that the correct term?) when I want to incorporate edits I've made. So my question is... is there a way to schedule indexing to maybe once per day and only add information for files that have changed? Or even just manually run it but still only add edits? This would make a huge difference in saving time (I have to leave it running overnight for the entire directory) as well as cost 😬. Excuse me if this is a dumb question, I'm not a programmer and am sort of muddling around figuring this out 🤓 Thank you for making this sort of project accessible to someone like me! ragingWater_: I had a similar problem which I solved the following way in another world: - if you have a list of files, you want something which says that edits were made in the last day, possibly looking at the last_update_time of the file should help you. - for decreasing the cost, I would suggest maybe doing a keyword extraction or summarization of your notes and generating an embedding for it. Take your NLP query and get the most similar file (cosine similarity by pinecone db should help, GPTIndex also has a faiss) this should help with your cost needs
# 使用每个线程中的doc_id和日期创建文档对象
new_documents = []
for thread in threads:
thread_text = thread["thread"]
thread_id = thread["metadata"]["id"]
timestamp = thread["metadata"]["timestamp"]
new_documents.append(
Document(text=thread_text, id_=thread_id, metadata={"date": timestamp})
)
print("Number of new documents: ", len(new_documents) - len(documents))
Number of new documents: 13
# now, refresh!
refreshed_docs = index.refresh(
new_documents,
update_kwargs={"delete_kwargs": {"delete_from_docstore": True}},
)
默认情况下,如果文档的内容发生了变化并进行了更新,我们可以向 delete_from_docstore
传递一个额外的标志。这个标志默认为 False
,因为索引可以共享文档存储。但由于我们只有一个索引,在这里从文档存储中删除是可以的。
如果我们保持选项为 False
,那么文档信息仍然会从 index_struct
中删除,这实际上会使该文档对索引不可见。
print("Number of newly inserted/refreshed docs: ", sum(refreshed_docs))
Number of newly inserted/refreshed docs: 15
有趣的是,我们有13个新文档,但有15个文档被刷新了。是有人编辑了他们的消息吗?在主题中添加了更多文本?让我们找出来。
print(refreshed_docs[-25:])
[False, True, False, False, True, False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True]
new_documents[-21]
Document(id_='1110938122902048809', embedding=None, weight=1.0, metadata={'date': '2023-05-24T14:31:28.732+00:00'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='36d308d1d2d1aa5cbfdb2f7d64709644a68805ec22a6053943f985084eec340e', text='Siddhant Saurabh:\nhey facing error\n```\n*error_trace: Traceback (most recent call last):\n File "/app/src/chatbot/query_gpt.py", line 248, in get_answer\n context_answer = self.call_pinecone_index(request)\n File "/app/src/chatbot/query_gpt.py", line 229, in call_pinecone_index\n self.source.append(format_cited_source(source_node.doc_id))\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 172, in doc_id\n return self.node.ref_doc_id\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 87, in ref_doc_id\n return self.relationships.get(DocumentRelationship.SOURCE, None)\nAttributeError: \'Field\' object has no attribute \'get\'\n```\nwith latest llama_index 0.6.9\n@Logan M @jerryjliu98 @ravitheja\nLogan M:\nHow are you inserting nodes/documents? That attribute on the node should be set automatically usually\nSiddhant Saurabh:\nI think this happened because of the error mentioned by me here https://discord.com/channels/1059199217496772688/1106229492369850468/1108453477081948280\nI think we need to re-preprocessing for such nodes, right?\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')
documents[-8]
Document(id_='1110938122902048809', embedding=None, weight=1.0, metadata={'date': '2023-05-24T14:31:28.732+00:00'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='c995c43873440a9d0263de70fff664269ec70d751c6e8245b290882ec5b656a1', text='Siddhant Saurabh:\nhey facing error\n```\n*error_trace: Traceback (most recent call last):\n File "/app/src/chatbot/query_gpt.py", line 248, in get_answer\n context_answer = self.call_pinecone_index(request)\n File "/app/src/chatbot/query_gpt.py", line 229, in call_pinecone_index\n self.source.append(format_cited_source(source_node.doc_id))\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 172, in doc_id\n return self.node.ref_doc_id\n File "/usr/local/lib/python3.8/site-packages/llama_index/data_structs/node.py", line 87, in ref_doc_id\n return self.relationships.get(DocumentRelationship.SOURCE, None)\nAttributeError: \'Field\' object has no attribute \'get\'\n```\nwith latest llama_index 0.6.9\n@Logan M @jerryjliu98 @ravitheja\nLogan M:\nHow are you inserting nodes/documents? That attribute on the node should be set automatically usually\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')
不错!较新的文档包含了更多的消息线程。正如你所看到的,refresh()
能够检测到这一点,并自动用更新的文本替换了旧的线程。