Github Issue分析¶
设置¶
In [ ]:
Copied!
%pip install llama-index-readers-github
%pip install llama-index-llms-openai
%pip install llama-index-program-openai
%pip install llama-index-readers-github
%pip install llama-index-llms-openai
%pip install llama-index-program-openai
In [ ]:
Copied!
import os
os.environ["GITHUB_TOKEN"] = "<your github token>"
import os
os.environ["GITHUB_TOKEN"] = ""
加载 Github 问题票据¶
In [ ]:
Copied!
import os
from llama_index.readers.github import (
GitHubRepositoryIssuesReader,
GitHubIssuesClient,
)
github_client = GitHubIssuesClient()
loader = GitHubRepositoryIssuesReader(
github_client,
owner="jerryjliu",
repo="llama_index",
verbose=True,
)
docs = loader.load_data()
import os
from llama_index.readers.github import (
GitHubRepositoryIssuesReader,
GitHubIssuesClient,
)
github_client = GitHubIssuesClient()
loader = GitHubRepositoryIssuesReader(
github_client,
owner="jerryjliu",
repo="llama_index",
verbose=True,
)
docs = loader.load_data()
Found 100 issues in the repo page 1 Resulted in 100 documents Found 100 issues in the repo page 2 Resulted in 200 documents Found 100 issues in the repo page 3 Resulted in 300 documents Found 100 issues in the repo page 4 Resulted in 400 documents Found 4 issues in the repo page 5 Resulted in 404 documents No more issues found, stopping
快速检查
In [ ]:
Copied!
docs[10].text
docs[10].text
Out[ ]:
"feat(context length): QnA Summarization as a relevant information extractor\n### Feature Description\r\n\r\nSummarizer can help in cases where the information is evenly distributed in the document i.e. a large amount of context is required but the language is verbose or there are many irrelevant details. Summarization specific to the query can help.\r\n\r\nEither cheap local model or even LLM are options; the latter for reducing latency due to large context window in RAG. \r\n\r\nAnother place where it helps is that percentile and top_k don't account for variable information density. (However, this may be solved with inter-node sub-node reranking). \r\n"
In [ ]:
Copied!
docs[10].metadata
docs[10].metadata
Out[ ]:
{'state': 'open', 'created_at': '2023-07-13T11:16:30Z', 'url': 'https://api.github.com/repos/jerryjliu/llama_index/issues/6889', 'source': 'https://github.com/jerryjliu/llama_index/issues/6889'}
提取主题¶
In [ ]:
Copied!
%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
In [ ]:
Copied!
from pydantic import BaseModel
from typing import List
from tqdm.asyncio import asyncio
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.llms.openai import OpenAI
from llama_index.core.async_utils import batch_gather
from pydantic import BaseModel
from typing import List
from tqdm.asyncio import asyncio
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.llms.openai import OpenAI
from llama_index.core.async_utils import batch_gather
In [ ]:
Copied!
prompt_template_str = """\
这是一个Github Issue工单。
{ticket}
请提取中心主题并输出标签列表。\
"""
prompt_template_str = """\
这是一个Github Issue工单。
{ticket}
请提取中心主题并输出标签列表。\
"""
In [ ]:
Copied!
class TagList(BaseModel):
"""一个与问题的中心主题对应的标签列表。"""
tags: List[str]
class TagList(BaseModel):
"""一个与问题的中心主题对应的标签列表。"""
tags: List[str]
In [ ]:
Copied!
program = OpenAIPydanticProgram.from_defaults(
prompt_template_str=prompt_template_str,
output_cls=TagList,
)
program = OpenAIPydanticProgram.from_defaults(
prompt_template_str=prompt_template_str,
output_cls=TagList,
)
In [ ]:
Copied!
tasks = [program.acall(ticket=doc) for doc in docs]
tasks = [program.acall(ticket=doc) for doc in docs]
In [ ]:
Copied!
output = await batch_gather(tasks, batch_size=10, verbose=True)
output = await batch_gather(tasks, batch_size=10, verbose=True)
[可选] 保存/加载提取的主题¶
In [ ]:
Copied!
import pickle
import pickle
In [ ]:
Copied!
with open("github_issue_analysis_data.pkl", "wb") as f:
pickle.dump(tag_lists, f)
with open("github_issue_analysis_data.pkl", "wb") as f:
pickle.dump(tag_lists, f)
In [ ]:
Copied!
with open("github_issue_analysis_data.pkl", "rb") as f:
tag_lists = pickle.load(f)
print(f"Loaded tag lists for {len(tag_lists)} tickets")
with open("github_issue_analysis_data.pkl", "rb") as f:
tag_lists = pickle.load(f)
print(f"Loaded tag lists for {len(tag_lists)} tickets")
总结主题¶
构建提示
In [ ]:
Copied!
"""
这里是从Github问题票据列表中提取的一组中心主题(以标签形式)的列表。
每个票据的标签之间用2个换行符分隔。
{tag_lists_str}
请总结要点并确定我们应该优先解决的问题。
"""
tag_lists_str = "\n\n".join([str(tag_list) for tag_list in tag_lists])
prompt = prompt.format(tag_lists_str=tag_lists_str)
"""
这里是从Github问题票据列表中提取的一组中心主题(以标签形式)的列表。
每个票据的标签之间用2个换行符分隔。
{tag_lists_str}
请总结要点并确定我们应该优先解决的问题。
"""
tag_lists_str = "\n\n".join([str(tag_list) for tag_list in tag_lists])
prompt = prompt.format(tag_lists_str=tag_lists_str)
Summarize with GPT-4
In [ ]:
Copied!
from llama_index.llms.openai import OpenAI
response = OpenAI(model="gpt-4").stream_complete(prompt)
from llama_index.llms.openai import OpenAI
response = OpenAI(model="gpt-4").stream_complete(prompt)
In [ ]:
Copied!
for r in response:
print(r.delta, end="")
for r in response:
print(r.delta, end="")
1. Bug Fixes: There are numerous bugs reported across different components such as 'Updating/Refreshing documents', 'Supabase Vector Store', 'Parsing', 'Qdrant', 'LLM event', 'Service context', 'Chroma db', 'Markdown Reader', 'Search_params', 'Index_params', 'MilvusVectorStore', 'SentenceSplitter', 'Embedding timeouts', 'PGVectorStore', 'NotionPageReader', 'VectorIndexRetriever', 'Knowledge Graph', 'LLM content', and 'Query engine'. These issues need to be prioritized and resolved to ensure smooth functioning of the system. 2. Feature Requests: There are several feature requests like 'QnA Summarization', 'BEIR evaluation', 'Cross-Node Ranking', 'Node content', 'PruningMode', 'RelevanceMode', 'Local-model defaults', 'Dynamically selecting from multiple prompts', 'Human-In-The-Loop Multistep Query', 'Explore Tree-of-Thought', 'Postprocessing', 'Relevant Section Extraction', 'Original Source Reconstruction', 'Varied Latency in Retrieval', and 'MLFlow'. These features can enhance the capabilities of the system and should be considered for future development. 3. Code Refactoring and Testing: There are mentions of code refactoring, testing, and code review. This indicates a need for improving code quality and ensuring robustness through comprehensive testing. 4. Documentation: There are several mentions of documentation updates, indicating a need for better documentation to help users understand and use the system effectively. 5. Integration: There are mentions of integration with other systems like 'BEIR', 'Langflow', 'Hugging Face', 'OpenAI', 'DynamoDB', and 'CometML'. This suggests a need for better interoperability with other systems. 6. Performance and Efficiency: There are mentions of 'Parallelize sync APIs', 'Average query time', 'Efficiency', 'Upgrade', and 'Execution Plan'. This indicates a need for improving the performance and efficiency of the system. 7. User Experience (UX): There are mentions of 'UX', 'Varied Latency in Retrieval', and 'Human-In-The-Loop Multistep Query'. This suggests a need for improving the user experience. 8. Error Handling: There are several mentions of error handling, indicating a need for better error handling mechanisms to ensure system robustness. 9. Authentication: There are mentions of 'authentication' and 'API key', indicating a need for secure access mechanisms. 10. Multilingual Support: There is a mention of 'LLM中文应用交流微信群', indicating a need for multilingual support.