PII
PII(个人身份信息)脱敏
如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
In [ ]:
Copied!
%pip install llama-index-llms-openai
%pip install llama-index-llms-huggingface
%pip install llama-index-llms-openai
%pip install llama-index-llms-huggingface
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
In [ ]:
Copied!
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core.postprocessor import (
PIINodePostprocessor,
NERPIINodePostprocessor,
)
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import Document, VectorStoreIndex
from llama_index.core.schema import TextNode
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core.postprocessor import (
PIINodePostprocessor,
NERPIINodePostprocessor,
)
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import Document, VectorStoreIndex
from llama_index.core.schema import TextNode
INFO:numexpr.utils:Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. INFO:numexpr.utils:NumExpr defaulting to 8 threads. NumExpr defaulting to 8 threads.
/home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
In [ ]:
Copied!
# 加载文档
text = """
你好,Paulo Santos。您信用卡账户1111-0000-1111-0000的最新对账单已寄到Seattle, WA 98109的123 Any Street。
"""
node = TextNode(text=text)
# 加载文档
text = """
你好,Paulo Santos。您信用卡账户1111-0000-1111-0000的最新对账单已寄到Seattle, WA 98109的123 Any Street。
"""
node = TextNode(text=text)
选项1:使用NER模型进行PII屏蔽¶
使用Hugging Face的NER模型进行PII屏蔽
In [ ]:
Copied!
processor = NERPIINodePostprocessor()
processor = NERPIINodePostprocessor()
In [ ]:
Copied!
from llama_index.core.schema import NodeWithScore
new_nodes = processor.postprocess_nodes([NodeWithScore(node=node)])
from llama_index.core.schema import NodeWithScore
new_nodes = processor.postprocess_nodes([NodeWithScore(node=node)])
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english). Using a pipeline without specifying a model name and revision in production is not recommended. /home/loganm/miniconda3/envs/llama-index/lib/python3.11/site-packages/transformers/pipelines/token_classification.py:169: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="AggregationStrategy.SIMPLE"` instead. warnings.warn(
In [ ]:
Copied!
# 查看已编辑的文本
new_nodes[0].node.get_text()
# 查看已编辑的文本
new_nodes[0].node.get_text()
Out[ ]:
'Hello [ORG_6]. The latest statement for your credit card account 1111-0000-1111-0000 was mailed to 123 [ORG_108] [LOC_112], [LOC_120], [LOC_129] 98109.'
In [ ]:
Copied!
# 获取元数据中的映射
# 注意:这不会发送到LLM!
new_nodes[0].node.metadata["__pii_node_info__"]
# 获取元数据中的映射
# 注意:这不会发送到LLM!
new_nodes[0].node.metadata["__pii_node_info__"]
Out[ ]:
{'[ORG_6]': 'Paulo Santos', '[ORG_108]': 'Any', '[LOC_112]': 'Street', '[LOC_120]': 'Seattle', '[LOC_129]': 'WA'}
In [ ]:
Copied!
from llama_index.llms.openai import OpenAI
processor = PIINodePostprocessor(llm=OpenAI())
from llama_index.llms.openai import OpenAI
processor = PIINodePostprocessor(llm=OpenAI())
In [ ]:
Copied!
from llama_index.core.schema import NodeWithScore
new_nodes = processor.postprocess_nodes([NodeWithScore(node=node)])
from llama_index.core.schema import NodeWithScore
new_nodes = processor.postprocess_nodes([NodeWithScore(node=node)])
In [ ]:
Copied!
# 查看已编辑的文本
new_nodes[0].node.get_text()
# 查看已编辑的文本
new_nodes[0].node.get_text()
Out[ ]:
'Hello [NAME]. The latest statement for your credit card account [CREDIT_CARD_NUMBER] was mailed to [ADDRESS].'
In [ ]:
Copied!
# 获取元数据中的映射
# 注意:这不会发送到LLM!
new_nodes[0].node.metadata["__pii_node_info__"]
# 获取元数据中的映射
# 注意:这不会发送到LLM!
new_nodes[0].node.metadata["__pii_node_info__"]
Out[ ]:
{'NAME': 'Paulo Santos', 'CREDIT_CARD_NUMBER': '1111-0000-1111-0000', 'ADDRESS': '123 Any Street, Seattle, WA 98109'}
选项3:使用Presidio进行PII数据脱敏¶
使用Presidio来识别和匿名化PII数据
In [ ]:
Copied!
# 加载文档
text = """
你好,Paulo Santos。您的信用卡账户4095-2609-9393-4932的最新对账单已经寄到了华盛顿州西雅图市98109邮寄地址。
IBAN GB90YNTU67299444055881和社会安全号码474-49-7577已在系统中验证。
进一步的沟通将发送至paulo@presidio.site
"""
presidio_node = TextNode(text=text)
# 加载文档
text = """
你好,Paulo Santos。您的信用卡账户4095-2609-9393-4932的最新对账单已经寄到了华盛顿州西雅图市98109邮寄地址。
IBAN GB90YNTU67299444055881和社会安全号码474-49-7577已在系统中验证。
进一步的沟通将发送至paulo@presidio.site
"""
presidio_node = TextNode(text=text)
In [ ]:
Copied!
from llama_index.postprocessor.presidio import PresidioPIINodePostprocessor
processor = PresidioPIINodePostprocessor()
from llama_index.postprocessor.presidio import PresidioPIINodePostprocessor
processor = PresidioPIINodePostprocessor()
In [ ]:
Copied!
from llama_index.core.schema import NodeWithScore
presidio_new_nodes = processor.postprocess_nodes(
[NodeWithScore(node=presidio_node)]
)
from llama_index.core.schema import NodeWithScore
presidio_new_nodes = processor.postprocess_nodes(
[NodeWithScore(node=presidio_node)]
)
In [ ]:
Copied!
# 查看已编辑的文本
presidio_new_nodes[0].node.get_text()
# 查看已编辑的文本
presidio_new_nodes[0].node.get_text()
Out[ ]:
'\nHello <PERSON_1>. The latest statement for your credit card account <CREDIT_CARD_1> was mailed to <LOCATION_2>, <LOCATION_1>. IBAN <IBAN_CODE_1> and social security number is <US_SSN_1> were verified on the system. Further communications will be sent to <EMAIL_ADDRESS_1> \n'
In [ ]:
Copied!
# 获取元数据中的映射
# 注意:这不会发送到LLM!
presidio_new_nodes[0].node.metadata["__pii_node_info__"]
# 获取元数据中的映射
# 注意:这不会发送到LLM!
presidio_new_nodes[0].node.metadata["__pii_node_info__"]
Out[ ]:
{'<EMAIL_ADDRESS_1>': 'paulo@presidio.site', '<US_SSN_1>': '474-49-7577', '<IBAN_CODE_1>': 'GB90YNTU67299444055881', '<LOCATION_1>': 'WA 98109', '<LOCATION_2>': 'Seattle', '<CREDIT_CARD_1>': '4095-2609-9393-4932', '<PERSON_1>': 'Paulo Santos'}
将节点提供给索引¶
在这个示例中,我们将学习如何将节点添加到索引中。
In [ ]:
Copied!
# 输入到索引中
index = VectorStoreIndex([n.node for n in new_nodes])
# 输入到索引中
index = VectorStoreIndex([n.node for n in new_nodes])
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens > [build_index_from_nodes] Total LLM token usage: 0 tokens INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 30 tokens > [build_index_from_nodes] Total embedding token usage: 30 tokens
In [ ]:
Copied!
response = index.as_query_engine().query(
"What address was the statement mailed to?"
)
print(str(response))
response = index.as_query_engine().query(
"What address was the statement mailed to?"
)
print(str(response))
INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens > [retrieve] Total LLM token usage: 0 tokens INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 8 tokens > [retrieve] Total embedding token usage: 8 tokens INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 71 tokens > [get_response] Total LLM token usage: 71 tokens INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens > [get_response] Total embedding token usage: 0 tokens [ADDRESS]