Microsoft onedrive

OneDriveReader #

Bases: BasePydanticReader

Microsoft OneDrive 读取器。

初始化 OneDriveReader 的新实例。

:param client_id: 在 Azure Entra（以前是 Azure Active directory）门户中注册的应用程序的应用（客户端）ID，具有 MS Graph 权限 "Files.Read.All"。 :param tenant_id: Azure Active Directory（AAD）租户的目录（租户）ID，应用程序注册的租户ID。对于多租户应用程序和 OneDrive 个人，默认为 "consumers"。 :param client_secret: Azure 门户中注册的应用程序的应用程序密钥。如果提供，将使用 MSAL 客户端凭据流进行身份验证（ConfidentialClientApplication）。如果未提供，将使用交互式身份验证（不建议用于 CI/CD 或不可行的手动交互身份验证场景）。应用程序身份验证需要。 :param userprinciplename: 将访问其 OneDrive 的用户主体名称（通常是组织提供的电子邮件）。应用程序身份验证需要。如果在调用 load_data() 时未提供参数，则将使用该参数。 :param folder_id: 从 OneDrive 获取的文件夹的文件夹ID。如果在调用 load_data() 时未提供参数，则将使用该参数。 :param file_ids: 要从 OneDrive 获取的文件的文件ID列表。如果在调用 load_data() 时未提供参数，则将使用该参数。 :param folder_path（str，可选）：要下载的 OneDrive 文件夹的相对路径。如果提供，将下载文件夹中的文件。如果在调用 load_data() 时未提供参数，则将使用该参数。 :param file_paths（List[str]，可选）：要下载的特定文件路径列表。如果在调用 load_data() 时未提供参数，则将使用该参数。 :param file_extractor（Optional[Dict[str, BaseReader]]）：文件扩展名到 BaseReader 类的映射，指定如何将该文件转换为文本。有关更多详细信息，请参见 SimpleDirectoryReader。

要使交互式身份验证起作用，将使用浏览器进行身份验证，因此注册应用程序应将重定向 URI 设置为 'https://localhost' 以用于移动和本机应用程序。

Source code in llama_index/readers/microsoft_onedrive/base.py

class OneDriveReader(BasePydanticReader):
    """Microsoft OneDrive 读取器。

初始化 OneDriveReader 的新实例。

:param client_id: 在 Azure Entra（以前是 Azure Active directory）门户中注册的应用程序的应用（客户端）ID，具有 MS Graph 权限 "Files.Read.All"。
:param tenant_id: Azure Active Directory（AAD）租户的目录（租户）ID，应用程序注册的租户ID。
                  对于多租户应用程序和 OneDrive 个人，默认为 "consumers"。
:param client_secret: Azure 门户中注册的应用程序的应用程序密钥。
                      如果提供，将使用 MSAL 客户端凭据流进行身份验证（ConfidentialClientApplication）。
                      如果未提供，将使用交互式身份验证（不建议用于 CI/CD 或不可行的手动交互身份验证场景）。
                      应用程序身份验证需要。
:param userprinciplename: 将访问其 OneDrive 的用户主体名称（通常是组织提供的电子邮件）。应用程序身份验证需要。如果在调用 load_data() 时未提供参数，则将使用该参数。
:param folder_id: 从 OneDrive 获取的文件夹的文件夹ID。如果在调用 load_data() 时未提供参数，则将使用该参数。
:param file_ids: 要从 OneDrive 获取的文件的文件ID列表。如果在调用 load_data() 时未提供参数，则将使用该参数。
:param folder_path（str，可选）：要下载的 OneDrive 文件夹的相对路径。如果提供，将下载文件夹中的文件。如果在调用 load_data() 时未提供参数，则将使用该参数。
:param file_paths（List[str]，可选）：要下载的特定文件路径列表。如果在调用 load_data() 时未提供参数，则将使用该参数。
:param file_extractor（Optional[Dict[str, BaseReader]]）：文件扩展名到 BaseReader 类的映射，指定如何将该文件转换为文本。
                                                        有关更多详细信息，请参见 `SimpleDirectoryReader`。


要使交互式身份验证起作用，将使用浏览器进行身份验证，因此注册应用程序应将重定向 URI 设置为 'https://localhost' 以用于移动和本机应用程序。"""

    client_id: str
    client_secret: Optional[str] = None
    tenant_id: Optional[str] = None
    userprincipalname: Optional[str] = None
    folder_id: Optional[str] = None
    file_ids: Optional[List[str]] = None
    folder_path: Optional[str] = None
    file_paths: Optional[List[str]] = None
    file_extractor: Optional[Dict[str, Union[str, BaseReader]]] = Field(
        default=None, exclude=True
    )

    _is_interactive_auth = PrivateAttr(False)
    _authority = PrivateAttr()
    _downloaded_files_metadata = PrivateAttr({})

    def __init__(
        self,
        client_id: str,
        client_secret: Optional[str] = None,
        tenant_id: Optional[str] = "consumers",
        userprincipalname: Optional[str] = None,
        folder_id: Optional[str] = None,
        file_ids: Optional[List[str]] = None,
        folder_path: Optional[str] = None,
        file_paths: Optional[List[str]] = None,
        file_extractor: Optional[Dict[str, Union[str, BaseReader]]] = None,
        **kwargs,
    ) -> None:
        self._is_interactive_auth = not client_secret
        self._authority = f"https://login.microsoftonline.com/{tenant_id}/"

        super().__init__(
            client_id=client_id,
            client_secret=client_secret,
            tenant_id=tenant_id,
            userprincipalname=userprincipalname,
            folder_id=folder_id,
            file_ids=file_ids,
            folder_path=folder_path,
            file_paths=file_paths,
            file_extractor=file_extractor,
            **kwargs,
        )

    def _authenticate_with_msal(self) -> Any:
        """使用MSAL进行身份验证。

为了使交互式身份验证工作，需要使用浏览器进行身份验证，因此注册的应用程序应该将重定向URI设置为“localhost”以用于移动和本地应用程序。
"""
        import msal

        result = None

        if self._is_interactive_auth:
            logger.debug("Starting user authentication...")
            app = msal.PublicClientApplication(
                self.client_id, authority=self._authority
            )

            # The acquire_token_interactive method will open the default web browser
            # for the interactive part of the OAuth2 flow. The registered application should have a redirect URI set to 'https://localhost'
            # under mobile and native applications.
            result = app.acquire_token_interactive(SCOPES)
        else:
            logger.debug("Starting app authentication...")
            app = msal.ConfidentialClientApplication(
                self.client_id,
                authority=self._authority,
                client_credential=self.client_secret,
            )

            result = app.acquire_token_for_client(scopes=CLIENTCREDENTIALSCOPES)

        if "access_token" in result:
            logger.debug("Authentication is successful...")
            return result["access_token"]
        else:
            logger.error(result.get("error"))
            logger.error(result.get("error_description"))
            logger.error(result.get("correlation_id"))
            raise Exception(result.get("error"))

    def _construct_endpoint(
        self,
        item_ref: str,
        isRelativePath: bool,
        isFile: bool,
        userprincipalname: Optional[str] = None,
    ) -> str:
        """根据提供的参数构建适当的OneDrive API端点。

Args:
    item_ref (str): 项目的引用；可以是项目ID或相对路径。
    isRelativePath (bool): 一个布尔值，指示item_ref是否为相对路径。
    isFile (bool): 一个布尔值，指示目标是否为文件。
    userprincipalname (str, optional): 用户主体名称；如果身份验证不是交互式的，则使用。默认为None。

Returns:
    str: 表示构建的端点的字符串。
"""
        if not self._is_interactive_auth and not userprincipalname:
            raise Exception(
                "userprincipalname cannot be empty for App authentication. Provide the userprincipalname (usually email) of the user whose OneDrive will be accessed."
            )

        endpoint = "https://graph.microsoft.com/v1.0/"

        # Update the base endpoint based on the authentication method
        if self._is_interactive_auth:
            endpoint += "me/drive"
        else:
            endpoint += f"users/{userprincipalname}/drive"

        # Update the endpoint for relative paths or item IDs
        if isRelativePath:
            endpoint += f"/root:/{item_ref}"
        else:
            endpoint += f"/items/{item_ref}"

        # If the target is not a file, adjust the endpoint to retrieve children of a folder
        if not isFile:
            endpoint += ":/children" if isRelativePath else "/children"

        logger.info(f"API Endpoint determined: {endpoint}")

        return endpoint

    def _get_items_in_drive_with_maxretries(
        self,
        access_token: str,
        item_ref: Optional[str] = "root",
        max_retries: int = 3,
        userprincipalname: Optional[str] = None,
        isFile: bool = False,
        isRelativePath=False,
    ) -> Any:
        """从驱动器中使用Microsoft Graph API检索项目。

Args:
access_token（str）：用于API调用的访问令牌。
item_ref（可选[str]）：特定项目ID/路径或根目录。
max_retries（int）：速率限制或服务器错误的最大重试次数。
userprincipalname：str值，指示将访问其OneDrive的userprincipalname（通常是组织提供的电子邮件）。对于应用程序身份验证是必需的。
isFile：bool值，指示是否查询文件还是文件夹。
isRelativePath：bool值，指示是否使用相对路径查询文件或文件夹。
Returns:
dict/None：最大重试次数后的JSON响应或None。

引发：
Exception：在不可重试的状态代码上。
"""
        endpoint = self._construct_endpoint(
            item_ref, isRelativePath, isFile, userprincipalname
        )
        headers = {"Authorization": f"Bearer {access_token}"}
        retries = 0

        while retries < max_retries:
            response = requests.get(endpoint, headers=headers)
            if response.status_code == 200:
                return response.json()
            # Check for Ratelimit error, this can happen if you query endpoint recursively
            # very frequently for large amount of file
            elif response.status_code in (429, *range(500, 600)):
                logger.warning(
                    f"Retrying {retries+1} in {retries+1} secs. Status code: {response.status_code}"
                )
                retries += 1
                time.sleep(retries)  # Exponential back-off
            else:
                raise Exception(
                    f"API request to download {item_ref} failed with status code: {response.status_code}, message: {response.content}"
                )

        logger.error(f"Failed after {max_retries} attempts.")
        return None

    def _download_file_by_url(self, item: Dict[str, Any], local_dir: str) -> str:
        """从OneDrive使用提供的项目下载URL下载文件。

Args:
- item（Dict[str，str]）：包含文件元数据和下载URL的字典。
- local_dir（str）：应将文件保存在其中的本地目录。

Returns:
- str：下载文件的文件路径
"""
        # Extract download URL and filename from the provided item.
        file_download_url = item["@microsoft.graph.downloadUrl"]
        file_name = item["name"]

        # Download the file.
        file_data = requests.get(file_download_url)

        # Save the downloaded file to the specified local directory.
        file_path = os.path.join(local_dir, file_name)
        with open(file_path, "wb") as f:
            f.write(file_data.content)

        return file_path

    def _extract_metadata_for_file(self, item: Dict[str, Any]) -> Dict[str, str]:
        """提取与文件相关的元数据。

Args:
- item（Dict[str, str]）：包含文件元数据的字典。

Returns:
- Dict[str, str]：包含提取的元数据的字典。
"""
        # Extract the required metadata for file.
        created_by = item.get("createdBy", {})
        modified_by = item.get("lastModifiedBy", {})
        return {
            "file_id": item.get("id"),
            "file_name": item.get("name"),
            "created_by_user": created_by.get("user", {}).get("displayName"),
            "created_by_app": created_by.get("application", {}).get("displayName"),
            "created_dateTime": item.get("createdDateTime"),
            "last_modified_by_user": modified_by.get("user", {}).get("displayName"),
            "last_modified_by_app": modified_by.get("application", {}).get(
                "displayName"
            ),
            "last_modified_datetime": item.get("lastModifiedDateTime"),
        }

    def _check_approved_mimetype_and_download_file(
        self,
        item: Dict[str, Any],
        local_dir: str,
        mime_types: Optional[List[str]] = None,
    ):
        """根据MIME类型检查文件并下载接受的文件。

:param item: dict，表示文件项的字典，必须包含'file'和'mimeType'键。
:param local_dir: str，要下载文件的本地目录。
:param mime_types: list，接受的MIME类型列表。如果为None或为空，则接受所有文件类型。
:return: dict，包含已下载文件的元数据的字典。
"""
        metadata = {}

        # Convert accepted MIME types to lowercase for case-insensitive comparison
        accepted_mimetypes = (
            [mimetype.lower() for mimetype in mime_types] if mime_types else ["*"]
        )

        # Check if the item's MIME type is among the accepted MIME types
        is_accepted_mimetype = (
            "*" in accepted_mimetypes
            or item["file"]["mimeType"].lower() in accepted_mimetypes
        )

        if is_accepted_mimetype:
            # It's a file with an accepted MIME type; download and extract metadata
            file_path = self._download_file_by_url(
                item, local_dir
            )  # Assuming this method is implemented
            metadata[file_path] = self._extract_metadata_for_file(
                item
            )  # Assuming this method is implemented
        else:
            # Log a debug message for files that are ignored due to an invalid MIME type
            logger.debug(
                f"Ignoring file '{item['name']}' as its MIME type does not match the accepted types."
            )

        return metadata

    def _connect_download_and_return_metadata(
        self,
        access_token: str,
        local_dir: str,
        item_id: str = None,
        include_subfolders: bool = True,
        mime_types: Optional[List[str]] = None,
        userprincipalname: Optional[str] = None,
        isRelativePath=False,
    ) -> Any:
        """递归地从OneDrive下载文件，从指定的item_id或根目录开始。

Args:
- access_token (str): 用于授权的令牌。
- local_dir (str): 用于存储下载文件的本地目录。
- item_id (str, optional): 要从中开始的特定项（文件夹/文件）的ID。如果为None，则从根目录开始。
- include_subfolders (bool, optional): 是否包括子文件夹。默认为True。
- mime_types(List[str], optional): 您想要允许的mime类型，例如："application/pdf"，默认为None，表示加载所有文件。
- userprincipalname (str): 需要访问其OneDrive的用户主体名称（通常是组织提供的电子邮件ID）。对于应用程序身份验证场景，这是必需的。
- isRelativePath (bool): 用于指示是否使用相对路径查询文件/文件夹的值。

Returns:
- dict: 文件路径及其对应的元数据的字典。

引发:
- Exception: 如果无法检索当前项的项目。
"""
        data = self._get_items_in_drive_with_maxretries(
            access_token,
            item_id,
            userprincipalname=userprincipalname,
            isRelativePath=isRelativePath,
        )

        if data:
            metadata = {}
            for item in data["value"]:
                if (
                    "folder" in item and include_subfolders
                ):  # It's a folder; traverse if flag is set
                    subfolder_metadata = self._connect_download_and_return_metadata(
                        access_token,
                        local_dir,
                        item["id"],
                        include_subfolders,
                        mime_types=mime_types,
                        userprincipalname=userprincipalname,
                    )
                    metadata.update(subfolder_metadata)  # Merge metadata

                elif "file" in item:
                    file_metadata = self._check_approved_mimetype_and_download_file(
                        item, local_dir, mime_types
                    )
                    metadata.update(file_metadata)

            return metadata

        # No data received; raise exception
        current_item = item_id if item_id else "RootFolder"
        raise Exception(f"Unable to retrieve items for: {current_item}")

    def _init_download_and_get_metadata(
        self,
        temp_dir: str,
        folder_id: Optional[str] = None,
        file_ids: Optional[List[str]] = None,
        folder_path: Optional[str] = None,
        file_paths: Optional[List[str]] = None,
        recursive: bool = False,
        mime_types: Optional[List[str]] = None,
        userprincipalname: Optional[str] = None,
    ) -> None:
        """从OneDrive中根据指定的文件夹或文件ID/路径下载文件。

Args:
- temp_dir (str): 将下载文件存储到的临时目录。
- folder_id (str, optional): 要下载的OneDrive文件夹的ID。如果提供，则会下载文件夹中的文件。
- file_ids (List[str], optional): 要下载的特定文件ID的列表。
- folder_path (str, optional): 要下载的OneDrive文件夹的相对路径。如果提供，则会下载文件夹中的文件。
- file_paths (List[str], optional): 要下载的特定文件路径的列表。
- recursive (bool): 指示是否从子文件夹中下载文件的标志，如果提供了folder_id。
- mime_types(List[str], optional): 您想要允许的mime类型，例如："application/pdf"，默认为None，表示加载所有文件。
- userprincipalname (str): 将访问其OneDrive的userprincipalname（通常是组织提供的电子邮件）。对于应用程序身份验证，这是必需的。
"""
        access_token = self._authenticate_with_msal()
        is_download_from_root = True
        downloaded_files_metadata = {}
        # If a folder_id is provided, download files from the folder
        if folder_id:
            is_download_from_root = False
            folder_metadata = self._connect_download_and_return_metadata(
                access_token,
                temp_dir,
                folder_id,
                recursive,
                mime_types=mime_types,
                userprincipalname=userprincipalname,
            )
            downloaded_files_metadata.update(folder_metadata)

        # Download files using the provided file IDs
        if file_ids:
            is_download_from_root = False
            for file_id in file_ids or []:
                item = self._get_items_in_drive_with_maxretries(
                    access_token,
                    file_id,
                    userprincipalname=userprincipalname,
                    isFile=True,
                )
                file_metadata = self._check_approved_mimetype_and_download_file(
                    item, temp_dir, mime_types
                )
                downloaded_files_metadata.update(file_metadata)

        # If a folder_path is provided, download files from the folder
        if folder_path:
            is_download_from_root = False
            folder_metadata = self._connect_download_and_return_metadata(
                access_token,
                temp_dir,
                folder_path,
                recursive,
                mime_types=mime_types,
                userprincipalname=userprincipalname,
                isRelativePath=True,
            )
            downloaded_files_metadata.update(folder_metadata)

        # Download files using the provided file paths
        if file_paths:
            is_download_from_root = False
            for file_path in file_paths or []:
                item = self._get_items_in_drive_with_maxretries(
                    access_token,
                    file_path,
                    userprincipalname=userprincipalname,
                    isFile=True,
                    isRelativePath=True,
                )
                file_metadata = self._check_approved_mimetype_and_download_file(
                    item, temp_dir, mime_types
                )
                downloaded_files_metadata.update(file_metadata)

        if is_download_from_root:
            # download files from root folder
            root_folder_metadata = self._connect_download_and_return_metadata(
                access_token,
                temp_dir,
                "root",
                recursive,
                mime_types=mime_types,
                userprincipalname=userprincipalname,
            )
            downloaded_files_metadata.update(root_folder_metadata)

        return downloaded_files_metadata

    def _load_documents_with_metadata(
        self, directory: str, recursive: bool = True
    ) -> List[Document]:
        """从指定目录使用SimpleDirectoryReader加载文档，并将其与相应的元数据关联起来。

Args:
- directory（str）：要加载文档的目录。
- recursive（bool，可选）：是否在目录中执行递归搜索。默认为True。

Returns:
- List[Document]：从指定目录加载的文档，并附带元数据。
"""

        def get_metadata(filename: str) -> Any:
            return self._downloaded_files_metadata[filename]

        simple_loader = SimpleDirectoryReader(
            directory,
            file_extractor=self.file_extractor,
            file_metadata=get_metadata,
            recursive=recursive,
        )
        return simple_loader.load_data()

    def load_data(
        self,
        folder_id: Optional[str] = None,
        file_ids: Optional[List[str]] = None,
        folder_path: Optional[str] = None,
        file_paths: Optional[List[str]] = None,
        mime_types: Optional[List[str]] = None,
        recursive: bool = True,
        userprincipalname: Optional[str] = None,
    ) -> List[Document]:
        """从文件夹id /文件id加载数据，如果都未提供，则从根目录下载。

Args:
    folder_id（str，可选）：OneDrive中文件夹的文件夹id。
    file_ids（List[str]，可选）：OneDrive中文件的文件id。
    folder_path（str，可选）：要下载的OneDrive文件夹的相对路径。如果提供，则下载文件夹中的文件。
    file_paths（List[str]，可选）：要下载的特定文件路径列表。
    mime_types：您想要允许的mime类型，例如："application/pdf"，默认为无，即加载找到的所有文件
    recursive：布尔值，用于遍历和读取子文件夹，默认为True
    userprincipalname：指示将访问其OneDrive的userprincipalname（通常是组织提供的电子邮件）的str值。对于应用程序身份验证方案，这是必需的。

Returns:
    List[Document]：文档列表。
"""
        # If arguments are not provided to load_data(), initialize them from the object's attributes
        if not userprincipalname:
            userprincipalname = self.userprincipalname

        if not folder_id:
            folder_id = self.folder_id

        if not file_ids:
            file_ids = self.file_ids

        if not folder_path:
            folder_path = self.folder_path

        if not file_paths:
            file_paths = self.file_paths

        try:
            with tempfile.TemporaryDirectory() as temp_dir:
                self._downloaded_files_metadata = self._init_download_and_get_metadata(
                    temp_dir=temp_dir,
                    folder_id=folder_id,
                    file_ids=file_ids,
                    folder_path=folder_path,
                    file_paths=file_paths,
                    recursive=recursive,
                    mime_types=mime_types,
                    userprincipalname=userprincipalname,
                )
                return self._load_documents_with_metadata(temp_dir, recursive=recursive)
        except Exception as e:
            logger.error(
                f"An error occurred while loading the data: {e}", exc_info=True
            )

load_data #

load_data(
    folder_id: Optional[str] = None,
    file_ids: Optional[List[str]] = None,
    folder_path: Optional[str] = None,
    file_paths: Optional[List[str]] = None,
    mime_types: Optional[List[str]] = None,
    recursive: bool = True,
    userprincipalname: Optional[str] = None,
) -> List[Document]

从文件夹id /文件id加载数据，如果都未提供，则从根目录下载。

Returns:

Type	Description
`List[Document]`	List[Document]：文档列表。

Source code in llama_index/readers/microsoft_onedrive/base.py

    def load_data(
        self,
        folder_id: Optional[str] = None,
        file_ids: Optional[List[str]] = None,
        folder_path: Optional[str] = None,
        file_paths: Optional[List[str]] = None,
        mime_types: Optional[List[str]] = None,
        recursive: bool = True,
        userprincipalname: Optional[str] = None,
    ) -> List[Document]:
        """从文件夹id /文件id加载数据，如果都未提供，则从根目录下载。

Args:
    folder_id（str，可选）：OneDrive中文件夹的文件夹id。
    file_ids（List[str]，可选）：OneDrive中文件的文件id。
    folder_path（str，可选）：要下载的OneDrive文件夹的相对路径。如果提供，则下载文件夹中的文件。
    file_paths（List[str]，可选）：要下载的特定文件路径列表。
    mime_types：您想要允许的mime类型，例如："application/pdf"，默认为无，即加载找到的所有文件
    recursive：布尔值，用于遍历和读取子文件夹，默认为True
    userprincipalname：指示将访问其OneDrive的userprincipalname（通常是组织提供的电子邮件）的str值。对于应用程序身份验证方案，这是必需的。

Returns:
    List[Document]：文档列表。
"""
        # If arguments are not provided to load_data(), initialize them from the object's attributes
        if not userprincipalname:
            userprincipalname = self.userprincipalname

        if not folder_id:
            folder_id = self.folder_id

        if not file_ids:
            file_ids = self.file_ids

        if not folder_path:
            folder_path = self.folder_path

        if not file_paths:
            file_paths = self.file_paths

        try:
            with tempfile.TemporaryDirectory() as temp_dir:
                self._downloaded_files_metadata = self._init_download_and_get_metadata(
                    temp_dir=temp_dir,
                    folder_id=folder_id,
                    file_ids=file_ids,
                    folder_path=folder_path,
                    file_paths=file_paths,
                    recursive=recursive,
                    mime_types=mime_types,
                    userprincipalname=userprincipalname,
                )
                return self._load_documents_with_metadata(temp_dir, recursive=recursive)
        except Exception as e:
            logger.error(
                f"An error occurred while loading the data: {e}", exc_info=True
            )