In [ ]:
Copied!
%pip install llama-index llama-index-readers-web
%pip install llama-index llama-index-readers-web
In [ ]:
Copied!
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
使用SimpleWebPageReader¶
from SimpleWebPageReader import SimpleWebPageReader
# 创建SimpleWebPageReader对象
reader = SimpleWebPageReader()
# 读取网页内容
url = 'https://www.example.com'
content = reader.read_webpage(url)
print(content)
In [ ]:
Copied!
from llama_index.core import SummaryIndex
from llama_index.readers.web import SimpleWebPageReader
from IPython.display import Markdown, display
import os
from llama_index.core import SummaryIndex
from llama_index.readers.web import SimpleWebPageReader
from IPython.display import Markdown, display
import os
In [ ]:
Copied!
# 注意:html_to_text=True 选项需要安装 html2text。
# 注意:html_to_text=True 选项需要安装 html2text。
In [ ]:
Copied!
documents = SimpleWebPageReader(html_to_text=True).load_data(
["http://paulgraham.com/worked.html"]
)
documents = SimpleWebPageReader(html_to_text=True).load_data(
["http://paulgraham.com/worked.html"]
)
In [ ]:
Copied!
documents[0]
documents[0]
In [ ]:
Copied!
index = SummaryIndex.from_documents(documents)
index = SummaryIndex.from_documents(documents)
In [ ]:
Copied!
# 将日志级别设置为DEBUG,以获得更详细的输出
query_engine = index.as_query_engine()
response = query_engine.query("作者在成长过程中做了什么?")
# 将日志级别设置为DEBUG,以获得更详细的输出
query_engine = index.as_query_engine()
response = query_engine.query("作者在成长过程中做了什么?")
In [ ]:
Copied!
display(Markdown(f"<b>{response}</b>"))
display(Markdown(f"{response}"))
使用 Spider Reader 🕷¶
Spider 是最快的爬虫。它可以将任何网站转换为纯HTML、markdown、元数据或文本,同时可以使用人工智能进行自定义操作。
Spider允许您使用高性能代理来防止检测,缓存AI操作,使用Webhooks来获取爬取状态,进行定时爬取等等...
先决条件: 您需要拥有Spider API密钥才能使用此加载器。您可以在spider.cloud上获取一个。
In [ ]:
Copied!
# 从单个URL抓取数据
from llama_index.readers.web import SpiderWebReader
spider_reader = SpiderWebReader(
api_key="YOUR_API_KEY", # 在 https://spider.cloud 获取
mode="scrape",
# params={} # 可选参数,更多信息请参考 https://spider.cloud/docs/api
)
documents = spider_reader.load_data(url="https://spider.cloud")
print(documents)
# 从单个URL抓取数据
from llama_index.readers.web import SpiderWebReader
spider_reader = SpiderWebReader(
api_key="YOUR_API_KEY", # 在 https://spider.cloud 获取
mode="scrape",
# params={} # 可选参数,更多信息请参考 https://spider.cloud/docs/api
)
documents = spider_reader.load_data(url="https://spider.cloud")
print(documents)
[Document(id_='54a6ecf3-b33e-41e9-8cec-48657aa2ed9b', embedding=None, metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 101750, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Spider - Fastest Web Crawler[Spider v1 Logo Spider ](/)[Pricing](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)The World\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \'Authorization\': os.environ["SPIDER_API_KEY"], \'Content-Type\': \'application/json\',}json_data = {"limit":50,"url":"http://www.example.com"}response = requests.post(\'https://api.spider.cloud/crawl\', headers=headers, json=json_data)print(response.json())```Example ResponseUnmatched Speed----------### 5secs ###To crawl 200 pages### 21x ###Faster than FireCrawl### 150x ###Faster than Apify Benchmarks displaying performance between Spider Cloud, Firecrawl, and Apify.[See framework benchmarks ](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md)Foundations for Crawling Effectively----------### Leading in performance ###Spider is written in Rust and runs in full concurrency to achieve crawling dozens of pages in secs.### Optimal response format ###Get clean and formatted markdown, HTML, or text content for fine-tuning or training AI models.### Caching ###Further boost speed by caching repeated web page crawls.### Smart Mode ###Spider dynamically switches to Headless Chrome when it needs to.Beta### Scrape with AI ###Do custom browser scripting and data extraction using the latest AI models.### Best crawler for LLMs ###Don\'t let crawling and scraping be the highest latency in your LLM & AI agent stack.### Scrape with no headaches ###* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM Responses### The Fastest Web Crawler ###* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* 5,000 requests per minute### Do more with AI ###* Custom browser scripting* Advanced data extraction* Data pipelines* Perfect for LLM and AI Agents* Accurate website labeling[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')]
爬取域名并跟踪所有更深层次的子页面。
In [ ]:
Copied!
# 从子页面开始深度爬取域名
from llama_index.readers.web import SpiderWebReader
spider_reader = SpiderWebReader(
api_key="YOUR_API_KEY",
mode="crawl",
# params={} # 可选参数,更多信息请参考 https://spider.cloud/docs/api
)
documents = spider_reader.load_data(url="https://spider.cloud")
print(documents)
# 从子页面开始深度爬取域名
from llama_index.readers.web import SpiderWebReader
spider_reader = SpiderWebReader(
api_key="YOUR_API_KEY",
mode="crawl",
# params={} # 可选参数,更多信息请参考 https://spider.cloud/docs/api
)
documents = spider_reader.load_data(url="https://spider.cloud")
print(documents)
[Document(id_='63f7ccbf-c6c8-4f69-80f7-f6763f761a39', embedding=None, metadata={'description': 'Our privacy policy and how it plays a part in the data collected.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 26647, 'keywords': None, 'pathname': '/privacy', 'resource_type': 'html', 'title': 'Privacy', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/privacy.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="Privacy[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Privacy Policy==========Learn about how we take privacy with the Spider project.[Spider](https://spider.cloud) offers a cutting-edge data scraping service with powerful AI capabilities. Our data collecting platform is designed to help users maximize the benefits of data collection while embracing the advancements in AI technology. With our innovative tools, we provide a seamless and fast interactive experience. This privacy policy details Spider's approach to product development, deployment, and usage, encompassing the Crawler, AI products, and features.[AI Development at Spider----------](#ai-development-at-spider)Spider leverages a robust combination of proprietary code, open-source frameworks, and synthetic datasets to train its cutting-edge products. To continuously improve our offerings, Spider may utilize inputs from user-generated prompts and content, obtained from trusted third-party providers. By harnessing this diverse data, Spider can deliver highly precise and pertinent recommendations to our valued users. While the foundational data crawling aspect of Spider is openly available on Github, the dashboard and AI components remain closed source. Spider respects all robots.txt files declared on websites allowing data to be extracted without harming the website.[Security, Privacy, and Trust----------](#security-privacy-and-trust)At Spider, our utmost priority is the development and implementation of Crawlers, AI technologies, and products that adhere to ethical, moral, and legal standards. We are dedicated to creating a secure and respectful environment for all users. Safeguarding user data and ensuring transparency in its usage are core principles we uphold. In line with this commitment, we provide the following important disclosures when utilizing our AI-related products:* Spider ensures comprehensive disclosure of features that utilize third-party AI platforms. To provide clarity, these integrations will be clearly indicated through distinct markers, designations, explanatory notes that appear when hovering, references to the underlying codebase, or any other suitable form of notification as determined by the system. Our commitment to transparency aims to keep users informed about the involvement of third-party AI platforms in our products.* We collect and use personal data as set forth in our [Privacy Policy](https://spider.cloud/privacy) which governs the collection and usage of personal data. If you choose to input personal data into our AI products, please be aware that such information may be processed through third-party AI providers. For any inquiries or concerns regarding data privacy, feel free to reach out to us at [Spider Help Github](https://github.com/orgs/spider-rs/discussions). We are here to assist you.* Except for user-generated prompts and/or content as inputs, Spider does not use customer data, including the code related to the use of Spider's deployment services, to train or finetune any models used.* We periodically review and update our policies and procedures in an effort to comply with applicable data protection regulations and industry standards.* We use reasonable measures designed to maintain the safety of users and avoid harm to people and the environment. Spider's design and development process includes considerations for ethical, security, and regulatory requirements with certain safeguards to prevent and report misuse or abuse.[Third-Party Service Providers----------](#third-party-service-providers)In providing AI products and services, we leverage various third-party providers in the AI space to enhance our services and capabilities, and will continue to do so for certain product features.This page will be updated from time to time with information about Spider's use of AI. The current list of third-party AI providers integrated into Spider is as follows:* [Anthropic](https://console.anthropic.com/legal/terms)* [Azure Cognitive Services](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/data-privacy)* [Cohere](https://cohere.com/terms-of-use)* [ElevenLabs](https://elevenlabs.io/terms)* [Hugging Face](https://huggingface.co/terms-of-service)* [Meta AI](https://www.facebook.com/policies_center/)* [OpenAI](https://openai.com/policies)* [Pinecone](https://www.pinecone.io/terms)* [Replicate](https://replicate.com/terms)We prioritize the safety of our users and take appropriate measures to avoid harm both to individuals and the environment. Our design and development processes incorporate considerations for ethical practices, security protocols, and regulatory requirements, along with established safeguards to prevent and report any instances of misuse or abuse. We are committed to maintaining a secure and respectful environment and upholding responsible practices throughout our services.[Acceptable Use----------](#acceptable-use)Spider's products are intended to provide helpful and respectful responses to user prompts and queries while collecting data along the web. We don't allow the use of our Scraper or AI tools, products and services for the following usages:* Denial of Service Attacks* Illegal activity* Inauthentic, deceptive, or impersonation behavior* Any other use that would violate Spider's standard published policies, codes of conduct, or terms of service.Any violation of this Spider AI Policy or any Spider policies or terms of service may result in termination of use of services at Spider's sole discretion. We will review and update this Spider AI Policy so that it remains relevant and effective. If you have feedback or would like to report any concerns or issues related to the use of AI systems, please reach out to [support@spider.cloud](mailto:support@spider.cloud).[More Information----------](#more-information)To learn more about Spider's integration of AI capabilities into products and features, check out the following resources:* [Spider-Rust](https://github.com/spider-rs)* [Spider](/)* [About](/)[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), Document(id_='18e4d35d-ff48-4d00-b924-abab7a06fbec', embedding=None, metadata={'description': 'Learn how to crawl and scrape websites with the fastest web crawler built for the job.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 27058, 'keywords': None, 'pathname': '/guides', 'resource_type': 'html', 'title': 'Spider Guides', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/guides.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Spider Guides[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Spider Guides==========Learn how to crawl and scrape websites easily.(4) Total Guides* [ Spider v1 Logo Spider Platform ---------- How to use the platform to collect data from the internet fast, affordable, and unblockable. ](/guides/spider)* [ Spider v1 Logo Spider API ---------- How to use the Spider API to curate data from any source blazing fast. The most advanced crawler that handles all workloads of all sizes. ](/guides/spider-api)* [ Spider v1 Logo Extract Contacts ---------- Get contact information from any website in real time with AI. The only way to accurately get dynamic information from websites. ](/guides/pipelines-extract-contacts)* [ Spider v1 Logo Website Archiving ---------- The programmable time machine that can store pages and all assets for easy website archiving. ](/guides/website-archiving)[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), Document(id_='b10c6402-bc35-4fec-b97c-fa30bde54ce8', embedding=None, metadata={'description': 'Complete reference documentation for the Spider API. Includes code snippets and examples for quickly getting started with the system.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 195426, 'keywords': None, 'pathname': '/docs/api', 'resource_type': 'html', 'title': 'Spider API Reference', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/docs*_*api.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Spider API Reference[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)API Reference==========The Spider API is based on REST. Our API is predictable, returns [JSON-encoded](http://www.json.org/) responses, uses standard HTTP response codes, authentication, and verbs. Set your API secret key in the `authorization` header to commence. You can use the `content-type` header with `application/json`, `application/xml`, `text/csv`, and `application/jsonl` for shaping the response.The Spider API supports multi domain actions. You can work with multiple domains per request by adding the urls comma separated.The Spider API differs for every account as we release new versions and tailor functionality. You can add `v1` before any path to pin to the version.Just getting started?----------Check out our [development quickstart](/guides/spider-api) guide.Not a developer?----------Use Spiders [no-code options or apps](/guides/spider) to get started with Spider and to do more with your Spider account no code required.Base UrlJSONCopy```https://api.spider.cloud```Crawl websites==========Start crawling a website(s) to collect resources.POST https://api.spider.cloud/crawlRequest body* url\xa0required\xa0string ---------- The URI resource to crawl. This can be a comma split list for multiple urls. Test Url* request\xa0string ---------- The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML. HTTP* limit\xa0number ---------- The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Crawl Limit* depth\xa0number ---------- The crawl limit for maximum depth. If zero, no limit will be applied. Crawl DepthSet Example* cache\xa0boolean ---------- Use HTTP caching for the crawl to speed up repeated runs. Set Example* budget\xa0object ---------- Object that has paths with a counter for limiting the amount of pages example `{"*":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting `{ "/docs/colors": 10, "/docs/": 100 }` which only allows a max of 100 pages if the route matches `/docs/:pathname` and only 10 pages if it matches `/docs/colors/:pathname`. Crawl Budget Set Example* locale\xa0string ---------- The locale to use for request, example `en-US`. Set Example* cookies\xa0string ---------- Add HTTP cookies to use for request. Set Example* stealth\xa0boolean ---------- Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome. Set Example* headers\xa0string ---------- Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs. Set Example* metadata\xa0boolean ---------- Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled. Set Example* viewport\xa0object ---------- Configure the viewport for chrome. Defaults to 800x600. Set Example* encoding\xa0string ---------- The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc. Set Example* subdomains\xa0boolean ---------- Allow subdomains to be included. Set Example* user\\_agent\xa0string ---------- Add a custom HTTP user agent to the request. Set Example* store\\_data\xa0boolean ---------- Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to false. Set Example* gpt\\_config\xa0object ---------- Use AI to generate actions to perform during the crawl. You can pass an array for the`"prompt"` to chain steps. Set Example* fingerprint\xa0boolean ---------- Use advanced fingerprint for chrome. Set Example* storageless\xa0boolean ---------- Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored. Set Example* readability\xa0boolean ---------- Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage. Set Example* return\\_format\xa0string ---------- The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like `HTML` etc. Raw* proxy\\_enabled\xa0boolean ---------- Enable high performance premium proxies for the request to prevent being blocked at the network level. Set Example* query\\_selector\xa0string ---------- The CSS query selector to use when extracting content from the markup. Test Query Selector* full\\_resources\xa0boolean ---------- Crawl and download all the resources for a website. Set Example* request\\_timeout\xa0number ---------- The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds. Set Example* run\\_in\\_background\xa0boolean ---------- Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set. Set ExampleShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \'Authorization\': os.environ["SPIDER_API_KEY"], \'Content-Type\': \'application/json\',}json_data = {"limit":50,"url":"http://www.example.com"}response = requests.post(\'https://api.spider.cloud/crawl\', headers=headers, json=json_data)print(response.json())```ResponseCopy```[ { "content": "<html>...", "error": null, "status": 200, "url": "http://www.example.com" }, // more content...]```Crawl websites get links==========Start crawling a website(s) to collect links found.POST https://api.spider.cloud/linksRequest body* url\xa0required\xa0string ---------- The URI resource to crawl. This can be a comma split list for multiple urls. Test Url* request\xa0string ---------- The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML. HTTP* limit\xa0number ---------- The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Crawl Limit* depth\xa0number ---------- The crawl limit for maximum depth. If zero, no limit will be applied. Crawl DepthSet Example* cache\xa0boolean ---------- Use HTTP caching for the crawl to speed up repeated runs. Set Example* budget\xa0object ---------- Object that has paths with a counter for limiting the amount of pages example `{"*":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting `{ "/docs/colors": 10, "/docs/": 100 }` which only allows a max of 100 pages if the route matches `/docs/:pathname` and only 10 pages if it matches `/docs/colors/:pathname`. Crawl Budget Set Example* locale\xa0string ---------- The locale to use for request, example `en-US`. Set Example* cookies\xa0string ---------- Add HTTP cookies to use for request. Set Example* stealth\xa0boolean ---------- Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome. Set Example* headers\xa0string ---------- Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs. Set Example* metadata\xa0boolean ---------- Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled. Set Example* viewport\xa0object ---------- Configure the viewport for chrome. Defaults to 800x600. Set Example* encoding\xa0string ---------- The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc. Set Example* subdomains\xa0boolean ---------- Allow subdomains to be included. Set Example* user\\_agent\xa0string ---------- Add a custom HTTP user agent to the request. Set Example* store\\_data\xa0boolean ---------- Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to false. Set Example* gpt\\_config\xa0object ---------- Use AI to generate actions to perform during the crawl. You can pass an array for the`"prompt"` to chain steps. Set Example* fingerprint\xa0boolean ---------- Use advanced fingerprint for chrome. Set Example* storageless\xa0boolean ---------- Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored. Set Example* readability\xa0boolean ---------- Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage. Set Example* return\\_format\xa0string ---------- The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like `HTML` etc. Raw* proxy\\_enabled\xa0boolean ---------- Enable high performance premium proxies for the request to prevent being blocked at the network level. Set Example* query\\_selector\xa0string ---------- The CSS query selector to use when extracting content from the markup. Test Query Selector* full\\_resources\xa0boolean ---------- Crawl and download all the resources for a website. Set Example* request\\_timeout\xa0number ---------- The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds. Set Example* run\\_in\\_background\xa0boolean ---------- Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set. Set ExampleShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \'Authorization\': os.environ["SPIDER_API_KEY"], \'Content-Type\': \'application/json\',}json_data = {"limit":50,"url":"http://www.example.com"}response = requests.post(\'https://api.spider.cloud/links\', headers=headers, json=json_data)print(response.json())```ResponseCopy```[ { "content": "", "error": null, "status": 200, "url": "http://www.example.com" }, // more content...]```Screenshot websites==========Start taking screenshots of website(s) to collect images to base64 or binary.POST https://api.spider.cloud/screenshotRequest bodyGeneralSpecific* url\xa0required\xa0string ---------- The URI resource to crawl. This can be a comma split list for multiple urls. Test Url* request\xa0string ---------- The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML. HTTP* limit\xa0number ---------- The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Crawl Limit* depth\xa0number ---------- The crawl limit for maximum depth. If zero, no limit will be applied. Crawl DepthSet Example* cache\xa0boolean ---------- Use HTTP caching for the crawl to speed up repeated runs. Set Example* budget\xa0object ---------- Object that has paths with a counter for limiting the amount of pages example `{"*":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting `{ "/docs/colors": 10, "/docs/": 100 }` which only allows a max of 100 pages if the route matches `/docs/:pathname` and only 10 pages if it matches `/docs/colors/:pathname`. Crawl Budget Set Example* locale\xa0string ---------- The locale to use for request, example `en-US`. Set Example* cookies\xa0string ---------- Add HTTP cookies to use for request. Set Example* stealth\xa0boolean ---------- Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome. Set Example* headers\xa0string ---------- Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs. Set Example* metadata\xa0boolean ---------- Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled. Set Example* viewport\xa0object ---------- Configure the viewport for chrome. Defaults to 800x600. Set Example* encoding\xa0string ---------- The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc. Set Example* subdomains\xa0boolean ---------- Allow subdomains to be included. Set Example* user\\_agent\xa0string ---------- Add a custom HTTP user agent to the request. Set Example* store\\_data\xa0boolean ---------- Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to false. Set Example* gpt\\_config\xa0object ---------- Use AI to generate actions to perform during the crawl. You can pass an array for the`"prompt"` to chain steps. Set Example* fingerprint\xa0boolean ---------- Use advanced fingerprint for chrome. Set Example* storageless\xa0boolean ---------- Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored. Set Example* readability\xa0boolean ---------- Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage. Set Example* return\\_format\xa0string ---------- The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like `HTML` etc. Raw* proxy\\_enabled\xa0boolean ---------- Enable high performance premium proxies for the request to prevent being blocked at the network level. Set Example* query\\_selector\xa0string ---------- The CSS query selector to use when extracting content from the markup. Test Query Selector* full\\_resources\xa0boolean ---------- Crawl and download all the resources for a website. Set Example* request\\_timeout\xa0number ---------- The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds. Set Example* run\\_in\\_background\xa0boolean ---------- Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set. Set ExampleShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \'Authorization\': os.environ["SPIDER_API_KEY"], \'Content-Type\': \'application/json\',}json_data = {"limit":50,"url":"http://www.example.com"}response = requests.post(\'https://api.spider.cloud/screenshot\', headers=headers, json=json_data)print(response.json())```ResponseCopy```[ { "content": "base64...", "error": null, "status": 200, "url": "http://www.example.com" }, // more content...]```Pipelines----------Create powerful workflows with our pipeline API endpoints. Use AI to extract contacts from any website or filter links with prompts with ease.Crawl websites and extract contacts==========Start crawling a website(s) to collect all contacts found leveraging AI.POST https://api.spider.cloud/pipeline/extract-contactsRequest bodyGeneralSpecific* url\xa0required\xa0string ---------- The URI resource to crawl. This can be a comma split list for multiple urls. Test Url* request\xa0string ---------- The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML. HTTP* limit\xa0number ---------- The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Crawl Limit* depth\xa0number ---------- The crawl limit for maximum depth. If zero, no limit will be applied. Crawl DepthSet Example* cache\xa0boolean ---------- Use HTTP caching for the crawl to speed up repeated runs. Set Example* budget\xa0object ---------- Object that has paths with a counter for limiting the amount of pages example `{"*":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting `{ "/docs/colors": 10, "/docs/": 100 }` which only allows a max of 100 pages if the route matches `/docs/:pathname` and only 10 pages if it matches `/docs/colors/:pathname`. Crawl Budget Set Example* locale\xa0string ---------- The locale to use for request, example `en-US`. Set Example* cookies\xa0string ---------- Add HTTP cookies to use for request. Set Example* stealth\xa0boolean ---------- Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome. Set Example* headers\xa0string ---------- Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs. Set Example* metadata\xa0boolean ---------- Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled. Set Example* viewport\xa0object ---------- Configure the viewport for chrome. Defaults to 800x600. Set Example* encoding\xa0string ---------- The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc. Set Example* subdomains\xa0boolean ---------- Allow subdomains to be included. Set Example* user\\_agent\xa0string ---------- Add a custom HTTP user agent to the request. Set Example* store\\_data\xa0boolean ---------- Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to false. Set Example* gpt\\_config\xa0object ---------- Use AI to generate actions to perform during the crawl. You can pass an array for the`"prompt"` to chain steps. Set Example* fingerprint\xa0boolean ---------- Use advanced fingerprint for chrome. Set Example* storageless\xa0boolean ---------- Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored. Set Example* readability\xa0boolean ---------- Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage. Set Example* return\\_format\xa0string ---------- The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like `HTML` etc. Raw* proxy\\_enabled\xa0boolean ---------- Enable high performance premium proxies for the request to prevent being blocked at the network level. Set Example* query\\_selector\xa0string ---------- The CSS query selector to use when extracting content from the markup. Test Query Selector* full\\_resources\xa0boolean ---------- Crawl and download all the resources for a website. Set Example* request\\_timeout\xa0number ---------- The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds. Set Example* run\\_in\\_background\xa0boolean ---------- Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set. Set ExampleShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \'Authorization\': os.environ["SPIDER_API_KEY"], \'Content-Type\': \'application/json\',}json_data = {"limit":50,"url":"http://www.example.com"}response = requests.post(\'https://api.spider.cloud/pipeline/extract-contacts\', headers=headers, json=json_data)print(response.json())```ResponseCopy```[ { "content": [{ "full_name": "John Doe", "email": "johndoe@gmail.com", "phone": "555-555-555", "title": "Baker"}, ...], "error": null, "status": 200, "url": "http://www.example.com" }, // more content...]```Label website==========Crawl a website and accurately categorize it using AI.POST https://api.spider.cloud/pipeline/labelRequest bodyGeneralSpecific* url\xa0required\xa0string ---------- The URI resource to crawl. This can be a comma split list for multiple urls. Test Url* request\xa0string ---------- The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML. HTTP* limit\xa0number ---------- The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Crawl Limit* depth\xa0number ---------- The crawl limit for maximum depth. If zero, no limit will be applied. Crawl DepthSet Example* cache\xa0boolean ---------- Use HTTP caching for the crawl to speed up repeated runs. Set Example* budget\xa0object ---------- Object that has paths with a counter for limiting the amount of pages example `{"*":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting `{ "/docs/colors": 10, "/docs/": 100 }` which only allows a max of 100 pages if the route matches `/docs/:pathname` and only 10 pages if it matches `/docs/colors/:pathname`. Crawl Budget Set Example* locale\xa0string ---------- The locale to use for request, example `en-US`. Set Example* cookies\xa0string ---------- Add HTTP cookies to use for request. Set Example* stealth\xa0boolean ---------- Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome. Set Example* headers\xa0string ---------- Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs. Set Example* metadata\xa0boolean ---------- Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled. Set Example* viewport\xa0object ---------- Configure the viewport for chrome. Defaults to 800x600. Set Example* encoding\xa0string ---------- The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc. Set Example* subdomains\xa0boolean ---------- Allow subdomains to be included. Set Example* user\\_agent\xa0string ---------- Add a custom HTTP user agent to the request. Set Example* store\\_data\xa0boolean ---------- Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to false. Set Example* gpt\\_config\xa0object ---------- Use AI to generate actions to perform during the crawl. You can pass an array for the`"prompt"` to chain steps. Set Example* fingerprint\xa0boolean ---------- Use advanced fingerprint for chrome. Set Example* storageless\xa0boolean ---------- Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored. Set Example* readability\xa0boolean ---------- Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage. Set Example* return\\_format\xa0string ---------- The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like `HTML` etc. Raw* proxy\\_enabled\xa0boolean ---------- Enable high performance premium proxies for the request to prevent being blocked at the network level. Set Example* query\\_selector\xa0string ---------- The CSS query selector to use when extracting content from the markup. Test Query Selector* full\\_resources\xa0boolean ---------- Crawl and download all the resources for a website. Set Example* request\\_timeout\xa0number ---------- The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds. Set Example* run\\_in\\_background\xa0boolean ---------- Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set. Set ExampleShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \'Authorization\': os.environ["SPIDER_API_KEY"], \'Content-Type\': \'application/json\',}json_data = {"limit":50,"url":"http://www.example.com"}response = requests.post(\'https://api.spider.cloud/pipeline/label\', headers=headers, json=json_data)print(response.json())```ResponseCopy```[ { "content": ["Government"], "error": null, "status": 200, "url": "http://www.example.com" }, // more content...]```Crawl State==========Get the state of the crawl for the domain.POST https://api.spider.cloud/crawl/statusRequest body* url\xa0required\xa0string ---------- The URI resource to crawl. This can be a comma split list for multiple urls. Test UrlShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \'Authorization\': os.environ["SPIDER_API_KEY"], \'Content-Type\': \'application/json\',}response = requests.post(\'https://api.spider.cloud/crawl/status\', headers=headers)print(response.json())```ResponseCopy``` { "content": { "data": { "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh", "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg", "domain": "example.com", "url": "https://example.com/", "links":1, "credits_used": 3, "mode":2, "crawl_duration": 340, "message": null, "request_user_agent": "Spider", "level": "info", "status_code": 0, "created_at": "2024-04-21T01:21:32.886863+00:00", "updated_at": "2024-04-21T01:21:32.886863+00:00" }, "error": "" }, "error": null, "status": 200, "url": "http://www.example.com" }```Credits Available==========Get the remaining credits available.GET https://api.spider.cloud/credits* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \'Authorization\': os.environ["SPIDER_API_KEY"], \'Content-Type\': \'application/json\',}response = requests.post(\'https://api.spider.cloud/credits\', headers=headers)print(response.json())```ResponseCopy```{ "credits": 52566 }```[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), Document(id_='44b350c3-f907-4767-84ec-a73fe59c190c', embedding=None, metadata={'description': 'End User License Agreement for the Spiderwebai and the spider project.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 20123, 'keywords': None, 'pathname': '/eula', 'resource_type': 'html', 'title': 'EULA', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/eula.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='EULA[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)End User License Agreement==========Our end user license agreement may change from time to time as we build out the software.Right to Ban----------Part of making sure the Spider is being used for the right purpose we will not allow malicious acts to be done with the system. If we find that you are using the tool to hack, crawl illegal pages, porn, or anything that falls into this line will be banned from the system. You can reach out to us to weigh out your reasons on why you should not be banned.License----------You can use the API and service to build ontop of. Replicating the features and re-selling the service is not allowed. We do not provide any custom license for the platform and encourage users to use our system to handle any crawling, scraping, or data curation needs for speed and cost effectiveness.### Adjustments to Plans ###The software is very new and while we figure out what we can charge to maintain the systems the plans may change. We will send out a notification of the changes in our [Discord](https://discord.gg/5bDPDxwTn3) or Github. For the most part plans will increase drastically with things set to scale costs that allow more usage for everyone. Spider is a product of[A11yWatch LLC](https://a11ywatch.com) the web accessibility tool. The crawler engine of Spider powers the curation for A11yWatch allowing auditing websites accessibility compliance extremely fast.#### Contact ####For information about how to contact Spider, please reach out to email below.[support@spider.cloud](mailto:support@spider.cloud)[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), Document(id_='445c0c76-bfd5-4f89-a439-fbdeb8077a4c', embedding=None, metadata={'description': 'Spider is the fastest web crawler written in Rust. The Cloud version is a hosted version of open-source project.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 139080, 'keywords': None, 'pathname': '/about', 'resource_type': 'html', 'title': 'About', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/about.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='About[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider) About==========Spider is the fastest web crawler written in Rust. The Cloud version is a hosted version of open-source project. Spider Features----------Our features that facilitate website scraping and provide swift insights in one platform. Deliver astonishing results using our powerful API.### Fast Unblockable Scraping ###When it comes to speed, the Spider project is the fastest web crawler available to the public. Utilize the foundation of open-source tools and make the most of your budget to scrape content effectively.Collecting Data Logo### Gain Website Insights with AI ###Enhance your crawls with AI to obtain relevant information fast from any website.AI Search### Extract Data Using Webhooks ###Set up webhooks across your websites to deliver the desired information anywhere you need.News Logo[A11yWatch](https://a11ywatch.com)maintains the project and the hosting for the service.[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), Document(id_='1a2d63a5-0315-4c5b-8fed-8ac460b82cc7', embedding=None, metadata={'description': 'Add the amount of credits you want to purchase for scraping the internet with AI and LLM data curation abilities fast.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 23083, 'keywords': None, 'pathname': '/credits/new', 'resource_type': 'html', 'title': 'Purchase Spider Credits', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/credits*_*new.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Purchase Spider Credits[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Add credits==========Add credits to start crawling any website today.|Default| Features | Amount ||-------|--------------------|--------------------||Default| Scraping Websites |$0.03 / gb bandwidth|| Extra | Premium Proxies |$0.01 / gb bandwidth|| Extra |Javascript Rendering|$0.01 / gb bandwidth|| Extra | Data Storage | $0.30 / gb month || Extra | AI Chat | $0.01 input/output |[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), Document(id_='6701b47a-0000-4111-8b5b-c77b01937a7d', embedding=None, metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 101750, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Spider - Fastest Web Crawler[Spider v1 Logo Spider ](/)[Pricing](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)The World\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \'Authorization\': os.environ["SPIDER_API_KEY"], \'Content-Type\': \'application/json\',}json_data = {"limit":50,"url":"http://www.example.com"}response = requests.post(\'https://api.spider.cloud/crawl\', headers=headers, json=json_data)print(response.json())```Example ResponseUnmatched Speed----------### 5secs ###To crawl 200 pages### 21x ###Faster than FireCrawl### 150x ###Faster than Apify Benchmarks displaying performance between Spider Cloud, Firecrawl, and Apify.[See framework benchmarks ](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md)Foundations for Crawling Effectively----------### Leading in performance ###Spider is written in Rust and runs in full concurrency to achieve crawling dozens of pages in secs.### Optimal response format ###Get clean and formatted markdown, HTML, or text content for fine-tuning or training AI models.### Caching ###Further boost speed by caching repeated web page crawls.### Smart Mode ###Spider dynamically switches to Headless Chrome when it needs to.Beta### Scrape with AI ###Do custom browser scripting and data extraction using the latest AI models.### Best crawler for LLMs ###Don\'t let crawling and scraping be the highest latency in your LLM & AI agent stack.### Scrape with no headaches ###* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM Responses### The Fastest Web Crawler ###* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* 5,000 requests per minute### Do more with AI ###* Custom browser scripting* Advanced data extraction* Data pipelines* Perfect for LLM and AI Agents* Accurate website labeling[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), Document(id_='91b98a80-7112-4837-8389-cb78221b254c', embedding=None, metadata={'description': 'Get contact information from any website in real time with AI. The only way to accurately get dynamic information from websites.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 25891, 'keywords': None, 'pathname': '/guides/pipelines-extract-contacts', 'resource_type': 'html', 'title': 'Guides - Extract Contacts', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/guides*_*pipelines-extract-contacts.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Guides - Extract Contacts[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Extract Contacts==========Contents----------* [Seamless extracting any contact any website](#seamless-extracting-any-contact-any-website)* [UI (Extracting Contacts)](#ui-extracting-contacts)* [API Extracting Usage](#api-extracting-usage) * [API Extracting Example](#api-extracting-example) * [Pipelines Combo](#pipelines-combo)Seamless extracting any contact any website----------Extracting contacts from a website used to be a very difficult challenge involving many steps that would change often. The challenges typically faced involve being able to get the data from a website without being blocked and setting up query selectors for the information you need using javascript. This would often break in two folds - the data extracting with a correct stealth technique or the css selector breaking as they update the website HTML code. Now we toss those two hard challenges away - one of them spider takes care of and the other the advancement in AI to process and extract information.UI (Extracting Contacts)----------You can use the UI on the dashboard to extract contacts after you crawled a page. Go to the page youwant to extract and click on the horizontal dropdown menu to display an option to extract the contact.The crawl will get the data first to see if anything new has changed. Afterwards if a contact was found usually within 10-60 seconds you will get a notification that the extraction is complete with the data.![Extracting contacts with the spider app](/img/app/extract-contacts.png)After extraction if the page has contact related data you can view it with a grid in the app.![The menu displaying the found contacts after extracting with the spider app](/img/app/extract-contacts-found.png)The grid will display the name, email, phone, title, and host(website found) of the contact(s).![Grid display of all the contact information found for the web page](/img/app/extract-contacts-grid.png)API Extracting Usage----------The endpoint `/pipeline/extract-contacts` provides the ability to extract all contacts from a website concurrently.### API Extracting Example ###To extract contacts from a website you can follow the example below. All params are optional except `url`. Use the `prompt` param to adjust the way the AI handles the extracting. If you use the param `store_data` or if the website already exist in the dashboard the contact data will be saved with the page.```import requests, os, jsonheaders = { \'Authorization\': os.environ["SPIDER_API_KEY"], \'Content-Type\': \'application/json\',}json_data = {"limit":1,"url":"http://www.example.com/contacts", "model": "gpt-4-1106-preview", "prompt": "A custom prompt to tailor the extracting."}response = requests.post(\'https://api.spider.cloud/crawl/pipeline/extract-contacts\', headers=headers, json=json_data, stream=True)for line in response.iter_lines(): if line: print(json.loads(line))```### Pipelines Combo ###Piplines bring a whole new entry to workflows for data curation, if you combine the API endpoints to only use the extraction on pages you know may have contacts can save credits on the system. One way would be to perform gathering all the links first with the `/links` endpoint. After getting the links for the pages use `/pipeline/filter-links` with a custom prompt that can use AI to reduce the noise of the links to process before `/pipline/extract-contacts`.Loading graph...Written on: 2/1/2024[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), Document(id_='5e7ade0d-0a50-46de-8116-72ee5dca0b20', embedding=None, metadata={'description': 'How to use the Spider API to curate data from any source blazing fast. The most advanced crawler that handles all workloads of all sizes.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 24752, 'keywords': None, 'pathname': '/guides/spider-api', 'resource_type': 'html', 'title': 'Guides - Spider API', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/guides*_*spider-api.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Guides - Spider API[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Getting started Spider API==========Contents----------* [API built to scale](#api-built-to-scale)* [API Usage](#api-usage)* [Crawling One Page](#crawling-one-page)* [Crawling Multiple Pages](#crawling-multiple-pages) * [Planet Scale Crawling](#planet-scale-crawling) * [Automatic Configuration](#automatic-configuration)API built to scale----------Welcome to our cutting-edge web crawler SaaS, renowned for its unparalleled speed.Our platform is designed to effortlessly manage thousands of requests per second, thanks to our elastically scalable system architecture and the Open-Source [spider](https://github.com/spider-rs/spider) project. We deliver consistent latency times ensuring swift processing for all responses.For an in-depth understanding of the request parameters supported, we invite you to explore our comprehensive API documentation. At present, we do not provide client-side libraries, as our API has been crafted with simplicity in mind for straightforward usage. However, we are open to expanding our offerings in the future to enhance user convenience.Dive into our [documentation]((/docs/api)) to get started and unleash the full potential of our web crawler today.API Usage----------Getting started with the API is simple and straight forward. After you get your [secret key](/api-keys)you can access our instance directly. We have one main endpoint `/crawl` that handles all things relatedto data curation. The crawler is highly configurable through the params to fit all needs.Crawling One Page----------Most cases you probally just want to crawl one page. Even if you only need one page, our system performs fast enough to lead the race.The most straight forward way to make sure you only crawl a single page is to set the [budget limit](./account/settings) with a wild card value or `*` to 1.You can also pass in the param `limit` in the JSON body with the limit of pages.Crawling Multiple Pages----------When you crawl multiple pages, the concurrency horsepower of the spider kicks in. You might wonder why and how one request may take (x)ms to come back, and 100 requests take about the same time! That’s because the built-in isolated concurrency allows for crawling thousands to millions of pages in no time. It’s the only current solution that can handle large websites with over 100k pages within a minute or two (sometimes even in a blink or two). By default, we do not add any limits to crawls unless specified.### Planet Scale Crawling ###If you plan on processing crawls that have over 200 pages, we recommend streaming the request from the client instead of parsing the entire payload once finished. We have an example of this with Python on the API docs page, also shown below.```import requests, os, jsonheaders = { \'Authorization\': os.environ["SPIDER_API_KEY"], \'Content-Type\': \'application/json\',}json_data = {"limit":250,"url":"http://www.example.com"}response = requests.post(\'https://api.spider.cloud/crawl/crawl\', headers=headers, json=json_data, stream=True)for line in response.iter_lines(): if line: print(json.loads(line))```#### Automatic Configuration ####Spider handles automatic concurrency handling and ip rotation to make it simple to curate data.The more credits you have or usage available allows for a higher concurrency limit.Written on: 1/3/2024[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), Document(id_='08e5f1d6-4ae7-4b68-ab96-4b6a3768e88c', embedding=None, metadata={'description': 'The programmable time machine that can store pages and all assets for easy website archiving.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 18970, 'keywords': None, 'pathname': '/guides/website-archiving', 'resource_type': 'html', 'title': 'Guides - Website Archiving', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/guides*_*website-archiving.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Guides - Website Archiving[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Website Archiving==========With Spider you can easily backup or capture a website at any point in time.Enable Full Resource storing in the settings or website configuration to get a 1:1 copy of any websitelocally.Time Machine----------Time machine is storing data at a certain point of a time. Spider brings this to you with one simple configuration.After running the crawls you can simply download the data. This can help store assets incase the code is lost orversion control is removed.Written on: 2/7/2024[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), Document(id_='024cb27e-21d2-49a5-8a1a-963e72038421', embedding=None, metadata={'description': 'How to use the platform to collect data from the internet fast, affordable, and unblockable.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 24666, 'keywords': None, 'pathname': '/guides/spider', 'resource_type': 'html', 'title': 'Guides - Spider Platform', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/guides*_*spider.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Guides - Spider Platform[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Getting started collecting data with Spider==========Contents----------* [Data Curation](#data-curation) * [Crawling (Website)](#crawling-website) * [Crawling (API)](#crawling-api)* [Crawl Configuration](#crawl-configuration) * [Proxies](#proxies) * [Headless Browser](#headless-browser) * [Crawl Budget Limits](#crawl-budget-limits)* [Crawling and Scraping Websites](#crawling-and-scraping-websites) * [Transforming Data](#transforming-data) * [Leveraging Open Source](#leveraging-open-source)* [Subscription and Spider Credits](#subscription-and-spider-credits)Data Curation----------Collecting data with Spider can be fast and rewarding if done with some simple preliminary steps.Use the dashboard to collect data seamlessly across the internet with scheduled updates.You have two main ways of collecting data using Spider. The first and simplest is to use the UI available for scraping.The alternative is to use the API to programmatically access the system and perform actions.### Crawling (Website) ###1. Register or login to your account using email or Github.2. Purchase [credits](/credits/new) to kickstart crawls with `pay-as-you-go` go after credits deplete.3. Configure crawl [settings](/account/settings) to fit workflows that you need.4. Navigate to the [dashboard](/) and enter a website url or ask a question to get a url that should be crawled.5. Crawl the website and export/download the data as needed.### Crawling (API) ###1. Register or login to your account using email or Github.2. Purchase [credits](/credits/new) to kickstart crawls with `pay-as-you-go` after credits deplete.3. Configure crawl [settings](/account/settings) to fit workflows that you need.4. Navigate to [API keys](/api-keys) and create a new secret key.5. Go to the [API docs](/docs/api) page to see how the API works and perform crawls with code examples.Crawl Configuration----------Configuration your account for how you would like to crawl can help save costs or effectiveness of the content. Some of the configurations include setting Premium Proxies, Headless Browser Rendering, Webhooks, and Budgeting.### Proxies ###Using proxies with our system is straight forward. Simple check the toggle on if you want all request to use a proxy to increase the success of not being blocked.![Proxies example app screenshot.](/img/app/proxy-setting.png)### Headless Browser ###If you want pages that require JavaScript to be executed the headless browser config is for you. Enabling will run all request through a real Chrome Browser for JavaScript required rendering pages.![Headless browser example app screenshot.](/img/app/headless-browser.png)### Crawl Budget Limits ###One of the key things you may need to do before getting into the crawl is setting up crawl-budgets.Crawl budgets allows you to determine how many pages you are going to crawl for a website.Determining the budget will save you costs when dealing with large websites that you only want certain data points from. The example below shows adding a asterisk (\\*) to determine all routes with a limit of 50 pages maximum. The settings can be overwritten by the website configuration or parameters if using the API.![Crawl budget example screenshot](/img/app/edit-budget.png)Crawling and Scraping Websites----------Collecting data can be done in many ways and for many reasons. Leveraging our state-of-the-art technology allows you to create fast workloads that can process content from multiple locations. At the time of writing, we have started to focus on our data processing API instead of the dashboard. The API has much more flexibility than the UI for performing advanced workloads like batching, formatting, and so on.![Dashboard UI for Spider displaying data collecting from www.napster.com, jeffmendez.com, rsseau.rs, and www.drake.com](/img/app/ui-crawl.png)### Transforming Data ###The API has more features for gathering the content in different formats and transforming the HTML as needed. You can transform the content from HTML to Markdown and feed it to a LLM for better handling the learning aspect. The API is the first class citizen for the application. The UI will have the features provided by the API eventually as the need arises.#### Leveraging Open Source ####One of the reasons Spider is the ultimate data-curation service for scraping is from the power of Open-Source. The core of the engine is completly available on [Github](https://github.com/spider-rs/spider) under [MIT](https://opensource.org/license/mit/) to show what is in store. We are constantly working on the crawler features including performance with plans to maintain the project for the long run.Subscription and Spider Credits----------The platform allows purchasing credits that gives you the ability to crawl at any time.When you purchase credits a crawl subscription is created that allows you to continue to usethe platform when your credits deplete. The limits provided coralate with the amount of creditspurchased, an example would be if you bought $5 in credits you would have about $40 in spending limit - $10 in credit gives $80 and so on.The highest purchase of credits directly determines how much is allowed on the platform. You can view your usage and credits on the [usage limits page](/account/usage).Written on: 1/2/2024[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), Document(id_='44bff527-c7f3-4346-a2f8-1454c52e1b01', embedding=None, metadata={'description': 'Generate API keys that allow access to the system programmatically anywhere. Full management access for your Spider API journey.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 28770, 'keywords': None, 'pathname': '/api-keys', 'resource_type': 'html', 'title': 'API Keys Spider', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/api-keys.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="API Keys Spider[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider) API Keys==========Generate API keys that allow access to the system programmatically anywhere. Full management access for your Spider API journey. Key Management----------Your secret API keys are listed below. Please note that we do not display your secret API keys again after you generate them.Do not share your API key with others, or expose it in the browser or other client-side code. In order to protect the security of your account, Spider may also automatically disable any API key that we've found has leaked publicly.Filter Name...Columns| Name |Key|Created|Last Used| ||-----------|---|-------|---------|---||No results.| | | | |0 of 0 row(s) selected.PreviousNext[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), Document(id_='e577c57a-2376-452f-8c39-04d1e284595c', embedding=None, metadata={'description': 'Explore your usage and set limits that work with your budget.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 21195, 'keywords': None, 'pathname': '/account/usage', 'resource_type': 'html', 'title': 'Usage - Spider', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/account*_*usage.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="Usage - Spider[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider) Usage limit==========Below you'll find a summary of usage for your account. The data may be delayed up to 5 minutes.Credits----------### Pay as you go ###### Approved usage limit ### The maximum usage Spider allows for your organization each month. Ask for increase.### Set a monthly budget ###When your organization reaches this usage threshold each month, subsequent requests will be rejected. Data may be deleted if payments are rejected.[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), Document(id_='e3eb1e3c-5080-4590-94e8-fd2ef4f6d3c6', embedding=None, metadata={'description': 'Adjust your spider settings to adjust your crawl settings.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 18322, 'keywords': None, 'pathname': '/account/settings', 'resource_type': 'html', 'title': 'Settings - Spider', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/account*_*settings.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Settings - Spider[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')]
请访问Spider获取指南和文档。
使用 Browserbase Reader 🅱️¶
Browserbase 是一个用于运行无头浏览器的无服务器平台,它提供高级调试、会话录制、隐身模式、集成代理和验证码解决功能。
安装和设置¶
- 从 browserbase.com 获取 API 密钥和项目 ID,并将其设置为环境变量 (
BROWSERBASE_API_KEY
,BROWSERBASE_PROJECT_ID
)。 - 安装 Browserbase SDK:
In [ ]:
Copied!
% pip install browserbase
% pip install browserbase
In [ ]:
Copied!
from llama_index.readers.web import BrowserbaseWebReader
from llama_index.readers.web import BrowserbaseWebReader
In [ ]:
Copied!
reader = BrowserbaseWebReader()
docs = reader.load_data(
urls=[
"https://example.com",
],
# 文本模式
text_content=False,
)
reader = BrowserbaseWebReader()
docs = reader.load_data(
urls=[
"https://example.com",
],
# 文本模式
text_content=False,
)
使用 FireCrawl Reader 🔥¶
Firecrawl是一个API,可以将整个网站转换为干净、易于访问的Markdown格式。
使用Firecrawl收集整个网站
In [ ]:
Copied!
from llama_index.readers.web import FireCrawlWebReader
from llama_index.readers.web import FireCrawlWebReader
In [ ]:
Copied!
# 使用firecrawl来爬取网站
firecrawl_reader = FireCrawlWebReader(
api_key="<your_api_key>", # 从https://www.firecrawl.dev/获取你的实际API密钥替换
mode="scrape", # 选择"crawl"和"scrape"之间的单页抓取
params={"additional": "parameters"}, # 可选的额外参数
)
# 从单个页面URL加载文档
documents = firecrawl_reader.load_data(url="http://paulgraham.com/")
# 使用firecrawl来爬取网站
firecrawl_reader = FireCrawlWebReader(
api_key="", # 从https://www.firecrawl.dev/获取你的实际API密钥替换
mode="scrape", # 选择"crawl"和"scrape"之间的单页抓取
params={"additional": "parameters"}, # 可选的额外参数
)
# 从单个页面URL加载文档
documents = firecrawl_reader.load_data(url="http://paulgraham.com/")
In [ ]:
Copied!
index = SummaryIndex.from_documents(documents)
index = SummaryIndex.from_documents(documents)
In [ ]:
Copied!
# 将日志级别设置为DEBUG,以获得更详细的输出
query_engine = index.as_query_engine()
response = query_engine.query("作者在成长过程中做了什么?")
# 将日志级别设置为DEBUG,以获得更详细的输出
query_engine = index.as_query_engine()
response = query_engine.query("作者在成长过程中做了什么?")
In [ ]:
Copied!
display(Markdown(f"<b>{response}</b>"))
display(Markdown(f"{response}"))
使用firecrawl爬取单个页面
In [ ]:
Copied!
# 使用您的API密钥和所需的模式初始化FireCrawlWebReader
from llama_index.readers.web.firecrawl_web.base import FireCrawlWebReader
firecrawl_reader = FireCrawlWebReader(
api_key="<your_api_key>", # 从https://www.firecrawl.dev/获取您的实际API密钥
mode="scrape", # 选择“crawl”和“scrape”以进行单页面抓取
params={"additional": "parameters"}, # 可选的额外参数
)
# 从单个页面URL加载文档
documents = firecrawl_reader.load_data(url="http://paulgraham.com/worked.html")
# 使用您的API密钥和所需的模式初始化FireCrawlWebReader
from llama_index.readers.web.firecrawl_web.base import FireCrawlWebReader
firecrawl_reader = FireCrawlWebReader(
api_key="", # 从https://www.firecrawl.dev/获取您的实际API密钥
mode="scrape", # 选择“crawl”和“scrape”以进行单页面抓取
params={"additional": "parameters"}, # 可选的额外参数
)
# 从单个页面URL加载文档
documents = firecrawl_reader.load_data(url="http://paulgraham.com/worked.html")
Running cells with '/opt/homebrew/bin/python3' requires the ipykernel package. Run the following command to install 'ipykernel' into the Python environment. Command: '/opt/homebrew/bin/python3 -m pip install ipykernel -U --user --force-reinstall'
In [ ]:
Copied!
index = SummaryIndex.from_documents(documents)
index = SummaryIndex.from_documents(documents)
In [ ]:
Copied!
# 将日志级别设置为DEBUG,以获得更详细的输出
query_engine = index.as_query_engine()
response = query_engine.query("作者在成长过程中做了什么?")
# 将日志级别设置为DEBUG,以获得更详细的输出
query_engine = index.as_query_engine()
response = query_engine.query("作者在成长过程中做了什么?")
In [ ]:
Copied!
display(Markdown(f"<b>{response}</b>"))
display(Markdown(f"{response}"))
使用TrafilaturaWebReader¶
In [ ]:
Copied!
from llama_index.readers.web import TrafilaturaWebReader
from llama_index.readers.web import TrafilaturaWebReader
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Cell In[7], line 1 ----> 1 from llama_index.readers.web import TrafilaturaWebReader ModuleNotFoundError: No module named 'llama_index.readers.web'
In [ ]:
Copied!
documents = TrafilaturaWebReader().load_data(
["http://paulgraham.com/worked.html"]
)
documents = TrafilaturaWebReader().load_data(
["http://paulgraham.com/worked.html"]
)
In [ ]:
Copied!
index = SummaryIndex.from_documents(documents)
index = SummaryIndex.from_documents(documents)
In [ ]:
Copied!
# 将日志级别设置为DEBUG,以获得更详细的输出
query_engine = index.as_query_engine()
response = query_engine.query("作者在成长过程中做了什么?")
# 将日志级别设置为DEBUG,以获得更详细的输出
query_engine = index.as_query_engine()
response = query_engine.query("作者在成长过程中做了什么?")
In [ ]:
Copied!
display(Markdown(f"<b>{response}</b>"))
display(Markdown(f"{response}"))
使用RssReader¶
In [ ]:
Copied!
from llama_index.core import SummaryIndex
from llama_index.readers.web import RssReader
documents = RssReader().load_data(
["https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"]
)
index = SummaryIndex.from_documents(documents)
# 将Logging设置为DEBUG以获得更详细的输出
query_engine = index.as_query_engine()
response = query_engine.query("今天新闻发生了什么?")
from llama_index.core import SummaryIndex
from llama_index.readers.web import RssReader
documents = RssReader().load_data(
["https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"]
)
index = SummaryIndex.from_documents(documents)
# 将Logging设置为DEBUG以获得更详细的输出
query_engine = index.as_query_engine()
response = query_engine.query("今天新闻发生了什么?")
使用ScrapFly¶
ScrapFly是一个具有无头浏览器功能、代理和反爬虫绕过功能的网络抓取API。它允许将网页数据提取为可访问的LLM标记或文本。使用pip安装ScrapFly Python SDK:
pip install scrapfly-sdk
以下是ScrapflyReader的基本用法
In [ ]:
Copied!
from llama_index.readers.web import ScrapflyReader
# 使用您的ScrapFly API密钥初始化ScrapflyReader
scrapfly_reader = ScrapflyReader(
api_key="您的ScrapFly API密钥", # 从https://www.scrapfly.io/获取您的API密钥
ignore_scrape_failures=True, # 忽略无法处理的网页并记录其异常
)
# 从URL加载文档为markdown格式
documents = scrapfly_reader.load_data(
urls=["https://web-scraping.dev/products"]
)
from llama_index.readers.web import ScrapflyReader
# 使用您的ScrapFly API密钥初始化ScrapflyReader
scrapfly_reader = ScrapflyReader(
api_key="您的ScrapFly API密钥", # 从https://www.scrapfly.io/获取您的API密钥
ignore_scrape_failures=True, # 忽略无法处理的网页并记录其异常
)
# 从URL加载文档为markdown格式
documents = scrapfly_reader.load_data(
urls=["https://web-scraping.dev/products"]
)
ScrapflyReader还允许传递ScrapeConfig对象以自定义抓取请求。有关所有功能详细信息及其API参数,请参阅文档:https://scrapfly.io/docs/scrape-api/getting-started
In [ ]:
Copied!
来自llama_index.readers.web的ScrapflyReader
# 使用您的ScrapFly API密钥初始化ScrapflyReader
scrapfly_reader = ScrapflyReader(
api_key="您的ScrapFly API密钥", # 从https://www.scrapfly.io/获取您的API密钥
ignore_scrape_failures=True, # 忽略无法处理的网页并记录其异常
)
scrapfly_scrape_config = {
"asp": True, # 绕过阻止和反爬虫解决方案,如Cloudflare的网页抓取
"render_js": True, # 使用云无头浏览器启用JavaScript渲染
"proxy_pool": "public_residential_pool", # 选择代理池(数据中心或住宅)
"country": "us", # 选择代理位置
"auto_scroll": True, # 自动滚动页面
"js": "", # 由无头浏览器执行自定义JavaScript代码
}
# 从URL加载文档为markdown格式
documents = scrapfly_reader.load_data(
urls=["https://web-scraping.dev/products"],
scrape_config=scrapfly_scrape_config, # 传递抓取配置
scrape_format="markdown", # 抓取结果格式,可以是`markdown`(默认)或`text`
)
来自llama_index.readers.web的ScrapflyReader
# 使用您的ScrapFly API密钥初始化ScrapflyReader
scrapfly_reader = ScrapflyReader(
api_key="您的ScrapFly API密钥", # 从https://www.scrapfly.io/获取您的API密钥
ignore_scrape_failures=True, # 忽略无法处理的网页并记录其异常
)
scrapfly_scrape_config = {
"asp": True, # 绕过阻止和反爬虫解决方案,如Cloudflare的网页抓取
"render_js": True, # 使用云无头浏览器启用JavaScript渲染
"proxy_pool": "public_residential_pool", # 选择代理池(数据中心或住宅)
"country": "us", # 选择代理位置
"auto_scroll": True, # 自动滚动页面
"js": "", # 由无头浏览器执行自定义JavaScript代码
}
# 从URL加载文档为markdown格式
documents = scrapfly_reader.load_data(
urls=["https://web-scraping.dev/products"],
scrape_config=scrapfly_scrape_config, # 传递抓取配置
scrape_format="markdown", # 抓取结果格式,可以是`markdown`(默认)或`text`
)