Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.praison.ai/llms.txt

Use this file to discover all available pages before exploring further.

Prerequisites
  • Python 3.10 or higher
  • PraisonAI Agents package installed
  • scrapy package installed

Spider Tools

Use Spider Tools to crawl and scrape web content with AI agents.
1

Install Dependencies

First, install the required packages:
pip install praisonaiagents scrapy
2

Import Components

Import the necessary components:
from praisonaiagents import Agent, Task, AgentTeam
from praisonaiagents import scrape_page, extract_links, crawl, extract_text
3

Create Agent

Create a web scraping agent:
spider_agent = Agent(
    name="WebSpider",
    role="Web Scraping Specialist",
    goal="Extract and analyze web content efficiently.",
    backstory="Expert in web scraping and content extraction.",
    tools=[scrape_page, extract_links, crawl, extract_text],
    reflection=False
)
4

Define Task

Define the scraping task:
scraping_task = Task(
    description="Scrape product information from an e-commerce website.",
    expected_output="Structured product data with prices and descriptions.",
    agent=spider_agent,
    name="product_scraping"
)
5

Run Agent

Initialize and run the agent:
agents = AgentTeam(
    agents=[spider_agent],
    tasks=[scraping_task],
    process="sequential"
)
agents.start()

Understanding Spider Tools

What are Spider Tools?

Spider Tools provide web scraping capabilities for AI agents:
  • Page scraping and downloading
  • Content extraction and filtering
  • Link discovery and crawling
  • HTML parsing and cleaning
  • Data structuring and formatting

Key Components

Spider Agent

Create specialized scraping agents:
Agent(tools=[scrape_page, extract_content, crawl_links, parse_html, structure_data])

Spider Task

Define scraping tasks:
Task(description="scraping_query")

Process Types

Sequential or parallel processing:
process="sequential"

Scraping Options

Customize scraping parameters:
max_depth=2, delay=1

Examples

Basic Web Scraping Agent

from praisonaiagents import Agent, Task, AgentTeam
from praisonaiagents import scrape_page, extract_links, crawl, extract_text

# Create search agent
agent = Agent(
    name="AsyncWebCrawler",
    role="Web Scraping Specialist",
    goal="Extract and analyze web content efficiently.",
    backstory="Expert in web scraping and content extraction.",
    tools=[scrape_page, extract_links, crawl, extract_text],
    reflection=False
)

# Define task
task = Task(
    description="Scrape and analyze the content from 'https://example.com'",
    expected_output="Extracted content with links and text analysis",
    agent=agent,
    name="web_scraping"
)

# Run agent
agents = AgentTeam(
    agents=[agent],
    tasks=[task],
    process="sequential"
)
agents.start()

Advanced Scraping with Multiple Agents

# Create scraping agent
scraper_agent = Agent(
    name="Scraper",
    role="Content Scraper",
    goal="Extract web content systematically.",
    tools=[scrape_page, crawl_links, parse_html],
    reflection=False
)

# Create analysis agent
analysis_agent = Agent(
    name="Analyzer",
    role="Content Analyst",
    goal="Analyze and structure scraped content.",
    backstory="Expert in data analysis and organization.",
    tools=[extract_content, structure_data],
    reflection=False
)

# Define tasks
scraping_task = Task(
    description="Scrape product data from multiple pages.",
    agent=scraper_agent,
    name="product_scraping"
)

analysis_task = Task(
    description="Analyze and structure the scraped product data.",
    agent=analysis_agent,
    name="data_analysis"
)

# Run agents
agents = AgentTeam(
    agents=[scraper_agent, analysis_agent],
    tasks=[scraping_task, analysis_task],
    process="sequential"
)
agents.start()

Best Practices

Configure agents with clear scraping focus:
Agent(
    name="WebScraper",
    role="Content Extraction Specialist",
    goal="Extract web content ethically and efficiently",
    tools=[scrape_page, extract_content, crawl_links, parse_html, structure_data]
)
Define specific scraping objectives:
Task(
    description="Extract product prices and descriptions from e-commerce site",
    expected_output="Structured product database"
)

Built-in URL Safety

Spider tools (scrape_page, extract_links, crawl, extract_text) refuse to fetch dangerous URLs before any network request is made. You don’t need to wrap them in a custom validator.

What gets refused

URL patternExampleWhy
Non-http/https schemesfile:///etc/passwd, gopher://xOnly web protocols are allowed
Loopbackhttp://127.0.0.1/, http://localhost/Blocks access to the local machine
Private / reserved IPshttp://10.0.0.5/, http://192.168.1.1/Blocks internal network access
Link-localhttp://169.254.169.254/Blocks cloud metadata services
Internal TLDshttp://intranet.local/, http://svc.internal/Blocks corporate internal hosts
Backslash in URLhttp://127.0.0.1:6666\@1.1.1.1SSRF-smuggling: urlparse says 1.1.1.1, requests actually hits 127.0.0.1
ASCII control chars (< 0x20 or 0x7f)http://example.com\x00.evil.com, http://example.com\r\n.evil.comCRLF / NUL injection in the authority
Non-string inputNone, 123Defensive — returns False instead of raising
The backslash and control-character rejections (the last two rows above) were added in PraisonAI #1578 to close an SSRF bypass where urllib.parse.urlparse and the HTTP client (requests / httpx) disagreed on the destination host.

What it looks like to your agent

When the validator refuses a URL, the tool returns an error dict instead of fetching:
from praisonaiagents.tools import scrape_page

# Smuggled URL — looks like 1.1.1.1, would actually hit 127.0.0.1
scrape_page("http://127.0.0.1:6666\\@1.1.1.1")
# {'error': 'Invalid or potentially dangerous URL: http://127.0.0.1:6666\\@1.1.1.1'}

# Loopback
scrape_page("http://localhost/admin")
# {'error': 'Invalid or potentially dangerous URL: http://localhost/admin'}

# Cloud metadata endpoint
scrape_page("http://169.254.169.254/latest/meta-data/")
# {'error': 'Invalid or potentially dangerous URL: http://169.254.169.254/latest/meta-data/'}

# Normal public URL — works as expected
scrape_page("https://example.com/")
# {'url': 'https://example.com/', 'status_code': 200, 'content': '...', ...}
This validation is always on for the bundled spider tools. It runs on every URL passed to scrape_page, extract_links, crawl, and extract_text. There is no flag to disable it, and it does not require enable_security().

Common Patterns

Web Scraping Pipeline

# Scraping agent
scraper = Agent(
    name="Scraper",
    role="Web Scraper",
    tools=[scrape_page, crawl_links, parse_html]
)

# Processing agent
processor = Agent(
    name="Processor",
    role="Data Processor",
    tools=[extract_content, structure_data]
)

# Define tasks
scrape_task = Task(
    description="Scrape website content",
    agent=scraper
)

process_task = Task(
    description="Process scraped content",
    agent=processor
)

# Run workflow
agents = AgentTeam(
    agents=[scraper, processor],
    tasks=[scrape_task, process_task]
)