Spider Agent

Prerequisites

Python 3.10 or higher
PraisonAI Agents package installed
scrapy package installed

Spider Tools

Use Spider Tools to crawl and scrape web content with AI agents.

Install Dependencies

First, install the required packages:

pip install praisonaiagents scrapy

Import Components

Import the necessary components:

from praisonaiagents import Agent, Task, AgentTeam
from praisonaiagents import scrape_page, extract_links, crawl, extract_text

Create Agent

Create a web scraping agent:

spider_agent = Agent(
    name="WebSpider",
    role="Web Scraping Specialist",
    goal="Extract and analyze web content efficiently.",
    backstory="Expert in web scraping and content extraction.",
    tools=[scrape_page, extract_links, crawl, extract_text],
    reflection=False
)

Define Task

Define the scraping task:

scraping_task = Task(
    description="Scrape product information from an e-commerce website.",
    expected_output="Structured product data with prices and descriptions.",
    agent=spider_agent,
    name="product_scraping"
)

Run Agent

Initialize and run the agent:

agents = AgentTeam(
    agents=[spider_agent],
    tasks=[scraping_task],
    process="sequential"
)
agents.start()

Understanding Spider Tools

What are Spider Tools?

Spider Tools provide web scraping capabilities for AI agents:

Page scraping and downloading
Content extraction and filtering
Link discovery and crawling
HTML parsing and cleaning
Data structuring and formatting

Key Components

Create specialized scraping agents:

Agent(tools=[scrape_page, extract_content, crawl_links, parse_html, structure_data])

Spider Task

Define scraping tasks:

Task(description="scraping_query")

Process Types

Sequential or parallel processing:

process="sequential"

Scraping Options

Customize scraping parameters:

max_depth=2, delay=1

Examples

Basic Web Scraping Agent

from praisonaiagents import Agent, Task, AgentTeam
from praisonaiagents import scrape_page, extract_links, crawl, extract_text

# Create search agent
agent = Agent(
    name="AsyncWebCrawler",
    role="Web Scraping Specialist",
    goal="Extract and analyze web content efficiently.",
    backstory="Expert in web scraping and content extraction.",
    tools=[scrape_page, extract_links, crawl, extract_text],
    reflection=False
)

# Define task
task = Task(
    description="Scrape and analyze the content from 'https://example.com'",
    expected_output="Extracted content with links and text analysis",
    agent=agent,
    name="web_scraping"
)

# Run agent
agents = AgentTeam(
    agents=[agent],
    tasks=[task],
    process="sequential"
)
agents.start()

Advanced Scraping with Multiple Agents

# Create scraping agent
scraper_agent = Agent(
    name="Scraper",
    role="Content Scraper",
    goal="Extract web content systematically.",
    tools=[scrape_page, crawl_links, parse_html],
    reflection=False
)

# Create analysis agent
analysis_agent = Agent(
    name="Analyzer",
    role="Content Analyst",
    goal="Analyze and structure scraped content.",
    backstory="Expert in data analysis and organization.",
    tools=[extract_content, structure_data],
    reflection=False
)

# Define tasks
scraping_task = Task(
    description="Scrape product data from multiple pages.",
    agent=scraper_agent,
    name="product_scraping"
)

analysis_task = Task(
    description="Analyze and structure the scraped product data.",
    agent=analysis_agent,
    name="data_analysis"
)

# Run agents
agents = AgentTeam(
    agents=[scraper_agent, analysis_agent],
    tasks=[scraping_task, analysis_task],
    process="sequential"
)
agents.start()

Best Practices

Agent Configuration

Configure agents with clear scraping focus:

Agent(
    name="WebScraper",
    role="Content Extraction Specialist",
    goal="Extract web content ethically and efficiently",
    tools=[scrape_page, extract_content, crawl_links, parse_html, structure_data]
)

Task Definition

Define specific scraping objectives:

Task(
    description="Extract product prices and descriptions from e-commerce site",
    expected_output="Structured product database"
)

Built-in URL Safety

Spider tools (scrape_page, extract_links, crawl, extract_text) refuse to fetch dangerous URLs before any network request is made. You don’t need to wrap them in a custom validator.

What gets refused

URL pattern	Example	Why
Non-`http`/`https` schemes	`file:///etc/passwd`, `gopher://x`	Only web protocols are allowed
Loopback	`http://127.0.0.1/`, `http://localhost/`	Blocks access to the local machine
Private / reserved IPs	`http://10.0.0.5/`, `http://192.168.1.1/`	Blocks internal network access
Link-local	`http://169.254.169.254/`	Blocks cloud metadata services
Internal TLDs	`http://intranet.local/`, `http://svc.internal/`	Blocks corporate internal hosts
Backslash in URL	`http://127.0.0.1:6666\@1.1.1.1`	SSRF-smuggling: `urlparse` says `1.1.1.1`, `requests` actually hits `127.0.0.1`
ASCII control chars (`< 0x20` or `0x7f`)	`http://example.com\x00.evil.com`, `http://example.com\r\n.evil.com`	CRLF / NUL injection in the authority
Non-string input	`None`, `123`	Defensive — returns `False` instead of raising

The backslash and control-character rejections (the last two rows above) were added in PraisonAI #1578 to close an SSRF bypass where urllib.parse.urlparse and the HTTP client (requests / httpx) disagreed on the destination host.

What it looks like to your agent

When the validator refuses a URL, the tool returns an error dict instead of fetching:

from praisonaiagents.tools import scrape_page

# Smuggled URL — looks like 1.1.1.1, would actually hit 127.0.0.1
scrape_page("http://127.0.0.1:6666\\@1.1.1.1")
# {'error': 'Invalid or potentially dangerous URL: http://127.0.0.1:6666\\@1.1.1.1'}

# Loopback
scrape_page("http://localhost/admin")
# {'error': 'Invalid or potentially dangerous URL: http://localhost/admin'}

# Cloud metadata endpoint
scrape_page("http://169.254.169.254/latest/meta-data/")
# {'error': 'Invalid or potentially dangerous URL: http://169.254.169.254/latest/meta-data/'}

# Normal public URL — works as expected
scrape_page("https://example.com/")
# {'url': 'https://example.com/', 'status_code': 200, 'content': '...', ...}

This validation is always on for the bundled spider tools. It runs on every URL passed to scrape_page, extract_links, crawl, and extract_text. There is no flag to disable it, and it does not require enable_security().

Common Patterns

Web Scraping Pipeline

# Scraping agent
scraper = Agent(
    name="Scraper",
    role="Web Scraper",
    tools=[scrape_page, crawl_links, parse_html]
)

# Processing agent
processor = Agent(
    name="Processor",
    role="Data Processor",
    tools=[extract_content, structure_data]
)

# Define tasks
scrape_task = Task(
    description="Scrape website content",
    agent=scraper
)

process_task = Task(
    description="Process scraped content",
    agent=processor
)

# Run workflow
agents = AgentTeam(
    agents=[scraper, processor],
    tasks=[scrape_task, process_task]
)

Getting Started

Core Concepts

Guides

Features

Models

Databases

Observability

Memory

Knowledge

RAG

Persistence

Tools

Other Features

Developers

Configuration

Best Practices

Getting Started (No Code)

Spider Agent

Spider Tools

Understanding Spider Tools

What are Spider Tools?

Key Components

Spider Agent

Spider Task

Process Types

Scraping Options

Examples

Basic Web Scraping Agent

Advanced Scraping with Multiple Agents

Best Practices

Built-in URL Safety

What gets refused

What it looks like to your agent

Common Patterns

Web Scraping Pipeline

Getting Started

Core Concepts

Guides

Features

Models

Databases

Observability

Memory

Knowledge

RAG

Persistence

Tools

Other Features

Developers

Configuration

Best Practices

Getting Started (No Code)

Documentation Index

​Spider Tools

​Understanding Spider Tools

What are Spider Tools?

​Key Components

Spider Agent

Spider Task

Process Types

Scraping Options

​Examples

​Basic Web Scraping Agent

​Advanced Scraping with Multiple Agents

​Best Practices

​Built-in URL Safety

​What gets refused

​What it looks like to your agent

​Common Patterns

​Web Scraping Pipeline

Spider Tools

Understanding Spider Tools

Key Components

Examples

Basic Web Scraping Agent

Advanced Scraping with Multiple Agents

Best Practices

Built-in URL Safety

What gets refused

What it looks like to your agent

Common Patterns

Web Scraping Pipeline