spider_agent = Agent( name="WebSpider", role="Web Scraping Specialist", goal="Extract and analyze web content efficiently.", backstory="Expert in web scraping and content extraction.", tools=[scrape_page, extract_links, crawl, extract_text], reflection=False)
4
Define Task
Define the scraping task:
scraping_task = Task( description="Scrape product information from an e-commerce website.", expected_output="Structured product data with prices and descriptions.", agent=spider_agent, name="product_scraping")
Spider tools (scrape_page, extract_links, crawl, extract_text) refuse to fetch dangerous URLs before any network request is made. You don’t need to wrap them in a custom validator.
The backslash and control-character rejections (the last two rows above) were added in PraisonAI #1578 to close an SSRF bypass where urllib.parse.urlparse and the HTTP client (requests / httpx) disagreed on the destination host.
When the validator refuses a URL, the tool returns an error dict instead of fetching:
from praisonaiagents.tools import scrape_page# Smuggled URL — looks like 1.1.1.1, would actually hit 127.0.0.1scrape_page("http://127.0.0.1:6666\\@1.1.1.1")# {'error': 'Invalid or potentially dangerous URL: http://127.0.0.1:6666\\@1.1.1.1'}# Loopbackscrape_page("http://localhost/admin")# {'error': 'Invalid or potentially dangerous URL: http://localhost/admin'}# Cloud metadata endpointscrape_page("http://169.254.169.254/latest/meta-data/")# {'error': 'Invalid or potentially dangerous URL: http://169.254.169.254/latest/meta-data/'}# Normal public URL — works as expectedscrape_page("https://example.com/")# {'url': 'https://example.com/', 'status_code': 200, 'content': '...', ...}
This validation is always on for the bundled spider tools. It runs on every URL passed to scrape_page, extract_links, crawl, and extract_text. There is no flag to disable it, and it does not require enable_security().