Documentation Index
Fetch the complete documentation index at: https://docs.praison.ai/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
- Python 3.10 or higher
- PraisonAI Agents package installed
crawl4ai package installed and set up
Crawl4AI provides powerful async web crawling with JavaScript rendering, content extraction, and LLM-based data extraction. PraisonAI includes built-in Crawl4AI tools for easy integration.
Installation
pip install praisonaiagents crawl4ai
crawl4ai-setup
Setup
export OPENAI_API_KEY=your_openai_api_key
PraisonAI provides built-in crawl4ai functions that you can use directly:
import asyncio
from praisonaiagents import crawl4ai
async def main():
result = await crawl4ai("https://example.com")
print(result["markdown"])
asyncio.run(main())
Available Functions
| Function | Description |
|---|
crawl4ai | Async crawl a URL and get markdown |
crawl4ai_many | Crawl multiple URLs concurrently |
crawl4ai_extract | Extract data using CSS selectors |
crawl4ai_llm_extract | Extract data using LLM |
crawl4ai_sync | Synchronous version of crawl4ai |
crawl4ai_extract_sync | Synchronous CSS extraction |
Basic Usage
Simple Crawl
import asyncio
from praisonaiagents import crawl4ai
async def main():
result = await crawl4ai("https://example.com")
if result["success"]:
print(f"URL: {result['url']}")
print(f"Markdown: {result['markdown'][:500]}...")
print(f"Links: {len(result['links'].get('internal', []))}")
else:
print(f"Error: {result['error']}")
asyncio.run(main())
Crawl with Options
import asyncio
from praisonaiagents import crawl4ai
async def main():
result = await crawl4ai(
url="https://example.com",
css_selector="main.content", # Focus on specific content
js_code="window.scrollTo(0, document.body.scrollHeight);", # Execute JS
wait_for="css:.loaded", # Wait for element
screenshot=True # Capture screenshot
)
if result["success"]:
print(result["markdown"])
if result.get("screenshot"):
print("Screenshot captured!")
asyncio.run(main())
Crawl Multiple URLs
import asyncio
from praisonaiagents import crawl4ai_many
async def main():
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
results = await crawl4ai_many(urls)
for result in results:
if result["success"]:
print(f"✓ {result['url']}: {len(result['markdown'])} chars")
else:
print(f"✗ {result['url']}: {result['error']}")
asyncio.run(main())
import asyncio
from praisonaiagents import crawl4ai_extract
async def main():
schema = {
"name": "Products",
"baseSelector": "div.product",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
result = await crawl4ai_extract(
url="https://example.com/products",
schema=schema
)
if result["success"]:
print(f"Extracted {result['count']} items")
for item in result["data"]:
print(f" - {item['title']}: {item['price']}")
asyncio.run(main())
import asyncio
from praisonaiagents import crawl4ai_llm_extract
async def main():
result = await crawl4ai_llm_extract(
url="https://openai.com/api/pricing/",
instruction="Extract all model names with their input and output token prices",
provider="openai/gpt-4o-mini"
)
if result["success"]:
print("Extracted data:", result["data"])
asyncio.run(main())
For more control, use the Crawl4AITools class directly:
import asyncio
from praisonaiagents import Crawl4AITools
async def main():
tools = Crawl4AITools(headless=True, output="silent")
try:
# Basic crawl
result = await tools.crawl("https://example.com")
print(result["markdown"][:500])
# CSS extraction
schema = {
"name": "Articles",
"baseSelector": "article",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "summary", "selector": "p", "type": "text"}
]
}
result = await tools.extract_css("https://example.com/blog", schema)
print(result["data"])
finally:
await tools.close()
asyncio.run(main())
Synchronous Usage
For non-async code, use the sync versions:
from praisonaiagents import crawl4ai_sync, crawl4ai_extract_sync
# Simple crawl
result = crawl4ai_sync("https://example.com")
print(result["markdown"][:500])
# CSS extraction
schema = {
"name": "Items",
"baseSelector": ".item",
"fields": [{"name": "title", "selector": "h3", "type": "text"}]
}
result = crawl4ai_extract_sync("https://example.com/items", schema)
print(result["data"])
Schema Reference
schema = {
"name": "Schema Name",
"baseSelector": "div.item", # CSS selector for each item
"fields": [
{
"name": "field_name",
"selector": "h2", # CSS selector within item
"type": "text" # text, attribute, html, nested, list, nested_list
},
{
"name": "link",
"selector": "a",
"type": "attribute",
"attribute": "href"
},
{
"name": "details",
"selector": ".details",
"type": "nested",
"fields": [
{"name": "brand", "selector": ".brand", "type": "text"}
]
}
]
}
Field Types
| Type | Description |
|---|
text | Extract text content |
attribute | Extract HTML attribute (specify attribute key) |
html | Extract raw HTML |
nested | Single nested object |
list | List of simple items |
nested_list | List of complex objects |
JavaScript Execution
Execute JavaScript before crawling:
result = await crawl4ai(
url="https://example.com",
js_code="""
// Scroll to load lazy content
window.scrollTo(0, document.body.scrollHeight);
// Click a button
document.querySelector('.load-more')?.click();
""",
wait_for="css:.loaded-content" # Wait for content to appear
)
Wait Conditions
# Wait for CSS selector
wait_for="css:.content-loaded"
# Wait for JavaScript condition
wait_for="js:() => document.querySelectorAll('.item').length > 10"
Video Tutorial
Key Points
- Async by default: Use
await for all crawl functions
- JavaScript rendering: Full browser support for dynamic content
- CSS extraction: Fast, no-LLM structured data extraction
- LLM extraction: AI-powered extraction for complex content
- Multi-URL: Efficient concurrent crawling