Crawl4AI-MCP
by: Vistiqx
Crawl4AI-MCP: A powerful web crawling and content analysis server that combines targeted web scraping with Claude AI processing. Extract insights from specific websites with customizable depth, content selection, and AI analysis capabilities.
📌Overview
Purpose: Crawl4AI is designed to facilitate intelligent web crawling and AI-powered content analysis, providing users with in-depth analysis of specific web content.
Overview: The Crawl4AI MCP Server offers a robust framework for targeted web crawling, enabling users to extract, analyze, and process content using Claude AI models. Tailored for researchers, content creators, and business analysts, it distinguishes itself from general search engines by delivering precise insights into selected websites.
Key Features:
-
Customizable Web Crawling: Allows users to define crawling depth and content selectors, ensuring focused data extraction specific to user needs.
-
AI-Powered Content Analysis: Utilizes Claude AI models to perform various tasks such as summarization, fact extraction, and deeper content analysis, enhancing data insights.
Crawl4AI MCP Server
This is not a functional MCP; it was a learning tool for future development.
Crawl4AI MCP Server is a Model-Controller-Processor server designed for intelligent web crawling and AI-powered content analysis. It provides a simple API for crawling websites and processing the content using Claude AI models.
Who Benefits from Crawl4AI?
Crawl4AI is designed for individuals and organizations needing targeted, in-depth analysis of specific web content. Unlike general search engines or AI assistants that provide broad coverage, Crawl4AI offers deeper insights into content you specifically want to analyze.
Ideal for:
- Researchers extracting structured information from specific websites or academic resources
- Content creators analyzing competitor content or industry trends within specific domains
- Data analysts processing web data for business intelligence
- Developers building applications requiring web content analysis capabilities
- Digital marketers analyzing industry websites, blogs, or competitor content
- Business analysts gathering industry-specific information from multiple sources
- Knowledge workers synthesizing information from specific web domains
User Benefits
- Targeted depth over breadth: Comprehensive analysis of specific websites that matter to you
- Customizable crawling: Control crawl depth, content extraction, and processing parameters
- Programmatic integration: Easily incorporate analysis into applications and workflows
- Flexible AI processing: Summarize, extract facts, analyze deeply, or generate questions
- Privacy and control: Run the server locally to keep searches internal
- Cost efficiency: Use your own Claude API key with control over token usage
- Automation potential: Schedule regular crawls to track content changes over time
- Customized AI prompting: Tailor AI analysis to your specific needs
- Content transformation: Turn unstructured web data into structured, actionable insights
Crawl4AI bridges simple web scraping and sophisticated AI analysis for meaningful extraction of web insights.
Features
- Web crawling with customizable depth and content selectors
- Respects robots.txt directives
- Content extraction and AI-powered processing with Claude models
- Simple REST API interface
- Configurable via command line or environment variables
- Detailed logging
Installation
-
Clone this repository:
git clone https://github.com/yourusername/crawl4ai-mcp.git cd crawl4ai-mcp
-
Install dependencies:
npm install
-
Create a
.env
file and add your Anthropic API key:ANTHROPIC_API_KEY=your_api_key_here
Usage
Starting the Server
Start server with default settings:
npm start
Or with options:
npm start -- --port 4000 --debug
Available options:
--port <number>
: Server port (default: 3000)--debug
: Enable debug logging
API Endpoints
Crawl a Website
POST /api/crawl
Request body example:
{
"url": "https://example.com",
"depth": 2,
"selector": "main",
"aiProcessing": {
"task": "summarize",
"model": "claude-3-sonnet-20240229"
}
}
Parameters:
url
(required): URL to start crawling fromdepth
(optional): How many levels deep to crawl (default: 1)selector
(optional): CSS selector for content extraction (default: "body")aiProcessing
(optional): AI processing configurationtask
: Processing type: summarize, extract, analyze, questionsmodel
: Claude model to use (default: "claude-3-sonnet-20240229")
Health Check
GET /api/healthcheck
Returns server status and version.
AI Processing Tasks
summarize
: Generate a summary of the crawled contentextract
: Extract factual informationanalyze
: Deep analysis of content, arguments, and qualityquestions
: Generate questions and answers based on content
Configuration
Configure the server using environment variables:
PORT
: Server port (default: 3000)ANTHROPIC_API_KEY
: Your Anthropic API keyDEBUG
: Set to"true"
to enable debug logging
Example
Crawl a website and summarize its content with curl:
curl -X POST http://localhost:3000/api/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"depth": 1,
"aiProcessing": {
"task": "summarize"
}
}'
License
MIT License
Acknowledgements
This project uses the following libraries:
- Express
- Puppeteer
- Cheerio
- Winston
- @anthropic-ai/sdk