Web Scraper & Sitemap Generator
Three-in-One Web Analysis Tool: A comprehensive web scraping and sitemap generation solution that combines content extraction, site structure mapping, and link organization. Features dual-mode operation with both a user-friendly Web UI (port 7861) and an MCP Server API (port 7862), making it perfect for content migration, SEO audits, and AI training data preparation.
三合一网页分析工具: 功能全面的网页抓取和站点地图生成解决方案,结合内容提取、站点结构映射和链接组织功能。提供双模式操作,包括用户友好的 Web 界面(端口 7861)和 MCP Server API(端口 7862),非常适合内容迁移、SEO 审计和 AI 训练数据准备。
Overview | 概述
Web Scraper & Sitemap Generator is a hackathon project that emerged from the Agents MCP Hackathon, offering a unique three-pronged approach to web content analysis. Unlike traditional scrapers that focus solely on content extraction, this tool provides:
- Content Scraping: Extracts clean text content from web pages and converts it to Markdown format
- Sitemap Generation: Creates organized site maps based on discovered page links
- Link Classification: Automatically distinguishes between internal and external links for better site structure understanding
What sets this server apart is its dual-mode architecture: it runs both a Gradio web interface for manual exploration and an MCP Server for programmatic access by AI assistants. This makes it equally useful for human operators performing one-off analyses and AI agents conducting automated content workflows.
Web Scraper & Sitemap Generator 是一个来自 Agents MCP Hackathon 的项目,提供了独特的三管齐下的网页内容分析方法。与传统的只专注于内容提取的爬虫不同,该工具提供:
- 内容抓取:从网页中提取干净的文本内容并转换为 Markdown 格式
- 站点地图生成:基于发现的页面链接创建组织化的站点地图
- 链接分类:自动区分内部链接和外部链接,以便更好地理解站点结构
该服务器的独特之处在于其双模式架构:它同时运行 Gradio Web 界面供手动探索和 MCP Server 供 AI 助手进行程序化访问。这使得它对执行一次性分析的人工操作员和进行自动化内容工作流的 AI 代理都同样有用。
Key Statistics | 关键数据
- Popularity: 49 likes on Hugging Face
- Platform: Hugging Face Space (Gradio SDK)
- Language: Python
- Project Type: Hackathon Project
- Transport: HTTP SSE (Server-Sent Events)
- Dual Ports: 7861 (Web UI) + 7862 (MCP Server)
Core Features | 核心特性
1. Web Content Scraping | 网页内容抓取
The scraping functionality extracts text content from any publicly accessible website and converts it into clean, readable Markdown format. This process:
- Removes HTML tags and preserves semantic structure
- Maintains headers, lists, links, and formatting
- Filters out scripts, styles, and navigation elements
- Produces clean Markdown suitable for documentation or analysis
抓取功能从任何公开访问的网站提取文本内容,并将其转换为干净、可读的 Markdown 格式。这个过程:
- 删除 HTML 标签并保留语义结构
- 维护标题、列表、链接和格式
- 过滤脚本、样式和导航元素
- 生成适合文档或分析的干净 Markdown
2. Sitemap Generation | 站点地图生成
The sitemap generator crawls through page links to create a comprehensive map of website structure. It:
- Discovers all linked pages from a starting URL
- Organizes links in a hierarchical structure
- Maps the navigation paths between pages
- Identifies the site’s content architecture
站点地图生成器通过页面链接爬取创建网站结构的全面地图。它:
- 从起始 URL 发现所有链接页面
- 以层次结构组织链接
- 映射页面之间的导航路径
- 识别站点的内容架构
3. Link Classification | 链接分类
The link analysis feature automatically categorizes discovered links into:
- Internal Links: Links pointing to pages within the same domain
- External Links: Links pointing to external domains and resources
- Resource Types: Distinguishes between pages, images, documents, etc.
This classification is essential for SEO audits, understanding site structure, and identifying outbound link patterns.
链接分析功能自动将发现的链接分类为:
- 内部链接:指向同一域内页面的链接
- 外部链接:指向外部域和资源的链接
- 资源类型:区分页面、图片、文档等
这种分类对于 SEO 审计、理解站点结构和识别出站链接模式至关重要。
4. Dual-Mode Architecture | 双模式架构
Web Interface (Port 7861):
- User-friendly Gradio interface
- Manual URL input and instant analysis
- Visual display of results
- Suitable for exploratory work and one-off analyses
MCP Server (Port 7862):
- Programmatic API access via MCP protocol
- Integration with AI assistants like Claude
- Batch processing capabilities
- Ideal for automated workflows
Web 界面(端口 7861):
- 用户友好的 Gradio 界面
- 手动 URL 输入和即时分析
- 结果的可视化展示
- 适合探索性工作和一次性分析
MCP Server(端口 7862):
- 通过 MCP 协议进行程序化 API 访问
- 与 Claude 等 AI 助手集成
- 批处理能力
- 适合自动化工作流
5. Markdown Conversion | Markdown 转换
HTML to Markdown conversion maintains document structure while producing clean, portable text:
- Headers: HTML heading tags (h1-h6) → Markdown headers (#, ##, ###)
- Lists: ul/ol elements → Markdown bullet/numbered lists
- Links:
<a>tags →[text](url)format - Emphasis:
<strong>,<em>→**bold**,*italic* - Code blocks:
<pre>,<code>→ fenced code blocks
HTML 到 Markdown 的转换在保持文档结构的同时产生干净、可移植的文本:
- 标题:HTML 标题标签(h1-h6)→ Markdown 标题(#、##、###)
- 列表:ul/ol 元素 → Markdown 项目符号/编号列表
- 链接:
<a>标签 →[text](url)格式 - 强调:
<strong>、<em>→**粗体**、*斜体* - 代码块:
<pre>、<code>→ 围栏代码块
MCP Tools Documentation | MCP 工具文档
The Web Scraper MCP Server provides three powerful tools that can be accessed through any MCP-compatible client.
Tool 1: scrape_content
Description: Extracts and formats website content into clean Markdown format.
描述:提取网站内容并格式化为干净的 Markdown 格式。
Parameters | 参数:
1 | { |
Returns | 返回:
1 | { |
Example Usage | 使用示例:
1 | # Via MCP Client |
Use Cases | 适用场景:
- Extracting blog articles for content analysis
- Converting documentation to Markdown for migration
- Collecting text data for AI training
- Creating offline readable versions of web content
Common Scenarios | 常见场景:
- 提取博客文章进行内容分析
- 将文档转换为 Markdown 以进行迁移
- 收集文本数据用于 AI 训练
- 创建网页内容的离线可读版本
Tool 2: generate_sitemap
Description: Generates a comprehensive sitemap of all links found on the website, organized hierarchically.
描述:生成网站上发现的所有链接的全面站点地图,按层次组织。
Parameters | 参数:
1 | { |
Returns | 返回:
1 | { |
Example Usage | 使用示例:
1 | # Via MCP Client |
Use Cases | 适用场景:
- Understanding website structure before migration
- Creating navigation documentation
- SEO audits to identify orphaned pages
- Mapping documentation hierarchies
Common Scenarios | 常见场景:
- 在迁移前了解网站结构
- 创建导航文档
- SEO 审计以识别孤立页面
- 映射文档层次结构
Tool 3: analyze_website
Description: Performs a complete website analysis, combining content extraction, sitemap generation, and link classification.
描述:执行完整的网站分析,结合内容提取、站点地图生成和链接分类。
Parameters | 参数:
1 | { |
Returns | 返回:
1 | { |
Example Usage | 使用示例:
1 | # Via MCP Client |
Use Cases | 适用场景:
- Comprehensive site audits
- Documentation migration planning
- Content strategy analysis
- Link profile evaluation for SEO
Common Scenarios | 常见场景:
- 全面的站点审计
- 文档迁移规划
- 内容策略分析
- SEO 的链接配置文件评估
Installation & Configuration | 安装与配置
Method 1: Local Web Interface | 方式 1:本地 Web 界面
This method runs the Gradio web interface for manual, interactive web scraping.
此方法运行 Gradio Web 界面以进行手动交互式网页抓取。
Requirements | 前置要求:
- Python 3.8 or higher
- Git
Installation Steps | 安装步骤:
1 | # Clone the repository from Hugging Face |
Using the Web Interface | 使用 Web 界面:
- Open your browser to
http://localhost:7861 - Enter the URL you want to scrape in the input field
- Choose the operation:
- Scrape Content: Extract and convert to Markdown
- Generate Sitemap: Create site structure map
- Analyze Website: Perform complete analysis
- Click the appropriate button to start the operation
- View results directly in the interface
Method 2: Local MCP Server | 方式 2:本地 MCP Server
This method runs the MCP Server for programmatic access by AI assistants.
此方法运行 MCP Server 以供 AI 助手进行程序化访问。
Installation Steps | 安装步骤:
1 | # Clone and install (same as Method 1) |
Claude Desktop Configuration | Claude Desktop 配置:
On macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
On Windows: %APPDATA%\Claude\claude_desktop_config.json
1 | { |
After updating the configuration:
- Restart Claude Desktop
- The web scraper tools will appear in Claude’s tool list
- You can now ask Claude to scrape websites, generate sitemaps, or analyze sites
配置更新后:
- 重启 Claude Desktop
- 网页抓取工具将出现在 Claude 的工具列表中
- 现在可以要求 Claude 抓取网站、生成站点地图或分析站点
Method 3: Remote Hugging Face Space | 方式 3:远程 Hugging Face Space
Use the hosted version on Hugging Face without local installation.
使用 Hugging Face 上的托管版本,无需本地安装。
Configuration | 配置:
1 | { |
Advantages | 优势:
- No local setup required
- Always up-to-date
- No resource consumption on local machine
Considerations | 注意事项:
- Requires internet connection
- May have rate limits
- Shared resource with other users
Use Cases & Workflows | 使用场景与工作流
1. Content Migration Workflow | 内容迁移工作流
Scenario: Migrating a legacy documentation site to a modern documentation platform (e.g., moving from WordPress to Docusaurus).
场景:将传统文档站点迁移到现代文档平台(例如,从 WordPress 迁移到 Docusaurus)。
Workflow | 工作流程:
1 | Step 1: Analyze the site structure |
Example Claude Conversation | Claude 对话示例:
1 | User: I need to migrate our documentation from https://old-docs.example.com to a new Docusaurus site. Can you help analyze the structure first? |
2. SEO Audit Workflow | SEO 审计工作流
Scenario: Performing a comprehensive SEO audit to identify link issues and site structure problems.
场景:执行全面的 SEO 审计以识别链接问题和站点结构问题。
Audit Checklist | 审计清单:
Site Structure Analysis
- Generate sitemap to visualize site hierarchy
- Identify orphaned pages (pages with no internal links)
- Check navigation depth (pages more than 3 clicks from home)
Internal Linking
- Count internal links per page
- Identify pages with few inbound links
- Check for broken internal links
External Links
- List all external links
- Categorize by domain
- Identify potential link opportunities
Content Quality
- Extract content from key pages
- Analyze content length and structure
- Identify thin content pages
Example Analysis | 分析示例:
1 | SEO Audit Results for https://example.com |
3. AI Training Data Preparation | AI 训练数据准备
Scenario: Collecting high-quality text data from curated websites for training or fine-tuning language models.
场景:从精选网站收集高质量文本数据用于训练或微调语言模型。
Workflow | 工作流程:
1 | # Pseudo-code for data collection workflow |
Data Quality Checks | 数据质量检查:
- Remove boilerplate (headers, footers, navigation)
- Filter out short content (<100 words)
- Deduplicate similar content
- Preserve code examples and formatting
- Maintain source attribution
4. Documentation Monitoring | 文档监控
Scenario: Regularly monitoring documentation sites for changes, broken links, or structural issues.
场景:定期监控文档站点的更改、损坏的链接或结构问题。
Monitoring Workflow | 监控工作流:
1 | Daily Check: |
Automated Monitoring Script | 自动化监控脚本:
1 | import json |
5. Knowledge Base Construction | 知识库构建
Scenario: Building a searchable knowledge base from multiple documentation sources.
场景:从多个文档来源构建可搜索的知识库。
Construction Steps | 构建步骤:
Source Identification | 识别来源
- List all documentation sources
- Prioritize by relevance and authority
Content Extraction | 内容提取
- Use analyze_website for each source
- Extract all pages to Markdown
- Maintain source metadata
Content Processing | 内容处理
- Clean and normalize Markdown
- Extract key sections and topics
- Generate embeddings for semantic search
Indexing | 索引
- Index content in search engine (e.g., Elasticsearch)
- Create hierarchical navigation
- Link related content across sources
Presentation | 呈现
- Build search interface
- Display content with source attribution
- Maintain links to original sources
Technical Architecture | 技术架构
System Components | 系统组件
1 | ┌─────────────────────────────────────────────┐ |
Processing Pipeline | 处理管道
Content Scraping Pipeline | 内容抓取管道:
1 | URL Input → HTTP Request → HTML Response → HTML Parsing |
Sitemap Generation Pipeline | 站点地图生成管道:
1 | Start URL → Page Fetch → Extract Links → Filter Links |
Full Analysis Pipeline | 完整分析管道:
1 | URL Input → [Content Pipeline] + [Sitemap Pipeline] |
Technology Stack | 技术栈
Core Framework | 核心框架:
- Gradio: Web UI and MCP server framework
- Python: Primary implementation language
Likely Dependencies | 可能的依赖:
- requests or httpx: HTTP client for fetching web pages
- BeautifulSoup4 or lxml: HTML parsing
- html2text or markdownify: HTML to Markdown conversion
- urllib: URL parsing and manipulation
MCP Integration | MCP 集成:
- Transport: Server-Sent Events (SSE)
- Endpoint:
/gradio_api/mcp/sse - Protocol: MCP (Model Context Protocol)
Performance Considerations | 性能考虑
Optimization Strategies | 优化策略:
Caching | 缓存
- Cache fetched pages to avoid redundant requests
- Store parsed results for repeated analysis
- Implement TTL for cache invalidation
Rate Limiting | 速率限制
- Respect robots.txt directives
- Implement polite crawling delays
- Limit concurrent requests
Parallel Processing | 并行处理
- Fetch multiple pages concurrently
- Process content in parallel threads
- Use async/await for I/O operations
Resource Management | 资源管理
- Limit crawl depth to prevent runaway scraping
- Set maximum page count per analysis
- Implement timeouts for slow sites
Best Practices | 最佳实践
For Content Scraping | 内容抓取
Respect Website Policies | 尊重网站政策
- Check and honor robots.txt
- Follow crawl-delay directives
- Don’t overwhelm servers with requests
Handle Errors Gracefully | 优雅处理错误
- Implement retry logic for failed requests
- Handle timeouts appropriately
- Log errors for debugging
Clean Content Effectively | 有效清理内容
- Remove navigation elements
- Strip advertisements and sidebars
- Preserve meaningful structure
For Sitemap Generation | 站点地图生成
Set Appropriate Limits | 设置适当限制
- Limit crawl depth (e.g., max 3-4 levels)
- Cap total pages analyzed (e.g., 500 pages)
- Set timeout for entire operation
Handle Redirects | 处理重定向
- Follow redirects automatically
- Track redirect chains
- Use final URL in sitemap
Normalize URLs | 规范化 URL
- Remove query parameters when appropriate
- Handle trailing slashes consistently
- Resolve relative URLs correctly
For Link Analysis | 链接分析
Accurate Classification | 准确分类
- Use domain matching for internal/external
- Handle subdomains correctly
- Consider protocol differences (http vs https)
Link Validation | 链接验证
- Check for broken links (optional)
- Identify redirect chains
- Flag suspicious links
Context Preservation | 上下文保留
- Maintain anchor text
- Record link position in content
- Note link relationships
Troubleshooting | 故障排除
Common Issues | 常见问题
Issue 1: MCP Server Not Connecting | MCP Server 无法连接
1 | Symptom: Claude can't see the web scraper tools |
Issue 2: Scraping Returns Empty Content | 抓取返回空内容
1 | Symptom: scrape_content returns blank or minimal text |
Issue 3: Sitemap Generation Timeout | 站点地图生成超时
1 | Symptom: generate_sitemap operation times out |
Issue 4: Broken Internal Links in Output | 输出中的内部链接损坏
1 | Symptom: Internal links in Markdown don't work |
Debug Mode | 调试模式
Enable detailed logging to troubleshoot issues:
启用详细日志记录以排除问题:
1 | # Add to configuration or environment |
Limitations & Considerations | 限制与注意事项
Technical Limitations | 技术限制
JavaScript-Rendered Content | JavaScript 渲染的内容
- Cannot scrape content loaded by JavaScript
- Single-page applications may not work well
- Dynamic content might be missed
Authentication-Protected Content | 身份验证保护的内容
- Cannot access content behind login
- No session management support
- Public pages only
Rate Limiting | 速率限制
- May trigger anti-scraping measures
- Shared Hugging Face space has limits
- Concurrent requests limited
Large Sites | 大型站点
- Full site analysis may be time-consuming
- Memory usage increases with site size
- May need to analyze in sections
Legal & Ethical Considerations | 法律与道德考虑
Copyright | 版权
- Respect content copyright
- Don’t republish scraped content without permission
- Use for analysis and personal purposes only
Terms of Service | 服务条款
- Check website’s terms of service
- Some sites prohibit scraping
- Respect robots.txt directives
Privacy | 隐私
- Don’t scrape personal information
- Be cautious with user-generated content
- Follow GDPR and other privacy regulations
Server Load | 服务器负载
- Don’t overwhelm target servers
- Use polite crawl delays
- Consider impact on site performance
Comparison with Alternatives | 与替代方案的比较
vs. Traditional Scrapers | 与传统爬虫对比
Web Scraper MCP Server | 本工具:
- ✓ Dual-mode operation (UI + API)
- ✓ Three-in-one analysis (content + sitemap + links)
- ✓ MCP integration for AI assistants
- ✓ Clean Markdown output
- ✗ No JavaScript rendering
- ✗ Limited to public content
Scrapy:
- ✓ Highly customizable
- ✓ Powerful crawling engine
- ✓ Large ecosystem of extensions
- ✗ Steeper learning curve
- ✗ No built-in Markdown conversion
- ✗ Requires coding
BeautifulSoup:
- ✓ Simple and flexible
- ✓ Excellent HTML parsing
- ✗ No crawling capabilities
- ✗ Manual implementation needed
- ✗ No sitemap generation
vs. Commercial Tools | 与商业工具对比
Web Scraper MCP Server | 本工具:
- ✓ Free and open source
- ✓ Self-hosted
- ✓ Customizable
- ✗ Limited features compared to commercial
- ✗ Manual setup required
Screaming Frog SEO Spider:
- ✓ Comprehensive SEO analysis
- ✓ Advanced crawling capabilities
- ✓ Detailed reporting
- ✗ Commercial license required
- ✗ Desktop application only
- ✗ No MCP integration
import.io:
- ✓ Visual scraping tool
- ✓ Cloud-based
- ✓ API access
- ✗ Expensive
- ✗ Proprietary
- ✗ Limited free tier
FAQ | 常见问题
General Questions | 一般问题
Q: What types of websites work best with this scraper?
A: Static HTML sites work best. This includes:
- Documentation sites
- Blogs and news sites
- Corporate websites
- Educational resources
- Markdown-based sites
Sites that DON’T work well:
- Single-page applications (SPAs) with heavy JavaScript
- Sites requiring authentication
- Content behind paywalls
- Sites with aggressive anti-scraping
问:哪些类型的网站最适合这个爬虫?
答:静态 HTML 站点效果最好。包括:
- 文档站点
- 博客和新闻站点
- 企业网站
- 教育资源
- 基于 Markdown 的站点
不太适合的站点:
- 使用大量 JavaScript 的单页应用程序(SPA)
- 需要身份验证的站点
- 付费内容
- 具有激进反爬虫措施的站点
Q: Can I use this for commercial projects?
A: Check the project’s license on Hugging Face. Generally:
- Personal use: Usually fine
- Commercial use: May have restrictions
- Always respect website terms of service
- Don’t violate copyright
问:我可以将其用于商业项目吗?
答:查看 Hugging Face 上的项目许可证。通常:
- 个人使用:通常可以
- 商业使用:可能有限制
- 始终尊重网站服务条款
- 不要违反版权
Q: How is this different from using curl or wget?
A: This tool provides:
- Automatic Markdown conversion
- Sitemap generation
- Link classification and analysis
- Clean content extraction (removes navigation, ads)
- MCP integration for AI assistants
- User-friendly interface
curl/wget provide:
- Raw HTML only
- No content processing
- No analysis features
- More manual work required
问:这与使用 curl 或 wget 有何不同?
答:本工具提供:
- 自动 Markdown 转换
- 站点地图生成
- 链接分类和分析
- 干净的内容提取(删除导航、广告)
- AI 助手的 MCP 集成
- 用户友好界面
curl/wget 提供:
- 仅原始 HTML
- 无内容处理
- 无分析功能
- 需要更多手动工作
Technical Questions | 技术问题
Q: Can I scrape multiple pages at once?
A: Yes, through different approaches:
- Use analyze_website to get all pages at once
- Call scrape_content multiple times via MCP
- For batch processing, use a script with the MCP client
问:我可以一次抓取多个页面吗?
答:可以,通过不同方法:
- 使用 analyze_website 一次获取所有页面
- 通过 MCP 多次调用 scrape_content
- 对于批处理,使用带有 MCP 客户端的脚本
Q: Does this respect robots.txt?
A: Implementation may vary. Best practices:
- Check robots.txt before scraping
- Honor crawl-delay directives
- Don’t scrape disallowed paths
- Add your own checks if needed
问:这个工具是否遵守 robots.txt?
答:实现可能有所不同。最佳实践:
- 抓取前检查 robots.txt
- 遵守 crawl-delay 指令
- 不要抓取禁止的路径
- 如需要添加自己的检查
Q: Can I modify the Markdown output format?
A: Yes, since this is open source:
- Clone the repository
- Modify the Markdown conversion logic
- Adjust the output format to your needs
- Run your customized version
问:我可以修改 Markdown 输出格式吗?
答:可以,因为这是开源的:
- 克隆存储库
- 修改 Markdown 转换逻辑
- 调整输出格式以满足您的需求
- 运行您的自定义版本
Q: How do I handle sites with pagination?
A: Strategies:
- Use generate_sitemap to find all paginated URLs
- Scrape each page individually with scrape_content
- Combine results in your application
- Look for “next page” links in the sitemap
问:如何处理带分页的站点?
答:策略:
- 使用 generate_sitemap 查找所有分页 URL
- 使用 scrape_content 单独抓取每个页面
- 在您的应用程序中组合结果
- 在站点地图中查找”下一页”链接
Setup & Configuration | 设置和配置
Q: Do I need to install this locally or can I use the Hugging Face Space directly?
A: Both options work:
Local Installation:
- More control and customization
- No external dependencies
- Better for high-volume usage
- Privacy (data stays local)
Hugging Face Space:
- No installation needed
- Always up-to-date
- Shared resources
- May have usage limits
问:我需要在本地安装还是可以直接使用 Hugging Face Space?
答:两种选项都可以:
本地安装:
- 更多控制和自定义
- 无外部依赖
- 更适合大量使用
- 隐私(数据保留在本地)
Hugging Face Space:
- 无需安装
- 始终保持最新
- 共享资源
- 可能有使用限制
Q: Which port should I use for MCP - 7861 or 7862?
A: Always use port 7862 for MCP connections.
- Port 7861: Web UI (for human use)
- Port 7862: MCP Server (for AI assistants)
问:我应该使用哪个端口连接 MCP - 7861 还是 7862?
答:始终使用 端口 7862 进行 MCP 连接。
- 端口 7861:Web 界面(供人类使用)
- 端口 7862:MCP Server(供 AI 助手使用)
Q: Can I run both the Web UI and MCP Server at the same time?
A: Yes! They run on different ports:
- Start both servers simultaneously
- Use Web UI for manual exploration
- Use MCP Server for automated workflows
- No conflict between them
问:我可以同时运行 Web 界面和 MCP Server 吗?
答:可以!它们运行在不同端口上:
- 同时启动两个服务器
- 使用 Web 界面进行手动探索
- 使用 MCP Server 进行自动化工作流
- 它们之间没有冲突
Advanced Usage | 高级用法
Custom Scraping Workflows | 自定义抓取工作流
For advanced users who want to build custom workflows on top of the MCP tools:
对于想要在 MCP 工具基础上构建自定义工作流的高级用户:
1 | import asyncio |
Integration with Vector Databases | 与向量数据库集成
Use scraped content to populate a vector database for semantic search:
使用抓取的内容填充向量数据库以进行语义搜索:
1 | from mcp_client import MCPClient |
Contributing & Community | 贡献与社区
How to Contribute | 如何贡献
This is a hackathon project hosted on Hugging Face. Ways to contribute:
这是一个托管在 Hugging Face 上的黑客马拉松项目。贡献方式:
Feedback
- Report bugs or issues
- Suggest new features
- Share use cases
Code Contributions
- Fork the space
- Implement improvements
- Submit pull requests
Documentation
- Improve usage examples
- Add tutorials
- Translate to other languages
Community
- Share your workflows
- Help other users
- Write blog posts about use cases
Roadmap & Future Enhancements | 路线图与未来增强
Potential improvements for future versions:
未来版本的潜在改进:
- JavaScript Support: Add headless browser for JS-rendered content
- Authentication: Support for basic auth and session management
- Advanced Filtering: Custom rules for content extraction
- Export Formats: Additional output formats (PDF, DOCX, HTML)
- Scheduling: Built-in scheduled scraping for monitoring
- Diff Detection: Automatic change detection between scrapes
- API Extensions: More granular control over scraping behavior
- Performance: Improved caching and parallel processing
Resources & References | 资源与参考
Official Links | 官方链接
- Hugging Face Space: https://huggingface.co/spaces/Agents-MCP-Hackathon/web-scraper
- Live Demo: Available on the Hugging Face Space
- Organization: Agents-MCP-Hackathon
Related Technologies | 相关技术
- MCP (Model Context Protocol): https://modelcontextprotocol.io
- Gradio: https://gradio.app
- Claude Desktop: https://claude.ai/desktop
Learning Resources | 学习资源
Web Scraping Best Practices:
- Respect robots.txt and website terms
- Implement rate limiting
- Handle errors gracefully
- Use appropriate user agents
MCP Development:
- MCP Specification
- Gradio MCP integration guide
- Building custom MCP servers
Markdown Format:
- CommonMark specification
- Markdown guide
- Converting HTML to Markdown
Conclusion | 结论
Web Scraper & Sitemap Generator is a versatile tool that bridges the gap between web content and structured, analyzable data. Its three-in-one approach combining content extraction, sitemap generation, and link analysis provides comprehensive website insights, while the dual-mode architecture makes it accessible for both manual and automated workflows.
Whether you’re migrating documentation, conducting SEO audits, preparing AI training data, or building knowledge bases, this tool provides a solid foundation for web content analysis. The MCP integration enables AI assistants like Claude to autonomously scrape and analyze websites, opening up new possibilities for intelligent content processing.
As a hackathon project with 49 likes on Hugging Face, it demonstrates the power of combining modern technologies (Gradio, MCP) to create practical, user-friendly tools. While it has limitations (no JavaScript rendering, public content only), it excels at its core mission: transforming web content into clean, structured Markdown with comprehensive site analysis.
Web Scraper & Sitemap Generator 是一个多功能工具,在网页内容和结构化、可分析的数据之间架起桥梁。它将内容提取、站点地图生成和链接分析结合在一起的三合一方法提供全面的网站洞察,而双模式架构使其适用于手动和自动化工作流。
无论您是在迁移文档、进行 SEO 审计、准备 AI 训练数据还是构建知识库,此工具都为网页内容分析提供了坚实的基础。MCP 集成使 Claude 等 AI 助手能够自主抓取和分析网站,为智能内容处理开辟了新的可能性。
作为在 Hugging Face 上获得 49 个赞的黑客马拉松项目,它展示了结合现代技术(Gradio、MCP)创建实用、用户友好工具的力量。虽然它有局限性(无 JavaScript 渲染、仅公共内容),但它在其核心使命上表现出色:将网页内容转换为干净、结构化的 Markdown,并提供全面的站点分析。
Last Updated: 2025-06-04
Project Status: Active (Hackathon Project)
Platform: Hugging Face Space
License: Check Hugging Face Space for license details