Web Researcher Mini

基于 Firecrawl CLI 的网页抓取与搜索工具，支持多格式内容提取与自动化分析。

已扫描

项目

内容

适合谁

研究人员、内容创作者

不适合谁

无网络环境用户、不熟悉命令行操作者

国内可用性

需网络配置。可能需要网络配置或第三方服务可访问。

安装难度

新手友好（★☆☆）。基于终端操作、依赖、API Key 和本地环境要求的初步判断。

安装与下载

复制命令安装

openclaw skills install @weilianglin100-sketch/web-researcher-mini

官方 ZIP下载官方 ZIP

Skill 说明

命令、参数、文件名以原文为准

Firecrawl CLI

使用 firecrawl CLI 工具获取并搜索网页内容。Firecrawl 会返回经过优化的干净 Markdown 格式内容，适合大语言模型的上下文窗口，支持 JavaScript 渲染，可绕过常见限制，并提供结构化数据。

安装

检查状态、认证信息和速率限制：

firecrawl --status

正常运行时的输出示例：

  🔥 firecrawl cli v1.0.2

  ● Authenticated via FIRECRAWL_API_KEY
  Concurrency: 0/100 jobs (parallel scrape limit)
  Credits: 500,000 remaining

Concurrency（并发数）：最大并行任务数。建议运行接近该上限的任务，但不要超过。
Credits（积分）：剩余 API 积分。每次抓取或爬取都会消耗积分。

如果尚未安装，请运行：

npm install -g firecrawl-cli

如用户未登录，请参考 [rules/install.md](rules/install.md) 中的安装规则获取更多信息。

认证

若未认证，请运行以下命令：

firecrawl login --browser

--browser 标志会自动打开浏览器进行认证，无需手动输入。

组织与存储

在工作目录中创建 .firecrawl/ 文件夹（如不存在），用于存储结果。若尚未添加，请将 .firecrawl/ 加入 .gitignore 文件中。始终使用 -o 参数直接写入文件（避免污染上下文）：

# 搜索网页（最常用操作）
firecrawl search "your query" -o .firecrawl/search-{query}.json

# 启用抓取的搜索
firecrawl search "your query" --scrape -o .firecrawl/search-{query}-scraped.json

# 抓取单个页面
firecrawl scrape https://example.com -o .firecrawl/{site}-{path}.md

示例文件名：

.firecrawl/search-react_server_components.json
.firecrawl/search-ai_news-scraped.json
.firecrawl/docs.github.com-actions-overview.md
.firecrawl/firecrawl.dev.md

命令

Search - 网页搜索（可选抓取）

# 基础搜索（人类可读输出）
firecrawl search "your query" -o .firecrawl/search-query.txt

# JSON 输出（推荐用于解析）
firecrawl search "your query" -o .firecrawl/search-query.json --json

# 限制结果数量
firecrawl search "AI news" --limit 10 -o .firecrawl/search-ai-news.json --json

# 指定搜索来源
firecrawl search "tech startups" --sources news -o .firecrawl/search-news.json --json
firecrawl search "landscapes" --sources images -o .firecrawl/search-images.json --json
firecrawl search "machine learning" --sources web,news,images -o .firecrawl/search-ml.json --json

# 按类别过滤（GitHub 仓库、研究论文、PDF 文件）
firecrawl search "web scraping python" --categories github -o .firecrawl/search-github.json --json
firecrawl search "transformer architecture" --categories research -o .firecrawl/search-research.json --json

# 时间范围搜索
firecrawl search "AI announcements" --tbs qdr:d -o .firecrawl/search-today.json --json  # 过去一天
firecrawl search "tech news" --tbs qdr:w -o .firecrawl/search-week.json --json          # 过去一周

# 地理位置搜索
firecrawl search "restaurants" --location "San Francisco,California,United States" -o .firecrawl/search-sf.json --json
firecrawl search "local news" --country DE -o .firecrawl/search-germany.json --json

# 搜索并抓取结果内容
firecrawl search "firecrawl tutorials" --scrape -o .firecrawl/search-scraped.json --json
firecrawl search "API docs" --scrape --scrape-formats markdown,links -o .firecrawl/search-docs.json --json

搜索选项：

选项	说明
`--limit <n>`	最多返回结果数量（默认：5，最大：100）
`--sources <sources>`	逗号分隔：web、images、news（默认：web）
`--categories <categories>`	逗号分隔：github、research、pdf
`--tbs <value>`	时间筛选：qdr:h（小时）、qdr:d（天）、qdr:w（周）、qdr:m（月）、qdr:y（年）
`--location <location>`	地理定位（例如："Germany"）
`--country <code>`	ISO 国家代码（默认：US）
`--scrape`	启用对搜索结果的抓取
`--scrape-formats <formats>`	当启用 `--scrape` 时指定抓取格式（默认：markdown）
`-o, --output <path>`	保存到文件

Scrape - 单页内容提取

# 基础抓取（Markdown 输出）
firecrawl scrape https://example.com -o .firecrawl/example.md

# 获取原始 HTML
firecrawl scrape https://example.com --html -o .firecrawl/example.html

# 多种格式输出（JSON 输出）
firecrawl scrape https://example.com --format markdown,links -o .firecrawl/example.json

# 仅提取主内容（移除导航栏、页脚、广告等）
firecrawl scrape https://example.com --only-main-content -o .firecrawl/example.md

# 等待 JS 渲染完成
firecrawl scrape https://spa-app.com --wait-for 3000 -o .firecrawl/spa.md

# 仅提取链接
firecrawl scrape https://example.com --format links -o .firecrawl/links.json

# 包含/排除特定 HTML 标签
firecrawl scrape https://example.com --include-tags article,main -o .firecrawl/article.md
firecrawl scrape https://example.com --exclude-tags nav,aside,.ad -o .firecrawl/clean.md

抓取选项：

选项	说明
`-f, --format <formats>`	输出格式：markdown、html、rawHtml、links、screenshot、json
`-H, --html`	快捷方式，等同于 `--format html`
`--only-main-content`	仅提取主内容
`--wait-for <ms>`	抓取前等待毫秒数（用于 JS 内容渲染）
`--include-tags <tags>`	仅包含指定的 HTML 标签
`--exclude-tags <tags>`	排除指定的 HTML 标签
`-o, --output <path>`	保存到文件

Crawl - 整站爬取

Web Researcher Mini

快速开始

# 启动爬取任务（返回作业 ID）
firecrawl crawl https://example.com

# 等待爬取完成
firecrawl crawl https://example.com --wait

# 带进度指示器
firecrawl crawl https://example.com --wait --progress

# 检查爬取状态
firecrawl crawl <job-id>

# 限制爬取页面数量
firecrawl crawl https://example.com --limit 100 --max-depth 3

# 仅爬取博客部分
firecrawl crawl https://example.com --include-paths /blog,/posts

# 排除管理页面
firecrawl crawl https://example.com --exclude-paths /admin,/login

# 设置请求频率限制
firecrawl crawl https://example.com --delay 1000 --max-concurrency 2

# 保存结果
firecrawl crawl https://example.com --wait -o crawl-results.json --pretty

爬取选项

选项	说明
`--wait`	等待爬取完成
`--progress`	等待时显示进度
`--limit <n>`	最大爬取页面数
`--max-depth <n>`	最大爬取深度
`--include-paths <paths>`	仅爬取匹配路径
`--exclude-paths <paths>`	跳过匹配路径
`--delay <ms>`	请求之间的延迟（毫秒）
`--max-concurrency <n>`	最大并发请求数

Map - 发现站点上的所有 URL

# 列出所有 URL（每行一个）
firecrawl map https://example.com -o .firecrawl/urls.txt

# 输出为 JSON 格式
firecrawl map https://example.com --json -o .firecrawl/urls.json

# 搜索特定 URL
firecrawl map https://example.com --search "blog" -o .firecrawl/blog-urls.txt

# 限制结果数量
firecrawl map https://example.com --limit 500 -o .firecrawl/urls.txt

# 包含子域名
firecrawl map https://example.com --include-subdomains -o .firecrawl/all-urls.txt

Map 选项

选项	说明
`--limit <n>`	最大发现的 URL 数量
`--search <query>`	按搜索关键词过滤 URL
`--sitemap <mode>`	可选值：include、skip、only
`--include-subdomains`	包含子域名
`--json`	输出为 JSON 格式
`-o, --output <path>`	保存到文件

信用额度使用情况

# 查看信用额度使用情况
firecrawl credit-usage

# 输出为 JSON 格式
firecrawl credit-usage --json --pretty

读取已爬取的文件

切勿一次性读取整个 firecrawl 输出文件，除非明确要求 —— 文件可能包含 1000 行以上内容。建议使用 grep、head 或逐行读取：

# 检查文件大小并预览结构
wc -l .firecrawl/file.md && head -50 .firecrawl/file.md

# 使用 grep 查找特定内容
grep -n "keyword" .firecrawl/file.md
grep -A 10 "## Section" .firecrawl/file.md

并行处理

使用 & 和 wait 可以并行运行多个爬取任务：

# 并行爬取（速度快）
firecrawl scrape https://site1.com -o .firecrawl/1.md &
firecrawl scrape https://site2.com -o .firecrawl/2.md &
firecrawl scrape https://site3.com -o .firecrawl/3.md &
wait

对于大量 URL，可使用 xargs 配合 -P 实现并行执行：

cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(echo {} | md5).md"'

与其他工具结合使用

# 从搜索结果中提取 URL
jq -r '.data.web[].url' .firecrawl/search-query.json

# 从搜索结果中获取标题
jq -r '.data.web[] | "\(.title): \(.url)"' .firecrawl/search-query.json

# 统计 Map 发现的 URL 数量
firecrawl map https://example.com | wc -l

@weilianglin100-sketch

已收录 1 个 Skill

Web Researcher Mini

安装与下载

Skill 说明

Firecrawl CLI

安装

认证

组织与存储

命令

Search - 网页搜索（可选抓取）

Scrape - 单页内容提取

Crawl - 整站爬取

Web Researcher Mini

快速开始

爬取选项

Map - 发现站点上的所有 URL

Map 选项

信用额度使用情况

读取已爬取的文件

并行处理

与其他工具结合使用

相关推荐

Jd

Web Site or Domain Name Basic Information Scanner

Web Navigator