diff --git a/README.md b/README.md index b1fa028..52d0c5a 100644 --- a/README.md +++ b/README.md @@ -5,94 +5,74 @@ [English](./docs/README_EN.md) | 简体中文 -**通过 MCP 协议将 Grok 搜索能力集成到 Claude,显著增强文档检索与事实核查能力** +**Grok-with-Tavily MCP,为 Claude Code 提供更完善的网络访问能力** -[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) -[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) -[![FastMCP](https://img.shields.io/badge/FastMCP-2.0.0+-green.svg)](https://github.com/jlowin/fastmcp) +[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![FastMCP](https://img.shields.io/badge/FastMCP-2.0.0+-green.svg)](https://github.com/jlowin/fastmcp) --- -## 概述 +## 一、概述 -Grok Search MCP 是一个基于 [FastMCP](https://github.com/jlowin/fastmcp) 构建的 MCP(Model Context Protocol)服务器,通过转接第三方平台(如 Grok)的强大搜索能力,为 Claude、Claude Code 等 AI 模型提供实时网络搜索功能。 +Grok Search MCP 是一个基于 [FastMCP](https://github.com/jlowin/fastmcp) 构建的 MCP 服务器,采用**双引擎架构**:**Grok** 负责 AI 驱动的智能搜索,**Tavily** 负责高保真网页抓取与站点映射,各取所长为 Claude Code / Cherry Studio 等LLM Client提供完整的实时网络访问能力。 -### 核心价值 -- **突破知识截止限制**:让 Claude 访问最新的网络信息,不再受训练数据时间限制 -- **增强事实核查**:实时搜索验证信息的准确性和时效性 -- **结构化输出**:返回包含标题、链接、摘要的标准化 JSON,便于 AI 模型理解与引用 -- **即插即用**:通过 MCP 协议无缝集成到 Claude Desktop、Claude Code 等客户端 - - -**工作流程**:`Claude → MCP → Grok API → 搜索/抓取 → 结构化返回` - -
-💡 更多选择Grok search 的理由 -与其他搜索方案对比: - -| 特性 | Grok Search MCP | Google Custom Search API | Bing Search API | SerpAPI | -|------|----------------|-------------------------|-----------------|---------| -| **AI 优化结果** | ✅ 专为 AI 理解优化 | ❌ 通用搜索结果 | ❌ 通用搜索结果 | ❌ 通用搜索结果 | -| **内容摘要质量** | ✅ AI 生成高质量摘要 | ⚠️ 需二次处理 | ⚠️ 需二次处理 | ⚠️ 需二次处理 | -| **实时性** | ✅ 实时网络数据 | ✅ 实时 | ✅ 实时 | ✅ 实时 | -| **集成复杂度** | ✅ MCP 即插即用 | ⚠️ 需自行开发 | ⚠️ 需自行开发 | ⚠️ 需自行开发 | -| **返回格式** | ✅ AI 友好 JSON | ⚠️ 需格式化 | ⚠️ 需格式化 | ⚠️ 需格式化 | - -## 功能特性 - -- ✅ OpenAI 兼容接口,环境变量配置 -- ✅ 实时网络搜索 + 网页内容抓取 -- ✅ 支持指定搜索平台(Twitter、Reddit、GitHub 等) -- ✅ 配置测试工具(连接测试 + API Key 脱敏) -- ✅ 动态模型切换(支持切换不同 Grok 模型并持久化保存) -- ✅ **工具路由控制(一键禁用官方 WebSearch/WebFetch,强制使用 GrokSearch)** -- ✅ **自动时间注入(搜索时自动获取本地时间,确保时间相关查询的准确性)** -- ✅ 可扩展架构,支持添加其他搜索 Provider -
- -## 安装教程 -### Step 0.前期准备(若已经安装uv则跳过该步骤) - -
+``` +Claude ──MCP──► Grok Search Server + ├─ web_search ───► Grok API(AI 搜索) + ├─ web_fetch ───► Tavily Extract → Firecrawl Scrape(内容抓取,自动降级) + └─ web_map ───► Tavily Map(站点映射) +``` -**Python 环境**: -- Python 3.10 或更高版本 -- 已配置 Claude Code 或 Claude Desktop +### 功能特性 -**uv 工具**(推荐的 Python 包管理器): +- **双引擎**:Grok 搜索 + Tavily 抓取/映射,互补协作 +- **Firecrawl 托底**:Tavily 提取失败时自动降级到 Firecrawl Scrape,支持空内容自动重试 +- **OpenAI 兼容接口**,支持任意 Grok 镜像站 +- **自动时间注入**(检测时间相关查询,注入本地时间上下文) +- 一键禁用 Claude Code 官方 WebSearch/WebFetch,强制路由到本工具 +- 智能重试(支持 Retry-After 头解析 + 指数退避) +- 父进程监控(Windows 下自动检测父进程退出,防止僵尸进程) -请确保您已成功安装 [uv 工具](https://docs.astral.sh/uv/getting-started/installation/): +### 效果展示 +我们以在`cherry studio`中配置本MCP为例,展示了`claude-opus-4.6`模型如何通过本项目实现外部知识搜集,降低幻觉率。 +![](./images/wogrok.png) +如上图,**为公平实验,我们打开了claude模型内置的搜索工具**,然而opus 4.6仍然相信自己的内部常识,不查询FastAPI的官方文档,以获取最新示例。 +![](./images/wgrok.png) +如上图,当打开`grok-search MCP`时,在相同的实验条件下,opus 4.6主动调用多次搜索,以**获取官方文档,回答更可靠。** -#### Windows 安装 uv -在 PowerShell 中运行以下命令: -```powershell -powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" -``` +## 二、安装 -**💡 重要提示** :我们 **强烈推荐** Windows 用户在 WSL(Windows Subsystem for Linux)中运行本项目! +### 前置条件 -#### Linux/macOS 安装 uv +- Python 3.10+ +- [uv](https://docs.astral.sh/uv/getting-started/installation/)(推荐的 Python 包管理器) +- Claude Code -使用 curl 或 wget 下载并安装: +
+安装 uv ```bash -# 使用 curl +# Linux/macOS curl -LsSf https://astral.sh/uv/install.sh | sh -# 或使用 wget -wget -qO- https://astral.sh/uv/install.sh | sh +# Windows PowerShell +powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" ``` +> Windows 用户**强烈推荐**在 WSL 中运行本项目。 +
+### 一键安装 +若之前安装过本项目,使用以下命令卸载旧版MCP。 +``` +claude mcp remove grok-search +``` -### Step 1. 安装 Grok Search MCP -使用 `claude mcp add-json` 一键安装并配置: -**注意:** 需要替换 **GROK_API_URL** 以及 **GROK_API_KEY**这两个字段为你自己的站点以及密钥,目前只支持openai格式,所以如果需要使用grok,也需要使用转为openai格式的grok镜像站 +将以下命令中的环境变量替换为你自己的值后执行。Grok 接口需为 OpenAI 兼容格式;Tavily 为可选配置,未配置时工具 `web_fetch` 和 `web_map` 不可用。 ```bash claude mcp add-json grok-search --scope user '{ @@ -100,379 +80,168 @@ claude mcp add-json grok-search --scope user '{ "command": "uvx", "args": [ "--from", - "git+https://github.com/GuDaStudio/GrokSearch", + "git+https://github.com/GuDaStudio/GrokSearch@grok-with-tavily", "grok-search" ], "env": { "GROK_API_URL": "https://your-api-endpoint.com/v1", - "GROK_API_KEY": "your-api-key-here" + "GROK_API_KEY": "your-grok-api-key", + "TAVILY_API_KEY": "tvly-your-tavily-key", + "TAVILY_API_URL": "https://api.tavily.com" } }' ``` +除此之外,你还可以在`env`字段中配置更多环境变量 -### Step 2. 验证安装 & 检查MCP配置 - -```bash -claude mcp list -``` +| 变量 | 必填 | 默认值 | 说明 | +|------|------|--------|------| +| `GROK_API_URL` | ✅ | - | Grok API 地址(OpenAI 兼容格式) | +| `GROK_API_KEY` | ✅ | - | Grok API 密钥 | +| `GROK_MODEL` | ❌ | `grok-4-fast` | 默认模型(设置后优先于 `~/.config/grok-search/config.json`) | +| `TAVILY_API_KEY` | ❌ | - | Tavily API 密钥(用于 web_fetch / web_map) | +| `TAVILY_API_URL` | ❌ | `https://api.tavily.com` | Tavily API 地址 | +| `TAVILY_ENABLED` | ❌ | `true` | 是否启用 Tavily | +| `FIRECRAWL_API_KEY` | ❌ | - | Firecrawl API 密钥(Tavily 失败时托底) | +| `FIRECRAWL_API_URL` | ❌ | `https://api.firecrawl.dev/v2` | Firecrawl API 地址 | +| `GROK_DEBUG` | ❌ | `false` | 调试模式 | +| `GROK_LOG_LEVEL` | ❌ | `INFO` | 日志级别 | +| `GROK_LOG_DIR` | ❌ | `logs` | 日志目录 | +| `GROK_RETRY_MAX_ATTEMPTS` | ❌ | `3` | 最大重试次数 | +| `GROK_RETRY_MULTIPLIER` | ❌ | `1` | 重试退避乘数 | +| `GROK_RETRY_MAX_WAIT` | ❌ | `10` | 重试最大等待秒数 | -应能看到 `grok-search` 服务器已注册。 -配置完成后,**强烈建议**在 Claude 对话中运行配置测试,以确保一切正常: +### 验证安装 -在 Claude 对话中输入: -``` -请测试 Grok Search 的配置 +```bash +claude mcp list ``` -或直接说: +🍟 显示连接成功后,我们**十分推荐**在 Claude 对话中输入 ``` -显示 grok-search 配置信息 +调用 grok-search toggle_builtin_tools,关闭Claude Code's built-in WebSearch and WebFetch tools ``` +工具将自动修改**项目级** `.claude/settings.json` 的 `permissions.deny`,一键禁用 Claude Code 官方的 WebSearch 和 WebFetch,从而迫使claude code调用本项目实现搜索! -工具会自动执行以下检查: -- ✅ 验证环境变量是否正确加载 -- ✅ 测试 API 连接(向 `/models` 端点发送请求) -- ✅ 显示响应时间和可用模型数量 -- ✅ 识别并报告任何配置错误 -如果看到 `❌ 连接失败` 或 `⚠️ 连接异常`,请检查: -- API URL 是否正确 -- API Key 是否有效 -- 网络连接是否正常 +## 三、MCP 工具介绍 -### Step 3. 配置系统提示词 -为了更好的使用Grok Search 可以通过配置Claude Code或者类似的系统提示词来对整体Vibe Coding Cli进行优化,以Claude Code 为例可以编辑 ~/.claude/CLAUDE.md中追加下面内容,提供了两版使用详细版更能激活工具的能力: - -**💡 提示**:现在可以使用 `toggle_builtin_tools` 工具一键禁用官方 WebSearch/WebFetch,强制路由到 GrokSearch! - -#### 精简版提示词 -```markdown -# Grok Search 提示词 精简版 -## 激活与路由 -**触发**:网络搜索/网页抓取/最新信息查询时自动激活 -**替换**:尽可能使用 Grok-search的工具替换官方原生search以及fetch功能 - -## 工具矩阵 - -| Tool | Parameters | Output | Use Case | -|------|------------|--------|----------| -| `web_search` | `query`(必填), `platform`/`min_results`/`max_results`(可选) | `[{title,url,content}]` | 多源聚合/事实核查/最新资讯 | -| `web_fetch` | `url`(必填) | Structured Markdown | 完整内容获取/深度分析 | -| `get_config_info` | 无 | `{api_url,status,test}` | 连接诊断 | -| `switch_model` | `model`(必填) | `{status,previous_model,current_model}` | 切换Grok模型/性能优化 | -| `toggle_builtin_tools` | `action`(可选: on/off/status) | `{blocked,deny_list,file}` | 禁用/启用官方工具 | - -## 执行策略 -**查询构建**:广度用 `web_search`,深度用 `web_fetch`,特定平台设 `platform` 参数 -**搜索执行**:优先摘要 → 关键 URL 补充完整内容 → 结果不足调整查询重试(禁止放弃) -**结果整合**:交叉验证 + **强制标注来源** `[标题](URL)` + 时间敏感信息注明日期 - -## 错误恢复 - -连接失败 → `get_config_info` 检查 | 无结果 → 放宽查询条件 | 超时 → 搜索替代源 - - -## 核心约束 - -✅ 强制 GrokSearch 工具 + 输出必含来源引用 + 失败必重试 + 关键信息必验证 -❌ 禁止无来源输出 + 禁止单次放弃 + 禁止未验证假设 -``` - -#### 详细版提示词
-💡 Grok Search Enhance 系统提示词(详细版)(点击展开) - -````markdown - - # Grok Search Enhance 系统提示词(详细版) - - ## 0. Module Activation - **触发条件**:当需要执行以下操作时,自动激活本模块: - - 网络搜索 / 信息检索 / 事实核查 - - 获取网页内容 / URL 解析 / 文档抓取 - - 查询最新信息 / 突破知识截止限制 - - ## 1. Tool Routing Policy - - ### 强制替换规则 - | 需求场景 | ❌ 禁用 (Built-in) | ✅ 强制使用 (GrokSearch) | - | :--- | :--- | :--- | - | 网络搜索 | `WebSearch` | `mcp__grok-search__web_search` | - | 网页抓取 | `WebFetch` | `mcp__grok-search__web_fetch` | - | 配置诊断 | N/A | `mcp__grok-search__get_config_info` | - - ### 工具能力矩阵 - -| Tool | Parameters | Output | Use Case | -|------|------------|--------|----------| -| `web_search` | `query`(必填), `platform`/`min_results`/`max_results`(可选) | `[{title,url,content}]` | 多源聚合/事实核查/最新资讯 | -| `web_fetch` | `url`(必填) | Structured Markdown | 完整内容获取/深度分析 | -| `get_config_info` | 无 | `{api_url,status,test}` | 连接诊断 | -| `switch_model` | `model`(必填) | `{status,previous_model,current_model}` | 切换Grok模型/性能优化 | -| `toggle_builtin_tools` | `action`(可选: on/off/status) | `{blocked,deny_list,file}` | 禁用/启用官方工具 | - - - ## 2. Search Workflow - - ### Phase 1: 查询构建 (Query Construction) - 1. **意图识别**:分析用户需求,确定搜索类型: - - **广度搜索**:多源信息聚合 → 使用 `web_search` - - **深度获取**:单一 URL 完整内容 → 使用 `web_fetch` - 2. **参数优化**: - - 若需聚焦特定平台,设置 `platform` 参数 - - 根据需求复杂度调整 `min_results` / `max_results` - - ### Phase 2: 搜索执行 (Search Execution) - 1. **首选策略**:优先使用 `web_search` 获取结构化摘要 - 2. **深度补充**:若摘要不足以回答问题,对关键 URL 调用 `web_fetch` 获取完整内容 - 3. **迭代检索**:若首轮结果不满足需求,**调整查询词**后重新搜索(禁止直接放弃) - - ### Phase 3: 结果整合 (Result Synthesis) - 1. **信息验证**:交叉比对多源结果,识别矛盾信息 - 2. **时效标注**:对时间敏感信息,**必须**标注信息来源与时间 - 3. **引用规范**:输出中**强制包含**来源 URL,格式:`[标题](URL)` - - ## 3. Error Handling - - | 错误类型 | 诊断方法 | 恢复策略 | - | :--- | :--- | :--- | - | 连接失败 | 调用 `get_config_info` 检查配置 | 提示用户检查 API URL / Key | - | 无搜索结果 | 检查 query 是否过于具体 | 放宽搜索词,移除限定条件 | - | 网页抓取超时 | 检查 URL 可访问性 | 尝试搜索替代来源 | - | 内容被截断 | 检查目标页面结构 | 分段抓取或提示用户直接访问 | - - ## 4. Anti-Patterns - - | ❌ 禁止行为 | ✅ 正确做法 | - | :--- | :--- | - | 搜索后不标注来源 | 输出**必须**包含 `[来源](URL)` 引用 | - | 单次搜索失败即放弃 | 调整参数后至少重试 1 次 | - | 假设网页内容而不抓取 | 对关键信息**必须**调用 `web_fetch` 验证 | - | 忽略搜索结果的时效性 | 时间敏感信息**必须**标注日期 | - - --- - 模块说明: - - 强制替换:明确禁用内置工具,强制路由到 GrokSearch - - 三工具覆盖:web_search + web_fetch + get_config_info - - 错误处理:包含配置诊断的恢复策略 - - 引用规范:强制标注来源,符合信息可追溯性要求 -```` +本项目提供八个 MCP 工具(展开查看) -
- -### 详细项目介绍 - -#### MCP 工具说明 +### `web_search` — AI 网络搜索 -本项目提供五个 MCP 工具: +通过 Grok API 执行 AI 驱动的网络搜索,默认仅返回 Grok 的回答正文,并返回 `session_id` 以便后续获取信源。 -##### `web_search` - 网络搜索 +`web_search` 输出不展开信源,仅返回 `sources_count`;信源会按 `session_id` 缓存在服务端,可用 `get_sources` 拉取。 | 参数 | 类型 | 必填 | 默认值 | 说明 | |------|------|------|--------|------| | `query` | string | ✅ | - | 搜索查询语句 | -| `platform` | string | ❌ | `""` | 聚焦搜索平台(如 `"Twitter"`, `"GitHub, Reddit"`) | -| `min_results` | int | ❌ | `3` | 最少返回结果数 | -| `max_results` | int | ❌ | `10` | 最多返回结果数 | +| `platform` | string | ❌ | `""` | 聚焦平台(如 `"Twitter"`, `"GitHub, Reddit"`) | +| `model` | string | ❌ | `null` | 按次指定 Grok 模型 ID | +| `extra_sources` | int | ❌ | `0` | 额外补充信源数量(Tavily/Firecrawl,可为 0 关闭) | -**返回**:包含 `title`、`url`、`content` 的 JSON 数组 +自动检测查询中的时间相关关键词(如"最新""今天""recent"等),注入本地时间上下文以提升时效性搜索的准确度。 +返回值(结构化字典): +- `session_id`: 本次查询的会话 ID +- `content`: Grok 回答正文(已自动剥离信源) +- `sources_count`: 已缓存的信源数量 -
-返回示例(点击展开) - -```json -[ - { - "title": "Claude Code - Anthropic官方CLI工具", - "url": "https://claude.com/claude-code", - "description": "Anthropic推出的官方命令行工具,支持MCP协议集成,提供代码生成和项目管理功能" - }, - { - "title": "Model Context Protocol (MCP) 技术规范", - "url": "https://modelcontextprotocol.io/docs", - "description": "MCP协议官方文档,定义了AI模型与外部工具的标准化通信接口" - }, - { - ... - } -] -``` -
+### `get_sources` — 获取信源 -##### `web_fetch` - 网页内容抓取 +通过 `session_id` 获取对应 `web_search` 的全部信源。 | 参数 | 类型 | 必填 | 说明 | |------|------|------|------| -| `url` | string | ✅ | 目标网页 URL | +| `session_id` | string | ✅ | `web_search` 返回的 `session_id` | -**功能**:获取完整网页内容并转换为结构化 Markdown,保留标题层级、列表、表格、代码块等元素 +返回值(结构化字典): +- `session_id` +- `sources_count` +- `sources`: 信源列表(每项包含 `url`,可能包含 `title`/`description`/`provider`) -
-返回示例(点击展开) +### `web_fetch` — 网页内容抓取 -```markdown ---- -source: https://modelcontextprotocol.io/docs/concepts/architecture -title: MCP 架构设计文档 -fetched_at: 2024-01-15T10:30:00Z ---- +通过 Tavily Extract API 获取完整网页内容,返回 Markdown 格式。Tavily 失败时自动降级到 Firecrawl Scrape 进行托底抓取。 -# MCP 架构设计文档 - -## 目录 -- [核心概念](#核心概念) -- [协议层次](#协议层次) -- [通信模式](#通信模式) - -## 核心概念 +| 参数 | 类型 | 必填 | 说明 | +|------|------|------|------| +| `url` | string | ✅ | 目标网页 URL | -Model Context Protocol (MCP) 是一个标准化的通信协议,用于连接 AI 模型与外部工具和数据源。 -... +### `web_map` — 站点结构映射 -更多信息请访问 [官方文档](https://modelcontextprotocol.io) -``` -
+通过 Tavily Map API 遍历网站结构,发现 URL 并生成站点地图。 +| 参数 | 类型 | 必填 | 默认值 | 说明 | +|------|------|------|--------|------| +| `url` | string | ✅ | - | 起始 URL | +| `instructions` | string | ❌ | `""` | 自然语言过滤指令 | +| `max_depth` | int | ❌ | `1` | 最大遍历深度(1-5) | +| `max_breadth` | int | ❌ | `20` | 每页最大跟踪链接数(1-500) | +| `limit` | int | ❌ | `50` | 总链接处理数上限(1-500) | +| `timeout` | int | ❌ | `150` | 超时秒数(10-150) | -##### `get_config_info` - 配置信息查询 +### `get_config_info` — 配置诊断 -**无需参数**。显示配置状态、测试 API 连接、返回响应时间和可用模型数量(API Key 自动脱敏) +无需参数。显示所有配置状态、测试 Grok API 连接、返回响应时间和可用模型列表(API Key 自动脱敏)。 -
-返回示例(点击展开) - -```json -{ - "api_url": "https://YOUR-API-URL/grok/v1", - "api_key": "sk-a*****************xyz", - "config_status": "✅ 配置完整", - "connection_test": { - "status": "✅ 连接成功", - "message": "成功获取模型列表 (HTTP 200),共 x 个模型", - "response_time_ms": 234.56 - } -} -``` - -
- -##### `switch_model` - 模型切换 +### `switch_model` — 模型切换 | 参数 | 类型 | 必填 | 说明 | |------|------|------|------| -| `model` | string | ✅ | 要切换到的模型 ID(如 `"grok-4-fast"`, `"grok-2-latest"`, `"grok-vision-beta"`) | - -**功能**: -- 切换用于搜索和抓取操作的默认 Grok 模型 -- 配置自动持久化到 `~/.config/grok-search/config.json` -- 支持跨会话保持设置 -- 适用于性能优化或质量对比测试 - -
-返回示例(点击展开) - -```json -{ - "status": "✅ 成功", - "previous_model": "grok-4-fast", - "current_model": "grok-2-latest", - "message": "模型已从 grok-4-fast 切换到 grok-2-latest", - "config_file": "/home/user/.config/grok-search/config.json" -} -``` - -**使用示例**: - -在 Claude 对话中输入: -``` -请将 Grok 模型切换到 grok-2-latest -``` +| `model` | string | ✅ | 模型 ID(如 `"grok-4-fast"`, `"grok-2-latest"`) | -或直接说: -``` -切换模型到 grok-vision-beta -``` - -
+切换后配置持久化到 `~/.config/grok-search/config.json`,跨会话保持。 -##### `toggle_builtin_tools` - 工具路由控制 +### `toggle_builtin_tools` — 工具路由控制 | 参数 | 类型 | 必填 | 默认值 | 说明 | |------|------|------|--------|------| -| `action` | string | ❌ | `"status"` | 操作类型:`"on"`/`"enable"`(禁用官方工具)、`"off"`/`"disable"`(启用官方工具)、`"status"`/`"check"`(查看状态) | - -**功能**: -- 控制项目级 `.claude/settings.json` 的 `permissions.deny` 配置 -- 禁用/启用 Claude Code 官方的 `WebSearch` 和 `WebFetch` 工具 -- 强制路由到 GrokSearch MCP 工具 -- 自动定位项目根目录(查找 `.git`) -- 保留其他配置项 - -
-返回示例(点击展开) - -```json -{ - "blocked": true, - "deny_list": ["WebFetch", "WebSearch"], - "file": "/path/to/project/.claude/settings.json", - "message": "官方工具已禁用" -} -``` - -**使用示例**: +| `action` | string | ❌ | `"status"` | `"on"` 禁用官方工具 / `"off"` 启用官方工具 / `"status"` 查看状态 | -``` -# 禁用官方工具(推荐) -禁用官方的 search 和 fetch 工具 - -# 启用官方工具 -启用官方的 search 和 fetch 工具 +修改项目级 `.claude/settings.json` 的 `permissions.deny`,一键禁用 Claude Code 官方的 WebSearch 和 WebFetch。 -# 检查当前状态 -显示官方工具的禁用状态 -``` +### `search_planning` — 搜索规划 +结构化搜索规划脚手架(分阶段、多轮),用于在执行复杂搜索前先生成可执行的搜索计划。
---- +## 四、常见问题
-

项目架构

(点击展开)
- -``` -src/grok_search/ -├── config.py # 配置管理(环境变量) -├── server.py # MCP 服务入口(注册工具) -├── logger.py # 日志系统 -├── utils.py # 格式化工具 -└── providers/ - ├── base.py # SearchProvider 基类 - └── grok.py # Grok API 实现 -``` - + +Q: 必须同时配置 Grok 和 Tavily 吗? + +A: Grok(`GROK_API_URL` + `GROK_API_KEY`)为必填,提供核心搜索能力。Tavily 和 Firecrawl 均为可选:配置 Tavily 后 `web_fetch` 优先使用 Tavily Extract,失败时降级到 Firecrawl Scrape;两者均未配置时 `web_fetch` 将返回配置错误提示。`web_map` 依赖 Tavily。
-## 常见问题 - -**Q: 如何获取 Grok API 访问权限?** -A: 注册第三方平台 → 获取 API Endpoint 和 Key → 使用 `claude mcp add-json` 配置 +
+ +Q: Grok API 地址需要什么格式? + +A: 需要 OpenAI 兼容格式的 API 地址(支持 `/chat/completions` 和 `/models` 端点)。如使用官方 Grok,需通过兼容 OpenAI 格式的镜像站访问。 +
-**Q: 配置后如何验证?** -A: 在 Claude 对话中说"显示 grok-search 配置信息",查看连接测试结果 +
+ +Q: 如何验证配置? + +A: 在 Claude 对话中说"显示 grok-search 配置信息",将自动测试 API 连接并显示结果。 +
## 许可证 -本项目采用 [MIT License](LICENSE) 开源。 +[MIT License](LICENSE) ---
-**如果这个项目对您有帮助,请给个 ⭐ Star!** +**如果这个项目对您有帮助,请给个 Star!** + [![Star History Chart](https://api.star-history.com/svg?repos=GuDaStudio/GrokSearch&type=date&legend=top-left)](https://www.star-history.com/#GuDaStudio/GrokSearch&type=date&legend=top-left)
diff --git a/docs/README_EN.md b/docs/README_EN.md index cd62a2a..a0b084e 100644 --- a/docs/README_EN.md +++ b/docs/README_EN.md @@ -1,95 +1,80 @@ -![Image](../pic/image.png) +![Image](../images/title.png)
-# Grok Search MCP + English | [简体中文](../README.md) -**Integrate Grok search capabilities into Claude via MCP protocol, significantly enhancing document retrieval and fact-checking abilities** +**Grok-with-Tavily MCP, providing enhanced web access for Claude Code** -[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) -[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) -[![FastMCP](https://img.shields.io/badge/FastMCP-2.0.0+-green.svg)](https://github.com/jlowin/fastmcp) +[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![FastMCP](https://img.shields.io/badge/FastMCP-2.0.0+-green.svg)](https://github.com/jlowin/fastmcp)
--- -## Overview +## 1. Overview -Grok Search MCP is an MCP (Model Context Protocol) server built on [FastMCP](https://github.com/jlowin/fastmcp), providing real-time web search capabilities for AI models like Claude and Claude Code by leveraging the powerful search capabilities of third-party platforms (such as Grok). +Grok Search MCP is an MCP server built on [FastMCP](https://github.com/jlowin/fastmcp), featuring a **dual-engine architecture**: **Grok** handles AI-driven intelligent search, while **Tavily** handles high-fidelity web content extraction and site mapping. Together they provide complete real-time web access for LLM clients such as Claude Code and Cherry Studio. -### Core Value -- **Break Knowledge Cutoff Limits**: Enable Claude to access the latest web information -- **Enhanced Fact-Checking**: Real-time search to verify information accuracy and timeliness -- **Structured Output**: Returns standardized JSON with title, link, and summary -- **Plug and Play**: Seamlessly integrates via MCP protocol - - -**Workflow**: `Claude → MCP → Grok API → Search/Fetch → Structured Return` - -## Why Choose Grok? - -Comparison with other search solutions: - -| Feature | Grok Search MCP | Google Custom Search API | Bing Search API | SerpAPI | -|---------|----------------|-------------------------|-----------------|---------| -| **AI-Optimized Results** | ✅ Optimized for AI understanding | ❌ General search results | ❌ General search results | ❌ General search results | -| **Content Summary Quality** | ✅ AI-generated high-quality summaries | ⚠️ Requires post-processing | ⚠️ Requires post-processing | ⚠️ Requires post-processing | -| **Real-time** | ✅ Real-time web data | ✅ Real-time | ✅ Real-time | ✅ Real-time | -| **Integration Complexity** | ✅ MCP plug and play | ⚠️ Requires development | ⚠️ Requires development | ⚠️ Requires development | -| **Return Format** | ✅ AI-friendly JSON | ⚠️ Requires formatting | ⚠️ Requires formatting | ⚠️ Requires formatting | +``` +Claude --MCP--> Grok Search Server + ├─ web_search ---> Grok API (AI Search) + ├─ web_fetch ---> Tavily Extract (Content Extraction) + └─ web_map ---> Tavily Map (Site Mapping) +``` -## Features +### Features -- ✅ OpenAI-compatible interface, environment variable configuration -- ✅ Real-time web search + webpage content fetching -- ✅ Support for platform-specific searches (Twitter, Reddit, GitHub, etc.) -- ✅ Configuration testing tool (connection test + API Key masking) -- ✅ Dynamic model switching (switch between Grok models with persistent settings) -- ✅ **Tool routing control (one-click disable built-in WebSearch/WebFetch, force use GrokSearch)** -- ✅ **Automatic time injection (automatically gets local time during search for accurate time-sensitive queries)** -- ✅ Extensible architecture for additional search providers +- **Dual Engine**: Grok search + Tavily extraction/mapping, complementary collaboration +- **OpenAI-compatible interface**, supports any Grok mirror endpoint +- **Automatic time injection** (detects time-related queries, injects local time context) +- One-click disable Claude Code's built-in WebSearch/WebFetch, force routing to this tool +- Smart retry (Retry-After header parsing + exponential backoff) +- Parent process monitoring (auto-detects parent process exit on Windows, prevents zombie processes) -## Quick Start +### Demo +Using `cherry studio` with this MCP configured, here's how `claude-opus-4.6` leverages this project for external knowledge retrieval, reducing hallucination rates. -**Python Environment**: -- Python 3.10 or higher -- Claude Code or Claude Desktop configured +![](../images/wogrok.png) +As shown above, **for a fair experiment, we enabled Claude's built-in search tools**, yet Opus 4.6 still relied on its internal knowledge without consulting FastAPI's official documentation for the latest examples. -**uv tool** (Recommended Python package manager): +![](../images/wgrok.png) +As shown above, with `grok-search MCP` enabled under the same experimental conditions, Opus 4.6 proactively made multiple search calls to **retrieve official documentation, producing more reliable answers.** -Please ensure you have successfully installed the [uv tool](https://docs.astral.sh/uv/getting-started/installation/): -
-Windows Installation +## 2. Installation -```powershell -powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" -``` +### Prerequisites -
+- Python 3.10+ +- [uv](https://docs.astral.sh/uv/getting-started/installation/) (recommended Python package manager) +- Claude Code
-Linux/macOS Installation - -Download and install using curl or wget: +Install uv ```bash -# Using curl +# Linux/macOS curl -LsSf https://astral.sh/uv/install.sh | sh -# Or using wget -wget -qO- https://astral.sh/uv/install.sh | sh +# Windows PowerShell +powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" ``` +> Windows users are **strongly recommended** to run this project in WSL. +
-> **💡 Important Note**: We **strongly recommend** Windows users run this project in WSL (Windows Subsystem for Linux)! -### 1. Installation & Configuration +### One-Click Install + +If you have previously installed this project, remove the old MCP first: +``` +claude mcp remove grok-search +``` -Use `claude mcp add-json` for one-click installation and configuration: +Replace the environment variables in the following command with your own values. The Grok endpoint must be OpenAI-compatible; Tavily is optional — `web_fetch` and `web_map` will be unavailable without it. ```bash claude mcp add-json grok-search --scope user '{ @@ -97,410 +82,166 @@ claude mcp add-json grok-search --scope user '{ "command": "uvx", "args": [ "--from", - "git+https://github.com/GuDaStudio/GrokSearch", + "git+https://github.com/GuDaStudio/GrokSearch@grok-with-tavily", "grok-search" ], "env": { "GROK_API_URL": "https://your-api-endpoint.com/v1", - "GROK_API_KEY": "your-api-key-here" + "GROK_API_KEY": "your-grok-api-key", + "TAVILY_API_KEY": "tvly-your-tavily-key", + "TAVILY_API_URL": "https://api.tavily.com" } }' ``` -#### Configuration Guide - -Configuration is done through **environment variables**, set directly in the `env` field during installation: +You can also configure additional environment variables in the `env` field: -| Environment Variable | Required | Default | Description | -|---------------------|----------|---------|-------------| -| `GROK_API_URL` | ✅ | - | Grok API endpoint (OpenAI-compatible format) | -| `GROK_API_KEY` | ✅ | - | Your API Key | -| `GROK_DEBUG` | ❌ | `false` | Enable debug mode (`true`/`false`) | -| `GROK_LOG_LEVEL` | ❌ | `INFO` | Log level (DEBUG/INFO/WARNING/ERROR) | -| `GROK_LOG_DIR` | ❌ | `logs` | Log file storage directory | +| Variable | Required | Default | Description | +|----------|----------|---------|-------------| +| `GROK_API_URL` | Yes | - | Grok API endpoint (OpenAI-compatible format) | +| `GROK_API_KEY` | Yes | - | Grok API key | +| `GROK_MODEL` | No | `grok-4-fast` | Default model (takes precedence over `~/.config/grok-search/config.json` when set) | +| `TAVILY_API_KEY` | No | - | Tavily API key (for web_fetch / web_map) | +| `TAVILY_API_URL` | No | `https://api.tavily.com` | Tavily API endpoint | +| `TAVILY_ENABLED` | No | `true` | Enable Tavily | +| `GROK_DEBUG` | No | `false` | Debug mode | +| `GROK_LOG_LEVEL` | No | `INFO` | Log level | +| `GROK_LOG_DIR` | No | `logs` | Log directory | +| `GROK_RETRY_MAX_ATTEMPTS` | No | `3` | Max retry attempts | +| `GROK_RETRY_MULTIPLIER` | No | `1` | Retry backoff multiplier | +| `GROK_RETRY_MAX_WAIT` | No | `10` | Max retry wait in seconds | -⚠️ **Security Notes**: -- API Keys are stored in Claude Code configuration file (`~/.config/claude/mcp.json`), please protect this file -- Do not share configurations containing real API Keys or commit them to version control -### 2. Verify Installation +### Verify Installation ```bash claude mcp list ``` -You should see the `grok-search` server registered. - -### 3. Test Configuration - -After configuration, it is **strongly recommended** to run a configuration test in Claude conversation to ensure everything is working properly: - -In Claude conversation, type: +After confirming a successful connection, we **highly recommend** typing the following in a Claude conversation: ``` -Please test the Grok Search configuration +Call grok-search toggle_builtin_tools to disable Claude Code's built-in WebSearch and WebFetch tools ``` +This will automatically modify the **project-level** `.claude/settings.json` `permissions.deny`, disabling Claude Code's built-in WebSearch and WebFetch, forcing Claude Code to use this project for searches! -Or simply say: -``` -Show grok-search configuration info -``` -The tool will automatically perform the following checks: -- ✅ Verify environment variables are loaded correctly -- ✅ Test API connection (send request to `/models` endpoint) -- ✅ Display response time and available model count -- ✅ Identify and report any configuration errors - -**Successful Output Example**: -```json -{ - "GROK_API_URL": "https://YOUR-API-URL/grok/v1", - "GROK_API_KEY": "sk-a*****************xyz", - "GROK_DEBUG": false, - "GROK_LOG_LEVEL": "INFO", - "GROK_LOG_DIR": "/home/user/.config/grok-search/logs", - "config_status": "✅ Configuration Complete", - "connection_test": { - "status": "✅ Connection Successful", - "message": "Successfully retrieved model list (HTTP 200), 5 models available", - "response_time_ms": 234.56 - } -} -``` -If you see `❌ 连接失败` or `⚠️ 连接异常`, please check: -- API URL is correct -- API Key is valid -- Network connection is working +## 3. MCP Tools -### 4. Advanced Configuration (Optional) -To better utilize Grok Search, you can optimize the overall Vibe Coding CLI by configuring Claude Code or similar system prompts. For Claude Code, edit ~/.claude/CLAUDE.md with the following content:
-💡 Grok Search Enhance System Prompt (Click to expand) - -# Grok Search Enhance System Prompt - -## 0. Module Activation -**Trigger Condition**: Automatically activate this module and **forcibly replace** built-in tools when performing: -- Web search / Information retrieval / Fact-checking -- Get webpage content / URL parsing / Document fetching -- Query latest information / Break through knowledge cutoff limits - -## 1. Tool Routing Policy - -### Forced Replacement Rules -| Use Case | ❌ Disabled (Built-in) | ✅ Mandatory (GrokSearch) | -| :--- | :--- | :--- | -| Web Search | `WebSearch` | `mcp__grok-search__web_search` | -| Web Fetch | `WebFetch` | `mcp__grok-search__web_fetch` | -| Config Diagnosis | N/A | `mcp__grok-search__get_config_info` | - -### Tool Capability Matrix - -| Tool | Function | Key Parameters | Output Format | Use Case | -| :--- | :--- | :--- | :--- | :--- | -| **web_search** | Real-time web search | `query` (required)
`platform` (optional: Twitter/GitHub/Reddit)
`min_results` / `max_results` | JSON Array
`{title, url, content}` | • Fact-checking
• Latest news
• Technical docs retrieval | -| **web_fetch** | Webpage content fetching | `url` (required) | Structured Markdown
(with metadata header) | • Complete document retrieval
• In-depth content analysis
• Link content verification | -| **get_config_info** | Configuration status detection | No parameters | JSON
`{api_url, status, connection_test}` | • Connection troubleshooting
• First-time use validation | -| **switch_model** | Model switching | `model` (required) | JSON
`{status, previous_model, current_model, config_file}` | • Switch Grok models
• Performance/quality optimization
• Cross-session persistence | -| **toggle_builtin_tools** | Tool routing control | `action` (optional: on/off/status) | JSON
`{blocked, deny_list, file}` | • Disable built-in tools
• Force route to GrokSearch
• Project-level config management | - -## 2. Search Workflow - -### Phase 1: Query Construction -1. **Intent Recognition**: Analyze user needs, determine search type: - - **Broad Search**: Multi-source information aggregation → Use `web_search` - - **Deep Retrieval**: Complete content from single URL → Use `web_fetch` -2. **Parameter Optimization**: - - Set `platform` parameter if focusing on specific platforms - - Adjust `min_results` / `max_results` based on complexity - -### Phase 2: Search Execution -1. **Primary Strategy**: Prioritize `web_search` for structured summaries -2. **Deep Supplementation**: If summaries are insufficient, call `web_fetch` on key URLs for complete content -3. **Iterative Retrieval**: If first-round results don't meet needs, **adjust query terms** and search again (don't give up) - -### Phase 3: Result Synthesis -1. **Information Verification**: Cross-compare multi-source results, identify contradictions -2. **Timeliness Notation**: For time-sensitive information, **must** annotate source and timestamp -3. **Citation Standard**: Output **must include** source URL in format: `[Title](URL)` - -## 3. Error Handling - -| Error Type | Diagnosis Method | Recovery Strategy | -| :--- | :--- | :--- | -| Connection failure | Call `get_config_info` to check configuration | Prompt user to check API URL / Key | -| No search results | Check if query is too specific | Broaden search terms, remove constraints | -| Web fetch timeout | Check URL accessibility | Try searching alternative sources | -| Content truncated | Check target page structure | Fetch in segments or prompt user to visit directly | - -## 4. Anti-Patterns - -| ❌ Prohibited Behavior | ✅ Correct Approach | -| :--- | :--- | -| Using built-in `WebSearch` / `WebFetch` | **Must** use GrokSearch corresponding tools | -| No source citation after search | Output **must** include `[Source](URL)` references | -| Give up after single search failure | Adjust parameters and retry at least once | -| Assume webpage content without fetching | **Must** call `web_fetch` to verify key information | -| Ignore search result timeliness | Time-sensitive information **must** be date-labeled | - ---- -Module Description: -- Forced Replacement: Explicitly disable built-in tools, force routing to GrokSearch -- Three-tool Coverage: web_search + web_fetch + get_config_info -- Error Handling: Includes configuration diagnosis recovery strategy -- Citation Standard: Mandatory source labeling, meets information traceability requirements - -
- -### 5. Project Details +This project provides eight MCP tools (click to expand) -#### MCP Tools +### `web_search` — AI Web Search -This project provides five MCP tools: +Executes AI-driven web search via Grok API. By default it returns only Grok's answer and a `session_id` for retrieving sources later. -##### `web_search` - Web Search +`web_search` does not expand sources in the response; it only returns `sources_count`. Sources are cached server-side by `session_id` and can be fetched with `get_sources`. | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| -| `query` | string | ✅ | - | Search query string | -| `platform` | string | ❌ | `""` | Focus on specific platforms (e.g., `"Twitter"`, `"GitHub, Reddit"`) | -| `min_results` | int | ❌ | `3` | Minimum number of results | -| `max_results` | int | ❌ | `10` | Maximum number of results | +| `query` | string | Yes | - | Search query | +| `platform` | string | No | `""` | Focus platform (e.g., `"Twitter"`, `"GitHub, Reddit"`) | +| `model` | string | No | `null` | Per-request Grok model ID | +| `extra_sources` | int | No | `0` | Extra sources via Tavily/Firecrawl (0 disables) | -**Returns**: JSON array containing `title`, `url`, `content` +Automatically detects time-related keywords in queries (e.g., "latest", "today", "recent"), injecting local time context to improve accuracy for time-sensitive searches. -
-Return Example (Click to expand) - -```json -[ - { - "title": "Claude Code - Anthropic Official CLI Tool", - "url": "https://claude.com/claude-code", - "description": "Official command-line tool from Anthropic with MCP protocol integration, providing code generation and project management" - }, - { - "title": "Model Context Protocol (MCP) Technical Specification", - "url": "https://modelcontextprotocol.io/docs", - "description": "Official MCP documentation defining standardized communication interfaces between AI models and external tools" - }, - { - "title": "GitHub - FastMCP: Build MCP Servers Quickly", - "url": "https://github.com/jlowin/fastmcp", - "description": "Python-based MCP server framework that simplifies tool registration and async processing" - } -] -``` -
+Return value (structured dict): +- `session_id`: search session ID +- `content`: answer only (sources removed) +- `sources_count`: cached sources count + +### `get_sources` — Retrieve Sources -##### `web_fetch` - Web Content Fetching +Retrieves the full cached source list for a previous `web_search` call. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| -| `url` | string | ✅ | Target webpage URL | - -**Features**: Retrieves complete webpage content and converts to structured Markdown, preserving headings, lists, tables, code blocks, etc. - -
-Return Example (Click to expand) - -```markdown ---- -source: https://modelcontextprotocol.io/docs/concepts/architecture -title: MCP Architecture Documentation -fetched_at: 2024-01-15T10:30:00Z ---- - -# MCP Architecture Documentation - -## Table of Contents -- [Core Concepts](#core-concepts) -- [Protocol Layers](#protocol-layers) -- [Communication Patterns](#communication-patterns) - -## Core Concepts - -Model Context Protocol (MCP) is a standardized communication protocol for connecting AI models with external tools and data sources. - -### Design Goals -- **Standardization**: Provide unified interface specifications -- **Extensibility**: Support custom tool registration -- **Efficiency**: Optimize data transmission and processing - -## Protocol Layers - -MCP adopts a three-layer architecture design: +| `session_id` | string | Yes | `session_id` returned by `web_search` | -| Layer | Function | Implementation | -|-------|----------|----------------| -| Transport | Data transmission | stdio, HTTP, WebSocket | -| Protocol | Message format | JSON-RPC 2.0 | -| Application | Tool definition | Tool Schema + Handlers | +Return value (structured dict): +- `session_id` +- `sources_count` +- `sources`: source list (each item includes `url`, may include `title`/`description`/`provider`) -## Communication Patterns +### `web_fetch` — Web Content Extraction -MCP supports the following communication patterns: - -1. **Request-Response**: Synchronous tool invocation -2. **Streaming**: Process large datasets -3. **Event Notification**: Asynchronous status updates - -```python -# Example: Register MCP tool -@mcp.tool(name="search") -async def search_tool(query: str) -> str: - results = await perform_search(query) - return json.dumps(results) -``` - -For more information, visit [Official Documentation](https://modelcontextprotocol.io) -``` -
- -##### `get_config_info` - Configuration Info Query +Extracts complete web content via Tavily Extract API, returning Markdown format. | Parameter | Type | Required | Description | |-----------|------|----------|-------------| -| None | - | - | This tool requires no parameters | - -**Features**: Display configuration status, test API connection, return response time and available model count (API Key automatically masked) - -
-Return Example (Click to expand) - -```json -{ - "GROK_API_URL": "https://YOUR-API-URL/grok/v1", - "GROK_API_KEY": "sk-a*****************xyz", - "GROK_DEBUG": false, - "GROK_LOG_LEVEL": "INFO", - "GROK_LOG_DIR": "/home/user/.config/grok-search/logs", - "config_status": "✅ Configuration Complete", - "connection_test": { - "status": "✅ Connection Successful", - "message": "Successfully retrieved model list (HTTP 200), 5 models available", - "response_time_ms": 234.56 - } -} -``` +| `url` | string | Yes | Target webpage URL | -
- -##### `switch_model` - Model Switching +### `web_map` — Site Structure Mapping -| Parameter | Type | Required | Description | -|-----------|------|----------|-------------| -| `model` | string | ✅ | Model ID to switch to (e.g., `"grok-4-fast"`, `"grok-2-latest"`, `"grok-vision-beta"`) | +Traverses website structure via Tavily Map API, discovering URLs and generating a site map. -**Features**: -- Switch the default Grok model used for search and fetch operations -- Configuration automatically persisted to `~/.config/grok-search/config.json` -- Cross-session settings retention -- Suitable for performance optimization or quality comparison testing +| Parameter | Type | Required | Default | Description | +|-----------|------|----------|---------|-------------| +| `url` | string | Yes | - | Starting URL | +| `instructions` | string | No | `""` | Natural language filtering instructions | +| `max_depth` | int | No | `1` | Max traversal depth (1-5) | +| `max_breadth` | int | No | `20` | Max links to follow per page (1-500) | +| `limit` | int | No | `50` | Total link processing limit (1-500) | +| `timeout` | int | No | `150` | Timeout in seconds (10-150) | -
-Return Example (Click to expand) - -```json -{ - "status": "✅ 成功", - "previous_model": "grok-4-fast", - "current_model": "grok-2-latest", - "message": "模型已从 grok-4-fast 切换到 grok-2-latest", - "config_file": "/home/user/.config/grok-search/config.json" -} -``` +### `get_config_info` — Configuration Diagnostics -**Usage Example**: +No parameters required. Displays all configuration status, tests Grok API connection, returns response time and available model list (API keys auto-masked). -In Claude conversation, type: -``` -Please switch the Grok model to grok-2-latest -``` +### `switch_model` — Model Switching -Or simply say: -``` -Switch model to grok-vision-beta -``` +| Parameter | Type | Required | Description | +|-----------|------|----------|-------------| +| `model` | string | Yes | Model ID (e.g., `"grok-4-fast"`, `"grok-2-latest"`) | -
+Settings persist to `~/.config/grok-search/config.json` across sessions. -##### `toggle_builtin_tools` - Tool Routing Control +### `toggle_builtin_tools` — Tool Routing Control | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| -| `action` | string | ❌ | `"status"` | Action type: `"on"`/`"enable"`(disable built-in tools), `"off"`/`"disable"`(enable built-in tools), `"status"`/`"check"`(view status) | - -**Features**: -- Control project-level `.claude/settings.json` `permissions.deny` configuration -- Disable/enable Claude Code's built-in `WebSearch` and `WebFetch` tools -- Force routing to GrokSearch MCP tools -- Auto-locate project root (find `.git`) -- Preserve other configuration items - -
-Return Example (Click to expand) - -```json -{ - "blocked": true, - "deny_list": ["WebFetch", "WebSearch"], - "file": "/path/to/project/.claude/settings.json", - "message": "官方工具已禁用" -} -``` - -**Usage Example**: - -``` -# Disable built-in tools (recommended) -Disable built-in search and fetch tools +| `action` | string | No | `"status"` | `"on"` disable built-in tools / `"off"` enable built-in tools / `"status"` check status | -# Enable built-in tools -Enable built-in search and fetch tools +Modifies project-level `.claude/settings.json` `permissions.deny` to disable Claude Code's built-in WebSearch and WebFetch. -# Check current status -Show status of built-in tools -``` +### `search_planning` — Search Planning +A structured multi-phase planning scaffold to generate an executable search plan before running complex searches.
---- +## 4. FAQ
-

Project Architecture

(Click to expand)
- -``` -src/grok_search/ -├── config.py # Configuration management (environment variables) -├── server.py # MCP service entry (tool registration) -├── logger.py # Logging system -├── utils.py # Formatting utilities -└── providers/ - ├── base.py # SearchProvider base class - └── grok.py # Grok API implementation -``` - + +Q: Must I configure both Grok and Tavily? + +A: Grok (`GROK_API_URL` + `GROK_API_KEY`) is required and provides the core search capability. Tavily is optional — without it, `web_fetch` and `web_map` will return configuration error messages.
-## FAQ - -**Q: How do I get Grok API access?** -A: Register with a third-party platform → Obtain API Endpoint and Key → Configure using `claude mcp add-json` command +
+ +Q: What format does the Grok API URL need? + +A: An OpenAI-compatible API endpoint (supporting `/chat/completions` and `/models` endpoints). If using official Grok, access it through an OpenAI-compatible mirror. +
-**Q: How to verify configuration after setup?** -A: Say "Show grok-search configuration info" in Claude conversation to check connection test results +
+ +Q: How to verify configuration? + +A: Say "Show grok-search configuration info" in a Claude conversation to automatically test the API connection and display results. +
## License -This project is open source under the [MIT License](LICENSE). +[MIT License](LICENSE) ---
-**If this project helps you, please give it a ⭐ Star!** -[![Star History Chart](https://api.star-history.com/svg?repos=GuDaStudio/GrokSearch&type=date&legend=top-left)](https://www.star-history.com/#GuDaStudio/GrokSearch&type=date&legend=top-left) +**If this project helps you, please give it a Star!** +[![Star History Chart](https://api.star-history.com/svg?repos=GuDaStudio/GrokSearch&type=date&legend=top-left)](https://www.star-history.com/#GuDaStudio/GrokSearch&type=date&legend=top-left)
diff --git a/images/wgrok.png b/images/wgrok.png new file mode 100644 index 0000000..f78eb3d Binary files /dev/null and b/images/wgrok.png differ diff --git a/images/wogrok.png b/images/wogrok.png new file mode 100644 index 0000000..b28002e Binary files /dev/null and b/images/wogrok.png differ diff --git a/src/grok_search/config.py b/src/grok_search/config.py index 006d340..51382b6 100644 --- a/src/grok_search/config.py +++ b/src/grok_search/config.py @@ -23,7 +23,11 @@ def __new__(cls): def config_file(self) -> Path: if self._config_file is None: config_dir = Path.home() / ".config" / "grok-search" - config_dir.mkdir(parents=True, exist_ok=True) + try: + config_dir.mkdir(parents=True, exist_ok=True) + except OSError: + config_dir = Path.cwd() / ".grok-search" + config_dir.mkdir(parents=True, exist_ok=True) self._config_file = config_dir / "config.json" return self._config_file @@ -81,30 +85,68 @@ def grok_api_key(self) -> str: @property def tavily_enabled(self) -> bool: - return os.getenv("TAVILY_ENABLED", "false").lower() in ("true", "1", "yes") + return os.getenv("TAVILY_ENABLED", "true").lower() in ("true", "1", "yes") + + @property + def tavily_api_url(self) -> str: + return os.getenv("TAVILY_API_URL", "https://api.tavily.com") @property def tavily_api_key(self) -> str | None: return os.getenv("TAVILY_API_KEY") + @property + def firecrawl_api_url(self) -> str: + return os.getenv("FIRECRAWL_API_URL", "https://api.firecrawl.dev/v2") + + @property + def firecrawl_api_key(self) -> str | None: + return os.getenv("FIRECRAWL_API_KEY") + @property def log_level(self) -> str: return os.getenv("GROK_LOG_LEVEL", "INFO").upper() + @property + def ssl_verify_enabled(self) -> bool: + """是否启用 SSL 证书验证,默认启用。设置为 false 可跳过验证(适用于内网自签名证书)""" + return os.getenv("GROK_SSL_VERIFY", "true").lower() not in ("false", "0", "no") + @property def log_dir(self) -> Path: log_dir_str = os.getenv("GROK_LOG_DIR", "logs") - if Path(log_dir_str).is_absolute(): - return Path(log_dir_str) - user_log_dir = Path.home() / ".config" / "grok-search" / log_dir_str - user_log_dir.mkdir(parents=True, exist_ok=True) - return user_log_dir + log_dir = Path(log_dir_str) + if log_dir.is_absolute(): + return log_dir + + home_log_dir = Path.home() / ".config" / "grok-search" / log_dir_str + try: + home_log_dir.mkdir(parents=True, exist_ok=True) + return home_log_dir + except OSError: + pass + + cwd_log_dir = Path.cwd() / log_dir_str + try: + cwd_log_dir.mkdir(parents=True, exist_ok=True) + return cwd_log_dir + except OSError: + pass + + tmp_log_dir = Path("/tmp") / "grok-search" / log_dir_str + tmp_log_dir.mkdir(parents=True, exist_ok=True) + return tmp_log_dir @property def grok_model(self) -> str: if self._cached_model is not None: return self._cached_model + env_model = os.getenv("GROK_MODEL") + if env_model: + self._cached_model = env_model + return env_model + config_data = self._load_config_file() file_model = config_data.get("model") if file_model: @@ -145,9 +187,13 @@ def get_config_info(self) -> dict: "GROK_MODEL": self.grok_model, "GROK_DEBUG": self.debug_enabled, "GROK_LOG_LEVEL": self.log_level, + "GROK_SSL_VERIFY": self.ssl_verify_enabled, "GROK_LOG_DIR": str(self.log_dir), + "TAVILY_API_URL": self.tavily_api_url, "TAVILY_ENABLED": self.tavily_enabled, "TAVILY_API_KEY": self._mask_api_key(self.tavily_api_key) if self.tavily_api_key else "未配置", + "FIRECRAWL_API_URL": self.firecrawl_api_url, + "FIRECRAWL_API_KEY": self._mask_api_key(self.firecrawl_api_key) if self.firecrawl_api_key else "未配置", "config_status": config_status } diff --git a/src/grok_search/logger.py b/src/grok_search/logger.py index af22a95..57f711d 100644 --- a/src/grok_search/logger.py +++ b/src/grok_search/logger.py @@ -3,23 +3,25 @@ from pathlib import Path from .config import config -LOG_DIR = config.log_dir -LOG_DIR.mkdir(parents=True, exist_ok=True) -LOG_FILE = LOG_DIR / f"grok_search_{datetime.now().strftime('%Y%m%d')}.log" - logger = logging.getLogger("grok_search") -logger.setLevel(getattr(logging, config.log_level)) - -file_handler = logging.FileHandler(LOG_FILE, encoding='utf-8') -file_handler.setLevel(getattr(logging, config.log_level)) +logger.setLevel(getattr(logging, config.log_level, logging.INFO)) -formatter = logging.Formatter( +_formatter = logging.Formatter( '%(asctime)s - %(name)s - %(levelname)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S' ) -file_handler.setFormatter(formatter) -logger.addHandler(file_handler) +try: + log_dir = config.log_dir + log_dir.mkdir(parents=True, exist_ok=True) + log_file = log_dir / f"grok_search_{datetime.now().strftime('%Y%m%d')}.log" + + file_handler = logging.FileHandler(log_file, encoding='utf-8') + file_handler.setLevel(getattr(logging, config.log_level, logging.INFO)) + file_handler.setFormatter(_formatter) + logger.addHandler(file_handler) +except OSError: + logger.addHandler(logging.NullHandler()) async def log_info(ctx, message: str, is_debug: bool = False): if is_debug: diff --git a/src/grok_search/planning.py b/src/grok_search/planning.py new file mode 100644 index 0000000..8bdb30e --- /dev/null +++ b/src/grok_search/planning.py @@ -0,0 +1,167 @@ +from pydantic import BaseModel, Field +from typing import Optional, Literal +import uuid + + +class IntentOutput(BaseModel): + core_question: str = Field(description="Distilled core question in one sentence") + query_type: Literal["factual", "comparative", "exploratory", "analytical"] = Field( + description="factual=single answer, comparative=A vs B, exploratory=broad understanding, analytical=deep reasoning" + ) + time_sensitivity: Literal["realtime", "recent", "historical", "irrelevant"] = Field( + description="realtime=today, recent=days/weeks, historical=months+, irrelevant=timeless" + ) + domain: Optional[str] = Field(default=None, description="Specific domain if identifiable") + premise_valid: Optional[bool] = Field(default=None, description="False if the question contains a flawed assumption") + ambiguities: Optional[list[str]] = Field(default=None, description="Unresolved ambiguities that may affect search direction") + unverified_terms: Optional[list[str]] = Field( + default=None, + description="External classifications, rankings, or taxonomies that may be incomplete or outdated " + "in training data (e.g., 'CCF-A', 'Fortune 500', 'OWASP Top 10'). " + "Each should become a prerequisite sub-query in Phase 3." + ) + + +class ComplexityOutput(BaseModel): + level: Literal[1, 2, 3] = Field( + description="1=simple (1-2 searches), 2=moderate (3-5 searches), 3=complex (6+ searches)" + ) + estimated_sub_queries: int = Field(ge=1, le=20) + estimated_tool_calls: int = Field(ge=1, le=50) + justification: str + + +class SubQuery(BaseModel): + id: str = Field(description="Unique identifier (e.g., 'sq1')") + goal: str + expected_output: str = Field(description="What a successful result looks like") + tool_hint: Optional[str] = Field(default=None, description="Suggested tool: web_search | web_fetch | web_map") + boundary: str = Field(description="What this sub-query explicitly excludes — MUST state mutual exclusion with sibling sub-queries, not just the broader domain") + depends_on: Optional[list[str]] = Field(default=None, description="IDs of prerequisite sub-queries") + + +class SearchTerm(BaseModel): + term: str = Field(description="Search query string. MUST be ≤8 words. Drop redundant synonyms (e.g., use 'RAG' not 'RAG retrieval augmented generation').") + purpose: str = Field(description="Single sub-query ID this term serves (e.g., 'sq2'). ONE term per sub-query — do NOT combine like 'sq1+sq2'.") + round: int = Field(ge=1, description="Execution round: 1=broad discovery, 2+=targeted follow-up refined by round 1 findings") + + +class StrategyOutput(BaseModel): + approach: Literal["broad_first", "narrow_first", "targeted"] = Field( + description="broad_first=wide then narrow, narrow_first=precise then expand, targeted=known-item" + ) + search_terms: list[SearchTerm] + fallback_plan: Optional[str] = Field(default=None, description="Fallback if primary searches fail") + + +class ToolPlanItem(BaseModel): + sub_query_id: str + tool: Literal["web_search", "web_fetch", "web_map"] + reason: str + params: Optional[dict] = Field(default=None, description="Tool-specific parameters") + + +class ExecutionOrderOutput(BaseModel): + parallel: list[list[str]] = Field(description="Groups of sub-query IDs runnable in parallel") + sequential: list[str] = Field(description="Sub-query IDs that must run in order") + estimated_rounds: int = Field(ge=1) + + +PHASE_NAMES = [ + "intent_analysis", + "complexity_assessment", + "query_decomposition", + "search_strategy", + "tool_selection", + "execution_order", +] + +REQUIRED_PHASES: dict[int, set[str]] = { + 1: {"intent_analysis", "complexity_assessment", "query_decomposition"}, + 2: {"intent_analysis", "complexity_assessment", "query_decomposition", "search_strategy", "tool_selection"}, + 3: set(PHASE_NAMES), +} + + +class PhaseRecord(BaseModel): + phase: str + thought: str + data: dict | list | None = None + confidence: float = 1.0 + + +class PlanningSession: + def __init__(self, session_id: str): + self.session_id = session_id + self.phases: dict[str, PhaseRecord] = {} + self.complexity_level: int | None = None + + @property + def completed_phases(self) -> list[str]: + return [p for p in PHASE_NAMES if p in self.phases] + + def required_phases(self) -> set[str]: + return REQUIRED_PHASES.get(self.complexity_level or 3, REQUIRED_PHASES[3]) + + def is_complete(self) -> bool: + if self.complexity_level is None: + return False + return self.required_phases().issubset(self.phases.keys()) + + def build_executable_plan(self) -> dict: + return {name: record.data for name, record in self.phases.items()} + + +class PlanningEngine: + def __init__(self): + self._sessions: dict[str, PlanningSession] = {} + + def process_phase( + self, + phase: str, + thought: str, + session_id: str = "", + is_revision: bool = False, + revises_phase: str = "", + confidence: float = 1.0, + phase_data: dict | list | None = None, + ) -> dict: + if session_id and session_id in self._sessions: + session = self._sessions[session_id] + else: + sid = session_id if session_id else uuid.uuid4().hex[:12] + session = PlanningSession(sid) + self._sessions[sid] = session + + target = revises_phase if is_revision and revises_phase else phase + if target not in PHASE_NAMES: + return {"error": f"Unknown phase: {target}. Valid: {', '.join(PHASE_NAMES)}"} + + session.phases[target] = PhaseRecord( + phase=target, thought=thought, data=phase_data, confidence=confidence + ) + + if target == "complexity_assessment" and isinstance(phase_data, dict): + level = phase_data.get("level") + if level in (1, 2, 3): + session.complexity_level = level + + complete = session.is_complete() + result: dict = { + "session_id": session.session_id, + "completed_phases": session.completed_phases, + "complexity_level": session.complexity_level, + "plan_complete": complete, + } + + remaining = [p for p in PHASE_NAMES if p in session.required_phases() and p not in session.phases] + if remaining: + result["phases_remaining"] = remaining + + if complete: + result["executable_plan"] = session.build_executable_plan() + + return result + + +engine = PlanningEngine() diff --git a/src/grok_search/providers/grok.py b/src/grok_search/providers/grok.py index 6e5c1c9..1b6ac58 100644 --- a/src/grok_search/providers/grok.py +++ b/src/grok_search/providers/grok.py @@ -7,7 +7,7 @@ from tenacity.wait import wait_base from zoneinfo import ZoneInfo from .base import BaseSearchProvider, SearchResult -from ..utils import search_prompt, fetch_prompt +from ..utils import search_prompt, fetch_prompt, url_describe_prompt, rank_sources_prompt from ..logger import log_info from ..config import config @@ -131,19 +131,11 @@ async def search(self, query: str, platform: str = "", min_results: int = 3, max "Content-Type": "application/json", } platform_prompt = "" - return_prompt = "" if platform: - platform_prompt = "\n\nYou should search the web for the information you need, and focus on these platform: " + platform + platform_prompt = "\n\nYou should search the web for the information you need, and focus on these platform: " + platform + "\n" - if max_results: - return_prompt = "\n\nYou should return the results in a JSON format, and the results should at least be " + str(min_results) + " and at most be " + str(max_results) + " results." - - # 仅在查询包含时间相关关键词时注入当前时间信息 - if _needs_time_context(query): - time_context = get_local_time_info() + "\n" - else: - time_context = "" + time_context = get_local_time_info() + "\n" payload = { "model": self.model, @@ -152,12 +144,12 @@ async def search(self, query: str, platform: str = "", min_results: int = 3, max "role": "system", "content": search_prompt, }, - {"role": "user", "content": time_context + query + platform_prompt + return_prompt }, + {"role": "user", "content": time_context + query + platform_prompt}, ], "stream": True, } - await log_info(ctx, f"platform_prompt: { query + platform_prompt + return_prompt}", config.debug_enabled) + await log_info(ctx, f"platform_prompt: { query + platform_prompt}", config.debug_enabled) return await self._execute_stream_with_retry(headers, payload, ctx) @@ -224,7 +216,7 @@ async def _execute_stream_with_retry(self, headers: dict, payload: dict, ctx=Non """执行带重试机制的流式 HTTP 请求""" timeout = httpx.Timeout(connect=6.0, read=120.0, write=10.0, pool=None) - async with httpx.AsyncClient(timeout=timeout, follow_redirects=True) as client: + async with httpx.AsyncClient(timeout=timeout, follow_redirects=True, verify=config.ssl_verify_enabled) as client: async for attempt in AsyncRetrying( stop=stop_after_attempt(config.retry_max_attempts + 1), wait=_WaitWithRetryAfter(config.retry_multiplier, config.retry_max_wait), @@ -240,3 +232,57 @@ async def _execute_stream_with_retry(self, headers: dict, payload: dict, ctx=Non ) as response: response.raise_for_status() return await self._parse_streaming_response(response, ctx) + + async def describe_url(self, url: str, ctx=None) -> dict: + """让 Grok 阅读单个 URL 并返回 title + extracts""" + headers = { + "Authorization": f"Bearer {self.api_key}", + "Content-Type": "application/json", + } + payload = { + "model": self.model, + "messages": [ + {"role": "system", "content": url_describe_prompt}, + {"role": "user", "content": url}, + ], + "stream": True, + } + result = await self._execute_stream_with_retry(headers, payload, ctx) + title, extracts = url, "" + for line in result.strip().splitlines(): + if line.startswith("Title:"): + title = line[6:].strip() or url + elif line.startswith("Extracts:"): + extracts = line[9:].strip() + return {"title": title, "extracts": extracts, "url": url} + + async def rank_sources(self, query: str, sources_text: str, total: int, ctx=None) -> list[int]: + """让 Grok 按查询相关度对信源排序,返回排序后的序号列表""" + headers = { + "Authorization": f"Bearer {self.api_key}", + "Content-Type": "application/json", + } + payload = { + "model": self.model, + "messages": [ + {"role": "system", "content": rank_sources_prompt}, + {"role": "user", "content": f"Query: {query}\n\n{sources_text}"}, + ], + "stream": True, + } + result = await self._execute_stream_with_retry(headers, payload, ctx) + order: list[int] = [] + seen: set[int] = set() + for token in result.strip().split(): + try: + n = int(token) + if 1 <= n <= total and n not in seen: + seen.add(n) + order.append(n) + except ValueError: + continue + # 补齐遗漏的序号 + for i in range(1, total + 1): + if i not in seen: + order.append(i) + return order diff --git a/src/grok_search/server.py b/src/grok_search/server.py index 23db3e0..295f5b2 100644 --- a/src/grok_search/server.py +++ b/src/grok_search/server.py @@ -6,151 +6,459 @@ if str(src_dir) not in sys.path: sys.path.insert(0, str(src_dir)) -from fastmcp import FastMCP, Context +from mcp.server.fastmcp import FastMCP, Context +from typing import Annotated, Optional +from pydantic import Field # 尝试使用绝对导入(支持 mcp run) try: from grok_search.providers.grok import GrokSearchProvider - from grok_search.utils import format_search_results from grok_search.logger import log_info from grok_search.config import config + from grok_search.sources import SourcesCache, merge_sources, new_session_id, split_answer_and_sources + from grok_search.planning import ( + IntentOutput, ComplexityOutput, SubQuery, + StrategyOutput, ToolPlanItem, ExecutionOrderOutput, + engine as planning_engine, + ) except ImportError: - # 降级到相对导入(pip install -e . 后) from .providers.grok import GrokSearchProvider - from .utils import format_search_results from .logger import log_info from .config import config + from .sources import SourcesCache, merge_sources, new_session_id, split_answer_and_sources + from .planning import ( + IntentOutput, ComplexityOutput, SubQuery, + StrategyOutput, ToolPlanItem, ExecutionOrderOutput, + engine as planning_engine, + ) import asyncio mcp = FastMCP("grok-search") -@mcp.tool( - name="web_search", - description=""" - Performs a third-party web search based on the given query and returns the results - as a JSON string. +_SOURCES_CACHE = SourcesCache(max_size=256) +_AVAILABLE_MODELS_CACHE: dict[tuple[str, str], list[str]] = {} +_AVAILABLE_MODELS_LOCK = asyncio.Lock() - The `query` should be a clear, self-contained natural-language search query. - When helpful, include constraints such as topic, time range, language, or domain. - The `platform` should be the platforms which you should focus on searching, such as "Twitter", "GitHub", "Reddit", etc. +async def _fetch_available_models(api_url: str, api_key: str) -> list[str]: + import httpx + + models_url = f"{api_url.rstrip('/')}/models" + async with httpx.AsyncClient(timeout=10.0, verify=config.ssl_verify_enabled) as client: + response = await client.get( + models_url, + headers={ + "Authorization": f"Bearer {api_key}", + "Content-Type": "application/json", + }, + ) + response.raise_for_status() + data = response.json() + + models: list[str] = [] + for item in (data or {}).get("data", []) or []: + if isinstance(item, dict) and isinstance(item.get("id"), str): + models.append(item["id"]) + return models + + +async def _get_available_models_cached(api_url: str, api_key: str) -> list[str]: + key = (api_url, api_key) + async with _AVAILABLE_MODELS_LOCK: + if key in _AVAILABLE_MODELS_CACHE: + return _AVAILABLE_MODELS_CACHE[key] + + try: + models = await _fetch_available_models(api_url, api_key) + except Exception: + models = [] + + async with _AVAILABLE_MODELS_LOCK: + _AVAILABLE_MODELS_CACHE[key] = models + return models + + +def _extra_results_to_sources( + tavily_results: list[dict] | None, + firecrawl_results: list[dict] | None, +) -> list[dict]: + sources: list[dict] = [] + seen: set[str] = set() + + if firecrawl_results: + for r in firecrawl_results: + url = (r.get("url") or "").strip() + if not url or url in seen: + continue + seen.add(url) + item: dict = {"url": url, "provider": "firecrawl"} + title = (r.get("title") or "").strip() + if title: + item["title"] = title + desc = (r.get("description") or "").strip() + if desc: + item["description"] = desc + sources.append(item) + + if tavily_results: + for r in tavily_results: + url = (r.get("url") or "").strip() + if not url or url in seen: + continue + seen.add(url) + item: dict = {"url": url, "provider": "tavily"} + title = (r.get("title") or "").strip() + if title: + item["title"] = title + content = (r.get("content") or "").strip() + if content: + item["description"] = content + sources.append(item) + + return sources - The `min_results` and `max_results` should be the minimum and maximum number of results to return. - Returns - ------- - str - A JSON-encoded string representing a list of search results. Each result - includes at least: - - `url`: the link to the result - - `title`: a short title - - `summary`: a brief description or snippet of the page content. - """ +@mcp.tool( + name="web_search", + description=""" + Before using this tool, please use the search_planning tool to plan the search carefully. + Performs a deep web search based on the given query and returns Grok's answer directly. + + This tool extracts sources if provided by upstream, caches them, and returns: + - session_id: string (When you feel confused or curious about the main content, use this field to invoke the get_sources tool to obtain the corresponding list of information sources) + - content: string (answer only) + - sources_count: int + """, + meta={"version": "2.0.0", "author": "guda.studio"}, ) -async def web_search(query: str, platform: str = "", min_results: int = 3, max_results: int = 10, ctx: Context = None) -> str: +async def web_search( + query: Annotated[str, "Clear, self-contained natural-language search query."], + platform: Annotated[str, "Target platform to focus on (e.g., 'Twitter', 'GitHub', 'Reddit'). Leave empty for general web search."] = "", + model: Annotated[str, "Optional model ID for this request only. This value is used ONLY when user explicitly provided."] = "", + extra_sources: Annotated[int, "Number of additional reference results from Tavily/Firecrawl. Set 0 to disable. Default 0."] = 0, +) -> dict: + session_id = new_session_id() try: api_url = config.grok_api_url api_key = config.grok_api_key - model = config.grok_model except ValueError as e: - error_msg = str(e) - if ctx: - await ctx.report_progress(error_msg) - return f"配置错误: {error_msg}" + await _SOURCES_CACHE.set(session_id, []) + return {"session_id": session_id, "content": f"配置错误: {str(e)}", "sources_count": 0} + + effective_model = config.grok_model + if model: + available = await _get_available_models_cached(api_url, api_key) + if available and model not in available: + await _SOURCES_CACHE.set(session_id, []) + return {"session_id": session_id, "content": f"无效模型: {model}", "sources_count": 0} + effective_model = model + + grok_provider = GrokSearchProvider(api_url, api_key, effective_model) + + # 计算额外信源配额 + has_tavily = bool(config.tavily_api_key) + has_firecrawl = bool(config.firecrawl_api_key) + firecrawl_count = 0 + tavily_count = 0 + if extra_sources > 0: + if has_firecrawl and has_tavily: + firecrawl_count = round(extra_sources * 1) + tavily_count = extra_sources - firecrawl_count + elif has_firecrawl: + firecrawl_count = extra_sources + elif has_tavily: + tavily_count = extra_sources + + # 并行执行搜索任务 + async def _safe_grok() -> str: + try: + return await grok_provider.search(query, platform) + except Exception: + return "" + + async def _safe_tavily() -> list[dict] | None: + try: + if tavily_count: + return await _call_tavily_search(query, tavily_count) + except Exception: + return None + + async def _safe_firecrawl() -> list[dict] | None: + try: + if firecrawl_count: + return await _call_firecrawl_search(query, firecrawl_count) + except Exception: + return None + + coros: list = [_safe_grok()] + if tavily_count > 0: + coros.append(_safe_tavily()) + if firecrawl_count > 0: + coros.append(_safe_firecrawl()) + + gathered = await asyncio.gather(*coros) + + grok_result: str = gathered[0] or "" + tavily_results: list[dict] | None = None + firecrawl_results: list[dict] | None = None + idx = 1 + if tavily_count > 0: + tavily_results = gathered[idx] + idx += 1 + if firecrawl_count > 0: + firecrawl_results = gathered[idx] + + answer, grok_sources = split_answer_and_sources(grok_result) + extra = _extra_results_to_sources(tavily_results, firecrawl_results) + all_sources = merge_sources(grok_sources, extra) + + await _SOURCES_CACHE.set(session_id, all_sources) + return {"session_id": session_id, "content": answer, "sources_count": len(all_sources)} + + +@mcp.tool( + name="get_sources", + description=""" + When you feel confused or curious about the search response content, use the session_id returned by web_search to invoke the this tool to obtain the corresponding list of information sources. + Retrieve all cached sources for a previous web_search call. + Provide the session_id returned by web_search to get the full source list. + """, + meta={"version": "1.0.0", "author": "guda.studio"}, +) +async def get_sources( + session_id: Annotated[str, "Session ID from previous web_search call."] +) -> dict: + sources = await _SOURCES_CACHE.get(session_id) + if sources is None: + return { + "session_id": session_id, + "sources": [], + "sources_count": 0, + "error": "session_id_not_found_or_expired", + } + return {"session_id": session_id, "sources": sources, "sources_count": len(sources)} - grok_provider = GrokSearchProvider(api_url, api_key, model) - await log_info(ctx, f"Begin Search: {query}", config.debug_enabled) - results = await grok_provider.search(query, platform, min_results, max_results, ctx) - await log_info(ctx, "Search Finished!", config.debug_enabled) - return results +async def _call_tavily_extract(url: str) -> str | None: + import httpx + api_url = config.tavily_api_url + api_key = config.tavily_api_key + if not api_key: + return None + endpoint = f"{api_url.rstrip('/')}/extract" + headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"} + body = {"urls": [url], "format": "markdown"} + try: + async with httpx.AsyncClient(timeout=60.0, verify=config.ssl_verify_enabled) as client: + response = await client.post(endpoint, headers=headers, json=body) + response.raise_for_status() + data = response.json() + if data.get("results") and len(data["results"]) > 0: + content = data["results"][0].get("raw_content", "") + return content if content and content.strip() else None + return None + except Exception: + return None + + +async def _call_tavily_search(query: str, max_results: int = 6) -> list[dict] | None: + import httpx + api_key = config.tavily_api_key + if not api_key: + return None + endpoint = f"{config.tavily_api_url.rstrip('/')}/search" + headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"} + body = { + "query": query, + "max_results": max_results, + "search_depth": "advanced", + "include_raw_content": False, + "include_answer": False, + } + try: + async with httpx.AsyncClient(timeout=90.0, verify=config.ssl_verify_enabled) as client: + response = await client.post(endpoint, headers=headers, json=body) + response.raise_for_status() + data = response.json() + results = data.get("results", []) + return [ + {"title": r.get("title", ""), "url": r.get("url", ""), "content": r.get("content", ""), "score": r.get("score", 0)} + for r in results + ] if results else None + except Exception: + return None + + +async def _call_firecrawl_search(query: str, limit: int = 14) -> list[dict] | None: + import httpx + api_key = config.firecrawl_api_key + if not api_key: + return None + endpoint = f"{config.firecrawl_api_url.rstrip('/')}/search" + headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"} + body = {"query": query, "limit": limit} + try: + async with httpx.AsyncClient(timeout=90.0, verify=config.ssl_verify_enabled) as client: + response = await client.post(endpoint, headers=headers, json=body) + response.raise_for_status() + data = response.json() + results = data.get("data", {}).get("web", []) + return [ + {"title": r.get("title", ""), "url": r.get("url", ""), "description": r.get("description", "")} + for r in results + ] if results else None + except Exception: + return None + + +async def _call_firecrawl_scrape(url: str, ctx=None) -> str | None: + import httpx + api_url = config.firecrawl_api_url + api_key = config.firecrawl_api_key + if not api_key: + return None + endpoint = f"{api_url.rstrip('/')}/scrape" + headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"} + max_retries = config.retry_max_attempts + for attempt in range(max_retries): + body = { + "url": url, + "formats": ["markdown"], + "timeout": 60000, + "waitFor": (attempt + 1) * 1500, + } + try: + async with httpx.AsyncClient(timeout=90.0, verify=config.ssl_verify_enabled) as client: + response = await client.post(endpoint, headers=headers, json=body) + response.raise_for_status() + data = response.json() + markdown = data.get("data", {}).get("markdown", "") + if markdown and markdown.strip(): + return markdown + await log_info(ctx, f"Firecrawl: markdown为空, 重试 {attempt + 1}/{max_retries}", config.debug_enabled) + except Exception as e: + await log_info(ctx, f"Firecrawl error: {e}", config.debug_enabled) + return None + return None @mcp.tool( name="web_fetch", description=""" - Fetches and extracts the complete content from a specified URL and returns it - as a structured Markdown document. - The `url` should be a valid HTTP/HTTPS web address pointing to the target page. - Ensure the URL is complete and accessible (not behind authentication or paywalls). - The function will: - - Retrieve the full HTML content from the URL - - Parse and extract all meaningful content (text, images, links, tables, code blocks) - - Convert the HTML structure to well-formatted Markdown - - Preserve the original content hierarchy and formatting - - Remove scripts, styles, and other non-content elements - Returns - ------- - str - A Markdown-formatted string containing: - - Metadata header (source URL, title, fetch timestamp) - - Table of Contents (if applicable) - - Complete page content with preserved structure - - All text, links, images, tables, and code blocks from the original page - - The output maintains 100% content fidelity with the source page and is - ready for documentation, analysis, or further processing. - Notes - ----- - - Does NOT summarize or modify content - returns complete original text - - Handles special characters, encoding (UTF-8), and nested structures - - May not capture dynamically loaded content requiring JavaScript execution - - Respects the original language without translation - """ + Fetches and extracts complete content from a URL, returning it as a structured Markdown document. + + **Key Features:** + - **Full Content Extraction:** Retrieves and parses all meaningful content (text, images, links, tables, code blocks). + - **Markdown Conversion:** Converts HTML structure to well-formatted Markdown with preserved hierarchy. + - **Content Fidelity:** Maintains 100% content fidelity without summarization or modification. + + **Edge Cases & Best Practices:** + - Ensure URL is complete and accessible (not behind authentication or paywalls). + - May not capture dynamically loaded content requiring JavaScript execution. + - Large pages may take longer to process; consider timeout implications. + """, + meta={"version": "1.3.0", "author": "guda.studio"}, ) -async def web_fetch(url: str, ctx: Context = None) -> str: - try: - api_url = config.grok_api_url - api_key = config.grok_api_key - model = config.grok_model - except ValueError as e: - error_msg = str(e) - if ctx: - await ctx.report_progress(error_msg) - return f"配置错误: {error_msg}" +async def web_fetch( + url: Annotated[str, "Valid HTTP/HTTPS web address pointing to the target page. Must be complete and accessible."], + ctx: Context = None +) -> str: await log_info(ctx, f"Begin Fetch: {url}", config.debug_enabled) - grok_provider = GrokSearchProvider(api_url, api_key, model) - results = await grok_provider.fetch(url, ctx) - await log_info(ctx, "Fetch Finished!", config.debug_enabled) - return results + + result = await _call_tavily_extract(url) + if result: + await log_info(ctx, "Fetch Finished (Tavily)!", config.debug_enabled) + return result + + await log_info(ctx, "Tavily unavailable or failed, trying Firecrawl...", config.debug_enabled) + result = await _call_firecrawl_scrape(url, ctx) + if result: + await log_info(ctx, "Fetch Finished (Firecrawl)!", config.debug_enabled) + return result + + await log_info(ctx, "Fetch Failed!", config.debug_enabled) + if not config.tavily_api_key and not config.firecrawl_api_key: + return "配置错误: TAVILY_API_KEY 和 FIRECRAWL_API_KEY 均未配置" + return "提取失败: 所有提取服务均未能获取内容" + + +async def _call_tavily_map(url: str, instructions: str = None, max_depth: int = 1, + max_breadth: int = 20, limit: int = 50, timeout: int = 150) -> str: + import httpx + import json + api_url = config.tavily_api_url + api_key = config.tavily_api_key + if not api_key: + return "配置错误: TAVILY_API_KEY 未配置,请设置环境变量 TAVILY_API_KEY" + endpoint = f"{api_url.rstrip('/')}/map" + headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"} + body = {"url": url, "max_depth": max_depth, "max_breadth": max_breadth, "limit": limit, "timeout": timeout} + if instructions: + body["instructions"] = instructions + try: + async with httpx.AsyncClient(timeout=float(timeout + 10), verify=config.ssl_verify_enabled) as client: + response = await client.post(endpoint, headers=headers, json=body) + response.raise_for_status() + data = response.json() + return json.dumps({ + "base_url": data.get("base_url", ""), + "results": data.get("results", []), + "response_time": data.get("response_time", 0) + }, ensure_ascii=False, indent=2) + except httpx.TimeoutException: + return f"映射超时: 请求超过{timeout}秒" + except httpx.HTTPStatusError as e: + return f"HTTP错误: {e.response.status_code} - {e.response.text[:200]}" + except Exception as e: + return f"映射错误: {str(e)}" + + +@mcp.tool( + name="web_map", + description=""" + Maps a website's structure by traversing it like a graph, discovering URLs and generating a comprehensive site map. + + **Key Features:** + - **Graph Traversal:** Explores website structure starting from root URL. + - **Depth & Breadth Control:** Configure traversal limits to balance coverage and performance. + - **Instruction Filtering:** Use natural language to focus crawler on specific content types. + + **Edge Cases & Best Practices:** + - Start with low max_depth (1-2) for initial exploration, increase if needed. + - Use instructions to filter for specific content (e.g., "only documentation pages"). + - Large sites may hit timeout limits; adjust timeout and limit parameters accordingly. + """, + meta={"version": "1.3.0", "author": "guda.studio"}, +) +async def web_map( + url: Annotated[str, "Root URL to begin the mapping (e.g., 'https://docs.example.com')."], + instructions: Annotated[str, "Natural language instructions for the crawler to filter or focus on specific content."] = "", + max_depth: Annotated[int, Field(description="Maximum depth of mapping from the base URL.", ge=1, le=5)] = 1, + max_breadth: Annotated[int, Field(description="Maximum number of links to follow per page.", ge=1, le=500)] = 20, + limit: Annotated[int, Field(description="Total number of links to process before stopping.", ge=1, le=500)] = 50, + timeout: Annotated[int, Field(description="Maximum time in seconds for the operation.", ge=10, le=150)] = 150 +) -> str: + result = await _call_tavily_map(url, instructions, max_depth, max_breadth, limit, timeout) + return result @mcp.tool( name="get_config_info", description=""" - Returns the current Grok Search MCP server configuration information and tests the connection. - - This tool is useful for: - - Verifying that environment variables are correctly configured - - Testing API connectivity by sending a request to /models endpoint - - Debugging configuration issues - - Checking the current API endpoint and settings - - Returns - ------- - str - A JSON-encoded string containing configuration details: - - `api_url`: The configured Grok API endpoint - - `api_key`: The API key (masked for security, showing only first and last 4 characters) - - `model`: The currently selected model for search and fetch operations - - `debug_enabled`: Whether debug mode is enabled - - `log_level`: Current logging level - - `log_dir`: Directory where logs are stored - - `config_status`: Overall configuration status (✅ complete or ❌ error) - - `connection_test`: Result of testing API connectivity to /models endpoint - - `status`: Connection status - - `message`: Status message with model count - - `response_time_ms`: API response time in milliseconds - - `available_models`: List of available model IDs (only present on successful connection) - - Notes - ----- - - API keys are automatically masked for security - - This tool does not require any parameters - - Useful for troubleshooting before making actual search requests - - Automatically tests API connectivity during execution - """ + Returns current Grok Search MCP server configuration and tests API connectivity. + + **Key Features:** + - **Configuration Check:** Verifies environment variables and current settings. + - **Connection Test:** Sends request to /models endpoint to validate API access. + - **Model Discovery:** Lists all available models from the API. + + **Edge Cases & Best Practices:** + - Use this tool first when debugging connection or configuration issues. + - API keys are automatically masked for security in the response. + - Connection test timeout is 10 seconds; network issues may cause delays. + """, + meta={"version": "1.3.0", "author": "guda.studio"}, ) async def get_config_info() -> str: import json @@ -176,7 +484,7 @@ async def get_config_info() -> str: import time start_time = time.time() - async with httpx.AsyncClient(timeout=10.0) as client: + async with httpx.AsyncClient(timeout=10.0, verify=config.ssl_verify_enabled) as client: response = await client.get( models_url, headers={ @@ -235,36 +543,23 @@ async def get_config_info() -> str: @mcp.tool( name="switch_model", description=""" - Switches the default Grok model used for search and fetch operations, and persists the setting. - - This tool is useful for: - - Changing the AI model used for web search and content fetching - - Testing different models for performance or quality comparison - - Persisting model preference across sessions - - Parameters - ---------- - model : str - The model ID to switch to (e.g., "grok-4-fast", "grok-2-latest", "grok-vision-beta") - - Returns - ------- - str - A JSON-encoded string containing: - - `status`: Success or error status - - `previous_model`: The model that was being used before - - `current_model`: The newly selected model - - `message`: Status message - - `config_file`: Path where the model preference is saved - - Notes - ----- - - The model setting is persisted to ~/.config/grok-search/config.json - - This setting will be used for all future search and fetch operations - - You can verify available models using the get_config_info tool - """ + Switches the default Grok model used for search and fetch operations, persisting the setting. + + **Key Features:** + - **Model Selection:** Change the AI model for web search and content fetching. + - **Persistent Storage:** Model preference saved to ~/.config/grok-search/config.json. + - **Immediate Effect:** New model used for all subsequent operations. + + **Edge Cases & Best Practices:** + - Use get_config_info to verify available models before switching. + - Invalid model IDs may cause API errors in subsequent requests. + - Model changes persist across sessions until explicitly changed again. + """, + meta={"version": "1.3.0", "author": "guda.studio"}, ) -async def switch_model(model: str) -> str: +async def switch_model( + model: Annotated[str, "Model ID to switch to (e.g., 'grok-4-fast', 'grok-2-latest', 'grok-vision-beta')."] +) -> str: import json try: @@ -301,11 +596,21 @@ async def switch_model(model: str) -> str: description=""" Toggle Claude Code's built-in WebSearch and WebFetch tools on/off. - Parameters: action - "on" (block built-in), "off" (allow built-in), "status" (check) - Returns: JSON with current status and deny list - """ + **Key Features:** + - **Tool Control:** Enable or disable Claude Code's native web tools. + - **Project Scope:** Changes apply to current project's .claude/settings.json. + - **Status Check:** Query current state without making changes. + + **Edge Cases & Best Practices:** + - Use "on" to block built-in tools when preferring this MCP server's implementation. + - Use "off" to restore Claude Code's native tools. + - Use "status" to check current configuration without modification. + """, + meta={"version": "1.3.0", "author": "guda.studio"}, ) -async def toggle_builtin_tools(action: str = "status") -> str: +async def toggle_builtin_tools( + action: Annotated[str, "Action to perform: 'on' (block built-in), 'off' (allow built-in), or 'status' (check current state)."] = "status" +) -> str: import json # Locate project root @@ -354,6 +659,106 @@ async def toggle_builtin_tools(action: str = "status") -> str: }, ensure_ascii=False, indent=2) +@mcp.tool( + name="search_planning", + description=""" + A structured thinking scaffold for planning web searches BEFORE execution. Produces no side effects — only organizes your reasoning into a reusable plan. + + **WHEN TO USE**: Before any search requiring 2+ tool calls, or when the query is ambiguous/multi-faceted. Skip for single obvious lookups. + + **HOW**: Call once per phase, filling only that phase's structured field. The server tracks your session and signals when the plan is complete. + + ## Phases (call in order, one per invocation) + + ### 1. `intent_analysis` → fill `intent` + Distill the user's real question. Classify type and time sensitivity. Surface ambiguities and flawed premises. Identify `unverified_terms` — external classifications/rankings/taxonomies (e.g., "CCF-A", "Fortune 500") whose contents you cannot reliably enumerate from memory. + + ### 2. `complexity_assessment` → fill `complexity` + Rate 1-3. This controls how many phases are required: + - **Level 1** (1-2 searches): phases 1-3 only → then execute + - **Level 2** (3-5 searches): phases 1-5 + - **Level 3** (6+ searches): all 6 phases + + ### 3. `query_decomposition` → fill `sub_queries` + Split into non-overlapping sub-queries along ONE decomposition axis (e.g., by venue type OR by technique — never both). Each `boundary` must state mutual exclusion with sibling sub-queries. Use `depends_on` for sequential dependencies. + **Prerequisite rule**: If Phase 1 identified `unverified_terms`, create a prerequisite sub-query to verify each term's current contents FIRST. Other sub-queries must `depends_on` it — do NOT hardcode assumed values from training data. + + ### 4. `search_strategy` → fill `strategy` + Design concise search terms (max 8 words each). One term serves one sub-query. Choose approach: + - `broad_first`: round 1 wide scan → round 2+ narrow based on findings (exploratory) + - `narrow_first`: precise first, expand if needed (analytical) + - `targeted`: known-item retrieval (factual) + + ### 5. `tool_selection` → fill `tool_plan` + Map each sub-query to optimal tool: + - **web_search**(query, platform?, extra_sources?): general retrieval + - **web_fetch**(url): extract full markdown from known URL + - **web_map**(url, instructions?, max_depth?): discover site structure + + ### 6. `execution_order` → fill `execution_order` + Group independent sub-queries into parallel batches. Sequence dependent ones. + + ## Anti-patterns (AVOID) + - ❌ `codebase RAG retrieval augmented generation 2024 2025 paper` (9 words, synonym stacking) + ✅ `codebase RAG papers 2024` (4 words, concise) + - ❌ purpose: "sq1+sq2" (merged scope defeats decomposition) + ✅ purpose: "sq2" (one term, one goal) + - ❌ Decompose by venue (sq1=SE, sq2=AI) AND by technique (sq3=indexing, sq4=repo-level) — creates overlapping matrix + ✅ Pick ONE axis: by venue (sq1=SE, sq2=AI, sq3=IR) OR by technique (sq1=RAG systems, sq2=indexing, sq3=retrieval) + - ❌ All terms round 1 with broad_first (no depth) + ✅ Round 1: broad terms → Round 2: refined by Round 1 findings + - ❌ Level 3 for simple "what is X?" → Level 1 suffices + - ❌ Skipping intent_analysis → always start here + + ## Session & Revision + First call: leave `session_id` empty → server returns one. Pass it back in subsequent calls. + To revise: set `is_revision=true` + `revises_phase` to overwrite a previous phase. + Plan auto-completes when all required phases (per complexity level) are filled. + """, + meta={"version": "1.0.0", "author": "guda.studio"}, +) +async def search_planning( + phase: Annotated[str, "Current phase: intent_analysis | complexity_assessment | query_decomposition | search_strategy | tool_selection | execution_order"], + thought: Annotated[str, "Your reasoning for this phase — explain WHY, not just WHAT"], + next_phase_needed: Annotated[bool, "true to continue planning, false when done or plan auto-completes"], + intent: Optional[IntentOutput] = None, + complexity: Optional[ComplexityOutput] = None, + sub_queries: Optional[list[SubQuery]] = None, + strategy: Optional[StrategyOutput] = None, + tool_plan: Optional[list[ToolPlanItem]] = None, + execution_order: Optional[ExecutionOrderOutput] = None, + session_id: Annotated[str, "Session ID from previous call. Empty for new session."] = "", + is_revision: Annotated[bool, "true to revise a previously completed phase"] = False, + revises_phase: Annotated[str, "Phase name to revise (required if is_revision=true)"] = "", + confidence: Annotated[float, "Confidence in this phase's output (0.0-1.0)"] = 1.0, +) -> str: + import json + + phase_data_map = { + "intent_analysis": intent.model_dump() if intent else None, + "complexity_assessment": complexity.model_dump() if complexity else None, + "query_decomposition": [sq.model_dump() for sq in sub_queries] if sub_queries else None, + "search_strategy": strategy.model_dump() if strategy else None, + "tool_selection": [tp.model_dump() for tp in tool_plan] if tool_plan else None, + "execution_order": execution_order.model_dump() if execution_order else None, + } + + target = revises_phase if is_revision and revises_phase else phase + phase_data = phase_data_map.get(target) + + result = planning_engine.process_phase( + phase=phase, + thought=thought, + session_id=session_id, + is_revision=is_revision, + revises_phase=revises_phase, + confidence=confidence, + phase_data=phase_data, + ) + + return json.dumps(result, ensure_ascii=False, indent=2) + + def main(): import signal import os diff --git a/src/grok_search/sources.py b/src/grok_search/sources.py new file mode 100644 index 0000000..63386e2 --- /dev/null +++ b/src/grok_search/sources.py @@ -0,0 +1,337 @@ +import ast +import json +import re +import uuid +from collections import OrderedDict +from typing import Any + +import asyncio + +from .utils import extract_unique_urls + + +_MD_LINK_PATTERN = re.compile(r"\[([^\]]+)\]\((https?://[^)]+)\)") +_SOURCES_HEADING_PATTERN = re.compile( + r"(?im)^" + r"(?:#{1,6}\s*)?" + r"(?:\*\*|__)?\s*" + r"(sources?|references?|citations?|信源|参考资料|参考|引用|来源列表|来源)" + r"\s*(?:\*\*|__)?" + r"(?:\s*[((][^)\n]*[))])?" + r"\s*[::]?\s*$" +) +_SOURCES_FUNCTION_PATTERN = re.compile( + r"(?im)(^|\n)\s*(sources|source|citations|citation|references|reference|citation_card|source_cards|source_card)\s*\(" +) + + +def new_session_id() -> str: + return uuid.uuid4().hex[:12] + + +class SourcesCache: + def __init__(self, max_size: int = 256): + self._max_size = max_size + self._lock = asyncio.Lock() + self._cache: OrderedDict[str, list[dict]] = OrderedDict() + + async def set(self, session_id: str, sources: list[dict]) -> None: + async with self._lock: + self._cache[session_id] = sources + self._cache.move_to_end(session_id) + while len(self._cache) > self._max_size: + self._cache.popitem(last=False) + + async def get(self, session_id: str) -> list[dict] | None: + async with self._lock: + sources = self._cache.get(session_id) + if sources is None: + return None + self._cache.move_to_end(session_id) + return sources + + +def merge_sources(*source_lists: list[dict]) -> list[dict]: + seen: set[str] = set() + merged: list[dict] = [] + for sources in source_lists: + for item in sources or []: + url = (item or {}).get("url") + if not isinstance(url, str) or not url.strip(): + continue + url = url.strip() + if url in seen: + continue + seen.add(url) + merged.append(item) + return merged + + +def split_answer_and_sources(text: str) -> tuple[str, list[dict]]: + raw = (text or "").strip() + if not raw: + return "", [] + + split = _split_function_call_sources(raw) + if split: + return split + + split = _split_heading_sources(raw) + if split: + return split + + split = _split_details_block_sources(raw) + if split: + return split + + split = _split_tail_link_block(raw) + if split: + return split + + return raw, [] + + +def _split_function_call_sources(text: str) -> tuple[str, list[dict]] | None: + matches = list(_SOURCES_FUNCTION_PATTERN.finditer(text)) + if not matches: + return None + + for m in reversed(matches): + open_paren_idx = m.end() - 1 + extracted = _extract_balanced_call_at_end(text, open_paren_idx) + if not extracted: + continue + + close_paren_idx, args_text = extracted + sources = _parse_sources_payload(args_text) + if not sources: + continue + + answer = text[: m.start()].rstrip() + return answer, sources + + return None + + +def _extract_balanced_call_at_end(text: str, open_paren_idx: int) -> tuple[int, str] | None: + if open_paren_idx < 0 or open_paren_idx >= len(text) or text[open_paren_idx] != "(": + return None + + depth = 1 + in_string: str | None = None + escape = False + + for idx in range(open_paren_idx + 1, len(text)): + ch = text[idx] + if in_string: + if escape: + escape = False + continue + if ch == "\\": + escape = True + continue + if ch == in_string: + in_string = None + continue + + if ch in ("'", '"'): + in_string = ch + continue + + if ch == "(": + depth += 1 + continue + if ch == ")": + depth -= 1 + if depth == 0: + if text[idx + 1 :].strip(): + return None + args_text = text[open_paren_idx + 1 : idx] + return idx, args_text + + return None + + +def _split_heading_sources(text: str) -> tuple[str, list[dict]] | None: + matches = list(_SOURCES_HEADING_PATTERN.finditer(text)) + if not matches: + return None + + for m in reversed(matches): + start = m.start() + sources_text = text[start:] + sources = _extract_sources_from_text(sources_text) + if not sources: + continue + answer = text[:start].rstrip() + return answer, sources + return None + + +def _split_tail_link_block(text: str) -> tuple[str, list[dict]] | None: + lines = text.splitlines() + if not lines: + return None + + idx = len(lines) - 1 + while idx >= 0 and not lines[idx].strip(): + idx -= 1 + if idx < 0: + return None + + tail_end = idx + link_like_count = 0 + while idx >= 0: + line = lines[idx].strip() + if not line: + idx -= 1 + continue + if not _is_link_only_line(line): + break + link_like_count += 1 + idx -= 1 + + tail_start = idx + 1 + if link_like_count < 2: + return None + + block_text = "\n".join(lines[tail_start : tail_end + 1]) + sources = _extract_sources_from_text(block_text) + if not sources: + return None + + answer = "\n".join(lines[:tail_start]).rstrip() + return answer, sources + + +def _split_details_block_sources(text: str) -> tuple[str, list[dict]] | None: + lower = text.lower() + close_idx = lower.rfind("
") + if close_idx == -1: + return None + tail = text[close_idx + len("") :].strip() + if tail: + return None + + open_idx = lower.rfind("")] + sources = _extract_sources_from_text(block_text) + if len(sources) < 2: + return None + + answer = text[:open_idx].rstrip() + return answer, sources + + +def _is_link_only_line(line: str) -> bool: + stripped = re.sub(r"^\s*(?:[-*]|\d+\.)\s*", "", line).strip() + if not stripped: + return False + if stripped.startswith(("http://", "https://")): + return True + if _MD_LINK_PATTERN.search(stripped): + return True + return False + + +def _parse_sources_payload(payload: str) -> list[dict]: + payload = (payload or "").strip().rstrip(";") + if not payload: + return [] + + data: Any = None + try: + data = json.loads(payload) + except Exception: + try: + data = ast.literal_eval(payload) + except Exception: + data = None + + if data is None: + return _extract_sources_from_text(payload) + + if isinstance(data, dict): + for key in ("sources", "citations", "references", "urls"): + if key in data: + return _normalize_sources(data[key]) + return _normalize_sources(data) + + return _normalize_sources(data) + + +def _normalize_sources(data: Any) -> list[dict]: + items: list[Any] + if isinstance(data, (list, tuple)): + items = list(data) + elif isinstance(data, dict): + items = [data] + else: + items = [data] + + normalized: list[dict] = [] + seen: set[str] = set() + + for item in items: + if isinstance(item, str): + for url in extract_unique_urls(item): + if url not in seen: + seen.add(url) + normalized.append({"url": url}) + continue + + if isinstance(item, (list, tuple)) and len(item) >= 2: + title, url = item[0], item[1] + if isinstance(url, str) and url.startswith(("http://", "https://")) and url not in seen: + seen.add(url) + out: dict = {"url": url} + if isinstance(title, str) and title.strip(): + out["title"] = title.strip() + normalized.append(out) + continue + + if isinstance(item, dict): + url = item.get("url") or item.get("href") or item.get("link") + if not isinstance(url, str) or not url.startswith(("http://", "https://")): + continue + if url in seen: + continue + seen.add(url) + out: dict = {"url": url} + title = item.get("title") or item.get("name") or item.get("label") + if isinstance(title, str) and title.strip(): + out["title"] = title.strip() + desc = item.get("description") or item.get("snippet") or item.get("content") + if isinstance(desc, str) and desc.strip(): + out["description"] = desc.strip() + normalized.append(out) + continue + + return normalized + + +def _extract_sources_from_text(text: str) -> list[dict]: + sources: list[dict] = [] + seen: set[str] = set() + + for title, url in _MD_LINK_PATTERN.findall(text or ""): + url = (url or "").strip() + if not url or url in seen: + continue + seen.add(url) + title = (title or "").strip() + if title: + sources.append({"title": title, "url": url}) + else: + sources.append({"url": url}) + + for url in extract_unique_urls(text or ""): + if url in seen: + continue + seen.add(url) + sources.append({"url": url}) + + return sources diff --git a/src/grok_search/utils.py b/src/grok_search/utils.py index f54b5e9..eedbd0f 100644 --- a/src/grok_search/utils.py +++ b/src/grok_search/utils.py @@ -1,6 +1,57 @@ from typing import List +import re from .providers.base import SearchResult +_URL_PATTERN = re.compile(r'https?://[^\s<>"\'`,。、;:!?》)】\)]+') + + +def extract_unique_urls(text: str) -> list[str]: + """从文本中提取所有唯一 URL,按首次出现顺序排列""" + seen: set[str] = set() + urls: list[str] = [] + for m in _URL_PATTERN.finditer(text): + url = m.group().rstrip('.,;:!?') + if url not in seen: + seen.add(url) + urls.append(url) + return urls + + +def format_extra_sources(tavily_results: list[dict] | None, firecrawl_results: list[dict] | None) -> str: + sections = [] + idx = 1 + urls = [] + if firecrawl_results: + lines = ["## Extra Sources [Firecrawl]"] + for r in firecrawl_results: + title = r.get("title") or "Untitled" + url = r.get("url", "") + if len(url) == 0: + continue + if url in urls: + continue + urls.append(url) + desc = r.get("description", "") + lines.append(f"{idx}. **[{title}]({url})**") + if desc: + lines.append(f" {desc}") + idx += 1 + sections.append("\n".join(lines)) + if tavily_results: + lines = ["## Extra Sources [Tavily]"] + for r in tavily_results: + title = r.get("title") or "Untitled" + url = r.get("url", "") + if url in urls: + continue + content = r.get("content", "") + lines.append(f"{idx}. **[{title}]({url})**") + if content: + lines.append(f" {content}") + idx += 1 + sections.append("\n".join(lines)) + return "\n\n".join(sections) + def format_search_results(results: List[SearchResult]) -> str: if not results: @@ -135,109 +186,56 @@ def format_search_results(results: List[SearchResult]) -> str: """ +url_describe_prompt = ( + "Browse the given URL. Return exactly two sections:\n\n" + "Title: tag or top heading; " + "if missing/generic, craft one using key terms found in the page>\n\n" + "Extracts: \n\n" + "Nothing else." +) + +rank_sources_prompt = ( + "Given a user query and a numbered source list, output ONLY the source numbers " + "reordered by relevance to the query (most relevant first). " + "Format: space-separated integers on a single line (e.g., 14 12 1 3 5). " + "Include every number exactly once. Nothing else." +) + search_prompt = """ -# Role: MCP高效搜索助手 +# Core Instruction -## Profile -- language: 中文 -- description: 你是一个基于MCP(Model Context Protocol)的智能搜索工具,专注于执行高质量的信息检索任务,并将搜索结果转化为标准JSON格式输出。核心优势在于搜索的全面性、信息质量评估与严格的JSON格式规范,为用户提供结构化、即时可用的搜索结果。 -- background: 深入理解信息检索理论和多源搜索策略,精通JSON规范标准(RFC 8259)及数据结构化处理。熟悉GitHub、Stack Overflow、技术博客、官方文档等多源信息平台的检索特性,具备快速评估信息质量和提炼核心价值的专业能力。 -- personality: 精准执行、注重细节、结果导向、严格遵循输出规范 -- expertise: 多维度信息检索、JSON Schema设计与验证、搜索质量评估、自然语言信息提炼、技术文档分析、数据结构化处理 -- target_audience: 需要进行信息检索的开发者、研究人员、技术决策者、需要结构化搜索结果的应用系统 +1. User needs may be vague. Think divergently, infer intent from multiple angles, and leverage full conversation context to progressively clarify their true needs. +2. **Breadth-First Search**—Approach problems from multiple dimensions. Brainstorm 5+ perspectives and execute parallel searches for each. Consult as many high-quality sources as possible before responding. +3. **Depth-First Search**—After broad exploration, select ≥2 most relevant perspectives for deep investigation into specialized knowledge. +4. **Evidence-Based Reasoning & Traceable Sources**—Every claim must be followed by a citation (`citation_card` format). More credible sources strengthen arguments. If no references exist, remain silent. +5. Before responding, ensure full execution of Steps 1–4. -## Skills +--- -1. 全面信息检索 - - 多维度搜索: 从不同角度和关键词组合进行全面检索 - - 智能关键词生成: 根据查询意图自动构建最优搜索词组合 - - 动态搜索策略: 根据初步结果实时调整检索方向和深度 - - 多源整合: 综合多个信息源的结果,确保信息完整性 - -2. JSON格式化能力 - - 严格语法: 确保JSON语法100%正确,可直接被任何JSON解析器解析 - - 字段规范: 统一使用双引号包裹键名和字符串值 - - 转义处理: 正确转义特殊字符(引号、反斜杠、换行符等) - - 结构验证: 输出前自动验证JSON结构完整性 - - 格式美化: 使用适当缩进提升可读性 - - 空值处理: 字段值为空时使用空字符串""而非null - -3. 信息精炼与提取 - - 核心价值定位: 快速识别内容的关键信息点和独特价值 - - 摘要生成: 自动提炼精准描述,保留关键信息和技术术语 - - 去重与合并: 识别重复或高度相似内容,智能合并信息源 - - 多语言处理: 支持中英文内容的统一提炼和格式化 - - 质量评估: 对搜索结果进行可信度和相关性评分 - -4. 多源检索策略 - - 官方渠道优先: 官方文档、GitHub官方仓库、权威技术网站 - - 社区资源覆盖: Stack Overflow、Reddit、Discord、技术论坛 - - 学术与博客: 技术博客、Medium文章、学术论文、技术白皮书 - - 代码示例库: GitHub搜索、GitLab、Bitbucket代码仓库 - - 实时信息: 最新发布、版本更新、issue讨论、PR记录 - -5. 结果呈现能力 - - 简洁表达: 用最少文字传达核心价值 - - 链接验证: 确保所有URL有效可访问 - - 分类归纳: 按主题或类型组织搜索结果 - - 元数据标注: 添加必要的时间、来源等标识 +# Search Instruction -## Workflow +1. Think carefully before responding—anticipate the user’s true intent to ensure precision. +2. Verify every claim rigorously to avoid misinformation. +3. Follow problem logic—dig deeper until clues are exhaustively clear. If a question seems simple, still infer broader intent and search accordingly. Use multiple parallel tool calls per query and ensure answers are well-sourced. +4. Search in English first (prioritizing English resources for volume/quality), but switch to Chinese if context demands. +5. Prioritize authoritative sources: Wikipedia, academic databases, books, reputable media/journalism. +6. Favor sharing in-depth, specialized knowledge over generic or common-sense content. -1. 理解查询意图: 分析用户搜索需求,识别关键信息点 -2. 构建搜索策略: 确定搜索维度、关键词组合、目标信息源 -3. 执行多源检索: 并行或顺序调用多个信息源进行深度搜索 -4. 信息质量评估: 对检索结果进行相关性、可信度、时效性评分 -5. 内容提炼整合: 提取核心信息,去重合并,生成结构化摘要 -6. JSON格式输出: 严格按照标准格式转换所有结果,确保可解析性 -7. 验证与输出: 验证JSON格式正确性后输出最终结果 +--- -## Rules -2. JSON格式化强制规范 - - 语法正确性: 输出必须是可直接解析的合法JSON,禁止任何语法错误 - - 标准结构: 必须以数组形式返回,每个元素为包含三个字段的对象 - - 字段定义: - ```json - { - "title": "string, 必填, 结果标题", - "url": "string, 必填, 有效访问链接", - "description": "string, 必填, 20-50字核心描述" - } - ``` - - 引号规范: 所有键名和字符串值必须使用双引号,禁止单引号 - - 逗号规范: 数组最后一个元素后禁止添加逗号 - - 编码规范: 使用UTF-8编码,中文直接显示不转义为Unicode - - 缩进格式: 使用2空格缩进,保持结构清晰 - - 纯净输出: JSON前后不添加```json```标记或任何其他文字 - -4. 内容质量标准 - - 相关性优先: 确保所有结果与MCP主题高度相关 - - 时效性考量: 优先选择近期更新的活跃内容 - - 权威性验证: 倾向于官方或知名技术平台的内容 - - 可访问性: 排除需要付费或登录才能查看的内容 - -5. 输出限制条件 - - 禁止冗长: 不输出详细解释、背景介绍或分析评论 - - 纯JSON输出: 只返回格式化的JSON数组,不添加任何前缀、后缀或说明文字 - - 无需确认: 不询问用户是否满意直接提供最终结果 - - 错误处理: 若搜索失败返回`{"error": "错误描述", "results": []}`格式 - -## Output Example -```json -[ - { - "title": "Model Context Protocol官方文档", - "url": "https://modelcontextprotocol.io/docs", - "description": "MCP官方技术文档,包含协议规范、API参考和集成指南" - }, - { - "title": "MCP GitHub仓库", - "url": "https://github.com/modelcontextprotocol", - "description": "MCP开源实现代码库,含SDK和示例项目" - } -] -``` +# Output Style -## Initialization -作为MCP高效搜索助手,你必须遵守上述Rules,按输出的JSON必须语法正确、可直接解析,不添加任何代码块标记、解释或确认性文字。 +0. **Be direct—no unnecessary follow-ups**. +1. Lead with the **most probable solution** before detailed analysis. +2. **Define every technical term** in plain language (annotate post-paragraph). +3. Explain expertise **simply yet profoundly**. +4. **Respect facts and search results—use statistical rigor to discern truth**. +5. **Every sentence must cite sources** (`citation_card`). More references = stronger credibility. Silence if uncited. +6. Expand on key concepts—after proposing solutions, **use real-world analogies** to demystify technical terms. +7. **Strictly format outputs in polished Markdown** (LaTeX for formulas, code blocks for scripts, etc.). """