Grafana MCP Server - AI驱动的可观测性平台全面集成

Posted on 十月 11, 2025

Grafana MCP Server - AI驱动的可观测性平台全面集成

官方实现 | Stars: 1700 | Go | Apache-2.0

概述

Grafana MCP Server 是 Grafana Labs 官方提供的 Model Context Protocol 实现，为 AI 助手提供了与 Grafana 可观测性平台的深度集成能力。作为业界领先的开源可观测性解决方案，Grafana 通过这个 MCP 服务器将其强大的监控、告警、日志分析和事件管理能力完全暴露给 AI 模型，实现了真正的”AI 驱动运维”。

该服务器支持 45+ 个专业工具，覆盖 Dashboard 管理、数据源查询（Prometheus/Loki）、告警规则、事件跟踪（Incidents）、智能调查（Sift）、OnCall 排班等全栈可观测性场景。无论是查询指标、分析日志、诊断性能问题，还是管理告警规则和值班计划，AI 助手都能通过自然语言与 Grafana 平台无缝交互。

特别值得一提的是，Grafana MCP Server 原生支持 Grafana Cloud 和 自托管实例，提供灵活的传输模式（stdio、SSE、streamable-HTTP），并具有细粒度的 RBAC 权限控制，确保在企业级环境中的安全性和可扩展性。

核心特性

✅ Grafana 官方实现，深度集成整个可观测性生态
🎯 45+ 专业工具，覆盖监控、告警、日志、事件全场景
📊 Dashboard 智能管理：搜索、检索、更新、JSONPath 提取
🔥 Prometheus 原生支持：PromQL 查询、指标元数据、标签探索
📝 Loki 日志分析：LogQL 查询、日志流统计、标签管理
🚨 完整告警体系：告警规则管理、联系点配置、状态监控
🔍 Sift 智能调查：自动检测日志错误模式、识别慢请求
📞 OnCall 排班管理：值班计划、轮班详情、当前值班人员
🎫 事件生命周期：创建事件、跟踪活动、协作响应
🔐 企业级 RBAC：细粒度权限控制，支持服务账户认证
🚀 多传输模式：stdio、SSE、streamable-HTTP
🌐 Grafana Cloud 支持：无缝对接云端实例

工具列表

Dashboard 管理工具（6个）

1. search_dashboards

功能：搜索 Grafana 仪表板，支持按标题、标签、文件夹等元数据筛选

参数：

query (string, 可选) - 搜索查询字符串，支持模糊匹配
tag (string[], 可选) - 按标签筛选
type (string, 可选) - 仪表板类型（dash-db / dash-folder）

权限要求：dashboards:read

示例：

{
  "query": "kubernetes",
  "tag": ["production", "monitoring"]
}

使用场景：快速定位相关仪表板，AI 可以理解”显示所有生产环境的 Kubernetes 仪表板”

2. get_dashboard_by_uid

功能：通过唯一标识符获取完整仪表板定义（包含所有面板、查询、变量配置）

参数：

uid (string, 必需) - 仪表板的唯一标识符

权限要求：dashboards:read

注意：大型仪表板可能消耗大量 token，建议优先使用 get_dashboard_summary

示例：

1
2
3

{
  "uid": "abc123xyz"
}

3. get_dashboard_summary

功能：获取仪表板的紧凑概览，不包含完整 JSON 定义，节省上下文空间

参数：

uid (string, 必需) - 仪表板的唯一标识符

权限要求：dashboards:read

返回信息：

仪表板标题、描述、标签
面板数量、数据源统计
时间范围配置
变量列表

示例：

1
2
3

{
  "uid": "abc123xyz"
}

4. get_dashboard_property

功能：使用 JSONPath 表达式提取仪表板的特定部分

参数：

uid (string, 必需) - 仪表板的唯一标识符
jsonpath (string, 必需) - JSONPath 查询表达式

权限要求：dashboards:read

示例：

{
  "uid": "abc123xyz",
  "jsonpath": "$.panels[?(@.type=='graph')].title"
}

使用场景：精确提取特定配置，例如”获取所有图表面板的标题”

5. get_dashboard_panel_queries

功能：提取仪表板中所有面板的查询语句和数据源信息

参数：

uid (string, 必需) - 仪表板的唯一标识符

权限要求：dashboards:read

返回信息：

面板标题
查询表达式（PromQL / LogQL / SQL 等）
数据源 UID 和类型
面板 ID

示例：

1
2
3

{
  "uid": "abc123xyz"
}

使用场景：理解仪表板的数据来源，调试查询问题，复制查询到其他工具

6. update_dashboard

功能：更新现有仪表板或创建新仪表板

参数：

dashboard (object, 必需) - 完整的仪表板 JSON 定义
folderUid (string, 可选) - 目标文件夹 UID
message (string, 可选) - 更新说明（版本历史）
overwrite (boolean, 可选) - 是否覆盖现有版本

权限要求：dashboards:create, dashboards:write

示例：

{
  "dashboard": {
    "title": "My Dashboard",
    "panels": [...],
    "uid": "abc123xyz"
  },
  "message": "Added new CPU usage panel",
  "overwrite": true
}

使用场景：AI 根据用户需求自动修改仪表板配置，添加或调整面板

数据源管理工具（3个）

7. list_datasources

功能：列出 Grafana 实例中配置的所有数据源

参数：无

权限要求：datasources:read

返回信息：

数据源名称、UID、类型
URL 端点
是否为默认数据源
访问模式（proxy / direct）

示例调用：

{}

8. get_datasource_by_uid

功能：通过唯一标识符获取数据源详细配置

参数：

uid (string, 必需) - 数据源的唯一标识符

权限要求：datasources:read

示例：

1
2
3

{
  "uid": "prometheus-uid-123"
}

9. get_datasource_by_name

功能：通过名称获取数据源详细配置

参数：

name (string, 必需) - 数据源名称

权限要求：datasources:read

示例：

1
2
3

{
  "name": "Production Prometheus"
}

Prometheus 查询工具（5个）

10. query_prometheus

功能：执行 PromQL 查询，获取时间序列指标数据

参数：

datasource_uid (string, 必需) - Prometheus 数据源 UID
query (string, 必需) - PromQL 查询表达式
start (string, 可选) - 查询开始时间（RFC3339 格式，如 2025-10-11T00:00:00Z）
end (string, 可选) - 查询结束时间（RFC3339 格式）
step (string, 可选) - 查询步长（如 15s, 1m, 5m）

权限要求：datasources:query

示例：

{
  "datasource_uid": "prometheus-uid-123",
  "query": "rate(http_requests_total[5m])",
  "start": "2025-10-11T00:00:00Z",
  "end": "2025-10-11T01:00:00Z",
  "step": "15s"
}

使用场景：AI 根据自然语言生成 PromQL 查询，例如”过去1小时的 HTTP 请求速率”

11. list_prometheus_metric_metadata

功能：列出 Prometheus 指标的元数据（类型、帮助信息、单位）

参数：

datasource_uid (string, 必需) - Prometheus 数据源 UID
metric (string, 可选) - 特定指标名称

权限要求：datasources:query

返回信息：

指标类型（counter / gauge / histogram / summary）
帮助文档字符串
单位信息

示例：

{
  "datasource_uid": "prometheus-uid-123",
  "metric": "http_requests_total"
}

12. list_prometheus_metric_names

功能：列出所有可用的 Prometheus 指标名称

参数：

datasource_uid (string, 必需) - Prometheus 数据源 UID

权限要求：datasources:query

示例：

1
2
3

{
  "datasource_uid": "prometheus-uid-123"
}

使用场景：帮助用户发现可用的指标，AI 可以建议相关指标

13. list_prometheus_label_names

功能：列出 Prometheus 中所有标签名称

参数：

datasource_uid (string, 必需) - Prometheus 数据源 UID
match[] (string[], 可选) - 匹配特定指标的标签

权限要求：datasources:query

示例：

1
2
3

{
  "datasource_uid": "prometheus-uid-123"
}

14. list_prometheus_label_values

功能：列出指定标签的所有可能值

参数：

datasource_uid (string, 必需) - Prometheus 数据源 UID
label (string, 必需) - 标签名称
match[] (string[], 可选) - 匹配特定指标

权限要求：datasources:query

示例：

{
  "datasource_uid": "prometheus-uid-123",
  "label": "instance"
}

使用场景：动态填充查询参数，例如”列出所有生产环境实例”

Loki 日志查询工具（4个）

15. query_loki_logs

功能：使用 LogQL 查询 Loki 日志数据

参数：

datasource_uid (string, 必需) - Loki 数据源 UID
query (string, 必需) - LogQL 查询表达式
start (string, 可选) - 查询开始时间（纳秒时间戳或 RFC3339）
end (string, 可选) - 查询结束时间
limit (integer, 可选) - 返回日志行数限制（默认 100）
direction (string, 可选) - 排序方向（forward / backward）

权限要求：datasources:query

示例：

{
  "datasource_uid": "loki-uid-456",
  "query": "{app=\"nginx\"} |= \"error\" | json | status >= 500",
  "start": "2025-10-11T00:00:00Z",
  "limit": 500,
  "direction": "backward"
}

使用场景：AI 根据自然语言生成 LogQL，例如”查找 nginx 在过去1小时的 5xx 错误”

16. list_loki_label_names

功能：列出 Loki 日志流的所有标签名称

参数：

datasource_uid (string, 必需) - Loki 数据源 UID
start (string, 可选) - 时间范围开始
end (string, 可选) - 时间范围结束

权限要求：datasources:query

示例：

1
2
3

{
  "datasource_uid": "loki-uid-456"
}

17. list_loki_label_values

功能：列出 Loki 标签的所有可能值

参数：

datasource_uid (string, 必需) - Loki 数据源 UID
label (string, 必需) - 标签名称
start (string, 可选) - 时间范围开始
end (string, 可选) - 时间范围结束

权限要求：datasources:query

示例：

{
  "datasource_uid": "loki-uid-456",
  "label": "namespace"
}

18. query_loki_stats

功能：获取日志流的统计信息（日志行数、字节数、流数量）

参数：

datasource_uid (string, 必需) - Loki 数据源 UID
query (string, 必需) - LogQL 查询表达式
start (string, 可选) - 查询开始时间
end (string, 可选) - 查询结束时间

权限要求：datasources:query

示例：

{
  "datasource_uid": "loki-uid-456",
  "query": "{app=\"backend\"}",
  "start": "2025-10-11T00:00:00Z",
  "end": "2025-10-11T23:59:59Z"
}

使用场景：评估日志量，检查存储使用情况

Grafana Incident 事件管理工具（4个）

19. list_incidents

功能：搜索和列出 Grafana Incident 事件

参数：

query (string, 可选) - 搜索查询
status (string, 可选) - 筛选状态（active / resolved / closed）
severity (string, 可选) - 筛选严重级别

权限要求：Viewer 角色（基础 Grafana 角色）

示例：

{
  "query": "database",
  "status": "active",
  "severity": "critical"
}

20. get_incident

功能：获取特定事件的详细信息（时间线、活动、关联告警）

参数：

incident_id (string, 必需) - 事件 ID

权限要求：Viewer 角色

示例：

1
2
3

{
  "incident_id": "incident-123"
}

21. create_incident

功能：创建新的事件记录

参数：

title (string, 必需) - 事件标题
severity (string, 可选) - 严重级别（minor / major / critical）
status (string, 可选) - 初始状态（默认 active）
labels (object, 可选) - 自定义标签

权限要求：Editor 角色

示例：

{
  "title": "Database Connection Pool Exhausted",
  "severity": "critical",
  "labels": {
    "service": "api-backend",
    "environment": "production"
  }
}

22. add_activity_to_incident

功能：向事件添加活动记录（调查进展、缓解措施、沟通记录）

参数：

incident_id (string, 必需) - 事件 ID
activity (string, 必需) - 活动内容（Markdown 格式）
activity_type (string, 可选) - 活动类型（investigation / mitigation / communication）

权限要求：Editor 角色

示例：

{
  "incident_id": "incident-123",
  "activity": "Increased database connection pool size from 100 to 200. Load is stabilizing.",
  "activity_type": "mitigation"
}

Sift 智能调查工具（5个）

Sift 是 Grafana 的 AI 驱动的根因分析工具，可以自动识别异常模式。

23. list_sift_investigations

功能：列出所有 Sift 调查记录

参数：

status (string, 可选) - 筛选状态（in_progress / completed）

权限要求：Viewer 角色

示例：

1
2
3

{
  "status": "completed"
}

24. get_sift_investigation

功能：获取 Sift 调查的详细信息和分析结果

参数：

investigation_id (string, 必需) - 调查 ID

权限要求：Viewer 角色

示例：

1
2
3

{
  "investigation_id": "sift-inv-789"
}

25. get_sift_analysis

功能：获取特定 Sift 分析的结果（检测到的模式、异常点）

参数：

analysis_id (string, 必需) - 分析 ID

权限要求：Viewer 角色

示例：

1
2
3

{
  "analysis_id": "analysis-456"
}

26. find_error_pattern_logs

功能：使用 Sift 自动检测 Loki 日志中的错误模式

参数：

datasource_uid (string, 必需) - Loki 数据源 UID
query (string, 必需) - LogQL 查询（定义日志范围）
start (string, 可选) - 分析开始时间
end (string, 可选) - 分析结束时间

权限要求：Viewer 角色

示例：

{
  "datasource_uid": "loki-uid-456",
  "query": "{namespace=\"production\", app=\"api\"} |= \"error\"",
  "start": "2025-10-11T00:00:00Z",
  "end": "2025-10-11T01:00:00Z"
}

使用场景：AI 自动发现日志中的异常错误模式，无需人工定义规则

返回信息：

检测到的错误模式
频率分布
异常时间段
相关日志样本

27. find_slow_requests

功能：使用 Sift 识别 Tempo 追踪数据中的慢请求

参数：

datasource_uid (string, 必需) - Tempo 数据源 UID
query (string, 必需) - TraceQL 查询
start (string, 可选) - 分析开始时间
end (string, 可选) - 分析结束时间

权限要求：Viewer 角色

示例：

{
  "datasource_uid": "tempo-uid-789",
  "query": "{service.name=\"api-gateway\"}",
  "start": "2025-10-11T00:00:00Z",
  "end": "2025-10-11T01:00:00Z"
}

使用场景：自动识别性能瓶颈，找出响应时间异常的请求

返回信息：

慢请求的 trace ID
响应时间分布
异常端点
潜在瓶颈分析

告警管理工具（3个）

28. list_alert_rules

功能：列出所有 Grafana 告警规则及其状态

参数：

state (string, 可选) - 筛选告警状态（firing / normal / pending / nodata / error）
folder (string, 可选) - 筛选文件夹

权限要求：alert.rules:read

示例：

1
2
3

{
  "state": "firing"
}

返回信息：

告警规则名称、UID
当前状态
查询表达式
评估频率
触发条件

29. get_alert_rule_by_uid

功能：获取特定告警规则的详细配置

参数：

uid (string, 必需) - 告警规则 UID

权限要求：alert.rules:read

示例：

1
2
3

{
  "uid": "alert-rule-xyz"
}

30. list_contact_points

功能：列出所有告警通知联系点（Email、Slack、PagerDuty 等）

参数：无

权限要求：alert.notifications:read

返回信息：

联系点名称、UID
通知类型（email / slack / webhook / pagerduty 等）
配置详情

示例调用：

{}

Grafana OnCall 排班管理工具（8个）

31. list_oncall_schedules

功能：列出所有 Grafana OnCall 排班计划

参数：

team (string, 可选) - 筛选团队

权限要求：OnCall 访问权限

示例：

1
2
3

{
  "team": "platform-team"
}

32. get_oncall_schedule

功能：获取特定排班计划的详细信息（轮班规则、覆盖范围、假期安排）

参数：

schedule_id (string, 必需) - 排班计划 ID

权限要求：OnCall 访问权限

示例：

1
2
3

{
  "schedule_id": "schedule-abc123"
}

33. get_oncall_shift

功能：获取特定轮班的详细信息

参数：

shift_id (string, 必需) - 轮班 ID

权限要求：OnCall 访问权限

示例：

1
2
3

{
  "shift_id": "shift-xyz789"
}

34. get_current_oncall_users

功能：查看当前正在值班的人员

参数：

schedule_id (string, 必需) - 排班计划 ID

权限要求：OnCall 访问权限

示例：

1
2
3

{
  "schedule_id": "schedule-abc123"
}

返回信息：

值班人员姓名、联系方式
轮班开始/结束时间
升级路径

使用场景：AI 自动回答”谁在值班？”、”如何联系当前值班工程师？”

35. list_oncall_teams

功能：列出所有 OnCall 团队

参数：无

权限要求：OnCall 访问权限

示例调用：

{}

36. list_oncall_users

功能：列出所有 OnCall 用户及其联系信息

参数：

team (string, 可选) - 筛选团队

权限要求：OnCall 访问权限

示例：

1
2
3

{
  "team": "backend-team"
}

37. list_oncall_alert_groups

功能：列出 OnCall 告警组

参数：

state (string, 可选) - 筛选状态（new / acknowledged / resolved / silenced）
integration (string, 可选) - 筛选集成来源
started_at_after (string, 可选) - 开始时间过滤

权限要求：OnCall 访问权限

示例：

{
  "state": "new",
  "integration": "grafana"
}

38. get_oncall_alert_group

功能：获取告警组的详细信息（告警内容、确认状态、解决记录）

参数：

alert_group_id (string, 必需) - 告警组 ID

权限要求：OnCall 访问权限

示例：

1
2
3

{
  "alert_group_id": "alert-group-456"
}

用户和团队管理工具（2个）

39. list_teams

功能：列出 Grafana 中的所有团队

参数：

query (string, 可选) - 搜索团队名称

权限要求：teams:read

示例：

1
2
3

{
  "query": "platform"
}

40. list_users_by_org

功能：列出当前组织中的所有用户

参数：无

权限要求：users:read

返回信息：

用户名、邮箱
角色（Admin / Editor / Viewer）
最后登录时间

示例调用：

{}

配置方式

环境变量

# Grafana 实例 URL（必需）
GRAFANA_URL=https://your-grafana.example.com

# 服务账户 Token（推荐方式）
GRAFANA_SERVICE_ACCOUNT_TOKEN=glsa_xxxxxxxxxxxxxxxxxxxxx

# 或使用用户名/密码认证（不推荐生产环境）
GRAFANA_USERNAME=admin
GRAFANA_PASSWORD=your-password

# 可选：禁用特定工具类别
DISABLE_TOOLS=oncall,sift

# 可选：传输模式（stdio / sse / streamable-http）
TRANSPORT_MODE=stdio

Claude Desktop 配置

在 ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) 或 %APPDATA%/Claude/claude_desktop_config.json (Windows) 中添加：

{
  "mcpServers": {
    "grafana": {
      "command": "mcp-grafana",
      "args": [],
      "env": {
        "GRAFANA_URL": "https://your-grafana.example.com",
        "GRAFANA_SERVICE_ACCOUNT_TOKEN": "glsa_xxxxxxxxxxxxxxxxxxxxx"
      }
    }
  }
}

配置说明：

在 Grafana 中创建服务账户：Configuration > Service Accounts > Add service account
分配适当的权限（建议最小权限原则）
生成 Token 并配置到环境变量
重启 Claude Desktop

Docker 部署

# 使用官方镜像
docker pull ghcr.io/grafana/mcp-grafana:latest

# 运行容器（stdio 模式）
docker run -i --rm \
  -e GRAFANA_URL=https://your-grafana.example.com \
  -e GRAFANA_SERVICE_ACCOUNT_TOKEN=glsa_xxxxxxxxxxxxxxxxxxxxx \
  ghcr.io/grafana/mcp-grafana:latest \
  -t stdio

# 运行容器（HTTP 模式，暴露端口）
docker run -d -p 8080:8080 \
  -e GRAFANA_URL=https://your-grafana.example.com \
  -e GRAFANA_SERVICE_ACCOUNT_TOKEN=glsa_xxxxxxxxxxxxxxxxxxxxx \
  ghcr.io/grafana/mcp-grafana:latest \
  -t streamable-http --port 8080

Kubernetes Helm Chart 部署

# 添加 Grafana Helm 仓库
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# 创建配置文件 values.yaml
cat > values.yaml <<EOF
grafana:
  url: "https://your-grafana.example.com"
  serviceAccountToken: "glsa_xxxxxxxxxxxxxxxxxxxxx"

service:
  type: ClusterIP
  port: 8080

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 128Mi

replicaCount: 2

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 80
EOF

# 部署
helm install grafana-mcp grafana/mcp-grafana \
  -f values.yaml \
  --namespace observability \
  --create-namespace

# 验证部署
kubectl get pods -n observability
kubectl logs -f deployment/grafana-mcp -n observability

从二进制文件运行

# 下载预编译二进制（GitHub Releases）
# macOS/Linux
wget https://github.com/grafana/mcp-grafana/releases/latest/download/mcp-grafana-$(uname -s)-$(uname -m)
chmod +x mcp-grafana-*
sudo mv mcp-grafana-* /usr/local/bin/mcp-grafana

# 设置环境变量
export GRAFANA_URL="https://your-grafana.example.com"
export GRAFANA_SERVICE_ACCOUNT_TOKEN="glsa_xxxxxxxxxxxxxxxxxxxxx"

# 运行
mcp-grafana -t stdio

从源码编译

# 克隆仓库
git clone https://github.com/grafana/mcp-grafana.git
cd mcp-grafana

# 编译（需要 Go 1.21+）
go build -o mcp-grafana cmd/mcp-grafana/main.go

# 运行
./mcp-grafana -t stdio

传输模式选择

stdio（标准输入输出）
- 适用于：Claude Desktop、本地开发
- 优点：简单直接，无需网络配置
- 限制：单一会话
SSE（Server-Sent Events）
- 适用于：Web 应用、浏览器集成
- 优点：实时流式响应
- 端口：默认 8080
streamable-HTTP
- 适用于：企业集成、多客户端
- 优点：标准 HTTP，易于负载均衡
- 端口：可配置

# stdio 模式
mcp-grafana -t stdio

# SSE 模式
mcp-grafana -t sse --port 8080

# streamable-HTTP 模式
mcp-grafana -t streamable-http --port 8080

使用场景

场景 1：智能故障排查

需求：凌晨 3 点收到生产告警”API 响应时间突增”，需要快速定位根因。

AI 交互流程：

用户：”API 响应时间告警，帮我排查”

AI 分析步骤（自动调用工具）：

① list_alert_rules(state="firing")
   → 发现 "API Response Time > 2s" 规则触发

② get_dashboard_by_uid(uid="api-performance-dashboard")
   → 获取 API 性能仪表板

③ query_prometheus(query="histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))")
   → P99 延迟从 200ms 飙升到 3s

④ query_loki_logs(query='{service="api"} |= "error" | json | duration > 2000')
   → 发现大量 "database timeout" 错误

⑤ find_error_pattern_logs(query='{service="api"}')
   → Sift 检测到新错误模式："connection pool exhausted"

⑥ query_prometheus(query="db_connection_pool_active / db_connection_pool_max")
   → 数据库连接池使用率 99%

AI 输出：

🔍 根因分析：

问题：数据库连接池耗尽
- 连接池使用率：99%（正常 <70%）
- P99 延迟：3000ms（正常 200ms）
- 错误模式：大量 "connection pool exhausted"

建议措施：
1. 立即扩容数据库连接池：100 → 200
2. 检查是否有慢查询（执行 find_slow_requests）
3. 回滚最近的部署（如果有）

相关仪表板：[API Performance Dashboard](https://grafana.example.com/d/api-perf)

用户：”帮我创建事件记录”

AI 执行：

create_incident({
  "title": "API Response Time Spike - Database Connection Pool Exhaustion",
  "severity": "critical",
  "labels": {"service": "api", "component": "database"}
})

价值：

从告警到根因分析：3 分钟（传统方式需 30+ 分钟）
自动关联日志、指标、追踪数据
生成结构化事件记录，便于事后回顾

场景 2：自然语言查询指标

需求：产品经理询问”过去 24 小时用户注册转化率如何？”

AI 交互流程：

用户：”过去 24 小时用户注册转化率如何？”

AI 分析步骤：

① search_dashboards(query="conversion")
   → 找到 "User Conversion Dashboard"

② get_dashboard_panel_queries(uid="conversion-dashboard")
   → 提取相关 PromQL 查询

③ query_prometheus({
     query: "(sum(rate(user_registration_completed[24h])) / sum(rate(user_registration_started[24h]))) * 100",
     start: "now-24h",
     end: "now"
   })
   → 转化率：12.5%

④ query_prometheus({
     query: "sum(rate(user_registration_completed[24h])) by (source)",
     start: "now-24h",
     end: "now"
   })
   → 按来源分析：
      - organic: 15.2%
      - paid_ads: 10.8%
      - referral: 14.1%

AI 输出：

📊 用户注册转化率（过去 24 小时）

总体转化率：12.5%
- 较昨日：↓ 0.8%
- 较上周同期：↑ 1.2%

按来源细分：
1. Organic：15.2% ⭐ 最佳
2. Referral：14.1%
3. Paid Ads：10.8% ⚠️ 低于预期

趋势：
- 上午 10-12 点转化率最高（16.3%）
- 深夜 2-4 点转化率最低（8.1%）

建议：优化 Paid Ads 落地页，A/B 测试 CTA 按钮

查看详细仪表板 →

价值：

无需学习 PromQL 语法
自动生成多维度分析
提供业务洞察和建议

场景 3：自动化 Dashboard 管理

需求：新上线微服务，需要创建标准监控 Dashboard。

AI 交互流程：

用户：”为新服务 ‘payment-gateway’ 创建标准微服务监控 Dashboard”

AI 执行步骤：

① search_dashboards(query="template microservice")
   → 找到 "Microservice Template Dashboard"

② get_dashboard_by_uid(uid="template-microservice")
   → 获取模板 JSON

③ 自动替换变量：
   - service_name: "payment-gateway"
   - prometheus_datasource: 从 list_datasources() 获取
   - loki_datasource: 从 list_datasources() 获取

④ update_dashboard({
     dashboard: {
       title: "Payment Gateway - Microservice Monitoring",
       tags: ["microservice", "payment", "production"],
       panels: [
         { title: "Request Rate", query: "rate(http_requests_total{service='payment-gateway'}[5m])" },
         { title: "Error Rate", query: "rate(http_requests_total{service='payment-gateway',status=~'5..'}[5m])" },
         { title: "Latency P99", query: "histogram_quantile(0.99, ...)" },
         { title: "Logs", datasource: "Loki", query: "{service='payment-gateway'}" }
       ]
     },
     message: "Auto-generated by AI for payment-gateway service"
   })

AI 输出：

✅ Dashboard 创建成功！

名称：Payment Gateway - Microservice Monitoring
URL：https://grafana.example.com/d/payment-gateway-monitoring

包含面板：
- Request Rate（请求速率）
- Error Rate（错误率）
- Latency P50/P90/P99（延迟分布）
- CPU & Memory Usage（资源使用）
- Log Stream（实时日志）

已配置告警规则：
- Error Rate > 1%
- P99 Latency > 500ms
- CPU Usage > 80%

价值：

秒级创建标准化 Dashboard
自动配置最佳实践面板
确保监控一致性

场景 4：OnCall 排班查询

需求：周五晚上需要确认周末值班安排。

AI 交互流程：

用户：”这周末谁在值班？”

AI 执行步骤：

① list_oncall_schedules()
   → 找到 "Platform Team Primary Schedule"

② get_current_oncall_users(schedule_id="platform-primary")
   → 当前值班：Alice (至周六 09:00)

③ get_oncall_schedule(schedule_id="platform-primary")
   → 获取完整排班信息
   → 周末值班：Bob (周六 09:00 - 周一 09:00)

AI 输出：

📅 Platform Team 本周末值班安排

现在值班：Alice Chen
- 结束时间：周六 2025-10-12 09:00 UTC
- 联系方式：[email protected], +1-555-0123
- Slack：@alice

周末值班：Bob Wilson
- 值班时间：周六 09:00 - 周一 09:00 UTC
- 联系方式：[email protected], +1-555-0456
- Slack：@bob

升级路径：
1. 一级：Bob Wilson（周末值班）
2. 二级：Platform Team Lead（15 分钟后）
3. 三级：VP Engineering（30 分钟后）

用户：”Bob 周末有事，能找人替班吗？”

AI 建议：

📋 可替班人员（来自 OnCall 团队）：

1. Charlie Davis - 可用
   - 上次值班：2 周前
   - 本月值班次数：2 次

2. Diana Lee - 可用
   - 上次值班：1 周前
   - 本月值班次数：3 次

建议：联系 Charlie，负担更均衡

替班操作：需要在 Grafana OnCall 界面手动调整
或通过 API（需要额外配置写权限）

价值：

快速获取值班信息
自动提供联系方式
智能推荐替班人选

场景 5：日志异常模式发现

需求：用户反馈偶发性功能异常，但无明确告警。

AI 交互流程：

用户：”用户反馈支付功能偶尔失败，但没有告警，帮我查查”

AI 执行步骤：

① query_loki_logs({
     query: '{service="payment-gateway"} |= "error" OR |= "fail"',
     start: "now-6h",
     limit: 1000
   })
   → 发现 127 条错误日志

② find_error_pattern_logs({
     datasource_uid: "loki-prod",
     query: '{service="payment-gateway"}',
     start: "now-6h"
   })
   → Sift 检测到 3 个异常模式

③ query_loki_stats({
     query: '{service="payment-gateway", level="error"}',
     start: "now-6h"
   })
   → 错误率：0.3%（低于告警阈值 1%）

AI 输出：

🔍 Sift 智能分析结果

检测到 3 个异常错误模式：

1. **新错误模式**（置信度：95%）
   错误信息："Payment provider timeout"
   出现次数：47 次
   时间分布：集中在 14:00-15:00 UTC
   影响用户：约 40 人

   原因推测：第三方支付接口超时
   建议：增加超时重试机制 + 降级到备用支付通道

2. **频率异常**（置信度：82%）
   错误信息："Insufficient balance check failed"
   出现次数：比平时高 3.2 倍

   原因推测：余额查询服务不稳定
   建议：检查余额服务健康状况

3. **偶发性错误**（置信度：68%）
   错误信息："Transaction ID collision"
   出现次数：8 次

   原因推测：ID 生成算法冲突
   建议：审查 ID 生成逻辑

📊 错误率趋势：
- 6 小时前：0.1%
- 3 小时前：0.5%（峰值）
- 现在：0.2%

💡 虽然未触发告警（阈值 1%），但出现新错误模式，建议调查

用户：”创建事件并通知支付团队”

AI 执行：

① create_incident({
     title: "Payment Gateway - Intermittent Failures (Sift Detected)",
     severity: "major",
     labels: {
       "service": "payment-gateway",
       "detected_by": "sift_ai",
       "error_pattern": "provider_timeout"
     }
   })

② add_activity_to_incident({
     incident_id: "incident-789",
     activity: `
## Sift 智能分析

- **主要问题**：第三方支付接口超时（47 次）
- **影响范围**：约 40 用户，0.3% 交易失败率
- **时间窗口**：14:00-15:00 UTC

## 建议措施
1. 联系第三方支付供应商确认服务状态
2. 实施超时重试机制
3. 配置备用支付通道

## 相关日志
[View in Grafana Loki](https://grafana.example.com/explore?...)
     `,
     activity_type: "investigation"
   })

③ 输出：
   "✅ 事件已创建（ID: incident-789）
    📢 建议通知 @payment-team 在 Slack #incident-response 频道"

价值：

AI 自动发现低频异常（传统告警会漏掉）
Sift 提供模式分析和根因推测
自动生成结构化事件报告

场景 6：性能优化建议

需求：定期审查系统性能，发现潜在瓶颈。

AI 交互流程：

用户：”帮我分析 API 服务性能，给出优化建议”

AI 执行步骤：

① search_dashboards(query="api performance")
   → 找到 "API Performance Dashboard"

② query_prometheus({
     query: "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[24h]))",
     start: "now-7d"
   })
   → P99 延迟：450ms（比上周增加 15%）

③ query_prometheus({
     query: "topk(10, sum by (endpoint) (rate(http_request_duration_seconds_sum[24h])) / sum by (endpoint) (rate(http_request_duration_seconds_count[24h])))",
     start: "now-24h"
   })
   → 最慢端点：/api/v1/reports/export（平均 2.3s）

④ find_slow_requests({
     datasource_uid: "tempo-prod",
     query: '{service.name="api", http.route="/api/v1/reports/export"}',
     start: "now-24h"
   })
   → Sift 识别瓶颈：数据库查询占 80% 时间

⑤ query_prometheus({
     query: "rate(http_requests_total[24h]) by (status)",
     start: "now-7d"
   })
   → 错误率趋势：稳定在 0.1%

AI 输出：

📈 API 性能分析报告（过去 7 天）

### 总体概况
- P99 延迟：450ms（↑ 15% vs 上周）⚠️
- 请求量：250k/天（↑ 8%）
- 错误率：0.1%（稳定）✅
- 可用性：99.95%

### 性能瓶颈（Top 3）

1. **`/api/v1/reports/export`** - 严重 🔴
   - 平均响应时间：2.3s（目标 <500ms）
   - 请求量：2k/天
   - 瓶颈：数据库聚合查询（占 80% 时间）

   **优化建议**：
   - 增加报表缓存（Redis），缓存 5 分钟
   - 异步生成大报表，返回下载链接
   - 添加数据库索引：`reports (created_at, user_id)`

   **预期收益**：响应时间降至 <300ms

2. **`/api/v1/search`** - 中等 🟡
   - 平均响应时间：850ms
   - 请求量：50k/天
   - 瓶颈：Elasticsearch 全文搜索

   **优化建议**：
   - 实施搜索结果预加载（热门查询）
   - 优化 ES 查询（减少聚合桶数量）
   - 添加限流：最多 100 结果/页

   **预期收益**：响应时间降至 <400ms

3. **`/api/v1/dashboard/metrics`** - 轻微 🟢
   - 平均响应时间：320ms
   - 请求量：120k/天
   - 瓶颈：多次 Prometheus 查询

   **优化建议**：
   - 批量 Prometheus 查询（一次请求多个指标）
   - 前端缓存仪表板数据（30 秒）

   **预期收益**：响应时间降至 <150ms

### 资源使用
- CPU：平均 45%，峰值 78%（健康）
- 内存：平均 60%，峰值 85%（接近上限）⚠️
- 数据库连接池：平均 65%，峰值 92%（建议扩容）

### 优先级建议
1. **立即执行**：优化 `/api/v1/reports/export`（影响最大）
2. **本周内**：扩容内存（从 4GB → 8GB）
3. **下个迭代**：优化搜索和仪表板端点

### 监控改进
- 添加 P95 延迟告警（当前仅监控 P99）
- 为慢端点配置专项告警（响应时间 > 1s）
- 增加数据库慢查询日志监控

查看详细追踪数据 → [Grafana Tempo](https://grafana.example.com/explore?...)

价值：

主动发现性能问题（非被动等待告警）
结合指标、日志、追踪的全栈分析
提供具体、可执行的优化方案
量化预期收益

技术架构

系统组件

┌─────────────────────────────────────────────────────────┐
│                   AI Assistant (Claude)                  │
│                  (Natural Language Interface)            │
└────────────────────────┬────────────────────────────────┘
                         │ MCP Protocol
                         │ (stdio / SSE / HTTP)
┌────────────────────────▼────────────────────────────────┐
│              Grafana MCP Server (Go)                     │
│  ┌─────────────────────────────────────────────────┐    │
│  │  Tool Router & Permission Validator              │    │
│  └───┬──────────────────────────────────────────┬──┘    │
│      │                                           │       │
│  ┌───▼──────────┐  ┌──────────────┐  ┌─────────▼───┐   │
│  │  Dashboard   │  │  Prometheus  │  │    Loki     │   │
│  │   Handler    │  │   Handler    │  │   Handler   │   │
│  └──────────────┘  └──────────────┘  └─────────────┘   │
│  ┌──────────────┐  ┌──────────────┐  ┌─────────────┐   │
│  │   Alerting   │  │   OnCall     │  │  Incidents  │   │
│  │   Handler    │  │   Handler    │  │   Handler   │   │
│  └──────────────┘  └──────────────┘  └─────────────┘   │
│  ┌──────────────┐  ┌──────────────┐                     │
│  │     Sift     │  │    Admin     │                     │
│  │   Handler    │  │   Handler    │                     │
│  └──────────────┘  └──────────────┘                     │
└────────────────────────┬────────────────────────────────┘
                         │ Grafana HTTP API
                         │ (with Authentication)
┌────────────────────────▼────────────────────────────────┐
│                  Grafana Instance                        │
│  ┌──────────┐  ┌──────────┐  ┌───────────┐             │
│  │Dashboard │  │Prometheus│  │   Loki    │             │
│  │  Engine  │  │ DataSrc  │  │  DataSrc  │             │
│  └──────────┘  └──────────┘  └───────────┘             │
│  ┌──────────┐  ┌──────────┐  ┌───────────┐             │
│  │ Alerting │  │ OnCall   │  │ Incidents │             │
│  │  Engine  │  │  Plugin  │  │   Plugin  │             │
│  └──────────┘  └──────────┘  └───────────┘             │
│  ┌──────────┐                                            │
│  │   Sift   │                                            │
│  │   AI     │                                            │
│  └──────────┘                                            │
└─────────────────────────────────────────────────────────┘

权限模型

Grafana MCP Server 支持两种权限模型：

1. RBAC（细粒度权限控制）

用于：Dashboard、数据源、告警

工具权限映射：
├── Dashboards
│   ├── search_dashboards        → dashboards:read
│   ├── get_dashboard_by_uid     → dashboards:read
│   ├── get_dashboard_summary    → dashboards:read
│   ├── get_dashboard_property   → dashboards:read
│   ├── get_dashboard_panel_queries → dashboards:read
│   └── update_dashboard         → dashboards:create + dashboards:write
│
├── Datasources
│   ├── list_datasources         → datasources:read
│   ├── get_datasource_by_uid    → datasources:read
│   ├── get_datasource_by_name   → datasources:read
│   ├── query_prometheus         → datasources:query
│   └── query_loki_logs          → datasources:query
│
└── Alerting
    ├── list_alert_rules         → alert.rules:read
    ├── get_alert_rule_by_uid    → alert.rules:read
    └── list_contact_points      → alert.notifications:read

2. 基础角色（Grafana 内置角色）

用于：Incidents、Sift、OnCall

角色权限映射：
├── Viewer（只读）
│   ├── list_incidents
│   ├── get_incident
│   ├── list_sift_investigations
│   ├── get_sift_investigation
│   ├── find_error_pattern_logs
│   └── find_slow_requests
│
├── Editor（读写）
│   ├── Viewer 的所有权限
│   ├── create_incident
│   └── add_activity_to_incident
│
└── Admin（管理员）
    ├── Editor 的所有权限
    ├── list_teams
    └── list_users_by_org

认证方式

服务账户 Token（推荐生产环境）

# 1. 在 Grafana 创建服务账户
Configuration > Service Accounts > Add service account
Name: mcp-server
Role: Editor

# 2. 生成 Token
Generate token → 复制 Token（格式：glsa_xxxxx）

# 3. 配置环境变量
export GRAFANA_SERVICE_ACCOUNT_TOKEN="glsa_xxxxxxxxxxxxxxxxxxxxx"

优点：

不会过期（除非手动撤销）
支持细粒度权限分配
可审计（单独的服务账户便于追踪操作）
符合企业安全最佳实践

用户名/密码认证（仅开发环境）

1 2	export GRAFANA_USERNAME="admin" export GRAFANA_PASSWORD="your-password"

缺点：

密码可能过期
难以追踪操作来源
不符合企业安全规范

传输协议对比

特性	stdio	SSE	streamable-HTTP
使用场景	Claude Desktop、CLI	Web App、浏览器	企业集成、多客户端
网络要求	无（本地）	HTTP/1.1	HTTP/1.1 或 HTTP/2
并发支持	单一会话	多客户端	多客户端 + 负载均衡
实时性	实时	实时流式	实时流式
部署复杂度	低	中	中
负载均衡	不支持	支持	支持
防火墙穿透	N/A	需要开放端口	需要开放端口

与其他方案对比

Grafana MCP vs 直接 Prometheus/Loki API

特性	Grafana MCP Server	直接 Prometheus/Loki API
学习曲线	零学习（自然语言）	需学习 PromQL/LogQL
多数据源统一接口	✅ 支持（Dashboard 统一视图）	❌ 需要分别调用
告警集成	✅ 原生支持 Grafana Alerting	❌ 需要单独配置
事件管理	✅ Grafana Incident 集成	❌ 无
OnCall 排班	✅ 完整支持	❌ 无
Sift AI 分析	✅ 智能模式识别	❌ 需要手动分析
权限管理	✅ Grafana RBAC	⚠️ 基础认证
Dashboard 管理	✅ 完整 CRUD	❌ 无
适用场景	AI 驱动运维、统一可观测性	单一数据源查询

Grafana MCP vs Grafana UI

特性	Grafana MCP Server	Grafana Web UI
交互方式	自然语言（AI 对话）	图形界面（点击操作）
学习成本	零（无需培训）	中等（需要熟悉界面）
自动化能力	✅ 强（AI 自动分析）	⚠️ 弱（需要手动操作）
根因分析	✅ AI 辅助（Sift 集成）	⚠️ 手动排查
批量操作	✅ 支持（AI 可批量处理）	⚠️ 有限
多步骤任务	✅ AI 自动编排	⚠️ 手动执行每步
移动访问	✅ 通过聊天（无需 UI）	⚠️ 移动端体验一般
适用场景	快速查询、故障排查、自动化	深度分析、配置管理

结论：Grafana MCP Server 是 Grafana UI 的智能补充，两者配合使用效果最佳。

最佳实践

1. 服务账户权限最小化

# 不推荐：给 MCP Server 分配 Admin 角色
❌ Role: Admin

# 推荐：最小权限原则
✅ 创建专用服务账户 "mcp-server"
   分配权限：
   - dashboards:read
   - datasources:read
   - datasources:query
   - alert.rules:read
   - alert.notifications:read

   如果需要写权限：
   - dashboards:create
   - dashboards:write
   - incidents:write

配置示例（Grafana RBAC）：

{
  "name": "MCP Server Read-Only",
  "permissions": [
    {"action": "dashboards:read", "scope": "dashboards:*"},
    {"action": "datasources:read", "scope": "datasources:*"},
    {"action": "datasources:query", "scope": "datasources:*"},
    {"action": "alert.rules:read", "scope": "*"},
    {"action": "alert.notifications:read", "scope": "*"}
  ]
}

2. 多环境隔离

# 生产环境
export GRAFANA_URL="https://grafana-prod.example.com"
export GRAFANA_SERVICE_ACCOUNT_TOKEN="${PROD_TOKEN}"

# 测试环境
export GRAFANA_URL="https://grafana-staging.example.com"
export GRAFANA_SERVICE_ACCOUNT_TOKEN="${STAGING_TOKEN}"

# Claude Desktop 配置多实例
{
  "mcpServers": {
    "grafana-prod": {
      "command": "mcp-grafana",
      "env": {
        "GRAFANA_URL": "https://grafana-prod.example.com",
        "GRAFANA_SERVICE_ACCOUNT_TOKEN": "${PROD_TOKEN}"
      }
    },
    "grafana-staging": {
      "command": "mcp-grafana",
      "env": {
        "GRAFANA_URL": "https://grafana-staging.example.com",
        "GRAFANA_SERVICE_ACCOUNT_TOKEN": "${STAGING_TOKEN}"
      }
    }
  }
}

3. 性能优化策略

优先使用 Summary 工具

# ❌ 不推荐：直接获取完整 Dashboard（大量 token）
get_dashboard_by_uid(uid="large-dashboard")

# ✅ 推荐：先获取概览
get_dashboard_summary(uid="large-dashboard")
# 需要时再精确提取
get_dashboard_property(uid="large-dashboard", jsonpath="$.panels[0].targets")

限制查询时间范围

# ❌ 不推荐：查询 30 天数据（响应慢、数据量大）
query_prometheus({
  query: "rate(http_requests_total[30d])",
  start: "now-30d"
})

# ✅ 推荐：查询 24 小时 + 聚合
query_prometheus({
  query: "avg_over_time(rate(http_requests_total[5m])[24h:1h])",
  start: "now-24h",
  step: "1h"
})

日志查询限制行数

# ❌ 不推荐：无限制查询（可能返回数万行）
query_loki_logs({
  query: "{app='nginx'}",
  start: "now-24h"
})

# ✅ 推荐：限制行数 + 精确筛选
query_loki_logs({
  query: "{app='nginx'} |= 'error' | status >= 500",
  start: "now-1h",
  limit: 100
})

4. 安全加固

网络隔离

# Kubernetes NetworkPolicy 示例
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: grafana-mcp-egress
spec:
  podSelector:
    matchLabels:
      app: grafana-mcp
  policyTypes:
    - Egress
  egress:
    # 仅允许访问 Grafana 实例
    - to:
      - podSelector:
          matchLabels:
            app: grafana
      ports:
        - protocol: TCP
          port: 3000
    # 允许 DNS 解析
    - to:
      - namespaceSelector:
          matchLabels:
            name: kube-system
      ports:
        - protocol: UDP
          port: 53

Secret 管理

# ❌ 不推荐：明文 Token 在配置文件
{
  "env": {
    "GRAFANA_SERVICE_ACCOUNT_TOKEN": "glsa_xxxxx"
  }
}

# ✅ 推荐：使用 Secret 管理（Kubernetes）
kubectl create secret generic grafana-mcp-token \
  --from-literal=token=glsa_xxxxxxxxxxxxxxxxxxxxx

# 在 Deployment 中引用
env:
  - name: GRAFANA_SERVICE_ACCOUNT_TOKEN
    valueFrom:
      secretKeyRef:
        name: grafana-mcp-token
        key: token

5. 监控 MCP Server 自身

# Prometheus ServiceMonitor（Kubernetes）
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: grafana-mcp
spec:
  selector:
    matchLabels:
      app: grafana-mcp
  endpoints:
    - port: metrics
      interval: 30s

关键指标：

mcp_tool_calls_total{tool="query_prometheus"} - 工具调用次数
mcp_tool_duration_seconds{tool="query_prometheus"} - 工具执行耗时
mcp_errors_total{tool="query_prometheus"} - 错误次数
mcp_grafana_api_latency_seconds - Grafana API 延迟

告警规则：

groups:
  - name: mcp-server
    rules:
      - alert: MCPServerHighErrorRate
        expr: rate(mcp_errors_total[5m]) > 0.1
        annotations:
          summary: "MCP Server 错误率 > 10%"

      - alert: MCPServerSlowQueries
        expr: histogram_quantile(0.95, mcp_tool_duration_seconds) > 10
        annotations:
          summary: "MCP Server P95 查询延迟 > 10s"

常见问题

Q1: 如何配置 Grafana Cloud？

{
  "mcpServers": {
    "grafana-cloud": {
      "command": "mcp-grafana",
      "env": {
        "GRAFANA_URL": "https://your-stack.grafana.net",
        "GRAFANA_SERVICE_ACCOUNT_TOKEN": "glsa_xxxxxxxxxxxxxxxxxxxxx"
      }
    }
  }
}

注意：

URL 格式：https://<your-stack-name>.grafana.net（无需加 /api）
Token 生成：Grafana Cloud Console → Configuration → Service Accounts
确保 Token 具有所需权限（默认创建的 Token 可能权限不足）

Q2: 支持哪些 Grafana 版本？

最低版本要求：

Grafana ≥ 9.0（核心功能）
Grafana ≥ 10.0（完整 Sift 支持）
Grafana ≥ 10.3（OnCall API v2）

推荐版本：Grafana 11.x（最新稳定版）

不支持的功能：

Grafana < 9.0：不支持 RBAC 细粒度权限
Grafana < 10.0：Sift 工具不可用
Grafana < 9.5：OnCall 功能受限

Q3: 如何调试 MCP Server？

启用调试日志

# 环境变量
export LOG_LEVEL=debug

# 或在启动时指定
mcp-grafana -t stdio --log-level debug

查看详细错误

# Docker 容器日志
docker logs -f <container-id>

# Kubernetes Pod 日志
kubectl logs -f deployment/grafana-mcp -n observability

# 本地运行（输出到文件）
mcp-grafana -t stdio > mcp.log 2>&1

测试 Grafana 连接

# 验证 URL 和 Token
curl -H "Authorization: Bearer ${GRAFANA_SERVICE_ACCOUNT_TOKEN}" \
  "${GRAFANA_URL}/api/health"

# 预期响应
{
  "commit": "abc123",
  "database": "ok",
  "version": "11.0.0"
}

Q4: 如何禁用某些工具类别？

# 禁用 OnCall 和 Sift 工具
export DISABLE_TOOLS="oncall,sift"

# 或在 Claude Desktop 配置
{
  "env": {
    "GRAFANA_URL": "...",
    "GRAFANA_SERVICE_ACCOUNT_TOKEN": "...",
    "DISABLE_TOOLS": "oncall,sift,incidents"
  }
}

可禁用的类别：

dashboards - Dashboard 管理工具
datasources - 数据源管理工具
prometheus - Prometheus 查询工具
loki - Loki 日志查询工具
alerting - 告警管理工具
oncall - OnCall 排班工具
incidents - 事件管理工具
sift - Sift 智能调查工具
admin - 用户和团队管理工具

Q5: 性能如何？API 限流吗？

MCP Server 性能：

轻量级 Go 实现，内存占用 <50MB
并发处理能力：>1000 req/s（取决于 Grafana 实例）
启动时间：<1s

Grafana API 限流：

Grafana Cloud：
- 默认：100 req/10s（可联系支持团队提升）
- 建议：缓存查询结果，避免重复请求
自托管实例：
- 默认无限流（取决于硬件）
- 建议配置限流：grafana.ini → [quota] → api_request_rate_limit

优化建议：

使用 get_dashboard_summary 而非 get_dashboard_by_uid（节省 80% 带宽）
限制 Prometheus/Loki 查询时间范围（<24h）
合理设置 step 参数（不要太细粒度）

Q6: 支持多租户吗？

支持方式：

方案 1：多个 MCP Server 实例（推荐）

{
  "mcpServers": {
    "grafana-team-a": {
      "command": "mcp-grafana",
      "env": {
        "GRAFANA_URL": "https://grafana.example.com",
        "GRAFANA_SERVICE_ACCOUNT_TOKEN": "${TEAM_A_TOKEN}"
      }
    },
    "grafana-team-b": {
      "command": "mcp-grafana",
      "env": {
        "GRAFANA_URL": "https://grafana.example.com",
        "GRAFANA_SERVICE_ACCOUNT_TOKEN": "${TEAM_B_TOKEN}"
      }
    }
  }
}

优点：完全隔离，权限清晰

方案 2：Grafana 组织（Organizations）

# 配置不同组织的 Token
Team A Token → Org ID: 1
Team B Token → Org ID: 2

# Grafana 会根据 Token 自动切换组织上下文

Q7: 如何处理大型 Dashboard（>1MB）？

问题：大型 Dashboard 的完整 JSON 可能消耗大量 LLM token。

解决方案：

优先使用 Summary

# 1. 先获取概览（<1KB）
get_dashboard_summary(uid="large-dashboard")

# 2. 仅提取需要的部分（<10KB）
get_dashboard_property(uid="large-dashboard", jsonpath="$.panels[?(@.title=='CPU Usage')]")

使用 Panel Queries

1 2	# 仅提取查询语句，不包含可视化配置 get_dashboard_panel_queries(uid="large-dashboard")

分批处理

# AI 可以自动分批提取面板
for panel_id in [1, 2, 3, ...]:
  get_dashboard_property(
    uid="large-dashboard",
    jsonpath=f"$.panels[?(@.id=={panel_id})]"
  )

评分详情

维度	评分	说明
功能性	5.0/5.0	覆盖 Grafana 全栈可观测性场景，45+ 专业工具
文档质量	4.9/5.0	官方文档完善，示例丰富，但部分高级场景缺少说明
社区活跃度	4.9/5.0	Grafana Labs 官方维护，1700+ stars，活跃社区
维护状态	5.0/5.0	持续更新，跟随 Grafana 版本迭代
代码质量	4.8/5.0	Go 实现，性能优异，测试覆盖良好
企业就绪度	5.0/5.0	RBAC 权限、多传输模式、Grafana Cloud 支持
综合评分	4.9/5.0	业界最全面的 Grafana AI 集成方案

总结

Grafana MCP Server 是 Grafana 可观测性平台的 AI 智能层，通过 45+ 专业工具，将监控、日志、告警、事件管理、OnCall 排班等全栈能力完全暴露给 AI 模型。它不仅仅是一个”查询工具”，更是实现”AI 驱动运维（AIOps）”的关键基础设施。

推荐指数: ⭐⭐⭐⭐⭐ (5/5)

适合你的情况：

✅ 使用 Grafana 作为可观测性平台
✅ 需要 AI 辅助故障排查和性能分析
✅ 希望用自然语言查询指标和日志
✅ 管理复杂的告警规则和 OnCall 排班
✅ 需要自动化 Dashboard 管理
✅ 希望 AI 自动发现异常模式（Sift）
✅ 构建 AI 运维助手（AIOps）

不适合的情况：

❌ 不使用 Grafana（考虑 Prometheus/Loki MCP Server）
❌ 仅需要简单的指标查询（直接使用 Prometheus API）
❌ 使用 Grafana < 9.0 版本（功能受限）

核心优势：

官方实现 - Grafana Labs 维护，质量保证
全栈集成 - 覆盖可观测性全场景（指标、日志、追踪、告警、事件）
AI 增强 - Sift 智能调查，自动识别异常模式
企业级 - RBAC 权限、多租户、Grafana Cloud 支持
零学习成本 - 自然语言交互，无需学习 PromQL/LogQL

实际价值：

故障响应时间 - 从 30 分钟降至 3 分钟（10x 提升）
学习成本 - 新人即可使用，无需培训
运维效率 - AI 自动化重复性任务，释放工程师精力
异常发现 - Sift AI 发现传统告警漏掉的低频问题

下一步行动：

在 Grafana 中创建服务账户和 Token
配置 Claude Desktop（5 分钟内完成）
尝试自然语言查询：”过去 1 小时 CPU 使用率如何？”
体验 AI 辅助故障排查
探索 Sift 智能调查和自动化 Dashboard 管理

最佳实践起步：

先以只读权限测试（dashboards:read + datasources:query）
在测试环境验证功能
逐步开放写权限（dashboards:write、incidents:write）
配置监控指标，观察 MCP Server 性能