Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
178 changes: 178 additions & 0 deletions skills/infrastructure/bmc-analyze/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# OpenUBMC Forum 故障定位工具

本工具集用于 OpenUBMC/iBMC 日志包的自动化故障诊断,提供 Claude Code Skill 和独立 Python 脚本两种使用方式。

## 文件说明

| 文件 | 用途 |
|------|------|
| `bmc_log_analyzer.py` | 独立 Python 解析脚本,支持 `.tar.gz` 和目录 |
| `bmc_log_mcp_server.py` | MCP Server(6 个工具,stdio 模式) |
| `my_skill/bmc-analyze.md` | Claude Code Skill 定义文件 |

---

## bmc-analyze 使用说明

### 概述

`/bmc-analyze` 是一个 Claude Code Skill,用于分析 OpenUBMC/iBMC 的 `dump_info` 日志包,自动生成结构化故障诊断报告。

### 安装

将 `my_skill/bmc-analyze.md` 复制到 Claude Code 的 skills 目录:

```bash
cp my_skill/bmc-analyze.md ~/.claude/skills/bmc-analyze.md
```

### 用法

```
/bmc-analyze [日志包路径]
```

**支持的输入格式:**
- `.tar.gz` 压缩包:`/bmc-analyze dump_info.tar.gz`
- 解压后目录:`/bmc-analyze /path/to/dump_info/`
- 不指定路径:自动搜索当前目录下的 `*.tar.gz`、`dump_info*`、`null_*`

### 报告结构

Skill 执行后输出以下诊断报告:

1. **基本信息** — BMC 类型(iBMC/openUBMC)、固件版本、产品型号
2. **网络配置** — IP 地址、网口状态
3. **串口/Telnet 状态**(⚠️ 优先检查)— telnetd 进程是否运行、systemd 服务是否激活
4. **错误日志摘要** — ERROR 总数、Top 模块分布
5. **命令历史** — 管理员操作记录
6. **诊断建议** — 针对发现问题的可操作步骤

### 日志文件速查表

| 查找内容 | 文件路径 |
|---------|---------|
| BMC 版本 | `RTOSDump/versioninfo/app_revision.txt` |
| IP 地址 | `RTOSDump/networkinfo/ifconfig_info` |
| 串口/Telnet 配置 | `AppDump/bmc_network/network_info.txt` |
| 应用错误日志 | `LogDump/app.log` |
| 操作审计 | `LogDump/operation.log` |
| 内核错误 | `RTOSDump/driver_info/dmesg_info` |
| 服务状态 | `RTOSDump/sysinfo/journalctl.log` |
| 进程列表 | `RTOSDump/sysinfo/ps_info` |
| 管理员历史 | `RTOSDump/other_info/command_records/*/ash_history` |

### 串口故障专项排查(论坛 #4257 类问题)

用户反映"无法通过 Telnet 连接 BMC 串口"时,按以下顺序检查:

1. `RTOSDump/sysinfo/ps_info` — telnetd 进程是否在列
2. `RTOSDump/sysinfo/journalctl.log*` — telnetd.service 是否被 deactivated
3. `LogDump/operation.log` — 查找 "SOL session" / "Connect SOL failed"
4. `AppDump/bmc_network/network_info.txt` — 端口 23 防火墙规则

若 telnetd 服务已停止:通过 iBMC/openUBMC Web 界面 → 系统管理 → 串口配置,重新启用串口转发。

---

## 示例:分析 dump_info 日志包

### 执行命令

```
/bmc-analyze dump_info
```

### 输出报告示例

以下为对 `dump_info/` 目录(iBMC S920X20 设备)的实际分析结果:

---

#### 一、基本信息

| 项目 | 值 |
|------|-----|
| BMC 类型 | iBMC |
| 活跃版本 | 25.12.02.03 (Build 001) |
| 编译时间 | 2026-03-09 13:48:26 |
| 产品型号 | S920X20 |
| BIOS 版本 | (U75)000 |

#### 二、网络配置

| 接口 | IP 地址 | 状态 |
|------|---------|------|
| eth0 | 10.0.2.15 | UP |
| lo | 127.0.0.1 | UP |

#### 三、串口/Telnet 状态

| 项目 | 状态 |
|------|------|
| telnetd 进程 | 运行中 |
| 进程详情 | `/data/home/busybox_x telnetd` (PID 2150) |

**结论:** telnetd 正常运行,可通过 telnet 登录 BMC。

#### 四、错误日志摘要

| 日志类型 | 数量 |
|---------|------|
| app.log ERROR | 136 条 |
| 告警日志 | 1061 条 |
| dmesg 异常 | 0 条 |
| maintenance ERROR | 10 条 |

**Top 模块错误分布:**

| 模块 | 错误数 |
|------|--------|
| snmp | 43 |
| fault_diagnosis | 22 |
| interface | 18 |
| firmware_mgmt | 12 |
| power_mgmt | 11 |
| event | 11 |
| pcie_device | 9 |

#### 五、命令历史

共 11 条记录,发现 telnet 尝试:

```
telnet localhost
```

#### 六、诊断建议

**主要问题:** app.log 存在 136 条 ERROR,snmp 模块最多(43 条)

排查步骤:
1. 检查 `LogDump/app.log`,重点关注 `snmp` 模块错误
2. 查找最早 ERROR 时间戳,定位问题发生时间点
3. 如需深入排查 snmp:检查 `RTOSDump/sysinfo/journalctl.log` 中 snmp 服务状态

**串口/Telnet 方面无异常**,telnetd 进程正常运行。

---

## 底层脚本直接使用

不依赖 Claude Code,也可直接运行 Python 脚本:

```bash
python3 bmc_log_analyzer.py <日志包路径>
```

**示例:**

```bash
# 分析目录
python3 bmc_log_analyzer.py ./dump_info

# 分析压缩包
python3 bmc_log_analyzer.py ./dump_info.tar.gz
```

脚本自动处理 `dump_info/dump_info/`(双层嵌套)和 `null_xxx/dump_info/`(单层嵌套)两种目录结构。
79 changes: 79 additions & 0 deletions skills/infrastructure/bmc-analyze/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# bmc-analyze — OpenUBMC/iBMC 日志包故障诊断

分析 OpenUBMC/iBMC 的 dump_info 日志包,生成结构化故障诊断报告。用法:`/bmc-analyze [日志包路径]`,支持 .tar.gz 文件或目录。

# 示例:
# /bmc-analyze dump_info.tar.gz
# /bmc-analyze /path/to/dump_info/

Analyze the BMC log package at the path provided by the user (or use the current directory if no path given).

Follow these steps:

## Step 1: Find the log package

If the user provided a path, use it directly. Otherwise, look for:
- Files matching `*.tar.gz` in the current directory
- Directories named `dump_info*` or `null_*` in the current directory

## Step 2: Run the analyzer script

Use the Bash tool to run:
```bash
python3 /home/zhongjun/claude/zhongjun2/openUBMC_forum/bmc_log_analyzer.py <PATH> 2>&1
```

If no analyzer script is available, perform the analysis manually:

### Manual Analysis Checklist

**A. Version Info** - Read `RTOSDump/versioninfo/app_revision.txt`
- Look for: `Active iBMC Version:` or `Active openUBMC Version:`

**B. Network Config** - Read `RTOSDump/networkinfo/ifconfig_info`
- Look for: IP addresses, UP/DOWN interface states

**C. Serial/Telnet Config** - Read these files:
- `RTOSDump/sysinfo/ps_info` or `top_info` → grep for "telnetd"
- `RTOSDump/sysinfo/journalctl.log*` → grep for "telnetd"
- `LogDump/operation.log` → grep for "serial" or "SOL"
- `AppDump/bmc_network/network_info.txt` → check port 23 rules

**D. Error Logs** - Read `LogDump/app.log`
- Count ERROR lines, identify top modules

**E. Command History** - Read `RTOSDump/other_info/command_records/*/ash_history`

## Step 3: Present the report

Format the findings as a structured report with:
1. **Basic Info**: BMC type, version, product model
2. **Network Status**: IP addresses, interface states
3. **Serial/Telnet Status** (⚠️ PRIORITY): Is telnetd running? Is systemd service active?
4. **Error Summary**: ERROR count, top modules, critical errors
5. **Diagnosis & Recommendations**: Specific actionable steps

## Key Log File Map (tell the user which file to check for what)

| What to find | File path |
|---|---|
| BMC version | `RTOSDump/versioninfo/app_revision.txt` |
| IP address | `RTOSDump/networkinfo/ifconfig_info` |
| Serial/telnet config | `AppDump/bmc_network/network_info.txt` |
| App error logs | `LogDump/app.log` |
| Operation audit | `LogDump/operation.log` |
| Kernel errors | `RTOSDump/driver_info/dmesg_info` |
| Service status | `RTOSDump/sysinfo/journalctl.log` |
| Process list | `RTOSDump/sysinfo/ps_info` |
| Admin history | `RTOSDump/other_info/command_records/*/ash_history` |

## Serial Port Troubleshooting (for forum post #4257 type issues)

If user reports "cannot connect via telnet to BMC serial port":

1. Check `RTOSDump/sysinfo/ps_info` - is `telnetd` in process list?
2. Check `RTOSDump/sysinfo/journalctl.log*` - was telnetd.service deactivated?
3. Check `LogDump/operation.log` - look for "SOL session" failures
4. Check `AppDump/bmc_network/network_info.txt` - firewall rules for port 23

If telnetd service is stopped: recommend enabling serial forwarding via iBMC/openUBMC web interface → System Management → Serial Port Config.
Loading
Loading