记一次磁盘空间满问题排查

云上的一台 ORACLE 测试库磁盘又满了，之前满过，没详细分析，这次又满了，想着要找到根本原因。纪录如下：

登录服务器查看，磁盘空间 / 使用率 100%

[root@xxx:/root]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        868M     0  868M   0% /dev
tmpfs           1.8G  1.1M  1.8G   1% /dev/shm
tmpfs           879M  596K  878M   1% /run
tmpfs           879M     0  879M   0% /sys/fs/cgroup
/dev/vda1        40G   38G  259M 100% /
tmpfs           176M     0  176M   0% /run/user/0

通过命令du -sh /*|grep G 进一部分查看占用磁盘空间较多的磁盘目录，进一步发现是 /va/log 下的messages 占用较多

[root@xxx:/var/log]# du -sh ./*|grep G
4.0G   ./journal
1.2G   ./messages
1.9G   ./messages-20250511
1.9G   ./messages-20250518
1.9G   ./messages-20250525
1.9G   ./messages-20250601

仔细查看 messages的内容，被刷入了 top的信息，更过分的是3秒钟刷新一次。

Jun  5 15:02:19 xxx top: top - 15:02:19 up 274 days, 23:50,  2 users,  load average: 0.02, 0.71, 2.51
Jun  5 15:02:22 xxx top: top - 15:02:22 up 274 days, 23:50,  2 users,  load average: 0.01, 0.70, 2.50
Jun  5 15:02:25 xxx top: top - 15:02:25 up 274 days, 23:50,  2 users,  load average: 0.01, 0.70, 2.50
Jun  5 15:02:28 xxx top: top - 15:02:28 up 274 days, 23:50,  2 users,  load average: 0.01, 0.68, 2.48
Jun  5 15:02:31 xxx top: top - 15:02:31 up 274 days, 23:50,  2 users,  load average: 0.09, 0.69, 2.48
Jun  5 15:02:34 xxx top: top - 15:02:34 up 274 days, 23:50,  2 users,  load average: 0.09, 0.69, 2.48
Jun  5 15:02:37 xxx top: top - 15:02:37 up 274 days, 23:50,  2 users,  load average: 0.08, 0.68, 2.46
Jun  5 15:02:40 xxx top: top - 15:02:40 up 274 days, 23:50,  2 users,  load average: 0.08, 0.68, 2.46
Jun  5 15:02:43 xxx top: top - 15:02:43 up 274 days, 23:50,  2 users,  load average: 0.08, 0.67, 2.45
Jun  5 15:02:46 xxx top: top - 15:02:46 up 274 days, 23:50,  2 users,  load average: 0.07, 0.66, 2.44
Jun  5 15:02:49 xxx top: top - 15:02:49 up 274 days, 23:50,  2 users,  load average: 0.07, 0.66, 2.44

整整齐齐的 3 秒一次。一周就产生了接近2G的数据，一个月就是8G，我本来就 40G的磁盘空间。

我首先检查了定时任务，确认没有定时任务，也没有开机启动脚本。

1 2	[root@xxx:/root]# grep -r "top -b" /etc/cron* [root@xxx:/root]# grep -rw "top" /etc/init.d /etc/profile.d /etc/rc.local /etc/rc.d

找不到头绪，然后通过ps 检查进程，果然发现了 top 进程，还带了- b参数，还是2024年运行的，吼吼，先kill掉看看是不是 messages再刷新，发现多了三行systemd的信息，messages也不刷新了。

[root@xxx:/var/log]# ps -ef |grep top 
root      7394     1  0  2024 ?        06:49:32 /usr/bin/top -b 
oracle   31768 31448  0 14:56 pts/0    00:00:00 top 
root     32507 30772  0 15:09 pts/1    00:00:00 grep --color=auto top 
[root@xxx:/var/log]# kill -9 7394 
[root@xxx:/var/log]# 
[root@xxx:/var/log]# tail -f messages 
Jun  5 15:09:35 xxx top: 30772 root      20   0  116744   2956   1312 S   0.0  0.2   0:00.34 bash 
Jun  5 15:09:35 xxx top: 30865 root      20   0   19428   1076    248 S   0.0  0.1   0:00.30 assist_dae+ 
Jun  5 15:09:35 xxx top: 31329 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kworker/0:1 
Jun  5 15:09:35 xxx top: 31446 root      20   0  191884   2004   1428 S   0.0  0.1   0:00.00 su 
Jun  5 15:09:35 xxx top: 31448 oracle    20   0  116708   2520    984 S   0.0  0.1   0:00.14 bash 
Jun  5 15:09:35 xxx top: 31768 oracle    20   0  162088   2248   1580 S   0.0  0.1   0:00.77 top 
Jun  5 15:09:35 xxx top: 32294 oracle    20   0  967316  15412  13080 S   0.0  0.9   0:00.00 oracle 
Jun  5 15:09:35 xxx systemd: toptest.service: main process exited, code=killed, status=9/KILL 
Jun  5 15:09:35 xxx systemd: Unit toptest.service entered failed state. 
Jun  5 15:09:35 xxx systemd: toptest.service failed.

简单总结就是：通过ps命令发现有一个24年启动的top -b进程（PID 7394）。我们已经使用kill -9 7394 kill了该进程，但随后在messages日志中出现了systemd关于toptest.service失败的记录，并且 messages 也不再更新

那就是肯定是这个进程产生的的。问题还是通过 system的 toptest.service 服务产生的，查看这个服务的状态信息，可以看到进行 7394 进行被 kill了。状态是 faile

[root@xxx:/var/log]#  systemctl status toptest.service
● toptest.service - /usr/bin/top -b
   Loaded: loaded (/run/systemd/system/toptest.service; static; vendor preset: disabled)
  Drop-In: /run/systemd/system/toptest.service.d
           └─50-Description.conf, 50-ExecStart.conf, 50-Slice.conf
   Active: failed (Result: signal) since Thu 2025-06-05 15:09:35 CST; 12min ago
  Process: 7394 ExecStart=/usr/bin/top -b (code=killed, signal=KILL)
 Main PID: 7394 (code=killed, signal=KILL)

Jun 05 15:09:35 xxx top[7394]: 30772 root      20   0  116744   2956   1312 S   0.0  0.2   0:00.34 bash
Jun 05 15:09:35 xxx top[7394]: 30865 root      20   0   19428   1076    248 S   0.0  0.1   0:00.30 assist_dae+
Jun 05 15:09:35 xxx top[7394]: 31329 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kworker/0:1
Jun 05 15:09:35 xxx top[7394]: 31446 root      20   0  191884   2004   1428 S   0.0  0.1   0:00.00 su
Jun 05 15:09:35 xxx top[7394]: 31448 oracle    20   0  116708   2520    984 S   0.0  0.1   0:00.14 bash
Jun 05 15:09:35 xxx top[7394]: 31768 oracle    20   0  162088   2248   1580 S   0.0  0.1   0:00.77 top
Jun 05 15:09:35 xxx top[7394]: 32294 oracle    20   0  967316  15412  13080 S   0.0  0.9   0:00.00 oracle
Jun 05 15:09:35 xxx systemd[1]: toptest.service: main process exited, code=killed, status=9/KILL
Jun 05 15:09:35 xxx systemd[1]: Unit toptest.service entered failed state.
Jun 05 15:09:35 xxx systemd[1]: toptest.service failed.

查了下资料（来源于大模型）：

配置文件位置 /run/systemd/system/toptest.service /run/ 是临时文件系统，意味着这个服务配置不会持久化（重启后消失）
static：表示这是一个”静态”服务（不能直接启动，通常被其他单元依赖）
vendor preset: disabled：默认不启用（由系统供应商预设）
Drop-In 目录：/run/systemd/system/toptest.service.d
配置片段：
- 50-Description.conf：服务描述配置
- 50-ExecStart.conf：包含实际的启动命令 /usr/bin/top -b
- 50-Slice.conf：资源限制配置（cgroups）

检查配置文件信息如下：

[root@xxx:/run/systemd/system/toptest.service.d]# cat /run/systemd/system/toptest.service
# Transient stub
[root@xxx:/run/systemd/system/toptest.service.d]# ll
total 12
-rw-r--r-- 1 root root 35 Sep  3  2024 50-Description.conf
-rw-r--r-- 1 root root 65 Sep  3  2024 50-ExecStart.conf
-rw-r--r-- 1 root root 27 Sep  3  2024 50-Slice.conf
[root@xxx:/run/systemd/system/toptest.service.d]# cat *
[Unit]
Description=/usr/bin/top -b
[Service]
ExecStart=
ExecStart=@/usr/bin/top "/usr/bin/top" "-b"
[Service]
Slice=test.slice

目前来看配置文件确实只在 /run 目录下，/etc下不存在。

[root@xxx:/var/log]# ls -l /etc/systemd/system/toptest.service
ls: cannot access /etc/systemd/system/toptest.service: No such file or directory
[root@xxx:/var/log]# updatedb
[root@xxx:/var/log]# locate toptest.service
[root@xxx:/var/log]# ls -l /usr/lib/systemd/system/toptest.service
ls: cannot access /usr/lib/systemd/system/toptest.service: No such file or directory

目前可以确定 messages 是被 toptest.service 服务打满了。下面就是清理了，实际上这个不用清理，只在内存里，也不启动。直接清理 messages 即可。

原文作者: liups.com

原文链接: http://liups.com/posts/4759d282/

许可协议: 知识共享署名-非商业性使用 4.0 国际许可协议