问题报错

今天巡检HIS的数据库服务器,发现小机AIX系统有大量的报错:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@hisdb01:/]errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
E87EF1BE 0207150025 P O dumpcheck The largest dump device is too small.
E87EF1BE 0206150025 P O dumpcheck The largest dump device is too small.
E87EF1BE 0205150025 P O dumpcheck The largest dump device is too small.
E87EF1BE 0204150025 P O dumpcheck The largest dump device is too small.
E87EF1BE 0203150025 P O dumpcheck The largest dump device is too small.
E87EF1BE 0202150025 P O dumpcheck The largest dump device is too small.
E87EF1BE 0201150025 P O dumpcheck The largest dump device is too small.
E87EF1BE 0131150025 P O dumpcheck The largest dump device is too small.
E87EF1BE 0130150025 P O dumpcheck The largest dump device is too small.
E87EF1BE 0129150025 P O dumpcheck The largest dump device is too small.
E87EF1BE 0128150025 P O dumpcheck The largest dump device is too small.
E87EF1BE 0127150025 P O dumpcheck The largest dump device is too small.
E87EF1BE 0126150025 P O dumpcheck The largest dump device is too small.

关于AIX报错errpt有具体含义如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
这里的输出分为六列依次为:

1.错误标示符IDENTIFIER:并不唯一,由它来确定使用的错误模板,显然同一种错误的IDENTIFIER是相同的。

2.时间戳TIMESTAMP:错误发生的时间,MMDDhhmmYY,依次表示月日时分年。

3.类型TYPE:错误的类型,或者说严重的程度,共分为6种:
PEND 设备或功能组件可能丢失 简写P
PERF 性能严重下降 P
PERM 硬件设备或软件模块损坏,确诊了的 P
TEMP 临时性错误,经过重试后已经恢复正常 T
INFO 一般消息,不是错误 I
UNKN 不能确定错误的严重性 U

4.种类CLASS c:指出错误源
H 硬件故障 Hardware
S 软件故障 Software
O 人为操作 Operation
U 不能确定 Unknown

5. 资源名RESOURCE_NAME
最初检测到错误的资源名软件或者硬件,并不代表这个资源有问题,而只是最先在它发现的。

6.描述
具体的错误代表的意义可以打IBM的支持热线寻求帮助。

问题分析

那么让我们查看一下报错的具体内容是什么,使用errpt -aj E87EF1BE |more进行分页查看

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
---------------------------------------------------------------------------
LABEL: DMPCHK_TOOSMALL
IDENTIFIER: E87EF1BE

Date/Time: Fri Feb 7 15:00:00 CST 2025
Sequence Number: 498
Machine Id: 00CB4D104B00
Node Id: hisdb02
Class: O
Type: PEND
WPAR: Global
Resource Name: dumpcheck

Description
The largest dump device is too small.

Probable Causes
Neither dump device is large enough to accommodate a system dump at this time.

Recommended Actions
Increase the size of one or both dump devices.

Detail Data
Largest dump device
lg_dumplv
Largest dump device size in kb
4194304
Current estimated dump size in kb
4637388
---------------------------------------------------------------------------
LABEL: DMPCHK_TOOSMALL
IDENTIFIER: E87EF1BE

根据上述的内容可以看出Probable Causes的原因是lg_dumplv空间不足导致的,设置的大小是4194304,实际的大小是

4637388,因此报错。关于dump devices IBM的描述如下:

When an unexpected system halt occurs, the system dump facility automatically copies selected areas of kernel data to the primary dump device. These areas include kernel segment 0, and other areas registered in the Master Dump Table by kernel modules or kernel extensions.

当系统意外停止时,系统转储工具会自动将内核数据的选定区域复制到主转储设备。这些区域包括内核段 0 以及内核模块或内核扩展在 Master Dump Table 中注册的其他区域。

可以通过命令查看当前大小

1
2
[root@hisdb01:/]sysdumpdev -e
0453-041 Estimated dump size in bytes: 4505080954

比较有意思的就是这个报错总在每天的15点报错,这是为啥呢。其实是系统的默认定时任务

通过crontab -l可以看到每天15:00有个默认的任务执行/usr/lib/ras/dumpcheck

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
[root@hisdb01:/]crontab -l
# @(#)08 1.15.1.3 src/bos/usr/sbin/cron/root, cmdcntl, bos720 2/11/94 17:19:47
# IBM_PROLOG_BEGIN_TAG
# This is an automatically generated prolog.
#
# bos720 src/bos/usr/sbin/cron/root 1.15.1.3
#
# Licensed Materials - Property of IBM
#
# COPYRIGHT International Business Machines Corp. 1989,1994
# All Rights Reserved
#
# US Government Users Restricted Rights - Use, duplication or
# disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
#
# IBM_PROLOG_END_TAG
#
# COMPONENT_NAME: (CMDCNTL) commands needed for basic system needs
#
# FUNCTIONS:
#
# ORIGINS: 27
#
# (C) COPYRIGHT International Business Machines Corp. 1989,1994
# All Rights Reserved
# Licensed Materials - Property of IBM
#
# US Government Users Restricted Rights - Use, duplication or
# disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
#
#0 3 * * * /usr/sbin/skulker
#45 2 * * 0 /usr/lib/spell/compress
#45 23 * * * ulimit 5000; /usr/lib/smdemon.cleanu > /dev/null
0 11 * * * /usr/bin/errclear -d S,O 30
0 12 * * * /usr/bin/errclear -d H 90
0,5,10,15,20,25,30,35,40,45,50,55 * * * * /usr/sbin/dumpctrl -k >/dev/null 2>/dev/null
0 15 * * * /usr/lib/ras/dumpcheck >/dev/null 2>&1
30 23 * * * /usr/sbin/ntpdate 192.168.50.250
0 0 * * * /usr/bin/nmon -f -N -m /nmon -s 30 -c 2880

我们来简单看一下这个命令的作用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@hisdb01:/]/usr/lib/ras/dumpcheck -h
getopt: Not a recognized flag: h
Usage:
dumpcheck [[-l] [-p ] [-t time-parameters] [-P ]]
dumpcheck [-r ]

Checks to see that the dump device and copy directory
are able to receive the system dump.

-l Log any warnings to the error log.
-p Print any warnings produced to stdout.
-r Remove the crontab entry for this function.
-P Indicates that the changes are to be made permenently.
-t time-parameters Change the time when dumpcheck is executed.

The -r parameter must be specified alone,
(i.e.) it is not valid with any other parameters.

执行一下

1
2
3
4
5
6
7
8
9
[root@hisdb01:/]/usr/lib/ras/dumpcheck -p
The largest dump device is too small.

Largest dump device
lg_dumplv
Largest dump device size in kb
4194304
Current estimated dump size in kb
4637388

问题处理

即然确定了问题哪里来的,下面进行一下问题fix,先确认一下lg_dumplv分配的大小,默认分配的是4G(PPS=4,PP SIZE=1024M)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@hisdb01:/]lslv lg_dumplv
LOGICAL VOLUME: lg_dumplv VOLUME GROUP: rootvg
LV IDENTIFIER: 00cb47b000004b0000000189e0b21f89.12 PERMISSION: read/write
VG STATE: active/complete LV STATE: opened/syncd
TYPE: sysdump WRITE VERIFY: off
MAX LPs: 512 PP SIZE: 1024 megabyte(s)
COPIES: 1 SCHED POLICY: parallel
LPs: 4 PPs: 4
STALE PPs: 0 BB POLICY: relocatable
INTER-POLICY: minimum RELOCATABLE: yes
INTRA-POLICY: middle UPPER BOUND: 32
MOUNT POINT: N/A LABEL: None
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?: NO
INFINITE RETRY: no PREFERRED READ: 0
ENCRYPTION: no

lg_dumplv来自于系统rootvg,该LV为非mirror lv,可以看到lg_dumplov来自于hdisk0磁盘上。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
[root@hisdb01:/]lsvg -l rootvg
rootvg:
LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT
hd5 boot 1 2 2 closed/syncd N/A
hd6 paging 33 66 2 open/syncd N/A
hd8 jfs2log 1 2 2 open/syncd N/A
hd4 jfs2 50 100 2 open/syncd /
hd2 jfs2 10 20 2 open/syncd /usr
hd9var jfs2 10 20 2 open/syncd /var
hd3 jfs2 10 20 2 open/syncd /tmp
hd1 jfs2 1 2 2 open/syncd /home
hd10opt jfs2 10 20 2 open/syncd /opt
hd11admin jfs2 1 2 2 open/syncd /admin
fwdump jfs2 3 6 2 open/syncd /var/adm/ras/platform
lg_dumplv sysdump 6 6 1 open/syncd N/A
livedump jfs2 1 2 2 open/syncd /var/adm/ras/livedump
fslv00 jfs2 100 200 2 open/syncd /u01
fslv01 jfs2 100 200 2 open/syncd /g01
[root@hisdb01:/]lspv -l hdisk0
hdisk0:
LV NAME LPs PPs DISTRIBUTION MOUNT POINT
fwdump 3 3 00..03..00..00..00 /var/adm/ras/platform
hd11admin 1 1 00..00..01..00..00 /admin
livedump 1 1 00..01..00..00..00 /var/adm/ras/livedump
lg_dumplv 6 6 00..06..00..00..00 N/A
fslv01 100 100 00..00..43..57..00 /g01
fslv00 100 100 100..00..00..00..00 /u01
hd5 1 1 01..00..00..00..00 N/A
hd6 33 33 00..33..00..00..00 N/A
hd8 1 1 00..00..01..00..00 N/A
hd4 50 50 00..30..20..00..00 /
hd2 10 10 00..00..10..00..00 /usr
hd9var 10 10 00..00..10..00..00 /var
hd3 10 10 00..00..10..00..00 /tmp
hd1 1 1 00..00..01..00..00 /home
hd10opt 10 10 00..00..10..00..00 /opt
[root@hisdb01:/]lspv -l hdisk1
hdisk1:
LV NAME LPs PPs DISTRIBUTION MOUNT POINT
fwdump 3 3 00..03..00..00..00 /var/adm/ras/platform
hd11admin 1 1 00..00..01..00..00 /admin
livedump 1 1 00..01..00..00..00 /var/adm/ras/livedump
fslv01 100 100 00..00..43..57..00 /g01
fslv00 100 100 100..00..00..00..00 /u01
hd5 1 1 01..00..00..00..00 N/A
hd6 33 33 00..33..00..00..00 N/A
hd8 1 1 00..00..01..00..00 N/A
hd4 50 50 00..30..20..00..00 /
hd2 10 10 00..00..10..00..00 /usr
hd9var 10 10 00..00..10..00..00 /var
hd3 10 10 00..00..10..00..00 /tmp
hd1 1 1 00..00..01..00..00 /home
hd10opt 10 10 00..00..10..00..00 /opt

检查一下rootvg是否有可用空间

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@hisdb01:/]lsvg rootvg
VOLUME GROUP: rootvg VG IDENTIFIER: 00cb47b000004b0000000189e0b21f89
VG STATE: active PP SIZE: 1024 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 1064 (1089536 megabytes)
MAX LVs: 256 FREE PPs: 398 (407552 megabytes)
LVs: 15 USED PPs: 666 (681984 megabytes)
OPEN LVs: 14 QUORUM: 1 (Disabled)
TOTAL PVs: 2 VG DESCRIPTORS: 3
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 2 AUTO ON: yes
MAX PPs per VG: 32512
MAX PPs per PV: 1016 MAX PVs: 32
LTG size (Dynamic): 256 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable
PV RESTRICTION: none INFINITE RETRY: no
DISK BLOCK SIZE: 512 CRITICAL VG: no
FS SYNC OPTION: no CRITICAL PVs: no
ENCRYPTION: no

执行lg_dumplv扩容,之后再查看

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@hisdb01:/]extendlv lg_dumplv 2
[root@hisdb01:/]lslv lg_dumplv
LOGICAL VOLUME: lg_dumplv VOLUME GROUP: rootvg
LV IDENTIFIER: 00cb47b000004b0000000189e0b21f89.12 PERMISSION: read/write
VG STATE: active/complete LV STATE: opened/syncd
TYPE: sysdump WRITE VERIFY: off
MAX LPs: 512 PP SIZE: 1024 megabyte(s)
COPIES: 1 SCHED POLICY: parallel
LPs: 6 PPs: 6
STALE PPs: 0 BB POLICY: relocatable
INTER-POLICY: minimum RELOCATABLE: yes
INTRA-POLICY: middle UPPER BOUND: 32
MOUNT POINT: N/A LABEL: None
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?: NO
INFINITE RETRY: no PREFERRED READ: 0
ENCRYPTION: no

这时再查看,可以看到rootvg的FREE PPS少了2个。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@hisdb01:/]lsvg rootvg
VOLUME GROUP: rootvg VG IDENTIFIER: 00cb47b000004b0000000189e0b21f89
VG STATE: active PP SIZE: 1024 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 1064 (1089536 megabytes)
MAX LVs: 256 FREE PPs: 396 (405504 megabytes)
LVs: 15 USED PPs: 668 (684032 megabytes)
OPEN LVs: 14 QUORUM: 1 (Disabled)
TOTAL PVs: 2 VG DESCRIPTORS: 3
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 2 AUTO ON: yes
MAX PPs per VG: 32512
MAX PPs per PV: 1016 MAX PVs: 32
LTG size (Dynamic): 256 kilobyte(s) AUTO SYNC: no
HOT SPARE: no BB POLICY: relocatable
PV RESTRICTION: none INFINITE RETRY: no
DISK BLOCK SIZE: 512 CRITICAL VG: no
FS SYNC OPTION: no CRITICAL PVs: no
ENCRYPTION: no

由于这台服务器再没有别的报错,那么我们手动清空下报错,再手动check一下。确认无报错产生,问题解决完毕。

1
2
3
4
[root@hisdb01:/]errclear 0
[root@hisdb01:/]errpt
[root@hisdb01:/]/usr/lib/ras/dumpcheck
[root@hisdb01:/]errpt

此问题产生的原因

默认的lg_dumplv大小

在AIX 系统中的lg_dumplv 逻辑卷是用于存放系统dump 的区域。 在安装系统时是否创建该逻辑卷与服务器的内存配置有关。 若服务器的内存小于4GB时, 那么在安装 AIX 5.2 或 5.3时,系统就不会自动创建它。而缺省将系统dump 存放在hd6上。 当服务器的内存大于4GB时,在安装AIX时,就会为系统 dump 创建一专用区域,该逻辑卷名就是 lg_dumplv. 其缺省大小是按以下规则分配的: . 4GB < = 服务器的内存 〈 12GB lg_dump 的大小为 1GB . 12GB < = 服务器的内存 〈 24GB lg_dump 的大小为 2GB . 24GB < = 服务器的内存 〈 48GB lg_dump 的大小为 3GB . 48GB < = 服务器的内存 lg_dump 的大小为 4GB

我这台服务器是128G内存的,因此默认的lg_dumplv大小为4G也符合。

关于dumpcheck

/usr/lib/ras/dumpcheck 命令用来检查系统转储使用的磁盘资源。 如果最大转储设备不足以接收转储,或者转储调页空间时,在拷贝目录中空间不够,该命令就记录一个错误。

dumpcheck 通常是每天在当地时间下午 3 点由守护程序运行。 当使用 -r 标志从根目录的 crontab 中删除或者使用 -t TimeParameters,更改转储检查的运行时间时,这可能有所变化。 也可以从 SMIT 进行配置。 安装服务辅助时, dumpcheck 会自动添加到 root 用户的 crontab

要取得最大效力,应该是在系统负载最大时运行 dumpcheck。 这些时候,系统的转储最有可能是它的最大值。 同样,即使使用 dumpcheck 观察转储的大小,仍然可能发生转储在转储装置或者是发生时在拷贝目录中不合适。 如果在转储时间刚好有一个系统负载高峰,这也可能发生。

然而默认转储设备通常不够大,无法进行完整转储,因此会有这个定时任务定时检查,并在不足的时候报错提示进行扩展。

这里提供一些参考

[管理系统转储设备]:

https://www.ibm.com/support/pages/managing-system-dump-devices

[dumpcheck 命令 - IBM 文档]:

https://www.ibm.com/docs/zh/aix/7.3?topic=d-dumpcheck-command

欢迎联系我一起讨论。我的微信号:Eric_xu_2023

也欢迎关注我的公众号:

1.png