最近3天遇到3次故障,记录其中一个先。

一大早就连续收到两条短信告警,一条是节点1数据库状态不为不可用,另一条是节点2数据库状态变为不可用,嗯,就是rac 俩节点都不可用了。快速登录服务器,发现主机可以登录,只是库挂了。

登录服务器查看日志如下:

2025-02-25T08:39:18.568741+08:00
Errors in file /oracle/app/oracle/diag/rdbms/liups/liups2/trace/liups2_ctwr_18172.trc (incident=1528836):
ORA-00600: internal error code, arguments: [krccfl_chunk], [0x7FFFF2998760], [236160], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/app/oracle/diag/rdbms/liups/liups2/incident/incdir_1528836/liups2_ctwr_18172_i1528836.trc
2025-02-25T08:39:20.430291+08:00
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2025-02-25T08:39:20.430465+08:00
Block change tracking service error 600.
2025-02-25T08:39:20.560265+08:00
Dumping diagnostic data in directory=[cdmp_20250225083920], requested by (instance=2, osid=18172 (CTWR)), summary=[incident=1528836].
2025-02-25T08:40:21.569711+08:00
Errors in file /oracle/app/oracle/diag/rdbms/liups/liups2/trace/liups2_ctwr_18172.trc:
ORA-00600: internal error code, arguments: [krccfl_chunk], [0x7FFFF2998760], [236160], [], [], [], [], [], [], [], [], []
Errors in file /oracle/app/oracle/diag/rdbms/liups/liups2/trace/liups2_ctwr_18172.trc (incident=1528837):
ORA-487 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /oracle/app/oracle/diag/rdbms/liups/liups2/incident/incdir_1528837/liups2_ctwr_18172_i1528837.trc
2025-02-25T08:40:23.665753+08:00
USER (ospid: 18172): terminating the instance due to ORA error 487
2025-02-25T08:40:23.682724+08:00
opiodr aborting process unknown ospid (2942) as a result of ORA-1092
2025-02-25T08:40:23.682966+08:00
opiodr aborting process unknown ospid (121968) as a result of ORA-1092
2025-02-25T08:40:23.683246+08:00
opiodr aborting process unknown ospid (21423) as a result of ORA-1092
2025-02-25T08:40:23.683459+08:00

可以看到是 ora 600的错误,这个通常是要查询mos,按照常规经验就是登录mos:https://support.oracle.com/signin

输入 ora 600. 找到 ORA-600/ORA-7445/ORA-700 Error Look-up Tool 打开,输入 Error Code First Argument,并选择数据库的版本,然后点击 Look-up an error,可能会弹出一篇或者几篇文章,也可能找不到相关的文章,比如

An Error document for ORA-600 [krccfl_chunk] is not registered with the tool.

An Error document for ORA-600 [krccfl_chunk] is not registered with the tool.

Your request for information on this error has been recorded and will be used for publishing prioritization.

Things to try:

Check the error message and confirm that this is an ORA-600 error and not an ORA-700 or ORA-7445 error.

Use ‘Do a general Search for Knowledge’ to begin a search for any published documents and bugs that mention the error.

然后我们再点击close,再点击 do a general search for knowkedge,也就是整个知识库搜索,然后根据结果进行匹配查找。

一通操作之后 几分钟过去了,在关键数据库宕机面前,争分夺秒恢复业务是多么的重要。这里呢,我也是一顿mos搜索,但是事后我去google上搜索 ora 600 krccfl_chunk 第一篇文章就是:Database doesn’t open after crash ORA-00600 [krccfl_chunk] (Doc ID 2967548.1),点击登录之后登录mos账号,就直接可以看到解决方案了。写的很明确跟 bct有关,然后仔细看alert日志,上面写的也很清楚:Block change tracking service error 600. trc的进程也是 CTWR,基本上完美匹配了,如果不放心可以在看下 call stack。

CTWR crashed with ORA-600 while starting DB due to BCT Enabled.

Stack Trace:

Error Stack: ORA-600[krccfl_chunk]
Main Stack:

1
2
krccfl <- krcchg <- krcccr <- kcratr_apply <- kcratr <- kctrec <- kcvirv <- ktmmon
<- ktmSmonMain <- ksbrdp <- opirip <- opidrv <- sou2o <- opimai_real <- ssthrdmain <- main

The Issue caused by unpulished bugs which fixed in 19.17 and above.

解决方案也有3个:

Option 1:

Disable Block Change Tracking and Open Database.

1
2
3
SQL> alter database disable block change tracking;

SQL> alter database open;

Option 2:

If BCT Needs to be enabled for Backup performance , tune BCT Parameter as below and Enable BCT.

1
2
3
4
5
6
7
8
alter system set "_bct_public_dba_buffer_size"=200000000 scope=both sid='*';
alter system set "_bct_public_dba_buffer_maxsize"=200000000;
alter system set "_bct_buffer_allocation_max"=1073741824 scope=both sid='*';
alter system set "_bct_public_dba_buffer_dynresize"=0 scope=both sid='*';
alter system set "_bct_health_check_interval"=0 scope=both sid='*';
alter system set "_high_priority_processes"=CTWR scope=spfile sid='*'; >>>>>>>>>Needs DB bounce.

SQL> alter database enable block change tracking;

Once Re-enabled take new Level 0 backup and followed by Level 1.

Option 3:

Upgrade to 19.17 and above if affected version.

这里通过 alter database disable block change tracking; 禁用 bct解决。

现在 mos 的seo 足够智能,通过Look-up an error 竟然找不到的内容,可以通过google引擎直接搜索到。同时目前在ai的加持下,老DBA的经验感觉越来越不吃香了,学会搜索,学会 ai ,可能处理起问题会更快速。且干且珍重。

原文作者: liups.com

原文链接: http://liups.com/posts/b2c46938/

许可协议: 知识共享署名-非商业性使用 4.0 国际许可协议