8

一个节点挂导致另外一个节点也跟着挂的案例一则

 3 years ago
source link: http://www.dboracle.com/archivers/%e4%b8%80%e4%b8%aa%e8%8a%82%e7%82%b9%e6%8c%82%e5%af%bc%e8%87%b4%e5%8f%a6%e5%a4%96%e4%b8%80%e4%b8%aa%e8%8a%82%e7%82%b9%e4%b9%9f%e8%b7%9f%e7%9d%80%e6%8c%82%e7%9a%84%e6%a1%88%e4%be%8b%e4%b8%80%e5%88%99.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

一个节点挂导致另外一个节点也跟着挂的案例一则

版权声明:本文为Buddy Yuan原创文章,转载请注明出处。原文地址:一个节点挂导致另外一个节点也跟着挂的案例一则
客户数据库节点1,在11点02分左右主机出现异常,导致宕机。此时vip发生了漂移,大量的节点连入节点2,导致节点2短期CPU R队列突增。此时节点2因为节点1的crash,需要做Reconfiguration。节点2在11点10分左右也异常宕机。接到告警后,登录到主机上于11点13分将数据库2节点拉起恢复了数据库。
节点1是因为主机硬件问题导致的宕机,而节点2在节点1出问题之后,按照道理应该进行接管,而节点2在接管的过程中出现了数据库宕库。
根据当时的alert日志分析,可以发现当时是被LMON进程宕掉了实例。

Fri Sep 25 11:11:15 2020
ORA-1092 : opitsk aborting process
Fri Sep 25 11:11:17 2020
Termination issued to instance processes. Waiting for the processes to exit
Fri Sep 25 11:11:17 2020
ORA-1092 : opitsk aborting process
Instance terminated by LMON, pid = 213044

而继续查看LMON的trace文件,可以发现下列信息:

*** 2020-09-25 11:05:06.721
2020-09-25 11:05:06.721325 : * Begin lmon rcfg step KJGA_RCFG_TIMERQ
2020-09-25 11:05:06.721606 : * Begin lmon rcfg step KJGA_RCFG_DDQ
2020-09-25 11:05:06.723678 : * Begin lmon rcfg step KJGA_RCFG_SETMASTER
2020-09-25 11:05:06.941692 :
 Set master node info
2020-09-25 11:05:06.942669 : * Begin lmon rcfg step KJGA_RCFG_ENQREPLAY

*** 2020-09-25 11:05:07.128
2020-09-25 11:05:07.128144 :  Submitted all remote-enqueue requests
2020-09-25 11:05:07.129277 : * Begin lmon rcfg step KJGA_RCFG_ENQDUBIOUS
 Dwn-cvts replayed, VALBLKs dubious
2020-09-25 11:05:07.386873 : * Begin lmon rcfg step KJGA_RCFG_ENQGRANT
 All grantable enqueues granted
2020-09-25 11:05:07.527865 : * Begin lmon rcfg step KJGA_RCFG_PCMREPLAY
2020-09-25 11:05:07.724845 :
2020-09-25 11:05:07.724928 :  Post SMON to start 1st pass IR

*** 2020-09-25 11:10:34.000
2020-09-25 11:10:34.000754 : * kjfclmsync: waited 327 secs for lmses to finish parallel rcfg work, terminating instance
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+465<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+63<-ksuitm()+5594<-kjfclmsync()+941<-kjfcrfg()+78119
<-kjfcln()+8349<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main(
)+201<-__libc_start_main()+253

----- End of Abridged Call Stack Trace -----

*** 2020-09-25 11:10:34.017
LMON (ospid: 213044): terminating the instance due to error 481

从LMON的Trace文件中可以发现11点05分之前,就发生了rcfg的操作,而这个代表了Reconfiguration动作。在11点10分钟的时候,遇到了:kjfclmsync: waited 327 secs for lmses to finish parallel rcfg work, terminating instance。

11111.png
222222.png

这里显著的原因是当时主机的system cpu高达80%。CPU的R队列高达1500,系统已经不能分配出CPU资源。当system cpu高的时候一般系统就会hang住。当系统hang住的时候会导致oracle系统进程出现异常。这里最终LMON进程为了保证数据库的一致性,做出了实例终止的动作。因为资源迅速的冲击导致集群触发bug,可以在MOS上看到类似的问题。Bug 9745585 LMON reported "waited %d secs for lmses to finish parallel rcfg work, terminating instance"
建议后期定期进行高可用切换演练测试,通过测试来评估是否存在切换瞬间压力过大导致此类宕库的问题。


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK