

记一次ora.asm服务无法启动和GPnP Get Item的问题
source link: http://www.dboracle.com/archivers/记一次ora-asm服务无法启动和gpnp-get-item的问题.html?amp%3Butm_medium=referral
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

版权声明:本文为Buddy Yuan原创文章,未经允许不得转载。原文地址: 记一次ora.asm服务无法启动和GPnP Get Item的问题
昨天晚上去打DB PSU,这是一套11.2.0.3 RAC。很快我就把数据库和集群全部停止,然后把补丁安装完了。我心想还是蛮顺利的啊,结果在重启集群的时候,突然发现集群没办法起来。通过crsctl check crs可以发现无法启动crsd的服务,而我去查看crsd.log的时候发现根本没有任何记录。一直是空的。然后我通过ocrcheck命令检查会一直hang住,执行crsctl query css votedisk确是好的,很快都能查到。然后我又执行了kfed去读取asm磁盘,发现也是能读的。通过进一步检查,发现是ora.asm服务启动不成功。进一步登陆到asm实例上查看asm的alert日志,发现能读取spfile文件,并且实例是能正常启动的。我通过查看asm的日志发现,磁盘组并没有走到最后一步mount成功的地方。如下所示:
SQL> ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */ NOTE: Diskgroups listed in ASM_DISKGROUPS are DG_FRA DG_DATA NOTE: Diskgroup used for Voting files is: DG_OCR Diskgroup with spfile:DG_OCR Diskgroup used for OCR is:DG_OCR NOTE: cache registered group DG_DATA number=1 incarn=0x50ec3256 NOTE: cache began mount (not first) of group DG_DATA number=1 incarn=0x50ec3256 NOTE: cache registered group DG_FRA number=2 incarn=0x50fc3257 NOTE: cache began mount (not first) of group DG_FRA number=2 incarn=0x50fc3257 NOTE: cache registered group DG_OCR number=3 incarn=0x50fc3258 NOTE: cache began mount (not first) of group DG_OCR number=3 incarn=0x50fc3258 NOTE: Assigning number (1,3) to disk (/dev/diskgroup/dg_data4) NOTE: Assigning number (1,2) to disk (/dev/diskgroup/dg_data3) NOTE: Assigning number (1,1) to disk (/dev/diskgroup/dg_data2) NOTE: Assigning number (1,0) to disk (/dev/diskgroup/dg_data1) NOTE: Assigning number (2,0) to disk (/dev/diskgroup/dg_fra) NOTE: Assigning number (3,0) to disk (/dev/diskgroup/dg_ocr) GMON querying group 1 at 4 for pid 23, osid 128640 NOTE: cache opening disk 0 of grp 1: DG_DATA_0000 path:/dev/diskgroup/dg_data1 NOTE: F1X0 found on disk 0 au 2 fcn 0.0 NOTE: cache opening disk 1 of grp 1: DG_DATA_0001 path:/dev/diskgroup/dg_data2 NOTE: cache opening disk 2 of grp 1: DG_DATA_0002 path:/dev/diskgroup/dg_data3 NOTE: cache opening disk 3 of grp 1: DG_DATA_0003 path:/dev/diskgroup/dg_data4 NOTE: cache mounting (not first) external redundancy group 1/0x50EC3256 (DG_DATA)
可以看到日志前面都是正常能够读盘的,中间做了一些其他的动作,在最后面出现下列错误:
NOTE: detached from domain 3 NOTE: cache dismounted group 3/0x50FC3258 (DG_OCR) NOTE: cache ending mount (fail) of group DG_OCR number=3 incarn=0x50fc3258 NOTE: cache deleting context for group DG_OCR 3/0x50fc3258 GMON dismounting group 3 at 9 for pid 23, osid 128640 NOTE: Disk in mode 0x8 marked for de-assignment ERROR: diskgroup DG_OCR was not mounted
接下来我通过asm实例进入到数据库查了一下,在正常实例的磁盘组是mounted状态,而异常的数据库实例里面磁盘组的状态是mounting状态。也就是正在mount的状态。这里的一个疑点就是怎么mount不上。
为了把这个疑点弄清楚,我做了一个全局的hang分析。这里必须要做全局的才能知道磁盘组为什么无法mount。
*** 2018-10-26 04:36:34.280 =============================================================================== HANG ANALYSIS: instances (db_name.oracle_sid): +asm.+asm2, +asm.+asm1 oradebug_node_dump_level: 3 analysis initiated by oradebug os thread scheduling delay history: (sampling every 1.000000 secs) 0.000000 secs at [ 04:36:34 ] NOTE: scheduling delay has not been sampled for 0.184529 secs 0.000000 secs from [ 04:36:30 - 04:36:35 ], 5 sec avg 0.000000 secs from [ 04:35:34 - 04:36:35 ], 1 min avg 0.000000 secs from [ 04:31:34 - 04:36:35 ], 5 min avg =============================================================================== Chains most likely to have caused the hang: [a] Chain 1 Signature: 'GPnP Get Item'<='DFS lock handle' Chain 1 Signature Hash: 0xe08712d6 [b] Chain 2 Signature: 'GPnP Get Item'<='rdbms ipc reply'<='enq: DD - contention' Chain 2 Signature Hash: 0xec6b634f [c] Chain 3 Signature: 'GPnP Get Item'<='rdbms ipc reply'<='enq: DD - contention' Chain 3 Signature Hash: 0xec6b634f =============================================================================== Non-intersecting chains: ------------------------------------------------------------------------------- Chain 1: ------------------------------------------------------------------------------- Oracle session identified by: { instance: 2 (+asm.+asm2) os id: 88049 process id: 23, oracle@KSPX-DB2 (TNS V1-V3) session id: 1427 session serial #: 3 } is waiting for 'DFS lock handle' with wait info: { p1: 'type|mode'=0x43490005 p2: 'id1'=0x3c p3: 'id2'=0x2 time in wait: 49 min 17 sec timeout after: never wait id: 107 blocking: 0 sessions current sql: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */ short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+160<-ksliwat()+1865<-kslwaitctx()+163<-kjusuc()+3619<-ksigeti()+3781<-ksbcic_int()+21066<-ksbcic()+12<-kfgbSendPeelWait()+53<-kfgFinalizeMount()+4736<-kfgscFinalize()+1405<-kfgForEachKfgsc()+193<-kfgsoFinalize()+135<-kfgFinalize()+396<-kfxdrvMount()+4781<-kfxdrvEntry()+2218<-opiexe()+20368<-opiosq0()+2948<-kpooprx()+274<-kpoal8()+829<-opiodr()+916<-ttcpip()+2242<-opitsk()+1673<-opiino()+966<-opiodr()+916<-opidr wait history: * time between current wait and wait #1: 0.000149 sec 1. event: 'rdbms ipc reply' time waited: 0.000338 sec wait id: 106 p1: 'from_process'=0x12 p2: 'timeout'=0x7fffffff * time between wait #1 and #2: 0.000054 sec 2. event: 'kfk: async disk IO' time waited: 0.000182 sec wait id: 105 p1: 'count'=0x1 p2: 'intr'=0x0 p3: 'timeout'=0xffffffff * time between wait #2 and #3: 0.000157 sec 3. event: 'rdbms ipc reply' time waited: 0.000372 sec wait id: 104 p1: 'from_process'=0x12 p2: 'timeout'=0x7fffffff } and is blocked by => Oracle session identified by: { instance: 1 (+asm.+asm1) os id: 15643 process id: 18, oracle@KSPX-DB1 (RBAL) session id: 1117 session serial #: 1 } which is waiting for 'GPnP Get Item' with wait info: { time in wait: 221831 min 33 sec timeout after: never wait id: 124844244 blocking: 44 sessions current sql: short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-__poll()+47<-sgipcwWaitHelper()+5271<-sgipcwWait()+475<-gipcWaitOsd()+281<-gipcInternalWait()+14965<-gipcWaitF()+3234<-gipcInternalRecvSync()+8749<-gipcRecvSyncF()+3343<-clsgpnpm_gipcGets()+297<-clsgpnpm_receiveMsg()+408<-clsgpnpm_exchange()+1059<-clsgpnp_profileCallUrlInt()+3568<-clsgpnp_getProfileEx()+325<-clsgpnp_dbmsGetItem()+263<-kggpnpAttrGet()+1272<-kfdParseProfileString()+885<-kfdDiscoverShallow()+2023<-kfgbDriver()+1745<-ksbabs()+ wait history: * time between current wait and wait #1: 0.000007 sec 1. event: 'GPnP Initialization' time waited: 0.018700 sec wait id: 124844243 * time between wait #1 and #2: 0.000207 sec 2. event: 'CSS initialization' time waited: 0.000027 sec wait id: 124844242 * time between wait #2 and #3: 0.000062 sec 3. event: 'rdbms ipc message' time waited: 0.002787 sec wait id: 124844241 p1: 'timeout'=0x12c }
通过全局的hang分析,我们可以发现,在节点2 asm启动之后,运行了”ALTER DISKGROUP ALL MOUNT”的命令,而现在这个命令没有执行成功,被阻塞了40多分钟。它是被节点1的rbal进程阻塞的,而该进程正在等待一个叫 “GPnP Get Item”的等待事件。那么既然我们知道问题的原因了,我们就可以开始在mos上查找相关资料了。通过查看文档 Diskgroup Mount Hangs with RBAL Waiting on ‘GPnP Get Item’ and ‘enq: DD – contention’ (文档 ID 1375505.1) ,发现这个问题是这是由未发布的bug:12398300引起的。这个问题的最根本原因就是gpnp进程stuck了,然后就会导致rbal进程产生等待。解决办法其实很简单,就是干掉节点1的gpnp进程就会释放。而干掉gpnp进程集群不会重启,它只会单独的把该进程重启。
通过kill gpnp之后,故障恢复,节点2的集群顺利启动。
这里有一个技巧,以后针对ora.asm服务无法启动的问题,其实都可以使用全局的hang分析。
Recommend
-
21
解决UEFI安装无法启动的问题 首页 分类 标签
-
8
Buddy Yuan的个人技术博客 Steve Jobs is a genius and has an impact on many people 一次并行进程无法开启的问题分析...
-
8
Intro: 一个(或者两个)关于 Armitage for mac 无法启动的问题。 问题Permalink 在
-
9
公司 Windows 有监控和 杀毒扫描,很卡,不得已又用回 Linux,即便如此也只能用虚拟机的方式环境:Elementary 0S 5.1 ( Ubuntu 18.04 LTS (bionic) )KVM requires a CPU that supports vmx or svm找到 Android...
-
5
记一次 React 组件无法更新状态值的问题分析与解决 React 组件中通过直接声明的元素变量(jsx 写法),在访问 state 中指定的状态值时,如果状态发生改变,使用状...
-
8
Python Fabric库无法启动后台进程的问题和解决办法2016.10.19问题和处理方法Python 的 Fabric 库能够方便的远程操作Linux主机执行命令或传输文件。其实现方式就是底层实现ssh协...
-
9
postgresql无法启动问题 2015-05-27数据库 今天升级系统后,发现postgresql无法启动了,报错/e...
-
6
排查Spring 无法启动的问题 原创 这几天有小伙伴跟我反馈,有一个项目从 kotlin 1.2 升级到 kotlin 1.3 以后 Spring 项目无法启动,报...
-
6
磁盘问题宕机后 docker 的 containerd 无法启动 ...
-
8
Redis服务端启动失败排查:解决重启无法获取远程连接访问 https://www.chenweiliang.com/cwl-29424.html
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK