记一次ora.asm服务无法启动和GPnP Get Item的问题

昨天晚上去打DB PSU，这是一套11.2.0.3 RAC。很快我就把数据库和集群全部停止，然后把补丁安装完了。我心想还是蛮顺利的啊，结果在重启集群的时候，突然发现集群没办法起来。通过crsctl check crs可以发现无法启动crsd的服务，而我去查看crsd.log的时候发现根本没有任何记录。一直是空的。然后我通过ocrcheck命令检查会一直hang住，执行crsctl query css votedisk确是好的，很快都能查到。然后我又执行了kfed去读取asm磁盘，发现也是能读的。通过进一步检查，发现是ora.asm服务启动不成功。进一步登陆到asm实例上查看asm的alert日志，发现能读取spfile文件，并且实例是能正常启动的。我通过查看asm的日志发现，磁盘组并没有走到最后一步mount成功的地方。如下所示：

SQL> ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */ 
NOTE: Diskgroups listed in ASM_DISKGROUPS are
 	 DG_FRA
 	 DG_DATA
NOTE: Diskgroup used for Voting files is:
 	 DG_OCR
Diskgroup with spfile:DG_OCR
Diskgroup used for OCR is:DG_OCR
NOTE: cache registered group DG_DATA number=1 incarn=0x50ec3256
NOTE: cache began mount (not first) of group DG_DATA number=1 incarn=0x50ec3256
NOTE: cache registered group DG_FRA number=2 incarn=0x50fc3257
NOTE: cache began mount (not first) of group DG_FRA number=2 incarn=0x50fc3257
NOTE: cache registered group DG_OCR number=3 incarn=0x50fc3258
NOTE: cache began mount (not first) of group DG_OCR number=3 incarn=0x50fc3258
NOTE: Assigning number (1,3) to disk (/dev/diskgroup/dg_data4)
NOTE: Assigning number (1,2) to disk (/dev/diskgroup/dg_data3)
NOTE: Assigning number (1,1) to disk (/dev/diskgroup/dg_data2)
NOTE: Assigning number (1,0) to disk (/dev/diskgroup/dg_data1)
NOTE: Assigning number (2,0) to disk (/dev/diskgroup/dg_fra)
NOTE: Assigning number (3,0) to disk (/dev/diskgroup/dg_ocr)
GMON querying group 1 at 4 for pid 23, osid 128640
NOTE: cache opening disk 0 of grp 1: DG_DATA_0000 path:/dev/diskgroup/dg_data1
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache opening disk 1 of grp 1: DG_DATA_0001 path:/dev/diskgroup/dg_data2
NOTE: cache opening disk 2 of grp 1: DG_DATA_0002 path:/dev/diskgroup/dg_data3
NOTE: cache opening disk 3 of grp 1: DG_DATA_0003 path:/dev/diskgroup/dg_data4
NOTE: cache mounting (not first) external redundancy group 1/0x50EC3256 (DG_DATA)

可以看到日志前面都是正常能够读盘的，中间做了一些其他的动作，在最后面出现下列错误：

NOTE: detached from domain 3
NOTE: cache dismounted group 3/0x50FC3258 (DG_OCR) 
NOTE: cache ending mount (fail) of group DG_OCR number=3 incarn=0x50fc3258
NOTE: cache deleting context for group DG_OCR 3/0x50fc3258
GMON dismounting group 3 at 9 for pid 23, osid 128640
NOTE: Disk  in mode 0x8 marked for de-assignment
ERROR: diskgroup DG_OCR was not mounted

接下来我通过asm实例进入到数据库查了一下，在正常实例的磁盘组是mounted状态，而异常的数据库实例里面磁盘组的状态是mounting状态。也就是正在mount的状态。这里的一个疑点就是怎么mount不上。

为了把这个疑点弄清楚，我做了一个全局的hang分析。这里必须要做全局的才能知道磁盘组为什么无法mount。

*** 2018-10-26 04:36:34.280
===============================================================================
HANG ANALYSIS:
  instances (db_name.oracle_sid): +asm.+asm2, +asm.+asm1
  oradebug_node_dump_level: 3
  analysis initiated by oradebug
  os thread scheduling delay history: (sampling every 1.000000 secs)
    0.000000 secs at [ 04:36:34 ]
      NOTE: scheduling delay has not been sampled for 0.184529 secs    0.000000 secs from [ 04:36:30 - 04:36:35 ], 5 sec avg
    0.000000 secs from [ 04:35:34 - 04:36:35 ], 1 min avg
    0.000000 secs from [ 04:31:34 - 04:36:35 ], 5 min avg
===============================================================================
 
Chains most likely to have caused the hang:
 [a] Chain 1 Signature: 'GPnP Get Item'<='DFS lock handle'
     Chain 1 Signature Hash: 0xe08712d6
 [b] Chain 2 Signature: 'GPnP Get Item'<='rdbms ipc reply'<='enq: DD - contention'
     Chain 2 Signature Hash: 0xec6b634f
 [c] Chain 3 Signature: 'GPnP Get Item'<='rdbms ipc reply'<='enq: DD - contention'
     Chain 3 Signature Hash: 0xec6b634f
 
===============================================================================
Non-intersecting chains:
 
-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 2 (+asm.+asm2)
                   os id: 88049
              process id: 23, oracle@KSPX-DB2 (TNS V1-V3)
              session id: 1427
        session serial #: 3
    }
    is waiting for 'DFS lock handle' with wait info:
    {
                      p1: 'type|mode'=0x43490005
                      p2: 'id1'=0x3c
                      p3: 'id2'=0x2
            time in wait: 49 min 17 sec
           timeout after: never
                 wait id: 107
                blocking: 0 sessions
             current sql: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */
             short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+160<-ksliwat()+1865<-kslwaitctx()+163<-kjusuc()+3619<-ksigeti()+3781<-ksbcic_int()+21066<-ksbcic()+12<-kfgbSendPeelWait()+53<-kfgFinalizeMount()+4736<-kfgscFinalize()+1405<-kfgForEachKfgsc()+193<-kfgsoFinalize()+135<-kfgFinalize()+396<-kfxdrvMount()+4781<-kfxdrvEntry()+2218<-opiexe()+20368<-opiosq0()+2948<-kpooprx()+274<-kpoal8()+829<-opiodr()+916<-ttcpip()+2242<-opitsk()+1673<-opiino()+966<-opiodr()+916<-opidr wait history: * time between current wait and wait #1: 0.000149 sec 1. event: 'rdbms ipc reply' time waited: 0.000338 sec wait id: 106 p1: 'from_process'=0x12 p2: 'timeout'=0x7fffffff * time between wait #1 and #2: 0.000054 sec 2. event: 'kfk: async disk IO' time waited: 0.000182 sec wait id: 105 p1: 'count'=0x1 p2: 'intr'=0x0 p3: 'timeout'=0xffffffff * time between wait #2 and #3: 0.000157 sec 3. event: 'rdbms ipc reply' time waited: 0.000372 sec wait id: 104 p1: 'from_process'=0x12 p2: 'timeout'=0x7fffffff } and is blocked by => Oracle session identified by:
    {
                instance: 1 (+asm.+asm1)
                   os id: 15643
              process id: 18, oracle@KSPX-DB1 (RBAL)
              session id: 1117
        session serial #: 1
    }
    which is waiting for 'GPnP Get Item' with wait info:
    {
            time in wait: 221831 min 33 sec
           timeout after: never
                 wait id: 124844244
                blocking: 44 sessions
             current sql: 
             short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-__poll()+47<-sgipcwWaitHelper()+5271<-sgipcwWait()+475<-gipcWaitOsd()+281<-gipcInternalWait()+14965<-gipcWaitF()+3234<-gipcInternalRecvSync()+8749<-gipcRecvSyncF()+3343<-clsgpnpm_gipcGets()+297<-clsgpnpm_receiveMsg()+408<-clsgpnpm_exchange()+1059<-clsgpnp_profileCallUrlInt()+3568<-clsgpnp_getProfileEx()+325<-clsgpnp_dbmsGetItem()+263<-kggpnpAttrGet()+1272<-kfdParseProfileString()+885<-kfdDiscoverShallow()+2023<-kfgbDriver()+1745<-ksbabs()+
            wait history:
              * time between current wait and wait #1: 0.000007 sec
              1.       event: 'GPnP Initialization'
                 time waited: 0.018700 sec
                     wait id: 124844243       
              * time between wait #1 and #2: 0.000207 sec
              2.       event: 'CSS initialization'
                 time waited: 0.000027 sec
                     wait id: 124844242       
              * time between wait #2 and #3: 0.000062 sec
              3.       event: 'rdbms ipc message'
                 time waited: 0.002787 sec
                     wait id: 124844241       p1: 'timeout'=0x12c
    }

通过全局的hang分析，我们可以发现，在节点2 asm启动之后，运行了”ALTER DISKGROUP ALL MOUNT”的命令，而现在这个命令没有执行成功，被阻塞了40多分钟。它是被节点1的rbal进程阻塞的，而该进程正在等待一个叫 “GPnP Get Item”的等待事件。那么既然我们知道问题的原因了，我们就可以开始在mos上查找相关资料了。通过查看文档 Diskgroup Mount Hangs with RBAL Waiting on ‘GPnP Get Item’ and ‘enq: DD – contention’ (文档 ID 1375505.1) ，发现这个问题是这是由未发布的bug：12398300引起的。这个问题的最根本原因就是gpnp进程stuck了，然后就会导致rbal进程产生等待。解决办法其实很简单，就是干掉节点1的gpnp进程就会释放。而干掉gpnp进程集群不会重启，它只会单独的把该进程重启。

通过kill gpnp之后，故障恢复，节点2的集群顺利启动。

这里有一个技巧，以后针对ora.asm服务无法启动的问题，其实都可以使用全局的hang分析。

Recommend

解决UEFI安装无法启动的问题

一次并行进程无法开启的问题分析

Armitage 无法启动问题

Ubuntu AVD 无法启动问题追踪

记一次 React 组件无法更新状态值的问题分析与解决

Python Fabric库无法启动后台进程的问题和解决办法

postgresql无法启动问题

排查Spring 无法启动的问题

磁盘问题宕机后 docker 的 containerd 无法启动

Redis服务端启动失败排查：解决重启无法获取远程连接访问

About Joyk