50

记一次ora.asm服务无法启动和GPnP Get Item的问题

 5 years ago
source link: http://www.dboracle.com/archivers/记一次ora-asm服务无法启动和gpnp-get-item的问题.html?amp%3Butm_medium=referral
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

版权声明:本文为Buddy Yuan原创文章,未经允许不得转载。原文地址: 记一次ora.asm服务无法启动和GPnP Get Item的问题

昨天晚上去打DB PSU,这是一套11.2.0.3 RAC。很快我就把数据库和集群全部停止,然后把补丁安装完了。我心想还是蛮顺利的啊,结果在重启集群的时候,突然发现集群没办法起来。通过crsctl check crs可以发现无法启动crsd的服务,而我去查看crsd.log的时候发现根本没有任何记录。一直是空的。然后我通过ocrcheck命令检查会一直hang住,执行crsctl query css votedisk确是好的,很快都能查到。然后我又执行了kfed去读取asm磁盘,发现也是能读的。通过进一步检查,发现是ora.asm服务启动不成功。进一步登陆到asm实例上查看asm的alert日志,发现能读取spfile文件,并且实例是能正常启动的。我通过查看asm的日志发现,磁盘组并没有走到最后一步mount成功的地方。如下所示:

SQL> ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */ 
NOTE: Diskgroups listed in ASM_DISKGROUPS are
 	 DG_FRA
 	 DG_DATA
NOTE: Diskgroup used for Voting files is:
 	 DG_OCR
Diskgroup with spfile:DG_OCR
Diskgroup used for OCR is:DG_OCR
NOTE: cache registered group DG_DATA number=1 incarn=0x50ec3256
NOTE: cache began mount (not first) of group DG_DATA number=1 incarn=0x50ec3256
NOTE: cache registered group DG_FRA number=2 incarn=0x50fc3257
NOTE: cache began mount (not first) of group DG_FRA number=2 incarn=0x50fc3257
NOTE: cache registered group DG_OCR number=3 incarn=0x50fc3258
NOTE: cache began mount (not first) of group DG_OCR number=3 incarn=0x50fc3258
NOTE: Assigning number (1,3) to disk (/dev/diskgroup/dg_data4)
NOTE: Assigning number (1,2) to disk (/dev/diskgroup/dg_data3)
NOTE: Assigning number (1,1) to disk (/dev/diskgroup/dg_data2)
NOTE: Assigning number (1,0) to disk (/dev/diskgroup/dg_data1)
NOTE: Assigning number (2,0) to disk (/dev/diskgroup/dg_fra)
NOTE: Assigning number (3,0) to disk (/dev/diskgroup/dg_ocr)
GMON querying group 1 at 4 for pid 23, osid 128640
NOTE: cache opening disk 0 of grp 1: DG_DATA_0000 path:/dev/diskgroup/dg_data1
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache opening disk 1 of grp 1: DG_DATA_0001 path:/dev/diskgroup/dg_data2
NOTE: cache opening disk 2 of grp 1: DG_DATA_0002 path:/dev/diskgroup/dg_data3
NOTE: cache opening disk 3 of grp 1: DG_DATA_0003 path:/dev/diskgroup/dg_data4
NOTE: cache mounting (not first) external redundancy group 1/0x50EC3256 (DG_DATA)

可以看到日志前面都是正常能够读盘的,中间做了一些其他的动作,在最后面出现下列错误:

NOTE: detached from domain 3
NOTE: cache dismounted group 3/0x50FC3258 (DG_OCR) 
NOTE: cache ending mount (fail) of group DG_OCR number=3 incarn=0x50fc3258
NOTE: cache deleting context for group DG_OCR 3/0x50fc3258
GMON dismounting group 3 at 9 for pid 23, osid 128640
NOTE: Disk  in mode 0x8 marked for de-assignment
ERROR: diskgroup DG_OCR was not mounted

接下来我通过asm实例进入到数据库查了一下,在正常实例的磁盘组是mounted状态,而异常的数据库实例里面磁盘组的状态是mounting状态。也就是正在mount的状态。这里的一个疑点就是怎么mount不上。

为了把这个疑点弄清楚,我做了一个全局的hang分析。这里必须要做全局的才能知道磁盘组为什么无法mount。

*** 2018-10-26 04:36:34.280
===============================================================================
HANG ANALYSIS:
  instances (db_name.oracle_sid): +asm.+asm2, +asm.+asm1
  oradebug_node_dump_level: 3
  analysis initiated by oradebug
  os thread scheduling delay history: (sampling every 1.000000 secs)
    0.000000 secs at [ 04:36:34 ]
      NOTE: scheduling delay has not been sampled for 0.184529 secs    0.000000 secs from [ 04:36:30 - 04:36:35 ], 5 sec avg
    0.000000 secs from [ 04:35:34 - 04:36:35 ], 1 min avg
    0.000000 secs from [ 04:31:34 - 04:36:35 ], 5 min avg
===============================================================================
 
Chains most likely to have caused the hang:
 [a] Chain 1 Signature: 'GPnP Get Item'<='DFS lock handle'
     Chain 1 Signature Hash: 0xe08712d6
 [b] Chain 2 Signature: 'GPnP Get Item'<='rdbms ipc reply'<='enq: DD - contention'
     Chain 2 Signature Hash: 0xec6b634f
 [c] Chain 3 Signature: 'GPnP Get Item'<='rdbms ipc reply'<='enq: DD - contention'
     Chain 3 Signature Hash: 0xec6b634f
 
===============================================================================
Non-intersecting chains:
 
-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 2 (+asm.+asm2)
                   os id: 88049
              process id: 23, oracle@KSPX-DB2 (TNS V1-V3)
              session id: 1427
        session serial #: 3
    }
    is waiting for 'DFS lock handle' with wait info:
    {
                      p1: 'type|mode'=0x43490005
                      p2: 'id1'=0x3c
                      p3: 'id2'=0x2
            time in wait: 49 min 17 sec
           timeout after: never
                 wait id: 107
                blocking: 0 sessions
             current sql: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */
             short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+160<-ksliwat()+1865<-kslwaitctx()+163<-kjusuc()+3619<-ksigeti()+3781<-ksbcic_int()+21066<-ksbcic()+12<-kfgbSendPeelWait()+53<-kfgFinalizeMount()+4736<-kfgscFinalize()+1405<-kfgForEachKfgsc()+193<-kfgsoFinalize()+135<-kfgFinalize()+396<-kfxdrvMount()+4781<-kfxdrvEntry()+2218<-opiexe()+20368<-opiosq0()+2948<-kpooprx()+274<-kpoal8()+829<-opiodr()+916<-ttcpip()+2242<-opitsk()+1673<-opiino()+966<-opiodr()+916<-opidr wait history: * time between current wait and wait #1: 0.000149 sec 1. event: 'rdbms ipc reply' time waited: 0.000338 sec wait id: 106 p1: 'from_process'=0x12 p2: 'timeout'=0x7fffffff * time between wait #1 and #2: 0.000054 sec 2. event: 'kfk: async disk IO' time waited: 0.000182 sec wait id: 105 p1: 'count'=0x1 p2: 'intr'=0x0 p3: 'timeout'=0xffffffff * time between wait #2 and #3: 0.000157 sec 3. event: 'rdbms ipc reply' time waited: 0.000372 sec wait id: 104 p1: 'from_process'=0x12 p2: 'timeout'=0x7fffffff } and is blocked by => Oracle session identified by:
    {
                instance: 1 (+asm.+asm1)
                   os id: 15643
              process id: 18, oracle@KSPX-DB1 (RBAL)
              session id: 1117
        session serial #: 1
    }
    which is waiting for 'GPnP Get Item' with wait info:
    {
            time in wait: 221831 min 33 sec
           timeout after: never
                 wait id: 124844244
                blocking: 44 sessions
             current sql: 
             short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-__poll()+47<-sgipcwWaitHelper()+5271<-sgipcwWait()+475<-gipcWaitOsd()+281<-gipcInternalWait()+14965<-gipcWaitF()+3234<-gipcInternalRecvSync()+8749<-gipcRecvSyncF()+3343<-clsgpnpm_gipcGets()+297<-clsgpnpm_receiveMsg()+408<-clsgpnpm_exchange()+1059<-clsgpnp_profileCallUrlInt()+3568<-clsgpnp_getProfileEx()+325<-clsgpnp_dbmsGetItem()+263<-kggpnpAttrGet()+1272<-kfdParseProfileString()+885<-kfdDiscoverShallow()+2023<-kfgbDriver()+1745<-ksbabs()+
            wait history:
              * time between current wait and wait #1: 0.000007 sec
              1.       event: 'GPnP Initialization'
                 time waited: 0.018700 sec
                     wait id: 124844243       
              * time between wait #1 and #2: 0.000207 sec
              2.       event: 'CSS initialization'
                 time waited: 0.000027 sec
                     wait id: 124844242       
              * time between wait #2 and #3: 0.000062 sec
              3.       event: 'rdbms ipc message'
                 time waited: 0.002787 sec
                     wait id: 124844241       p1: 'timeout'=0x12c
    }

通过全局的hang分析,我们可以发现,在节点2 asm启动之后,运行了”ALTER DISKGROUP ALL MOUNT”的命令,而现在这个命令没有执行成功,被阻塞了40多分钟。它是被节点1的rbal进程阻塞的,而该进程正在等待一个叫 “GPnP Get Item”的等待事件。那么既然我们知道问题的原因了,我们就可以开始在mos上查找相关资料了。通过查看文档 Diskgroup Mount Hangs with RBAL Waiting on ‘GPnP Get Item’ and ‘enq: DD – contention’ (文档 ID 1375505.1) ,发现这个问题是这是由未发布的bug:12398300引起的。这个问题的最根本原因就是gpnp进程stuck了,然后就会导致rbal进程产生等待。解决办法其实很简单,就是干掉节点1的gpnp进程就会释放。而干掉gpnp进程集群不会重启,它只会单独的把该进程重启。

通过kill gpnp之后,故障恢复,节点2的集群顺利启动。

这里有一个技巧,以后针对ora.asm服务无法启动的问题,其实都可以使用全局的hang分析。


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK