复盘:HRegionServer节点CPU持续飙高异常排查
source link: https://mp.weixin.qq.com/s?__biz=MzA4NzA5NzE5Ng%3D%3D&%3Bmid=2650230358&%3Bidx=1&%3Bsn=78dadbc1b38d5a8bbcd161558617e5a3
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
点击关注上方“ 知了小巷 ”,
设为“置顶或星标”,第一时间送达干货。
复盘:HRegionServer节点CPU持续飙高异常排查
钉钉告警信息
告警截图
翻看CPU监控信息
CPU从凌晨1点08分突然直线 拉升 ,并持续维持在高位
查找消耗CPU较高的进程
HRegionServer进程号:9096
$ jps 9069 HRegionServer
top命令查看CPU情况
9069进程CPU已飙到 753.1% ,实际上是有8个线程的CPU利用率超过90%...
$ top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9069 xxxxx 20 0 4468160 2.4g 11044 S 753.1 15.4 22741:51 java 72377 xxxxx 20 0 3003704 894600 7296 S 2.6 5.5 4974:57 java ...
查看进程里面CPU较高的线程
使用top -H -p命令;下面列表显示有8个线程,除了线程号9235的CPU利用率是89.4(瞬时),其它7个均超过90%。
$ top -H -p 9069 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9234 xxxxx 20 0 4468160 2.4g 11044 R 92.7 15.4 836:14.76 java 9237 xxxxx 20 0 4468160 2.4g 11044 R 92.7 15.4 970:13.42 java 9247 xxxxx 20 0 4468160 2.4g 11044 R 92.7 15.4 2672:25 java 9246 xxxxx 20 0 4468160 2.4g 11044 R 92.1 15.4 1787:46 java 9242 xxxxx 20 0 4468160 2.4g 11044 R 91.7 15.4 850:03.95 java 9240 xxxxx 20 0 4468160 2.4g 11044 R 90.4 15.4 3028:13 java 9238 xxxxx 20 0 4468160 2.4g 11044 R 90.1 15.4 3424:54 java 9235 xxxxx 20 0 4468160 2.4g 11044 R 89.4 15.4 1952:44 java
获取出现问题的线程的堆栈
使用JDK提供的jstack命令,将JVM进程的堆栈信息输出到文本文件里面,便于分析
$ jstack 9069 > allstack.txt
翻译一下线程号并找到堆栈信息
printf "%x\n",输出指定参数的16进制形式
下面把9234翻译为2412,然后用2412去allstack.txt里面去筛选出当前线程的堆栈信息。
$ printf "%x\n" 9234 2412
或者直接使用 管道 | ,单独输出当前线程的堆栈信息
$ jstack 9069 | grep 2412 > Onestack.txt
看看nid=0x2412的堆栈信息(Onestack.txt里的内容)
"RpcServer.default.FPBQ.Fifo.handler=1,queue=1,port=16020" #34 daemon prio=5 os_prio=0 tid=0x00007f99a93a0000 nid=0x2412 waiting on condition [0x00007f9981b2f000]
线程9234完整堆栈信息
"RpcServer.default.FPBQ.Fifo.handler=1,queue=1,port=16020" #34 daemon prio=5 os_prio=0 tid=0x00007f99a93a0000 nid=0x2412 runnable [0x00007f9981b2e000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hbase.KeyValue$KVComparator.compareKey(KeyValue.java:1997) at org.apache.hadoop.hbase.regionserver.querymatcher.ScanQueryMatcher.compareKeyForNextColumn(ScanQueryMatcher.java:297) at org.apache.hadoop.hbase.regionserver.StoreScanner.trySkipToNextColumn(StoreScanner.java:843) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekOrSkipToNextColumn(StoreScanner.java:742) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:674) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6282) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6442) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6214) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6200) at org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.rebuildIndices(UngroupedAggregateRegionObserver.java:1063) - locked <0x00000000bed22a50> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.doPostScannerOpen(UngroupedAggregateRegionObserver.java:404) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.overrideDelegate(BaseScannerRegionObserver.java:251) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:299) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2844) - locked <0x00000000bed233d0> (a org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3093) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
其他几个线程也都是类似的,堆栈信息异常几乎一致。
堆栈信息看到的是:Phoenix在scan扫描数据时, 重建索引rebuildIndices
源码片段:
@Override protected RegionScanner doPostScannerOpen(final ObserverContext<RegionCoprocessorEnvironment> c, final Scan scan, final RegionScanner s) throws IOException, SQLException { RegionCoprocessorEnvironment env = c.getEnvironment(); Region region = env.getRegion(); long ts = scan.getTimeRange().getMax(); boolean localIndexScan = ScanUtil.isLocalIndex(scan); // BaseScannerRegionObserver.ANALYZE_TABLE // _ANALYZETABLE if (ScanUtil.isAnalyzeTable(scan)) { // ... } else if (ScanUtil.isIndexRebuild(scan)) { // rebuildIndices return rebuildIndices(s, region, scan, env); } ... }
凌晨1点发生了什么...
为了快速恢复正常的CPU水平,重启了HBase集群,CPU利用率恢复正常
凌晨1点: 任务调度平台发现,凌晨1点左右有根据日期定时删除T-1天数据的任务,查看任务状态为失败,任务日志显示HBase RPC超时,也发送了邮件告警。初步判断为 删除数据->扫描数据->重建索引 => (数据进入Phoenix时,索引创建失败)
通过对这一天的数据进行select ... limit查询,发现后台进程中也会出现上述rebuildIndices的线程堆栈,只是limit限制了条数,很快重建完索引后就正常结束了
原因:索引创建失败,后续查询和删除会重建索引
由于删除的数据是之前使用DataX批量导入的,于是查看DataX批量导入时的日志记录, 发现ERROR...使用DataX大批量upsert插入数据时报错,索引更新(创建)失败
java.sql.SQLException: ERROR 2008 (INT10): Unable to find cached index metadata. ERROR 2008 (INT10): ERROR 2008 (INT10): Unable to find cached index metadata. key=-6300784100791250221 region=XXXX:YYYYY,,1604307763042.64071f279677f937353d25246a29ef25.host=xxxx.bigdata.xx,16020,1604307755320 Index update failed
Unable to find cached index metadata.
...Index update failed
查看Phoenix官方配置
http://phoenix.apache.org/tuning.html
有这样一个参数:
phoenix.coprocessor.maxServerCacheTimeToLiveMs
Maximum living time (in milliseconds) of server caches. A cache entry expires after this amount of time has passed since last access. Consider adjusting this parameter when a server-side IOException(“Could not find hash cache for joinId”) happens. Getting warnings like “Earlier hash cache(s) might have expired on servers” might also be a sign that this number should be increased.
默认值:30000 = 30秒
适当将maxServerCacheTimeToLiveMs参数提高到3000 000 = 50分钟
另外重要的与Phoenix二级索引相关的配置参数
https://phoenix.apache.org/secondary_indexing.html
在没有变更参数重启集群之前,准实时的小批量数据同步,索引是正常的,查询速度也比较明显。
总结
对HBase-Phoenix的使用,最好使用集成系统,比如云或者CDH,最好不要使用自搭建的集群,即便如此,仍然要做好各种测试,完整链路的各个环节异常处理机制和监控,对于重要的对外服务,做好生产备用集群。
往期精选
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK