25

复盘:HRegionServer节点CPU持续飙高异常排查

 3 years ago
source link: https://mp.weixin.qq.com/s?__biz=MzA4NzA5NzE5Ng%3D%3D&%3Bmid=2650230358&%3Bidx=1&%3Bsn=78dadbc1b38d5a8bbcd161558617e5a3
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

点击关注上方“ 知了小巷 ”,

设为“置顶或星标”,第一时间送达干货。

复盘:HRegionServer节点CPU持续飙高异常排查

Y32q6v.jpg!mobile

钉钉告警信息

告警截图

YVBzUji.png!mobile

翻看CPU监控信息

CPU从凌晨1点08分突然直线 拉升 ,并持续维持在高位

NZfAr2B.png!mobile

查找消耗CPU较高的进程

HRegionServer进程号:9096

$ jps
9069 HRegionServer

top命令查看CPU情况

9069进程CPU已飙到 753.1% ,实际上是有8个线程的CPU利用率超过90%...

$ top
 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9069 xxxxx     20   0 4468160   2.4g  11044 S 753.1 15.4  22741:51 java
72377 xxxxx     20   0 3003704 894600   7296 S   2.6  5.5   4974:57 java
...

查看进程里面CPU较高的线程

使用top -H -p命令;下面列表显示有8个线程,除了线程号9235的CPU利用率是89.4(瞬时),其它7个均超过90%。

$ top -H -p 9069
 PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
9234 xxxxx     20   0 4468160   2.4g  11044 R 92.7 15.4 836:14.76 java
9237 xxxxx     20   0 4468160   2.4g  11044 R 92.7 15.4 970:13.42 java
9247 xxxxx     20   0 4468160   2.4g  11044 R 92.7 15.4   2672:25 java
9246 xxxxx     20   0 4468160   2.4g  11044 R 92.1 15.4   1787:46 java
9242 xxxxx     20   0 4468160   2.4g  11044 R 91.7 15.4 850:03.95 java
9240 xxxxx     20   0 4468160   2.4g  11044 R 90.4 15.4   3028:13 java
9238 xxxxx     20   0 4468160   2.4g  11044 R 90.1 15.4   3424:54 java
9235 xxxxx     20   0 4468160   2.4g  11044 R 89.4 15.4   1952:44 java

获取出现问题的线程的堆栈

使用JDK提供的jstack命令,将JVM进程的堆栈信息输出到文本文件里面,便于分析

$ jstack 9069 > allstack.txt

翻译一下线程号并找到堆栈信息

printf "%x\n",输出指定参数的16进制形式

下面把9234翻译为2412,然后用2412去allstack.txt里面去筛选出当前线程的堆栈信息。

$ printf "%x\n" 9234
2412

或者直接使用 管道 | ,单独输出当前线程的堆栈信息

$ jstack 9069 | grep 2412 > Onestack.txt

看看nid=0x2412的堆栈信息(Onestack.txt里的内容)

"RpcServer.default.FPBQ.Fifo.handler=1,queue=1,port=16020" #34 daemon prio=5 os_prio=0 tid=0x00007f99a93a0000 nid=0x2412 waiting on condition [0x00007f9981b2f000]

线程9234完整堆栈信息

"RpcServer.default.FPBQ.Fifo.handler=1,queue=1,port=16020" #34 daemon prio=5 os_prio=0 tid=0x00007f99a93a0000 nid=0x2412 runnable [0x00007f9981b2e000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.hadoop.hbase.KeyValue$KVComparator.compareKey(KeyValue.java:1997)
        at org.apache.hadoop.hbase.regionserver.querymatcher.ScanQueryMatcher.compareKeyForNextColumn(ScanQueryMatcher.java:297)
        at org.apache.hadoop.hbase.regionserver.StoreScanner.trySkipToNextColumn(StoreScanner.java:843)
        at org.apache.hadoop.hbase.regionserver.StoreScanner.seekOrSkipToNextColumn(StoreScanner.java:742)
        at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:674)
        at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:152)
        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6282)
        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6442)
        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6214)
        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6200)        
        at org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.rebuildIndices(UngroupedAggregateRegionObserver.java:1063)
        - locked <0x00000000bed22a50> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
        at org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.doPostScannerOpen(UngroupedAggregateRegionObserver.java:404)
        at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.overrideDelegate(BaseScannerRegionObserver.java:251)
        at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:299)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2844)
        - locked <0x00000000bed233d0> (a org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3093)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)

其他几个线程也都是类似的,堆栈信息异常几乎一致。

堆栈信息看到的是:Phoenix在scan扫描数据时, 重建索引rebuildIndices

源码片段:

@Override
protected RegionScanner doPostScannerOpen(final ObserverContext<RegionCoprocessorEnvironment> c, final Scan scan, final RegionScanner s) throws IOException, SQLException {
    RegionCoprocessorEnvironment env = c.getEnvironment();
    Region region = env.getRegion();
    long ts = scan.getTimeRange().getMax();
    boolean localIndexScan = ScanUtil.isLocalIndex(scan);
    // BaseScannerRegionObserver.ANALYZE_TABLE
    // _ANALYZETABLE
    if (ScanUtil.isAnalyzeTable(scan)) {
        // ...
    } else if (ScanUtil.isIndexRebuild(scan)) {
       // rebuildIndices
        return rebuildIndices(s, region, scan, env);
    }

    ...
}

凌晨1点发生了什么...

为了快速恢复正常的CPU水平,重启了HBase集群,CPU利用率恢复正常

凌晨1点: 任务调度平台发现,凌晨1点左右有根据日期定时删除T-1天数据的任务,查看任务状态为失败,任务日志显示HBase RPC超时,也发送了邮件告警。初步判断为 删除数据->扫描数据->重建索引 => (数据进入Phoenix时,索引创建失败)

通过对这一天的数据进行select ... limit查询,发现后台进程中也会出现上述rebuildIndices的线程堆栈,只是limit限制了条数,很快重建完索引后就正常结束了

原因:索引创建失败,后续查询和删除会重建索引

由于删除的数据是之前使用DataX批量导入的,于是查看DataX批量导入时的日志记录, 发现ERROR...使用DataX大批量upsert插入数据时报错,索引更新(创建)失败

java.sql.SQLException: ERROR 2008 (INT10): Unable to find cached index metadata.  ERROR 2008 (INT10): ERROR 2008 (INT10): Unable to find cached index metadata.  key=-6300784100791250221 region=XXXX:YYYYY,,1604307763042.64071f279677f937353d25246a29ef25.host=xxxx.bigdata.xx,16020,1604307755320 Index update failed

Unable to find cached index metadata.

...Index update failed

查看Phoenix官方配置

http://phoenix.apache.org/tuning.html

有这样一个参数:

phoenix.coprocessor.maxServerCacheTimeToLiveMs

Maximum living time (in milliseconds) of server caches. A cache entry expires after this amount of time has passed since last access. Consider adjusting this parameter when a server-side IOException(“Could not find hash cache for joinId”) happens. Getting warnings like “Earlier hash cache(s) might have expired on servers” might also be a sign that this number should be increased.

默认值:30000 = 30秒

适当将maxServerCacheTimeToLiveMs参数提高到3000 000 = 50分钟

另外重要的与Phoenix二级索引相关的配置参数

https://phoenix.apache.org/secondary_indexing.html

在没有变更参数重启集群之前,准实时的小批量数据同步,索引是正常的,查询速度也比较明显。

总结

对HBase-Phoenix的使用,最好使用集成系统,比如云或者CDH,最好不要使用自搭建的集群,即便如此,仍然要做好各种测试,完整链路的各个环节异常处理机制和监控,对于重要的对外服务,做好生产备用集群。

往期精选

7问Impala

Hive数据过滤之分区和分桶

Hive数据过滤之表过滤

Hive数据过滤之distinct子句

Hive数据过滤之having子句

Hive数据过滤之where子句

Apache Kafka生产环境集群资源规划与配置

入门Apache Kafka需要了解的方方面面

nIzeUzu.png!mobile


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK