10

JFR定位线上问题实例 - JFR导致的雪崩问题定位与解决

 3 years ago
source link: https://club.perfma.com/article/2068590
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
张哈希2小时前

最近发现一个应用,使用默认的 JFR 配置,发生了类似于雪崩的现象,这里记录下,引以为戒。

Key takeaways

  1. JFR 的线程堆栈 dump 采集默认周期是everyChunk,也就是每次新建一个Chunk就会采集一次。默认每个Chunk大小为 12M,在线程比较多的时候,堆栈 dump 可能大于 12M,导致 JFR 一直切换新的 Chunk,然后又触发evenryChunk导致无限循环而雪崩。
  2. 对于 JFR 默认事件采集配置(位于JDK目录/lib/jfr/default.jfc),每个采集周期和Chunk相关的,都要谨慎处理,最好周期通过固定时间写死,例如每一分钟等等,不要使用Chunk作为指标,防止上面这个问题。
  3. 只有 Running 的线程才会进行 CPU 上下文切换,创建很多线程,但是同一时间有任务的线程很少(Running 线程比较少),那么线程切换也不会很频繁。等待任务的线程基本不占用 CPU 资源。

线上某个线程数非常多的应用,线程是空闲状态,有任务提交过来,但是过了 10s,任务才开始执行,并且这一段时间有很多类似的情况,例如:

2020-12-01 13:07:34.515  INFO [微服务名称,298ee1f359f67ff7,298ee1f359f67ff7] [19] [线程全局唯一名称:调度线程] 任务提交到线程池

.....

2020-12-01 13:07:43.623  INFO [微服务名称,298ee1f359f67ff7,e881ba0d86fbc4fc] [19] [线程全局唯一名称:24684780-0] 开始处理
2020-12-01 13:07:44.159  INFO [微服务名称,298ee1f359f67ff7,e881ba0d86fbc4fc] [19] [线程全局唯一名称:24684780-0] 处理完成

查看24684780-0这个线程池的统计当时的队列是空的,所有线程都是等待任务的状态。排除了有其他任务占满核心线程池导致等待的情况。

线程一直空闲,但是却不能接受任务处理,表示这个线程可能一直处于需要进入安全点SafePoint的场景。查看GC,当时的 GC 是正常的,证明并不是 GC 引起的需要进入安全点

image.png

查看当时的 safepoint 日志(我们线上的进程开启了 safepoint 日志,参数:-Xlog:safepoint=debug:file=safepoint.log:utctime,level,tags:filecount=50,filesize=100M),发现在这期间内,有很多很多的进入 safepoint 的处理(我们只看了进入安全点(Entering safepoint region)这一行):

[2020-12-01T13:07:32.290+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:32.335+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:32.635+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:32.655+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:32.853+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:32.923+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:32.968+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:33.269+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:33.286+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:33.480+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:33.524+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:33.806+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:33.822+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:33.949+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:34.019+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:34.064+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:34.346+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:34.361+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:34.510+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:34.555+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:34.839+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:34.861+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:35.023+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:35.097+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:35.140+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:35.423+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:35.439+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:35.614+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:35.654+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:35.943+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:35.959+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:36.089+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:36.172+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:36.216+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:36.498+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:36.515+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:36.706+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:36.749+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:37.036+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:37.052+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:37.184+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:37.258+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:37.301+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:37.624+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:37.648+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:37.906+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:37.955+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:38.248+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:38.265+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:38.431+0000][info ][safepoint] Entering safepoint region: G1CollectForAllocation
[2020-12-01T13:07:38.551+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:38.620+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:38.686+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:38.978+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:39.001+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:39.214+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:39.269+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:39.552+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:39.568+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:39.733+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:39.813+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:39.860+0000][info ][safepoint] Entering safepoint region: PrintThreads

看上去很像 JFR 的定时采集,通过 top -Hp查看线程cpu占用:

top - 01:31:46 up 10:01,  1 user,  load average: 2.16, 2.55, 2.62
Threads: 25390 total,   2 running, 25388 sleeping,   0 stopped,   0 zombie
%Cpu(s): 17.0 us,  4.7 sy,  0.0 ni, 74.0 id,  4.1 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 32120452 total,   287936 free, 20438864 used, 11393652 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 11485760 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                                                                                      
63120 root      20   0   26.8g   9.4g  22016 R 73.2 30.8 178:50.29 VM Thread                                                                                                                                                                    
63130 root      20   0   26.8g   9.4g  22016 S 15.7 30.8  45:19.81 JFR Periodic Ta                                                                                                                                                              
63114 root      20   0   26.8g   9.4g  22016 S  6.6 30.8  17:31.55 GC Thread#0                                                                                                                                                                  
63185 root      20   0   26.8g   9.4g  22016 S  6.6 30.8  44:59.01 Log4j2-TF-10-As                                                                                                                                                              
63346 root      20   0   26.8g   9.4g  22016 S  6.3 30.8  17:31.89 GC Thread#2                                                                                                                                                                  
63345 root      20   0   26.8g   9.4g  22016 S  6.0 30.8  17:31.70 GC Thread#1                                                                                                                                                                  
63347 root      20   0   26.8g   9.4g  22016 S  5.7 30.8  17:31.86 GC Thread#3                                                                                                                                                                  
63128 root      20   0   26.8g   9.4g  22016 S  5.4 30.8  13:35.61 JFR Recorder Th                                                                                                                                                              
63134 root      20   0   26.8g   9.4g  22016 S  4.2 30.8  13:21.16 java                                                                                                                                                                         
63569 root      20   0   26.8g   9.4g  22016 S  1.8 30.8  19:20.94 lettuce-epollEv                                                                                                                                                              
64742 root      20   0   26.8g   9.4g  22016 S  1.2 30.8   1:03.26 XNIO-2 task-6                                                                                                                                                                
 2777 root      20   0   26.8g   9.4g  22016 S  1.2 30.8   0:46.15 XNIO-2 task-24    
 。。。。。。

发现使用 CPU 最多的线程是 VM 线程,其次是 JFR 线程。怀疑是因为 JFR 采集的原因造成的所有线程一直进入安全点的问题。我们的 JFR 采集事件的配置是默认的配置,也就是 JDK 目录下的 lib 目录下的 jfr 目录下的 default.jfc 查看其中的线程 Dump 配置:

<event name="jdk.ThreadDump">
	<setting name="enabled" control="thread-dump-enabled">true</setting>
	<setting name="period" control="thread-dump-interval">everyChunk</setting>
</event>

发现默认是everyChunk,也就是每次进入新的Chunk会打印一次。那么什么是Chunk?啥时候进入新的Chunk?

image.png

在 JFR 中,所有的 Event (包括通过JFR API产生的 Event 还有 JVM 产生的 EVENT),会先存储到每个线程自己的 Thread Buffer 中;在这个 Buffer 满了之后,会将 Buffer 的内容刷入 Global Buffer 中;Global Buffer 是一个环形 Buffer,保存着所有线程发送来的 Thread Buffer 中的内容。当这个环形 Buffer 存储到达上限之后,根据配置,会选择丢弃或者刷入文件,这里默认是刷入文件。这个文件,就可以理解为Chunk。

Chunk的大小是有限制的,默认是12m.但是,这个进程的线程很多,导致线程 dump 超过了 12m。导致 JFR 一直切换新的 Chunk,然后又触发evenryChunk导致无限循环进入安全点。

发现问题之后,首先想到的是,减少线程数量。由于业务需要,这个业务的线程主要用来做业务隔离以及有序执行,同一时间内的 Running 线程数量并不会很多。只有 Running 的线程才会进行 CPU 上下文切换,创建很多线程,但是同一时间有任务的线程很少(Running 线程比较少),那么线程切换也不会很频繁。等待任务的线程基本不占用 CPU 资源。所以暂时先不考虑减少线程数量。

最后的解决方案是,对于 JFR 默认事件采集配置,每个采集周期和Chunk相关的,都要谨慎处理,最好周期通过固定时间写死,例如每一分钟等等,不要使用Chunk作为指标,防止出现某个采集雪崩导致不断切换chunk导致无限采集这个问题。


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK