【译】 Golang 中的垃圾回收（二）

这系列的文章如下

第一部分，我花了时间去描述了golang 垃圾收集器的行为，并说明了程序运行时候收集器所产生的延迟。我分享了怎么样去产生以及去解释GC traces。并展示了堆内内存如何改变，并且解释了GC不同阶段对延迟代价的影响。

最后的结论是，如果你减少了堆的压力，你就会减少延迟代价并提高了程序性能。我也指出了，降低收集速率，推迟收集，设法增加两次收集的时间间隔并不是很好的策略。即使是收集速率很快，但是以一致的速度进行，也能保持程序以最佳性能运行。

这部分里，我会带你通过一个实际的web应用并向你展示怎么样生成GC traces和程序状态。然后我会解释这些工具的输出内容，这样你可以找到方法去提高应用程序的性能。

运行应用

看一下go training的web应用程序

图1.1

github.com/ardanlabs/g…

图一是应用程序的样子。这个程序从不同的地方下载三个rss 订阅集合，并且允许用户去进行搜索。build之后，启动程序

L1

$ go build
$ GOGC=off ./project > /dev/null
复制代码

L1中我们设置了GOGC变量为off，也就是关闭垃圾回收。日志重定向到/dev/null。随着程序运行，我们可以向server中发送请求了。

L2

$ hey -m POST -c 100 -n 10000 "http://localhost:5000/search?term=topic&cnn=on&bbc=on&nyt=on"
复制代码

L2展示了通过hey使用100个连接发送了10k的请求。所有请求发送到服务端，这个过程结果产生如下结果

图1.2

图2.2给出了关闭GC处理requests的可视化的过程。10k请求的处理大概花费了4188ms，也就是服务端每秒处理大概2387次请求。

打开GC

如果开启GC的话会怎么样

L3

$ GODEBUG=gctrace=1 ./project > /dev/null
复制代码

L3中GOGC移除了，使用了GODEBUG变量，我们可以看到GC traces。 GODEBUG 设置之后runtime就会在每次收集的时候生成GC trace。现在我们再次跑同样10k的请求。所有请求发送到服务端，我们可以看到GC traces和 hey 提供的信息。

L4

$ GODEBUG=gctrace=1 ./project > /dev/null
gc 3 @3.182s 0%: 0.015+0.59+0.096 ms clock, 0.19+0.10/1.3/3.0+1.1 ms cpu, 4->4->2 MB, 5 MB goal, 12 P
.
.
.
gc 2553 @8.452s 14%: 0.004+0.33+0.051 ms clock, 0.056+0.12/0.56/0.94+0.61 ms cpu, 4->4->2 MB, 5 MB goal, 12 P
复制代码

L4中展示了从程序运行开始的第三次和最后一次的GC trace。我没展示前两次的collection因为在这些收集发生之后，负载信息才会通过服务发送。最后一次collection表明了处理请求一共产生了2551次收集（没算前两个）

下面是我对每次trace进行了拆分

L5

gc 2553 @8.452s 14%: 0.004+0.33+0.051 ms clock, 0.056+0.12/0.56/0.94+0.61 ms cpu, 4->4->2 MB, 5 MB goal, 12 P

gc 2553     : The 2553 GC runs since the program started
@8.452s     : Eight seconds since the program started
14%         : Fourteen percent of the available CPU so far has been spent in GC

// wall-clock
0.004ms     : STW        : Write-Barrier - Wait for all Ps to reach a GC safe-point.
0.33ms      : Concurrent : Marking
0.051ms     : STW        : Mark Term     - Write Barrier off and clean up.

// CPU time
0.056ms     : STW        : Write-Barrier
0.12ms      : Concurrent : Mark - Assist Time (GC performed in line with allocation)
0.56ms      : Concurrent : Mark - Background GC time
0.94ms      : Concurrent : Mark - Idle GC time
0.61ms      : STW        : Mark Term

4MB         : Heap memory in-use before the Marking started
4MB         : Heap memory in-use after the Marking finished
2MB         : Heap memory marked as live after the Marking finished
5MB         : Collection goal for heap memory in-use after Marking finished

// Threads
12P         : Number of logical processors or threads used to run Goroutines.
复制代码

L5展示了最后一次collection的实际数字。多亏 hey ,下面是我们看到的运行时的性能结果。

L6

Requests            : 10,000
------------------------------------------------------
Requests/sec        : 1,882 r/s   - Hey
Total Duration      : 5,311ms     - Hey
Percent Time in GC  : 14%         - GC Trace
Total Collections   : 2,551       - GC Trace
------------------------------------------------------
Total GC Duration   : 744.54ms    - (5,311ms * .14)
Average Pace of GC  : ~2.08ms     - (5,311ms / 2,551)
Requests/Collection : ~3.98 r/gc  - (10,000 / 2,511)
复制代码

L6给出了结果。下面给出了过程的可视化，来看一下发生了什么。

图1.3

图1.3给出了可视化的发生过程。开启回收器之后，相同10k的请求，它必须去处理大概~2.5k次collection。每次collection平均处理速度大概是2ms，处理这些collection增加了额外大概1.1秒的延迟。

图1.4

图1.4对比了两种情况下应用程序的运行情况。

减少分配

我们最好是可以看看堆内存的分配情况，来看看是否有可以去掉的non-productive的分配

L7

go tool pprof http://localhost:5000/debug/pprof/allocs
复制代码

L7使用了pprof去调用 /debug/pprof/allocs 接口，然后取从运行程序里拉取内存分配信息。这个接口的存在是因为用了下面的代码。

L8

import _ "net/http/pprof"

go func() {
    http.ListenAndServe("localhost:5000", http.DefaultServeMux)
}()
复制代码

导入 net/http/pprof 去绑定系统的默认服务。然后使用 http.ListenAndServer 携带 http.DefaultServerMux 去开启服务。

L9

(pprof) top 6 -cum
Showing nodes accounting for 0.56GB, 5.84% of 9.56GB total
Dropped 80 nodes (cum <= 0.05GB)
Showing top 6 nodes out of 51
      flat  flat%   sum%        cum   cum%
         0     0%     0%     4.96GB 51.90%  net/http.(*conn).serve
    0.49GB  5.11%  5.11%     4.93GB 51.55%  project/service.handler
         0     0%  5.11%     4.93GB 51.55%  net/http.(*ServeMux).ServeHTTP
         0     0%  5.11%     4.93GB 51.55%  net/http.HandlerFunc.ServeHTTP
         0     0%  5.11%     4.93GB 51.55%  net/http.serverHandler.ServeHTTP
    0.07GB  0.73%  5.84%     4.55GB 47.63%  project/search.rssSearch
复制代码

L9的底下出现了 rssSearch 的方法。这个方法分配了4.55GB的内存。接下来使用 list 去看一下这个方法的内部细节

L10

(pprof) list rssSearch
Total: 9.56GB
ROUTINE ======================== project/search.rssSearch in project/search/rss.go
   71.53MB     4.55GB (flat, cum) 47.63% of Total


         .          .    117:	// Capture the data we need for our results if we find ...
         .          .    118:	for _, item := range d.Channel.Items {
         .     4.48GB    119:		if strings.Contains(strings.ToLower(item.Description), strings.ToLower(term)) {
   48.53MB    48.53MB    120:			results = append(results, Result{
         .          .    121:				Engine:  engine,
         .          .    122:				Title:   item.Title,
         .          .    123:				Link:    item.Link,
         .          .    124:				Content: item.Description,
         .          .    125:			})
复制代码

L10列出了代码，第119行出现了大量的内存分配

L11

.     4.48GB    119:		if strings.Contains(strings.ToLower(item.Description), strings.ToLower(term)) {
复制代码

L11 给出了出现问题的代码。整个函数分配了4.55GB的数据，但这一行就分配了4.48GB。接下来看一下这一行代码做了什么，然后看看有哪里可以优化的地方。

L12

117 // Capture the data we need for our results if we find the search term.
118 for _, item := range d.Channel.Items {
119     if strings.Contains(strings.ToLower(item.Description), strings.ToLower(term)) {
120         results = append(results, Result{
121             Engine:  engine,
122             Title:   item.Title,
123             Link:    item.Link,
124             Content: item.Description,
125        })
126    }
127 }
复制代码

L12列出了tight loop里的代码。调用 strings.ToLower 会产生内存分配，因为需要创造新的strings，而且它们需要分配到堆内存上。在循环里调用 strings.ToLower 根本没有必要，因为完全可以在循环外面进行处理。

119行代码可以做一下调整来去掉不必要的内存分配。

L13

// Before the code change.
if strings.Contains(strings.ToLower(item.Description), strings.ToLower(term)) {

// After the code change.
if strings.Contains(item.Description, term) {
复制代码

注意：在把item塞进缓存之前已经把 item.Description 进行lower处理了，这部分代码没列出来。新塞进来item会每15分钟缓存一次。把 term 进行lower处理直接在循环外面进行就可以了

L13里给出了我们是怎么把 strings.ToLower 从tight loop中拿掉的。重新编译项目然后运行，然后再次拿10k个requests去请求。

L14

$ go build
$ GODEBUG=gctrace=1 ./project > /dev/null
gc 3 @6.156s 0%: 0.011+0.72+0.068 ms clock, 0.13+0.21/1.5/3.2+0.82 ms cpu, 4->4->2 MB, 5 MB goal, 12 P
.
.
.
gc 1404 @8.808s 7%: 0.005+0.54+0.059 ms clock, 0.060+0.47/0.79/0.25+0.71 ms cpu, 4->5->2 MB, 5 MB goal, 12 P
复制代码

L14中，在调整了代码之后，相同10K请求下，现在只有1402次的collections次数了。

L15

With Extra Allocations              Without Extra Allocations
======================================================================
Requests            : 10,000        Requests            : 10,000
----------------------------------------------------------------------
Requests/sec        : 1,882 r/s     Requests/sec        : 3,631 r/s
Total Duration      : 5,311ms       Total Duration      : 2,753 ms
Percent Time in GC  : 14%           Percent Time in GC  : 7%
Total Collections   : 2,551         Total Collections   : 1,402
----------------------------------------------------------------------
Total GC Duration   : 744.54ms      Total GC Duration   : 192.71 ms
Average Pace of GC  : ~2.08ms       Average Pace of GC  : ~1.96ms
Requests/Collection : ~3.98 r/gc    Requests/Collection : 7.13 r/gc
复制代码

L15展示了和上一次结果的对比。下面提供更加清晰的可视化图例来看看发生了什么。

图1.5

图1.5展示了对比图。这一次收集器运行少了1149次（1402 vs 2551）在相同10k请求情况下。GC的时间比率从14%降到了7%。collection节省了%74的时间，这让应用程序处理速度提高了48%。

图1.6

图1.6展示了4中不同情况的对比图。我加上了优化版本关闭gc的情况

学习到什么

在上一篇文章中提到，提升收集器的效率在于减少堆内存的压力。记住，压力可以定义为：在给定时间内，应用程序在可用堆内存上的分配有多快。当压力减少，收集器产生的延时就会减少。这些延迟会拖慢你的应用程序。

L16

With Extra Allocations              Without Extra Allocations
======================================================================
Requests            : 10,000        Requests            : 10,000
----------------------------------------------------------------------
Requests/sec        : 1,882 r/s     Requests/sec        : 3,631 r/s
Total Duration      : 5,311ms       Total Duration      : 2,753 ms
Percent Time in GC  : 14%           Percent Time in GC  : 7%
Total Collections   : 2,551         Total Collections   : 1,402
----------------------------------------------------------------------
Total GC Duration   : 744.54ms      Total GC Duration   : 192.71 ms
Average Pace of GC  : ~2.08ms       Average Pace of GC  : ~1.96ms
Requests/Collection : ~3.98 r/gc    Requests/Collection : 7.13 r/gc
复制代码

L16是上面两个带GC版本的程序运行对比。很明显移除了4.48GB的不必要分配使程序运行更快。有趣的事情是，两个版本里，每次垃圾收集的平均速度时间几乎是相同的，大概都是在2ms左右。我们改变的事情是在于每次收集之间做了更多的work处理。应用程序从3.98r/gc提升到7.13r/gc(请求次数/gc)。工作量提高了79.1%。

在两次收集之间处理更多work会帮助减少收集的次数，可以看到gc次数从2551减少到了1402，大概45%的降低。应用程序的gc时间减少了74%，从745ms减少到193ms，也就是总时间的14%减少到7%。如果你运行关闭gc的优化版本代码，性能差异只有13%，应用程序处理时间从2753ms降到2398ms。

结论

如果你花时间专注于减少分配，作为go开发者的你也就是在提高gc的效率。你无法写出0分配的程序，因此区分生产性的分配（利于程序运行）和非生产性的内存分配(损害性能)是很重要的事情。之后你就可以完全相信gc去帮你处理内存管理的事情了。

gc是一个好的折衷方式。我花一点时间去进行gc处理，得到的是我不需要再去关系内存的管理了。go会让开发者更有效率而且应用程序依旧可以足够快的运行，这一切都归功于golang的garbage collector

运行应用

图1.1

L1

L2

图1.2

打开GC

L3

L4

L5

L6

图1.3

图1.4

减少分配

L7

L8

L9

L10

L11

L12

L13

L14

L15

图1.5

图1.6

学习到什么

L16

结论

Recommend

SwiftUI ActionSheet Tutorial

Malicious Python libraries targeting Linux servers removed from PyPI

Grin Network Executes First Hard Fork in Bid to Decentralize Mining Power

又拍云主办全球行业互联网技术峰会开幕

打造工业级推荐系统（八）：AB 测试平台的工程实现

Better branching with Lodash cond (pattern matching)

Bystack跨链技术源码解读

使用NSQ(附Golang代码)

需要一份简单直观的 PPT？你可以用 Marp「写」出来

[golang] 使用mysql及redis实现简单的分布式锁

About Joyk