go http 框架性能大幅下降原因分析

最近在开发一个web 框架，然后业务方使用过程中，跟我们说，压测qps 上不去，我就很纳闷，httprouter + net/http.httpserver , 性能不可能这么差啊，网上的压测结果都是10w qps 以上，几个middleware 至于将性能拖垮？后来一番排查，发现些有意思的东西。

首先，我就简单压测hello world, 每个请求进来，我日志都不打，然后，打开pprof ,显示的情况如下：

这里futex 怎么这么高？看着上面的一些操作，addtimer, deltimer 我想到以前的自己实现的定时器，这估计是超时引起的。然后检查版本，go1.9,  然后框架默认为每个conn 设置了4个timeout，readtimeout, writetimeout, idletimeout, headertimeout ，这直接导致了定时器在添加和删除回调的时候，锁的压力特别大。

下面我们分析下，正常的加超时操作，到底发生了些什么，下面是个最简单的例子，为了安全，每个连接设置超时。

package main

import (

"fmt"

"github.com/julienschmidt/httprouter"

"log"

"net/http"

"time"

)

func Index(w http.ResponseWriter, r *http.Request, _ httprouter.Params) {

fmt.Fprint(w, "Welcome!\n")

}

func Hello(w http.ResponseWriter, r *http.Request, ps httprouter.Params) {

fmt.Fprintf(w, "hello, %s!\n", ps.ByName("name"))

}

func main() {

router := httprouter.New()
router.GET("/", Index)
router.GET("/hello/:name", Hello)

srv := &http.Server{
    ReadTimeout:       5 * time.Second,
    WriteTimeout:      10 * time.Second,
    ReadHeaderTimeout: 10 * time.Second,
    IdleTimeout:       10 * time.Second,
    Addr:              "0.0.0.0:8998",
    Handler:           router,
}

log.Fatal(srv.ListenAndServe())

}

其中，ListenAndServe() 在调用accept 每个连接后，会调用 server.serve(), 根据是否添加超时，调用conn.SetReadDeadline等函数，对应的是 net/http/server.go,如下：

// Serve a new connection.

func (c *conn) serve(ctx context.Context) {

...

if tlsConn, ok := c.rwc.(*tls.Conn); ok {
    if d := c.server.ReadTimeout; d != 0 {
        c.rwc.SetReadDeadline(time.Now().Add(d)) // 设置读超时
    }
    if d := c.server.WriteTimeout; d != 0 {
        c.rwc.SetWriteDeadline(time.Now().Add(d))// 设置写超时
    }
    if err := tlsConn.Handshake(); err != nil {
        c.server.logf("http: TLS handshake error from %s: %v", c.rwc.RemoteAddr(), err)
        return
    }
    c.tlsState = new(tls.ConnectionState)
    *c.tlsState = tlsConn.ConnectionState()
    if proto := c.tlsState.NegotiatedProtocol; validNPN(proto) {
        if fn := c.server.TLSNextProto[proto]; fn != nil {
            h := initNPNRequest{tlsConn, serverHandler{c.server}}
            fn(c.server, tlsConn, h)
        }
        return
    }
}

...

之后，con.SetReadDeadline 会调用 internal/poll/fd_poll_runtime.go的 fd.setReadDeadline，最后调用runtime/netpoll.go 的poll_runtime_pollSetDeadline，这个函数会链接成internal/poll.runtime_pollSetDeadline。这个函数比较关键：

//go:linkname poll_runtime_pollSetDeadline internal/poll.runtime_pollSetDeadline

func poll_runtime_pollSetDeadline(pd

pollDesc, d int64, mode int) {

lock(&pd.lock)

if pd.closing {

unlock(&pd.lock)

return

}

pd.seq++ // invalidate current timers

// Reset current timers.

if pd.rt.f != nil {

deltimer(&pd.rt)

pd.rt.f = nil

}

if pd.wt.f != nil {

deltimer(&pd.wt)

pd.wt.f = nil

}

// Setup new timers.

if d != 0 && d <= nanotime() {

d = -1

}

if mode == 'r' || mode == 'r'+'w' {

pd.rd = d

}

if mode == 'w' || mode == 'r'+'w' {

pd.wd = d

}

if pd.rd > 0 && pd.rd == pd.wd {

pd.rt.f = netpollDeadline

pd.rt.when = pd.rd

// Copy current seq into the timer arg.

// Timer func will check the seq against current descriptor seq,

// if they differ the descriptor was reused or timers were reset.

pd.rt.arg = pd

pd.rt.seq = pd.seq

addtimer(&pd.rt)

} else {

if pd.rd > 0 {

pd.rt.f = netpollReadDeadline // 设置读的定时回调

pd.rt.when = pd.rd

pd.rt.arg = pd

pd.rt.seq = pd.seq

addtimer(&pd.rt) // 添加到系统定时器中

}

if pd.wd > 0 {

pd.wt.f = netpollWriteDeadline // 设置写的定时回调

pd.wt.when = pd.wd

pd.wt.arg = pd

pd.wt.seq = pd.seq

addtimer(&pd.wt) // 添加到系统定时器中

}

// If we set the new deadline in the past, unblock currently pending IO if any.

atomicstorep(unsafe.Pointer(&wg), nil) // full memory barrier between stores to rd/wd and load of rg/wg in netpollunblock

if pd.rd < 0 {

rg = netpollunblock(pd, 'r', false)

}

if pd.wd < 0 {

wg = netpollunblock(pd, 'w', false)

}

unlock(&pd.lock)

if rg != nil {

netpollgoready(rg, 3)

}

if wg != nil {

netpollgoready(wg, 3)

}

这里主要工作就是检查过期定时器，然后添加定时器，设置回调函数为netpollReadDeadline 或者netpollWriteDeadline。从中可以看出添加和删除定时器操作为addtimer(&pd.rt), deltimer(&pd.rt)。

后面就是核心了，为啥加超时后这么慢，看下addtimer 的实现，timer 是个四叉小顶堆，每次添加一个超时，最后都需要对一个全局的timers 进行加锁，当qps 很高，一个请求，多次加锁，这性能能很高吗？

type timer struct {

i int // heap index

// Timer wakes up at when, and then at when+period, ... (period > 0 only)
// each time calling f(arg, now) in the timer goroutine, so f must be
// a well-behaved function and not block.
when   int64
period int64
f      func(interface{}, uintptr)
arg    interface{}
seq    uintptr

}

var timers struct {

lock mutex

created bool

sleeping bool

rescheduling bool

sleepUntil int64

waitnote note

timer

}

//添加一个定时器

func addtimer(t *timer) {

lock(&timers.lock)

addtimerLocked(t)

unlock(&timers.lock)

}

解决锁冲突改怎么办？分段锁是很常见一个思路，在go1.10 后，timers 由一个，变成64个，定时器被打散到64个锁上去，自然锁冲突就降低了。看1.10的runtime/time.go 可以发现定义如下，每个p有单独的timer，每个timer能被多个p使用：

// Package time knows the layout of this structure.

// If this struct changes, adjust ../time/sleep.go:/runtimeTimer.

// For GOOS=nacl, package syscall knows the layout of this structure.

// If this struct changes, adjust ../syscall/net_nacl.go:/runtimeTimer.

type timer struct {

tb *timersBucket // the bucket the timer lives in

i int // heap index

// Timer wakes up at when, and then at when+period, ... (period > 0 only)
// each time calling f(arg, now) in the timer goroutine, so f must be
// a well-behaved function and not block.
when   int64
period int64
f      func(interface{}, uintptr)
arg    interface{}
seq    uintptr

}

// timersLen is the length of timers array.

// Ideally, this would be set to GOMAXPROCS, but that would require

// dynamic reallocation

// The current value is a compromise between memory usage and performance

// that should cover the majority of GOMAXPROCS values used in the wild.

const timersLen = 64 //64个bucket

// timers contains "per-P" timer heaps.

// Timers are queued into timersBucket associated with the current P,

// so each P may work with its own timers independently of other P instances.

// Each timersBucket may be associated with multiple P

// if GOMAXPROCS > timersLen.

var timers [timersLen]struct {

timersBucket

// The padding should eliminate false sharing
// between timersBucket values.
pad [sys.CacheLineSize - unsafe.Sizeof(timersBucket{})%sys.CacheLineSize]byte

}

下面是go1.10 后的timer 数据结构（此图来源于网络）：

总结，网上很多httpserver 框架压测 qps 很高，但是它们的demo并没有设置超时，数据真实值会差很多。线上如果需要设置超时，需要注意go 的版本，qps 很高的情况下，最好使用1.10以上。最终我们不做任何其他操作情况下，仅将go 版本提高到1.10，qps 提高接近2倍。

Recommend

Netty堆外内存泄露排查盛宴

FP Complete: Is Rust functional?

再谈心跳检测报表构图(10.18)

megasniff: Debugging promise-chains for humans

使用事件总线共享组件之间的Props

GitHub - SoledaD208/CVE-2018-10933: CVE-2018-10933 very simple POC

影视圈惊变2018 | 深氪

P4编程理论与实践（2）—快速上手

Cato Networks为其SD-WAN增加了自我修复功能

2018 SDN+NFV+IPv6 Fest测试活动开幕技术创新内容升级

About Joyk