ansible hang in docker container
source link: https://zhangguanzhang.github.io/2020/11/23/ansible-hang-in-docker/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
这几天同事发现在 docker 容器里运行 ansible 命令很卡,发来了个命令叫我试试 ansible localhost -m setup -a 'filter=ansible_default_ipv4' 2>/dev/null |grep '\"address\"' |awk -F'\"' '{print $4}'
环境信息
$ cat /etc/os-release NAME="Kylin Linux Advanced Server" VERSION="V10 (Tercel)" ID="kylin" VERSION_ID="V10" PRETTY_NAME="Kylin Linux Advanced Server V10 (Tercel)" ANSI_COLOR="0;31" $ uname -a Linux reg.wps.lan 4.19.90-17.ky10.aarch64 #1 SMP Sun Jun 28 14:27:40 CST 2020 aarch64 aarch64 aarch64 GNU/Linux $ docker info Containers: 1 Running: 1 Paused: 0 Stopped: 0 Images: 1 Server Version: 18.09.9 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f init version: fec3683 Security Options: seccomp Profile: default Kernel Version: 4.19.90-17.ky10.aarch64 Operating System: Kylin Linux Advanced Server V10 (Tercel) OSType: linux Architecture: aarch64 CPUs: 64 Total Memory: 62.76GiB Name: reg.wps.lan ID: 3ZQD:MZWN:FNR4:5HEE:F57N:3BLD:EP3T:LJT7:NWEJ:3TZ3:IEBD:KSHZ Docker Root Dir: /data/kube/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: treg.yun.wps.cn reg.wps.lan:5000 127.0.0.0/8 Registry Mirrors: https://registry.docker-cn.com/ https://docker.mirrors.ustc.edu.cn/ Live Restore Enabled: false Product License: Community Engine
容器里的 ansible 和 python 信息
$ ansible --version ansible 2.8.6 config file = None configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/local/lib/python2.7/dist-packages/ansible executable location = /usr/local/bin/ansible python version = 2.7.12 (default, Apr 15 2020, 17:07:12) [GCC 5.4.0 20160609] $ python --version Python 2.7.12
他说如果用麒麟的 rpm 包安装 docker 就没问题,用我们的拷贝二进制文件安装的 docker 起的容器里就不行。一开始是怀疑 setup 模块在收集某些信息的时候阻塞了,后面我试了下这样也会卡住 ansible localhost -m shell -a date
。
排查过程
带上了 -vvvv
发现卡在下面的输出这:
<127.0.0.1> PUT /root/.ansible/tmp/ansible-local-466216tkG5t/tmpML7hBj TO /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/AnsiballZ_command.py <127.0.0.1> EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/ /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/AnsiballZ_command.py && sleep 0' <127.0.0.1> EXEC /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/AnsiballZ_command.py && sleep 0'
搜了下可以 export ANSIBLE_DEBUG=True
打印更详细的日志,打印了下面的日志:
23918 1606128393.94311: ANSIBALLZ: Done creating module 23918 1606128393.94465: _low_level_execute_command(): starting 23918 1606128393.94493: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python && sleep 0' 23918 1606128393.94515: in local.exec_command() 23918 1606128393.94528: opening command with Popen() 23918 1606128393.94979: done running command with Popen() 23918 1606128393.95005: getting output with communicate()
看样子是子进程卡住了,主进程等子进程。因为容器的进程实际上也是在宿主机上的,宿主机上安装了 strace,查看下进程:
$ ps aux | grep ansible root 3020177 2.7 0.0 129408 48960 pts/0 Sl+ 19:12 0:26 /usr/bin/python /usr/local/bin/ansible localhost -vvvvv -m shell -a ls root 3020189 0.0 0.0 134912 50368 pts/0 S+ 19:12 0:00 /usr/bin/python /usr/local/bin/ansible localhost -vvvvv -m shell -a ls root 3020216 0.0 0.0 2368 768 pts/0 S+ 19:12 0:00 /bin/sh -c /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py && sleep 0' root 3020217 0.0 0.0 2368 768 pts/0 S+ 19:12 0:00 /bin/sh -c /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py && sleep 0 root 3020218 0.0 0.0 19776 14336 pts/0 S+ 19:12 0:00 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py root 3020219 9.3 0.0 18688 10816 pts/0 t+ 19:12 1:27 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py root 3050341 0.3 0.0 8832 4544 ? Ss 19:28 0:00 ssh: /root/.ansible/cp/5a0929d121 [mux] root 3052724 0.0 0.0 214080 1536 pts/7 S+ 19:28 0:00 grep ansible
strace 下 3020219 :
$ strace -p 3020219 close(76267956) = -1 EBADF (错误的文件描述符) close(76267957) = -1 EBADF (错误的文件描述符) close(76267958) = -1 EBADF (错误的文件描述符) close(76267959) = -1 EBADF (错误的文件描述符) close(76267960) = -1 EBADF (错误的文件描述符) close(76267961) = -1 EBADF (错误的文件描述符) close(76267962) = -1 EBADF (错误的文件描述符) close(76267963) = -1 EBADF (错误的文件描述符) close(76267964) = -1 EBADF (错误的文件描述符) close(76267965) = -1 EBADF (错误的文件描述符) close(76267966) = -1 EBADF (错误的文件描述符) close(76267967) = -1 EBADF (错误的文件描述符) close(76267968) = -1 EBADF (错误的文件描述符) close(76267969) = -1 EBADF (错误的文件描述符) close(76267970) = -1 EBADF (错误的文件描述符) close(76267971) = -1 EBADF (错误的文件描述符) close(76267972^C) = -1 EBADF (错误的文件描述符) strace: Process 3020219 detached
终端一直刷上面的,看样子是文件描述符泄露, 错误的文件描述符
英文就是 Bad file descriptor
, 谷歌搜了下 in "docker" (Bad file descriptor) strace
,找到了 Spawning PTY processes is many times slower on Docker 18.09
里几位大佬排查到是容器的 nofile 太高就会卡,如果启动容器 nofile 设置低则没问题,下面还有个大佬给了个链接在 python 层面修复这个问题 python/cpython#11584
,还有下面的解决办法,直接设置默认的 limit:
$ cat /etc/docker/daemon.json-ulimits { "default-ulimits": { "nofile": { "Name": "nofile", "Hard": 1024, "Soft": 1024 }, "nproc": { "Name": "nproc", "Soft": 65536, "Hard": 65536 } } }
改配置怕影响其他容器,就决定从 python 层面改,看了下提交的 pr python/cpython@5626fff
是改的 subprocess.py
,在容器里查找下:
find / -type f -name subprocess.py /usr/lib/python2.7/subprocess.py /usr/lib/python3.5/asyncio/subprocess.py /usr/lib/python3.5/subprocess.py /usr/local/lib/python2.7/dist-packages/future/moves/subprocess.py /usr/local/lib/python2.7/dist-packages/gevent/subprocess.py
对比了下函数名 def _close_fds(self, but):
,确定是 /usr/lib/python2.7/subprocess.py
,把 pr 的内容加进去后再执行:
$ time ansible localhost -m shell -a date [WARNING]: No inventory was parsed, only implicit localhost is available localhost | CHANGED | rc=0 >> 2020年 11月 23日 星期一 19:38:14 CST real 0m3.088s user 0m2.860s sys 0m0.255s
上面的几位大佬给出了其他的解决方案,也可以在 containerd 配置文件里配置把 nofile 固定住,或者 docker daemon,或者应用的软件层面修复。
参考
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK