12

ansible hang in docker container

 3 years ago
source link: https://zhangguanzhang.github.io/2020/11/23/ansible-hang-in-docker/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

这几天同事发现在 docker 容器里运行 ansible 命令很卡,发来了个命令叫我试试 ansible localhost -m setup -a 'filter=ansible_default_ipv4' 2>/dev/null |grep '\"address\"' |awk -F'\"' '{print $4}'

环境信息

$ cat /etc/os-release 
NAME="Kylin Linux Advanced Server"
VERSION="V10 (Tercel)"
ID="kylin"
VERSION_ID="V10"
PRETTY_NAME="Kylin Linux Advanced Server V10 (Tercel)"
ANSI_COLOR="0;31"

$ uname -a
Linux reg.wps.lan 4.19.90-17.ky10.aarch64 #1 SMP Sun Jun 28 14:27:40 CST 2020 aarch64 aarch64 aarch64 GNU/Linux

$ docker info
Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 1
Server Version: 18.09.9
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.19.90-17.ky10.aarch64
Operating System: Kylin Linux Advanced Server V10 (Tercel)
OSType: linux
Architecture: aarch64
CPUs: 64
Total Memory: 62.76GiB
Name: reg.wps.lan
ID: 3ZQD:MZWN:FNR4:5HEE:F57N:3BLD:EP3T:LJT7:NWEJ:3TZ3:IEBD:KSHZ
Docker Root Dir: /data/kube/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 treg.yun.wps.cn
 reg.wps.lan:5000
 127.0.0.0/8
Registry Mirrors:
 https://registry.docker-cn.com/
 https://docker.mirrors.ustc.edu.cn/
Live Restore Enabled: false
Product License: Community Engine

容器里的 ansible 和 python 信息

$ ansible --version
ansible 2.8.6
  config file = None
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python2.7/dist-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 2.7.12 (default, Apr 15 2020, 17:07:12) [GCC 5.4.0 20160609]
$ python --version
Python 2.7.12

他说如果用麒麟的 rpm 包安装 docker 就没问题,用我们的拷贝二进制文件安装的 docker 起的容器里就不行。一开始是怀疑 setup 模块在收集某些信息的时候阻塞了,后面我试了下这样也会卡住 ansible localhost -m shell -a date

排查过程

带上了 -vvvv 发现卡在下面的输出这:

<127.0.0.1> PUT /root/.ansible/tmp/ansible-local-466216tkG5t/tmpML7hBj TO /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/AnsiballZ_command.py
<127.0.0.1> EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/ /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/AnsiballZ_command.py && sleep 0'
<127.0.0.1> EXEC /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/AnsiballZ_command.py && sleep 0'

搜了下可以 export ANSIBLE_DEBUG=True 打印更详细的日志,打印了下面的日志:

23918 1606128393.94311: ANSIBALLZ: Done creating module
23918 1606128393.94465: _low_level_execute_command(): starting
23918 1606128393.94493: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python && sleep 0'
23918 1606128393.94515: in local.exec_command()
23918 1606128393.94528: opening command with Popen()
23918 1606128393.94979: done running command with Popen()
23918 1606128393.95005: getting output with communicate()

看样子是子进程卡住了,主进程等子进程。因为容器的进程实际上也是在宿主机上的,宿主机上安装了 strace,查看下进程:

$ ps aux | grep ansible
root     3020177  2.7  0.0 129408 48960 pts/0    Sl+  19:12   0:26 /usr/bin/python /usr/local/bin/ansible localhost -vvvvv -m shell -a ls
root     3020189  0.0  0.0 134912 50368 pts/0    S+   19:12   0:00 /usr/bin/python /usr/local/bin/ansible localhost -vvvvv -m shell -a ls
root     3020216  0.0  0.0   2368   768 pts/0    S+   19:12   0:00 /bin/sh -c /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py && sleep 0'
root     3020217  0.0  0.0   2368   768 pts/0    S+   19:12   0:00 /bin/sh -c /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py && sleep 0
root     3020218  0.0  0.0  19776 14336 pts/0    S+   19:12   0:00 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py
root     3020219  9.3  0.0  18688 10816 pts/0    t+   19:12   1:27 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py
root     3050341  0.3  0.0   8832  4544 ?        Ss   19:28   0:00 ssh: /root/.ansible/cp/5a0929d121 [mux]
root     3052724  0.0  0.0 214080  1536 pts/7    S+   19:28   0:00 grep ansible

strace 下 3020219 :

$ strace -p 3020219
close(76267956)                         = -1 EBADF (错误的文件描述符)
close(76267957)                         = -1 EBADF (错误的文件描述符)
close(76267958)                         = -1 EBADF (错误的文件描述符)
close(76267959)                         = -1 EBADF (错误的文件描述符)
close(76267960)                         = -1 EBADF (错误的文件描述符)
close(76267961)                         = -1 EBADF (错误的文件描述符)
close(76267962)                         = -1 EBADF (错误的文件描述符)
close(76267963)                         = -1 EBADF (错误的文件描述符)
close(76267964)                         = -1 EBADF (错误的文件描述符)
close(76267965)                         = -1 EBADF (错误的文件描述符)
close(76267966)                         = -1 EBADF (错误的文件描述符)
close(76267967)                         = -1 EBADF (错误的文件描述符)
close(76267968)                         = -1 EBADF (错误的文件描述符)
close(76267969)                         = -1 EBADF (错误的文件描述符)
close(76267970)                         = -1 EBADF (错误的文件描述符)
close(76267971)                         = -1 EBADF (错误的文件描述符)
close(76267972^C)                         = -1 EBADF (错误的文件描述符)
strace: Process 3020219 detached

终端一直刷上面的,看样子是文件描述符泄露, 错误的文件描述符 英文就是 Bad file descriptor , 谷歌搜了下 in "docker" (Bad file descriptor) strace ,找到了 Spawning PTY processes is many times slower on Docker 18.09 里几位大佬排查到是容器的 nofile 太高就会卡,如果启动容器 nofile 设置低则没问题,下面还有个大佬给了个链接在 python 层面修复这个问题 python/cpython#11584 ,还有下面的解决办法,直接设置默认的 limit:

$ cat /etc/docker/daemon.json-ulimits
{
	"default-ulimits": {
		"nofile": {
			"Name": "nofile",
			"Hard": 1024,
			"Soft": 1024
		},
		"nproc": {
			"Name": "nproc",
			"Soft": 65536,
			"Hard": 65536
		}
	}
}

改配置怕影响其他容器,就决定从 python 层面改,看了下提交的 pr python/cpython@5626fff 是改的 subprocess.py ,在容器里查找下:

find / -type f -name subprocess.py 
/usr/lib/python2.7/subprocess.py
/usr/lib/python3.5/asyncio/subprocess.py
/usr/lib/python3.5/subprocess.py
/usr/local/lib/python2.7/dist-packages/future/moves/subprocess.py
/usr/local/lib/python2.7/dist-packages/gevent/subprocess.py

对比了下函数名 def _close_fds(self, but): ,确定是 /usr/lib/python2.7/subprocess.py ,把 pr 的内容加进去后再执行:

$ time ansible localhost -m shell -a date
[WARNING]: No inventory was parsed, only implicit localhost is available


localhost | CHANGED | rc=0 >>
2020年 11月 23日 星期一 19:38:14 CST


real	0m3.088s
user	0m2.860s
sys	  0m0.255s

上面的几位大佬给出了其他的解决方案,也可以在 containerd 配置文件里配置把 nofile 固定住,或者 docker daemon,或者应用的软件层面修复。

参考


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK