19

编译tensorflow-serving GPU

 3 years ago
source link: http://yizhanggou.top/untitled-7/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

背景

因为业务方上了bert的模型,所以要制作一个GPU版本的sidecar的TFServing镜像。

编译

找到一种可以直接通过docker编译,不用在主机上装各种东西的方法:使用devel版本的镜像。

devel镜像

TFServing-devel版本的镜像自带了很多编译Tensorflow Serving的组件,比如bazel、gcc、glibc等等等等,因此会非常大。打好镜像后我们把bin文件拷贝到非devel版本的镜像里面运行即可。

去dockerhub拉镜像: https://hub.docker.com/r/tensorflow/serving/tags?page=1&name=2.0

拉下面这两个:

tensorflow/serving   2.0.0-gpu           af288d8e0730        11 months ago       2.49GB
tensorflow/serving   2.0.0-devel-gpu     111028dae1da        11 months ago       11.8GB

运行

运行并进入容器:

docker run -itd --name tfs --network=host tensorflow/serving:2.0.0-devel-gpu /bin/bash
docker exec -it tfs /bin/bash

在容器里更改代码,然后编译:

bazel build -c opt --config=cuda //tensorflow_serving/model_servers:tensorflow_model_server --verbose_failures

问题

在编译过程中还是会遇到各种问题,这里挑了几个有代表性的拿出来讲一下:

no such package

编译过程中会遇到几次类似的:

ERROR: /tensorflow-serving/tensorflow_serving/model_servers/BUILD:318:1: no such package '@grpc//': java.io.IOException: Error downloading [https://storage.googleapis.com/mirror.tensorflow.org/github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz, https://github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz] to /root/.cache/bazel/_bazel_root/e53bbb0b0da4e26d24b415310219b953/external/grpc/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz: Tried to reconnect at offset 5,847,203 but server didn't support it and referenced by '//tensorflow_serving/model_servers:server_lib'
ERROR: Analysis of target '//tensorflow_serving/model_servers:tensorflow_model_server' failed; build aborted: no such package '@grpc//': java.io.IOException: Error downloading [https://storage.googleapis.com/mirror.tensorflow.org/github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz, https://github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz] to /root/.cache/bazel/_bazel_root/e53bbb0b0da4e26d24b415310219b953/external/grpc/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz: Tried to reconnect at offset 5,847,203 but server didn't support it
INFO: Elapsed time: 1346.453s

解决方法:重试几次。或者用以下两种方法

宿主机搭建一个文件服务器

使用ng搭建:

vim /usr/local/etc/nginx/nginx.conf
http {
    autoindex on;
    include       mime.types;
    default_type  application/octet-stream;

    sendfile        on;

    keepalive_timeout  65;
    server {
        listen       8001;
        server_name  127.0.0.1;

        location / {
            root   <your_path>;
            index  index.html index.htm;
        }
    }
}

使用宿主机的代理

  1. 在宿主机找到ip
(base) ➜  bin ifconfig | grep "inet " | grep -v 127.0.0.1
	inet xxx.xxx.xxx.xxx netmask 0xfffffff0 broadcast xxx.xxx.xxx.xxx
	inet xxx.xxx.xxx.xxx netmask 0xffffff00 broadcast xxx.xxx.xxx.xxx
	inet xxx.xxx.xxx.xxx netmask 0xffffff00 broadcast xxx.xxx.xxx.xxx
  1. 进去容器后设置代理
export ALL_PROXY='socks5://xxx.xxx.xxx.xxx:1080'
  1. 看看是否设置上了:
curl cip.cc

gcc: Internal error: Killed (program cc1)

内存不足,调大docker镜像内存吧:

Preferences -> Advances

卤煮是调大到12G,swap给了2G才编完的。

can not be used when making a shared object; recompile with -fPIC

/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow_serving/model_servers/_objs/tensorflow_model_server/tensorflow_serving/model_servers/version.o: relocation R_X86_64_32 against `.rodata' can not be used when making a shared object; recompile with -fPIC
bazel-out/k8-opt/bin/tensorflow_serving/model_servers/_objs/tensorflow_model_server/tensorflow_serving/model_servers/version.o: error adding symbols: Bad value
collect2: error: ld returned 1 exit status

在github上找了个issue: https://github.com/netfs/serving/commit/be7c70d779a39fad73a535185a4f4f991c1d859a 但是本地代码里面已经有这个fix了。后来是把version去掉了才编译过的:

BUILD文件修改:

cc_library(
    name = "tensorflow_model_server_main_lib",
    srcs = [
        "main.cc",
    ],
    #hdrs = [
    #    "version.h",
    #],
    #linkstamp = "version.cc",
    visibility = [
        ":tensorflow_model_server_custom_op_clients",
        "//tensorflow_serving:internal",
    ],
    deps = [
        ":server_lib",
        "@org_tensorflow//tensorflow/c:c_api",
        "@org_tensorflow//tensorflow/core:lib",
        "@org_tensorflow//tensorflow/core/platform/cloud:gcs_file_system",
        "@org_tensorflow//tensorflow/core/platform/hadoop:hadoop_file_system",
        "@org_tensorflow//tensorflow/core/platform/s3:s3_file_system",
    ],
)

main.cc文件修改

//#include "tensorflow_serving/model_servers/version.h"

...
if (display_version) {
    std::cout << "TensorFlow ModelServer: " << "r1.12" << "\n"
              << "TensorFlow Library: " << TF_Version() << "\n";
    return 0;
  }

保存镜像

编完后保存:

docker commit -a "xxx" -m "tfserving gpu build" b629d5936020 tensorflow/serving:2.0.0-devel-gpu-build

镜像导入导出:

docker save -o xxx.tar tensorflow/serving:mkl
docker load -i xxx.tar

启动参数

sudo nvidia-docker run -p 8500:8500 \
  --mount type=bind,source=xxx/models,target=xxx \
  -t --entrypoint=tensorflow_model_server tensorflow/serving:latest-gpu \
  --port=8500 --per_process_gpu_memory_fraction=0.5 \
  --enable_batching=true --model_name=east --model_base_path=/models/east_model &

参数含义:

  • -p 8500:8500 :指的是开放8500这个gRPC端口。
  • --mount type=bind, source=/your/local/model, target=/models:把你导出的本地模型文件夹挂载到docker container的/models这个文件夹,tensorflow serving会从容器内的/models文件夹里面找到你的模型。
  • -t --entrypoint=tensorflow_model_server tensorflow/serving:latest-gpu:如果使用非devel版的docker,启动docker之后是不能进入容器内部bash环境的,--entrypoint的作用是允许你“间接”进入容器内部,然后调用tensorflow_model_server命令来启动TensorFlow Serving,这样才能输入后面的参数。紧接着指定使用tensorflow/serving:latest-gpu 这个镜像,可以换成你想要的任何版本。
  • --port=8500:开放8500这个gRPC端口(需要先设置上面entrypoint参数,否则无效。下面参数亦然)
  • --per_process_gpu_memory_fraction=0.5:只允许模型使用多少百分比的显存,数值在[0, 1]之间。
  • --enable_batching:允许模型进行批推理,提高GPU使用效率。
  • --model_name:模型名字,在导出模型的时候设置的名字。
  • --model_base_path:模型所在容器内的路径,前面的mount已经挂载到了/models文件夹内,这里需要进一步指定到某个模型文件夹,例如/models/east_model指的是使用/models/east_model这个文件夹下面的模型。

使用代码里面的tool构建

也可以使用代码里面的tool构建tfserving镜像:

拉代码

git clone --recurse-submodules https://github.com/tensorflow/models.git
git checkout r2.0
cd serving

构建ModelServer

修改代码后,构建优化版本的ModelServer。

CPU版本:

docker build --pull -t $USER/tensorflow-serving-devel

-f tensorflow_serving/tools/docker/Dockerfile.devel .

如果机器安装了Intel的MKL库(据说要比开源的OpenBLAS快),那么可以使用:

docker build --pull -t $USER/tensorflow-serving-devel

-f tensorflow_serving/tools/docker/Dockerfile.devel-mkl .

GPU版本:

docker build --pull -t $USER/tensorflow-serving-devel-gpu

-f tensorflow_serving/tools/docker/Dockerfile.devel-gpu .

上面的(任选一个)过程会构建$USER/tensorflow-serving-devel这个镜像。

构建Tensorflow Serving 镜像

接下来我们用上面构建的$USER/tensorflow-serving-devel来构建Tensorflow Serving镜像。

CPU版本:

docker build -t $USER/tensorflow-serving

--build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel

-f tensorflow_serving/tools/docker/Dockerfile .

如果是MKL的CPU版本:

docker build -t $USER/tensorflow-serving

--build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel

-f tensorflow_serving/tools/docker/Dockerfile.mkl .

GPU的版本:

docker build -t $USER/tensorflow-serving

--build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel

-f tensorflow_serving/tools/docker/Dockerfile.gpu .

ref

fPIC: https://www.cnblogs.com/zl1991/p/11465111.html

http://webcache.googleusercontent.com/search?q=cache:ZulKFDzVupwJ:fancyerii.github.io/books/tfserving-docker/+&cd=4&hl=zh-CN&ct=clnk&gl=us


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK