33

简化版本的Ubuntu深度学习GPU环境搭建

 5 years ago
source link: http://www.flyml.net/2018/08/10/simple-deep-learning-gpu-env-setup/?amp%3Butm_medium=referral
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

前言:

之前曾经有一篇文章, 详细讲述了如何一步步手动的安装配置环境.

包括:

  1. 驱动程序 driver
  2. cuda
  3. cudnn
  4. nividia-docker

但是现在安装相比之前已经简化了非常非常多了. 现在前面3个事情, 一个 apt 命令就可以搞定. 废话不多说, 开始进入正文.

删除\卸载以前跟NVIDIA相关的东西

sudo apt pruge nvidia*
 

这个会卸载包括驱动以及 nvidia-docker 命令

安装显卡相关的驱动

写这一篇文章的时候, 当前大版本是390. 相应的命令如下:

sudo add-apt-repository ppa:graphics-drivers  
sudo apt-get update
sudo apt install nvidia-390  # 未测试的命令: apt install nvidia-current
 

以前手动下载驱动跟CUDA, 再一步步的安装的方法, 已经完全过时啦!

安装完成之后, 运行 nvidia-smi 可以看到运行的结果:

Fri Aug 10 10:54:49 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77                 Driver Version: 390.77                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   40C    P5    16W / 120W |      0MiB /  6077MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
 
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
 
 

安装完成之后, 最好重启一次, 确保驱动生效.

安装最新版本的Docker-CE

因为后面还需要安装命令行工具 nvidia-docker2 所以需要安装新版本的 docker-ce

之前通过APT直接安装的 docker-io 已经不能用 不兼容了.

注意, 首先要卸载历史版本:

sudo apt-get remove docker docker-engine docker.io

然后就按照官网的命令一个个输入进去好了:

官网地址: https://docs.docker.com/v17.12/install/linux/docker-ce/ubuntu/#install-docker-ce-1

在这里就不赘述了.

安装nvidia-docker2

安装这个的原因是, 因为我们需要在docker之中使用GPU 与 CUDA Toolkit

官网: https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)

  • 首先卸载老版本
    docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
    sudo apt-get purge nvidia-docker
     
    
  • 增加APT Repository
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
    sudo apt-key add -
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
    sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    sudo apt-get update
     
    
  • 安装
    sudo apt-get install nvidia-docker2   
     
    
  • 安装完成之后看看版本
    $ nvidia-docker version
    NVIDIA Docker: 2.0.3
    Client:
    Version:           18.06.0-ce
    API version:       1.38
    Go version:        go1.10.3
    Git commit:        0ffa825
    Built:             Wed Jul 18 19:11:02 2018
    OS/Arch:           linux/amd64
    Experimental:      false
     
    Server:
    Engine:
    Version:          18.06.0-ce
    API version:      1.38 (minimum version 1.12)
    Go version:       go1.10.3
    Git commit:       0ffa825
    Built:            Wed Jul 18 19:09:05 2018
    OS/Arch:          linux/amd64
    Experimental:     false
     
    
  • 安装完成之后, 检查一下环境
    sudo cat /etc/docker/daemon.json
    {
      "runtimes": {
          "nvidia": {
              "path": "nvidia-container-runtime",
              "runtimeArgs": []
          }
      }
    }
     
    
  • 重启docker service 确保生效
    sudo systemctl daemon-reload
    sudo systemctl restart docker
     
    

启动Docker镜像, 使用GPU

首先进行一个简单的测试:

# Sample 1:
nvidia-docker run --rm nvidia/cuda nvidia-smi
 
# Sample 2:
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
 

运行结果:

Fri Aug 10 04:50:35 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77                 Driver Version: 390.77                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   39C    P0    25W / 120W |      0MiB /  6077MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
 
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
 

使用TensorFlow-GPU镜像

# 拉取Tensorflow-GPU的镜像
docker pull tensorflow/tensorflow:latest-gpu-py3
 
# 启动
sudo nvidia-docker run --name wenjun-tf-gpu-py3\
 -it -p 8888:8888 -p 6006:6006\
 -v /data/notebooks:/notebooks\
 -v /data/jupyter-notebook-dataset:/dataset\
 tensorflow/tensorflow:latest-gpu-py3
 

到docker container之中确认GPU正常工作

import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))
 

这是我的运行的结果:

MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2018-08-10 05:08:25.640901: I tensorflow/core/common_runtime/placer.cc:935] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-08-10 05:08:25.640934: I tensorflow/core/common_runtime/placer.cc:935] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-08-10 05:08:25.640950: I tensorflow/core/common_runtime/placer.cc:935] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
 [49. 64.]]
 

可以看到, GPU显示出来了.

本文原创, 转载需要注明出处:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK