Docker & GPU

WSL + Windows

👉 Read more: .

With Tensorflow or PyTorch

👉 Official doc for TF + docker.
👉 Note about Docker and Tensorflow: Tensorflow.
👉 An example of docker pytorch with gpu support.

Basic installation

🚨

You must (successfully) install the GPU driver on your (Linux) machine before proceeding with the steps in this note. Go to the "Check info" section to check the availability of your drivers.

☝

(Maybe just for me) It works perfectly on Pop!_OS 20.04, I tried it and we have a lot of problems with Pop!_OS 21.10 so stay with 20.04!

1sudo apt update
2
3sudo apt install -y nvidia-container-runtime
4# You may need to replace above line with
5sudo apt install nvidia-docker2
6sudo apt install nvidia-container-toolkit
7
8sudo apt install -y nvidia-cuda-toolkit
9# restard required

If you have problems installing nvidia-docker2, read this section!

Check info

1# Verify that your computer has a graphic card
2lspci | grep -i nvidia

1# First, install drivers and check
2nvidia-smi
3# output: NVIDIA-SMI 450.80.02 Driver Version: 450.80.02    CUDA Version: 11.0
4# It's the maximum CUDA version that your driver supports

1# Check current version of cuda
2nvcc --version
3# If nvcc is not available, it may be in /usr/local/cuda/bin/
4# Add this location to PATH
5# modify ~/.zshrc or ~/.bashrc
6export PATH=/usr/local/cuda/bin:$PATH
7
8# You may need to install
9sudo apt install -y nvidia-cuda-toolkit

If below command doesn't work, try to install nvidia-docker2 (read this section).

1# Install and check nvidia-docker
2dpkg -l | grep nvidia-docker
3# or
4nvidia-docker version

1# Verifying –gpus option under docker run
2docker run --help | grep -i gpus
3# output: --gpus gpu-request GPU devices to add to the container ('all' to pass all GPUs)

Does Docker work with GPU?

1# List all GPU devices
2docker run -it --rm --gpus all ubuntu nvidia-smi -L
3# output: GPU 0: GeForce GTX 1650 (...)

1# ERROR ?
2# docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

1# ERROR ?
2# Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
3
4# Solution: install nvidia-docker2

1# Verifying again with nvidia-smi
2docker run -it --rm --gpus all ubuntu nvidia-smi
3
4# Return something like
5+-----------------------------------------------------------------------------+
6| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
7|-------------------------------+----------------------+----------------------+
8| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
9| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
10|                               |                      |               MIG M. |
11|===============================+======================+======================|
12|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
13| N/A   55C    P0    11W /  N/A |    369MiB /  4096MiB |      5%      Default |
14|                               |                      |                  N/A |
15+-------------------------------+----------------------+----------------------+
16# and another box like this

Check `cudnn`

1whereis cudnn
2# cudnn: /usr/include/cudnn.h
3
4# Check cudnn version
5cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
6# or try this (it works for me, cudnn 8)
7cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

Install `nvidia-docker2`

👉 (Should follow this for the up-to-date) Officicial guide to install.

Note: (Only for me) Use the codes below.

1# check version
2nvidia-docker version

Difference: `nvidia-container-toolkit` vs `nvidia-container-runtime`

👉 What's the difference between the lastest nvidia-docker and nvidia container runtime？

In this note, with Docker 19.03+ (docker --version), he says that nvidia-container-toolkit is used for --gpus (in docker run ...), nvidia-container-runtime is used for --runtime=nvidia (can also be used in docker-compose file).

However, if you want to use Kubernetes with Docker 19.03, you actually need to continue using nvidia-docker2 because Kubernetes doesn't support passing GPU information down to docker through the --gpus flag yet. It still relies on the nvidia-container-runtime to pass GPU information down the stack via a set of environment variables.

👉 Installation Guide — NVIDIA Cloud Native Technologies documentation

Using docker-compose?

Purpose?

1# instead of using
2docker run \
3    --gpus all\
4    --name docker_thi_test\
5    --rm\
6    -v abc:abc\
7    -p 8888:8888

1# we use this with docker-compose.yml
2docker-compose up

1# check version of docker-compose
2docker-compose --version

1# If "version" in docker-compose.yml < 2.3
2# Modify: /etc/docker/daemon.json
3{
4    "default-runtime": "nvidia",
5    "runtimes": {
6        "nvidia": {
7            "path": "nvidia-container-runtime",
8            "runtimeArgs": []
9        }
10    }
11}

1# restart our docker daemon
2sudo pkill -SIGHUP dockerd

1# If "version" in docker-compose.yml >=2.3
2# docker-compose.yml => able to use "runtime"
3version: '2.3' # MUST BE >=2.3 AND <3
4services:
5  testing:
6    ports:
7      - "8000:8000"
8    runtime: nvidia
9    volumes:
10      - ./object_detection:/object_detection

👉 Check more in my repo my-dockerfiles on Github.

Run the test,

1docker pull tensorflow/tensorflow:latest-gpu-jupyter
2mkdir ~/Downloads/test/notebooks

Without using docker-compose.yml (tensorflow) (cf. this note for more)

1docker run --name docker_thi_test -it --rm -v $(realpath ~/Downloads/test/notebooks):/tf/notebooks -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter

With docker-compose.yml?

1# ~/Download/test/Dockerfile
2FROM tensorflow/tensorflow:latest-gpu-jupyter

1# ~/Download/test/docker-compose.yml
2version: '2'
3services:
4  jupyter:
5    container_name: 'docker_thi_test'
6    build: .
7    volumes:
8        - ./notebooks:/tf/notebooks # notebook directory
9    ports:
10        - 8888:8888 # exposed port for jupyter
11    environment:
12        - NVIDIA_VISIBLE_DEVICES=0 # which gpu do you want to use for this container
13        - PASSWORD=12345

Then run,

1docker-compose run --rm jupyter

Check usage of GPU

1# Linux only
2nvidia-smi

1# All processes that use GPU
2sudo fuser -v /dev/nvidia*

Kill process

1# Kill a single process
2sudo kill -9 3019

Reset GPU

1# all
2sudo nvidia-smi --gpu-reset

1# single
2sudo nvidia-smi --gpu-reset -i 0

Errors with GPU

1# Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
2# Function call stack:
3# train_function

👉 Check this answer as a reference!

👇 Use a GPU.

1# Limit the GPU memory to be used
2gpus = tf.config.list_physical_devices('GPU')
3if gpus:
4  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
5  try:
6    tf.config.set_logical_device_configuration(
7        gpus[0],
8        [tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
9    logical_gpus = tf.config.list_logical_devices('GPU')
10    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
11  except RuntimeError as e:
12    # Virtual devices must be set before GPUs have been initialized
13    print(e)

Problems with pytorch versions: check this.

RuntimeError: cuda runtime error (804) : forward compatibility was attempted on non supported HW at /pytorch/aten/src/THC/THCGeneral.cpp:47 (after update system including nvdia-cli, maybe) => The same problem with below, need to restart the computer.

nvidia-smi: Failed to initialize NVML: Driver/library version mismatch.

This thread: just restart the computer.

Make NVIDIA work in docker (Linux)

⚠️

This section still works (on 26-Oct-2020), but it's obselete for newer methods.

One idea: Use NVIDIA driver of the base machine, don't install anything in Docker!

References

Difference between base, runtime and devel in Dockerfile of CUDA.

Dockerfile on Github of Tensorflow.