Thi's avatar
HomeAboutNotesBlogTopicsToolsReading
About|My sketches |Cooking |Cafe icon Support Thi
πŸ’Œ [email protected]

Docker & GPU

Anh-Thi Dinh
Data EngineeringDocker
Left aside

WSL + Windows

πŸ‘‰ Read more: .

With Tensorflow or PyTorch

πŸ‘‰ Official doc for TF + docker.
πŸ‘‰ Note about Docker and Tensorflow:
Tensorflow.
πŸ‘‰
An example of docker pytorch with gpu support.

Basic installation

🚨
You must (successfully) install the GPU driver on your (Linux) machine before proceeding with the steps in this note. Go to the "Check info" section to check the availability of your drivers.
☝
(Maybe just for me) It works perfectly on Pop!_OS 20.04, I tried it and we have a lot of problems with Pop!_OS 21.10 so stay with 20.04!
If you have problems installing nvidia-docker2, read this section!

Check info

If below command doesn't work, try to install nvidia-docker2 (read this section).

Does Docker work with GPU?

Check cudnn

Install nvidia-docker2

πŸ‘‰ (Should follow this for the up-to-date) Officicial guide to install.
Note: (Only for me) Use the codes below.

Difference: nvidia-container-toolkit vs nvidia-container-runtime

πŸ‘‰ What's the difference between the lastest nvidia-docker and nvidia container runtime?
In this note, with Docker 19.03+ (docker --version), he says that nvidia-container-toolkit is used for --gpus (in docker run ...), nvidia-container-runtime is used for --runtime=nvidia (can also be used in docker-compose file).
However, if you want to use Kubernetes with Docker 19.03, you actually need to continue using nvidia-docker2 because Kubernetes doesn't support passing GPU information down to docker through the --gpus flag yet. It still relies on the nvidia-container-runtime to pass GPU information down the stack via a set of environment variables.
πŸ‘‰ Installation Guide β€” NVIDIA Cloud Native Technologies documentation

Using docker-compose?

Purpose?
πŸ‘‰ Check more in my repo my-dockerfiles on Github.
Run the test,
Without using docker-compose.yml (tensorflow) (cf. this note for more)
With docker-compose.yml?
Then run,

Check usage of GPU

Kill process

Reset GPU

Errors with GPU

πŸ‘‰ Check this answer as a reference!
πŸ‘‡ Use a GPU.

Problems with pytorch versions: check this.

RuntimeError: cuda runtime error (804) : forward compatibility was attempted on non supported HW at /pytorch/aten/src/THC/THCGeneral.cpp:47 (after update system including nvdia-cli, maybe) => The same problem with below, need to restart the computer.

nvidia-smi: Failed to initialize NVML: Driver/library version mismatch.
This thread: just restart the computer.

Make NVIDIA work in docker (Linux)

⚠️
This section still works (on 26-Oct-2020), but it's obselete for newer methods.
One idea: Use NVIDIA driver of the base machine, don't install anything in Docker!

References

  1. Difference between base, runtime and devel in Dockerfile of CUDA.
  1. Dockerfile on Github of Tensorflow.
β—†WSL + Windowsβ—†With Tensorflow or PyTorchβ—†Basic installationβ—†Check infoβ—‹Does Docker work with GPU?β—‹Check cudnnβ—†Install nvidia-docker2β—†Difference: nvidia-container-toolkit vs nvidia-container-runtimeβ—†Using docker-compose?β—†Check usage of GPUβ—‹Kill processβ—†Reset GPUβ—†Errors with GPUβ—†Make NVIDIA work in docker (Linux)β—†References
About|My sketches |Cooking |Cafe icon Support Thi
πŸ’Œ [email protected]
1sudo apt update
2
3sudo apt install -y nvidia-container-runtime
4# You may need to replace above line with
5sudo apt install nvidia-docker2
6sudo apt install nvidia-container-toolkit
7
8sudo apt install -y nvidia-cuda-toolkit
9# restard required
1# Verify that your computer has a graphic card
2lspci | grep -i nvidia
1# First, install drivers and check
2nvidia-smi
3# output: NVIDIA-SMI 450.80.02 Driver Version: 450.80.02    CUDA Version: 11.0
4# It's the maximum CUDA version that your driver supports
1# Check current version of cuda
2nvcc --version
3# If nvcc is not available, it may be in /usr/local/cuda/bin/
4# Add this location to PATH
5# modify ~/.zshrc or ~/.bashrc
6export PATH=/usr/local/cuda/bin:$PATH
7
8# You may need to install
9sudo apt install -y nvidia-cuda-toolkit
1# Install and check nvidia-docker
2dpkg -l | grep nvidia-docker
3# or
4nvidia-docker version
1# Verifying –gpus option under docker run
2docker run --help | grep -i gpus
3# output: --gpus gpu-request GPU devices to add to the container ('all' to pass all GPUs)
1# List all GPU devices
2docker run -it --rm --gpus all ubuntu nvidia-smi -L
3# output: GPU 0: GeForce GTX 1650 (...)
1# ERROR ?
2# docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
1# ERROR ?
2# Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
3
4# Solution: install nvidia-docker2
1# Verifying again with nvidia-smi
2docker run -it --rm --gpus all ubuntu nvidia-smi
3
4# Return something like
5+-----------------------------------------------------------------------------+
6| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
7|-------------------------------+----------------------+----------------------+
8| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
9| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
10|                               |                      |               MIG M. |
11|===============================+======================+======================|
12|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
13| N/A   55C    P0    11W /  N/A |    369MiB /  4096MiB |      5%      Default |
14|                               |                      |                  N/A |
15+-------------------------------+----------------------+----------------------+
16# and another box like this
1whereis cudnn
2# cudnn: /usr/include/cudnn.h
3
4# Check cudnn version
5cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
6# or try this (it works for me, cudnn 8)
7cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
1# check version
2nvidia-docker version
1# instead of using
2docker run \
3    --gpus all\
4    --name docker_thi_test\
5    --rm\
6    -v abc:abc\
7    -p 8888:8888
1# we use this with docker-compose.yml
2docker-compose up
1# check version of docker-compose
2docker-compose --version
1# If "version" in docker-compose.yml < 2.3
2# Modify: /etc/docker/daemon.json
3{
4    "default-runtime": "nvidia",
5    "runtimes": {
6        "nvidia": {
7            "path": "nvidia-container-runtime",
8            "runtimeArgs": []
9        }
10    }
11}
1# restart our docker daemon
2sudo pkill -SIGHUP dockerd
1# If "version" in docker-compose.yml >=2.3
2# docker-compose.yml => able to use "runtime"
3version: '2.3' # MUST BE >=2.3 AND <3
4services:
5  testing:
6    ports:
7      - "8000:8000"
8    runtime: nvidia
9    volumes:
10      - ./object_detection:/object_detection
1docker pull tensorflow/tensorflow:latest-gpu-jupyter
2mkdir ~/Downloads/test/notebooks
1docker run --name docker_thi_test -it --rm -v $(realpath ~/Downloads/test/notebooks):/tf/notebooks -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter
1# ~/Download/test/Dockerfile
2FROM tensorflow/tensorflow:latest-gpu-jupyter
1# ~/Download/test/docker-compose.yml
2version: '2'
3services:
4  jupyter:
5    container_name: 'docker_thi_test'
6    build: .
7    volumes:
8        - ./notebooks:/tf/notebooks # notebook directory
9    ports:
10        - 8888:8888 # exposed port for jupyter
11    environment:
12        - NVIDIA_VISIBLE_DEVICES=0 # which gpu do you want to use for this container
13        - PASSWORD=12345
1docker-compose run --rm jupyter
1# Linux only
2nvidia-smi
1# All processes that use GPU
2sudo fuser -v /dev/nvidia*
1# Kill a single process
2sudo kill -9 3019
1# all
2sudo nvidia-smi --gpu-reset
1# single
2sudo nvidia-smi --gpu-reset -i 0
1# Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
2# Function call stack:
3# train_function
1# Limit the GPU memory to be used
2gpus = tf.config.list_physical_devices('GPU')
3if gpus:
4  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
5  try:
6    tf.config.set_logical_device_configuration(
7        gpus[0],
8        [tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
9    logical_gpus = tf.config.list_logical_devices('GPU')
10    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
11  except RuntimeError as e:
12    # Virtual devices must be set before GPUs have been initialized
13    print(e)