Docker + GPUs

Last modified 3 days ago / Edit on Github

πŸ‘‰ Note: Docker 101
πŸ‘‰ Note: Wordpress Docker
πŸ‘‰ Note: Airflow + Kubernetes 101
πŸ‘‰ Note: Tensorflow extra

WSL + Windows

πŸ‘‰ Note: WSL + Windows

With Tensorflow or PyTorch

πŸ‘‰ Official doc for TF + docker
πŸ‘‰ Note: Docker + TF.
πŸ‘‰ An example of docker pytorch with gpu support.

Basic installation

You have to install (successfully) GPU driver on your (linux) machine before continuing the steps in this note. Go to "Check info" section to check the availability of your drivers.

(Maybe for me only) It works perfectly on Pop!_OS 20.04, I've tried and we have many problems with Pop!_OS 21.10. Therefore stick to 20.04!!!!

sudo apt update

sudo apt install -y nvidia-container-runtime
# You may need to replace above line with
sudo apt install nvidia-docker2
sudo apt install nvidia-container-toolkit

sudo apt install -y nvidia-cuda-toolkit
# restard required

If you have problems when installing nvidia-docker2, read this section!

Check info

# verify that your computer has a graphic card
lspci -nn | grep '\[03'
# First, install drivers and check
nvidia-smi
# output: NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0
# it's maximum CUDA version that your driver supports
# check current version of cuda
nvcc --version
# If there is not nvcc, it may be in /usr/local/cuda/bin/
# Add this location to PATH
# modify ~/.zshrc or ~/.bashrc
export PATH=/usr/local/cuda/bin:$PATH

# You may need to install
sudo apt install -y nvidia-cuda-toolkit

If below command doesn't work, try to install nvidia-docker2 (read this section).

# install and check nvidia-docker
dpkg -l | grep nvidia-docker
# or
nvidia-docker version
# Verifying –gpus option under docker run
docker run --help | grep -i gpus
# output: --gpus gpu-request GPU devices to add to the container ('all' to pass all GPUs)

Check docker work with gpu?

# Listing out GPU devices
docker run -it --rm --gpus all ubuntu nvidia-smi -L
# output: GPU 0: GeForce GTX 1650 (...)
# ERROR ?
# docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
# ERROR ?
# Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

# Solution: install nvidia-docker2
# Verifying again with nvidia-smi
docker run -it --rm --gpus all ubuntu nvidia-smi

# Return something like
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54 Driver Version: 510.54 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 55C P0 11W / N/A | 369MiB / 4096MiB | 5% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
# and another box like this
Archived but still useful
# Test a working setup container-toolkit
# Update 14/04/2022: the tag "latest" has deprecated => check your system versions and use
# the corresponding tag
# So, the below code is only for reference, it's not working anymore
docker run --rm --gpus all nvidia/cuda nvidia-smi
# Test a working setup container-runtime
# Update 14/04/2022: below code isn't working anymore because nvidia/cuda doesn't have
# the "latest" tag!
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

# Error response from daemon: Unknown runtime specified nvidia.
# Search below for "/etc/docker/daemon.json"
# Maybe it helps.

Install nvidia-docker2

More information (ref)

This package is the only docker-specific package of any of them. It takes the script associated with the nvidia-container-runtime and installs it into docker's /etc/docker/daemon.json file for you. This then allows you to run (for example) docker run --runtime=nvidia ... to automatically add GPU support to your containers. It also installs a wrapper script around the native docker CLI called nvidia-docker which lets you invoke docker without needing to specify --runtime=nvidia every single time. It also lets you set an environment variable on the host (NV_GPU) to specify which GPUs should be injected into a container.

πŸ‘‰ (Should follow this for the up-to-date) Officicial guide to install.

Note: (For me only) use below codes.

Command lines (for quickly preview)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

# NOTE FOR POPOS 20.04
# replace above line with
distribution=ubuntu20.04

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-docker2

πŸ‘‡ Read more about below error.

# Error?
# Read more:
# Depends: nvidia-container-toolkit (>= 1.9.0-1) but 1.5.1-1pop1~1627998766~20.04~9847cf2 is to be installed

# create a new file
sudo nano /etc/apt/preferences.d/nvidia-docker-pin-1002
# with below content
Package: *
Pin: origin nvidia.github.io
Pin-Priority: 1002
# then save

# try again
sudo apt-get install -y nvidia-docker2
# restart docker
sudo systemctl restart docker

# wanna check?
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
# check version
nvidia-docker version

Difference: nvidia-container-toolkit vs nvidia-container-runtime

πŸ‘‰ What's the difference between the lastest nvidia-docker and nvidia container runtime?

In this note, with Docker 19.03+ (docker --version), he says that nvidia-container-toolkit is used for --gpus (in docker run ...), nvidia-container-runtime is used for --runtime=nvidia (can also be used in docker-compose file).

However, if you want to use Kubernetes with Docker 19.03, you actually need to continue using nvidia-docker2 because Kubernetes doesn't support passing GPU information down to docker through the --gpus flag yet. It still relies on the nvidia-container-runtime to pass GPU information down the stack via a set of environment variables.

πŸ‘‰ Installation Guide β€” NVIDIA Cloud Native Technologies documentation

Using docker-compose?

Purpose?

# instead of using
docker run \
--gpus all\
--name docker_thi_test\
--rm\
-v abc:abc\
-p 8888:8888
# we use this with docker-compose.yml
docker-compose up
# check version of docker-compose
docker-compose --version
# If "version" in docker-compose.yml < 2.3
# Modify: /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
# restart our docker daemon
sudo pkill -SIGHUP dockerd
# If "version" in docker-compose.yml >=2.3
# docker-compose.yml => able to use "runtime"
version: '2.3' # MUST BE >=2.3 AND <3
services:
testing:
ports:
- "8000:8000"
runtime: nvidia
volumes:
- ./object_detection:/object_detection

πŸ‘‰ Check more in my repo my-dockerfiles on Github.

Run the test,

docker pull tensorflow/tensorflow:latest-gpu-jupyter
mkdir ~/Downloads/test/notebooks

Without using docker-compose.yml (tensorflow) (cf. this note for more)

docker run --name docker_thi_test -it --rm -v $(realpath ~/Downloads/test/notebooks):/tf/notebooks -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter

With docker-compose.yml?

# ~/Download/test/Dockerfile
FROM tensorflow/tensorflow:latest-gpu-jupyter
# ~/Download/test/docker-compose.yml
version: '2'
services:
jupyter:
container_name: 'docker_thi_test'
build: .
volumes:
- ./notebooks:/tf/notebooks # notebook directory
ports:
- 8888:8888 # exposed port for jupyter
environment:
- NVIDIA_VISIBLE_DEVICES=0 # which gpu do you want to use for this container
- PASSWORD=12345

Then run,

docker-compose run --rm jupyter

Check usage of GPU

# Linux only
nvidia-smi
Return something like this
# |===============================+======================+======================|
# | 0 GeForce GTX 1650 Off | 00000000:01:00.0 Off | N/A |
# | N/A 53C P8 2W / N/A | 3861MiB / 3914MiB | 2% Default |
# | | | N/A |
# +-------------------------------+----------------------+----------------------+

# => 3861MB / 3914MB is used!

# +-----------------------------------------------------------------------------+
# | Processes: GPU Memory |
# | GPU PID Type Process name Usage |
# |=============================================================================|
# | 0 3019 C ...e/scarter/anaconda3/envs/tf1/bin/python 3812MiB |
# +-----------------------------------------------------------------------------+

# => Process 3019 is using the GPU
# All processes that use GPU
sudo fuser -v /dev/nvidia*

Kill process

# Kill a single process
sudo kill -9 3019

Reset GPU

# all
sudo nvidia-smi --gpu-reset
# single
sudo nvidia-smi --gpu-reset -i 0

Errors with GPU

# Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
# Function call stack:
# train_function

πŸ‘‰ Check this answer as a reference!

πŸ‘‡ Use a GPU.

# Limit the GPU memory to be used
gpus = tf.config.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.set_logical_device_configuration(
gpus[0],
[tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)

Problems with pytorch versions: check this.


RuntimeError: cuda runtime error (804) : forward compatibility was attempted on non supported HW at /pytorch/aten/src/THC/THCGeneral.cpp:47 (after update system including nvdia-cli, maybe) => The same problem with below, need to restart the computer.


nvidia-smi: Failed to initialize NVML: Driver/library version mismatch.

This thread: just restart the computer.

Make NVIDIA work in docker (Linux)

This section is still working (on 26-Oct-2020) but it's old for newer methods.

Idea: Using NVIDIA driver of the base machine, don't install anything in docker!

Detail of steps

  1. First, maker sure your base machine has an NVIDIA driver.

    # list all gpus
    lspci -nn | grep '\[03'

    # check nvidia & cuda versions
    nvidia-smi
  2. Install nvidia-container-runtime

    curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

    curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list

    sudo apt-get update

    sudo apt-get install nvidia-container-runtime
  3. Note that, we cannot use docker-compose.yml in this case!!!

  4. Create an image img_datas with Dockerfile is

    FROM nvidia/cuda:10.2-base

    RUN apt-get update && \
    apt-get -y upgrade && \
    apt-get install -y python3-pip python3-dev locales git


    # install dependencies
    COPY requirements.txt requirements.txt
    RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install -r requirements.txt

    COPY . .

    # default command
    CMD [ "jupyter", "lab", "--no-browser", "--allow-root", "--ip=0.0.0.0" ]
  5. Create a container,

    docker run --name docker_thi --gpus all -v /home/thi/folder_1/:/srv/folder_1/ -v /home/thi/folder_1/git/:/srv/folder_2 -dp 8888:8888 -w="/srv" -it img_datas

    # -v: volumes
    # -w: working dir
    # --gpus all: using all gpus on base machine

This article is also very interesting and helpful in some cases.

References

  1. Difference between base, runtime and devel in Dockerfile of CUDA.
  2. Dockerfile on Github of Tensorflow.
Support Thi Support Thi