Docker + GPUs

Last modified 8 months ago / Edit on Github

πŸ‘‰ Note: Docker 101
πŸ‘‰ Note: Wordpress Docker
πŸ‘‰ Note: Airflow + Kubernetes 101
πŸ‘‰ Note: Tensorflow extra

WSL + Windows

πŸ‘‰ Note: WSL + Windows

With Tensorflow or PyTorch

πŸ‘‰ Official doc for TF + docker
πŸ‘‰ Note: Docker + TF.
πŸ‘‰ An example of docker pytorch with gpu support.

Basic installation

You have to install (successfully) GPU driver on your (linux) machine before continuing the steps in this note. Go to "Check info" section to check the availability of your drivers.

It works perfectly on Pop!_OS 20.04,

sudo apt update
sudo apt install -y nvidia-container-runtime
sudo apt install -y nvidia-container-toolkit
sudo apt install -y nvidia-cuda-toolkit
# restard required

Check info

# verify that your computer has a graphic card
lspci -nn | grep '\[03'
# First, install drivers and check
# output: NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0
# it's maximum CUDA version that your driver supports
# check current version of cuda
nvcc --version
# If there is not nvcc, it may be in /usr/local/cuda/bin/
# Add this location to PATH
# modify ~/.zshrc or ~/.bashrc
export PATH=/usr/local/cuda/bin:$PATH

# You may need to install
sudo apt install -y nvidia-cuda-toolkit

If below command doesn't work, try to install nvidia-docker2 (read this section).

# install and check nvidia-docker
dpkg -l | grep nvidia-docker
# or
nvidia-docker version
# Verifying –gpus option under docker run
docker run --help | grep -i gpus
# output: --gpus gpu-request GPU devices to add to the container ('all' to pass all GPUs)
# Listing out GPU devices
docker run -it --rm --gpus all ubuntu nvidia-smi -L
# output: GPU 0: GeForce GTX 1650 (...)
# Verifying again with nvidia-smi
docker run -it --rm --gpus all ubuntu nvidia-smi
# test a working setup container-toolkit
docker run --rm --gpus all nvidia/cuda nvidia-smi
# test a working setup container-runtime
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

# Error response from daemon: Unknown runtime specified nvidia.
# Search below for "/etc/docker/daemon.json"
# Maybe it helps.

Install nvidia-docker2

More information (ref)

This package is the only docker-specific package of any of them. It takes the script associated with the nvidia-container-runtime and installs it into docker's /etc/docker/daemon.json file for you. This then allows you to run (for example) docker run --runtime=nvidia ... to automatically add GPU support to your containers. It also installs a wrapper script around the native docker CLI called nvidia-docker which lets you invoke docker without needing to specify --runtime=nvidia every single time. It also lets you set an environment variable on the host (NV_GPU) to specify which GPUs should be injected into a container.

πŸ‘‰ (Should follow this for the up-to-date) Officicial guide to install.

Command lines (for quickly preview)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L | sudo apt-key add -
curl -s -L$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-docker2

# restart docker
sudo systemctl restart docker
# check version
nvidia-docker version

Difference: nvidia-container-toolkit vs nvidia-container-runtime

πŸ‘‰ What's the difference between the lastest nvidia-docker and nvidia container runtime?

In this note, with Docker 19.03+ (docker --version), he says that nvidia-container-toolkit is used for --gpus (in docker run ...), nvidia-container-runtime is used for --runtime=nvidia (can also be used in docker-compose file).

However, if you want to use Kubernetes with Docker 19.03, you actually need to continue using nvidia-docker2 because Kubernetes doesn't support passing GPU information down to docker through the --gpus flag yet. It still relies on the nvidia-container-runtime to pass GPU information down the stack via a set of environment variables.

πŸ‘‰ Installation Guide β€” NVIDIA Cloud Native Technologies documentation

Using docker-compose?


# instead of using
docker run \
--gpus all\
--name docker_thi_test\
-v abc:abc\
-p 8888:8888
# we use this with docker-compose.yml
docker-compose up
# check version of docker-compose
docker-compose --version
# If "version" in docker-compose.yml < 2.3
# Modify: /etc/docker/daemon.json
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
# restart our docker daemon
sudo pkill -SIGHUP dockerd
# If "version" in docker-compose.yml >=2.3
# docker-compose.yml => able to use "runtime"
version: '2.3' # MUST BE >=2.3 AND <3
- "8000:8000"
runtime: nvidia
- ./object_detection:/object_detection

πŸ‘‰ Check more in my repo my-dockerfiles on Github.

Run the test,

docker pull tensorflow/tensorflow:latest-gpu-jupyter
mkdir ~/Downloads/test/notebooks

Without using docker-compose.yml (tensorflow) (cf. this note for more)

docker run --name docker_thi_test -it --rm -v $(realpath ~/Downloads/test/notebooks):/tf/notebooks -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter

With docker-compose.yml?

# ~/Download/test/Dockerfile
FROM tensorflow/tensorflow:latest-gpu-jupyter
# ~/Download/test/docker-compose.yml
version: '2'
container_name: 'docker_thi_test'
build: .
- ./notebooks:/tf/notebooks # notebook directory
- 8888:8888 # exposed port for jupyter
- NVIDIA_VISIBLE_DEVICES=0 # which gpu do you want to use for this container
- PASSWORD=12345

Then run,

docker-compose run --rm jupyter

Check usage of GPU

# Linux only
Return something like this
# |===============================+======================+======================|
# | 0 GeForce GTX 1650 Off | 00000000:01:00.0 Off | N/A |
# | N/A 53C P8 2W / N/A | 3861MiB / 3914MiB | 2% Default |
# | | | N/A |
# +-------------------------------+----------------------+----------------------+

# => 3861MB / 3914MB is used!

# +-----------------------------------------------------------------------------+
# | Processes: GPU Memory |
# | GPU PID Type Process name Usage |
# |=============================================================================|
# | 0 3019 C ...e/scarter/anaconda3/envs/tf1/bin/python 3812MiB |
# +-----------------------------------------------------------------------------+

# => Process 3019 is using the GPU
# All processes that use GPU
sudo fuser -v /dev/nvidia*

Kill process

# Kill a single process
sudo kill -9 3019

Reset GPU

# all
sudo nvidia-smi --gpu-reset
# single
sudo nvidia-smi --gpu-reset -i 0

Errors with GPU

# Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
# Function call stack:
# train_function

Check this answer as a reference!

Problems with pytorch versions: check this.

RuntimeError: cuda runtime error (804) : forward compatibility was attempted on non supported HW at /pytorch/aten/src/THC/THCGeneral.cpp:47 (after update system including nvdia-cli, maybe) => The same problem with below, need to restart the computer.

nvidia-smi: Failed to initialize NVML: Driver/library version mismatch.

This thread: just restart the computer.

Make NVIDIA work in docker (Linux)

This section is still working (on 26-Oct-2020) but it's old for newer methods.

Idea: Using NVIDIA driver of the base machine, don't install anything in docker!

Detail of steps

  1. First, maker sure your base machine has an NVIDIA driver.

    # list all gpus
    lspci -nn | grep '\[03'

    # check nvidia & cuda versions
  2. Install nvidia-container-runtime

    curl -s -L | sudo apt-key add -
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

    curl -s -L$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list

    sudo apt-get update

    sudo apt-get install nvidia-container-runtime
  3. Note that, we cannot use docker-compose.yml in this case!!!

  4. Create an image img_datas with Dockerfile is

    FROM nvidia/cuda:10.2-base

    RUN apt-get update && \
    apt-get -y upgrade && \
    apt-get install -y python3-pip python3-dev locales git

    # install dependencies
    COPY requirements.txt requirements.txt
    RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install -r requirements.txt

    COPY . .

    # default command
    CMD [ "jupyter", "lab", "--no-browser", "--allow-root", "--ip=" ]
  5. Create a container,

    docker run --name docker_thi --gpus all -v /home/thi/folder_1/:/srv/folder_1/ -v /home/thi/folder_1/git/:/srv/folder_2 -dp 8888:8888 -w="/srv" -it img_datas

    # -v: volumes
    # -w: working dir
    # --gpus all: using all gpus on base machine

This article is also very interesting and helpful in some cases.


  1. Difference between base, runtime and devel in Dockerfile of CUDA.
  2. Dockerfile on Github of Tensorflow.