A workflow for creating a Docker image with support for TensorFlow, PyTorch, and GPU

A notable point in this post is the installation of TensorFlow 2.9.1 (TF) with CUDA 11.3, which is not officially supported!

⚠️

Since the version of what is to be mentioned in this post is really important, keep in mind that what I write are for the time I am writing this post!

From this to that

I want to create a Dockerfile based on a machine with the following specifications,

My computer: Dell XPS 7590 (Intel i7 9750H/2.6GHz, GeForce GTX 1650 Mobile, RAM 32GB, SSD 1TB).

OS: Pop!_OS 20.04 LTS (a distribution based upon Ubuntu 20.04).

Nvidia driver (on the physical machine): 510.73.05

CUDA (on the physical machine): 10.1

Docker engine: 20.10.17

Python: 3.9.7

In this Dockerfile we can create a container that supports,

TensorFlow: 2.9.1.

PyTorch: 1.12.1+cu113

CUDA: 11.3

cuDNN: 8

Python: 3.8.10

OS: Ubuntu 20.04.5 LTS (Focal Fossa)

Zsh & oh-my-zsh are already installed.

Jupyter notebook is installed and automatically runs as an entrypoint.

OpenSSH support (for accessing the container via SSH)

👉

The final Dockerfile on Github.

TL;DR;

☝

Yes, you don't have to read other sections, just this one for everything run!

Motivation

I want to try detectron2 from Meta AI, a library that provides state-of-the-art detection and segmentation algorithms. This library requires the use of cuda=1.13 for a smooth installation. Therefore, I need a Docker container with cuda=1.13 + TensorFlow + PyTorch for this task. However, the latest version of cuda that is officially supported by TF is cuda=11.2.

If you do not necessarily need a special version of cuda that TF may not support, you can simply use TF's official Docker images.

So, if you find that the versions of TF, CUDA and cudnn match, just use it as a base Docker image or follow the official instructions from TensorFlow. This post is a general idea how we can install TF with other versions of CUDA and cuDNN.

The final workflow

Installation and setup

Make sure the GPU driver is successfully installed on your machine and read this note to allow Docker Engine communicate with this physical GPU.

Basically, the following codes should work.

1# Check if a GPU is available
2lspci | grep -i nvidia
3
4# Check NVIDIA driver info
5nvidia-smi
6
7# Check the version of cuda
8nvcc --version
9
10# Verify your nvidia-docker installation
11docker run --gpus all --rm nvidia/cuda nvidia-smi

⚠️

If docker -v gives a version earlier than 19.03, you have to use --runtime=nvidia instead of --gpus all.

Choose a base image

Most problems come from TF, it is imperative to adjust the version of TF, CUDA, cuDNN. You can check this link for the corresponding versions between TF, cuDNN and CUDA (we call it "list-1"). A natural way to choose a base image is from a TF docker image and then install separately PT. This is a Dockerfile I built with this idea (tf-2.8.1-gpu, torch-1.12.1+cu113). However, as you can see in the "list-1" list, the official TF only supports CUDA 11.2 (or 11.0 or 10.1). If you want to install TF with CUDA 11.3, it's impossible if you start from the official build.

Based on the official tutorial of installing TF with pip, to install TF 2.9.1, we need cudatoolkit=11.2 and cudnn=8.1.0. What if we handle to have cuda=11.3 and cudnn=8 first and then we look for a way to install TF 2.9?

From the NVIDIA public hub repository, I found this image (nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04) which has already installed cuda=11.3 and cudnn=8.

☝

What is the difference between base, runtime and devel in the name of images from the NVIDIA public hub? Check this.

I create a very simple Dockerfile starting from this base image to check if we can install torch=1.12.1+cu113 and tensorflow=2.9.1.

👉 A simple "Dockerfile" file based on nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04

1# Create the image "img_sample" from the file "Dockerfile"
2docker build -t img_sample . -f Dockerfile
3
4# Create and run a container from the image "img_sample"
5docker run --name container_sample --gpus all -w="/working" img_sample bash
6
7# Enter the "container_sample" container
8docker exect -it container_sample bash
9
10# In the container "container_sample"
11
12# Check if the NVIDIA Driver is recognized
13nvidia-smi
14
15# Check the version of CUDA
16nvcc --version
17
18# Check the version of cuDNN
19cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

Then I try to install tensorflow from step 5 of this official tutorial

1pip install --upgrade pip
2pip install tensorflow==2.9.1

YES! It's that simple!

Let us check if it works (step 6 in the tutorial)?

1# Verify the CPU setup
2python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
3# A tensor should be return, something like
4# tf.Tensor(-686.383, shape=(), dtype=float32)
5
6# Verify the GPU setup
7python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
8# A list of GPU devices should be return, something like
9# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

🎉 Voilà, it works like clockwork!

Add more things in the Dockerfile

Basically, we are done with the main part of this post. This section mainly explains why I also include the Zsh installation and OpenSSH setup in the Dockerfile.

Python libraries

All normally used packages (with their corresponding versions) are stored in the requirements.txt file. We need into copy this file to the container and start the installation process,

1COPY requirements.txt requirements.txt
2RUN python3 -m pip install --upgrade pip && \\
3    python3 -m pip install -r requirements.txt
4COPY . .

PyTorch

Install PyTorch by following the official instructions

1RUN pip3 install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url <https://download.pytorch.org/whl/cu113>

Zsh

👉 Note: Terminal + ZSH.

Add the following lines to install and set up Zsh. Why do we need Zsh instead of the default bash? Because we need a better look of the command lines and not just white texts. Another problem sometimes arises from the "backspace" key on the keyboard. When you type something and use the backspace key to correct the mistake, the previous character is not removed as it should be, but other characters appear. This problem was mentioned once before at the end of section “SSH to User-managed notebook” in Google Vertex AI .

1# Dockerfile
2RUN apt-get install -y zsh && apt-get install -y curl
3RUN PATH="$PATH:/usr/bin/zsh"
4RUN sh -c "$(curl -fsSL <https://raw.githubusercontent.com/robbyrussell/oh-my-zsh/master/tools/install.sh>)"

After Zsh installed,

1# Instead of using this
2docker exec -it docker_name bash
3
4# Use this
5docker exec -it docker_name zsh

One note: When adding an alias, be sure to add it to both .bashrc and .zshrc as follows,

1# Dockerfile
2RUN echo 'alias python="python3"' >> ~/.bashrc
3RUN echo 'alias python="python3"' >> ~/.zshrc

OpenSSH

👉 Note: Local connection between 2 computers

If you want to access a running container via SSH, you must install and run OpenSSH in that container and expose the port 22.

1# Dockerfile
2RUN apt-get install -y openssh-server
3RUN mkdir /var/run/sshd
4RUN echo 'root:qwerty' | chpasswd
5RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
6# SSH login fix. Otherwise user is kicked off after login
7RUN sed 's@session\\s*required\\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
8ENV NOTVISIBLE "in users profile"
9RUN echo "export VISIBLE=now" >> /etc/profile
10EXPOSE 22

Don't forget to export port 22 when you create a new container,

1docker run --name container_name --gpus all \\
2	-dp: 6789:22
3	# other options
4	-it image_name

One more step: If a jupyter notebook is running (at port 8888) in your container, you need to run the following code to get the SSH server running,

1docker exec docker_ai $(which sshd) -Ddp 22

Now if you want to access this container via SSH,

1ssh -p 6789 root@localhost

Let's use the password qwerty (it's set in the above code, at line RUN echo 'root:qwerty' | chpasswd)!

JupyterLab

It's great if our image has a running jupyter notebook server as an entry point so every time we create a new container, there's already a jupyter notebook running and we just use it.

1# Dockerfile
2RUN python3 -m pip install jupyterlab
3
4CMD /bin/bash -c 'jupyter lab --no-browser --allow-root --ip=0.0.0.0 --NotebookApp.token="" --NotebookApp.password=""'

Don't forget to expose the port 8888 when you create a new container,

1docker run --name container_name --gpus all \\
2	-dp: 8888:8888
3	# other options
4	-it image_name

Go to http://localhost:8888 to open the notebook.

1# Check if a GPU is available 2lspci | grep -i nvidia 3 4# Check NVIDIA driver info 5nvidia-smi 6 7# Check the version of cuda 8nvcc --version 9 10# Verify your nvidia-docker installation 11docker run --gpus all --rm nvidia/cuda nvidia-smi

1# Create the image "img_sample" from the file "Dockerfile" 2docker build -t img_sample . -f Dockerfile 3 4# Create and run a container from the image "img_sample" 5docker run --name container_sample --gpus all -w="/working" img_sample bash 6 7# Enter the "container_sample" container 8docker exect -it container_sample bash 9 10# In the container "container_sample" 11 12# Check if the NVIDIA Driver is recognized 13nvidia-smi 14 15# Check the version of CUDA 16nvcc --version 17 18# Check the version of cuDNN 19cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

1# Verify the CPU setup 2python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))" 3# A tensor should be return, something like 4# tf.Tensor(-686.383, shape=(), dtype=float32) 5 6# Verify the GPU setup 7python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" 8# A list of GPU devices should be return, something like 9# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

1# Dockerfile 2RUN apt-get install -y zsh && apt-get install -y curl 3RUN PATH="$PATH:/usr/bin/zsh" 4RUN sh -c "$(curl -fsSL <https://raw.githubusercontent.com/robbyrussell/oh-my-zsh/master/tools/install.sh>)"

1# Dockerfile 2RUN apt-get install -y openssh-server 3RUN mkdir /var/run/sshd 4RUN echo 'root:qwerty' | chpasswd 5RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config 6# SSH login fix. Otherwise user is kicked off after login 7RUN sed 's@session\\s*required\\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd 8ENV NOTVISIBLE "in users profile" 9RUN echo "export VISIBLE=now" >> /etc/profile 10EXPOSE 22