A notable point in this post is the installation of TensorFlow 2.9.1 (TF) with CUDA 11.3, which is not officially supported!
Since the version of what is to be mentioned in this post is really important, keep in mind that what I write are for the time I am writing this post!
I want to create a Dockerfile based on a machine with the following specifications,
- My computer: Dell XPS 7590 (Intel i7 9750H/2.6GHz, GeForce GTX 1650 Mobile, RAM 32GB, SSD 1TB).
- OS: Pop!_OS 20.04 LTS (a distribution based upon Ubuntu 20.04).
- Nvidia driver (on the physical machine): 510.73.05
- CUDA (on the physical machine): 10.1
- Docker engine: 20.10.17
- Python: 3.9.7
In this Dockerfile we can create a container that supports,
- TensorFlow: 2.9.1.
- PyTorch: 1.12.1+cu113
- CUDA: 11.3
- cuDNN: 8
- Python: 3.8.10
- OS: Ubuntu 20.04.5 LTS (Focal Fossa)
- Jupyter notebook is installed and automatically runs as an entrypoint.
- OpenSSH support (for accessing the container via SSH)
The final Dockerfile on Github.
Yes, you don't have to read other sections, just this one for everything run!
I want to try detectron2 from Meta AI, a library that provides state-of-the-art detection and segmentation algorithms. This library requires the use of
cuda=1.13
for a smooth installation. Therefore, I need a Docker container with cuda=1.13
+ TensorFlow + PyTorch for this task. However, the latest version of cuda
that is officially supported by TF is cuda=11.2
.If you do not necessarily need a special version of
cuda
that TF may not support, you can simply use TF's official Docker images.So, if you find that the versions of TF, CUDA and cudnn match, just use it as a base Docker image or follow the official instructions from TensorFlow. This post is a general idea how we can install TF with other versions of CUDA and cuDNN.
The final workflow
Make sure the GPU driver is successfully installed on your machine and read this note to allow Docker Engine communicate with this physical GPU.
Basically, the following codes should work.
If
docker -v
gives a version earlier than 19.03, you have to use --runtime=nvidia
instead of --gpus all
.Most problems come from TF, it is imperative to adjust the version of TF, CUDA, cuDNN. You can check this link for the corresponding versions between TF, cuDNN and CUDA (we call it "list-1"). A natural way to choose a base image is from a TF docker image and then install separately PT. This is a Dockerfile I built with this idea (tf-2.8.1-gpu, torch-1.12.1+cu113). However, as you can see in the "list-1" list, the official TF only supports CUDA 11.2 (or 11.0 or 10.1). If you want to install TF with CUDA 11.3, it's impossible if you start from the official build.
Based on the official tutorial of installing TF with
pip
, to install TF 2.9.1, we need cudatoolkit=11.2
and cudnn=8.1.0
. What if we handle to have cuda=11.3
and cudnn=8
first and then we look for a way to install TF 2.9?From the NVIDIA public hub repository, I found this image (nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04) which has already installed
cuda=11.3
and cudnn=8
.What is the difference between
base
, runtime
and devel
in the name of images from the NVIDIA public hub? Check this.I create a very simple Dockerfile starting from this base image to check if we can install
torch=1.12.1+cu113
and tensorflow=2.9.1
.Then I try to install
tensorflow
from step 5 of this official tutorialYES! It's that simple!
Let us check if it works (step 6 in the tutorial)?
π VoilΓ , it works like clockwork!
Basically, we are done with the main part of this post. This section mainly explains why I also include the Zsh installation and OpenSSH setup in the Dockerfile.
All normally used packages (with their corresponding versions) are stored in the
requirements.txt
file. We need into copy this file to the container and start the installation process,Install PyTorch by following the official instructions
π Note: Terminal + ZSH.
Add the following lines to install and set up Zsh. Why do we need Zsh instead of the default
bash
? Because we need a better look of the command lines and not just white texts. Another problem sometimes arises from the "backspace" key on the keyboard. When you type something and use the backspace key to correct the mistake, the previous character is not removed as it should be, but other characters appear. This problem was mentioned once before at the end of section βSSH to User-managed notebookβ in Google Vertex AI .After Zsh installed,
One note: When adding an alias, be sure to add it to both
.bashrc
and .zshrc
as follows,π Note: Local connection between 2 computers
If you want to access a running container via SSH, you must install and run OpenSSH in that container and expose the port
22
.Don't forget to export port
22
when you create a new container,One more step: If a jupyter notebook is running (at port
8888
) in your container, you need to run the following code to get the SSH server running,Now if you want to access this container via SSH,
Let's use the password
qwerty
(it's set in the above code, at line RUN echo 'root:qwerty' | chpasswd
)!It's great if our image has a running jupyter notebook server as an entry point so every time we create a new container, there's already a jupyter notebook running and we just use it.
Don't forget to expose the port
8888
when you create a new container,Go to http://localhost:8888 to open the notebook.