A notable point in this post is the installation of TensorFlow 2.9.1 (TF) with CUDA 11.3, which is not officially supported!
Since the version of what is to be mentioned in this post is really important, keep in mind that what I write are for the time I am writing this post!
I want to create a Dockerfile based on a machine with the following specifications,
- My computer: Dell XPS 7590 (Intel i7 9750H/2.6GHz, GeForce GTX 1650 Mobile, RAM 32GB, SSD 1TB).
- OS: Pop!_OS 20.04 LTS (a distribution based upon Ubuntu 20.04).
- Nvidia driver (on the physical machine): 510.73.05
- CUDA (on the physical machine): 10.1
- Docker engine: 20.10.17
- Python: 3.9.7
In this Dockerfile we can create a container that supports,
- TensorFlow: 2.9.1.
- PyTorch: 1.12.1+cu113
- CUDA: 11.3
- cuDNN: 8
- Python: 3.8.10
- OS: Ubuntu 20.04.5 LTS (Focal Fossa)
- Jupyter notebook is installed and automatically runs as an entrypoint.
- OpenSSH support (for accessing the container via SSH)
Yes, you don't have to read other sections, just this one for everything run!
I want to try detectron2 from Meta AI, a library that provides state-of-the-art detection and segmentation algorithms. This library requires the use of
cuda=1.13for a smooth installation. Therefore, I need a Docker container with
cuda=1.13+ TensorFlow + PyTorch for this task. However, the latest version of
cudathat is officially supported by TF is
If you do not necessarily need a special version of
cudathat TF may not support, you can simply use TF's official Docker images.
So, if you find that the versions of TF, CUDA and cudnn match, just use it as a base Docker image or follow the official instructions from TensorFlow. This post is a general idea how we can install TF with other versions of CUDA and cuDNN.
Make sure the GPU driver is successfully installed on your machine and read this note to allow Docker Engine communicate with this physical GPU.
Basically, the following codes should work.
1# Check if a GPU is available 2lspci | grep -i nvidia 3 4# Check NVIDIA driver info 5nvidia-smi 6 7# Check the version of cuda 8nvcc --version 9 10# Verify your nvidia-docker installation 11docker run --gpus all --rm nvidia/cuda nvidia-smi
docker -vgives a version earlier than 19.03, you have to use
Most problems come from TF, it is imperative to adjust the version of TF, CUDA, cuDNN. You can check this link for the corresponding versions between TF, cuDNN and CUDA (we call it "list-1"). A natural way to choose a base image is from a TF docker image and then install separately PT. This is a Dockerfile I built with this idea (tf-2.8.1-gpu, torch-1.12.1+cu113). However, as you can see in the "list-1" list, the official TF only supports CUDA 11.2 (or 11.0 or 10.1). If you want to install TF with CUDA 11.3, it's impossible if you start from the official build.
Based on the official tutorial of installing TF with
pip, to install TF 2.9.1, we need
cudnn=8.1.0. What if we handle to have
cudnn=8first and then we look for a way to install TF 2.9?
From the NVIDIA public hub repository, I found this image (nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04) which has already installed
I create a very simple Dockerfile starting from this base image to check if we can install
1# Create the image "img_sample" from the file "Dockerfile" 2docker build -t img_sample . -f Dockerfile 3 4# Create and run a container from the image "img_sample" 5docker run --name container_sample --gpus all -w="/working" img_sample bash 6 7# Enter the "container_sample" container 8docker exect -it container_sample bash 9 10# In the container "container_sample" 11 12# Check if the NVIDIA Driver is recognized 13nvidia-smi 14 15# Check the version of CUDA 16nvcc --version 17 18# Check the version of cuDNN 19cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
Then I try to install
tensorflowfrom step 5 of this official tutorial
1pip install --upgrade pip 2pip install tensorflow==2.9.1
YES! It's that simple!
Let us check if it works (step 6 in the tutorial)?
1# Verify the CPU setup 2python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))" 3# A tensor should be return, something like 4# tf.Tensor(-686.383, shape=(), dtype=float32) 5 6# Verify the GPU setup 7python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" 8# A list of GPU devices should be return, something like 9# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
🎉 Voilà, it works like clockwork!
Basically, we are done with the main part of this post. This section mainly explains why I also include the Zsh installation and OpenSSH setup in the Dockerfile.
All normally used packages (with their corresponding versions) are stored in the
requirements.txtfile. We need into copy this file to the container and start the installation process,
1COPY requirements.txt requirements.txt 2RUN python3 -m pip install --upgrade pip && \\ 3 python3 -m pip install -r requirements.txt 4COPY . .
Install PyTorch by following the official instructions
1RUN pip3 install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url <https://download.pytorch.org/whl/cu113>
👉 Note: Terminal + ZSH.
Add the following lines to install and set up Zsh. Why do we need Zsh instead of the default
bash? Because we need a better look of the command lines and not just white texts. Another problem sometimes arises from the "backspace" key on the keyboard. When you type something and use the backspace key to correct the mistake, the previous character is not removed as it should be, but other characters appear. This problem was mentioned once before at the end of section “SSH to User-managed notebook” in Google Vertex AI .
1# Dockerfile 2RUN apt-get install -y zsh && apt-get install -y curl 3RUN PATH="$PATH:/usr/bin/zsh" 4RUN sh -c "$(curl -fsSL <https://raw.githubusercontent.com/robbyrussell/oh-my-zsh/master/tools/install.sh>)"
After Zsh installed,
1# Instead of using this 2docker exec -it docker_name bash 3 4# Use this 5docker exec -it docker_name zsh
One note: When adding an alias, be sure to add it to both
1# Dockerfile 2RUN echo 'alias python="python3"' >> ~/.bashrc 3RUN echo 'alias python="python3"' >> ~/.zshrc
If you want to access a running container via SSH, you must install and run OpenSSH in that container and expose the port
1# Dockerfile 2RUN apt-get install -y openssh-server 3RUN mkdir /var/run/sshd 4RUN echo 'root:qwerty' | chpasswd 5RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config 6# SSH login fix. Otherwise user is kicked off after login 7RUN sed 's@session\\s*required\\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd 8ENV NOTVISIBLE "in users profile" 9RUN echo "export VISIBLE=now" >> /etc/profile 10EXPOSE 22
Don't forget to export port
22when you create a new container,
1docker run --name container_name --gpus all \\ 2 -dp: 6789:22 3 # other options 4 -it image_name
One more step: If a jupyter notebook is running (at port
8888) in your container, you need to run the following code to get the SSH server running,
1docker exec docker_ai $(which sshd) -Ddp 22
Now if you want to access this container via SSH,
1ssh -p 6789 root@localhost
Let's use the password
qwerty(it's set in the above code, at line
RUN echo 'root:qwerty' | chpasswd)!
It's great if our image has a running jupyter notebook server as an entry point so every time we create a new container, there's already a jupyter notebook running and we just use it.
1# Dockerfile 2RUN python3 -m pip install jupyterlab 3 4CMD /bin/bash -c 'jupyter lab --no-browser --allow-root --ip=0.0.0.0 --NotebookApp.token="" --NotebookApp.password=""'
Don't forget to expose the port
8888when you create a new container,
1docker run --name container_name --gpus all \\ 2 -dp: 8888:8888 3 # other options 4 -it image_name
Go to http://localhost:8888 to open the notebook.