Cas d'usage : Tensorflow + venv python + job GPU

Voici un exemple d'installation dans le home utilisateur d'un venv python contenant Tensorflow, et d'utilisation de ce venv dans des jobs GPU sur le cluster.

Construction du `venv`

On charge préalablement les bons modules :

[user@hpclogin01 ~]$ module load gcc/8.1.0
[user@hpclogin01 ~]$ module load cuda/10.0.130

On crée et on active le venv dans le répertoire ~/tflow :

[user@hpclogin01 ~]$ virtualenv --python=python3.6 tflow
[user@hpclogin01 ~]$ cd tflow/
[user@hpclogin01 tflow]$ source bin/activate

On installe numpy et tensorflow-gpu dans le venv :

(tflow) [user@hpclogin01 tflow]$ pip install numpy==1.19.5 tensorflow-gpu==1.14
Collecting numpy==1.19.5
[...]
Collecting tensorflow-gpu==1.14
[...]
Successfully installed absl-py-0.12.0 astor-0.8.1 cached-property-1.5.2 gast-0.4.0 google-pasta-0.2.0 grpcio-1.37.1 h5py-3.1.0 importlib-metadata-4.0.1 keras-applications-1.0.8 keras-preprocessing-1.1.2 markdown-3.3.4 numpy-1.19.5 protobuf-3.16.0 setuptools-56.1.0 six-1.16.0 tensorboard-1.14.0 tensorflow-estimator-1.14.0 tensorflow-gpu-1.14.0 termcolor-1.1.0 typing-extensions-3.10.0.0 werkzeug-1.0.1 wrapt-1.12.1 zipp-3.4.1

Utilisation

Job interactif pour tester

Ensuite, on demande un job interactif sur un noeud avec GPU pour tester :

[user@hpclogin01 ~]$ srun --partition=gpu --gres=gpu:1 --pty bash

Le scheduler nous a donné 1 GPU sur le noeud hpcdgx01. On réactive l'environnement :

[user@hpcdgx01 ~]$ cd tflow/
[user@hpcdgx01 tflow]$ module load gcc/8.1.0
[user@hpcdgx01 tflow]$ module load cuda/10.0.130
[user@hpcdgx01 tflow]$ source bin/activate

On lance une commande python pour demander à tensorflow de nous lister les devices visibles :

(tflow) [user@hpcdgx01 tflow]$ python -c "from tensorflow.python.client import device_lib; device_lib.list_local_devices()"
2021-05-07 11:13:29.973350: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-05-07 11:13:29.995140: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2021-05-07 11:13:31.130838: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x362be70 executing computations on platform CUDA. Devices:
2021-05-07 11:13:31.130901: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
[...]
2021-05-07 11:13:31.136466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:07:00.0
[...]
2021-05-07 11:13:31.324371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2021-05-07 11:13:31.327674: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-07 11:13:31.327696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2021-05-07 11:13:31.327709: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2021-05-07 11:13:31.332735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 30591 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:07:00.0, compute capability: 7.0)

Tensorflow liste bien le GPU comme device accessible, sans erreur, tout semble ok.

Job classique en production

Bien sûr, pour un job dont le but n'est pas de tester, on ne fera pas un job interactif, mais un script que l'on soumettra avec sbatch ; par exemple (fichier job01.sh) :

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=32G
#SBATCH --partition=gpu
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user.email@uca.fr
#SBATCH --time=08:00:00

cd ~/tflow/
module load gcc/8.1.0
module load cuda/10.0.130
source bin/activate

python ~/monSuperCodePython.py

Et on soumet le job avec : sbatch job01.sh

Construction d'un `venv` avec paquets récents (Python 3.10, CUDA 12, TensorFlow 2.12, Torch 2.1)

$ cd ~
$ module purge
$ module load gcc/8.1.0 
$ module load python/3.10.10 
$ module load cuda/12.0.1
$ virtualenv --python=python3.10 newvenv
$ cd newvenv 
$ cat > env.sh <<EOF
module purge
module load gcc/8.1.0 
module load python/3.10.10 
module load cuda/12.0.1
source bin/activate 
EOF
$ source env.sh
$ pip3 install tensorflow==2.12.0
$ pip3 install torch==2.1.0 torchvision --pre -f https://download.pytorch.org/whl/nightly/cu121/torch_nightly.html

Et un exemple de job pour tester :

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=2G
#SBATCH --partition=gpu
#SBATCH --time=00:02:00

cd ~/newvenv 
source env.sh 

python3 -c "import torch; print('torch : is_cuda_available = %s' % torch.cuda.is_available())"

python3 -c "from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())"

true