GPU jobs
GPU cards available
Several GPU servers (lapp-wngpu00x.in2p3.fr) are available with various NVIDIA GPU cards :
Server number | NVIDIA cards per server | Profile | CPUs (vprocs) | RAM (GiB) |
---|---|---|---|---|
002 | 2 x Tesla K80 | Default | 12 (24) | 128 |
003 | 2 x Tesla K80 | Default | 24 (48) | 192 |
004 | 1 x Tesla V100 | Training | 24 (48) | 192 |
005 | 1 x Quadro P6000 | Default | 16 (32) | 64 |
006 | 4 x Tesla T4 | Inference | 32 (64) | 192 |
007 | 3 x Ampere A100 40GB | Training | 16 (32) | 384 |
008 | 3 x Ampere A100 40GB | Training | 16 (32) | 256 |
009 | 1 x Ampere A100 40GB | Training (restricted access to LISTIC laboratory users) | 16 (32) | 256 |
010-11 | 3 x Ampere A100 80GB | Training | 32 (64) | 512 |
012 | 1 x Ampere A100 80GB | Training (restricted access to LAPTh laboratory users) | 32 (64) | 512 |
GPU specifications
Tesla K80 specifications : numbers in brackets refer to the aggregate of 2 GPUs.
3G.20GB refers to a MIG instance of 3 compute units and 20G of memory (see below).
A100 refers to two GPU types A100 40G and A100 80G. Users targeting the use of A100 80G should explicitely request the "a100 80gb" GPU type (see below).
Dynamic GPU allocation using partitionable slots
GPUs are considered as resources to jobs and managed by HTCondor. To ensure dynamic resource allocation, partitionable slots are used on lapp-wngpuxx machines.
On each GPU worker, the resources reserved for GPU jobs are assigned to one partitionable slot from which dynamic slots are created at claim time and assigned the requested resources. When dynamic slots are unclaimed, their resources are merged back into the parent partitionable slot.
GPU partitionable slot : 100 % of GPU resources plus a certain amount of CPUs and Memory are reserved to GPU jobs.
See below how the GPU partitionable slot is defined on each machine and pay attention to the fact that requiring the total number of cpus or memory defined by the GPU partitionable slot will block the possibility to schedule any other GPU job.
Server | gpu(s) | Partitionable slot condiguration |
---|---|---|
gpu002/003 | 2 x Tesla K80 | cpus=4, gpu=100%, memory=16 GiB |
gpu004 | 1 x Tesla V100 | cpus=8, gpu=100%, memory=64 GiB |
gpu005 | 1 x Quadro P6000 | cpus=4, gpu=100%, memory=16 GiB |
gpu006 | 4 x Tesla T4 | cpus=16, gpu=100%, memory=64 GiB |
gpu007 | 3 x Ampere A100 40GB | cpus=24, gpus=100%, memory=240 GiB |
gpu008 | 3 x Ampere A100 40GB | cpus=24, gpus=100%, memory=240 GiB |
gpu009 | 1 x Ampere A100 40GB | cpus=16, gpus=100%,memory=128 GiB (restrict. LISTIC) |
gpu010-11 | 3 x Ampere A100 80GB | cpus=24, gpus=100%, memory=576 GiB |
gpu012 | 1 x Ampere A100 80GB | cpus=32, gpus=100% ,memory=256 GiB (restrict. LAPTh) |
Description file
To use one or more GPU cards in a job, the following line needs to be added to the description file:
# specify the number of GPU cards to use in the same server
# replace X with 1 to the max possible number of cards in the desired server
request_gpus = X
Then there are additional lines if you want to specify more precisely the kind of job you want to run:
# for a specific GPU type, replace XXX with "k80", "v100", "p6000", "t4", "a100" or "a100 80gb"
+wantGpuType = "XXX"
or
# for a specific GPU profile, replace XXX with Inference or Training
+wantGpuUsage = "XXX"
If none of these options is defined, the default usage will be applied (execution on K80 or p6000).
If both options are specified, priority will be given to +WantGpuType.
What you should be aware of
As explained above, the combination below must be used with caution.
request_gpus = 1
+wantGpuType = "a100"
request_cpus =24
As a matter of fact, this combination will prevent access to two other GPU cards available on the multi-gpu server being used by the job. This concers the multi-GPUs servers : lapp-wngpu007/8 and lapp-wngpu011/12.
REMEMBER ! When HTcondor is out of CPU or Memory in partitionable slot, it is out of slots for GPU jobs.
Specific option for reserved servers
Please contact us via support-must@lapp.in2p3.fr if you need to reserve a specific server for your jobs.
Once it is configured, please add the following line to the description file in addition to the type or profile:
# for a specific GPU server, replace XXX with 001 to 012 according to your needs
requirements = machine == "lapp-wngpuXXX.in2p3.fr"
Executable file
Once a job matches to a given slot, it needs to know which GPU(s) to use, if multiple are present.
GPU UIDs that the job is permitted to use are published into the job's environment with variable _CONDOR_Assignedgpus
.
HTCondor now has a Wrapper that automatically sets the CUDA_VISIBLE_DEVICES environment variable with the card(s) affected by HTCondor. Your job will therefore turn on the correct card.
If you launch an interactive job, the wrapper is not used, so you must manually set the CUDA_VISIBLE_DEVICES environment variable:
export CUDA_VISIBLE_DEVICES=${_CONDOR_Assignedgpus}
if you want deactivate the wrapper, please specify in your description file:
+noWrapper = "True"
You will then have to remember to position CUDA_VISIBLE_DEVICES
manually in your job.
NVIDIA Multi-instance GPU (MIG) support
Starting with NVIDIA Ampere GPU, MIG is and advanced capability proposed by NVIDIA which has been configured and tested on the GPU007 server. It enables multiple GPU Instances to run in parallel on a single, physical NVIDIA Ampere GPU. and allows users are able to see and schedule jobs on new virtual GPU Instances as if they were physical GPUs.
MIG is currently not activated but do not hesitate to contact support-must AT lapp.in2p3.fr if you are interested.
MIG allows multiple vGPUs (and thereby VMs) to run in parallel on a single A100, while preserving the isolation guarantees that vGPU provides. For more information on GPU partitioning using vGPU and MIG, refer to the NVIDIA technical brief.
When configured with MIG, the 3 A100 cards of lapp-wngpu007.in2p3.fr may be splitted in 2 cards with 3 Graphics units and 20 Gb memory. Then up to 6 jobs may be actived on the GPU007 server at the same time. Other MIG configurations are possible. To use a MIG GPU, user must specify in the description file:
+WantGpuType = "3g.20gb"
Choosing a MIG GpuType does not allow to request more than one GPU
Using TensorBoard display with conda
This requires that the log files from the Tensorflow computation to be stored under the MUST shared file system /mustfs/LAPP-DATA or /mustfs/MUST-DATA, or using simlinks such as /uds_data/
Then, please use 2 SSH terminal windows connected to the same UI.
In the first terminal, run:
# .bashrc includes conda initialize commands added after miniconda installation
source .bashrc
# <your_env> includes tensorboard
conda activate <your_env>
cd <path_to_tensorboars_run_logs>
tensorboard —logdir=.
or launch only particular experiments:
tensorboard —logdir=exp1_folder
To select several experiments to display in tensorboard, create a new folder with simlinks to the desired experiments :
mkdir my_experiment_runs
cd my_experiment_runs
ln -s <path_to_exp1_folder> exp1_folder
ln -s <path_to_exp2_folder> exp2_folder
cd ..
and run:
tensorboard -logdir=my_experiment_runs
If everything works properly, tensorboard is launched and terminal will show:
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.6.0 at http://localhost:6008/ (PRESS CTRL+C to quit)
Check the
Then, in the second terminal window, create a ssh connection forwarding the
ssh -X -Y -tt -L 6006:localhost:<tensorboard_port> <your_login>@<UI>.in2p3.fr
Finally, open the localhost url (http://localhost:6006/) in a browser.
Use of the NVIDIA HPC SDK
The NVIDIA HPC Software Development Kit version 21.9 is available on the latest GPUs. It includes compilers, libraries and software tools, supporting GPU acceleration with standard C++ and Fortran and providing performance profiling and debugging tools. More information is availale at https://developer.nvidia.com/hpc-sdk.
Using the nvc++ compiler, it is possible to execute C++17 code (for compute capabilities ≥ 6.0, working with G++-9 or newer) on GPUs. The use of the nvcc compiler requires compute capabilities ≥ 3.5.
You can refer to the course of Pierre Aubert (CTA/LAPP) (in French). This course proposes a simple example of Hadamard product C++ code with a submission script that can be used for quick start.
Quick Start with nvc++ compiler is available.
Global monitoring
MUST GPU monitoring page allows to check with GPU cards are available on all GPU servers.
If you wish to get access to this page, please send your request via support-must@lapp.in2p3.fr.