User instructions for LPHE cluster
Updated on 19.06.2024
To log in
Just for login, USERNAME is your usual EPFL account:
ssh <USERNAME>@lphesrv1.epfl.ch
This master node can be used for light tasks. For any other purpose, the worker nodes should be used instead. They are accessed via the Slurm scheduler:
srun --time=24:00:00 --mem-per-cpu=2000M --cpus-per-task=1 --pty bash
Modify the arguments to fit your needs. It is recommended to create an alias for this command.
Never run srun
without indicating the memory or the cpus requested, as this will block a complete machine (64 cpus and 250GB).
Never access the working nodes (lphepc120-125) via ssh
. The Slurm scheduler is not prepared for this and a race condition for resources between ssh
and the scheduler can happen.
- To connect from outside of EPFL, a VPN is needed. See more information here.
SSH Keys
As for SSH keys on the new cluster, all users can install their own SSH key themselves. Instead of creating and editing the ~/.ssh/authorized_keys file, you can use this command:
$ ssh-copy-id -i ~/.ssh/id_rsa.pub USERNAME@lphesrv1.epfl.ch
People can always use their current GASPAR password to login, which means that they can install a new SSH key in case they lose access to the old one.
Some setup tips
The ROOT, python, numpy, etc. packages are accessible via LCG. This can be setup by doing:
source /cvmfs/sft.cern.ch/lcg/views/LCG_105/x86_64-el9-gcc13-opt/setup.sh
You can change the gcc version to any of those available. You can also use a different (more recent or older) LCG version. The list of the LCG version available can be seen here. The packages includes and their version can also be consulted in the same link.
You can setup the environment of some collaborations by using:
$ setLHCb
-> alias for 'source /cvmfs/lhcb.cern.ch/lib/LbEnv'
$ setCMS
-> alias for 'source /cvmfs/cms.cern.ch/cmsset_default.sh'
$ setSHiP
-> alias for 'source /cvmfs/ship.cern.ch/SHiP-2020/latest/setUp.sh'
Within the LHCb environment, you can do lb-conda default
to uses the latest packages installed, or create a custom environment using:
$ lb-conda-dev virtual-env default my-local-directory
$ ./my-local-directory/run
There, you can install extra packages if needed. More instructions can be found here.
Your storage
/panfs: 120 TB in total, 3TB for each user
/home: 15 TB in total, 300GB for each user
To check your quotas:
$ quota -s
Disk quotas for user yunxuan (uid 265228):
Filesystem space quota limit grace files quota limit grace
/dev/mapper/lphe-home
8548M 260G 300G 54791 0 0
/dev/mapper/lphe-panfs
2083G 2700G 3072G 25306 0 0
OR
[ghan@lphesrv1 ~]$ xfs_quota -x -c 'quota -h' /home
Disk quotas for User ghan (283076)
Filesystem Blocks Quota Limit Warn/Time Mounted on
/dev/mapper/lphe-home
293.0G 260G 300G 00 [1 day] /home
The Quota here is the threshold to trigger the reminder and the Limit is the hard upper limit of space for each user. The hard limit can never be exceeded. If someone stays between the soft and hard limits for more than a week, then he/she also cannot write anymore. For instance, Gao Han has just one day left to remove some files or she won’t be able to write to /home anymore.
Slurm System to submit jobs
Slurm is the job scheduling system of our cluster.
You can find here:
- a quick start guide about how to use Slurm.
- a summary containing the full list of Slurm options.
To submit jobs, you will need to create a submit.sh
file (feel free to chose your own name) with the following structure:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2048M
#SBATCH -J YourJobName
#SBATCH -t 30:00
python YourPythonScript.py -t argument1 -a $1
where:
ntasks
indicate how many different threads you want to have. It is useful if you have commands that you want to run in parallel within the same batch script. This may be two separate commands separated by an & or two commands used in a bash pipe (|). In most cases,ntasks=1
is enough.-J
is the name you want to give to the job. It is only important to find it later.-t
is the time after which the job will be killed.- Any non-comment bash command will be executed in the node. This includes setting variables or environments, running scripts, …
- Arguments passed to the
submit.sh
script can be used as $1, $2, … - It is possible to setup N jobs with the same configuration (useful for toys) by adding
#SBATCH --array=1-100
. This will submit 100 identical jobs where the only difference will be the environmental variable$SLURM_ARRAY_TASK_ID
which can be used to set the seed of a random generator. If many jobs are submitted, it is good practice to limit the amount of them that are run at once by using#SBATCH --array=1-100%10
(limiting them to 10 at a time).The slurm job is submited by doing:$ sbatch submit.sh
You can later check the submitted jobs by doing
squeue
, orsqueue -u <USERNAME>
if you’re only interested in yours.
You can also cancel jobs before they finish withscancel <jobID>
, or all your jobs withscancel -u <USERNAME>
.
Advanced knowledge
- The master node lphesrv1 along with /panfs and /home are in the same machine, which is located in EPFL datacenter. Experts over there can better take care of it.
- The worker nodes lphepc[120..125] are in BSP 614.1: they have more CPU cores and a larger memory.
- People who leave LPHE should clean up the files. Their corresponding folder can be removed at any time.