FAQ HPC

How to be authorized to access to the computers at CINES

How to have HPC's hours at CINES ?

Every researcher of the national community can, under authorization of his laboratory, ask for resources on computers at CINES. there’s an application for that : « DARI » (Demande d’Attribution de Ressources Informatiques). The file must be submitted during one (or the two) twice yearly sessions (April and September).

I have used up all my time off. How can i have more ?

Throughout the year, the users who have used all theirs HPCs hours can ask for supplementary requests on the web site DARI.

I've hours on an other national computing center, can i ask for transfert of these hours back to the CINES ?

Such request shall be justified, in an objective and reasoned manner, a mail shall be send at svp@cines.fr. This request be reviewed by the committees responsible.

Can i lend my login to a colleague ?

The charter of the CINES say that the logins are strictly personal and not transferrable. Any failure to this rule could result of the revoking og the account and the right to access to the resources at CINES. The charter can be read here.

Data management

Which are the storage spaces ?

All the informations can be found here.
Here is a summary, there’s 4 data spaces :

  • /home : for compilation, to keep your programs + libs used to run your progs
  • /scratch : TEMPORARY space, you can submit your jobs from this space, you can use it too for th temporary I/O during the run
  • /store : space more secure. To keep your important output.
  • /data : space secure too, will be soon replace by the store everywhere

It’s important to note that the /scratch is a temporary space. You MUST use the /store (or the /data) to keep your importants results.

how much space is available for me ?

You can use the command etat_projet which gives, among other things, the state of consumption of the project but also the occupation and limits of the various storage spaces.

You can also use the ncdu command which allows you to recursively count the space occupied and the number of files.
You must be careful about the exhaustive and recursive nature of this tool because depending on the directory in which you use it, it can take a considerably long time.
You can try to target the branches of the tree structure that interest you the most to avoid scanning your entire space.

It is possible to save the results of a search in a file so that they can be processed later without having to restart the command.

Examples to generate the output:

ncdu -qrxo output $SCRATCHDIR/dir1/dir2/dir3

to reread a file:

ncdu -qrxf output

The informations about the quotas maximum are available here.

Softwares

What are the software available at CINES ?

All the software installed by the CINES can be use throught a “module” command. You can use module avail to know the listing of all the modules availables on the machine.

How can i use a software ?

the command module can be call througth different mean :

  • module avail listing of all the modules installed on the machine (for ex. abinit).
  • module load <name_module> to add the <name_module> to the current session (for ex. module load abinit to load all the environment need to use abinit).
  • module show nom_module give the informations about a module (version, conflict with others modules, etc..).
  • module list list of all the loaded modules for the current session (only the current session).

One time, the command module load nom_module call, you can use the software.

The versions currently available through modules are not that i'd like to use

Multiple software versions are available ont the machines. You can list all the differents version with the command : module avail
If you can’t find the version you look for in the listing, you can call the support team (svp@cines.fr) who could help you to make your progs run with the current versions, or to install a newer version.

I want a software which is not currently installed at CINES, how can i do ?

If it’s a software under license, first you must gave us the license, than, we will install the software.
If the software is specific to your ressearch community, we will encourage you, to install it in your environment (in your $SHAREDDIR for example).
Otherwise, you could make a call to the support team at CINES (svp@cines.fr) to ask for a new installation. Your recipe will be carefully studied by the support team, and an answer will be transmit as soon as possible.

Machines availability

You could have a look of the machines availability on on this web page.

General information

Which is the kind of processors on Occigen ?

The supercomputer Occigen encompasses :

2106 dual-socket nodes Intel® Xeon® Haswell @ 2,6 Ghz. Twelve cores are present on each socket.

1260 dual-socket nodes Intel® Xeon® Broadwell @ 2.6 GHz. Fourteen cores are present on each socket.

The AVX2 vectorisation technology is available. Please, note that hyperthreading is activated on each node.

What is the operating system ?

The operating system is built by BULL, and named BullX SCS AE5. It is based on Redhat 7.3 Linux.

Which are the different file systems ?

Your /home directory is installed on a Pananas file system.
the /scratch is a Lustre Filesystem for the HPC applications.
The /store is a Lustre Filesystem, it’s the backed-up storage space.

How to acess the filesystems on the machine ?

$HOMEDIR (panasas file system) is mainly dedicated to your source files and $STOREDIR to permanent data files (input and results files). At runtime, your input and output files should be placed on $SCRATCHDIR.
Finally, if you want to share files with other logins of your unix group, the $SHAREDHOMEDIR and $SHAREDSCRATCHDIR are there for you.

How should I know my space usage on each kind of file system ? (/home /scratch et /store) ?

Please use the command etat_projet

Access to Occigen

How to connect to Occigen ?

Linux : type the usual command : ssh login_name@occigen.cines.fr
Windows : through a ssh client (ex : putty) choose occigen.cines.fr

How to launch a graphic software ?

It is necessary to redirect the graphic output to a X server. In order to do that under Linux, please connect with the command ssh -X login_name@occigen.cines.fr. Under Windows, use Xming for example and activate the X11 forwarding from the SSH client menu.

How to move data ?

Under Linux, the copy must be done with the scp command from your machine to occigen.cines.fr.
Under windows ; please use a transfer files software as Filezilla and specify occigen.cines.fr and port 22.

Submitting jobs

What is the job scheduler ?

The job scheduler is SLURM (Simple Linux Utility for Resource Management). This software came from ShedMD. The version on Occigen, was updated by Bull/ATOS.

Which space should I launch my jobs from ?

You may launch slurm scripts from your homedir bur but keep in mind that at running time, files should be located on $SCRATCHDIR. Use environment variables to facilitate the scripting of data movement.

How to submit job on Occigen ?

First, make a script with the sbatch commands, an example is given below.
The first line of this file must always be a line giving the shell of the commands that will be used: # !/bin/bash.
The lines beginning with an hyphen and containing the word SBATCH will be taken into account by the scheduler SLURM. They give instructions related to the reservation of resources you actually need and wall clock time limit.

#!/bin/bash	 
#SBATCH -J job_name	 
#SBATCH --nodes=2
#SBATCH --constraint=BDW28	 
#SBATCH --ntasks=56	 
#SBATCH --ntasks-per-node=28	 
#SBATCH --threads-per-core=1
#SBATCH --mem=40GB	 
#SBATCH --time=00:30:00	 
#SBATCH --output job_name.output

As slurm inherits the environment of the windows you type the launching command « sbatch script.slurm>, it is strongly recommended to start the unix commands in the script.slurm by « module purge ».
If you use openmpi (but you can use IntelMPI also) as MPI library the command to run your executable will be :

 srun --mpi=pmi2 -K1 -n $SLURM_NTASKS ./my_executable param1 param2	 

Otherwise, you will use intelmpi and the command line is :

 mpirun (or mpiexec.hydra) -n $SLURM_NTASKS ./my_executable param1 param2

How to use a computational node in an interactive mode ?

The interactive mode is forbidden on a login node. On a compute node, first, you should type the following command (on the login node) to get access to some resources (compute nodes):

salloc --constraint=BDW28 -N 1 -n 10 -t 10:00

then when the resources become available, in the same windows, type :

srun my_executable

In this example, it will run for 10 minutes on a single node with 10 tasks.

My calculation actually needs more than 24h, how can I get a longer execution time ?

If your executable is able to do checkpoint/restart files, please respect the fair play rules and submit a new job for continuation.
If your application does not make a checkpoint/restart yet, please send a well argued demand to svp@cines.fr and it will be examined. Even if you are granted of such an authorization, you should you keep in mind that the resources that can host this queueing class are limited and the priority of these kind of jobs are lower. Your job is also largely exposed to the risk of node failure and a general advice is that you have to develop checkpointing in your code in order to avoid the waste of computational time in such a case.

What can I do if my application needs a lot of memory per node ?

Half of the 2106 nodes Haswell of the fist part of Occigen benefits from 128 GB of memory. In order to tell the SLURM scheduler that your job actually need them, add the following line in your script :

#SBATCH –mem=120GB
#SBATCH --constraint=HSW24

The memory that is indicated here concerns each node.

How to know the remaining usuable amount of computational core.hours of my project ?

Followings this link you will find out all the information related to your project ( allocated core.hours and consumed core.hours by yourselves and your colleagues involved in this project).

How to submit a job using MPI/OpenMP ?

You can find on this web page, some examples of script using MPI and OpenMP  (MPI+OpenMP (Hybrid)

How to submit a job HTC (embarrassingly parallel) ?

We have a tool on Occigen that you can use to run sequential taks inside a MPI job. these tasks are commands lines, with one line equal to one executable and his parameters.

The tool’s name is pserie_lb. You can use it by calling the module pserie/0.1

The documentation on pserie_lb can be read here in chapter 4.

How to make slurm scripts if my current SHELL differs from bash ?

Please add the following lines at the beginning of the slurm script :

#!/bin/bash	 
. /usr/share/Modules/init/bash

Careful do not forget the dot (".") at the start of the second line.

Emacs seems very slow, how to improve that ?

Add to the file ~/.emacs

 ;;disable the version control	 
(setq vc-handled-backends nil)

How to make the « ls » command more efficient on the Lustre file system ?

This command facilitates the listing of your directory (on lustre file systems only /store and /scratch) :

 /usr/local/bin/lfs_ls

This command does’nt work well with /home

How to use efficently the software Gaussian on Occigen ?

There’s a technical note here. It can be download here.

How to exploit the parallelization of the VASP code on Occigen ?

VASP offers both parallelization :

  • over bands (see NPAR)
  • and over k-points (see KPAR).

In your input file (INCAR) these parameters must be carefully chosen in order to obtain the best performance, taking into account of the Occigen architecture.

Parameters of the INCAR file

The KPAR value

The KPAR parameter manages the parallelization over the k-points, it represents the number of kpoints groups created to parallelize on this dimension.

The Occigen supercomputer has double-socket nodes with 12 cores per socket. Keep in mind that one should try to exploit the highest number of cores available on each node while complying with the constraints described below.

For example, if the study case owns 10 k-points and KPAR is set to 2, there will be 2 k-points groups each performing calculations on 5 k-points. Similarly, if KPAR is set to 5, there will be 5 groups each with 2 k-points.

Currently KPAR is limited to values that divide exactly both the total number of k-points and the total number of cores used by the job.

The NPAR value

The NPAR parameter manages the parallelization over the bands, It determines the number of bands that are handled in parallel, each band being distributed over some cores. The number of cores that are dedicated to a band is equal to the ratio : total number of cores divided by NPAR. If NPAR=1, all the cores will work together on every individual band.

NPAR must be adjusted in agreement with the KPAR value according to the following formula:

 

The best performance seems to be obtained on Occigen for NCORES_PAR_BAND ranging from 2 to 4.

Selecting the right number of cores for a calculation with VASP is very important because using too many cores will be inefficient. As a rule of thumb, you can choose the number of cores in the same order of magnitude as the number of atoms. This parameter needs further refinements. When the workload is split over several compute nodes, you have to ensure that a sufficient amount of work has been allocated to each core.

If you encounter « out Of Memory » problems, the following action should be taken :

  • Select the high memory nodes by adding a special directive in your job script ( #SBATCH – mem=120GB)
  • If this is not sufficient, consider again the parallelisation. Try to increase the number of cores and/or unpopulate the cores.

Following is an example of a submission script for Occigen

#!/bin/sh
#SBATCH --nodes=2
#SBATCH --constraint=HSW24
#SBATCH --ntasks-per-node=24
#SBATCH --ntasks=48
#SBATCH --mem=118000
#SBATCH --time=24:00:00
#SBATCH --exclusive
#SBATCH --output vasp5.3.3.out
module purge
module load intel
module load openmpi
ulimit -s unlimited
srun --mpi=pmi2 -K1 /path/vasp-k-points

General information

Which is the kind of processors on vizualisation nodes ?

The vizuualization cluster has a login node (visu.cines.fr) and 4 nodes for visualization and pre/post processing (from visu1 to visu4).
Somme processor like Broadwell, bi-sockets nodes with 14 cores per socket, are available with 256Gb of memory by node. A full description can be can be found here.
Each node has a 1 GPUs Nvidia Tesla P100 with 12 Gb each.
Hyperthreading is activated on these processors.

Which OS run on Cristal ?

The cluster use an OS version BullX based on a Redhat 6.7.

Which are the different file systems ?

Your /home directory is installed on a Pananas file system.
the /scratch is a Lustre Filesystem for the HPC applications.
The /store is a Lustre Filesystem, it’s the backed-up storage space.
The /home, /scratch and /store are the same that you can use on the machine Occigen.

Access to vizualisation

how to connect to the vizualisation cluster ?

To have access to the vizulisation cluster you can connect with a ssh comand to the address visu.cines.fr
Windows : through a ssh client (ex : putty) choose visu.cines.fr

How to run a graphical application ?

How can i copy/download data ?

Linux, from your laboratory toward visu.cines.fr with the scp command.
Windows, use a program like Filezilla and give it visu.cines.fr as name of the target host (and 22 as port number).

Compute on the machine

What is the job scheduler ?

The job scheduler is SLURM (Simple Linux Utility for Resource.

From which node can i open a graphic session ?

You have to login to the visu.cines.fr node first. .

How to use a computational node in an interactive mode ?

The interactive mode is forbidden on a login node. On a compute node, first, you should type the following command (on the login node) to get access to some resources (compute nodes):

salloc -N 1 -n 10 -t 10:00

then when the resources become available, in the same windows, type :

srun my_executable

In this example, it will run for 10 minutes on a single node with 10 tasks.

How do i know my free cores hours ?

There’s no accounting for these nodes.

Dernière modification le : 1 July 2019
CINES