Environment

Connection to the machine

You must use ssh on : occigen.cines.fr

ssh <my_login>@occigen.cines.fr

The operating system is of the linux type based on the BullX SCS (Redhat) software suite.
The cluster includes several connection nodes for users. When the connection is established, the user is on one of these nodes. The assignment of connections is based on the availability of login nodes. It is possible that two simultaneous connections may end up on two different login nodes.

Software environment : modules

CINES provides many software on its clusters, and often several versions of each software. In order to avoid conflicts between different versions of the same software, it is generally necessary to define an environment specific to each version.

The available software can be viewed or loaded via the following commands:

module avail 
display the list of available environments
module load 
load a library or software into your environment
module list 
display the list of loaded environments
module purge 
remove an already loaded environment
module show 
see the content of the module

Job submission environment (SLURM)

The compute nodes of the Occigen machine can be used in two modes :

  • Exclusive mode
  • Shared mode

1) Jobs in « EXCLUSIVE » mode

After the Occigen extension went into production in February 2017, which consists of nodes with Broadwell processors, work was submitted on two distinct partitions: the Haswell processor part (24 cores) and the Broadwell processor part (28 cores).

Since spring 2018, we have been offering more flexibility, allowing you to request either both architectures at the same time or one or the other without specifying which one. The table below summarizes the different possibilities and the SLURM guideline associated with each case.

Type of job Processor choice  Directives SLURM
Code OpenMP Haswell node only
--constraint=HSW24
Broadwell node only
--constraint=BDW28

Parallel code MPI /

Hybrid code

Haswell node only
--constraint=HSW24
Broadwell node only
--constraint=BDW28
Precise configuration of x Haswell nodes and y Broadwell nodes
--constraint="[HSW24*x&BDW28*y]"
Haswell or Broadwell nodes regardless (HSW24 and BDW28 can be mixed)
--constraint=HSW24|BDW28
All nodes of the same type, Haswell or Broadwell depending on availability
--constraint=[HSW24|BDW28]

 

You will find batch submission script templates describing all these cases in the CINES gitlab in the following directory :

2) Jobs in « SHARED » mode

In shared mode, several jobs can run simultaneously in one node. A technical mechanism prevents jobs from interpenetrating each other, there can be no overwriting of memory areas, nor “theft” of CPU cycles.

By default, all jobs that require less than 24 cores will be in shared mode.

All nodes in the shared partition have 128 GB of memory.

Jobs with a demand greater than 23 cores are not affected by the shared mode, nor are jobs that explicitly request the exclusive mode.

 

3) Commands

This table provides useful commands for submitting work.

Action command
Submit a job
sbatch script_name.slurm
List all the jobs
squeue
List only your jobs
squeue –u <login>
Display caracteristics of a job
scontrol show job
Forecasting the time of a work shift (may vary …)
squeue –start –job <job_id>
Forecasting the schedule for the passage of your own work
squeue –u <login> --start
Cancel a job
scancel <job_id>

4) Submitting a job HTC

PSERIE is a tool that allows you to perform sequential tasks or jobs within an MPI job. These tasks correspond to command lines containing sequential executable (serial) and their argument(s) if necessary. The executable is called pserie_lb.

a) Description :

When using this version, the MPI process (or rank) #0 distributes the tasks to be performed to other MPI processes. It does not execute any commands. The other processes are each assigned a task. Once a process has completed one of its tasks, rank 0 assigns it a task that has not yet been completed. This distribution is performed as long as tasks have not been executed (as long as command lines of the input file have not been processed). There is no order of execution of tasks; the first MPI process that is ready to receive a task will be assigned the first available (not executed) task from the list in the input file. The input file can contain more command lines than the number of processes.

b) Utilization

To use this tool, three modules must be loaded: intel/17.0, openmpi/intel/2.0.1 and pserie. Pserie requires an input file where the commands to be executed are located.

Example of input file input.dat :

./my_executable my_parameter_égal_to_parameter_1
./my_executable my_parameter_égal_to_parameter_2
./my_executable my_parameter_égal_to_parameter_3
./my_executable my_parameter_égal_to_parameter_4
./my_executable my_parameter_égal_to_parameter_5
...

c) Example of script SLURM :

#!/bin/bash
#SBATCH -J job_name
#SBATCH --nodes=2
#SBATCH --ntasks=48
#SBATCH --constraint=HSW24
#SBATCH --time=00:30:00
module purge
module load intel/17.0
module load openmpi/intel/2.0.1
module load pserie
srun -n $SLURM_NTASKS pserie_lb < input.dat

Note: There is an example of use with the associated slurm script under /opt/software/occigen/tools/pserie/0.1/intel/17.0/openmpi/intel/2.0.1/example with the input file named fdr.dat which contains the list of commands to execute. The commands in this file include simple echo and sleep commands.

5) Overtime calculation hours (+25% initial staffing)

Since November 4, 2019, in consultation with GENCI and the IDRIS and TGCC computer centres, CINES has also been developing the “bonus” mode. The “bonus” works, until now addressable via the option “-qos=bonus” are deleted in favour of a systematic additional allocation of any project of 25%.
More precisely:

  • As soon as the consumption of a project exceeds 125% of its initial allocation of hours, this project is blocked and no member can submit any more work.
  • A priority system makes it possible to manage the execution of work on the machine in the most equitable way possible between projects. This system takes into account various parameters, including in particular the initial allocation of hours and the consumption of hours spent (the effective recognition decreases exponentially over time, with a half-life of 14 days).
  • A project that has under-consumed in the recent past (i.e. the last days/weeks), has a high priority for the execution of its work.
  • A project that has over-consumed in the recent past (i.e. the last days/weeks) is not blocked, it can continue to run but with a low priority and can therefore benefit from the cycles available on the target machine in case of low load.

This priority system maximizes the effective use of CPU cycles by:

  • Encouraging projects to use their hours regularly throughout the year to maximize hours with a high priority of execution
  • Allowing projects, in the event of under-utilisation of the machine, either to catch up on a consumption delay or to get ahead despite a low priority of execution, with no limit on hours other than that of 125% of their initial allocation of hours.

6) Environment variables

The following utility variables (non-exhaustive list) can be used in the user commands (Shell) of a submission script.

Variable name value
$SLURM_JOB_ID ID of a job
$SLURM_JOB_NAME Name of a job (spécify by the “#SBATCH -J”)
$SLURM_SUBMIT_DIR Name of the inital directory (in which the sbatch command has been run)
$SLURM_NTASKS Number of MPI process of the job

7) Common options SBATCH

In the SLURM script to give indications on resources it is possible to use the following directives :

  • To indicate the number of physical nodes of the machine :
    #SBATCH --nodes=W

  • To indicate the number total of MPI process which are going to run on a node with 48 or 56 hyperthreads cores :
    #SBATCH --ntasks=X

  • To indicate for the OpenMP codes how many threads will be running on each real core of the node :
    #SBATCH --threads-per-core=Y

  • To indicate the maximum memory size that we want in one node (cannot exceed the total memory of the node 64 GB or 128 GB) :
    #SBATCH --mem=Z

  • To indicate that you don’t want your job to be put in a shared node :
    #SBATCH --exclusive

In a shared node, each job can consume all or part of the node’s memory. Or, a job may require much more memory to be used.

In the second case, the job manager will not place a new job in this node until the memory is released. For this example, job J1 will be charged 1 core, while job J2 will be charged 5 cores (maximum between the number of cores requested (-ntasks=1) and its memory size divided by 5). In this simplified example, the maximum is 5, so it will be charged 5 cores.

Here are some examples of how to use these parameters
Example1:

 #SBATCH –nodes=1 # Only one node

#SBATCH –ntasks=48 # As many tasks as desired

#SBATCH –threads-per-core=2 # Number of tasks per physical core: a sub-multiple of the number of tasks –> 24 physical cores are reserved here

If you do not reserve all the core and/or memory resources, another job can start on the node using the remaining resources.

It is possible to impose the “dedicated node” mode by using the directive  :

#SBATCH –exclusive

If no memory request is specified by the user in the submission script, the job running on a shared node is assigned a memory limit of 1 GB per task.
The default value is deliberately low to encourage users to define their needs. Here is the directive to use to express your memory need :

#SBATCH –mem=4000 # 4 GB of memory by task

8) Common mistakes :

When a job exceeds its memory request, the process responsible for it is killed by SLURM and the job stopped. Other jobs active on this node will not be impacted. If a job causes a “memory overflow”, it is processed at the Linux kernel level and neither the node nor the other jobs should be affected.
An error message then appears in the output file:

/SLURM/job919521/slurm_script: line 33: 30391 Killed /home/user/TEST/memslurmstepd: Exceeded step memory limit at some point.

Shared jobs will also be taken into account in the blocking process in case of over-occupation of /home and /scratch storage spaces. Shared jobs will be assigned to partitions whose names will be suffixed by s. Example: BLOCKED_home_s for a shared job blocked for exceeding a quota on /home.

To notice that a job is started in shared mode, just look at the partition in which it is assigned:

login@occigen54:~/TEST$ squeue -u login JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 919914 shared TEST_SHA login R 0:04 1 occigen3100

We see that the job 919914 runs in the shared partition.

To know the status of shared nodes, run the command : sinfo -p shared -o « %.15n %.8T %.8m %.8O %.8z %.14C »

gil@occigen57:~/TEST$ sinfo -p shared -o “%.15n %.8T %.8m %.8O %.8z %.14C” HOSTNAMES STATE MEMORY CPU_LOAD S:C:T CPUS(A/I/O/T) occigen3100 mixed 128000 0.00 2:12:2 24/24/0/48 occigen3101 idle 128000 0.00 2:12:2 0/48/0/48 occigen3102 idle 128000 0.00 2:12:2 0/48/0/48 occigen3103 idle 128000 0.01 2:12:2 0/48/0/48 occigen3104 idle 128000 0.01 2:12:2 0/48/0/48 occigen3105 idle 128000 0.00 2:12:2 0/48/0/48 gil@occigen57:~/TEST$

We see that there are (at the time of the order) six nodes in the “shared” partition (occigen3100 to 3105).

The occigen3100 node in the “mixed” state already contains one or more jobs.

The node was occupied on half of its hearts (24/24/0/48). 24 allocated cores, 24 idle cores, 0 in the “other” state, and a maximum of 48 cores per node.

Dernière modification le : 19 November 2019
CINES