Environment

Connection to the machine

You must use ssh on : occigen.cines.fr

ssh <my_login>@occigen.cines.fr

The operating system is of the linux type based on the BullX SCS (Redhat) software suite.
The cluster includes several connection nodes for users. When the connection is established, the user is on one of these nodes. The assignment of connections is based on the availability of login nodes. It is possible that two simultaneous connections may end up on two different login nodes.

Software environment : modules

CINES provides many software on its clusters, and often several versions of each software. In order to avoid conflicts between different versions of the same software, it is generally necessary to define an environment specific to each version.

The available software can be viewed or loaded via the following commands:

module avail 
display the list of available environments
module load 
load a library or software into your environment
module list 
display the list of loaded environments
module purge 
remove an already loaded environment
module show 
see the content of the module

Job submission environment (SLURM)

The compute nodes of the Occigen machine can be used in two modes :

  • Exclusive mode
  • Shared mode

1) Jobs in « EXCLUSIVE » mode

After the Occigen extension went into production in February 2017, which consists of nodes with Broadwell processors, work was submitted on two distinct partitions: the Haswell processor part (24 cores) and the Broadwell processor part (28 cores).

Since spring 2018, we have been offering more flexibility, allowing you to request either both architectures at the same time or one or the other without specifying which one. The table below summarizes the different possibilities and the SLURM guideline associated with each case.

Type of job Processor choice  Directives SLURM
Code OpenMP Haswell node only
--constraint=HSW24
Broadwell node only
--constraint=BDW28

Parallel code MPI /

Hybrid code

Haswell node only
--constraint=HSW24
Broadwell node only
--constraint=BDW28
Precise configuration of x Haswell nodes and y Broadwell nodes
--constraint="[HSW24*x&BDW28*y]"
Haswell or Broadwell nodes regardless (HSW24 and BDW28 can be mixed)
--constraint=HSW24|BDW28
All nodes of the same type, Haswell or Broadwell depending on availability
--constraint=[HSW24|BDW28]

 

You will find batch submission script templates describing all these cases in the CINES gitlab in the following directory :

2) Jobs in « SHARED » mode

In shared mode, several jobs can run simultaneously in one node. A technical mechanism prevents jobs from interpenetrating each other, there can be no overwriting of memory areas, nor “theft” of CPU cycles.

By default, all jobs that require less than 24 cores will be in shared mode.

All nodes in the shared partition have 128 GB of memory.

Jobs with a demand greater than 23 cores are not affected by the shared mode, nor are jobs that explicitly request the exclusive mode.

 

3) Commands

This table provides useful commands for submitting work.

Action command
Submit a job
sbatch script_name.slurm
List all the jobs
squeue
List only your jobs
squeue –u <login>
Display caracteristics of a job
scontrol show job
Forecasting the time of a work shift (may vary …)
squeue –start –job <job_id>
Forecasting the schedule for the passage of your own work
squeue –u <login> --start
Cancel a job
scancel <job_id>

4) Submitting a job HTC

PSERIE is a tool that allows you to perform sequential tasks or jobs within an MPI job. These tasks correspond to command lines containing sequential executable (serial) and their argument(s) if necessary. The executable is called pserie_lb.

a) Description :

When using this version, the MPI process (or rank) #0 distributes the tasks to be performed to other MPI processes. It does not execute any commands. The other processes are each assigned a task. Once a process has completed one of its tasks, rank 0 assigns it a task that has not yet been completed. This distribution is performed as long as tasks have not been executed (as long as command lines of the input file have not been processed). There is no order of execution of tasks; the first MPI process that is ready to receive a task will be assigned the first available (not executed) task from the list in the input file. The input file can contain more command lines than the number of processes.

b) Utilization

To use this tool, three modules must be loaded: intel/17.0, openmpi/intel/2.0.1 and pserie. Pserie requires an input file where the commands to be executed are located.

Example of input file input.dat :

./my_executable my_parameter_égal_to_parameter_1
./my_executable my_parameter_égal_to_parameter_2
./my_executable my_parameter_égal_to_parameter_3
./my_executable my_parameter_égal_to_parameter_4
./my_executable my_parameter_égal_to_parameter_5
...

c) Example of script SLURM :

#!/bin/bash
#SBATCH -J job_name
#SBATCH --nodes=2
#SBATCH --ntasks=48
#SBATCH --constraint=HSW24
#SBATCH --time=00:30:00
module purge
module load intel/17.0
module load openmpi/intel/2.0.1
module load pserie
srun -n $SLURM_NTASKS pserie_lb < input.dat

Note: There is an example of use with the associated slurm script under /opt/software/occigen/tools/pserie/0.1/intel/17.0/openmpi/intel/2.0.1/example with the input file named fdr.dat which contains the list of commands to execute. The commands in this file include simple echo and sleep commands.

5) Submiting a job in the “Bonus” queue

The “bonus” configuration is deployed and active on OCCIGEN. It makes it possible to optimize the machine’s activity by using the calculation cycles available during periods of low load, by granting users a bonus hours credit corresponding to 20% of their DARI allocation and not allocated to it. This additional work can be started when the decline in activity allows it, without any guarantee of time or passage – their priority being lower than that of conventional work.

SLURM then manages the request (authorization, limits, etc.), the submission of a bonus job is done by specifying the following parameter in the script :

#SBATCH --qos=bonus

Note that bonus jobs:

  • cannot exceed 24H
  • cannot use shared nodes
  • not exceeding 100 nodes
  • cannot start as long as a production job is pending
  • are not refundable in case of problem on compute nodes

6) Environment variables

The following utility variables (non-exhaustive list) can be used in the user commands (Shell) of a submission script.

Variable name value
$SLURM_JOB_ID ID of a job
$SLURM_JOB_NAME Name of a job (spécify by the “#SBATCH -J”)
$SLURM_SUBMIT_DIR Name of the inital directory (in which the sbatch command has been run)
$SLURM_NTASKS Number of MPI process of the job

7) Common options SBATCH

In the SLURM script to give indications on resources it is possible to use the following directives :

  • To indicate the number of physical nodes of the machine :
    #SBATCH --nodes=W

  • To indicate the number total of MPI process which are going to run on a node with 48 or 56 hyperthreads cores :
    #SBATCH --ntasks=X

  • To indicate for the OpenMP codes how many threads will be running on each real core of the node :
    #SBATCH --threads-per-core=Y

  • To indicate the maximum memory size that we want in one node (cannot exceed the total memory of the node 64 GB or 128 GB) :
    #SBATCH --mem=Z

  • To indicate that you don’t want your job to be put in a shared node :
    #SBATCH --exclusive

In a shared node, each job can consume all or part of the node’s memory. Or, a job may require much more memory to be used.

In the second case, the job manager will not place a new job in this node until the memory is released. For this example, job J1 will be charged 1 core, while job J2 will be charged 5 cores (maximum between the number of cores requested (-ntasks=1) and its memory size divided by 5). In this simplified example, the maximum is 5, so it will be charged 5 cores.

Here are some examples of how to use these parameters
Example1:

 #SBATCH –nodes=1 # Only one node

#SBATCH –ntasks=48 # As many tasks as desired

#SBATCH –threads-per-core=2 # Number of tasks per physical core: a sub-multiple of the number of tasks –> 24 physical cores are reserved here

If you do not reserve all the core and/or memory resources, another job can start on the node using the remaining resources.

It is possible to impose the “dedicated node” mode by using the directive  :

#SBATCH –exclusive

If no memory request is specified by the user in the submission script, the job running on a shared node is assigned a memory limit of 1 GB per task.
The default value is deliberately low to encourage users to define their needs. Here is the directive to use to express your memory need :

#SBATCH –mem=4000 # 4 GB of memory by task

8) Common mistakes :

When a job exceeds its memory request, the process responsible for it is killed by SLURM and the job stopped. Other jobs active on this node will not be impacted. If a job causes a “memory overflow”, it is processed at the Linux kernel level and neither the node nor the other jobs should be affected.
An error message then appears in the output file:

/SLURM/job919521/slurm_script: line 33: 30391 Killed /home/user/TEST/memslurmstepd: Exceeded step memory limit at some point.

Shared jobs will also be taken into account in the blocking process in case of over-occupation of /home and /scratch storage spaces. Shared jobs will be assigned to partitions whose names will be suffixed by s. Example: BLOCKED_home_s for a shared job blocked for exceeding a quota on /home.

To notice that a job is started in shared mode, just look at the partition in which it is assigned:

login@occigen54:~/TEST$ squeue -u login JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 919914 shared TEST_SHA login R 0:04 1 occigen3100

We see that the job 919914 runs in the shared partition.

To know the status of shared nodes, run the command : sinfo -p shared -o « %.15n %.8T %.8m %.8O %.8z %.14C »

gil@occigen57:~/TEST$ sinfo -p shared -o “%.15n %.8T %.8m %.8O %.8z %.14C” HOSTNAMES STATE MEMORY CPU_LOAD S:C:T CPUS(A/I/O/T) occigen3100 mixed 128000 0.00 2:12:2 24/24/0/48 occigen3101 idle 128000 0.00 2:12:2 0/48/0/48 occigen3102 idle 128000 0.00 2:12:2 0/48/0/48 occigen3103 idle 128000 0.01 2:12:2 0/48/0/48 occigen3104 idle 128000 0.01 2:12:2 0/48/0/48 occigen3105 idle 128000 0.00 2:12:2 0/48/0/48 gil@occigen57:~/TEST$

We see that there are (at the time of the order) six nodes in the “shared” partition (occigen3100 to 3105).

The occigen3100 node in the “mixed” state already contains one or more jobs.

The node was occupied on half of its hearts (24/24/0/48). 24 allocated cores, 24 idle cores, 0 in the “other” state, and a maximum of 48 cores per node.

Dernière modification le : 1 July 2019
CINES