Using the IRIDIA Cluster

From IridiaWiki
Jump to navigationJump to search

Cluster composition

Currently the IRIDIA cluster is composed by 2 servers (majorana and polyphemus) and 32 rack units (computational nodes). Each rack unit has 2 CPUs AMD Opteron244 working at 1,75GHz and 2GB of RAM (nodes from r02 to r17 have 4 modules of 512MB each 400MHz DDR ECC REG DIMM, nodes from r18 to r33 have 8 modules of 256MB each, 400MHz DDR ECC REG DIMM). In total the cluster is composed of 64 CPUs dedicated to computations and 3 CPU for administrative purposes.


COMPLEX_NAME: opteron244

- AMD Opteron244 (2 CPU @ 1,75GHz)

nodes: r02, r03, r04, r05, r06, r07, r08, r09, r10, r11, r12, r13, r14, r15, r16, r17, r18, r19, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, r31, r32, r33


Queues

Each computational node has the following 4 queues:


  • <machine>.short: max 5 jobs concurrently can run in this queue and they run at nice-level 2. Each job can only run for maximum 24h of CPU time (so real execution of the program, without the time needed by the system for multitasking, etc). If a job reaches the 24th hour it will receive a signal SIGUSR1 and after some more time a SIGKILL that will terminate it.
  • <machine>.medium: max 3 jobs concurrently can run in this queue, but they run at nice-level 3 (lower priority than the short ones). Each job can only run for maximum 72h of CPU time (so real execution of the program, without the time needed by the system for multitasking, etc). When it reaches the 72nd hour it will receive a signal SIGUSR1 and after some more time a SIGKILL that will terminate it.
  • <machine>.long: only 1 job at a time can run in this queue, but it runs at nice-level 3 (lower priority than the short ones). The job can only run for maximum 168h of CPU time (so real execution of the program, without the time needed by the system for multitasking, etc). If a job reaches the 168th hour it will receive a signal SIGUSR1 and after some more time a SIGKILL that will terminate it.
  • <machine>.par: max 3 jobs concurrently can run in this queue, but they run at nice-level 3 (lower priority than the short ones).

Summarizing: each node can run concurrently up to 12 jobs (distributed on 2 CPUs) .


YOU HAVE TO DESIGN YOUR COMPUTATIONS IN SUCH A WAY THAT EACH SINGLE JOB DOESN'T RUN FOR MORE THAN 7 DAYS (of CPU time).


How to submit a job

To submit a job that lasts up to 1 day you have to specify -l COMPLEX_NAME -l shorttime in the shell script passed at the qsub command, like in this example:

#!/bin/bash
#$ -N name_of_the_short_job
#$ -l complex_name
#$ -l shorttime
#$ -cwd


To submit a job that lasts up to 3 days you have to specify -l COMPLEX_NAME -l mediumtime in the shell script passed at the qsub command, like in this example:

#!/bin/bash
#$ -N name_of_the_medium_job
#$ -l complex_name
#$ -l mediumtime
#$ -cwd


To submit a job that lasts up to 7 days you have to specify -l COMPLEX_NAME -l longtime in the shell script passed at the qsub command, like in this example:

#!/bin/bash
#$ -N name_of_the_long_job
#$ -l complex_name
#$ -l longtime
#$ -cwd


To submit a job that runs in the parallel environment you have to specify -l COMPLEX_NAME -l parallel in the shell script passed at the qsub command, like in this example:

#!/bin/bash
#$ -N name_of_the_long_job
#$ -l complex_name
#$ -l parallel
#$ -pe parallel_environment_name
#$ -cwd


THE SCHEDULER CANNOT PUT IN EXECUTION MORE THAN 64 JOBS OF THE SAME USER AT THE SAME TIME. IF YOU SUBMIT MORE THAN 64 JOBS, MAXIMUM 64 WILL BE RUNNING AT THE SAME TIME.


Programming tips for the cluster

If the jobs needs to read/write quite much and often, it is better to copy the input files to the /tmp directory (which is in the local harddrive of the node) and to write the output files also there, moving them in the /home/user_name directory only when the computation is over. In this way your job does not have to use NFS for each read/write operation relieving majorana of some weight (the /home partition is exported from there to all the nodes), making it more fast (Prasanna measured a speedup of 5x on his code).