Difference between revisions of "Using the IRIDIA Cluster"

From IridiaWiki
Jump to navigationJump to search
Line 69: Line 69:
 
| c0-19 || || ||
 
| c0-19 || || ||
 
|-
 
|-
| c0-20 || || 20 Feb 2008 ||
+
| c0-20 || || ||
 
|-
 
|-
 
| c0-21 || || ||
 
| c0-21 || || ||
 
|-
 
|-
| c0-22 || || 20 Feb 2008 ||
+
| c0-22 || || ||
 
|-
 
|-
 
| c0-23 || || ||
 
| c0-23 || || ||
 
|-
 
|-
| c0-24 || offline || 20 Feb 2008 || everything broken
+
| c0-24 || || ||
 
|-
 
|-
| c0-25 || || 20 Feb 2008 ||
+
| c0-25 || offline || 20 Feb 2008 || everything broken
 
|-
 
|-
 
| c0-26 || || ||
 
| c0-26 || || ||

Revision as of 18:32, 20 February 2008

Cluster composition

Currently the IRIDIA cluster is composed by two racks. The first one contains 1 server (majorana) + 32 computational nodes (from c0-0 to c0-31), while the second contains 16 computational nodes (from c1-0 to c1-15). In total the cluster is composed of 96 CPUs (64 single-core + 32 dual-core) dedicated to computations and 2 CPUs for administrative purposes.


Each of the units in the first rack (from c0-0 to c0-31) features 2 CPUs AMD Opteron 244 (each with 1MB L2 cache) working at 1,75GHz and 2GB of RAM. Nodes from c0-0 to c0-15 have 4 modules of 512MB 400MHz DDR ECC REG DIMM. Nodes from c0-16 to c0-31 have 8 modules of 256MB 400MHz DDR ECC REG DIMM.

Each of the units in the second rack (from c1-0 to c1-15) features 2 Dual-Core AMD Opteron Processors 2216 HE (each with 2x1MB L2 cache) working at 2,4GHz and 4GB of RAM.


COMPLEX_NAME: opteron244

- 2 Single-Core AMD Opteron244 @ 1,75GHz

nodes: c0-0, c0-1, c0-2, c0-3, c0-4, c0-5, c0-6, c0-7, c0-8, c0-9, c0-10, c0-11, c0-12, c0-13, c0-14, c0-15, c0-16, c0-17, c0-18, c0-19, c0-20, c0-21, c0-22, c0-23, c0-24, c0-25, c0-26, c0-27, c0-28, c0-29, c0-30, c0-31


COMPLEX_NAME: opteron2216

- 2 Dual-Core AMD Opteron2216 HE @ 2,4GHz

nodes: c1-0, c1-1, c1-2, c1-3, c1-4, c1-5, c1-6, c1-7, c1-8, c1-9, c1-10, c1-11, c1-12, c1-13, c1-14, c1-15


Cluster status

Node Effect Date Note
c0-0 20 Feb 2008
c0-1 reserved to ESA project 20 Feb 2008 to submit jobs here add -l esa (HW: first NIC broken, uses spare NIC now)
c0-2 reserved to ESA project 30 Nov 2007 to submit jobs here add -l esa
c0-3 reserved to ESA project 3 Dec 2007 to submit jobs here add -l esa
c0-4 reserved to ESA project 30 Nov 2007 to submit jobs here add -l esa
c0-5 reserved to ESA project 3 Dec 2007 to submit jobs here add -l esa
c0-6 reserved to ESA project 4 Dec 2007 to submit jobs here add -l esa
c0-7 reserved to ESA project 4 Dec 2007 to submit jobs here add -l esa
c0-8 reserved to ESA project 4 Dec 2007 to submit jobs here add -l esa
c0-9 20 Feb 2008 (HW: first NIC broken, uses spare NIC now)
c0-10
c0-11
c0-12
c0-13
c0-14
c0-15
c0-16 reserved to ESA project 23 Jan 2008 to submit jobs here add -l esa
c0-17
c0-18
c0-19
c0-20
c0-21
c0-22
c0-23
c0-24
c0-25 offline 20 Feb 2008 everything broken
c0-26
c0-27
c0-28 reserved to ESA project 23 Jan 2008 to submit jobs here add -l esa
c0-29 5 Feb 2008 (HW: first NIC broken, uses spare NIC now)
c0-30
c0-31 reserved to ESA project 23 Jan 2008 to submit jobs here add -l esa
c1-0
c1-1
c1-2
c1-3
c1-4 5 Feb 2008 (HW: problem with CMOS battery, may not boot up automatically)
c1-5
c1-6
c1-7
c1-8
c1-9
c1-10
c1-11
c1-12
c1-13
c1-14
c1-15

Queues

Each computational node has the following queues; each queue has as many slot as the number of cores in that node:


  • <node>.short: jobs run in this queue at nice-level 2 for maximum 24h of CPU time (real execution of the program, without counting the time needed by the system for multitasking, etc). If a job still runs after the 24th hour, it will receive a signal SIGUSR1 and after some more time a SIGKILL that will terminate it.
  • <node>.medium: jobs run in this queue at nice-level 3 (lower priority than the short ones) for maximum 72h of CPU time (real execution of the program, without counting the time needed by the system for multitasking, etc). If a job still runs after the 72nd hour, it will receive a signal SIGUSR1 and after some more time a SIGKILL that will terminate it.
  • <node>.long: jobs run in this queue at nice-level 3 (lower priority than the short ones) for maximum 168h of CPU time (real execution of the program, without counting the time needed by the system for multitasking, etc). If a job still runs after the 168th hour, it will receive a signal SIGUSR1 and after some more time a SIGKILL that will terminate it.


Summarizing: on a node up to 6 jobs (Opteron 244 nodes) or 12 jobs (Opteron 2216 nodes) can run concurrently with an average space in RAM of 341MB per job.


YOU HAVE TO DESIGN YOUR COMPUTATIONS IN SUCH A WAY THAT EACH SINGLE JOB DOESN'T RUN FOR MORE THAN 7 DAYS (of CPU time).

How to submit a job

To submit a job you have to create a script (that we indicate with SCRIPT_NAME.sh) with the commands you want to execute. Once you have the script, from majorana execute the command qsub SCRIPT_NAME.sh


The script should begin like this:

#!/bin/bash
#$ -N NAME_OF_JOB
#$ -cwd

In this case the job will be scheduled for execution in one node chosen from any node, and in one queue chosen from any queue of that node.


If you want, you can impose some restrictions on the kind of node or queue to choose from.

To select a node with Opteron 244 CPUs, you must add to your script the line #$ -l opteron244, while to select a node with Opteron 2216 CPUs, you must add to your script the line #$ -l opteron2216. To select only among the short queues, you must add to your script the line #$ -l short; to select only among the medium queues, you must add the line #$ -l medium; to select only among the long queues, you must add to your script the line #$ -l long.


THE SCHEDULER ADDS TO THE QUEUEING SYSTEM A JOB ONLY IF THERE ARE LESS THAN 10000 JOBS IN THE QUEUE; A SAME USER CAN ADD A JOB TO THE QUEUEING SYSYEM ONLY IF HE HAS LESS THAN 1500 JOBS IN THE QUEUE.


IF THERE IS AN EXECUTION SLOT AVAILABLE, THE SCHEDULER ASSIGN IT TO A USER JOB ONLY IF THAT USER HAS LESS THAN 128 JOBS ALREADY RUNNING.

Programming tips for the cluster

If the jobs needs to read/write quite much and often, in the submission script it is better to copy the input files to the /tmp directory (which is in the local harddrive of the node) and to write the output files also there, moving them in the /home/user_name directory only when the computation is over. In this way your job does not have to use NFS for each read/write operation relieving majorana of some weight (the /home partition is exported from there to all the nodes), making it more fast.

REMEMBER TO REMOVE YOUR FILES FROM THE /TMP DIRECTORY ONCE THE COMPUTATION IS OVER