IRIDIA cluster maintenance

From IridiaWiki
Jump to navigationJump to search

This page contains information on maintenance of the cluster. This includes installing new software, add/removing nodes, security etc.

Adding a new diskless node

In order to have a new fully functional client, the server must first configured to allow the server to boot from the net. Then, the new client must be added to the client list of \sge. The actual client kernel assumes that the client has a Inter PRO 1000 card. At the moment, other cards require a re-compilation of the kernel and other modifications to the net-booting process.

  1. switch the client on while it is attached to a keyboard and a monitor;
  2. enter the BIOS and configure the client in order not to stop when keyboard, video card, floppy, or whatever else is missing;
  3. configure it to boot from LAN;
  4. let it boot and, if it appears, write down the MAC address of the network card; switch it off otherwise.

Finding the MAC address of a new client

The MAC address is a sequence of 12 hexadecimal digits, normally coupled and each couple separeted by a ``: or a space. If you do not have it, you can get it in this way:

On the server, type the following:

tail -f /var/log/daemon.log

Switch on the client and let it boot from the network (it will fail). Now look at the server's screen: it will appear a line like:

DHCPDISCOVER from 00:13:16:69:71:fa via eth1

the numbers between from and via are the MAC address.

Next, the final steps. Let's say that the MAC address is 00:13:16:69:71:fa, the new host name will be p69 and its IP address will be 192.168.100.69. Then, on the server edit the file

/etc/dhcpd.conf

Search for the block where the other nodes are defined, looking for instance for "host p02" and add the following after the last definition of the group:

host p69 {
        hardware ethernet 00:13:16:69:71:fa;
        fixed-address 192.168.100.69;
}

Execute

/etc/init.d/dhcp restart

Add the new host in /etc/hosts

...
192.168.100.69  p69
...

Add the new data to export the filesistem in /etc/exports:

/var/lib/diskless/default/192.168.100.69/etc 192.168.100.69(rw,no_root_squash)
/var/lib/diskless/default/192.168.100.69/rw 192.168.100.69(ro,no_root_squash)
/var/lib/diskless/default/192.168.100.69/rw-secure 192.168.100.69(rw,no_root_squash)

Restart the NFS server:

/etc/init.d/nfs-kernel-server restart

And finally, execute:

update-host-directories

Then the host must be included in the Sun Grid Engine. Read and follow the instructions Of the Sun ONE Grid Engine Administration and User's Guide, Chapter 2 ``How to Install Execution Host. A copy of the guide can be found on the server in the file

/usr/local/sge/doc/SGE53AdminUserDoc.pdf

Adding new software/packages on the servers

Both the server and the client are running Debian. The Debian tool to manage program installation is apt-get.

Suppose you want to install a package, whose name is pippo, on the server:

As root, first type:

apt-get update
apt-get install pippo

The program might complain that some other packages are missing. Add their name to the previous command after pippo. It is usually possible to choose among three versions of the program (stable, testing and unstable. If you want to specify a particular version, use:

apt-get install pippo/unstable


Keeping packages up to date

The maintenance process differs between the different types of nodes computers. In case of update of packages, the clients are configured /etc/apt/sources.list to compare the versions of their own packages with those in /mnt/debmirror, which is a NFS directory located on majorana. Before any update can take place, the mirror on majorana must be updated.

Diskless nodes

Whatever change need to be done, can be done directly in the nfsroot on majorana. A very handy way to do that is to use the command chroot, which redefine the root directory to point to the one specified as argument. For instance, to upgrade the packages in the nfsroot, type the following commands:

chroot /var/lib/diskless/simple/root
apt-get update
apt-get dist-upgrade
exit

Remember that the clients do not see directly the directories /dev, /etc, /tmp and /var under /var/lib/diskless/simple/root. Anytime a file is modified in one of these directories (99% of the times when new packages are installed or upgraded), it needs to be updated in the client's private directories as well. The script update-new-hosts, placed in /root/bin/, does the job. The changes are immediately seen on the clients.

One must take care of the services that are stated during new installations or restarted during upgrades: they will not run on the clients but on the servers! Therefore they must be stopped and re-executed from a non-chrooted environment on majorana, and then executed on each hosts, using the command dsh. For example, below there is an example of the commands that should be given when updating SSH on all the clients.

chroot /var/lib/diskless/simple/root
apt-get update
apt-get upgrade ssh
...
exit
/etc/init.d/ssh restart
update-new-hosts
dsh -a -- /etc/init.d/ssh restart

The line after "exit" restarts the ssh service on majorana, which was executed with the wrong configuration during the update for the clients. The last line executes the command /etc/init.d/ssh restart on all hosts using dsh (the list of hosts and the groups in which they are divided are in /etc/dsh/ on majorana).

Nodes with disks (rack nodes)

There are two ways of maintaining the clients in the rack.

Method 1: Since each of them has its own filesystem, it is possible to execute a set of instructions on each of them using dsh. An update of all packages can be done typing, on majorana, the following:

dsh -a -- "apt-get update && apt-get dist-upgrade"

This command sends to all the machines (configured in /etc/dsh/ on majorana) the comand "apt-get update". If it is succesful then it executes the command "apt-get dist-upgrade". On the clients, apt-get is configure to automatically use the "--yes" option, in order to assume the answer ``yes to all questions and to perform an interaction-less update. The advantage of this method is that it does not need to reboot the computer, therefore running jobs are not influenced.

Method 2: Perform another installation boot. A complete installation of one computer does not take long (less than 1 hour). Moreover, FAI can be configured not to format the disks, therefore keeping the packages already installed on the client's filesystem. In this case the "installation" boot becomes a simpler ``upgraded boot, since only the out-of-date packes will be downloaded and changed, and all the process takes less time. The problem with this approach is that it needs to reboot the computer, therefore any running job is lost (actually, the queuing system is warned of the reboot and the jobs should be rescheduled). The advantage is that the configuration is granted to be homogeneous on all clients.