Difference between revisions of "IRIDIA cluster maintenance"

From IridiaWiki
Jump to navigationJump to search
 
(7 intermediate revisions by 2 users not shown)
Line 89: Line 89:
 
fai-chboot -IB r69
 
fai-chboot -IB r69
 
</pre>
 
</pre>
  +
  +
After that the client correctly rebooted, SGE on majorana must be "informed" that new clients are available.
  +
  +
On majorana as root
  +
# qconf -ah <hostname> Add node as administrative host
  +
# qconf -as <hostname> Add node as submit host
  +
  +
On <hostname> as root
  +
# /usr/local/sge/install_execd Install node as execution host
  +
  +
Log on majorana as root and create the queues using qmon.
   
 
== Adding new software/packages on the servers ==
 
== Adding new software/packages on the servers ==
Line 195: Line 206:
   
 
If you want the package pippo.deb to be automatically installed by FAI in the future, add an entry to the file /usr/local/share/fai/package_config/CLUSTER_NODE on majorana
 
If you want the package pippo.deb to be automatically installed by FAI in the future, add an entry to the file /usr/local/share/fai/package_config/CLUSTER_NODE on majorana
  +
  +
  +
== Issues with the RAID system ==
  +
At boot time press Alt+3 to enter in the RAID bios settings.
  +
According to SYSGEN when the disks are removed and then reinserted in majorana, the positions in which the disks are put are not an issue.
  +
  +
The harddisks in use are three Maxtor MaxLine Plus II 250GB SATA/150 HDD

Latest revision as of 11:34, 25 January 2007

This page contains information on maintenance of the cluster. This includes installing new software, add/removing nodes, security etc.

TODO: Add something about security (AIDE, SNORT)


Adding a new diskless node

In order to have a new fully functional client, the client must first be configured to boot from the net. Then, the new client must be added to the client list of \sge on the server. The actual client kernel assumes that the client has a Inter PRO 1000 card. At the moment, other cards require a re-compilation of the kernel and other modifications to the net-booting process.

  1. switch the client on while it is attached to a keyboard and a monitor;
  2. enter the BIOS (pressing Delete key immediately after boot) and configure the client in order not to stop when keyboard, video card, floppy, or whatever else is missing;
  3. configure it to boot from LAN;
  4. let it boot and, if it appears, write down the MAC address of the network card; switch it off otherwise.

Finding the MAC address of a new client

The MAC address is a sequence of 12 hexadecimal digits, normally coupled and each couple separeted by a ":" or a space. If you do not have it, you can get it in this way:

On majorana, type the following:

tail -f /var/log/daemon.log

Switch on the client and let it boot from the network (it will fail). Now look at the server's screen: it will appear a line like:

DHCPDISCOVER from 00:13:16:69:71:fa via eth1

the numbers between from and via are the MAC address.

Next, the final steps. Let's say that the MAC address is 00:13:16:69:71:fa, the new host name will be p69 and its IP address will be 192.168.100.69. Then, on majorana edit the file

/etc/dhcpd.conf

Search for the block where the other nodes are defined, looking for instance for "host p02" and add the following after the last definition of the group:

host p69 {
        hardware ethernet 00:13:16:69:71:fa;
        fixed-address 192.168.100.69;
}

Execute

/etc/init.d/dhcp restart

Add the new host in /etc/hosts

...
192.168.100.69  p69
...

Re-create NIS maps (clients resolve names into IP addresses first using NIS, then using the DNS):

cd /var/yp
make

Add the new data to export the filesistem in /etc/exports:

/var/lib/diskless/default/192.168.100.69/etc 192.168.100.69(rw,no_root_squash)
/var/lib/diskless/default/192.168.100.69/rw 192.168.100.69(ro,no_root_squash)
/var/lib/diskless/default/192.168.100.69/rw-secure 192.168.100.69(rw,no_root_squash)

Restart the NFS server:

/etc/init.d/nfs-kernel-server restart

And finally, execute:

update-host-directories

Then the host must be included in the Sun Grid Engine. Read and follow the instructions of the "Sun ONE Grid Engine Administration and User's Guide, Chapter 2: How to Install Execution Host". A copy of the guide can be found on the server in the file

/usr/local/sge/doc/SGE53AdminUserDoc.pdf

Adding a new rack node

The procedure for adding a new rack node is very similar to that for diskless node. First find its MAC address, update the dhcp server and /etc/hosts as described for diskless nodes. There is no need to modify /etc/exports and to restart the nfs server.

Then create a PXE configuration file for the new node to start a FAI installation boot. If the new rack node name is r69, then:

fai-chboot -IB r69

After that the client correctly rebooted, SGE on majorana must be "informed" that new clients are available.

On majorana as root

  1. qconf -ah <hostname> Add node as administrative host
  2. qconf -as <hostname> Add node as submit host

On <hostname> as root

  1. /usr/local/sge/install_execd Install node as execution host

Log on majorana as root and create the queues using qmon.

Adding new software/packages on the servers

Both the server and the client are running Debian. The Debian tool to manage program installation is apt-get.

Suppose you want to install a package, whose name is pippo, on the server:

As root, first type:

apt-get update
apt-get install pippo

The program might complain that some other packages are missing. Add their name to the previous command after pippo. It is usually possible to choose among three versions of the program (stable, testing and unstable. If you want to specify a particular version, use:

apt-get install pippo/unstable

Checking packages on the servers

As root, type:

dpkg-query -l name_of_package

Upgrading the servers

In order to upgrade the servers, login as root and type:

apt-get update
apt-get dist-upgrade

Keeping packages up to date

The maintenance process differs between the different types of nodes computers. In case of update of packages, the clients are configured /etc/apt/sources.list to compare the versions of their own packages with those in /mnt/debmirror, which is a NFS directory located on majorana. Before any update can take place, the mirror on majorana must be updated.

Diskless nodes

Whatever change need to be done, can be done directly in the nfsroot on majorana. A very handy way to do that is to use the command chroot, which redefine the root directory to point to the one specified as argument. For instance, to upgrade the packages in the nfsroot, type the following commands:

chroot /var/lib/diskless/simple/root
apt-get update
apt-get dist-upgrade
exit

Remember that the clients do not see directly the directories /dev, /etc, /tmp and /var under /var/lib/diskless/simple/root. Anytime a file is modified in one of these directories (99% of the times when new packages are installed or upgraded), it needs to be updated in the client's private directories as well. The script update-new-hosts, placed in /root/bin/, does the job. The changes are immediately seen on the clients.

One must take care of the services that are stated during new installations or restarted during upgrades: they will not run on the clients but on the server! Therefore they must be stopped and re-executed from a non-chrooted environment on majorana, and then executed on each hosts, using the command dsh. For example, below there is an example of the commands that should be given when updating SSH on all the clients.

chroot /var/lib/diskless/simple/root
apt-get update
apt-get upgrade ssh
...
exit
/etc/init.d/ssh restart
update-new-hosts
dsh -g athlon2400 athlon1400 athlon2800 -- /etc/init.d/ssh restart

The line after "exit" restarts the ssh service on majorana, which was executed with the wrong configuration during the update for the clients. The last line executes the command /etc/init.d/ssh restart on all hosts using dsh (the list of hosts and the groups in which they are divided are in /etc/dsh/ on majorana).

Nodes with disks (rack nodes)

There are two ways of maintaining the clients in the rack.

Method 1: Since each of them has its own filesystem, it is possible to execute a set of instructions on each of them using dsh. An update of all packages can be done typing, on majorana, the following:

dsh -g opteron244 -- "apt-get update && apt-get dist-upgrade"

This command sends to the machines in the opteron244 group (configured in /etc/dsh/ on majorana) the comand "apt-get update". If it is succesful then it executes the command "apt-get dist-upgrade". On the clients, apt-get is configured to automatically use the "--yes" option, in order to assume the answer "yes" to all questions and to perform an interaction-less update. The advantage of this method is that it does not need to reboot the computer, therefore running jobs are not influenced.

Method 2: Perform another installation boot. A complete installation of one computer does not take long (less than 1 hour). Moreover, FAI can be configured not to format the disks, therefore keeping the packages already installed on the client's filesystem. In this case the "installation" boot becomes a simpler "upgraded" boot, since only the out-of-date packes will be downloaded and changed, and all the process takes less time. The problem with this approach is that it needs to reboot the computer, therefore any running job is lost (actually, the queuing system is warned of the reboot and the jobs should be rescheduled). The advantage is that the configuration is granted to be homogeneous on all clients.

Adding a new user

Use the command adduser on majorana and recreate NIS maps. The user will be immediately seen on all other computers in the local network:

adduser
  <answer all questions>
cd /var/yp
make

Issues with Ganglia Monitor

Sometimes the Ganglia Monitor web page https://polyphemus.ulb.ac.be/ganglia/?m=cpu_report&r=hour&s=by%2520hostname&c=Polyphemus&h=&sh=1&hc=4 reports some hosts as down while they are not. What happen is that the ganglia-monitor daemon just stopped and it's enough to restart it on the interested nodes. To do so log in as root and issue the commands:

/etc/init.d/ganglia-monitor stop
/etc/init.d/ganglia-monitor start


Adding a DEB package on the rack

Suppose you want to install a package located on majorana, whose name is pippo.deb, on the rack nodes:

As root, from majorana first type:

cp /path/to/deb/pippo.deb /home/
dsh -c -g opteron244 -- dpkg -i /home/pippo.deb
rm /home/pippo.deb

If you want the package pippo.deb to be automatically installed by FAI in the future, add an entry to the file /usr/local/share/fai/package_config/CLUSTER_NODE on majorana


Issues with the RAID system

At boot time press Alt+3 to enter in the RAID bios settings. According to SYSGEN when the disks are removed and then reinserted in majorana, the positions in which the disks are put are not an issue.

The harddisks in use are three Maxtor MaxLine Plus II 250GB SATA/150 HDD