IRIDIA cluster todo

From IridiaWiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This page contains a list of items which still need to be done on the cluster.

Errors & problems:

  • There is a random error when installing the clients in the rack with FAI. The clients start to output a lot of things on screen, but unfortunately they scroll to fast to be read. I could not find any way to block them.
  • Neither yppasswd nor passwd work on the clients of the NIS domain. User have to change password from majorana.
  • Max: in order to have LAM/MPI works we need to set on each node update-alternatives --config rsh to the option 2 (ssh) (TODO: add this command in the FAI configuration files of the node image)

Improvements:

  • A daemon that checks the status of the UPS should be installed on both majorana and polyphemus.
  • Make the configuration of the package on the diskless and on the rack more similar. At the moment FAI takes care only of modifing the important configuation files in /etc.
  • Use one repository for the configuation of those packaged which use debconf. This program can be used to access configuration DBs also shared via NFS of querying a LDAP server.
  • Create a script to automatically install/upgrade packages on the clients
  • Set up the backup server to automatically backup configuration files on the cluster.
  • Move the SGE scheduler to majorana, and configure polyphemus in order to be only a submission host.
  • Add a DNS server that caches queries from the local network, so to reduce load (and possible problems) on the ufficial DNS.
  • modify update-cluster scripts on majorana, so to create different dsh groups: athlon*, opteron*, diskless, rack, etc. This can be done coding the information in a special way in the notes of each node.

Wishlist:

  • Install Ganglia to monitor the usage of the cluster via web.
  • Install LDAP instead of NIS (only if it is better or it works).
  • Install a new version of Sun Grid Engine (or something else).
  • Install Bugzilla to trace problems on the cluster (and to have a knowledge base of how to solve them!).