Monthly Archive for January, 2010

Cluster Installation: First Nodes Up

Since Monday I am at the High Performance Computing Center Stuttgart (HLRS) and I have started the initial installation of our cluster.The people from the HLRS have offered to support us with the initial installation, which we gladly accepted because they know how to do clusters.

On Monday I installed the three infrastructure servers which are used to control the 180 nodes of the cluster. The cluster is running Scientific Linux and my first task was to get it on those three infrastructure servers.

Those servers have two 500GB disks and they were supposed to be running as software RAID. After the seventh failed attempt to configure the partitions as RAID1 with the Scientific Linux installer we used a Debian install DVD to partition the disks and after the successful configuration of the partitions as RAID1 we installed Scientific Linux on all three systems. Not knowing how to use anaconda to configure a RAID1 (like we wanted to) was a bit embarrassing, but with all the Fedora and CentOS installation I have done I have never configured a software RAID1 from the installer; either the system had only one disk, a hardware RAID controller or I configured the RAID manually after the installation. But at the end of the day all three system were installed and configured for their tasks.

Today (Tuesday) we used the installation to boot the first two nodes of the cluster. All the nodes are running disk-less and are booting over TFTP/NFS from a single read-only image.

Update To Fedora 12

Last week I have finally updated our mirror server to Fedora 12. It was still running Fedora 10 which has reached its end of life. The server was running Fedora 10 for a long time and it was always running with a CentOS kernel. The Fedora kernels were, at the beginning, not stable enough (crashing after three or four days) so that I quickly switched to a CentOS kernel. I know that I should have reported bugs, but in the case of the mirror server I am more concerned to keep it up and running than getting debug data from it. It also not easy for me to get physically to the machine so that I had a lot of good excuses to switch to a CentOS kernel.

Now the system is running using the Fedora 12 kernel and after a week it is still up without any problems.

Updating My RPM Fusion Builder

I am running one of the RPM Fusion builders in a VM using CentOS and after I saw that the newly created VMs on my notebook are using virtio for network and disk access I thought that I will try this also for my builder VM. It was pretty easy and straight forward.

First I had to update from CentOS 5.2 to CentOS 5.4 so that the virtio drivers are available. After that I was just following http://wiki.libvirt.org/page/Virtio.

For the network:

  • shut down the VM
  • edit the XML and add <model type='virtio'/> to the network section
  • start the VM
  • done

For the disk:

  • create a new ramdisk with the virtio drivers: mkinitrd --with virtio_pci --with virtio_blk -f /boot/initrd-$(uname -r).img $(uname -r)
  • or dracut -f --add-drivers "virtio_pci virtio_blk" /boot/initrd-$(uname -r).img $(uname -r) for Fedora 12
  • change /boot/grub/device.map from “(hd0) /dev/hda” to “(hd0) /dev/vda
  • using LVM requires no changes to the root= parameter in /etc/grub.conf
  • shut down the VM
  • edit the XML changing <target dev='hda' bus='ide'/> to <target dev='vda' bus='virtio'/>
  • start the VM
  • done

During the boot of the VM I can now see that it is loading the virtio disk drivers and detecting vda1 and vda2. Using lspci and lsmod I can also verify that the new virtio devices are available and also used. The VM seems to be faster but I have not actually benchmarked it.

RPM Fusion Mirrorlist Server

On the last day of the last year (2009-12-31) both RPM Fusion’s mirrorlist server were most of the time not reachable. The problem started at 00:53 (UTC) and it was at least going on until 16:00 (UTC). Both mirrorlist servers have been on the same network and the router for that network  broke down. If it would have been the link to our provider the router had a backup route to stay on-line, but this time it actually hit the single point of failure – and everything was off-line. See: error report of the provider (german).

I was never happy that both mirrorlist server were running in the same network and I especially wanted to get the mirrorlist server off my mirror server. Thanks to Patrick I have now access to another VM at a different provider where I am running a new mirrorlist server instance. It does not require much in terms of resources and bandwidth, but having root access makes everything so much easier.

RPM Fusion’s mirrorlist server are now two dedicated VMs at two different providers and that should protect the functionality from failures like the one on 2009-12-31.