80 compute nodes from our cluster are up and running. We are now waiting for more switches and the filesystem servers to finally get the complete cluster (with all compute nodes) operational. To get the remaining nodes operational all I have to do is to add their MAC address to a file and with the magic of some scripts everything else is configured automatically. Unfortunately it all depends on the missing ethernet switches which should arrive any day now.
I was not happy with the partitioning of one of the cluster infrastructure servers. It had a software RAID for /boot, one for swap and the rest was a big software RAID for /. I should have used LVM for / for easy resizing, but I forgot and so I had to do it the hard way. I wanted to resize /dev/md2 which was used for / and then use LVM for the rest.
First I had to resize the filesystem. Online shrinking is not supported for resize2fs (at least I was not able to do it) and so I had to boot the CentOS 5.4 rescue system.
After dropping to the shell of the rescue system (without mounting the filesystems) I copied a mdadm.conf from a similar system to /etc so that I would be able to start the RAIDs:
- mdadm -A /dev/md0
- mdadm -A /dev/md1
- mdadm -A /dev/md2
Only starting /dev/md2 would have be enough, but I wanted to make sure that everything is working as it is supposed to. Then, before running resize2fs, I had to do a filesystem check:
- e2fsck -f /dev/md2 -C 0
Next step was to actually shrink the filesystem and make it smaller than the desired final size:
- resize2fs /dev/md2 30G
Then I shrunk the RAID to about 40GB:
- mdadm --grow /dev/md2 -z 40000000
and after that I had to resize the filesystem again to use the 40GB:
- resize2fs /dev/md2
At this point I mounted the filesystem to see if it actually worked and it looked good (and smaller). Now came the hard part; to use the remaining space I had to re-partition the disk. I started fdisk and deleted the corresponding partitions and created at the same start point smaller partitions (42GB). This was the part were I was really worried about losing all my data which was fortunately backed up (of course). After I created the smaller partitions I tried to start /dev/md2 and it failed, saying that it could not find any RAID partitions.
I then tried to create the RAID again, hoping all data would be still available. I first created the RAID with only one device:
- mdadm --create /dev/md2 -n 2 -l 1 /dev/sdb3 missing
This seemed to work and after mounting the new RAID I saw that all my files were still there. So the next step was to add the second device to the RAID with:
- mdadm --manage -a /dev/md2 /dev/sda3
At this point the RAID started to re-sync and 20 minutes later I was able to grow the RAID to the new partition size:
- mdadm --grow /dev/md2 -z max
Again I had to wait and before doing the final filesystem resize another filesystem check was necessary:
- e2fsck -f /dev/md2 -C 0
- resize2fs /dev/md2
And after only two hours I finally had what I wanted. I rebooted the system and it came up with the smaller / partition. I used the remaining space to create a new RAID (/dev/md3) which will probably be used with LVM if I ever need more space on this server in the future.
Without having a backup I would have not done all the steps because I was not always sure it would actually work.
Yesterday (2010-02-06) Benjamin and myself were again in Lech/Zürs snowboarding; just like three weeks ago. Last time (2010-01-17) Pattrick and Torsten were also able to join. This time it was only Benjamin and me.
The weather was similar to our last visit. Mostly cloudy with a few peeks of sunshine. This time, however, we had lots of new deep powder and it was freeriding time. Extremely exhausting but great fun.
Since Monday I am at the High Performance Computing Center Stuttgart (HLRS) and I have started the initial installation of our cluster.The people from the HLRS have offered to support us with the initial installation, which we gladly accepted because they know how to do clusters.
On Monday I installed the three infrastructure servers which are used to control the 180 nodes of the cluster. The cluster is running Scientific Linux and my first task was to get it on those three infrastructure servers.
Those servers have two 500GB disks and they were supposed to be running as software RAID. After the seventh failed attempt to configure the partitions as RAID1 with the Scientific Linux installer we used a Debian install DVD to partition the disks and after the successful configuration of the partitions as RAID1 we installed Scientific Linux on all three systems. Not knowing how to use anaconda to configure a RAID1 (like we wanted to) was a bit embarrassing, but with all the Fedora and CentOS installation I have done I have never configured a software RAID1 from the installer; either the system had only one disk, a hardware RAID controller or I configured the RAID manually after the installation. But at the end of the day all three system were installed and configured for their tasks.
Today (Tuesday) we used the installation to boot the first two nodes of the cluster. All the nodes are running disk-less and are booting over TFTP/NFS from a single read-only image.
Last week I have finally updated our mirror server to Fedora 12. It was still running Fedora 10 which has reached its end of life. The server was running Fedora 10 for a long time and it was always running with a CentOS kernel. The Fedora kernels were, at the beginning, not stable enough (crashing after three or four days) so that I quickly switched to a CentOS kernel. I know that I should have reported bugs, but in the case of the mirror server I am more concerned to keep it up and running than getting debug data from it. It also not easy for me to get physically to the machine so that I had a lot of good excuses to switch to a CentOS kernel.
Now the system is running using the Fedora 12 kernel and after a week it is still up without any problems.
I am running one of the RPM Fusion builders in a VM using CentOS and after I saw that the newly created VMs on my notebook are using virtio for network and disk access I thought that I will try this also for my builder VM. It was pretty easy and straight forward.
First I had to update from CentOS 5.2 to CentOS 5.4 so that the virtio drivers are available. After that I was just following http://wiki.libvirt.org/page/Virtio.
For the network:
- shut down the VM
- edit the XML and add <model type='virtio'/> to the network section
- start the VM
- done
For the disk:
- create a new ramdisk with the virtio drivers: mkinitrd --with virtio_pci --with virtio_blk -f /boot/initrd-$(uname -r).img $(uname -r)
- or dracut -f --add-drivers "virtio_pci virtio_blk" /boot/initrd-$(uname -r).img $(uname -r) for Fedora 12
- change /boot/grub/device.map from “(hd0) /dev/hda” to “(hd0) /dev/vda“
- using LVM requires no changes to the root= parameter in /etc/grub.conf
- shut down the VM
- edit the XML changing <target dev='hda' bus='ide'/> to <target dev='vda' bus='virtio'/>
- start the VM
- done
During the boot of the VM I can now see that it is loading the virtio disk drivers and detecting vda1 and vda2. Using lspci and lsmod I can also verify that the new virtio devices are available and also used. The VM seems to be faster but I have not actually benchmarked it.
On the last day of the last year (2009-12-31) both RPM Fusion’s mirrorlist server were most of the time not reachable. The problem started at 00:53 (UTC) and it was at least going on until 16:00 (UTC). Both mirrorlist servers have been on the same network and the router for that network broke down. If it would have been the link to our provider the router had a backup route to stay on-line, but this time it actually hit the single point of failure – and everything was off-line. See: error report of the provider (german).
I was never happy that both mirrorlist server were running in the same network and I especially wanted to get the mirrorlist server off my mirror server. Thanks to Patrick I have now access to another VM at a different provider where I am running a new mirrorlist server instance. It does not require much in terms of resources and bandwidth, but having root access makes everything so much easier.
RPM Fusion’s mirrorlist server are now two dedicated VMs at two different providers and that should protect the functionality from failures like the one on 2009-12-31.
In the night from Friday to Saturday a disk (slot 7) from our external RAID, containing most of the mirror server data, failed and was marked as BAD. No really a big problem, yet. The hot spare drive was activated and the rebuild started. About 24 hours later the rebuild finished. On Sunday (around 16:00) another drive (slot 5) failed and we immediately started to sync all the data to another box in case another drive decides to go off-line, which would mean a complete data loss. All the data on that RAID are (only) mirrored, but to re-sync all the 9TB we currently have would probably take a few weeks. Unfortunately the sync to another box will also take a few days until it is finished, so it is still possible that we might lose a lot. We are waiting for the replacement disks which have been promised to be here by Monday (today), but as the rebuild needs over 24 hours there is still the chance of a data loss.
Update (2009-12-14 23:20): The replacement disks have arrived and after more than twelve hours 25% of the array has been rebuilt.
Update (2009-12-15 11:00): After more than 24 hours 58% of the array has been rebuilt. It seems to rebuild faster during the night.
Not really back in school, but it has been now more than one week that I started my new job at my old university in Esslingen at the beginning of December 2009. After only 11 months at my previous workplace (Matrix Vision) I am now working for the faculty of Information Technology.
I will be responsible for the setup and installation of the new cluster of the university. The cluster will be part of the bwGRiD and it will have around 1500 cores and is currently being installed. It is partly water-cooled and a few days ago the racks were delivered and installed. The cluster is from NEC and we are expecting the servers to be delivered in the next few days. The cluster will be running Scientific Linux.
I am now in the same building as my mirror server. This might be a good thing, because now I am much closer to the hardware and can act faster if something unexpected happens… It might also be a bad thing, because now I am much closer and can experiment with things I would not do if I was not in the same building.
I finally have mutt configured in such a way that it first tries to display the plain text part of a mail and only the HTML part if there is no plain text available. For years I had mutt configured to display HTML mails using lynx but it was displaying the HTML part even if there was plain text available.
To display HTML mails I was using auto_view text/html in my .muttrc like it is described everywhere with the following corresponding entry in my .mailcap:
text/html; lynx -dump %s; copiousoutput; nametemplate=%s.html
The problem with this setup is that it displays the HTML part of a mail even if there is a plain text part available. So I had auto_view text/html disabled for most of the time and edited the configuration file manually to enable it again for the rare cases in which I received a HTML only mail.
But as this is mutt and almost everything can be configured I finally searched and found a solution:
auto_view text/html alternative_order text/plain text/html
If the message has a plain text part and a HTML part mutt shows me the plain text part, but if there is only a HTML part available I get the HTML converted to plain text. Exactly what I always wanted.