1. Cluster Updated to CentOS 6.5 (IGB/PTP Problems)

    Normally I would not mention that our Linux cluster was updated. But as the update to CentOS 6.5 produced some strange errors I thought that I write it down in case somebody else has the same errors.

    Our cluster has a bit more than 200 nodes and all nodes are running disk-less with read-only mounted filesystem over NFS. Until now we were using Scientific Linux 5.5 and it was time to update it to something newer: CentOS 6.5.

    So all nodes were shut down and then started with the new CentOS 6.5 image and everything seemed fine. After a few minutes there were, however, about 30 nodes which went offline. The hardware on all nodes is the same and it was strange that 30 nodes should have the same hardware error after a software upgrade. I was not able to contact the defect systems over Ethernet but they still were answering ping requests over InfiniBand. I could not log in into the defect systems as the filesystem was mounted over Ethernet and not InfiniBand. Going to the console of the systems I saw that the system was still up and running but was not reachable over Ethernet. The link was still active and the kernel detected if the link was going up or down. But the driver of the Ethernet card refused to answer any packets.

    Without Ethernet it was hard to debug as the systems have no local drive and as soon as the Ethernet driver stopped working no logging in was possible.

    Looking at the protocols of the boot I saw that the system starts up with the wrong date which is then corrected by NTP during the boot. I also saw that the moment the time was corrected the systems stopped working. At least most of the time.

    Looking at the parameters of the network driver (igb) to find some debug options I saw that it has a dependency on the ptp module. I had no idea what PTP was but the Internet told me that it is the Precision Time Protocol and that it is a feature which was enabled with RHEL6.5 and therefore also with our used CentOS 6.5. The network driver also stopped working once I tried to write the correct time to the RTC using hwclock.

    On some of the systems the time stored in the RTC was more than 3.5 years in the past. The reason for this might be that the most of the time the systems are not shut down cleanly but only powered off or power cycled using ipmitool because the systems are disk-less and have a read-only filesystem. But this also means that hwclock is never run on shutdown to sync the time to the RTC.

    Setting SYNC_HWCLOCK in /etc/sysconfig/ntpdate to yes syncs the actual time to the RTC and after the next reboots all my problems were gone.

    Syncing the RTC to a reasonable value helped to solve my problem but this still looks like a bug in the network driver that it stops working after changing the time.

    Tagged as : cluster
  2. If you have too much memory

    We have integrated new nodes into our cluster. All of the new nodes have a local SSD for fast temporary scratch data. In order to find which are the best options and IO scheduler I have written a script which tries a lot of combinations (80 to be precise) of file system options and IO schedulers. As the nodes have 64 GB of RAM the first run of the script took 40 hours as I tried to write always twice the size of the RAM for my benchmarks to avoid any caching effects. In order to reduce the amount of available memory I wrote a program called memhog which malloc()s the memory and then also mlock()s it. The usage is really simple

    $ ./memhog
    Usage: memhog <size in GB>
    

    I am now locking 56GB with memhog and I reduced the benchmark file size to 30GB.

    So, if you have too much memory and want to waste it... Just use memhog.c.

    Tagged as : cluster

Page 1 / 1