Cluster Updated to CentOS 6.5 (IGB/PTP Problems)
Normally I would not mention that our Linux cluster was updated. But as the update to CentOS 6.5 produced some strange errors I thought that I write it down in case somebody else has the same errors.
Our cluster has a bit more than 200 nodes and all nodes are running disk-less with read-only mounted filesystem over NFS. Until now we were using Scientific Linux 5.5 and it was time to update it to something newer: CentOS 6.5.
So all nodes were shut down and then started with the new CentOS 6.5 image and everything seemed fine. After a few minutes there were, however, about 30 nodes which went offline. The hardware on all nodes is the same and it was strange that 30 nodes should have the same hardware error after a software upgrade. I was not able to contact the defect systems over Ethernet but they still were answering ping requests over InfiniBand. I could not log in into the defect systems as the filesystem was mounted over Ethernet and not InfiniBand. Going to the console of the systems I saw that the system was still up and running but was not reachable over Ethernet. The link was still active and the kernel detected if the link was going up or down. But the driver of the Ethernet card refused to answer any packets.
Without Ethernet it was hard to debug as the systems have no local drive and as soon as the Ethernet driver stopped working no logging in was possible.
Looking at the protocols of the boot I saw that the system starts up with the wrong date which is then corrected by NTP during the boot. I also saw that the moment the time was corrected the systems stopped working. At least most of the time.
Looking at the parameters of the network driver (igb) to find some debug options I saw that it has a dependency on the ptp module. I had no idea what PTP was but the Internet told me that it is the Precision Time Protocol and that it is a feature which was enabled with RHEL6.5 and therefore also with our used CentOS 6.5. The network driver also stopped working once I tried to write the correct time to the RTC using hwclock.
On some of the systems the time stored in the RTC was more than 3.5 years in the past. The reason for this might be that the most of the time the systems are not shut down cleanly but only powered off or power cycled using ipmitool because the systems are disk-less and have a read-only filesystem. But this also means that hwclock is never run on shutdown to sync the time to the RTC.
Setting SYNC_HWCLOCK in /etc/sysconfig/ntpdate to yes syncs the actual time to the RTC and after the next reboots all my problems were gone.
Syncing the RTC to a reasonable value helped to solve my problem but this still looks like a bug in the network driver that it stops working after changing the time.