Cluster Updated to CentOS 6.5 (IGB/PTP Problems)

Normally I would not mention that our Linux cluster was updated. But as the update to CentOS 6.5 produced some strange errors I thought that I write it down in case somebody else has the same errors.

Our cluster has a bit more than 200 nodes and all nodes are running disk-less with read-only mounted filesystem over NFS. Until now we were using Scientific Linux 5.5 and it was time to update it to something newer: CentOS 6.5.

So all nodes were shut down and then started with the new CentOS 6.5 image and everything seemed fine. After a few minutes there were, however, about 30 nodes which went offline. The hardware on all nodes is the same and it was strange that 30 nodes should have the same hardware error after a software upgrade. I was not able to contact the defect systems over Ethernet but they still were answering ping requests over InfiniBand. I could not log in into the defect systems as the filesystem was mounted over Ethernet and not InfiniBand. Going to the console of the systems I saw that the system was still up and running but was not reachable over Ethernet. The link was still active and the kernel detected if the link was going up or down. But the driver of the Ethernet card refused to answer any packets.

Without Ethernet it was hard to debug as the systems have no local drive and as soon as the Ethernet driver stopped working no logging in was possible.

Looking at the protocols of the boot I saw that the system starts up with the wrong date which is then corrected by NTP during the boot. I also saw that the moment the time was corrected the systems stopped working. At least most of the time.

Looking at the parameters of the network driver (igb) to find some debug options I saw that it has a dependency on the ptp module. I had no idea what PTP was but the Internet told me that it is the Precision Time Protocol and that it is a feature which was enabled with RHEL6.5 and therefore also with our used CentOS 6.5. The network driver also stopped working once I tried to write the correct time to the RTC using hwclock.

On some of the systems the time stored in the RTC was more than 3.5 years in the past. The reason for this might be that the most of the time the systems are not shut down cleanly but only powered off or power cycled using ipmitool because the systems are disk-less and have a read-only filesystem. But this also means that hwclock is never run on shutdown to sync the time to the RTC.

Setting SYNC_HWCLOCK in /etc/sysconfig/ntpdate to yes syncs the actual time to the RTC and after the next reboots all my problems were gone.

Syncing the RTC to a reasonable value helped to solve my problem but this still looks like a bug in the network driver that it stops working after changing the time.

Checkpoint and almost Restart in Open MPI

Now that checkpoint/restart with CRIU is possible since Fedora 19 I started adding CRIU support to Open MPI. With my commit 30772 it is now possible to checkpoint a process running under Open MPI. The restart functionality is not yet implemented but should be soon available. I have a test case (orte-test) which prints its PID and sleeps one second in a loop which I start under orterun like this:

/path/to/orterun --mca ft_cr_enabled 1 --mca opal_cr_use_thread 1 --mca oob tcp --mca crs_criu_verbose 30 --np 1 orte-test

The options have following meaning:

  • –mca ft_cr_enabled 1
    • ft stands for fault tolerance
    • cr stands for checkpoint/restart
    • this option is to enable the checkpoint/restart functionality
  • –mca opal_cr_use_thread 1: use an additional thread to control checkpoint/restart operations
  • –mca oob tcp: use TCP instead of unix domain sockets (the socket code needs some additional changes for C/R to work)
  • –mca crs_criu_verbose 30: print all CRIU debug messages
  • –np 1: spawn one test case

The output of the test case looks like this:


[dcbz:12563] crs:criu: open()
[dcbz:12563] crs:criu: open: priority = 10
[dcbz:12563] crs:criu: open: verbosity = 30
[dcbz:12563] crs:criu: open: log_file = criu.log
[dcbz:12563] crs:criu: open: log_level = 0
[dcbz:12563] crs:criu: open: tcp_established = 1
[dcbz:12563] crs:criu: open: shell_job = 1
[dcbz:12563] crs:criu: open: ext_unix_sk = 1
[dcbz:12563] crs:criu: open: leave_running = 1
[dcbz:12563] crs:criu: component_query()
[dcbz:12563] crs:criu: module_init()
[dcbz:12563] crs:criu: opal_crs_criu_prelaunch
[dcbz:12565] crs:criu: open()
[dcbz:12565] crs:criu: open: priority = 10
[dcbz:12565] crs:criu: open: verbosity = 30
[dcbz:12565] crs:criu: open: log_file = criu.log
[dcbz:12565] crs:criu: open: log_level = 0
[dcbz:12565] crs:criu: open: tcp_established = 1
[dcbz:12565] crs:criu: open: shell_job = 1
[dcbz:12565] crs:criu: open: ext_unix_sk = 1
[dcbz:12565] crs:criu: open: leave_running = 1
[dcbz:12565] crs:criu: component_query()
[dcbz:12565] crs:criu: module_init()
[dcbz:12565] crs:criu: opal_crs_criu_reg_thread
Process 12565
Process 12565
Process 12565

To start the checkpoint operation the Open MPI tool orte-checkpoint is used:

/path/to/orte-checkpoint -V 10 `pidof orterun`

which outputs the following:


[dcbz:12570] orte_checkpoint: Checkpointing...
[dcbz:12570] PID 12563
[dcbz:12570] Connected to Mpirun [[56676,0],0]
[dcbz:12570] orte_checkpoint: notify_hnp: Contact Head Node Process PID 12563
[dcbz:12570] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.00 / 0.08] Requested - ...
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.00 / 0.08] Pending - ...
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.00 / 0.08] Running - ...
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.06 / 0.14] Locally Finished - ...
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.00 / 0.14] Checkpoint Established - ompi_global_snapshot_12563.ckpt
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.00 / 0.14] Continuing/Recovered - ompi_global_snapshot_12563.ckpt
Snapshot Ref.: 0 ompi_global_snapshot_12563.ckpt

orte-checkpoint tries to connect to the previously started orterun process and requests that a checkpoint should be taken. orterun outputs the following after receiving the checkpoint request:


[dcbz:12565] crs:criu: checkpoint(12565, ---)
[dcbz:12565] crs:criu: criu_init_opts() returned 0
[dcbz:12565] crs:criu: opening snapshot directory /home/adrian/ompi_global_snapshot_12563.ckpt/0/opal_snapshot_0.ckpt
[dcbz:12563] 12563: Checkpoint established for process [56676,0].
[dcbz:12563] 12563: Successfully restarted process [56676,0].
Process 12565

At this point the checkpoint has been written to disk and the process continues (printing its PID).

For a complete checkpoint/restart functionality I still have to implement the restart functionality in Open MPI and I also have to take care of the unix domain sockets (shutting them down for the checkpointing).

This requires the latest criu package (criu-1.1-4) which includes headers to build Open MPI against CRIU as well as the CRIU service.

Using the ownCloud address book in mutt

Now that I have been syncing my ownCloud address book to my mobile devices and my laptop I was missing this address book in mutt. But using pyCardDAV and the instructions at http://got-tty.org/archives/mutt-kontakte-aus-owncloud-nutzen.html it was easy to integrate the ownCloud address book in mutt. As pyCardDAV was already packaged for Fedora it was not much more work than yum install python-carddav, edit ~/.config/pycard/pycard.conf to get the address book synced.

I was already using a LDAP address book in mutt so that I had to extent the existing configuration to:
set query_command = "~/bin/mutt_ldap.pl '%s'; /usr/bin/pc_query -m '%s'"

Now, whenever I press CTRL+T during address input, first the LDAP server is queried and than my local copy of the ownCloud address book.

New external RAID

Today a new external RAID (connected via Fibre Channel) was attached to our mirror server. To create the filesystem (XFS) I used this command:

mkfs -t xfs -d su=64k -d sw=13 /dev/sdf1

According to https://raid.wiki.kernel.org/index.php/RAID_setup#XFS this are the correct options for 13 data disks (15 with RAID6 plus 1 hot spare) and a stripe size of 64k.

Dynamic DNS

For the last ten years I wanted to set up my own dynamic DNS service but was never motivated enough. Recently enough motivation was provided and using the scripts from http://www.fischglas.de/software/dyn/ made it really easy to set up a dynamic DNS service using bind. Following changes were necessary to the named.conf file:

zone "dyn.domain" in {
        type master;
        file "db.dyn.domain";
        allow-update {
                key host.domain.;
        };
};

Whenever the IP address of my host changes I am loading a URL with my hostname and password encoded. The script behind the URL checks if my hostname and password is correct and updates the zone file using nsupdate with a TTL of 120 seconds.

The script uses a simple configuration file (/etc/dyn/dyn.cfg) with the following content:

dns.key.name:host.domain.
dns.key:yyeofEWfgvdfgdfgerX==
authfile:/etc/dyn/secrets
dns.host:host.domain
debug:0

bcache Follow-Up

After using bcache for about three weeks it still works without any problems. I am serving around 700GB per day from the bcache device and looking at the munin results cache hits are averaging at about 12000 and cache misses are averaging at around 700. So, only looking at the statistics, it still seems to work very effectively for our setup.

RPM Fusion’s MirrorManager moved

After running RPM Fusion’s MirrorManager instance for many years on Fedora I moved it to a CentOS 6.4 VM. This was necessary because the MirrorManager installation was really ancient and still running from a modified git checkout I did many years ago. I expected that the biggest obstacle in this upgrade and move would be the database upgrade of MirrorManager as its schema has changed over the years. But I was fortunate and MirrorManager included all the necessary scripts to update the database (thanks Matt). Even from the ancient version I was running.

RPM Fusion’s MirrorManager instance uses postgresql to store its data and so I dumped the data on the one system to import it into the database on the new system. MirrorManager stores information about the files as pickled python data in the database and those columns were not possible to be imported due to problems with the character encoding. As this is data that is provided by the master mirror I just emptied those columns and after the first run MirrorManager recreated those informations.

Moving the MirrorManager instance to a VM means that, if you are running a RPM Fusion mirror, the crawler which checks if your mirror is up to date will now connect from another IP address (129.143.116.115) to your mirror. The data collected by MirrorManager’s crawler is then used to create http://mirrors.rpmfusion.org/mm/publiclist/ and the mirrorlist used by yum (http://mirrors.rpmfusion.org/mirrorlist?repo=free-fedora-updates-released-19&arch=x86_64). There are currently four systems serving as mirrors.rpmfusion.org

Looking at yesterday’s statistics (http://mirrors.rpmfusion.org/statistics/?date=2013-08-20) it seems there were about 400000 accesses per day to our mirrorlist servers.

bcache on Fedora 19

After having upgraded our mirror server from Fedora 17 to Fedora 19 two weeks ago I was curious to try out bcache. Knowing how important filesystem caching for a file server like ours is we always tried to have as much memory as “possible”. The current system has 128GB of memory and at least 90% are used as filesystem cache. So bcache sounds like a very good idea to provide another layer of caching for all the IOs we are doing. By chance I had an external RAID available with 12 x 1TB hard disc drives which I configured as a RAID6 and 4 x 128GB SSDs configured as a RAID10.

After modprobing the bcache kernel module and installing the necessary bcache-tools I created the bcache backing device and caching device like it is described here. I then created the filesystem like I did it with our previous RAIDs. For RAID6 with 12 hard disc drive and a RAID chunk size of 512KB I used mkfs.ext4 -b 4096 -E stride=128,stripe-width=1280 /dev/bcache0. Although I am unsure how useful these options are when using bcache.

So far it worked pretty flawlessly. To know what to expect from /dev/bcache0 I benchmarked it using bonnie++. I got 670MB/s for writing and 550MB/s for reading. Again, I am unsure how to interpret these values as bcache tries to detect sequential IO and bypasses the cache device for sequential IO larger than 4MB.

Anyway. I started copying my fedora and fedora-archive mirror to the bcache device and we are now serving those two mirrors (only about 4.1TB) from our bcache device.

I have created a munin plugin to monitor the usage of the bcache device and there are many cache hits (right now more than 25K) and some cache misses (about 1K). So it seems that it does what is supposed to do and the number of IOs directly hitting the hard disc drives is much lower than it would be:

I also increased the cutoff for sequential IO which should bypass the cache from 4MB to 64MB.

The user-space tools (bcache-tools) are not yet available in Fedora (as far as I can tell) but I found http://terjeros.fedorapeople.org/bcache-tools/ which I updated to the latest git: http://lisas.de/~adrian/bcache-tools/

Update: as requested the munin plugin: bcache

Remove Old Kernels

Mainly using Fedora, I am accustomed that old kernel images are automatically uninstalled after a certain number of kernel images have been installed using yum. The default is to have three kernel images installed and so far this has always worked.

I am also maintaining a large number of Ubuntu VMs and every now and then we have the problem that the filesystem is full, because too many kernel images are installed. I have searched for some time but there seems to be no automatic kernel image removal in apt-get. There is one command which is often recommended which is something like:

dpkg -l 'linux-*' | sed '/^ii/!d;/'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d;
s/^[^ ]*[^ ]* \([^ ]*\).*/\1/;/[0-9]/!d' | xargs sudo apt-get -y purge
[1]

This works, but only if you are already running the latest kernel and therefore I have adapted it a little for our needs. Instead of removing all kernel images except the running kernel image I remove all kernel images except the running and the newest kernel image. No real big difference but important for our setup where we do not reboot all VMs with every kernel image update.

Running the script gives me following output
# remove-old-kernels

linux-image-3.2.0-23-generic linux-image-3.2.0-36-generic linux-image-3.2.0-37-generic linux-image-3.2.0-38-generic linux-image-3.2.0-39-generic linux-image-3.2.0-40-generic linux-image-3.2.0-43-generic linux-image-3.2.0-45-generic linux-image-3.2.0-48-generic linux-image-3.2.0-49-generic

The output of the script can then be easily used to remove the unnecessary kernel images with apt-get purge.

The script can be downloaded here: remove-old-kernels

And before anybody complains: I know it is not really the most elegant solution and I should have not written it using bash.

A New Home

After having received my Raspberry Pi in November, I am finally using it. I have connected it to my television using raspbmc.Using XBMC Remote I can control it without the need for a mouse, keyboard or lirc based remote control and so far it works pretty good. Following are a few pictures with the new case I bought a few days ago:

Pi

Pi

Pi