I had just finished tuning my ownCloud sync setup, when – after years of smooth, unharmed operation despite numerous cement-terminated falls – the better parts of my N9’s gorilla glass finally decide to break apart as the phone left the the bike mount mid-ride. It seems the mount broke due to modifications I made as it kept pressing buttons unintentionally.

glass

Hopefully I will be able to get my hands on a another (retired) N9 next week so I can use  that phone’s display to replace the broken one, which is nice as I wouldn’t know which new phone I would by right now, for some reason the Ubuntu Edge I ordered never shipped.

This way I can continue using SyncEvolution with my little script to sync with ownCloud which uses some MeeGo D-Bus magic to pop-up a short message informing me when the sync is complete. As I failed at ash arithmetic the script feels a little clumsy, but it seems to do what it should.

Normally I would not mention that our Linux cluster was updated. But as the update to CentOS 6.5 produced some strange errors I thought that I write it down in case somebody else has the same errors.

Our cluster has a bit more than 200 nodes and all nodes are running disk-less with read-only mounted filesystem over NFS. Until now we were using Scientific Linux 5.5 and it was time to update it to something newer: CentOS 6.5.

So all nodes were shut down and then started with the new CentOS 6.5 image and everything seemed fine. After a few minutes there were, however, about 30 nodes which went offline. The hardware on all nodes is the same and it was strange that 30 nodes should have the same hardware error after a software upgrade. I was not able to contact the defect systems over Ethernet but they still were answering ping requests over InfiniBand. I could not log in into the defect systems as the filesystem was mounted over Ethernet and not InfiniBand. Going to the console of the systems I saw that the system was still up and running but was not reachable over Ethernet. The link was still active and the kernel detected if the link was going up or down. But the driver of the Ethernet card refused to answer any packets.

Without Ethernet it was hard to debug as the systems have no local drive and as soon as the Ethernet driver stopped working no logging in was possible.

Looking at the protocols of the boot I saw that the system starts up with the wrong date which is then corrected by NTP during the boot. I also saw that the moment the time was corrected the systems stopped working. At least most of the time.

Looking at the parameters of the network driver (igb) to find some debug options I saw that it has a dependency on the ptp module. I had no idea what PTP was but the Internet told me that it is the Precision Time Protocol and that it is a feature which was enabled with RHEL6.5 and therefore also with our used CentOS 6.5. The network driver also stopped working once I tried to write the correct time to the RTC using hwclock.

On some of the systems the time stored in the RTC was more than 3.5 years in the past. The reason for this might be that the most of the time the systems are not shut down cleanly but only powered off or power cycled using ipmitool because the systems are disk-less and have a read-only filesystem. But this also means that hwclock is never run on shutdown to sync the time to the RTC.

Setting SYNC_HWCLOCK in /etc/sysconfig/ntpdate to yes syncs the actual time to the RTC and after the next reboots all my problems were gone.

Syncing the RTC to a reasonable value helped to solve my problem but this still looks like a bug in the network driver that it stops working after changing the time.

Now that checkpoint/restart with CRIU is possible since Fedora 19 I started adding CRIU support to Open MPI. With my commit 30772 it is now possible to checkpoint a process running under Open MPI. The restart functionality is not yet implemented but should be soon available. I have a test case (orte-test) which prints its PID and sleeps one second in a loop which I start under orterun like this:

/path/to/orterun --mca ft_cr_enabled 1 --mca opal_cr_use_thread 1 --mca oob tcp --mca crs_criu_verbose 30 --np 1 orte-test

The options have following meaning:

  • –mca ft_cr_enabled 1
    • ft stands for fault tolerance
    • cr stands for checkpoint/restart
    • this option is to enable the checkpoint/restart functionality
  • –mca opal_cr_use_thread 1: use an additional thread to control checkpoint/restart operations
  • –mca oob tcp: use TCP instead of unix domain sockets (the socket code needs some additional changes for C/R to work)
  • –mca crs_criu_verbose 30: print all CRIU debug messages
  • –np 1: spawn one test case

The output of the test case looks like this:

[dcbz:12563] crs:criu: open() [dcbz:12563] crs:criu: open: priority = 10 [dcbz:12563] crs:criu: open: verbosity = 30 [dcbz:12563] crs:criu: open: log_file = criu.log [dcbz:12563] crs:criu: open: log_level = 0 [dcbz:12563] crs:criu: open: tcp_established = 1 [dcbz:12563] crs:criu: open: shell_job = 1 [dcbz:12563] crs:criu: open: ext_unix_sk = 1 [dcbz:12563] crs:criu: open: leave_running = 1 [dcbz:12563] crs:criu: component_query() [dcbz:12563] crs:criu: module_init() [dcbz:12563] crs:criu: opal_crs_criu_prelaunch [dcbz:12565] crs:criu: open() [dcbz:12565] crs:criu: open: priority = 10 [dcbz:12565] crs:criu: open: verbosity = 30 [dcbz:12565] crs:criu: open: log_file = criu.log [dcbz:12565] crs:criu: open: log_level = 0 [dcbz:12565] crs:criu: open: tcp_established = 1 [dcbz:12565] crs:criu: open: shell_job = 1 [dcbz:12565] crs:criu: open: ext_unix_sk = 1 [dcbz:12565] crs:criu: open: leave_running = 1 [dcbz:12565] crs:criu: component_query() [dcbz:12565] crs:criu: module_init() [dcbz:12565] crs:criu: opal_crs_criu_reg_thread Process 12565 Process 12565 Process 12565 

To start the checkpoint operation the Open MPI tool orte-checkpoint is used:

/path/to/orte-checkpoint -V 10 `pidof orterun`

which outputs the following:

[dcbz:12570] orte_checkpoint: Checkpointing... [dcbz:12570] PID 12563 [dcbz:12570] Connected to Mpirun [[56676,0],0] [dcbz:12570] orte_checkpoint: notify_hnp: Contact Head Node Process PID 12563 [dcbz:12570] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message. [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update. [dcbz:12570] [ 0.00 / 0.08] Requested - ... [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message. [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update. [dcbz:12570] [ 0.00 / 0.08] Pending - ... [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message. [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update. [dcbz:12570] [ 0.00 / 0.08] Running - ... [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message. [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update. [dcbz:12570] [ 0.06 / 0.14] Locally Finished - ... [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message. [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update. [dcbz:12570] [ 0.00 / 0.14] Checkpoint Established - ompi_global_snapshot_12563.ckpt [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message. [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update. [dcbz:12570] [ 0.00 / 0.14] Continuing/Recovered - ompi_global_snapshot_12563.ckpt Snapshot Ref.: 0 ompi_global_snapshot_12563.ckpt 

orte-checkpoint tries to connect to the previously started orterun process and requests that a checkpoint should be taken. orterun outputs the following after receiving the checkpoint request:

[dcbz:12565] crs:criu: checkpoint(12565, ---) [dcbz:12565] crs:criu: criu_init_opts() returned 0 [dcbz:12565] crs:criu: opening snapshot directory /home/adrian/ompi_global_snapshot_12563.ckpt/0/opal_snapshot_0.ckpt [dcbz:12563] 12563: Checkpoint established for process [56676,0]. [dcbz:12563] 12563: Successfully restarted process [56676,0]. Process 12565 

At this point the checkpoint has been written to disk and the process continues (printing its PID).

For a complete checkpoint/restart functionality I still have to implement the restart functionality in Open MPI and I also have to take care of the unix domain sockets (shutting them down for the checkpointing).

This requires the latest criu package (criu-1.1-4) which includes headers to build Open MPI against CRIU as well as the CRIU service.

Now that I have been syncing my ownCloud address book to my mobile devices and my laptop I was missing this address book in mutt. But using pyCardDAV and the instructions at http://got-tty.org/archives/mutt-kontakte-aus-owncloud-nutzen.html it was easy to integrate the ownCloud address book in mutt. As pyCardDAV was already packaged for Fedora it was not much more work than yum install python-carddav, edit ~/.config/pycard/pycard.conf to get the address book synced.

I was already using a LDAP address book in mutt so that I had to extent the existing configuration to:
set query_command = "~/bin/mutt_ldap.pl '%s'; /usr/bin/pc_query -m '%s'"

Now, whenever I press CTRL+T during address input, first the LDAP server is queried and than my local copy of the ownCloud address book.

A new terminatorX release is available, grab the tarball from the download section if you want to give it a try. While still GTK+2 based, this release completes the first half of the GTK+3 migration guide, expect the next releases to be GTK+3 based. Aside of lots of cleanups addressing deprecated APIs, this release also brings:

  • gradient for the sample widget to freshen the UI a bit
  • a fix for Bug #33
  • delayed initialization for the jack engine (when jack is not activated via preferences) to avoid unnecessary start-up delays

 

In preparation for the upcoming terminatorX release this site has been overhauled. The layout generated by the original handcrafted scripts looked rather antiquated, so the scripts were retired in favour of WordPress with all its goodies including comments and HTML5 media playback.

In case you have visited terminatorX.org before you should find all the things you could find before although presented in much more aesthetically pleasing form.