PXCAB

A long time ago (2007 or 2008) I was developing firmware for Cell processor based systems. Most of the Slimline Open Firmware (SLOF) has been released and is also available in Fedora as firmware for QEMU: SLOF.

One of the systems we have been developing firmware for was a PCI Express card called PXCAB. The processor on this PCI Express card was not the original Cell processor but the newer PowerXCell 8i which has a much better double precision floating point performance. A few weeks ago I was able to get one of those PCI Express cards in a 1U chassis:

PXCAB

This chassis was designed to hold two PXCABs: one running in root complex mode and the other in endpoint mode. That way one card was the host system and the other the PCI express connected device. This single card is now running in root complex mode.

I can boot a kernel either via TFTP or from the flash. As writing the flash takes some time I am booting it right now via TFTP. Compiling the latest kernel from git for PPC64 is thanks to the available cross compiler (gcc-powerpc64-linux-gnu.x86_64) no problem: make CROSS_COMPILE=powerpc64-linux-gnu- ARCH=powerpc.

The more difficult part was to compile user space tools but fortunately I was able to compile it natively on a PPC64 system. With this minimal busybox based system I can boot the system and chroot into a Fedora 24 NFS mount.

I was trying to populate a directory with a minimal PPC64 based Fedora 24 system with following command:

dnf --setopt arch=ppc64 --installroot $PWD/ppc64 install dnf --releasever 24

Unfortunately that does not work as there currently seems to be no way to tell dnf to install the packages for another architecture. I was able to download a few RPMs and directly install them with rpm using the option --ignorearch. In the end I also installed the data for the chroot on my PPC64 system as that was faster and easier.

Now I can boot the PXCAB via TFTP into the busybox based ramdisk and from there I can chroot in to the NFS mounted Fedora 24 system.

The system has one CPU with two threads and 4GB of RAM. In addition to the actual RAM there is also 256MB of memory which can be accessed as a block device using the axonram driver. My busybox based ramdisk is copied to that ramdisk and thus freeing some more actual RAM:

# df -h
Filesystem         Size    Used Available Use% Mounted on
/dev/axonram0    247.9M   15.6M    219.5M   7% /

System information from the firmware:

SYSTEM INFORMATION
 Processor  = PowerXCell DD1.0 @ 2800 MHz
 I/O Bridge = Cell BE companion chip DD3.0
 Timebase   = 14318 kHz (external)
 Config     = SMP disabled
 SMP Size   = 1 (2 threads)
 Boot-Date  = 2016-07-21 19:37
 Memory     = 4096MB (CPU0: 4096MB)

Updated RPM Fusion’s mirrorlist servers

RPM Fusion’s mirrorlist server which are returning a list of (probably, hopefully) up to date mirrors (e.g., http://mirrors.rpmfusion.org/mirrorlist?repo=free-fedora-rawhide&arch=x86_64) still have been running on CentOS5 and the old MirrorManager code base. It was running on two systems (DNS load balancing) and was not the most stable setup. Connecting from a country which has been recently added to the GeoIP database let to 100% CPU usage of the httpd process. Which let to a DOS after a few requests. I added a cron entry to restart the httpd server every hour, which seemed to help a bit, but it was a rather clumsy workaround.

It was clear that the two systems need to be updated to something newer and as the new MirrorManager2 code base can luckily handle the data format from the old MirrorManager code base it was possible to update the RPM Fusion mirrorlist servers without updating the MirrorManager back-end (yet).

From now on there are four CentOS7 systems answering the requests for mirrors.rpmfusion.org. As the new RPM Fusion infrastructure is also ansible based I added the ansible files from Fedora to the RPM Fusion infrastructure repository. I had to remove some parts but most ansible content could be reused.

When yum or dnf are now connecting to http://mirrors.rpmfusion.org/mirrorlist?repo=free-fedora-rawhide&arch=x86_64 the answer is created by one of four CentOS7 systems running the latest MirrorManager2 code.

RPM Fusion also has the same mirrorlist access statistics like Fedora: http://mirrors.rpmfusion.org/statistics/.

I still need to update the back-end system which is only one system instead of six different system like in the Fedora infrastructure.

Protocol Changes In Fedora’s MirrorManager

There have been two protocol related issues with MirrorManager open for some time:

Both issues have been resolved. The first issue, to drop FTP URLs from the metalinks, has been resolved in multiple steps. The first step was to block FTP URLs from being added to Fedora’s MirrorManager (Optionally exclude certain protocols from MM, New MirrorManager2 features) and the second step, to remove all remaining FTP URLs from Fedora’s MirrorManager, was performed during the last few days and weeks. Using MirrorManager’s mirrorlist interface (which is not used very often) only returned FTP if the mirror had no HTTP(S) URLs. So it was already rather unusual to be redirected to a FTP mirror. Using MirrorManager’s metalink interface returned all possible URLs for a host. With the removal of all FTP URLs from MirrorManager’s database no user should see FTP URLs any more and the problems some clients encoutered (see Drop ftp:// urls from metalinks) should be ‘resolved’.

The other issue (Add a way to specify you want only https urls from metalink) has also been solved by adding a protocol option to the mirrorlist and metalink back-end. The new MirrorManager release (0.7.2) which includes these changes is already running on the staging instance and the result can be seen here:

To have more HTTPS based mirrors in our database we scanned all existing public mirrors to see if they also provide HTTPS. With this the number of HTTPS URLs was increased from 24 to over 120.

The option to select which protocol the mirrorlist/metalink mirrors should provide is not yet running on the production instance.

Lazy Process Migration

Process Migration

Using CRIU it is possible to checkpoint/save/dump the state of a process into a set of files which can then be used to restore/restart the process at a later point in time. If the files from the checkpoint operation are transferred from one system to another and then used to restore the process, this is probably the simplest form of process migration.

Source system:

  • criu dump -D /checkpoint/destination -t PID
  • rsync -a /checkpoint/destination destination.system:/checkpoint/destination

Destination system:

  • criu restore -D /checkpoint/destination

For large processes the migration duration can be rather long. For a process using 24GB this can lead to migration duration longer than 280 seconds. The limiting factor in most cases is the interconnect between the systems involved in the process migration.

Optimization: Pre-Copy

One existing solution to decrease process downtime during migration is pre-copy. In one or multiple runs the memory of the process is copied from the source to the destination system. With every run only memory pages which have change since the last run have to be transferred. This can lead to situations where the process downtime during migration can be dramatically decreased.

This depends on the type of application which is migrated and especially how often/fast the memory content is changed. In extreme cases it was possible to decrease process downtime during migration for a 24GB process from 280 seconds to 8 seconds with the help of pre-copy.

This approach is basically the same if migrating single processes (or process groups) or virtual machines.

It Always Depends On…

Unfortunately pre-copy optimization can also lead to situations where the so called optimized case with pre-copy can require more time than the unoptimized case:

In the example above a process has been migrated during three stages of its lifetime and there are situations (state: Calculation) where pre-copy has enormous advantages (14 seconds with pre-copy and 51 seconds without pre-copy) but there are also situations (state: Initialization) where the pre-copy optimization increases the process downtime during migration (40 seconds with pre-copy and 27 seconds without pre-copy). It depends on the memory change rate.

Optimization: Post-Copy

Another approach to reduce the process downtime during migration is post-copy. The required memory pages are not dumped and transferred before restoring the process but on demand. Each time a missing memory page is accessed the migrated process is halted until the required memory pages has been transferred from the source system to the destination system:

Thanks to userfaultfd this approach (or optimization) can be now integrated into CRIU. With the help of userfaultfd it is possible to mark memory pages to be handled by userfaultfd. If such a memory page is accessed, the process is halted until the requested page is provided. The listener for the userfaultfd requests is running in user-space and listening on a file descriptor. The same approach has already been implemented for QEMU.

Enough Theory

With all the background information on why and how the initial code to restore processes with userfaultfd support has been merged into the CRIU development branch: criu-dev. This initial implementation of lazy-pages support does not yet support lazy process migration between two hosts, but with the upstream merged patches it is at least possible to checkpoint a process and to restore the process using userfaultfd. A lazy restore consists of two parts. The usual ‘criu restore‘ part and an additional, what we call uffd daemon, ‘criu lazy-pages‘ part. To better demonstrate the advantages of a lazy restore there are patches to enhance crit (CRiu Image Tool) to remove pages which can be restored with userfaultfd from a checkpoint directory. Using a test case which allocates about 200MB of memory (and which writes one byte in each page over and over) requires after being dumped about 200MB. Using the mentioned crit enhancement make-lazy reduces the size of the checkpoint down to 116KB:

$ crit make-lazy /tmp/checkpoint/ /tmp/lazy-checkpoint
$ du -hs /tmp/checkpoint/ /tmp/lazy-checkpoint
     201M       /tmp/checkpoint
     116K       /tmp/lazy-checkpoint

With this the data which actually has to be transferred during process downtime is drastically reduced and the required memory pages are inserted in the restored process on demand using userfaultfd. Restoring the checkpointed process using lazy-restore would look something like this:

First the uffd daemon:

$ criu lazy-pages -D /tmp/checkpoint \
--address /tmp/userfault.socket

And then the actual restore:

$ criu restore -D /tmp/lazy-checkpoint \
--lazy-pages --address /tmp/userfault.socket

The socket specified with --address is used to exchange information about the restored process required by the uffd daemon. Once criu restore has done all its magic to restore the process except restoring the lazy memory pages, the process to be restored is actually started and runs until the first userfaultfd handled memory page is accessed. At that point the process hangs and the uffd daemon gets a message to provide the required memory pages. Once the uffd daemon provides the requested memory page, the restored process continues to run until the next page is requested. As potentially not all memory pages are requested, as they might not get accessed for some time, the uffd daemon starts to transfer unrequested memory pages into the restored process so that the uffd daemon can shut down after a certain time.

Booting with syslinux

Having read about using syslinux as a boot-loader for virtual machines I tried to replace grub2 on one of the Fedora 24 virtual machines I am using with syslinux:

Not completely knowing what to do I did:

  • dnf install syslinux-extlinux.x86_64
  • /sbin/extlinux –install /boot/extlinux/

The I tried to create a configuration file using grubby:

  • grubby --extlinux --add-kernel=/boot/vmlinuz-4.4.6-300.fc23.x86_64 --title="4.4.6" --initrd=/boot/initramfs-4.4.6-300.fc23.x86_64.img --args="ro root=/dev/sda3"

Which resulted in:

# cat /etc/extlinux.conf 
label 4.4.6
 kernel /vmlinuz-4.4.6-300.fc23.x86_64
 initrd /initramfs-4.4.6-300.fc23.x86_64.img
 append ro root=/dev/sda3

I added following lines to the file manually:

default 4.4.6
ui menu.c32
timeout 50

After that I rebooted and the virtual machine was still using grub2 to load the kernel.

To write syslinux to the MBR following additional command was required:
dd if=/usr/share/syslinux/mbr.bin of=/dev/sda bs=440 count=1. I was a bit nervous rebooting the system after overwriting the MBR, but it rebooted successfully. The configuration file was also correctly updated after I installed a new kernel via dnf. I also removed grub2 (dnf remove grub2*) and was able to successfully reboot into the new kernel without grub2.

New MirrorManager2 features

The latest MirrorManager release (0.6.1) which is active since 2015-12-17 in Fedora’s infrastructure has a few additional features which provide insights into the mirror network usage.

The first is called statistics. It gives a daily overview what clients are requesting. It analysis the metalink and mirrorlist accesses and draws diagrams. Each time the local yum or dnf metadata has expired a new mirrorlist/metalink is requested which contains the ‘best’ mirrors for the client currently requesting the data. The current MirrorManager statistics implementation tries to display how often the different repositories are requested from which country for the available architectures:

In addition to the statistics where the clients are coming from and which files they are interested in the old code to draw a map of the location of all mirror servers has been re-enabled: maps

Another new visualization tries to track the propagation. The time the existing mirrors need to carry the latest bits. A script connects to all enabled mirrors and checks which repomd.xml file is currently available on the mirror. This is done for the development branch and all active branches. The script displays how many mirrors have the current repomd.xml file or if the mirror still has theĀ  repomd.xml file from the previous push (or the push before) or if the file is even older: Propagation.

Another relevant change in Fedora’s MirrorManager is that it is no longer possible to enter FTP URLs. This is the first step to remove FTP based URLsĀ  as FTP based mirrors are often, depending on the network topology, difficult to connect to, other protocols (HTTP, RSYNC) are better suited and more mirror server are not providing FTP anyway.

Bimini Upgrade

I finally upgraded my PowerStation from Fedora 18 to Fedora 21. The upgrade went pretty smooth and was not much more than:


$ yum --releasever=19 --exclude=yaboot --exclude=kernel distro-sync
$ yum --releasever=20 --exclude=yaboot --exclude=kernel distro-sync
$ yum --releasever=21 --exclude=yaboot --exclude=kernel distro-sync

As I was doing the upgrade without console access I did not want to change the bootloader from yaboot to grub2 and I also excluded the kernel. Once I have console access I will also upgrade those packages.

The only difficulty was upgrading from Fedora 20 to Fedora 21 because 32bit packages were dropped from ppc and I was not sure if the system would still boot after removing all 32bit packages (yum remove *ppc). But it just worked and now I have an up to date 64bit ppc Fedora 21 system.

Cluster Updated to CentOS 6.5 (IGB/PTP Problems)

Normally I would not mention that our Linux cluster was updated. But as the update to CentOS 6.5 produced some strange errors I thought that I write it down in case somebody else has the same errors.

Our cluster has a bit more than 200 nodes and all nodes are running disk-less with read-only mounted filesystem over NFS. Until now we were using Scientific Linux 5.5 and it was time to update it to something newer: CentOS 6.5.

So all nodes were shut down and then started with the new CentOS 6.5 image and everything seemed fine. After a few minutes there were, however, about 30 nodes which went offline. The hardware on all nodes is the same and it was strange that 30 nodes should have the same hardware error after a software upgrade. I was not able to contact the defect systems over Ethernet but they still were answering ping requests over InfiniBand. I could not log in into the defect systems as the filesystem was mounted over Ethernet and not InfiniBand. Going to the console of the systems I saw that the system was still up and running but was not reachable over Ethernet. The link was still active and the kernel detected if the link was going up or down. But the driver of the Ethernet card refused to answer any packets.

Without Ethernet it was hard to debug as the systems have no local drive and as soon as the Ethernet driver stopped working no logging in was possible.

Looking at the protocols of the boot I saw that the system starts up with the wrong date which is then corrected by NTP during the boot. I also saw that the moment the time was corrected the systems stopped working. At least most of the time.

Looking at the parameters of the network driver (igb) to find some debug options I saw that it has a dependency on the ptp module. I had no idea what PTP was but the Internet told me that it is the Precision Time Protocol and that it is a feature which was enabled with RHEL6.5 and therefore also with our used CentOS 6.5. The network driver also stopped working once I tried to write the correct time to the RTC using hwclock.

On some of the systems the time stored in the RTC was more than 3.5 years in the past. The reason for this might be that the most of the time the systems are not shut down cleanly but only powered off or power cycled using ipmitool because the systems are disk-less and have a read-only filesystem. But this also means that hwclock is never run on shutdown to sync the time to the RTC.

Setting SYNC_HWCLOCK in /etc/sysconfig/ntpdate to yes syncs the actual time to the RTC and after the next reboots all my problems were gone.

Syncing the RTC to a reasonable value helped to solve my problem but this still looks like a bug in the network driver that it stops working after changing the time.

Checkpoint and almost Restart in Open MPI

Now that checkpoint/restart with CRIU is possible since Fedora 19 I started adding CRIU support to Open MPI. With my commit 30772 it is now possible to checkpoint a process running under Open MPI. The restart functionality is not yet implemented but should be soon available. I have a test case (orte-test) which prints its PID and sleeps one second in a loop which I start under orterun like this:

/path/to/orterun --mca ft_cr_enabled 1 --mca opal_cr_use_thread 1 --mca oob tcp --mca crs_criu_verbose 30 --np 1 orte-test

The options have following meaning:

  • –mca ft_cr_enabled 1
    • ft stands for fault tolerance
    • cr stands for checkpoint/restart
    • this option is to enable the checkpoint/restart functionality
  • –mca opal_cr_use_thread 1: use an additional thread to control checkpoint/restart operations
  • –mca oob tcp: use TCP instead of unix domain sockets (the socket code needs some additional changes for C/R to work)
  • –mca crs_criu_verbose 30: print all CRIU debug messages
  • –np 1: spawn one test case

The output of the test case looks like this:


[dcbz:12563] crs:criu: open()
[dcbz:12563] crs:criu: open: priority = 10
[dcbz:12563] crs:criu: open: verbosity = 30
[dcbz:12563] crs:criu: open: log_file = criu.log
[dcbz:12563] crs:criu: open: log_level = 0
[dcbz:12563] crs:criu: open: tcp_established = 1
[dcbz:12563] crs:criu: open: shell_job = 1
[dcbz:12563] crs:criu: open: ext_unix_sk = 1
[dcbz:12563] crs:criu: open: leave_running = 1
[dcbz:12563] crs:criu: component_query()
[dcbz:12563] crs:criu: module_init()
[dcbz:12563] crs:criu: opal_crs_criu_prelaunch
[dcbz:12565] crs:criu: open()
[dcbz:12565] crs:criu: open: priority = 10
[dcbz:12565] crs:criu: open: verbosity = 30
[dcbz:12565] crs:criu: open: log_file = criu.log
[dcbz:12565] crs:criu: open: log_level = 0
[dcbz:12565] crs:criu: open: tcp_established = 1
[dcbz:12565] crs:criu: open: shell_job = 1
[dcbz:12565] crs:criu: open: ext_unix_sk = 1
[dcbz:12565] crs:criu: open: leave_running = 1
[dcbz:12565] crs:criu: component_query()
[dcbz:12565] crs:criu: module_init()
[dcbz:12565] crs:criu: opal_crs_criu_reg_thread
Process 12565
Process 12565
Process 12565

To start the checkpoint operation the Open MPI tool orte-checkpoint is used:

/path/to/orte-checkpoint -V 10 `pidof orterun`

which outputs the following:


[dcbz:12570] orte_checkpoint: Checkpointing...
[dcbz:12570] PID 12563
[dcbz:12570] Connected to Mpirun [[56676,0],0]
[dcbz:12570] orte_checkpoint: notify_hnp: Contact Head Node Process PID 12563
[dcbz:12570] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.00 / 0.08] Requested - ...
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.00 / 0.08] Pending - ...
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.00 / 0.08] Running - ...
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.06 / 0.14] Locally Finished - ...
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.00 / 0.14] Checkpoint Established - ompi_global_snapshot_12563.ckpt
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.00 / 0.14] Continuing/Recovered - ompi_global_snapshot_12563.ckpt
Snapshot Ref.: 0 ompi_global_snapshot_12563.ckpt

orte-checkpoint tries to connect to the previously started orterun process and requests that a checkpoint should be taken. orterun outputs the following after receiving the checkpoint request:


[dcbz:12565] crs:criu: checkpoint(12565, ---)
[dcbz:12565] crs:criu: criu_init_opts() returned 0
[dcbz:12565] crs:criu: opening snapshot directory /home/adrian/ompi_global_snapshot_12563.ckpt/0/opal_snapshot_0.ckpt
[dcbz:12563] 12563: Checkpoint established for process [56676,0].
[dcbz:12563] 12563: Successfully restarted process [56676,0].
Process 12565

At this point the checkpoint has been written to disk and the process continues (printing its PID).

For a complete checkpoint/restart functionality I still have to implement the restart functionality in Open MPI and I also have to take care of the unix domain sockets (shutting them down for the checkpointing).

This requires the latest criu package (criu-1.1-4) which includes headers to build Open MPI against CRIU as well as the CRIU service.

Using the ownCloud address book in mutt

Now that I have been syncing my ownCloud address book to my mobile devices and my laptop I was missing this address book in mutt. But using pyCardDAV and the instructions at http://got-tty.org/archives/mutt-kontakte-aus-owncloud-nutzen.html it was easy to integrate the ownCloud address book in mutt. As pyCardDAV was already packaged for Fedora it was not much more work than yum install python-carddav, edit ~/.config/pycard/pycard.conf to get the address book synced.

I was already using a LDAP address book in mutt so that I had to extent the existing configuration to:
set query_command = "~/bin/mutt_ldap.pl '%s'; /usr/bin/pc_query -m '%s'"

Now, whenever I press CTRL+T during address input, first the LDAP server is queried and than my local copy of the ownCloud address book.