For external use only

Optimizing live container migration in LXD

Wed 06 December 2017
By adrian
After having worked on optimizing live container migration based on runc (pre-copy migration and post-copy migration) I tried to optimize container migration in LXD.

After a few initial discussions with Christian I started with pre-copy migration. Container migration in LXD is based on CRIU, just as in runc and CRIU's pre-copy migration support is based on dirty page tracking support of Linux: SOFT-DIRTY PTEs.

As LXD uses LXC for the actual container checkpointing and restoring I was curious if there was already pre-copy migration support in LXC. After figuring out the right command-line parameters it almost worked thanks to the great checkpoint and restore support implemented by Tycho some time ago.

Now that I knew that it works in LXC I focused on getting pre-copy migration support into LXD. LXD supports container live migration using the move command: lxc move <container> <remote>:<container>
This move command, however, did not use any optimization yet. It basically did:
1. Initial sync of the filesystem
2. Checkpoint container using CRIU
3. Transfer container checkpoint
4. Final sync of the filesystem
5. Restart container on the remote system
The downtime for the container in this scenario is between step 2 and step
5 and depends on the used memory of the processes inside the container. The goal of pre-copy migration is to dump the memory of the container and transfer it to the remote destination while the container keeps on running and doing a final dump with only the memory pages that changed since the last pre-dump (more about process migration optimization theories).

Back to LXD: At the end of the day I had a very rough (and very hardcoded) first pre-copy migration implementation ready and I kept working on it until it was ready to be submitted upstream. The pull request has already been merged upstream and now LXD supports pre-copy migration.

As not all architecture/kernel/criu combinations support pre-copy migration it has to be turned on manually right now, but we already discussed adding pre-copy support detection to LXC. To tell LXD to use pre-copy migration, the parameter 'migration.incremental.memory' needs to be set to 'true'. Once that is done and if LXD is instructed to migrate a container the following will happen:
- Initial sync of the filesystem
- Start pre-copy checkpointing loop using CRIU
  - Check if maximum number pre-copy iterations has been reached
  - Check if threshold of unchanged memory pages has been reached
  - Transfer container checkpoint
  - Continue pre-copy checkpointing loop if neither of those conditions is true
- Final container delta checkpoint using CRIU
- Transfer final delta checkpoint
- Final sync of the filesystem
- Restart container on the remote system
So instead of doing a single checkpoint and transferring it, there are now multiple pre-copy checkpoints and the container keeps on running during those transfers. The container is only suspended during the last delta checkpoint and the transfer of the last delta checkpoint. In many cases this reduces the container downtime during migration, but there is the possibility that pre-copy migration also increases the container downtime during migration. This depends (as always) on the workload.

To control how many pre-copy iterations LXD does there are two additional variables:
1. migration.incremental.memory.iterations (defaults to 10)
2. migration.incremental.memory.goal (defaults to 70%)
The first variable (iterations) is used to tell LXD how many pre-copy iterations it should do before doing the final dump and the second variable (goal) is used to tell LXD the percentage of pre-copied memory pages that should not change between pre-copy iterations before doing the final dump.

So LXD, in the default configuration, does either 10 pre-copy iterations before doing the final migration or the final migration is triggered when at least 70% of the memory pages have been transferred by the last pre-copy iteration.

Now that this pull request is merged and if pre-copy migration is enabled a lxc move <container> <remote>:<container> should live migrate the container with a reduced downtime.

I want to thank Christian for the collaboration on getting CRIU's pre-copy support into LXD, Tycho for his work preparing LXC and LXD to support migration so nicely and the developers of p.haul for the ideas how to implement pre-copy container migration. Next step: lazy migration.
Tagged as : criu migration pre-copy
Lazy Migration in CRIU's master branch

Sun 01 October 2017
By adrian
For almost two years Mike Rapoport and I have been working on lazy process migration. Lazy process migration (or post-copy migration) is a technique to decrease the process or container downtime during the live migration. I described the basic functionality in the following previous articles:
- Lazy Process Migration
- Combining pre-copy and post-copy migration
Those articles are not 100% correct anymore as we changed some of the parameters during the last two years, but the concepts stayed the same.

Mike and I started about two years ago to work on it and the latest CRIU release (3.5) includes the possibility to use lazy migration. Now that the post-copy migration feature has been merged from the criu-dev branch to the master branch it is part of the normal CRIU releases.

With CRIU's 3.5 release lazy migration can be used on any kernel which supports userfaultfd. I already updated the CRIU packages in Fedora to 3.5 so that lazy process migration can be used just by installing the latest CRIU packages with dnf (still in the testing repository right now).

More information about container live migration in our upcoming Open Source Summit Europe talk: Container Migration Around The World.

My pull request to support lazy migration in runC was also recently merged, so that it is now possible to migrate containers using pre-copy migration and post-copy migration. It can also be combined.

Another interesting change about CRIU is that it started as x86_64 only and now it is also available on aarch64, ppc64le and s390x. The support to run on s390x has just been added with the previous 3.4 release and starting with Fedora 27 the necessary kernel configuration options are also active on s390x in addition to the other supported architectures.
Tagged as : criu fedora
Linux Plumbers Conference 2016

Fri 20 January 2017
By adrian
It is a bit late but I still wanted to share my presentations from this year's Linux Plumbers Conference:
On my way back home I had to stay one night in Albuquerque and it looks like the hotel needs to upgrade its TV system. It is still running Fedora 10 which is EOL since 2009-12-18:
Tagged as : criu
Influence which PID will be the next

Sat 15 October 2016
By adrian
To restore a checkpointed process with CRIU the process ID (PID) has to be the same it was during checkpointing. CRIU uses /proc/sys/kernel/ns_last_pid to set the PID to one lower as the process to be restored just before fork()-ing into the new process.

The same interface (/proc/sys/kernel/ns_last_pid) can also be used from the command-line to influence which PID the kernel will use for the next process.
```
# cat /proc/sys/kernel/ns_last_pid
1626
# echo -n 9999 > /proc/sys/kernel/ns_last_pid
# cat /proc/sys/kernel/ns_last_pid
10000
```
Writing '9999' (without a 'new line') to /proc/sys/kernel/ns_last_pid tells the kernel, that the next PID should be '10000'. This only works if between after writing to /proc/sys/kernel/ns_last_pid and forking the new process no other process has been created. So it is not possible to guarantee which PID the new process will get but it can be influenced.

There is also a posting which describes how to do the same with C: How to set PID using ns_last_pid
Tagged as : criu fedora
Combining pre-copy and post-copy migration

Fri 14 October 2016
By adrian
In my last post about CRIU in May 2016 I mentioned lazy memory transfer to decrease process downtime during migration. Since May 2016 Mike Rapoport's patches for remote lazy process migration have been merged into CRIU's criu-dev branch as well as my patches to combine pre-copy and post-copy migration.

Using pre-copy (criu pre-dump) it has "always" been possible to dump the memory of a process using soft-dirty-tracking. criu pre-dump can be run multiple times and each time only the changed memory pages will be written to the checkpoint directory.

Depending on the processes to be migrated and how fast they are changing their memory, this can still lead to a situation where the final dump can be rather large which can mean a longer downtime during migration than desired. This is why we started to work on post-copy migration (also know as lazy migration). There are, however, situations where post-copy migration can also increase the process downtime during migration instead of decreasing it.

The latest changes regarding post-copy migration in the criu-dev branch offer the possibility to combine pre-copy and post-copy migration. The memory pages of the process are pre-dumped using soft-dirty-tracking and transferred to the destination while the process on the source machine keeps on running. Once the process is actually migrated to the destination system everything besides the memory pages is transferred to the destination system. Excluding the memory pages (as the remaining memory pages will be migrated lazily) usually only a few hundred kilobytes have to be transferred which reduces the process downtime during migration significantly.

Using criu with pre-copy and post-copy could look like this:

Source system:
```
# criu pre-dump -D /tmp/cp/1 -t PID
# rsync -a /tmp/cp destination:/tmp
# criu dump -D /tmp/cp/2 -t PID --port 27 --lazy-pages   
 --prev-images-dir ../1/ --track-mem
```
The first criu command dumps the memory of the process PID and resets the soft-dirty memory tracking. The initial dump is then transferred using rsync to the destination system. During that time the process PID keeps on running. The last criu command starts the lazy page mode which dumps everything besides memory pages which can be transferred lazily and waits for connections over the network on port 27. Only pages which have changed since the last pre-dump are considered for the lazy restore. At this point the process is no longer running and the process downtime starts.

Destination system:
```
# rsync -a source:/tmp/cp /tmp/
# criu lazy-pages --page-server --address source --port 27   
 -D /tmp/cp/2 &
# criu restore --lazy-pages -D /tmp/cp/2
```
Once criu is waiting on port 27 on the source system the remaining checkpoint images can be transferred from the source system to the destination system (using rsync in this case). Now criu can be started in lazy-pages mode connecting to the page server on port 27 on the source system. This is the part we usually call the UFFD daemon. The last step is the actual restore (criu restore).

The following diagrams try to visualize what happens during the last step: criu restore.

It all starts with criu restore (on the right). criu does its magic to restore the process and copies the memory pages from criu pre-dump to the process and marks lazy pages as being handled by userfaultfd. Once everything is restored criu jumps into the restored process and the restored process continues to run where it was when checkpointed. Once the process accesses a userfaultfd marked memory address the process will be paused until a memory page (hopefully the correct one) is copied to that address.

The part that we call the UFFD daemon or criu lazy-pages listens on the userfault file descriptor for a message and as soon as a valid UFFD request arrives it requests that page from the source system via TCP where criu is still running in page-server mode. If the page-server finds that memory page it transfers the actual page back to the destination system to the UFFD daemon which injects the page into the kernel using the same userfault file descriptor it previously got the page request from. Now that the page which initially triggered the page-fault or in our case userfault is at its place the restored process continues to run until another missing page is accessed and the whole procedure starts again.

To be able to remove the UFFD daemon and the page-server at some point we currently push all unused pages into the restored process if there are no further userfaultfd requests for 5 seconds.

The whole procedure still has a lot of possibilities for optimization but now that we finally can combine pre-copy and post-copy memory migration we are a lot closer to decreasing process downtime during migration.

The next steps are to get support for pre-copy and post-copy into p.haul (Process Hauler) and into different container runtimes which already support migration via criu.

My other recently posted criu related articles:
Tagged as : criu fedora
HS100 - Wi-Fi Smart Plug

Mon 08 August 2016
By adrian
For my recently installed PXACB I was looking for a way to remotely power it on and off. I found the Wi-Fi Smart Plug "HS100" and a blog post that it can be controlled from the command-line.

The referenced script uses captured results from wireshark and just re-transmits these messages from a shell script. In one of the comments someone points out that this is XOR'd JSON and how it can be decoded. Instead of a shell script I re-implemented it in Python and I am now always using XOR to encode and decode the JSON messages without needing to include the encoded commands in my script. This makes it easier to read the script and to extend the script.

The protocol used is JSON which is XOR'd and then transmitted to the device. Same goes for the answers. The JSON string is XOR'd with the previous character of the JSON string and the value of the first XOR operation is 0xAB. Additionally each message is prefixed with '\x00\x00\x00\x23'.

The message to turn on the power looks like this:
```
{
 "system": {
  "set_relay_state": {
   "state": 1
  }
 }
}
```
To find more about which commands the device understands I used the information I got from: Why not root your Christmas gift?

I downloaded the firmware for the US model of my smart plug and used binwalk to analyze the content of the firmware. The firmware contains busybox based ramdisk which includes the smart plug relevant programs /usr/bin/shd and /usr/bin/shdTester and it seems at least following commands exist:
- system
- reset
- get_sysinfo
- set_test_mode
- set_dev_alias
- set_relay_state
- check_new_config
- download_firmware
- get_download_state
- flash_firmware
- set_mac_addr
- set_device_id
- set_hw_id
- test_check_uboot
- get_dev_icon
- set_dev_icon
- set_led_off
- set_dev_location
With the knowledge from the original shell script implementation and the results from binwalk I wrote the following script: https://lisas.de/~adrian/hs100.py

Using this script I can power the device behind the smart plug easily on and off:
```
$ ./hs100.py -H p-pxcab.example.com off
$ ./hs100.py -H p-pxcab.example.com state
Power OFF
$ ./hs100.py -H p-pxcab.example.com on
$ ./hs100.py -H p-pxcab.example.com state
Power ON
```
The only annoying thing about the smart plug is, that it tries to communicate with some cloud systems so that it could be controlled from anywhere. After starting the smart plug it makes a name lookup for devs.tplinkcloud.com and connects to port 50443. I can connect to that system with openssl s_client -connect devs.tplinkcloud.com:50443 but what the smart plug actually sends to that system I do not know. If I do not block the smart plug in the firewall I see a NTP request after that and then the communication seems to stop. Right now the smart plug is blocked and does no NTP requests but it still tries to reach devs.tplinkcloud.com:50443 once a minute.
Tagged as : pxcab
PXCAB

Thu 21 July 2016
By adrian
A long time ago (2007 or 2008) I was developing firmware for Cell processor based systems. Most of the Slimline Open Firmware (SLOF) has been released and is also available in Fedora as firmware for QEMU: SLOF.

One of the systems we have been developing firmware for was a PCI Express card called PXCAB. The processor on this PCI Express card was not the original Cell processor but the newer PowerXCell 8i which has a much better double precision floating point performance. A few weeks ago I was able to get one of those PCI Express cards in a 1U chassis:

This chassis was designed to hold two PXCABs: one running in root complex mode and the other in endpoint mode. That way one card was the host system and the other the PCI express connected device. This single card is now running in root complex mode.

I can boot a kernel either via TFTP or from the flash. As writing the flash takes some time I am booting it right now via TFTP. Compiling the latest kernel from git for PPC64 is thanks to the available cross compiler (gcc-powerpc64-linux-gnu.x86_64) no problem: make CROSS_COMPILE=powerpc64-linux-gnu- ARCH=powerpc.

The more difficult part was to compile user space tools but fortunately I was able to compile it natively on a PPC64 system. With this minimal busybox based system I can boot the system and chroot into a Fedora 24 NFS mount.

I was trying to populate a directory with a minimal PPC64 based Fedora 24 system with following command:

dnf --setopt arch=ppc64 --installroot $PWD/ppc64 install dnf --releasever 24

Unfortunately that does not work as there currently seems to be no way to tell dnf to install the packages for another architecture. I was able to download a few RPMs and directly install them with rpm using the option --ignorearch. In the end I also installed the data for the chroot on my PPC64 system as that was faster and easier.

Now I can boot the PXCAB via TFTP into the busybox based ramdisk and from there I can chroot in to the NFS mounted Fedora 24 system.

The system has one CPU with two threads and 4GB of RAM. In addition to the actual RAM there is also 256MB of memory which can be accessed as a block device using the axonram driver. My busybox based ramdisk is copied to that ramdisk and thus freeing some more actual RAM:
```
# df -h
Filesystem         Size    Used Available Use% Mounted on
/dev/axonram0    247.9M   15.6M    219.5M   7% /
```
System information from the firmware:
```
SYSTEM INFORMATION
 Processor  = PowerXCell DD1.0 @ 2800 MHz
 I/O Bridge = Cell BE companion chip DD3.0
 Timebase   = 14318 kHz (external)
 Config     = SMP disabled
 SMP Size   = 1 (2 threads)
 Boot-Date  = 2016-07-21 19:37
 Memory     = 4096MB (CPU0: 4096MB)
```
Tagged as : fedora powerstation ppc64 pxcab
Updated RPM Fusion's mirrorlist servers

Thu 14 July 2016
By adrian

RPM Fusion's mirrorlist server which are returning a list of (probably, hopefully) up to date mirrors (e.g., http://mirrors.rpmfusion.org/mirrorlist?repo=free-fedora-rawhide&arch=x86_64) still have been running on CentOS5 and the old MirrorManager code base. It was running on two systems (DNS load balancing) and was not the most stable setup. Connecting from a country which has been recently added to the GeoIP database let to 100% CPU usage of the httpd process. Which let to a DOS after a few requests. I added a cron entry to restart the httpd server every hour, which seemed to help a bit, but it was a rather clumsy workaround.

It was clear that the two systems need to be updated to something newer and as the new MirrorManager2 code base can luckily handle the data format from the old MirrorManager code base it was possible to update the RPM Fusion mirrorlist servers without updating the MirrorManager back-end (yet).

From now on there are four CentOS7 systems answering the requests for mirrors.rpmfusion.org. As the new RPM Fusion infrastructure is also ansible based I added the ansible files from Fedora to the RPM Fusion infrastructure repository. I had to remove some parts but most ansible content could be reused.

When yum or dnf are now connecting to http://mirrors.rpmfusion.org/mirrorlist?repo=free-fedora-rawhide&arch=x86_64 the answer is created by one of four CentOS7 systems running the latest MirrorManager2 code.

RPM Fusion also has the same mirrorlist access statistics like Fedora: http://mirrors.rpmfusion.org/statistics/.

I still need to update the back-end system which is only one system instead of six different system like in the Fedora infrastructure.

Tagged as : mirrormanager rpmfusion
Protocol Changes In Fedora's MirrorManager

Thu 16 June 2016
By adrian
There have been two protocol related issues with MirrorManager open for some time:
- Drop ftp:// urls from metalinks
- Add a way to specify you want only https urls from metalink
Both issues have been resolved. The first issue, to drop FTP URLs from the metalinks, has been resolved in multiple steps. The first step was to block FTP URLs from being added to Fedora's MirrorManager (Optionally exclude certain protocols from MM, New MirrorManager2 features) and the second step, to remove all remaining FTP URLs from Fedora's MirrorManager, was performed during the last few days and weeks. Using MirrorManager's mirrorlist interface (which is not used very often) only returned FTP if the mirror had no HTTP(S) URLs. So it was already rather unusual to be redirected to a FTP mirror. Using MirrorManager's metalink interface returned all possible URLs for a host. With the removal of all FTP URLs from MirrorManager's database no user should see FTP URLs any more and the problems some clients encoutered (see Drop ftp:// urls from metalinks) should be 'resolved'.

The other issue (Add a way to specify you want only https urls from metalink) has also been solved by adding a protocol option to the mirrorlist and metalink back-end. The new MirrorManager release (0.7.2) which includes these changes is already running on the staging instance and the result can be seen here:
To have more HTTPS based mirrors in our database we scanned all existing public mirrors to see if they also provide HTTPS. With this the number of HTTPS URLs was increased from 24 to over 120.

The option to select which protocol the mirrorlist/metalink mirrors should provide is not yet running on the production instance.
Tagged as : fedora mirrormanager
Lazy Process Migration

Wed 04 May 2016
By adrian
Process Migration

Using CRIU it is possible to checkpoint/save/dump the state of a process into a set of files which can then be used to restore/restart the process at a later point in time. If the files from the checkpoint operation are transferred from one system to another and then used to restore the process, this is probably the simplest form of process migration.

Source system:
- criu dump -D /checkpoint/destination -t PID
- rsync -a /checkpoint/destination destination.system:/checkpoint/destination
Destination system:
- criu restore -D /checkpoint/destination
For large processes the migration duration can be rather long. For a process using 24GB this can lead to migration duration longer than 280 seconds. The limiting factor in most cases is the interconnect between the systems involved in the process migration.

Optimization: Pre-Copy

One existing solution to decrease process downtime during migration is pre-copy. In one or multiple runs the memory of the process is copied from the source to the destination system. With every run only memory pages which have change since the last run have to be transferred. This can lead to situations where the process downtime during migration can be dramatically decreased.

This depends on the type of application which is migrated and especially how often/fast the memory content is changed. In extreme cases it was possible to decrease process downtime during migration for a 24GB process from 280 seconds to 8 seconds with the help of pre-copy.

This approach is basically the same if migrating single processes (or process groups) or virtual machines.

It Always Depends On...

Unfortunately pre-copy optimization can also lead to situations where the so called optimized case with pre-copy can require more time than the unoptimized case:
![alt](https://lisas.de/~adrian/migration-time.png)
In the example above a process has been migrated during three stages of its lifetime and there are situations (state: Calculation) where pre-copy has enormous advantages (14 seconds with pre-copy and 51 seconds without pre-copy) but there are also situations (state: Initialization) where the pre-copy optimization increases the process downtime during migration (40 seconds with pre-copy and 27 seconds without pre-copy). It depends on the memory change rate.

Optimization: Post-Copy

Another approach to reduce the process downtime during migration is post-copy. The required memory pages are not dumped and transferred before restoring the process but on demand. Each time a missing memory page is accessed the migrated process is halted until the required memory pages has been transferred from the source system to the destination system:
![alt](https://lisas.de/~adrian/memory-transfer-after-migration.png)
Thanks to userfaultfd this approach (or optimization) can be now integrated into CRIU. With the help of userfaultfd it is possible to mark memory pages to be handled by userfaultfd. If such a memory page is accessed, the process is halted until the requested page is provided. The listener for the userfaultfd requests is running in user-space and listening on a file descriptor. The same approach has already been implemented for QEMU.

Enough Theory

With all the background information on why and how the initial code to restore processes with userfaultfd support has been merged into the CRIU development branch: criu-dev. This initial implementation of lazy-pages support does not yet support lazy process migration between two hosts, but with the upstream merged patches it is at least possible to checkpoint a process and to restore the process using userfaultfd. A lazy restore consists of two parts. The usual 'criu restore' part and an additional, what we call uffd daemon, 'criu lazy-pages' part. To better demonstrate the advantages of a lazy restore there are patches to enhance crit (CRiu Image Tool) to remove pages which can be restored with userfaultfd from a checkpoint directory. Using a test case which allocates about 200MB of memory (and which writes one byte in each page over and over) requires after being dumped about 200MB. Using the mentioned crit enhancement make-lazy reduces the size of the checkpoint down to 116KB:
```
$ crit make-lazy /tmp/checkpoint/ /tmp/lazy-checkpoint
$ du -hs /tmp/checkpoint/ /tmp/lazy-checkpoint
     201M       /tmp/checkpoint
     116K       /tmp/lazy-checkpoint
```
With this the data which actually has to be transferred during process downtime is drastically reduced and the required memory pages are inserted in the restored process on demand using userfaultfd. Restoring the checkpointed process using lazy-restore would look something like this:

First the uffd daemon:
```
$ criu lazy-pages -D /tmp/checkpoint --address /tmp/userfault.socket
```
And then the actual restore:
```
$ criu restore -D /tmp/lazy-checkpoint --lazy-pages --address /tmp/userfault.socket
```
The socket specified with --address is used to exchange information about the restored process required by the uffd daemon. Once criu restore has done all its magic to restore the process except restoring the lazy memory pages, the process to be restored is actually started and runs until the first userfaultfd handled memory page is accessed. At that point the process hangs and the uffd daemon gets a message to provide the required memory pages. Once the uffd daemon provides the requested memory page, the restored process continues to run until the next page is requested. As potentially not all memory pages are requested, as they might not get accessed for some time, the uffd daemon starts to transfer unrequested memory pages into the restored process so that the uffd daemon can shut down after a certain time.
Tagged as : criu fedora

Page 2 / 5

Process Migration

Optimization: Pre-Copy

It Always Depends On...

Optimization: Post-Copy

Enough Theory