Tag Archives: criu

Linux Plumbers Conference 2016

It is a bit late but I still wanted to share my presentations from this year’s Linux Plumbers Conference:

On my way back home I had to stay one night in Albuquerque and it looks like the hotel needs to upgrade its TV system. It is still running Fedora 10 which is EOL since 2009-12-18:

Still Fedora 10

Influence which PID will be the next

To restore a checkpointed process with CRIU the process ID (PID) has to be the same it was during checkpointing. CRIU uses /proc/sys/kernel/ns_last_pid to set the PID to one lower as the process to be restored just before fork()-ing into the new process.

The same interface (/proc/sys/kernel/ns_last_pid) can also be used from the command-line to influence which PID the kernel will use for the next process.

# cat /proc/sys/kernel/ns_last_pid
1626
# echo -n 9999 > /proc/sys/kernel/ns_last_pid
# cat /proc/sys/kernel/ns_last_pid
10000

Writing ‘9999’ (without a ‘new line’) to /proc/sys/kernel/ns_last_pid tells the kernel, that the next PID should be ‘10000’. This only works if between after writing to /proc/sys/kernel/ns_last_pid and forking the new process no other process has been created. So it is not possible to guarantee which PID the new process will get but it can be influenced.

There is also a posting which describes how to do the same with C: How to set PID using ns_last_pid

Combining pre-copy and post-copy migration

In my last post about CRIU in May 2016 I mentioned lazy memory transfer to decrease process downtime during migration. Since May 2016 Mike Rapoport’s patches for remote lazy process migration have been merged into CRIU‘s criu-dev branch as well as my patches to combine pre-copy and post-copy migration.

Using pre-copy (criu pre-dump) it has “always” been possible to dump the memory of a process using soft-dirty-tracking. criu pre-dump can be run multiple times and each time only the changed memory pages will be written to the checkpoint directory.

Depending on the processes to be migrated and how fast they are changing their memory, this can still lead to a situation where the final dump can be rather large which can mean a longer downtime during migration than desired. This is why we started to work on post-copy migration (also know as lazy migration). There are, however, situations where post-copy migration can also increase the process downtime during migration instead of decreasing it.

The latest changes regarding post-copy migration in the criu-dev branch offer the possibility to combine pre-copy and post-copy migration. The memory pages of the process are pre-dumped using soft-dirty-tracking and transferred to the destination while the process on the source machine keeps on running. Once the process is actually migrated to the destination system everything besides the memory pages is transferred to the destination system. Excluding the memory pages (as the remaining memory pages will be migrated lazily) usually only a few hundred kilobytes have to be transferred which reduces the process downtime during migration significantly.

Using criu with pre-copy and post-copy could look like this:

Source system:

# criu pre-dump -D /tmp/cp/1 -t PID
# rsync -a /tmp/cp destination:/tmp
# criu dump -D /tmp/cp/2 -t PID --port 27 --lazy-pages \
  --prev-images-dir ../1/ --track-mem

The first criu command dumps the memory of the process PID and resets the soft-dirty memory tracking. The initial dump is then transferred using rsync to the destination system. During that time the process PID keeps on running. The last criu command starts the lazy page mode which dumps everything besides memory pages which can be transferred lazily and waits for connections over the network on port 27. Only pages which have changed since the last pre-dump are considered for the lazy restore. At this point the process is no longer running and the process downtime starts.

Destination system:

# rsync -a source:/tmp/cp /tmp/
# criu lazy-pages --page-server --address source --port 27 \
  -D /tmp/cp/2 &
# criu restore --lazy-pages -D /tmp/cp/2

Once criu is waiting on port 27 on the source system the remaining checkpoint images can be transferred from the source system to the destination system (using rsync in this case). Now criu can be started in lazy-pages mode connecting to the page server on port 27 on the source system. This is the part we usually call the UFFD daemon. The last step is the actual restore (criu restore).

The following diagrams try to visualize what happens during the last step: criu restore.

step1

It all starts with criu restore (on the right). criu does its magic to restore the process and copies the memory pages from criu pre-dump to the process and marks lazy pages as being handled by userfaultfd. Once everything is restored criu jumps into the restored process and the restored process continues to run where it was when checkpointed. Once the process accesses a userfaultfd marked memory address the process will be paused until a memory page (hopefully the correct one) is copied to that address.

step2

The part that we call the UFFD daemon or criu lazy-pages listens on the userfault file descriptor for a message and as soon as a valid UFFD request arrives it requests that page from the source system via TCP where criu is still running in page-server mode. If the page-server finds that memory page it transfers the actual page back to the destination system to the UFFD daemon which injects the page into the kernel using the same userfault file descriptor it previously got the page request from. Now that the page which initially triggered the page-fault or in our case userfault is at its place the restored process continues to run until another missing page is accessed and the whole procedure starts again.

To be able to remove the UFFD daemon and the page-server at some point we currently push all unused pages into the restored process if there are no further userfaultfd requests for 5 seconds.

The whole procedure still has a lot of possibilities for optimization but now that we finally can combine pre-copy and post-copy memory migration we are a lot closer to decreasing process downtime during migration.

The next steps are to get support for pre-copy and post-copy into p.haul (Process Hauler) and into different container runtimes which already support migration via criu.

My other recently posted criu related articles:

Lazy Process Migration

Process Migration

Using CRIU it is possible to checkpoint/save/dump the state of a process into a set of files which can then be used to restore/restart the process at a later point in time. If the files from the checkpoint operation are transferred from one system to another and then used to restore the process, this is probably the simplest form of process migration.

Source system:

  • criu dump -D /checkpoint/destination -t PID
  • rsync -a /checkpoint/destination destination.system:/checkpoint/destination

Destination system:

  • criu restore -D /checkpoint/destination

For large processes the migration duration can be rather long. For a process using 24GB this can lead to migration duration longer than 280 seconds. The limiting factor in most cases is the interconnect between the systems involved in the process migration.

Optimization: Pre-Copy

One existing solution to decrease process downtime during migration is pre-copy. In one or multiple runs the memory of the process is copied from the source to the destination system. With every run only memory pages which have change since the last run have to be transferred. This can lead to situations where the process downtime during migration can be dramatically decreased.

This depends on the type of application which is migrated and especially how often/fast the memory content is changed. In extreme cases it was possible to decrease process downtime during migration for a 24GB process from 280 seconds to 8 seconds with the help of pre-copy.

This approach is basically the same if migrating single processes (or process groups) or virtual machines.

It Always Depends On…

Unfortunately pre-copy optimization can also lead to situations where the so called optimized case with pre-copy can require more time than the unoptimized case:

In the example above a process has been migrated during three stages of its lifetime and there are situations (state: Calculation) where pre-copy has enormous advantages (14 seconds with pre-copy and 51 seconds without pre-copy) but there are also situations (state: Initialization) where the pre-copy optimization increases the process downtime during migration (40 seconds with pre-copy and 27 seconds without pre-copy). It depends on the memory change rate.

Optimization: Post-Copy

Another approach to reduce the process downtime during migration is post-copy. The required memory pages are not dumped and transferred before restoring the process but on demand. Each time a missing memory page is accessed the migrated process is halted until the required memory pages has been transferred from the source system to the destination system:

Thanks to userfaultfd this approach (or optimization) can be now integrated into CRIU. With the help of userfaultfd it is possible to mark memory pages to be handled by userfaultfd. If such a memory page is accessed, the process is halted until the requested page is provided. The listener for the userfaultfd requests is running in user-space and listening on a file descriptor. The same approach has already been implemented for QEMU.

Enough Theory

With all the background information on why and how the initial code to restore processes with userfaultfd support has been merged into the CRIU development branch: criu-dev. This initial implementation of lazy-pages support does not yet support lazy process migration between two hosts, but with the upstream merged patches it is at least possible to checkpoint a process and to restore the process using userfaultfd. A lazy restore consists of two parts. The usual ‘criu restore‘ part and an additional, what we call uffd daemon, ‘criu lazy-pages‘ part. To better demonstrate the advantages of a lazy restore there are patches to enhance crit (CRiu Image Tool) to remove pages which can be restored with userfaultfd from a checkpoint directory. Using a test case which allocates about 200MB of memory (and which writes one byte in each page over and over) requires after being dumped about 200MB. Using the mentioned crit enhancement make-lazy reduces the size of the checkpoint down to 116KB:

$ crit make-lazy /tmp/checkpoint/ /tmp/lazy-checkpoint
$ du -hs /tmp/checkpoint/ /tmp/lazy-checkpoint
     201M       /tmp/checkpoint
     116K       /tmp/lazy-checkpoint

With this the data which actually has to be transferred during process downtime is drastically reduced and the required memory pages are inserted in the restored process on demand using userfaultfd. Restoring the checkpointed process using lazy-restore would look something like this:

First the uffd daemon:

$ criu lazy-pages -D /tmp/checkpoint \
--address /tmp/userfault.socket

And then the actual restore:

$ criu restore -D /tmp/lazy-checkpoint \
--lazy-pages --address /tmp/userfault.socket

The socket specified with --address is used to exchange information about the restored process required by the uffd daemon. Once criu restore has done all its magic to restore the process except restoring the lazy memory pages, the process to be restored is actually started and runs until the first userfaultfd handled memory page is accessed. At that point the process hangs and the uffd daemon gets a message to provide the required memory pages. Once the uffd daemon provides the requested memory page, the restored process continues to run until the next page is requested. As potentially not all memory pages are requested, as they might not get accessed for some time, the uffd daemon starts to transfer unrequested memory pages into the restored process so that the uffd daemon can shut down after a certain time.

Process Migration coming to Fedora 19 (probably)

With the recent approved review of the package crtools in Fedora I have made a feature proposal for checkpoint/restore.

To test checkpoint/restore on Fedora you need to run the current development version of Fedora and install crtools using yum (yum install crtools). Until it is decided if it actually will be a Fedora 19 feature and the necessary changes in the Fedora kernel packages have been implemented it is necessary to install a kernel which is not in the repository. I have built a kernel in Fedora’s buildsystem which enables the following config options: CHECKPOINT_RESTORE, NAMESPACES, EXPERT.

A kernel with these changes enabled is available from koji as a scratch build: http://koji.fedoraproject.org/koji/taskinfo?taskID=4899525

After installing this kernel I am able to migrate a process from one Fedora system to another. For my test case I am migrating a UDP ping pong (udpp.c) program from one system to another while communicating with a third system.

udpp

udpp is running in server mode on 129.143.116.10 and on 134.108.34.90 udpp is started in client mode. After a short time I am migrating, with the help of crtools, the udpp client to 85.214.67.247. The following is part of the output on the udpp server:


-->

Received ping packet from 134.108.34.90:38374
Data: This is ping packet 6

Sending pong packet 6
<-- -->

Received ping packet from 134.108.34.90:38374
Data: This is ping packet 7

Sending pong packet 7
<-- -->

Received ping packet from 85.214.67.247:38374
Data: This is ping packet 8

Sending pong packet 8
<-- -->

Received ping packet from 85.214.67.247:38374
Data: This is ping packet 9

Sending pong packet 9
<--

So with only little changes to the kernel configuration it is possible to migrate a process by checkpointing and restoring a process with the help of crtools.