1. CRIU and SELinux

    When I started to include container migration into Podman using CRIU over a year ago I did not really think about SELinux. I was able to checkpoint a running container and I was also able to restore it later https://podman.io/blogs/2018/10/10/checkpoint-restore.html. I never looked at the process labels of the restored containers. But I really should have.

    After my initial implementation of container checkpoint and restore for Podman I started to work on live container migration for Podman in October last year (2018). I opened the corresponding pull request end of January 2019. I immediately started to get SELinux related failures from the CI.

    Amongst other SELinux denials the main SELinux related problem was a blocked connectto.

    avc: denied { connectto } for pid=23569 comm="top" path=002F6372746F6F6C732D70722D3233363139 scontext=system_u:system_r:container_t:s0:c245,c463 tcontext=unconfined_u:system_r:container_runtime_t:s0-s0:c0.c1023 tclass=unix_stream_socket permissive=0

    This is actually a really interesting denial, because it gives away details about how CRIU works. This denial was caused by a container running top (podman run -d alpine top) which I tried to checkpoint.

    To understand why a denial like this is reported by top it helps to understand how CRIU works. To be able to access all resources of the process CRIU tries to checkpoint (or dump), CRIU injects parasite code into the process. The parasite code allows CRIU to act from within the process's address space. Once the parasite is injected and running it connects to the main CRIU process and is ready to receive commands.

    The parasite's attempt to connect to the main CRIU process is exactly the step SELinux is blocking. Looking at the denial it seems that a process top running as system_u:system_r:container_t:s0:c245,c463 is trying to connectto a socket labeled as unconfined_u:system_r:container_runtime_t:s0-s0:c0.c1023, which is indeed suspicious: something running in a container tries to connect to something running on the outside of the container. Knowing that this is CRIU and knowing how CRIU works it is, however, required that the parasite code connects to the main process using connectto.

    Fortunately SELinux has the necessary interface to solve this: setsockcreatecon(3). Using setsockcreatecon(3) it is possible to specify the context of newly created sockets. So all we have to do is get the context of the process to checkpoint and tell SELinux to label newly created sockets accordingly (8eb4309). Once understood that was easy. Unfortunately this is also where the whole thing got really complicated.

    The CRIU RPM package in Fedora is built without SELinux support, because CRIU's SELinux support until now was limited and not tested. CRIU's SELinux support used to be: If the process context does not start with unconfined_ CRIU just refuses to dump the process and exits. Being unaware of SELinux a process restored with CRIU was no longer running with the context it was started but with the context of CRIU during the restore. So if a container was running with a context like system_u:system_r:container_t:s0:c248,c716 during checkpointing it was running with the wrong context after restore: unconfined_u:system_r:container_runtime_t:s0, which is the context of the container runtime and not of the actual container process.

    So first I had to fix CRIU's SELinux handling to be able to use setsockcreatecon(3). Fortunately, once I understood the problem, it was pretty easy to fix CRIU's SELinux process labeling. Most of the LSM code in CRIU was written by Tycho in 2015 with focus on AppArmor which luckily uses the same interfaces as SELinux. So all I had to do is remove the restrictions on which SELinux context CRIU is willing to operate on and make sure that CRIU stores the information about the process context in its image files 796da06.

    Once the next CRIU release with these patches included is available I have to add BuildRequires: libselinux-devel to the RPM to build Fedora's CRIU package with SELinux support. This, however, means that CRIU users on Fedora might see SELinux errors they have not seen before. CRIU now needs SELinux policies which allow CRIU to change the SELinux context of a running process. For the Podman use case which started all of this there has been the corresponding change in container-selinux to allow container_runtime_t to dyntransition to container domains.

    For CRIU use cases outside of containers additional policies have been created which are also used by the new CRIU ZDTM test case selinux00. A new boolean exists which allows CRIU to use "setcon to dyntrans to any process type which is part of domain attribute". So with setsebool -P unconfined_dyntrans_all 1 it should be possible to use CRIU on Fedora just like before.

    After I included all those patches and policies into Podman's CI almost all checkpoint/restore related tests were successful. Except one test which was testing if it is possible to checkpoint and restore a container with established TCP connections. In this test case a container with Redis is started, a connection to Redis is opened and the container is checkpointed and restored. This was still failing in CI which was interesting as this seemed unrelated to SELinux.

    Trying to reproduce the test case locally I actually saw the following SELinux errors during restore:

    audit: SELINUX_ERR op=security_bounded_transition seresult=denied oldcontext=unconfined_u:system_r:container_runtime_t:s0 newcontext=system_u:system_r:container_t:s0:c218,c449

    This was unusual as it did not look like something that could be fixed with a policy.

    The reason my test case for checkpointing and restoring containers with established TCP connections failed was not the fact that it is testing established TCP connections, but the fact that it is a multithreaded process. Looking at the SELinux kernel code I found following comment in security/selinux/hooks.c:

    /* Only allow single threaded processes to change context */

    This line is unchanged since 2008 so it seemed unlikely that it would be possible to change SELinux in such a way that it would be possible to label each thread separately. My first attempt to solve this was to change the process label with setcon(3) before CRIU forks the first time. This kind of worked but at the same time created lots of SELinux denials (over 50), because during restore CRIU changes itself and the forks it creates into the process it wants to restore. So instead of changing the process label just before forking the first time I switched to setting the process label just before CRIU creates all threads (e86c2e9).

    Setting the context just before creating the threads resulted in only two SELinux denials. The first is about CRIU accessing the log file during restore which is not critical and the other denial happens when CRIU tries to influence the PID of the threads it wants to create via /proc/sys/kernel/ns_last_pid. As CRIU is now running in the SELinux context of the to be restored container and to avoid allowing the container to access all files which are labeled as sysctl_kernel_t, Fedora's selinux-policy contains a patch to label /proc/sys/kernel/ns_last_pid as sysctl_kernel_ns_last_pid_t.

    So with the latest CRIU and selinux-policy installed and the following addition to my local SELinux policy (kernel_rw_kernel_ns_lastpid_sysctl(container_domain)) I can now checkpoint and restore a Podman container (even multi-threaded) with the correct SELinux process context after a restore and no further SELinux denials blocking the checkpointing or restoring of the container. There are a few SELinux denials which are mainly related to not being able to write to the log files. Those denials, however, do not interfere with the checkpoint and restoring.

    For some time (two or three years) I was aware that CRIU was never verified to work correctly with SELinux but I always ignored it and I should have just fixed it a long time ago. Without the CRIU integration into Podman, however, I would have not been able to test my changes as I was able to do.

    I would like to thank Radostin for his feedback and ideas when I was stuck and his overview of the necessary CRIU changes, Dan for his help in adapting the container-selinux package to CRIU's needs and Lukas for the necessary changes to Fedora's selinux-policy package to make CRIU work with SELinux on Fedora. All these combined efforts made it possible to have the necessary policies and code changes ready to support container migration with Podman.

    Tagged as : criu podman selinux fedora
  2. Animated Download Maps

    Our mirror server has been generating download maps for almost 10 years (since August 2009). This is done by going through all our download log files (HTTP, FTP, RSYNC) and using GeoIP and the Matplotlib Basemap Toolkit to draw maps from where our mirror server is being accessed.

    I have taken the output from almost ten years and created the following animations. The first animation shows clients accessing all mirrored content:

    As the mirror server is running Fedora it is updated once a year which might result in an updated version of Basemap once a year. The update usually happens in December or January which sometimes can be seen in the animation when the output changes. Updating to Fedora 27 (December 2017) resulted in a Basemap version which started to draw different results and the last update to Fedora 29 (December 2018) can also be seen as switching to Python 3 removed most of the clients from the map (only visible in the last second of the animation). It seems some of the calculations are giving different results in Python 3.

    In addition to the map showing the accesses for all mirrored data, there is also an animation for clients accessing files from our Fedora mirror:

    The interesting thing about only looking at clients accessing Fedora files is that it can be seen that most accesses are actually from Europe. This seems to indicate that Fedora's mirroring system partially succeeds in directing clients to close by mirrors. Looking at the location of clients accessing our EPEL mirror it seems to work even better. This is probably related to the much larger number of existing EPEL mirrors:

    Another interesting effect of upgrading once a year can be seen around 6:42 in the EPEL animation. After upgrading to Fedora 25 the generated maps where upside down for a few days until I was able to fix it.

    Tagged as : fedora traffic
  3. Nextcloud in a Container

    After using Podman a lot during the last weeks while adding checkpoint/restore support to Podman I was finally ready to use containers in production on our mirror server. We were still running the ownCloud version that came via RPMs in Fedora 27 and it seems like many people have moved on to Nextcloud from tarballs.

    One of the main reason to finally use containers is Podman's daemonless approach.

    The first challenge while moving from ownCloud 9.1.5 to Nextcloud 14 is the actual upgrade. To make sure it works I first made a copy of all the uploaded files and of the database and did a test upgrade yesterday using a CentOS 7 VM. With PHP 7 from Software Collections it was not a real problem. It took some time, but it worked. I used the included upgrade utility to upgrade from ownCloud 9 to Nextcloud 10, to Nextcloud 11, to Nextcloud 12, to Nextcloud 13, to Nextcloud 14. Lots of upgrades. Once I verified that everything was still functional I did it once more, but this time I used the real data and disabled access to our ownCloud instance.

    The next step was to start the container. I decided to use the nextcloud:fpm container as I was planning to use the existing web server to proxy the requests. The one thing which makes using containers on our mirror server a bit difficult, is that it is not possible to use any iptables NAT rules. At some point there are just too many network connections in the NAT table from all the clients connecting to our mirror server that it used to drop network connections. This is a problem which is probably fixed since a long time, but it used to be a problem and I try to avoid it. That is why my Nextcloud container is using the host network namespace:

    podman run --name nextcloud-fpm -d --net host \
      -v /home/containers/nextcloud/html:/var/www/html \
      -v /home/containers/nextcloud/apps:/var/www/html/custom_apps \
      -v /home/containers/nextcloud/config:/var/www/html/config \
      -v /home/containers/nextcloud/data:/var/www/html/data \
      nextcloud:fpm
    

    I was reusing my existing config.php in which the connection to PostgreSQL on 127.0.0.1 was still configured.

    Once the container was running I just had to add the proxy rules to the Apache HTTP Server and it should have been ready. Unfortunately this was not as easy as I hoped it to be. All the documentation I found is about using the Nextcloud FPM container with NGINX. I found nothing about Apache's HTTPD. The following lines required most of the time of the whole upgrade to Nextcloud project:

    <FilesMatch \.php.*>
       SetHandler proxy:fcgi://127.0.0.1:9000/
       ProxyFCGISetEnvIf "reqenv('REQUEST_URI') =~ m|(/owncloud/)(.*)$|" SCRIPT_FILENAME "/var/www/html/$2"
       ProxyFCGISetEnvIf "reqenv('REQUEST_URI') =~ m|^(.+\.php)(.*)$|" PATH_INFO "$2"
    </FilesMatch>
    

    I hope these lines are actually correct, but so far all clients connecting to it seem to be happy. To have the Nextcloud container automatically start on system startup I based my systemd podman service file on the one from the Intro to Podman article.

    [Unit]
    Description=Custom Nextcloud Podman Container
    After=network.target
    
    [Service]
    Type=simple
    TimeoutStartSec=5m
    ExecStartPre=-/usr/bin/podman rm nextcloud-fpm
    
    ExecStart=/usr/bin/podman run --name nextcloud-fpm --net host \
       -v /home/containers/nextcloud/html:/var/www/html \
       -v /home/containers/nextcloud/apps:/var/www/html/custom_apps \
       -v /home/containers/nextcloud/config:/var/www/html/config \
       -v /home/containers/nextcloud/data:/var/www/html/data \
       nextcloud:fpm
    
    ExecReload=/usr/bin/podman stop nextcloud-fpm
    ExecReload=/usr/bin/podman rm nextcloud-fpm
    ExecStop=/usr/bin/podman stop nextcloud-fpm
    Restart=always
    RestartSec=30
    
    [Install]
    WantedBy=multi-user.target
    
    Tagged as : fedora nextcloud podman
  4. Antimatter Factory

    On October 19th, 2018, I was giving a talk about OpenHPC at the CentOS Dojo at CERN.

    I really liked the whole event and my talk was also recorded. Thanks for everyone involved for organizing it. The day before FOSDEM 2019 there will be another CentOS Dojo in Brussels. I hope I have the chance to also attend it.

    The most interesting thing during my two days in Geneva was, however, the visit of the Antimatter Factory:

    Antimatter Factory

    Assuming I actually understood anything we were told about it, it is exactly that: an antimatter factory.

    Tagged as : fedora centos openhpc
  5. S3 sleep with ThinkPad X1 Carbon 6th Generation

    Since a few weeks I have the new ThinkPad X1 Carbon 6th Generation and as many people I really like it.

    The biggest problem is that suspend does not work as expected.

    The issue seems to be that the X1 is using a new suspend technology called "Windows Modern Standby," or S0i3, and has removed classic S3 sleep.[1]

    Following the instructions in Alexander's article it was possible to get S3 suspend to work as expected and everything was perfect.

    With the latest Firmware update to 0.1.28 (using sudo fwupdmgr update (thanks a lot to Linux Vendor Firmware Service (LVFS) that this works!!!)) I checked if the patch mentioned in Alexander's article still applies and it did not.

    So I modified the patch to apply again and made it available here: https://lisas.de/~adrian/X1C6_S3_DSDT_0_1_28.patch

    Talking with Christian about it he mentioned an easier way to include the changed ACPI table into grub. For my Fedora system this looks like this:

    • cp dsdt.aml /boot/efi/EFI/fedora/
    • echo 'acpi $prefix/dsdt.aml' > /boot/efi/EFI/fedora/custom.cfg

    Thanks to Alexander and Christian I can correctly suspend my X1 again.

    Update 2018-09-09: Lenovo fixed the BIOS and everything described above is no longer necessary with version 0.1.30. Also see https://brauner.github.io/2018/09/08/thinkpad-6en-s3.html

    Tagged as : fedora X1
  6. archive.rpmfusion.org

    After many years the whole RPM Fusion repository has grown to over 320GB. There have been occasional requests to move the unsupported releases to an archive, just like Fedora handles its mirror setup, but until last week this did not happen.

    As of now we have moved all unsupported releases (EL-5, Fedora 8 - 25) to our archive (http://archive.rpmfusion.org/) and clients are now being redirected to the new archive system. The archive consists of 260GB which means we can reduce the size mirrors need to carry by more than 75%.

    From a first look at the archive logs the amount of data requested by all clients for the archived releases is only about 30GB per day. Those 30GB are downloaded by over 350000 HTTP requests and over 98% of those requests are downloading the repository metdata only (repomd.xml, *filelist*, *primary*, *comps*).

  7. Lazy Migration in CRIU's master branch

    For almost two years Mike Rapoport and I have been working on lazy process migration. Lazy process migration (or post-copy migration) is a technique to decrease the process or container downtime during the live migration. I described the basic functionality in the following previous articles:

    Those articles are not 100% correct anymore as we changed some of the parameters during the last two years, but the concepts stayed the same.

    Mike and I started about two years ago to work on it and the latest CRIU release (3.5) includes the possibility to use lazy migration. Now that the post-copy migration feature has been merged from the criu-dev branch to the master branch it is part of the normal CRIU releases.

    With CRIU's 3.5 release lazy migration can be used on any kernel which supports userfaultfd. I already updated the CRIU packages in Fedora to 3.5 so that lazy process migration can be used just by installing the latest CRIU packages with dnf (still in the testing repository right now).

    More information about container live migration in our upcoming Open Source Summit Europe talk: Container Migration Around The World.

    My pull request to support lazy migration in runC was also recently merged, so that it is now possible to migrate containers using pre-copy migration and post-copy migration. It can also be combined.

    Another interesting change about CRIU is that it started as x86_64 only and now it is also available on aarch64, ppc64le and s390x. The support to run on s390x has just been added with the previous 3.4 release and starting with Fedora 27 the necessary kernel configuration options are also active on s390x in addition to the other supported architectures.

    Tagged as : criu fedora
  8. Influence which PID will be the next

    To restore a checkpointed process with CRIU the process ID (PID) has to be the same it was during checkpointing. CRIU uses /proc/sys/kernel/ns_last_pid to set the PID to one lower as the process to be restored just before fork()-ing into the new process.

    The same interface (/proc/sys/kernel/ns_last_pid) can also be used from the command-line to influence which PID the kernel will use for the next process.

    # cat /proc/sys/kernel/ns_last_pid
    1626
    # echo -n 9999 > /proc/sys/kernel/ns_last_pid
    # cat /proc/sys/kernel/ns_last_pid
    10000
    

    Writing '9999' (without a 'new line') to /proc/sys/kernel/ns_last_pid tells the kernel, that the next PID should be '10000'. This only works if between after writing to /proc/sys/kernel/ns_last_pid and forking the new process no other process has been created. So it is not possible to guarantee which PID the new process will get but it can be influenced.

    There is also a posting which describes how to do the same with C: How to set PID using ns_last_pid

    Tagged as : criu fedora
  9. Combining pre-copy and post-copy migration

    In my last post about CRIU in May 2016 I mentioned lazy memory transfer to decrease process downtime during migration. Since May 2016 Mike Rapoport's patches for remote lazy process migration have been merged into CRIU's criu-dev branch as well as my patches to combine pre-copy and post-copy migration.

    Using pre-copy (criu pre-dump) it has "always" been possible to dump the memory of a process using soft-dirty-tracking. criu pre-dump can be run multiple times and each time only the changed memory pages will be written to the checkpoint directory.

    Depending on the processes to be migrated and how fast they are changing their memory, this can still lead to a situation where the final dump can be rather large which can mean a longer downtime during migration than desired. This is why we started to work on post-copy migration (also know as lazy migration). There are, however, situations where post-copy migration can also increase the process downtime during migration instead of decreasing it.

    The latest changes regarding post-copy migration in the criu-dev branch offer the possibility to combine pre-copy and post-copy migration. The memory pages of the process are pre-dumped using soft-dirty-tracking and transferred to the destination while the process on the source machine keeps on running. Once the process is actually migrated to the destination system everything besides the memory pages is transferred to the destination system. Excluding the memory pages (as the remaining memory pages will be migrated lazily) usually only a few hundred kilobytes have to be transferred which reduces the process downtime during migration significantly.

    Using criu with pre-copy and post-copy could look like this:

    Source system:

    # criu pre-dump -D /tmp/cp/1 -t PID
    # rsync -a /tmp/cp destination:/tmp
    # criu dump -D /tmp/cp/2 -t PID --port 27 --lazy-pages   
     --prev-images-dir ../1/ --track-mem
    

    The first criu command dumps the memory of the process PID and resets the soft-dirty memory tracking. The initial dump is then transferred using rsync to the destination system. During that time the process PID keeps on running. The last criu command starts the lazy page mode which dumps everything besides memory pages which can be transferred lazily and waits for connections over the network on port 27. Only pages which have changed since the last pre-dump are considered for the lazy restore. At this point the process is no longer running and the process downtime starts.

    Destination system:

    # rsync -a source:/tmp/cp /tmp/
    # criu lazy-pages --page-server --address source --port 27   
     -D /tmp/cp/2 &
    # criu restore --lazy-pages -D /tmp/cp/2
    

    Once criu is waiting on port 27 on the source system the remaining checkpoint images can be transferred from the source system to the destination system (using rsync in this case). Now criu can be started in lazy-pages mode connecting to the page server on port 27 on the source system. This is the part we usually call the UFFD daemon. The last step is the actual restore (criu restore).

    The following diagrams try to visualize what happens during the last step: criu restore.

    step1

    It all starts with criu restore (on the right). criu does its magic to restore the process and copies the memory pages from criu pre-dump to the process and marks lazy pages as being handled by userfaultfd. Once everything is restored criu jumps into the restored process and the restored process continues to run where it was when checkpointed. Once the process accesses a userfaultfd marked memory address the process will be paused until a memory page (hopefully the correct one) is copied to that address.

    step2

    The part that we call the UFFD daemon or criu lazy-pages listens on the userfault file descriptor for a message and as soon as a valid UFFD request arrives it requests that page from the source system via TCP where criu is still running in page-server mode. If the page-server finds that memory page it transfers the actual page back to the destination system to the UFFD daemon which injects the page into the kernel using the same userfault file descriptor it previously got the page request from. Now that the page which initially triggered the page-fault or in our case userfault is at its place the restored process continues to run until another missing page is accessed and the whole procedure starts again.

    To be able to remove the UFFD daemon and the page-server at some point we currently push all unused pages into the restored process if there are no further userfaultfd requests for 5 seconds.

    The whole procedure still has a lot of possibilities for optimization but now that we finally can combine pre-copy and post-copy memory migration we are a lot closer to decreasing process downtime during migration.

    The next steps are to get support for pre-copy and post-copy into p.haul (Process Hauler) and into different container runtimes which already support migration via criu.

    My other recently posted criu related articles:

    Tagged as : criu fedora
  10. PXCAB

    A long time ago (2007 or 2008) I was developing firmware for Cell processor based systems. Most of the Slimline Open Firmware (SLOF) has been released and is also available in Fedora as firmware for QEMU: SLOF.

    One of the systems we have been developing firmware for was a PCI Express card called PXCAB. The processor on this PCI Express card was not the original Cell processor but the newer PowerXCell 8i which has a much better double precision floating point performance. A few weeks ago I was able to get one of those PCI Express cards in a 1U chassis:

    PXCAB

    This chassis was designed to hold two PXCABs: one running in root complex mode and the other in endpoint mode. That way one card was the host system and the other the PCI express connected device. This single card is now running in root complex mode.

    I can boot a kernel either via TFTP or from the flash. As writing the flash takes some time I am booting it right now via TFTP. Compiling the latest kernel from git for PPC64 is thanks to the available cross compiler (gcc-powerpc64-linux-gnu.x86_64) no problem: make CROSS_COMPILE=powerpc64-linux-gnu- ARCH=powerpc.

    The more difficult part was to compile user space tools but fortunately I was able to compile it natively on a PPC64 system. With this minimal busybox based system I can boot the system and chroot into a Fedora 24 NFS mount.

    I was trying to populate a directory with a minimal PPC64 based Fedora 24 system with following command:

    dnf --setopt arch=ppc64 --installroot $PWD/ppc64 install dnf --releasever 24

    Unfortunately that does not work as there currently seems to be no way to tell dnf to install the packages for another architecture. I was able to download a few RPMs and directly install them with rpm using the option --ignorearch. In the end I also installed the data for the chroot on my PPC64 system as that was faster and easier.

    Now I can boot the PXCAB via TFTP into the busybox based ramdisk and from there I can chroot in to the NFS mounted Fedora 24 system.

    The system has one CPU with two threads and 4GB of RAM. In addition to the actual RAM there is also 256MB of memory which can be accessed as a block device using the axonram driver. My busybox based ramdisk is copied to that ramdisk and thus freeing some more actual RAM:

    # df -h
    Filesystem         Size    Used Available Use% Mounted on
    /dev/axonram0    247.9M   15.6M    219.5M   7% /
    

    System information from the firmware:

    SYSTEM INFORMATION
     Processor  = PowerXCell DD1.0 @ 2800 MHz
     I/O Bridge = Cell BE companion chip DD3.0
     Timebase   = 14318 kHz (external)
     Config     = SMP disabled
     SMP Size   = 1 (2 threads)
     Boot-Date  = 2016-07-21 19:37
     Memory     = 4096MB (CPU0: 4096MB)
    

Page 1 / 3