When I started to include container migration into Podman using CRIU over a year ago I did not really think about SELinux. I was able to checkpoint a running container and I was also able to restore it later https://podman.io/blogs/2018/10/10/checkpoint-restore.html. I never looked at the process labels of the restored containers. But I really should have.

After my initial implementation of container checkpoint and restore for Podman I started to work on live container migration for Podman in October last year (2018). I opened the corresponding pull request end of January 2019. I immediately started to get SELinux related failures from the CI.

Amongst other SELinux denials the main SELinux related problem was a blocked connectto.

avc: denied { connectto } for pid=23569 comm="top" path=002F6372746F6F6C732D70722D3233363139 scontext=system_u:system_r:container_t:s0:c245,c463 tcontext=unconfined_u:system_r:container_runtime_t:s0-s0:c0.c1023 tclass=unix_stream_socket permissive=0

This is actually a really interesting denial, because it gives away details about how CRIU works. This denial was caused by a container running top (podman run -d alpine top) which I tried to checkpoint.

To understand why a denial like this is reported by top it helps to understand how CRIU works. To be able to access all resources of the process CRIU tries to checkpoint (or dump), CRIU injects parasite code into the process. The parasite code allows CRIU to act from within the process’s address space. Once the parasite is injected and running it connects to the main CRIU process and is ready to receive commands.

The parasite’s attempt to connect to the main CRIU process is exactly the step SELinux is blocking. Looking at the denial it seems that a process top running as system_u:system_r:container_t:s0:c245,c463 is trying to connectto a socket labeled as unconfined_u:system_r:container_runtime_t:s0-s0:c0.c1023, which is indeed suspicious: something running in a container tries to connect to something running on the outside of the container. Knowing that this is CRIU and knowing how CRIU works it is, however, required that the parasite code connects to the main process using connectto.

Fortunately SELinux has the necessary interface to solve this: setsockcreatecon(3). Using setsockcreatecon(3) it is possible to specify the context of newly created sockets. So all we have to do is get the context of the process to checkpoint and tell SELinux to label newly created sockets accordingly (8eb4309). Once understood that was easy. Unfortunately this is also where the whole thing got really complicated.

The CRIU RPM package in Fedora is built without SELinux support, because CRIU’s SELinux support until now was limited and not tested. CRIU’s SELinux support used to be: If the process context does not start with unconfined_ CRIU just refuses to dump the process and exits. Being unaware of SELinux a process restored with CRIU was no longer running with the context it was started but with the context of CRIU during the restore. So if a container was running with a context like system_u:system_r:container_t:s0:c248,c716 during checkpointing it was running with the wrong context after restore: unconfined_u:system_r:container_runtime_t:s0, which is the context of the container runtime and not of the actual container process.

So first I had to fix CRIU’s SELinux handling to be able to use setsockcreatecon(3). Fortunately, once I understood the problem, it was pretty easy to fix CRIU’s SELinux process labeling. Most of the LSM code in CRIU was written by Tycho in 2015 with focus on AppArmor which luckily uses the same interfaces as SELinux. So all I had to do is remove the restrictions on which SELinux context CRIU is willing to operate on and make sure that CRIU stores the information about the process context in its image files 796da06.

Once the next CRIU release with these patches included is available I have to add BuildRequires: libselinux-devel to the RPM to build Fedora’s CRIU package with SELinux support. This, however, means that CRIU users on Fedora might see SELinux errors they have not seen before. CRIU now needs SELinux policies which allow CRIU to change the SELinux context of a running process. For the Podman use case which started all of this there has been the corresponding change in container-selinux to allow container_runtime_t to dyntransition to container domains.

For CRIU use cases outside of containers additional policies have been created which are also used by the new CRIU ZDTM test case selinux00. A new boolean exists which allows CRIU to use “setcon to dyntrans to any process type which is part of domain attribute”. So with setsebool -P unconfined_dyntrans_all 1 it should be possible to use CRIU on Fedora just like before.

After I included all those patches and policies into Podman’s CI almost all checkpoint/restore related tests were successful. Except one test which was testing if it is possible to checkpoint and restore a container with established TCP connections. In this test case a container with Redis is started, a connection to Redis is opened and the container is checkpointed and restored. This was still failing in CI which was interesting as this seemed unrelated to SELinux.

Trying to reproduce the test case locally I actually saw the following SELinux errors during restore:

audit: SELINUX_ERR op=security_bounded_transition seresult=denied oldcontext=unconfined_u:system_r:container_runtime_t:s0 newcontext=system_u:system_r:container_t:s0:c218,c449

This was unusual as it did not look like something that could be fixed with a policy.

The reason my test case for checkpointing and restoring containers with established TCP connections failed was not the fact that it is testing established TCP connections, but the fact that it is a multithreaded process. Looking at the SELinux kernel code I found following comment in security/selinux/hooks.c:

/* Only allow single threaded processes to change context */

This line is unchanged since 2008 so it seemed unlikely that it would be possible to change SELinux in such a way that it would be possible to label each thread separately. My first attempt to solve this was to change the process label with setcon(3) before CRIU forks the first time. This kind of worked but at the same time created lots of SELinux denials (over 50), because during restore CRIU changes itself and the forks it creates into the process it wants to restore. So instead of changing the process label just before forking the first time I switched to setting the process label just before CRIU creates all threads (e86c2e9).

Setting the context just before creating the threads resulted in only two SELinux denials. The first is about CRIU accessing the log file during restore which is not critical and the other denial happens when CRIU tries to influence the PID of the threads it wants to create via /proc/sys/kernel/ns_last_pid. As CRIU is now running in the SELinux context of the to be restored container and to avoid allowing the container to access all files which are labeled as sysctl_kernel_t, Fedora’s selinux-policy contains a patch to label /proc/sys/kernel/ns_last_pid as sysctl_kernel_ns_last_pid_t.

So with the latest CRIU and selinux-policy installed and the following addition to my local SELinux policy (kernel_rw_kernel_ns_lastpid_sysctl(container_domain)) I can now checkpoint and restore a Podman container (even multi-threaded) with the correct SELinux process context after a restore and no further SELinux denials blocking the checkpointing or restoring of the container. There are a few SELinux denials which are mainly related to not being able to write to the log files. Those denials, however, do not interfere with the checkpoint and restoring.

For some time (two or three years) I was aware that CRIU was never verified to work correctly with SELinux but I always ignored it and I should have just fixed it a long time ago. Without the CRIU integration into Podman, however, I would have not been able to test my changes as I was able to do.

I would like to thank Radostin for his feedback and ideas when I was stuck and his overview of the necessary CRIU changes, Dan for his help in adapting the container-selinux package to CRIU’s needs and Lukas for the necessary changes to Fedora’s selinux-policy package to make CRIU work with SELinux on Fedora. All these combined efforts made it possible to have the necessary policies and code changes ready to support container migration with Podman.

I’ve connected the Arduino pro mini (328/5V) to my pcb. Of course it’s not directly soldered to the PCB but using a connector, so I can replace the parts that get bricked during development. I’ve downloaded the blink example using something like this. Directly after flashing it worked, but once I disconnected the flashing adapter it stopped. After remembering, that I’ve to short my optional filter in case it’s not assemble it works.

… an ohmic load (Thank you Axel) shows the same behavior (spikes on Vout) as seen in the previous post. Fortunately I’ve spent some space on the pcb for an optional filter that has now become mandatory.

The spikes do not change with load or input voltage. I took a closer look and they are much less random compared to what the screenshot looks like. They’re expected transient responses to the switching. currently they’re around +- 1,5V which is too much.

Unfortunately the additional inductor and capacitor for the filter where not part of the part delivery I’ve received. The delivery date is changing once a week and is oscillating around 30th of march.

But in the meantime I still can try to get the arduino running. It has it’s own voltage regulator and an additional capacitor at the input, so the currently “dirty” Vout will not be an issue.

I’ve run the power supply under load. As you can see I’ve

  1. not yet removed the screen protector foil from my multimeter
  2. connected the cables in the wrong direction so the current is negative
load current @ Vout=5V [A]

The load was a florist wire that accidentally had the correct length to have a resistance of 10 Ohm (Just in case: R=U/I). So in addition to the resistance it also is an inductive load due to the geometric nature of florist wire.I did not want to unwind it.

Channel 1: Vout
Channel 2: buck inductor input voltage

Three things can be seen:

  1. I’ve a problem with reflections and need a better environment for taking pictures (or an oscilloscope with screen shot functionality)
  2. The switching frequency of around 122kHz can be seen on the buck inductance. (Channel 2)
  3. There are spikes on Vout (Channel 1). They correlate with the switching points of the buck and are most probably caused by the “coily” nature of my florist wire load.

The result of the short test is that I’ve not noticed heating on the pcb or the parts even though I’m running the circuitry at the upper boundary of what it’s designed for. That’s good. For a real test with reasonably long duration (> 1 day) I need a fire proof environment, that also contains the designated housing, so that air turbulence can not cool down the pcb and of course a possibility to measure and log the temperature over time.

After a (luckily unsuccessful) search for short cuts I have connected the power supply part on the pcb to an external power supply. The following screen shot shows that the power supply becomes operational at around 12V.

Channel 1: Vout
Channel 2: Vin (manual rise from 0 to 15V )

Unfortunately I do not have a nice load to check the behavior close to the 0,5A the power supply is designed for. But I have small light bulb that causes a load of around 30 mA. Running the power supply at this load for some minutes did not cause any noticeable increase of the temperature. That’s a good sign. I also tried a short cut between Vout and GND. Nothing bad happened. The MAX5033 detected the short cut and shut off, before trying to start again. After removing the short cut it went back to normal. This state I did not try for a longer time. The effects were visible on the oscilloscope and audible. Typically the inductors start to “sing” under such conditions.

My external power supply can only provide 20V, but I assume, that if everything works at 20V it’ll also run at 24V. So the next step before actually connecting the arduino is to run the power supply for an extended period of time (~ 1 day) with high load and 24V input.

After a long time I’ve reactivated my solder iron. Since I’ve done that without additional flux (apart from the content of the solder) the result looks accordingly. My next step will be testing the circuitry.

power supply part of the cancombase

Surprisingly for me the soldering of the IC was the easiest. I assume, that the pads were perfectly sized for hand soldering. The resistors and small capacitors look horrible because I did not use tweezers. The large capacitor’s solder pads are a bit too small for hand soldering and the inductor needs a higher temperature because of the relatively high mass.

I also noticed that I’ve to improve my documentation. More information on the pcb, the layout and the circuit diagrams are required to simplify the soldering and reduce the time spent on searching the parts and their orientation. For example having the small dot that indicates pin 1 of an IC would be very helpful. Also the orientation of the larger capacitors and the exact location and size of the text on the pcb.

After ordering the prototype pcbs in China on Saturay they arrived on the following Wednesday. I even got one more pcb than I’ve ordered. The service is very fast and the price more than acceptable. So based on this single sample I can recommend allpcb.com. Apart from the silkscreen, the pcb looks good. But I’ve put exactly zero effort in it, so it’s OK. The picture shows the Arduino pro mini plugged in, but not yet soldered.

prototype pcb with loosly mounted arduino pro mini

The next step is to solder the power supply parts (visible here on the very right) and the optional filter against ripple. After that the difficult part, soldering the oscillator, will be the next step.

Of course loxone offers the possibility to connect the miniserver to the internet and also an app for mobile devices to connect to your smart home via internet. The problem is the connection is not as smart as expected. heise.de had a short and a long story about that.

So the first step is not to connect the system to the internet at all. The second step is to have a separate network for the home automation with very restricted access in both directions. Of course I want to use something like ntp ro make sure the time is always correct. But what I do not want is that the system is accessible from the outside.

Another reason to restrict the internet access for the miniserver is that after loxone provides a software update and the miniserver becomes “aware” it’ll start complaining that the software sould be updated. This is acceptable for the people who run the installation, but the normal user should not be bothered with that kind of information.

With the help of Jonas as reviewer I’m one step closer to the solution that was missing in Switch selection. The first version of cancombase is finsihed.

The 5×10 cm pcb fits behind the switches in a double plug socket. The 4 pairs in the CAT cable will be used in the following way:

  1. Connect switch 1 to the miniserver and the backuo system (a post will follow)
  2. Power supply 24V (the selected switches need the 24V and I have decided – since I don’t know better – that a buck is easier than a boost)
  3. + 4. CAN (Since CAN bus does not allow a star topology it’ll be a long bus with a baud rate of around 100kBaud. Of course this has to be checked after installation. Wikipedia indicates that 125 kbit/s allow up to 500 meters of cable. A rough calculation )

A description of the PCB is available here. It’s based on the arduino pro mini. Or an available clone of it.

The gap between the now introduced CAN and the loxone miniserver will be filled (most probably) with a rasperry pi that converts the CAN messages to UDP messages the miniserver is able to read.

Apart from reading switch states (maybe with double-click detection) and writing to feedback LEDs the next version of cancombase will also contain a temperature sensor.

Our mirror server has been generating download maps for almost 10 years (since August 2009). This is done by going through all our download log files (HTTP, FTP, RSYNC) and using GeoIP and the Matplotlib Basemap Toolkit to draw maps from where our mirror server is being accessed.

I have taken the output from almost ten years and created the following animations. The first animation shows clients accessing all mirrored content:

As the mirror server is running Fedora it is updated once a year which might result in an updated version of Basemap once a year. The update usually happens in December or January which sometimes can be seen in the animation when the output changes. Updating to Fedora 27 (December 2017) resulted in a Basemap version which started to draw different results and the last update to Fedora 29 (December 2018) can also be seen as switching to Python 3 removed most of the clients from the map (only visible in the last second of the animation). It seems some of the calculations are giving different results in Python 3.

In addition to the map showing the accesses for all mirrored data, there is also an animation for clients accessing files from our Fedora mirror:

The interesting thing about only looking at clients accessing Fedora files is that it can be seen that most accesses are actually from Europe. This seems to indicate that Fedora’s mirroring system partially succeeds in directing clients to close by mirrors. Looking at the location of clients accessing our EPEL mirror it seems to work even better. This is probably related to the much larger number of existing EPEL mirrors:

Another interesting effect of upgrading once a year can be seen around 6:42 in the EPEL animation. After upgrading to Fedora 25 the generated maps where upside down for a few days until I was able to fix it.

One of the CRIU uses cases is container checkpointing and restoring, which also can be used to migrate containers. Therefore container runtimes are using CRIU to checkpoint all the processes in a container as well as to restore the processes in that container. Many container runtimes are layered, which means that the user facing layer (Podman, Docker, LXD) calls another layer to checkpoint (or restore) the container (runc, LXC) and this layer then calls CRIU.

This leads to the problem that if CRIU introduces a new feature or option, all involved layers need code changes. Or if one of those layers made assumption about how to use CRIU, the user must live with that assumption, which may be wrong for the user’s use case.

To offer the possibility to change CRIU’s behaviour through all these layers, be it that the container runtime has not implemented a certain CRIU feature or that the user needs a different CRIU behaviour, we started to discuss configuration files in 2016.

Configuration files should be evaluated by CRIU and offer a third way to influence CRIU’s behaviour. Setting options via CLI and RPC are the other two ways.

At the Linux Plumbers Conference in 2016 during the Checkpoint/Restore micro-conference I gave a short introduction talk about how configuration files could look and everyone was nodding their head.

In early 2017 Veronika Kabatova provided patches which were merged in CRIU’s development branch criu-dev. At that point the development stalled a bit and only in early 2018 the discussion was picked up again. To have a feature merged into the master branch, which means it will be part of the next release, requires complete documentation (man-pages and wiki) and feature parity for CRIU’s CLI and RPC mode. At this point it was documented but not supported in RPC mode.

Adding configuration file support to CRIU’s RPC mode was not a technical challenge, but if any recruiter ever asks me which project was the most difficult, I will talk about this. We were exchanging mails and patches for about half a year and it seems everybody had different expectations how everything should behave. I think at the end they pitied me and just merged my patches…

CRIU 3.11 which was released on 2018-11-06 is the first release which includes support for configuration files and now (finally) I want to write about how it could be used.

I am using the Simple_TCP_pair example from CRIU’s wiki. First start the server:

#️ ./tcp-howto 10000 

Then I am starting the client:

# ./tcp-howto 127.0.0.1 10000 Connecting to 127.0.0.1:10000 PP 1 -> 1 PP 2 -> 2 PP 3 -> 3 PP 4 -> 4 

Once client and server are running, let’s try to checkpoint the client:

# rm -f /etc/criu/default.conf # criu dump -t `pgrep -f 'tcp-howto 127.0.0.1 10000'` Error (criu/sk-inet.c:188): inet: Connected TCP socket, consider using --tcp-established option. 

CRIU tells us that it needs a special option to checkpoint processes with established TCP connections. No problem, but instead of changing the command-line, let’s add it to the configuration file:

# echo tcp-established > /etc/criu/default.conf # criu dump -t `pgrep -f 'tcp-howto 127.0.0.1 10000'` Error (criu/tty.c:1861): tty: Found dangling tty with sid 16693 pgid 16711 (pts) on peer fd 0. Task attached to shell terminal. Consider using --shell-job option. More details on http://criu.org/Simple_loop 

Alright, let’s also add shell-job to the configuration file:

# echo shell-job >> /etc/criu/default.conf # criu dump -t `pgrep -f 'tcp-howto 127.0.0.1 10000'` && echo OK OK 

That worked. Cool. Finally! Most CLI options can be used in the configuration file(s) and more detailed documentation can be found in the CRIU wiki.

I want to thank Veronika for her initial implementation and everyone else helping, discussing and reviewing emails and patches to get this ready for release.

After using Podman a lot during the last weeks while adding checkpoint/restore support to Podman I was finally ready to use containers in production on our mirror server. We were still running the ownCloud version that came via RPMs in Fedora 27 and it seems like many people have moved on to Nextcloud from tarballs.

One of the main reason to finally use containers is Podman’s daemonless approach.

The first challenge while moving from ownCloud 9.1.5 to Nextcloud 14 is the actual upgrade. To make sure it works I first made a copy of all the uploaded files and of the database and did a test upgrade yesterday using a CentOS 7 VM. With PHP 7 from Software Collections it was not a real problem. It took some time, but it worked. I used the included upgrade utility to upgrade from ownCloud 9 to Nextcloud 10, to Nextcloud 11, to Nextcloud 12, to Nextcloud 13, to Nextcloud 14. Lots of upgrades. Once I verified that everything was still functional I did it once more, but this time I used the real data and disabled access to our ownCloud instance.

The next step was to start the container. I decided to use the nextcloud:fpm container as I was planning to use the existing web server to proxy the requests. The one thing which makes using containers on our mirror server a bit difficult, is that it is not possible to use any iptables NAT rules. At some point there are just too many network connections in the NAT table from all the clients connecting to our mirror server that it used to drop network connections. This is a problem which is probably fixed since a long time, but it used to be a problem and I try to avoid it. That is why my Nextcloud container is using the host network namespace:

podman run --name nextcloud-fpm -d --net host   -v /home/containers/nextcloud/html:/var/www/html   -v /home/containers/nextcloud/apps:/var/www/html/custom_apps   -v /home/containers/nextcloud/config:/var/www/html/config   -v /home/containers/nextcloud/data:/var/www/html/data   nextcloud:fpm 

I was reusing my existing config.php in which the connection to PostgreSQL on 127.0.0.1 was still configured.

Once the container was running I just had to add the proxy rules to the Apache HTTP Server and it should have been ready. Unfortunately this was not as easy as I hoped it to be. All the documentation I found is about using the Nextcloud FPM container with NGINX. I found nothing about Apache’s HTTPD. The following lines required most of the time of the whole upgrade to Nextcloud project:

<FilesMatch .php.*>  SetHandler proxy:fcgi://127.0.0.1:9000/  ProxyFCGISetEnvIf "reqenv('REQUEST_URI') =~ m|(/owncloud/)(.*)$|" SCRIPT_FILENAME "/var/www/html/$2"  ProxyFCGISetEnvIf "reqenv('REQUEST_URI') =~ m|^(.+.php)(.*)$|" PATH_INFO "$2" </FilesMatch> 

I hope these lines are actually correct, but so far all clients connecting to it seem to be happy. To have the Nextcloud container automatically start on system startup I based my systemd podman service file on the one from the Intro to Podman article.

[Unit] Description=Custom Nextcloud Podman Container After=network.target [Service] Type=simple TimeoutStartSec=5m ExecStartPre=-/usr/bin/podman rm nextcloud-fpm ExecStart=/usr/bin/podman run --name nextcloud-fpm --net host   -v /home/containers/nextcloud/html:/var/www/html   -v /home/containers/nextcloud/apps:/var/www/html/custom_apps   -v /home/containers/nextcloud/config:/var/www/html/config   -v /home/containers/nextcloud/data:/var/www/html/data   nextcloud:fpm ExecReload=/usr/bin/podman stop nextcloud-fpm ExecReload=/usr/bin/podman rm nextcloud-fpm ExecStop=/usr/bin/podman stop nextcloud-fpm Restart=always RestartSec=30 [Install] WantedBy=multi-user.target 

On October 19th, 2018, I was giving a talk about OpenHPC at the CentOS Dojo at CERN.

I really liked the whole event and my talk was also recorded. Thanks for everyone involved for organizing it. The day before FOSDEM 2019 there will be another CentOS Dojo in Brussels. I hope I have the chance to also attend it.

The most interesting thing during my two days in Geneva was, however, the visit of the Antimatter Factory:

Antimatter Factory

Assuming I actually understood anything we were told about it, it is exactly that: an antimatter factory.