When I started to include container migration into Podman using CRIU over a year ago I did
not really think about SELinux. I was able to checkpoint a running container and I
was also able to restore it later
https://podman.io/blogs/2018/10/10/checkpoint-restore.html. I never looked
at the process labels of the restored containers. But I really should have.
After my initial implementation of container checkpoint and restore for Podman
I started to work on live container migration for Podman in October last year
(2018). I opened the corresponding pull request end of January 2019. I immediately
started to get SELinux related failures from the CI.
Amongst other SELinux denials the main SELinux related problem was a blocked
connectto.
avc:  denied  { connectto } for  pid=23569 comm="top"
path=002F6372746F6F6C732D70722D3233363139
scontext=system_u:system_r:container_t:s0:c245,c463
tcontext=unconfined_u:system_r:container_runtime_t:s0-s0:c0.c1023
tclass=unix_stream_socket permissive=0
This is actually a really interesting denial, because it gives away details about how
CRIU works. This denial was caused by a container running top (podman
run -d alpine top) which I tried to checkpoint.
To understand why a denial like this is reported by top it helps to
understand how CRIU works. To be able to access all resources of the
process CRIU tries to checkpoint (or dump), CRIU injects parasite code into the process. The parasite code allows CRIU to act from
within the process's address space. Once the parasite is injected and running
it connects to the main CRIU process and is ready to receive commands.
The parasite's attempt to connect to the main CRIU process is exactly the step
SELinux is blocking. Looking at the denial it seems that a process top
running as system_u:system_r:container_t:s0:c245,c463 is trying to connectto
a socket labeled as unconfined_u:system_r:container_runtime_t:s0-s0:c0.c1023,
which is indeed suspicious: something running in a container tries
to connect to something running on the outside of the container. Knowing
that this is CRIU and knowing how CRIU works it is, however, required that the parasite
code connects to the main process using connectto.
Fortunately SELinux has the necessary interface to solve this:
setsockcreatecon(3). Using
setsockcreatecon(3) it is possible to specify the context of newly created
sockets. So all we have to do is get the context of the process to checkpoint
and tell SELinux to label newly created sockets accordingly
(8eb4309).
Once understood that was easy. Unfortunately this is also where the whole thing
got really complicated.
The CRIU RPM package in Fedora is built without SELinux support, because CRIU's
SELinux support until now was limited and not tested. CRIU's SELinux support
used to be: If the process context does not start with unconfined_ CRIU just
refuses to dump the process and exits. Being unaware of SELinux a process
restored with CRIU was no longer running with the context it was started but
with the context of CRIU during the restore. So if a container was running with
a context like system_u:system_r:container_t:s0:c248,c716 during
checkpointing it was running with the wrong context after restore:
unconfined_u:system_r:container_runtime_t:s0, which is the context of the
container runtime and not of the actual container process.
So first I had to fix CRIU's SELinux handling to be able to use
setsockcreatecon(3).  Fortunately, once I understood the problem, it was
pretty easy to fix CRIU's SELinux process labeling. Most of the LSM code in
CRIU was written by Tycho in 2015 with focus on AppArmor
which luckily uses the same interfaces as SELinux.  So all I had to do is
remove the restrictions on which SELinux context CRIU is willing to operate on and
make sure that CRIU stores the information about the process context in its
image files
796da06.
Once the next CRIU release with these patches included is available I have to
add BuildRequires: libselinux-devel to the RPM to build Fedora's CRIU package
with SELinux support. This, however, means that CRIU users on Fedora might see
SELinux errors they have not seen before. CRIU now needs SELinux policies which
allow CRIU to change the SELinux context of a running process. For the Podman
use case which started all of this there has been the corresponding change in
container-selinux to allow
container_runtime_t to dyntransition to container
domains.
For CRIU use cases outside of containers additional
policies
have been created which are also used by the new CRIU ZDTM test case
selinux00.
A new boolean exists which allows CRIU to use "setcon to dyntrans to any
process type which is part of domain attribute". So with setsebool -P
unconfined_dyntrans_all 1 it should be possible to use CRIU on Fedora just
like before.
After I included all those patches and policies into Podman's CI almost all
checkpoint/restore related tests were successful. Except one test which was
testing if it is possible to checkpoint and restore a container with
established TCP connections. In this test case a container with Redis is
started, a connection to Redis is opened and the container is checkpointed and
restored. This was still failing in CI which was interesting as this seemed
unrelated to SELinux.
Trying to reproduce the test case locally I actually saw the following SELinux
errors during restore:
audit: SELINUX_ERR op=security_bounded_transition seresult=denied
oldcontext=unconfined_u:system_r:container_runtime_t:s0
newcontext=system_u:system_r:container_t:s0:c218,c449
This was unusual as it did not look like something that could be fixed with a
policy.
The reason my test case for checkpointing and restoring containers with
established TCP connections failed was not the fact that it is testing
established TCP connections, but the fact that it is a multithreaded process.
Looking at the SELinux kernel code I found following comment in
security/selinux/hooks.c:
/* Only allow single threaded processes to change context */
This line is unchanged since 2008 so it seemed unlikely that it would be
possible to change SELinux in such a way that it would be possible to label
each thread separately. My first attempt to solve this was to change the
process label with
setcon(3) before CRIU
forks the first time. This kind of worked but at the same time created lots of
SELinux denials (over 50), because during restore CRIU changes itself and the
forks it creates into the process it wants to restore. So instead of changing
the process label just before forking the first time I switched to setting the
process label just before CRIU creates all threads
(e86c2e9).
Setting the context just before creating the threads resulted in only two
SELinux denials.  The first is about CRIU accessing the log file during restore
which is not critical and the other denial happens when CRIU tries to influence
the PID of the threads it wants to create via /proc/sys/kernel/ns_last_pid.
As CRIU is now running in the SELinux context of the to be restored container
and to avoid allowing the container to access all files which are labeled as
sysctl_kernel_t, Fedora's selinux-policy contains a
patch
to label /proc/sys/kernel/ns_last_pid as sysctl_kernel_ns_last_pid_t.
So with the latest CRIU and selinux-policy installed and the following
addition to my local SELinux policy
(kernel_rw_kernel_ns_lastpid_sysctl(container_domain)) I can now checkpoint
and restore a Podman container (even multi-threaded) with the correct SELinux
process context after a restore and no further SELinux denials blocking the
checkpointing or restoring of the container. There are a few SELinux denials
which are mainly related to not being able to write to the log files. Those
denials, however, do not interfere with the checkpoint and restoring.
For some time (two or three years) I was aware that CRIU was never verified
to work correctly with SELinux but I always ignored it and I should have just
fixed it a long time ago. Without the CRIU integration into Podman, however, I
would have not been able to test my changes as I was able to do.
I would like to thank Radostin for his feedback
and ideas when I was stuck and his  overview of the necessary CRIU
changes,
Dan for his help in adapting the
container-selinux package to CRIU's needs and
Lukas for the necessary changes to
Fedora's selinux-policy package to make CRIU work with SELinux on Fedora.
All these combined efforts made it possible to have the necessary policies and
code changes ready to support container
migration with Podman.