Category Archives: Coding

Influence which PID will be the next

To restore a checkpointed process with CRIU the process ID (PID) has to be the same it was during checkpointing. CRIU uses /proc/sys/kernel/ns_last_pid to set the PID to one lower as the process to be restored just before fork()-ing into the new process.

The same interface (/proc/sys/kernel/ns_last_pid) can also be used from the command-line to influence which PID the kernel will use for the next process.

# cat /proc/sys/kernel/ns_last_pid
1626
# echo -n 9999 > /proc/sys/kernel/ns_last_pid
# cat /proc/sys/kernel/ns_last_pid
10000

Writing ‘9999’ (without a ‘new line’) to /proc/sys/kernel/ns_last_pid tells the kernel, that the next PID should be ‘10000’. This only works if between after writing to /proc/sys/kernel/ns_last_pid and forking the new process no other process has been created. So it is not possible to guarantee which PID the new process will get but it can be influenced.

There is also a posting which describes how to do the same with C: How to set PID using ns_last_pid

Checkpoint and almost Restart in Open MPI

Now that checkpoint/restart with CRIU is possible since Fedora 19 I started adding CRIU support to Open MPI. With my commit 30772 it is now possible to checkpoint a process running under Open MPI. The restart functionality is not yet implemented but should be soon available. I have a test case (orte-test) which prints its PID and sleeps one second in a loop which I start under orterun like this:

/path/to/orterun --mca ft_cr_enabled 1 --mca opal_cr_use_thread 1 --mca oob tcp --mca crs_criu_verbose 30 --np 1 orte-test

The options have following meaning:

  • –mca ft_cr_enabled 1
    • ft stands for fault tolerance
    • cr stands for checkpoint/restart
    • this option is to enable the checkpoint/restart functionality
  • –mca opal_cr_use_thread 1: use an additional thread to control checkpoint/restart operations
  • –mca oob tcp: use TCP instead of unix domain sockets (the socket code needs some additional changes for C/R to work)
  • –mca crs_criu_verbose 30: print all CRIU debug messages
  • –np 1: spawn one test case

The output of the test case looks like this:


[dcbz:12563] crs:criu: open()
[dcbz:12563] crs:criu: open: priority = 10
[dcbz:12563] crs:criu: open: verbosity = 30
[dcbz:12563] crs:criu: open: log_file = criu.log
[dcbz:12563] crs:criu: open: log_level = 0
[dcbz:12563] crs:criu: open: tcp_established = 1
[dcbz:12563] crs:criu: open: shell_job = 1
[dcbz:12563] crs:criu: open: ext_unix_sk = 1
[dcbz:12563] crs:criu: open: leave_running = 1
[dcbz:12563] crs:criu: component_query()
[dcbz:12563] crs:criu: module_init()
[dcbz:12563] crs:criu: opal_crs_criu_prelaunch
[dcbz:12565] crs:criu: open()
[dcbz:12565] crs:criu: open: priority = 10
[dcbz:12565] crs:criu: open: verbosity = 30
[dcbz:12565] crs:criu: open: log_file = criu.log
[dcbz:12565] crs:criu: open: log_level = 0
[dcbz:12565] crs:criu: open: tcp_established = 1
[dcbz:12565] crs:criu: open: shell_job = 1
[dcbz:12565] crs:criu: open: ext_unix_sk = 1
[dcbz:12565] crs:criu: open: leave_running = 1
[dcbz:12565] crs:criu: component_query()
[dcbz:12565] crs:criu: module_init()
[dcbz:12565] crs:criu: opal_crs_criu_reg_thread
Process 12565
Process 12565
Process 12565

To start the checkpoint operation the Open MPI tool orte-checkpoint is used:

/path/to/orte-checkpoint -V 10 `pidof orterun`

which outputs the following:


[dcbz:12570] orte_checkpoint: Checkpointing...
[dcbz:12570] PID 12563
[dcbz:12570] Connected to Mpirun [[56676,0],0]
[dcbz:12570] orte_checkpoint: notify_hnp: Contact Head Node Process PID 12563
[dcbz:12570] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.00 / 0.08] Requested - ...
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.00 / 0.08] Pending - ...
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.00 / 0.08] Running - ...
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.06 / 0.14] Locally Finished - ...
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.00 / 0.14] Checkpoint Established - ompi_global_snapshot_12563.ckpt
[dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
[dcbz:12570] [ 0.00 / 0.14] Continuing/Recovered - ompi_global_snapshot_12563.ckpt
Snapshot Ref.: 0 ompi_global_snapshot_12563.ckpt

orte-checkpoint tries to connect to the previously started orterun process and requests that a checkpoint should be taken. orterun outputs the following after receiving the checkpoint request:


[dcbz:12565] crs:criu: checkpoint(12565, ---)
[dcbz:12565] crs:criu: criu_init_opts() returned 0
[dcbz:12565] crs:criu: opening snapshot directory /home/adrian/ompi_global_snapshot_12563.ckpt/0/opal_snapshot_0.ckpt
[dcbz:12563] 12563: Checkpoint established for process [56676,0].
[dcbz:12563] 12563: Successfully restarted process [56676,0].
Process 12565

At this point the checkpoint has been written to disk and the process continues (printing its PID).

For a complete checkpoint/restart functionality I still have to implement the restart functionality in Open MPI and I also have to take care of the unix domain sockets (shutting them down for the checkpointing).

This requires the latest criu package (criu-1.1-4) which includes headers to build Open MPI against CRIU as well as the CRIU service.

If you have too much memory

We have integrated new nodes into our cluster. All of the new nodes have a local SSD for fast temporary scratch data. In order to find which are the best options and IO scheduler I have written a script which tries a lot of combinations (80 to be precise) of file system options and IO schedulers. As the nodes have 64 GB of RAM the first run of the script took 40 hours as I tried to write always twice the size of the RAM for my benchmarks to avoid any caching effects. In order to reduce the amount of available memory I wrote a program called memhog which malloc()s the memory and then also mlock()s it. The usage is really simple

$ ./memhog
Usage: memhog <size in GB>

I am now locking 56GB with memhog and I reduced the benchmark file size to 30GB.

So, if you have too much memory and want to waste it… Just use memhog.c.

Kover 6

After having successfully updated libcdio in rawhide to 0.90 and also introduced the split off libcdio-paranoia in Fedora’s development branch, I rebuilt most of on libcdio depending packages. Two packages were no longer building but their maintainers quickly fixed it. The only broken dependent package was kover. As I am still upstream of kover I had to change the code to use the new CD-Text API of libcdio 0.90.

With these changes I have released kover version 6 which is available at http://lisas.de/kover/kover-6.tar.bz2.

Git

As I wanted to make a new kover release I thought I could try to move my code to git. The first step was to copy the code from cvs to a local git repository:

  • git-cvsimport -i -v -d :pserver:adrian@cvs:/cvs/kover -C kover.git kover

I was a bit surprised that the newly created directory kover.git was empty except for a .git directory. Without much knowing what I was doing I typed git-checkout and it listed all available files but my directory was still empty. So I tried git-checkout .. This time there was no output but all my files were now in my directory so that I could start doing changes.
Committing, adding and removing files is easy and works just like expected (git-commit, git-add, git-rm). The steps to publish the git repository, however, were not as easy. The following commands were necessary to make it work for me:

  • git clone --bare . git
  • touch git/git-daemon-export-ok
  • cd git/
  • git --bare update-server-info
  • chmod a+x hooks/post-update

or with newer versions of git the following mv instead of the chmod

  • mv hooks/post-update.sample hooks/post-update
  • cd ..
  • rsync -avP -e "ssh" git/ lisas.de:/var/www/html/kover/git

From this point on it was now possible to access the repository through http with git clone http://lisas.de/kover/git. To push my changes to the repository on the server I use git-push lisas.de:/var/www/html/kover/git master. It was necessary to install git-core on the server so that git-push would work without errors. To pull changes from the online repository I use git-pull http://lisas.de/kover/git master.

There is probably no real reason to use git because up until now kover was mainly developed by me and it will therefore not profit from the distributed features which are the main advantages from git but I wanted to play with the same toys as all the cool kids ;-).