Checkpoint and almost Restart in Open MPI
Now that checkpoint/restart with CRIU is possible since Fedora 19 I started adding CRIU support to Open MPI. With my commit 30772 it is now possible to checkpoint a process running under Open MPI. The restart functionality is not yet implemented but should be soon available. I have a test case (orte-test) which prints its PID and sleeps one second in a loop which I start under orterun like this:
/path/to/orterun --mca ft_cr_enabled 1 --mca opal_cr_use_thread 1 --mca oob tcp --mca crs_criu_verbose 30 --np 1 orte-test
The options have following meaning:
- –mca ft_cr_enabled 1
- ft stands for fault tolerance
- cr stands for checkpoint/restart
- this option is to enable the checkpoint/restart functionality
- –mca opal_cr_use_thread 1: use an additional thread to control checkpoint/restart operations
- –mca oob tcp: use TCP instead of unix domain sockets (the socket code needs some additional changes for C/R to work)
- –mca crs_criu_verbose 30: print all CRIU debug messages
- –np 1: spawn one test case
The output of the test case looks like this:
[dcbz:12563] crs:criu: open() [dcbz:12563] crs:criu: open: priority = 10 [dcbz:12563] crs:criu: open: verbosity = 30 [dcbz:12563] crs:criu: open: log_file = criu.log [dcbz:12563] crs:criu: open: log_level = 0 [dcbz:12563] crs:criu: open: tcp_established = 1 [dcbz:12563] crs:criu: open: shell_job = 1 [dcbz:12563] crs:criu: open: ext_unix_sk = 1 [dcbz:12563] crs:criu: open: leave_running = 1 [dcbz:12563] crs:criu: component_query() [dcbz:12563] crs:criu: module_init() [dcbz:12563] crs:criu: opal_crs_criu_prelaunch [dcbz:12565] crs:criu: open() [dcbz:12565] crs:criu: open: priority = 10 [dcbz:12565] crs:criu: open: verbosity = 30 [dcbz:12565] crs:criu: open: log_file = criu.log [dcbz:12565] crs:criu: open: log_level = 0 [dcbz:12565] crs:criu: open: tcp_established = 1 [dcbz:12565] crs:criu: open: shell_job = 1 [dcbz:12565] crs:criu: open: ext_unix_sk = 1 [dcbz:12565] crs:criu: open: leave_running = 1 [dcbz:12565] crs:criu: component_query() [dcbz:12565] crs:criu: module_init() [dcbz:12565] crs:criu: opal_crs_criu_reg_thread Process 12565 Process 12565 Process 12565
To start the checkpoint operation the Open MPI tool orte-checkpoint is used:
/path/to/orte-checkpoint -V 10 `pidof orterun`
which outputs the following:
[dcbz:12570] orte_checkpoint: Checkpointing... [dcbz:12570] PID 12563 [dcbz:12570] Connected to Mpirun [[56676,0],0] [dcbz:12570] orte_checkpoint: notify_hnp: Contact Head Node Process PID 12563 [dcbz:12570] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message. [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update. [dcbz:12570] [ 0.00 / 0.08] Requested - ... [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message. [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update. [dcbz:12570] [ 0.00 / 0.08] Pending - ... [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message. [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update. [dcbz:12570] [ 0.00 / 0.08] Running - ... [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message. [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update. [dcbz:12570] [ 0.06 / 0.14] Locally Finished - ... [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message. [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update. [dcbz:12570] [ 0.00 / 0.14] Checkpoint Established - ompi_global_snapshot_12563.ckpt [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message. [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update. [dcbz:12570] [ 0.00 / 0.14] Continuing/Recovered - ompi_global_snapshot_12563.ckpt Snapshot Ref.: 0 ompi_global_snapshot_12563.ckpt
orte-checkpoint tries to connect to the previously started orterun process and requests that a checkpoint should be taken. orterun outputs the following after receiving the checkpoint request:
[dcbz:12565] crs:criu: checkpoint(12565, ---) [dcbz:12565] crs:criu: criu_init_opts() returned 0 [dcbz:12565] crs:criu: opening snapshot directory /home/adrian/ompi_global_snapshot_12563.ckpt/0/opal_snapshot_0.ckpt [dcbz:12563] 12563: Checkpoint established for process [56676,0]. [dcbz:12563] 12563: Successfully restarted process [56676,0]. Process 12565
At this point the checkpoint has been written to disk and the process continues (printing its PID).
For a complete checkpoint/restart functionality I still have to implement the restart functionality in Open MPI and I also have to take care of the unix domain sockets (shutting them down for the checkpointing).
This requires the latest criu package (criu-1.1-4) which includes headers to build Open MPI against CRIU as well as the CRIU service.