{"id":627,"date":"2014-02-20T21:21:00","date_gmt":"2014-02-20T19:21:00","guid":{"rendered":"https:\/\/lisas.de\/~adrian\/posts\/2014-Feb-20-checkpoint-and-almost-restart-in-open-mpi.html"},"modified":"2026-03-30T22:41:24","modified_gmt":"2026-03-30T20:41:24","slug":"checkpoint-and-almost-restart-in-open-mpi","status":"publish","type":"post","link":"https:\/\/lisas.de\/luges\/index.php\/2014\/02\/20\/checkpoint-and-almost-restart-in-open-mpi\/","title":{"rendered":"Checkpoint and almost Restart in Open MPI"},"content":{"rendered":"<p>Now that checkpoint\/restart with <a href=\"http:\/\/criu.org\/\">CRIU<\/a> is possible since Fedora 19 I started adding CRIU support to <a href=\"http:\/\/www.open-mpi.org\/\">Open MPI<\/a>. With my commit <a href=\"https:\/\/svn.open-mpi.org\/trac\/ompi\/changeset\/30772\">30772<\/a> it is now possible to checkpoint a process running under Open MPI. The restart functionality is not yet implemented but should be soon available. I have a test case (<em>orte-test<\/em>) which prints its PID and sleeps one second in a loop which I start under <em>orterun<\/em> like this:<\/p>\n<p><code>\/path\/to\/orterun --mca ft_cr_enabled 1 --mca opal_cr_use_thread 1 --mca oob tcp --mca crs_criu_verbose 30 --np 1 orte-test<\/code><\/p>\n<p>The options have following meaning:<\/p>\n<ul>\n<li><em>&#8211;mca ft_cr_enabled 1<\/em>\n<ul>\n<li>ft stands for fault tolerance<\/li>\n<li>cr stands for checkpoint\/restart<\/li>\n<li>this option is to enable the checkpoint\/restart functionality<\/li>\n<\/ul>\n<\/li>\n<li><em>&#8211;mca opal_cr_use_thread 1<\/em>: use an additional thread to control checkpoint\/restart operations<\/li>\n<li><em>&#8211;mca oob tcp<\/em>: use TCP instead of unix domain sockets (the socket code needs some additional changes for C\/R to work)<\/li>\n<li><em>&#8211;mca crs_criu_verbose 30<\/em>: print all CRIU debug messages<\/li>\n<li><em>&#8211;np 1<\/em>: spawn one test case<\/li>\n<\/ul>\n<p>The output of the test case looks like this:<\/p>\n<div class=\"highlight\">\n<pre><span><\/span><code><span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12563<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"n\">open<\/span><span class=\"p\">()<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12563<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">priority<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"mh\">10<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12563<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">verbosity<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"mh\">30<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12563<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">log_file<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"n\">criu<\/span><span class=\"p\">.<\/span><span class=\"n\">log<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12563<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">log_level<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"mh\">0<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12563<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">tcp_established<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"mh\">1<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12563<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">shell_job<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"mh\">1<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12563<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">ext_unix_sk<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"mh\">1<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12563<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">leave_running<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"mh\">1<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12563<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"n\">component_query<\/span><span class=\"p\">()<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12563<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"n\">module_init<\/span><span class=\"p\">()<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12563<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"n\">opal_crs_criu_prelaunch<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12565<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"n\">open<\/span><span class=\"p\">()<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12565<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">priority<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"mh\">10<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12565<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">verbosity<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"mh\">30<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12565<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">log_file<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"n\">criu<\/span><span class=\"p\">.<\/span><span class=\"n\">log<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12565<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">log_level<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"mh\">0<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12565<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">tcp_established<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"mh\">1<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12565<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">shell_job<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"mh\">1<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12565<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">ext_unix_sk<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"mh\">1<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12565<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"nl\">open:<\/span><span class=\"w\"> <\/span><span class=\"n\">leave_running<\/span><span class=\"w\"> <\/span><span class=\"o\">=<\/span><span class=\"w\"> <\/span><span class=\"mh\">1<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12565<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"n\">component_query<\/span><span class=\"p\">()<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12565<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"n\">module_init<\/span><span class=\"p\">()<\/span> <span class=\"p\">[<\/span><span class=\"nl\">dcbz:<\/span><span class=\"mh\">12565<\/span><span class=\"p\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">crs:criu:<\/span><span class=\"w\"> <\/span><span class=\"n\">opal_crs_criu_reg_thread<\/span><span class=\"w\"> <\/span><span class=\"n\">Process<\/span><span class=\"w\"> <\/span><span class=\"mh\">12565<\/span><span class=\"w\"> <\/span><span class=\"n\">Process<\/span><span class=\"w\"> <\/span><span class=\"mh\">12565<\/span><span class=\"w\"> <\/span><span class=\"n\">Process<\/span><span class=\"w\"> <\/span><span class=\"mh\">12565<\/span> <\/code><\/pre>\n<\/div>\n<p>To start the checkpoint operation the Open MPI tool <em>orte-checkpoint<\/em> is used:<\/p>\n<p><code>\/path\/to\/orte-checkpoint -V 10 `pidof orterun`<\/code><\/p>\n<p>which outputs the following:<\/p>\n<div class=\"highlight\">\n<pre><span><\/span><code><span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">orte_checkpoint<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"n\">Checkpointing<\/span><span class=\"p\">...<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"n\">PID<\/span><span class=\"w\"> <\/span><span class=\"mi\">12563<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"n\">Connected<\/span><span class=\"w\"> <\/span><span class=\"k\">to<\/span><span class=\"w\"> <\/span><span class=\"n\">Mpirun<\/span><span class=\"w\"> <\/span><span class=\"o\">[<\/span><span class=\"n\">[56676,0<\/span><span class=\"o\">]<\/span><span class=\"p\">,<\/span><span class=\"mi\">0<\/span><span class=\"err\">]<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">orte_checkpoint<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"nl\">notify_hnp<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"n\">Contact<\/span><span class=\"w\"> <\/span><span class=\"n\">Head<\/span><span class=\"w\"> <\/span><span class=\"n\">Node<\/span><span class=\"w\"> <\/span><span class=\"n\">Process<\/span><span class=\"w\"> <\/span><span class=\"n\">PID<\/span><span class=\"w\"> <\/span><span class=\"mi\">12563<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">orte_checkpoint<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"nl\">notify_hnp<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"n\">Requested<\/span><span class=\"w\"> <\/span><span class=\"n\">a<\/span><span class=\"w\"> <\/span><span class=\"k\">checkpoint<\/span><span class=\"w\"> <\/span><span class=\"k\">of<\/span><span class=\"w\"> <\/span><span class=\"n\">jobid<\/span><span class=\"w\"> <\/span><span class=\"o\">[<\/span><span class=\"n\">INVALID<\/span><span class=\"o\">]<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">orte_checkpoint<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"nl\">hnp_receiver<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"n\">Receive<\/span><span class=\"w\"> <\/span><span class=\"n\">a<\/span><span class=\"w\"> <\/span><span class=\"n\">command<\/span><span class=\"w\"> <\/span><span class=\"n\">message<\/span><span class=\"p\">.<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">orte_checkpoint<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"nl\">hnp_receiver<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"n\">Status<\/span><span class=\"w\"> <\/span><span class=\"k\">Update<\/span><span class=\"p\">.<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"o\">[<\/span><span class=\"n\"> 0.00 \/ 0.08<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"n\">Requested<\/span><span class=\"w\"> <\/span><span class=\"o\">-<\/span><span class=\"w\"> <\/span><span class=\"p\">...<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">orte_checkpoint<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"nl\">hnp_receiver<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"n\">Receive<\/span><span class=\"w\"> <\/span><span class=\"n\">a<\/span><span class=\"w\"> <\/span><span class=\"n\">command<\/span><span class=\"w\"> <\/span><span class=\"n\">message<\/span><span class=\"p\">.<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">orte_checkpoint<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"nl\">hnp_receiver<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"n\">Status<\/span><span class=\"w\"> <\/span><span class=\"k\">Update<\/span><span class=\"p\">.<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"o\">[<\/span><span class=\"n\"> 0.00 \/ 0.08<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"n\">Pending<\/span><span class=\"w\"> <\/span><span class=\"o\">-<\/span><span class=\"w\"> <\/span><span class=\"p\">...<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">orte_checkpoint<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"nl\">hnp_receiver<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"n\">Receive<\/span><span class=\"w\"> <\/span><span class=\"n\">a<\/span><span class=\"w\"> <\/span><span class=\"n\">command<\/span><span class=\"w\"> <\/span><span class=\"n\">message<\/span><span class=\"p\">.<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">orte_checkpoint<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"nl\">hnp_receiver<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"n\">Status<\/span><span class=\"w\"> <\/span><span class=\"k\">Update<\/span><span class=\"p\">.<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"o\">[<\/span><span class=\"n\"> 0.00 \/ 0.08<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"n\">Running<\/span><span class=\"w\"> <\/span><span class=\"o\">-<\/span><span class=\"w\"> <\/span><span class=\"p\">...<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">orte_checkpoint<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"nl\">hnp_receiver<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"n\">Receive<\/span><span class=\"w\"> <\/span><span class=\"n\">a<\/span><span class=\"w\"> <\/span><span class=\"n\">command<\/span><span class=\"w\"> <\/span><span class=\"n\">message<\/span><span class=\"p\">.<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">orte_checkpoint<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"nl\">hnp_receiver<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"n\">Status<\/span><span class=\"w\"> <\/span><span class=\"k\">Update<\/span><span class=\"p\">.<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"o\">[<\/span><span class=\"n\"> 0.06 \/ 0.14<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"n\">Locally<\/span><span class=\"w\"> <\/span><span class=\"n\">Finished<\/span><span class=\"w\"> <\/span><span class=\"o\">-<\/span><span class=\"w\"> <\/span><span class=\"p\">...<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">orte_checkpoint<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"nl\">hnp_receiver<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"n\">Receive<\/span><span class=\"w\"> <\/span><span class=\"n\">a<\/span><span class=\"w\"> <\/span><span class=\"n\">command<\/span><span class=\"w\"> <\/span><span class=\"n\">message<\/span><span class=\"p\">.<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">orte_checkpoint<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"nl\">hnp_receiver<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"n\">Status<\/span><span class=\"w\"> <\/span><span class=\"k\">Update<\/span><span class=\"p\">.<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"o\">[<\/span><span class=\"n\"> 0.00 \/ 0.14<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"k\">Checkpoint<\/span><span class=\"w\"> <\/span><span class=\"n\">Established<\/span><span class=\"w\"> <\/span><span class=\"o\">-<\/span><span class=\"w\"> <\/span><span class=\"n\">ompi_global_snapshot_12563<\/span><span class=\"p\">.<\/span><span class=\"n\">ckpt<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">orte_checkpoint<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"nl\">hnp_receiver<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"n\">Receive<\/span><span class=\"w\"> <\/span><span class=\"n\">a<\/span><span class=\"w\"> <\/span><span class=\"n\">command<\/span><span class=\"w\"> <\/span><span class=\"n\">message<\/span><span class=\"p\">.<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"nl\">orte_checkpoint<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"nl\">hnp_receiver<\/span><span class=\"p\">:<\/span><span class=\"w\"> <\/span><span class=\"n\">Status<\/span><span class=\"w\"> <\/span><span class=\"k\">Update<\/span><span class=\"p\">.<\/span> <span class=\"o\">[<\/span><span class=\"n\">dcbz:12570<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"o\">[<\/span><span class=\"n\"> 0.00 \/ 0.14<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"n\">Continuing<\/span><span class=\"o\">\/<\/span><span class=\"n\">Recovered<\/span><span class=\"w\"> <\/span><span class=\"o\">-<\/span><span class=\"w\"> <\/span><span class=\"n\">ompi_global_snapshot_12563<\/span><span class=\"p\">.<\/span><span class=\"n\">ckpt<\/span><span class=\"w\"> <\/span><span class=\"n\">Snapshot<\/span><span class=\"w\"> <\/span><span class=\"k\">Ref<\/span><span class=\"p\">.<\/span><span class=\"err\">:<\/span><span class=\"w\"> <\/span><span class=\"mi\">0<\/span><span class=\"w\"> <\/span><span class=\"n\">ompi_global_snapshot_12563<\/span><span class=\"p\">.<\/span><span class=\"n\">ckpt<\/span> <\/code><\/pre>\n<\/div>\n<p><em>orte-checkpoint<\/em> tries to connect to the previously started <em>orterun<\/em> process and requests that a checkpoint should be taken. <em>orterun<\/em> outputs the following after receiving the checkpoint request:<\/p>\n<div class=\"highlight\">\n<pre><span><\/span><code>[dcbz:12565] crs:criu: checkpoint(12565, ---) [dcbz:12565] crs:criu: criu_init_opts() returned 0 [dcbz:12565] crs:criu: opening snapshot directory \/home\/adrian\/ompi_global_snapshot_12563.ckpt\/0\/opal_snapshot_0.ckpt [dcbz:12563] 12563: Checkpoint established for process [56676,0]. [dcbz:12563] 12563: Successfully restarted process [56676,0]. Process 12565 <\/code><\/pre>\n<\/div>\n<p>At this point the checkpoint has been written to disk and the process continues (printing its PID).<\/p>\n<p>For a complete checkpoint\/restart functionality I still have to implement the restart functionality in Open MPI and I also have to take care of the unix domain sockets (shutting them down for the checkpointing).<\/p>\n<p>This requires the latest criu package (criu-1.1-4) which includes headers to build Open MPI against CRIU as well as the CRIU service.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Now that checkpoint\/restart with CRIU is possible since Fedora 19 I started adding CRIU support to Open MPI. With my commit 30772 it is now possible to checkpoint a process running under Open MPI. The restart functionality is not yet implemented but should be soon available. I have a test case (orte-test) which prints its [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[],"class_list":["post-627","post","type-post","status-publish","format-standard","hentry","category-luges"],"_links":{"self":[{"href":"https:\/\/lisas.de\/luges\/index.php\/wp-json\/wp\/v2\/posts\/627","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lisas.de\/luges\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lisas.de\/luges\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lisas.de\/luges\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/lisas.de\/luges\/index.php\/wp-json\/wp\/v2\/comments?post=627"}],"version-history":[{"count":8,"href":"https:\/\/lisas.de\/luges\/index.php\/wp-json\/wp\/v2\/posts\/627\/revisions"}],"predecessor-version":[{"id":1995,"href":"https:\/\/lisas.de\/luges\/index.php\/wp-json\/wp\/v2\/posts\/627\/revisions\/1995"}],"wp:attachment":[{"href":"https:\/\/lisas.de\/luges\/index.php\/wp-json\/wp\/v2\/media?parent=627"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lisas.de\/luges\/index.php\/wp-json\/wp\/v2\/categories?post=627"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lisas.de\/luges\/index.php\/wp-json\/wp\/v2\/tags?post=627"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}