Checkpointing long-running jobs

Because some fraction of resources on the IGWN Grid are opportunistic and slots can potentially be requisitioned for other, higher-priority jobs at any given time, there can be little guarentee that any one job will run for significantly longer than about an hour without interruption. In reality, many jobs will run for several hours, but to avoid loss of work, long-running applications should adopt the self-checkpointing paradigm in HTCondor:

  • Applications should periodically save progress to their own checkpoint file.
  • Once the save is complete, applications should exit with a non-zero, non-unity exit code to be specified in the job submit file.
  • HTCondor traps this specific exit, copies the checkpoint file back to the submit host and immediately resumes the job, with no further user-level interaction.

Full documentation here.

Note that we do not recommend applications to rely on trapping termination signals on their own: the purely periodic, self-checkpointing mechanism is the most robust mechanism to minimise loss of work.

As described in the documentation, the appropriate checkpoint interval will be application specific and depend on (e.g.) how quickly it can resume but, in practice, 1 hour is usually quite reasonable.