Self-Checkpointing Exercise 1.1: Trying It Out¶

The goal of this exercise is to practice writing a submit file for self-checkpointing, and to see the process in action.

Calculating Fibonacci numbers … slowly¶

The sample code for this exercise calculates the Fibonacci number resulting from a given set of iterations. Because this is a trival computation, the code includes a delay in each iteration through the main loop; this simulates a more intensive computation.

To get set up:

Log in to learn.chtc.wisc.edu (login05 is fine, too, except the condor_ssh_to_job step below should be skipped)
Create and change into a new directory for this exercise

Download the Python script that is the main executable for this exercise:

user@server $ wget https://raw.githubusercontent.com/osg-htc/user-school-2022/main/src/checkpointing/fibonacci.py

If you want to run the script directly, make it executable first:
```
user@server $ chmod 0755 fibonacci.py
```

Take a look at the code, if you like. It is not very elegant, but it gets the job done.

A few notes:

The script takes a single argument, the number of iterations to run. To minimize computing time while leaving time to explore, 10 is a good number of iterations.
The script checkpoints every other iteration through the main loop. The exit status code for a checkpoint is 85.
It prints some output to standard out along the way, to let you know what is going on.
The final result is written to a separate file named fibonacci.result. This file does not exist until the very end of the complete run.
It is safe to run from the command line on an access point:
```
user@server $ ./fibonacci.py 10
```
If you run it, what happens? (Due to the 30-second delay, be patient.) Can you explain its behavior? What happens if you run it again, without changing any files in between? Why?

Preparing to run¶

Now you have an executable and you know how to run it. It is time to prepare it for submission to HTCondor!

Using what you know about the script (above), and using information in the slides from today, try writing a submit file that runs this software and implements exit-driven self-checkpointing. The Python code itself is ready and should not need any changes.

Just use a plain queue statement, one job is enough to experiment on.

Before you submit, read the next section first!

Running and monitoring¶

With the 30-second delay per iteration in the code and the suggested 10 iterations, once the script starts running you have about 5 minutes of runtime in which to see what is going on. So it may help to read through this section and then return here and submit your job.

If your job has problems or finishes before you have the chance to do all the steps below, just remove the extra files (besides the Python script and your submit file) and try again!

Submission and first checkpoint¶

Submit the job
Look at the contents of the submit directory — what changed?
Start watching the log file: tail -n 100 -f YOUR-LOG-FILENAME.log

Be patient! As HTCondor adds more lines to the end of your log file, they will appear automatically. Thus, nothing much will happen until HTCondor starts running your job. When it does, you will see three sets of messages in the log file quickly:

Started transferring input files
Finished transferring input files
Job executing on host:

(Of course, each message will contain a lot of other characters!)

Now wait about 1 minute, and you should see two more messages appear:

Started transferring output files
Finished transferring output files

That is the first checkpoint happening!

Viewing the running job (only on `learn`)¶

Once you see those messages, let’s look at the running job on the execute point (note: these steps will work only on CHTC):

Press Control-C (Ctrl-c or ^C) to exit the tail command
Run condor_q to get the job ID of the running job
Run condor_ssh_to_job JOB_ID, where you replace JOB_ID with your job ID from above
Once you have logged in to the execute point, run some commands (somewhat quickly!):
1. Where am I? hostname and then pwd
2. What is here? Run ls -lF — is that what you expected?
3. Do you see the fibonacci.checkpoint file? What is in it? Is that what you expected?
4. Wait about 1 more minute and look at the checkpoint file again? Did it change?
5. Run logout to leave the execute point and return to the access point

As you may have guessed, condor_ssh_to_job logged you in to the execute point of your job and changed into its execute directory. You were looking at the job as it ran!

Forcing your job to stop running¶

Now, assuming that your job is still running (check condor_q again), you can force HTCondor to remove (evict) your job before it finishes:

Run condor_q to get the job ID of the running job
Run condor_vacate_job JOB_ID, where you replace JOB_ID with your job ID from above
Monitor the action again by running tail -n 100 -f YOUR-LOG-FILENAME.log

Finishing the job and wrap-up¶

Be patient again! You removed your running job, and so HTCondor put it back in the queue as idle. If you wait a minute or two, you should see that HTCondor starts running the job again.

In the log file, look carefully for the two Job executing on host: messages. Does it seem like you ran on the same computer again or on a different one? Both are possible!
Let your job finish running this time. There should be a Job terminated of its own accord message near the end.
Did you get results? Go through all the files and see what they contain. The log and output files are probably the most interesting. But did you get a result file, too?

Did the output file — that is, whatever file you named in the output line of your submit file — contain everything that you expected it to?

Conclusion¶

This has been a brief and simple tour of self-checkpointing. If you would like to learn more, please read the Self-Checkpointing Applications section of the HTCondor Manual. Or talk to School staff about it. Or contact [email protected] for further help at any time.