Monitoring job status¶
Where are my jobs running?¶
To see where jobs are running, you can query for the MATCH_EXP_JOB_GLIDEIN_Site
job attribute.
Query the queue for where jobs are
To see where IGWN jobs are running, use condor_q
to query for the MATCH_EXP_JOB_GLIDEIN_Site
attribute (using some standard commands to sum the counts):
$ condor_q -run -af MATCH_EXP_JOB_GLIDEIN_Site | sort | uniq -c
126 ComputeCanada-Cedar
1 ISI
3 LSU-SuperMIC
8 ND_CAMLGPU
41 NEMO
58 undefined
152 Unknown
44 USdC
Query the history for where jobs ran
To see where jobs ran in the past, use condor_history
instead:
$ condor_history -limit 100 -af MATCH_EXP_JOB_GLIDEIN_Site | sort | uniq -c
17 ComputeCanada-Cedar
3 PIC
1 Swan
3 undefined
6 Unknown
70 USdC
Tip: use -limit
to stop condor_history
from attempting to search through every job ever, which can be very slow to return.
Job progress & intermediate output¶
When jobs run without a shared filesystem, using HTCondor file transfer, you cannot directly access files as if they were in your /home
, sometimes making tracking progress and evaluating success difficult.
Instead, there are a number of HTCondor utilities to monitor status and progress of jobs running in remote sandboxes.
Stream stdout/stderr¶
By default, jobs running without a shared filesystem wait until termination to write to the usual stdout
and stderr
files. For jobs which use stdout
and stderr
to print modest amounts of status information, you can enable streaming, to update these files in near real-time.
HTCondor submit file with streaming stdout
Stream stdout
and stderr
in real time
executable = ./lalapps_somejob
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
log = example.log
error = example.err
output = example.out
stream_out = True
stream_error = True
request_disk = 10GB
request_memory = 6GB
accounting_tag = igwn.dev.o5.compsoft.lalapps.nobelwin
queue 1
Note that streaming requires continuous, robust network connectivity to the jobs. There are rare instances where this may not work well.
condor_tail
¶
If stdout
is very large or streaming output is otherwise impractical, use condor_tail
to view or follow data printed to stdout
. This can also be used to print and follow the tail of any file in the job's sandbox to your stdout.
Use condor_tail
interactively on the submit machine for your jobs / workflow. See: condor_tail --help
.
condor_ssh_to_job
¶
In many cases, you can log in to the execute node and explore a job sandbox using condor_ssh_to_job
interactively on the submit machine for your job.
See: condor_ssh_to_job --help
.
condor_chirp
¶
This provides a mechanism to access files and ClassAds from the executing job itself. That is, instead of accessing output from the submit side using condor_tail
or condor_ssh_to_job
, condor_chirp
allows actions originating on the execute host.
This is designed to be used by a job itself and can, for example, be used to periodically transmit output data from the execution host back to the submit host, or inject status information into the job ClassAd description, which would allow straightforward condor_q
queries for status/progress information.
The easiest way to use this utility is using the Python client, which can be implemented via a wrapper script or natively for for Python applications.
See: condor_chirp.