Monitoring job status¶
Where are my jobs running?¶
To see where jobs are running, you can query for the
MATCH_EXP_JOB_GLIDEIN_Site job attribute.
Query the queue for where jobs are
To see where IGWN jobs are running, use
condor_q to query for the
MATCH_EXP_JOB_GLIDEIN_Site attribute (using some standard commands to sum the counts):
$ condor_q -run -af MATCH_EXP_JOB_GLIDEIN_Site | sort | uniq -c 126 ComputeCanada-Cedar 1 ISI 3 LSU-SuperMIC 8 ND_CAMLGPU 41 NEMO 58 undefined 152 Unknown 44 USdC
Query the history for where jobs ran
To see where jobs ran in the past, use
$ condor_history -limit 100 -af MATCH_EXP_JOB_GLIDEIN_Site | sort | uniq -c 17 ComputeCanada-Cedar 3 PIC 1 Swan 3 undefined 6 Unknown 70 USdC
-limit to stop
condor_history from attempting to search through every job ever, which can be very slow to return.
Job progress & intermediate output¶
When jobs run without a shared filesystem, using HTCondor file transfer, you cannot directly access files as if they were in your
/home, sometimes making tracking progress and evaluating success difficult.
Instead, there are a number of HTCondor utilities to monitor status and progress of jobs running in remote sandboxes.
By default, jobs running without a shared filesystem wait until termination to write to the usual
stderr files. For jobs which use
stderr to print modest amounts of status information, you can enable streaming, to update these files in near real-time.
HTCondor submit file with streaming
stderr in real time
executable = ./lalapps_somejob should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT log = example.log error = example.err output = example.out stream_out = True stream_error = True request_disk = 10GB request_memory = 6GB accounting_tag = igwn.dev.o5.compsoft.lalapps.nobelwin queue 1
Note that streaming requires continuous, robust network connectivity to the jobs. There are rare instances where this may not work well.
stdout is very large or streaming output is otherwise impractical, use
condor_tail to view or follow data printed to
stdout. This can also be used to print and follow the tail of any file in the job's sandbox to your stdout.
condor_tail interactively on the submit machine for your jobs / workflow. See:
In many cases, you can log in to the execute node and explore a job sandbox using
condor_ssh_to_job interactively on the submit machine for your job.
This provides a mechanism to access files and ClassAds from the executing job itself. That is, instead of accessing output from the submit side using
condor_chirp allows actions originating on the execute host.
This is designed to be used by a job itself and can, for example, be used to periodically transmit output data from the execution host back to the submit host, or inject status information into the job ClassAd description, which would allow straightforward
condor_q queries for status/progress information.
The easiest way to use this utility is using the Python client, which can be implemented via a wrapper script or natively for for Python applications.