Monitoring job status
Monitoring the queue¶
In order to track jobs submitting to the Open Science Grid, a special pool
needs to be specified for condor_status
, condor_q
, and friends:
Condor pool for OSG
osg-ligo-1.t2.ucsd.edu
This value is stored in the environment variable $IGWN_POOL
Example query for workflow submitted to the OSG pool
$ condor_q -pool ${IGWN_POOL} -better-analyze 14607855.0
-- Schedd: ldas-osg.ligo.caltech.edu : <131.215.113.204:9618?...
The Requirements expression for job 14607855.000 is
((IS_GLIDEIN is true) && (HAS_CVMFS_LIGO_CONTAINERS is true)) && (TARGET.Arch == "X86_64") &&
(TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
(TARGET.HasFileTransfer)
Job 14607855.000 defines the following attributes:
RequestDisk = 1
RequestMemory = 4096
The Requirements expression for job 14607855.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[1] 229 HAS_CVMFS_LIGO_CONTAINERS is true
[9] 223 TARGET.Memory >= RequestMemory
[10] 7 [1] && [9]
No successful match recorded.
Last failed match: Fri Jun 14 08:30:28 2019
Reason for last match failure: no match found
14607855.000: Run analysis summary ignoring user priority. Of 233 machines,
201 are rejected by your job's requirements
2 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
30 are able to run your job
Where are my jobs running?¶
Query the queue / history for where jobs are / were
To see where IGWN jobs are running / ran, use condor_q
/ condor_history
to query for the MATCH_EXP_JOB_GLIDEIN_Site
attribute:
$ condor_q -run -af MATCH_EXP_JOB_GLIDEIN_Site
CNAF
GATech
IN2P3
KISTI
Lancaster
LIGO-CIT
LIGO-WA
MWT2
Nebraska
Omaha
PIC
SuperMIC
UChicago
undefined
Unknown
Wisconsin
Job progress & intermediate output¶
When jobs run without a shared filesystem, using HTCondor file transfer, you cannot directly access files as if they were in your /home
, sometimes making tracking progress and evaluating success difficult.
Instead, there are a number of HTCondor utilities to monitor status and progress of jobs running in remote sandboxes.
Stream stdout/stderr¶
By default, jobs running without a shared filesystem wait until termination to write to the usual stdout
and stderr
files. For jobs which use stdout
and stderr
to print modest amounts of status information, you can enable streaming, to update these files in near real-time.
HTCondor submit file with streaming stdout
Stream stdout
and stderr
in real time
universe = Vanilla
executable = /lalapps_somejob
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
log = example.log
error = example.err
output = example.out
stream_out = True
stream_error = True
queue 1
Note that streaming requires continuous, robust network connectivity to the jobs. There are rare instances where this may not work well.
condor_tail
¶
If stdout
is very large or streaming output is otherwise impractical, use condor_tail
to view or follow data printed to stdout
. This can also be used to print and follow the tail of any file in the job's sandbox to your stdout.
Use condor_tail
interactively on the submit machine for your jobs / workflow. See: condor_tail --help
.
condor_ssh_to_job
¶
In many cases, you can log in to the execute node and explore a job sandbox using condor_ssh_to_job
interactively on the submit machine for your job.
See: condor_ssh_to_job --help
.
condor_chirp
¶
This provides a mechanism to access files and ClassAds from the executing job itself. That is, instead of accessing output from the submit side using condor_tail
or condor_ssh_to_job
, condor_chirp
allows actions originating on the execute host.
This is designed to be used by a job itself and can, for example, be used to periodically transmit output data from the execution host back to the submit host, or inject status information into the job ClassAd description, which would allow straightforward condor_q
queries for status/progress information.
The easiest way to use this utility is using the Python client, which can be implemented via a wrapper script or natively for for Python applications.
See: condor_chirp.