Skip to content

Monitoring job status

Monitoring the queue

In order to track jobs submitting to the Open Science Grid, a special pool needs to be specified for condor_status, condor_q, and friends:

Condor pool for OSG

osg-ligo-1.t2.ucsd.edu

This value is stored in the environment variable $IGWN_POOL

Example query for workflow submitted to the OSG pool

$ condor_q -pool ${IGWN_POOL} -better-analyze 14607855.0


-- Schedd: ldas-osg.ligo.caltech.edu : <131.215.113.204:9618?...
The Requirements expression for job 14607855.000 is

    ((IS_GLIDEIN is true) && (HAS_CVMFS_LIGO_CONTAINERS is true)) && (TARGET.Arch == "X86_64") &&
    (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
    (TARGET.HasFileTransfer)

Job 14607855.000 defines the following attributes:

    RequestDisk = 1
    RequestMemory = 4096

The Requirements expression for job 14607855.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[1]         229  HAS_CVMFS_LIGO_CONTAINERS is true
[9]         223  TARGET.Memory >= RequestMemory
[10]          7  [1] && [9]

No successful match recorded.
Last failed match: Fri Jun 14 08:30:28 2019

Reason for last match failure: no match found

14607855.000:  Run analysis summary ignoring user priority.  Of 233 machines,
    201 are rejected by your job's requirements
      2 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
     30 are able to run your job

Where are my jobs running?

Query the queue / history for where jobs are / were

To see where IGWN jobs are running / ran, use condor_q / condor_history to query for the MATCH_EXP_JOB_GLIDEIN_Site attribute:

$ condor_q -run -af MATCH_EXP_JOB_GLIDEIN_Site
CNAF
GATech
IN2P3
KISTI
Lancaster
LIGO-CIT
LIGO-WA
MWT2
Nebraska
Omaha
PIC
SuperMIC
UChicago
undefined
Unknown
Wisconsin

Job progress & intermediate output

When jobs run without a shared filesystem, using HTCondor file transfer, you cannot directly access files as if they were in your /home, sometimes making tracking progress and evaluating success difficult.

Instead, there are a number of HTCondor utilities to monitor status and progress of jobs running in remote sandboxes.

Stream stdout/stderr

By default, jobs running without a shared filesystem wait until termination to write to the usual stdout and stderr files. For jobs which use stdout and stderr to print modest amounts of status information, you can enable streaming, to update these files in near real-time.

HTCondor submit file with streaming stdout

Stream stdout and stderr in real time

universe = Vanilla
executable = /lalapps_somejob
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
log = example.log
error = example.err
output = example.out
stream_out = True
stream_error = True
queue 1

Note that streaming requires continuous, robust network connectivity to the jobs. There are rare instances where this may not work well.

condor_tail

If stdout is very large or streaming output is otherwise impractical, use condor_tail to view or follow data printed to stdout. This can also be used to print and follow the tail of any file in the job's sandbox to your stdout.

Use condor_tail interactively on the submit machine for your jobs / workflow. See: condor_tail --help.

condor_ssh_to_job

In many cases, you can log in to the execute node and explore a job sandbox using condor_ssh_to_job interactively on the submit machine for your job.

See: condor_ssh_to_job --help.

condor_chirp

This provides a mechanism to access files and ClassAds from the executing job itself. That is, instead of accessing output from the submit side using condor_tail or condor_ssh_to_job, condor_chirp allows actions originating on the execute host.

This is designed to be used by a job itself and can, for example, be used to periodically transmit output data from the execution host back to the submit host, or inject status information into the job ClassAd description, which would allow straightforward condor_q queries for status/progress information.

The easiest way to use this utility is using the Python client, which can be implemented via a wrapper script or natively for for Python applications.

See: condor_chirp.