Skip to content

Monitoring job status

Where are my jobs running?

To see where jobs are running, you can query for the MATCH_EXP_JOB_GLIDEIN_Site job attribute.

Query the queue for where jobs are

To see where IGWN jobs are running, use condor_q to query for the MATCH_EXP_JOB_GLIDEIN_Site attribute (using some standard commands to sum the counts):

$ condor_q -run -af MATCH_EXP_JOB_GLIDEIN_Site | sort | uniq -c
    126 ComputeCanada-Cedar
      1 ISI
      3 LSU-SuperMIC
      8 ND_CAMLGPU
     41 NEMO
     58 undefined
    152 Unknown
     44 USdC

Query the history for where jobs ran

To see where jobs ran in the past, use condor_history instead:

$ condor_history -limit 100 -af MATCH_EXP_JOB_GLIDEIN_Site | sort | uniq -c
     17 ComputeCanada-Cedar
      3 PIC
      1 Swan
      3 undefined
      6 Unknown
     70 USdC

Tip: use -limit to stop condor_history from attempting to search through every job ever, which can be very slow to return.

Job progress & intermediate output

When jobs run without a shared filesystem, using HTCondor file transfer, you cannot directly access files as if they were in your /home, sometimes making tracking progress and evaluating success difficult.

Instead, there are a number of HTCondor utilities to monitor status and progress of jobs running in remote sandboxes.

Stream stdout/stderr

By default, jobs running without a shared filesystem wait until termination to write to the usual stdout and stderr files. For jobs which use stdout and stderr to print modest amounts of status information, you can enable streaming, to update these files in near real-time.

HTCondor submit file with streaming stdout

Stream stdout and stderr in real time

executable = ./lalapps_somejob

should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT

log = example.log
error = example.err
output = example.out
stream_out = True
stream_error = True

request_disk = 10GB
request_memory = 6GB

accounting_tag = igwn.dev.o5.compsoft.lalapps.nobelwin

queue 1

Note that streaming requires continuous, robust network connectivity to the jobs. There are rare instances where this may not work well.

condor_tail

If stdout is very large or streaming output is otherwise impractical, use condor_tail to view or follow data printed to stdout. This can also be used to print and follow the tail of any file in the job's sandbox to your stdout.

Use condor_tail interactively on the submit machine for your jobs / workflow. See: condor_tail --help.

condor_ssh_to_job

In many cases, you can log in to the execute node and explore a job sandbox using condor_ssh_to_job interactively on the submit machine for your job.

See: condor_ssh_to_job --help.

condor_chirp

This provides a mechanism to access files and ClassAds from the executing job itself. That is, instead of accessing output from the submit side using condor_tail or condor_ssh_to_job, condor_chirp allows actions originating on the execute host.

This is designed to be used by a job itself and can, for example, be used to periodically transmit output data from the execution host back to the submit host, or inject status information into the job ClassAd description, which would allow straightforward condor_q queries for status/progress information.

The easiest way to use this utility is using the Python client, which can be implemented via a wrapper script or natively for for Python applications.

See: condor_chirp.