Skip to content

Batch computing with HTCondor

This page will describe how to use HTCondor to process large workflows on The LIGO-Virgo Computing Grid (LVCG).

Overview

HTCondor is a batch job scheduler. That is, you provide it a description of discrete work units you want accomplished and it distributes them in a fair way to available computing resources.

For more details on basic concepts, include how to submit jobs, manage jobs, and manage workflows, please refer to the HTCondor Users' Manual.

Using HTCondor on the IGWN computing grid

IGWN Computing Grid uses a distributed computing model, where computing centres are spread across geographical locations, but the user experience should be much the same at all centres. To maximise the availability of hosts upon which batch jobs can run, the IGWN Computing Grid centres are all connected to a wider connected network of computing centres. In Europe, these are coordinated by the European Grid Infrastructure, and in the United States by the Open Science Grid.

Workflow requirements

In order for workflows to be maximally re-locatable from one resource to another, they should avoid requiring configuration that isn't available as standard, including access to specific shared filesystems, or the world wide web.

Re-architecting workflows to not rely upon non-standard configuration will take work, and may be difficult, but the benefits will likely out-weigh the costs in the long term.

Accounting information

In order to keep track of how much computing is used for IWGN data processing, all jobs that are submitted must be tagged with an accounting group, identifying the analysis being performed.

All jobs must include an accounting_group argument in their condor submission file referencing a pre-approved tag, e.g:

universe = vanilla
executable = /usr/bin/od
arguments = -An -N2 -i
input = /dev/random
output = /dev/null
accounting_group = ligo.prod.s6.burst.snews.raven
queue 1

Additionally, if the jobs are submitted from a shared user account, e.g. cbc, then accounting_group_user must be included to indicate which human being is responsible for those jobs, e.g:

accounting_group = ligo.prod.s6.burst.snews.raven
accounting_group_user = marie.curie

For more details on accounting tags, and how to select one, see https://accounting.ligo.org/user, and for accounting data summaries, see https://accounting.ligo.org.

Condor file transfers

When working without a shared file system, you should use HTCondor's file transfer mechanism; for details see here.

Submitting to the Open Science Grid

Access points

Workflows can be submitted to the Open Science Grid only from specific machines. For IGWN members, those machines are:

Hostname Location
ldas-osg.ligo.caltech.edu Caltech (LIGO)

The connection follows the same directions given here to connect to generic collaboration resources, hence both ssh and gsissh access is supported.

Each submit host is configured to connect to the underlying HTCondor computing workload manager. Any computing task one wishes to run on the IGWN pool should be submitted from one of such submit hosts.

Condor configuration

Condor submit files must be modified to include the following directives:

+OpenScienceGrid = True
requirements = (IS_GLIDE_IN=?=True)

Other requirements may be given to indicate to condor other resources that are needed by your job(s). The following table describes the requirement expression, and what it means for the target execute host:

Requirement Description
(HAS_CVMFS_LIGO_CONTAINERS=?=True) the ligo-containers.opensciencegrid.org CVMFS repository is available
(HAS_LIGO_FRAMES=?=True) GWF data files are available via CVMFS (paths as returned from datafind.ligo.org:443, see CVMFS data discovery)
(HAS_SINGULARITY=?=True) jobs can run within Singularity containers

Specifying multiple requirements to HTCondor

Multiple requirements should be combined using the && intersection operator:

requirements = (IS_GLIDE_IN=?=True) && (HAS_SINGULARITY=?=True)

Desired sites

You can (but are not required to) specify which site(s) you would like your jobs to run on using the +DESIRED_Sites condor directive in your submit file:

+DESIRED_Sites = "LIGO-CIT"

Declaring multiple +DESIRED_Sites

Multiple desired sites can be declared as a comma-separated list:

+DESIRED_Sites = "LIGO-CIT,GATech"

Please select sites from the following list:

Label Location
BNL Brookhaven National Lab. (USA)
CCIN2P3 IN2P3, Lyon (France)
CNAF INFN (Italy)
GATech Georgia Tech (USA)
KISTI KISTI (Korea)
LIGO-CIT Caltech (USA)
LIGO-LHO LIGO Hanford Observatory (USA)
LIGO-LLO LIGO Livingston Observatory (USA)
NIKHEF Nikhef (Netherlands)
QB2 LSU (USA)
RAL RAL (UK)
SDSC-PRP UCSD (USA)
SU-ITS Syracuse (USA)
SuperMIC LSU (USA)
UChicago Univ. of Chicago (USA)
UCSD UCSD (USA)

Undesired Sites

To blacklist a site, repeat the above using e.g.:

+UNDESIRED_Sites = "GATech"

Tracking jobs in the OSG condor pool

In order to track jobs submitting to the Open Science Grid, a special pool needs to be specified for condor_status, condor_q, and friends:

Condor pool for OSG

osg-ligo-1.t2.ucsd.edu

Example query for workflow submitted to the OSG pool

$ condor_q -pool osg-ligo-1.t2.ucsd.edu -better-analyze 14607855.0


-- Schedd: ldas-osg.ligo.caltech.edu : <131.215.113.204:9618?...
The Requirements expression for job 14607855.000 is

    ((IS_GLIDEIN is true) && (HAS_CVMFS_LIGO_CONTAINERS is true)) && (TARGET.Arch == "X86_64") &&
    (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
    (TARGET.HasFileTransfer)

Job 14607855.000 defines the following attributes:

    RequestDisk = 1
    RequestMemory = 4096

The Requirements expression for job 14607855.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[1]         229  HAS_CVMFS_LIGO_CONTAINERS is true
[9]         223  TARGET.Memory >= RequestMemory
[10]          7  [1] && [9]

No successful match recorded.
Last failed match: Fri Jun 14 08:30:28 2019

Reason for last match failure: no match found

14607855.000:  Run analysis summary ignoring user priority.  Of 233 machines,
    201 are rejected by your job's requirements
      2 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
     30 are able to run your job

Pegasus Workflow Management System

Pegasus is a workflow management system that allows you to automate, recover, and debug large-scale workflows. On IGWN Computing Grid resources, Pegasus acts as an interface to HTCondor, so can be used in place of directly writing an HTCondor workflow, including for submissions through the Open Science Grid.