Batch computing with HTCondor¶
This page will describe how to use HTCondor to process large workflows on The LIGO-Virgo Computing Grid (LVCG).
HTCondor is a batch job scheduler. That is, you provide it a description of discrete work units you want accomplished and it distributes them in a fair way to available computing resources.
For more details on basic concepts, include how to submit jobs, manage jobs, and manage workflows, please refer to the HTCondor Users' Manual.
Using HTCondor on the IGWN computing grid¶
IGWN Computing Grid uses a distributed computing model, where computing centres are spread across geographical locations, but the user experience should be much the same at all centres. To maximise the availability of hosts upon which batch jobs can run, the IGWN Computing Grid centres are all connected to a wider connected network of computing centres. In Europe, these are coordinated by the European Grid Infrastructure, and in the United States by the Open Science Grid.
In order for workflows to be maximally re-locatable from one resource to another, they should avoid requiring configuration that isn't available as standard, including access to specific shared filesystems, or the world wide web.
Re-architecting workflows to not rely upon non-standard configuration will take work, and may be difficult, but the benefits will likely out-weigh the costs in the long term.
In order to keep track of how much computing is used for IWGN data processing, all jobs that are submitted must be tagged with an accounting group, identifying the analysis being performed.
All jobs must include an
accounting_group argument in their condor submission file referencing a pre-approved tag, e.g:
universe = vanilla executable = /usr/bin/od arguments = -An -N2 -i input = /dev/random output = /dev/null accounting_group = ligo.prod.s6.burst.snews.raven queue 1
Additionally, if the jobs are submitted from a shared user account, e.g.
accounting_group_user must be included to indicate which human being is responsible for those jobs, e.g:
accounting_group = ligo.prod.s6.burst.snews.raven accounting_group_user = marie.curie
Condor file transfers¶
When working without a shared file system, you should use HTCondor's file transfer mechanism; for details see here.
Submitting to the Open Science Grid¶
Workflows can be submitted to the Open Science Grid only from specific machines. For IGWN members, those machines are:
| ||Caltech (LIGO)|
The connection follows the same directions given here to connect to generic collaboration resources, hence both
gsissh access is supported.
Each submit host is configured to connect to the underlying HTCondor computing workload manager. Any computing task one wishes to run on the IGWN pool should be submitted from one of such submit hosts.
Condor submit files must be modified to include the following directives:
+OpenScienceGrid = True requirements = (IS_GLIDE_IN=?=True)
requirements may be given to indicate to condor other resources that are needed by your job(s). The following table describes the requirement expression, and what it means for the target execute host:
| ||the |
| ||GWF data files are available via CVMFS (paths as returned from |
| ||jobs can run within Singularity containers|
Specifying multiple requirements to HTCondor
Multiple requirements should be combined using the
&& intersection operator:
requirements = (IS_GLIDE_IN=?=True) && (HAS_SINGULARITY=?=True)
You can (but are not required to) specify which site(s) you would like your jobs to run on using the
+DESIRED_Sites condor directive in your submit file:
+DESIRED_Sites = "LIGO-CIT"
Multiple desired sites can be declared as a comma-separated list:
+DESIRED_Sites = "LIGO-CIT,GATech"
Please select sites from the following list:
| ||Brookhaven National Lab. (USA)|
| ||IN2P3, Lyon (France)|
| ||INFN (Italy)|
| ||Georgia Tech (USA)|
| ||KISTI (Korea)|
| ||Caltech (USA)|
| ||LIGO Hanford Observatory (USA)|
| ||LIGO Livingston Observatory (USA)|
| ||Nikhef (Netherlands)|
| ||LSU (USA)|
| ||RAL (UK)|
| ||UCSD (USA)|
| ||Syracuse (USA)|
| ||LSU (USA)|
| ||Univ. of Chicago (USA)|
| ||UCSD (USA)|
To blacklist a site, repeat the above using e.g.:
+UNDESIRED_Sites = "GATech"
Tracking jobs in the OSG condor pool¶
In order to track jobs submitting to the Open Science Grid, a special
pool needs to be specified for
condor_q, and friends:
Condor pool for OSG
Example query for workflow submitted to the OSG pool
$ condor_q -pool osg-ligo-1.t2.ucsd.edu -better-analyze 14607855.0 -- Schedd: ldas-osg.ligo.caltech.edu : <220.127.116.11:9618?... The Requirements expression for job 14607855.000 is ((IS_GLIDEIN is true) && (HAS_CVMFS_LIGO_CONTAINERS is true)) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer) Job 14607855.000 defines the following attributes: RequestDisk = 1 RequestMemory = 4096 The Requirements expression for job 14607855.000 reduces to these conditions: Slots Step Matched Condition ----- -------- ---------  229 HAS_CVMFS_LIGO_CONTAINERS is true  223 TARGET.Memory >= RequestMemory  7  &&  No successful match recorded. Last failed match: Fri Jun 14 08:30:28 2019 Reason for last match failure: no match found 14607855.000: Run analysis summary ignoring user priority. Of 233 machines, 201 are rejected by your job's requirements 2 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 30 are able to run your job
Pegasus Workflow Management System¶
Pegasus is a workflow management system that allows you to automate, recover, and debug large-scale workflows. On IGWN Computing Grid resources, Pegasus acts as an interface to HTCondor, so can be used in place of directly writing an HTCondor workflow, including for submissions through the Open Science Grid.