The IGWN Grid¶
The International Gravitational Wave Network Computing Grid (IGWN Grid) is the superset of dedicated, allocated and opportunistic distributed computing resources available worldwide for gravitational wave data analysis.
You may have heard this referred to colloquially as the "Open Science Grid"; while the OSG is deeply involved, "IGWN Grid" is the correct term to encompass the entire infrastructure involved here.
These pages will describe how to use HTCondor to process large workflows on the IGWN Grid.
The IGWN Computing Grid uses a distributed computing model, where computing centres are spread across geographical locations, but the user experience should be much the same at all centres.
The IGWN Grid uses the HTCondor batch job scheduler. You provide it a description of discrete work units you want accomplished and HTCondor distributes them in a fair way to available computing resources with the capability to complete those work units.
Basic familiarity with HTCondor is assumed for much of this guide. For more introductory details, including how to submit jobs, manage jobs, and manage workflows, please refer to:
Pegasus Workflow Management System¶
Pegasus is a workflow management system that allows you to automate, recover, and debug large-scale workflows. On IGWN Computing Grid resources, Pegasus acts as an interface to HTCondor, so can be used in place of directly writing an HTCondor workflow, including for submissions through the Open Science Grid.
This guide deals with using HTCondor directly.
Adapting to life on the grid¶
The IGWN Computing Grid centres are all connected to a wider connected network of computing centres. In Europe, these are coordinated by the European Grid Infrastructure, and in the United States by the Open Science Grid.
IGWN user access to these resources is provisioned through dedicated HTCondor submit hosts managed by IGWN partner institutions and jobs' resource access is managed through OSG-managed infrastructure.
Due to the spatially distributed nature of IGWN grid resources, there are a number of important requirements which may differ from what one is used to on a single, dedicated computing cluster. We will discuss these in detail below but briefly:
IGWN Grid Workflow Requirements
The basic recipe for running on the grid includes:
- Accounting information: as on dedicated LSC clusters, jobs on all IGWN resources must identify their purpose through the use of accounting tags.
- Software deployment: any executable code, and all of its dependencies, that your job runs should be available at or portable enough to run at remote sites. Solutions include the IGWN conda environments or Singularity containers, both of which are hosted at all participating IGWN sites through CVMFS. Standalone executables can also be transmitted with the jobs themselves using HTCondor file transfer.
- Remote data I/O: Jobs should (generally) have no run-time requirements for access to shared network file system on the submit host. Jobs on the IGWN grid run at remote sites and will not generally have standard POSIX access to any shared filesystem present on the host they were submitted from (e.g. the users'
/home) during execution. Input data must be read from a globally accessible repository (e.g. frames in CVMFS) or sent with the job using HTCondor's file transfer mechanism (e.g. configuration files); output data is transferred back to the submit host or other remote storage on job completion using HTCondor file transfer.
- Submit hosts: access to the global pool of IGWN grid resources is only available on appropriately configured HTCondor submit hosts. To access this pool, workflows must be submitted from those hosts.
Re-architecting entire workflows to satisfy these requirements can be non-trivial, particularly for workflows with complex inter-job data dependencies, but will provide access to a far more sustainable number of resources and eliminates the need to manually partition workloads over multiple computing sites.
Importantly, many IGWN submit sites also provide access to a local HTCondor pool with a shared network filesystem. Complex workflows can then be run on a mix of local and global grid resources. So, for example, workflows with large numbers of I/O-intensive pre-/post-processing jobs may run locally, while more computationally expensive analysis jobs are distributed globally.
You're (probably) closer than you think
Incremental modification of workflows to satisfy these requirements is entirely possible and common practice. If porting all software and data I/O in one go seems daunting or inefficient, consider the feasibility of moving the most parallel and expensive jobs onto the grid first. Bear in mind, however, that porting all jobs will only increase the number of available resources.
These pages run through detailed explanations and examples for each of the IGWN Grid workflow requirements listed above. Use the navbars on the left and right to explore this documentation.