Working without a shared filesystem
In this section we discuss data access for jobs on the IGWN Grid:
- How to access centrally-curated data from the read-only CVMFS filesystem, and which data are accessible that way.
- How to send input data out with your jobs
- How to bring output data back
Reading frames from CVMFS¶
Many sites on the IGWN grid host official strain data frame files in special CVMFS repositories. If your jobs read strain data from frame files at execution time, you can restrict your jobs to these sites by requiring
HAS_LIGO_FRAMES =?= True and access those frames as usual through their CVMFS path using X509 authentication.
HTCondor submit file to read authenticated frames from CVMFS
Dump the contents of an authenticated frame file in CVMFS at a remote site using software from the LIGO CVMFS oasis:
universe = Vanilla executable = /cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py37/bin/lalapps_frread transfer_executable = False requirements = HAS_LIGO_FRAMES =?= True arguments = /cvmfs/oasis.opensciencegrid.org/ligo/frames/O3/V1Online/V-V1Online-12654/V-V1Online-1265400000-2000.gwf V1:Hrec_hoft_16384Hz use_x509userproxy = True x509userproxy = /path/to/proxy log = example.log error = example.err output = example.out queue 1
x509userproxy = /path/to/proxyis optional: if absent, HTCondor will locate the X509 certificate from the
X509_USER_PROXYenvironment variable (see below).
- For most use-cases, frame files in CVMFS can be located by specifying the
datafind.ligo.org:443server in calls to
gwdatafind. See CVMFS data discovery for full details.
Ensure you have a valid X509 proxy certificate
Unless you are using publicly-released GWOSC data, you must ensure that you have a valid X509 proxy certificate when the job is submitted. Take care to ensure that all jobs in e.g. a long-running DAG have, or will have when they start running, access to a valid proxy.
Note that proxy certificates are usually created in
/tmp, which may not be persistent, and which is local to the host the proxy is created on. It may be useful to explicitly specify the path of a copy of your proxy certificate in the job submit file using the
$ ligo-proxy-init albert.einstein Your identity: albert.einstein@LIGO.ORG Enter pass phrase for this identity: Creating proxy .................................... Done Your proxy is valid until: Nov 6 11:28:43 2020 GMT $ cp $(grid-proxy-info -path) /path/to/copied/proxy $ export X509_USER_PROXY=$(/path/to/copied/proxy)
x509userproxy=/path/to/copied/proxyin your job submit file. You will still need to be careful to update this proxy if it expires, however.
Bringing data with your jobs¶
Jobs executing on the IGWN grid do not have access to a normal shared filesystem and should instead use HTCondor file transfer.
Jobs on the IGWN Grid will usually be submitted from a host with access to a shared filesystem and your usual
/homedirectories. On the execute side, however, jobs start in a different "sandbox" directory and generally do not have access to your, or anyone else's,
/home. Unless it can be found in CVMFS, you must bring all required input data with you, and all output data back, using HTCondor file transfer.
To send/retrieve files and directories to/from your job, use
transfer_output_files in the job submit file. Take particular note of the implications of the working directories on both the submit side and on the execute side:
- Relative paths used in
transfer_input_filesare relative to the current working directory when the workflow is submitted. This can be overriden using the
initialdircommand in the submit file (again, see here).
- Paths on the execute side are relative to the sandbox directory where the job runs. This includes paths in
- When they come back to the submit host, paths in
transfer_output_filesare again relative to the current working directory when the workflow was submitted.
Depending on the use-case, the easiest configuration is usually to just use an absolute path in
initialdir, and relative paths in
HTCondor submit file with file transfers enabled
Send a job configuration file and injection xml table out with a job, and send back an output directory (assumed here to be created by the executable), on completion:
universe = Vanilla executable = /cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py37/bin/lalapps_somejob transfer_executable = False should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT initialdir = /home/albert.einstein/GW123456 transfer_input_files = config.ini, GW123456_siminspiral.xml transfer_output_files = results_dir log = example.log error = example.err output = example.out queue 1
The submit machine prior to job submission:
/home └── albert.einstein └── GW123456 ├── config.ini ├── example.sub └── GW123456_siminspiral.xml
The job runs on the execute host in a sandbox like:
/srv └── job_123456 ├── config.ini ├── GW123456_siminspiral.xml └── results_dir
results_diris an output directory created by
And, on completion, the submit machine has:
/home └── albert.einstein └── GW123456 ├── config.ini ├── example.err ├── example.log ├── example.out ├── example.sub ├── GW123456_siminspiral.xml └── results_dir
In the event you need to remove jobs from the queue, be aware that
condor_rm will permanently wipe the contents of the job sandbox. If jobs use periodic self-checkpointing, you may be able to retrieve checkpoint data on the submit host by locating your jobs' directory in the SPOOL:
$ condor_config_val SPOOL /var/lib/condor/spool
If you require intermediate data access prior to job completion, you may wish to explore using condor_chirp.
Using file transfer by default can be a good idea
Even moderately I/O-intensive jobs may benefit from using HTCondor file transfer regardless of whether they are running on the IGWN Grid. Heavy loads or poor network performance can be problematic for jobs using shared filesystems. Using HTCondor file transfer mitigates I/O problems by using local storage on the execute host and internal checksumming to ensure data integrity during transfers.
Technical note: all machines in the IGWN Grid pool are on a different
FileSystemDomain than the submit machine. All machines on any local pool accessible to the submit machine are on the same
FileSystemDomain as the submit machine.
Dataflow jobs: your workflow may involve jobs which are redundant if their output already exists. A typical use-case might be running a new instance of a DAG using output from a previous instance of the same DAG. To understand how to skip jobs whose outputs already exist, see dataflow jobs.
File transfer with URLs: Note that this discussion has only covered paths to files and directories local to the submit or execute machines. HTCondor also supports file transfer via URLs.