Skip to content

Working without a shared filesystem

In this section we discuss data access for jobs on the IGWN Grid:

  • How to access centrally-curated data from the read-only CVMFS filesystem, and which data are accessible that way.
  • How to send input data out with your jobs
  • How to bring output data back

Reading frames from CVMFS

Many sites on the IGWN grid host official strain data frame files in special CVMFS repositories. If your jobs read strain data from frame files at execution time, you can restrict your jobs to these sites by requiring HAS_LIGO_FRAMES =?= True and access those frames as usual through their CVMFS path using X509 authentication.

HTCondor submit file to read authenticated frames from CVMFS

Dump the contents of an authenticated frame file in CVMFS at a remote site using software from the LIGO CVMFS oasis:

universe = Vanilla
executable = /cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py37/bin/lalapps_frread
transfer_executable = False
requirements = HAS_LIGO_FRAMES =?= True
arguments = /cvmfs/oasis.opensciencegrid.org/ligo/frames/O3/V1Online/V-V1Online-12654/V-V1Online-1265400000-2000.gwf V1:Hrec_hoft_16384Hz
use_x509userproxy = True
x509userproxy = /path/to/proxy
log = example.log
error = example.err
output = example.out
queue 1

Notes:

  • The x509userproxy = /path/to/proxy is optional: if absent, HTCondor will locate the X509 certificate from the X509_USER_PROXY environment variable (see below).
  • For most use-cases, frame files in CVMFS can be located by specifying the datafind.ligo.org:443 server in calls to gwdatafind. See CVMFS data discovery for full details.

Ensure you have a valid X509 proxy certificate

Unless you are using publicly-released GWOSC data, you must ensure that you have a valid X509 proxy certificate when the job is submitted. Take care to ensure that all jobs in e.g. a long-running DAG have, or will have when they start running, access to a valid proxy.

Note that proxy certificates are usually created in /tmp, which may not be persistent, and which is local to the host the proxy is created on. It may be useful to explicitly specify the path of a copy of your proxy certificate in the job submit file using the x509userproxy directive:

$  ligo-proxy-init albert.einstein
Your identity: albert.einstein@LIGO.ORG
Enter pass phrase for this identity:
Creating proxy .................................... Done
Your proxy is valid until: Nov 6 11:28:43 2020 GMT
$  cp $(grid-proxy-info -path) /path/to/copied/proxy
$  export X509_USER_PROXY=$(/path/to/copied/proxy)
And then set x509userproxy=/path/to/copied/proxy in your job submit file. You will still need to be careful to update this proxy if it expires, however.

Bringing data with your jobs

Jobs executing on the IGWN grid do not have access to a normal shared filesystem and should instead use HTCondor file transfer.

Jobs on the IGWN Grid will usually be submitted from a host with access to a shared filesystem and your usual /homedirectories. On the execute side, however, jobs start in a different "sandbox" directory and generally do not have access to your, or anyone else's, /home. Unless it can be found in CVMFS, you must bring all required input data with you, and all output data back, using HTCondor file transfer.

To send/retrieve files and directories to/from your job, use transfer_input_files/transfer_output_files in the job submit file. Take particular note of the implications of the working directories on both the submit side and on the execute side:

  • Relative paths used in transfer_input_files are relative to the current working directory when the workflow is submitted. This can be overriden using the initialdir command in the submit file (again, see here).
  • Paths on the execute side are relative to the sandbox directory where the job runs. This includes paths in transfer_output_files.
  • When they come back to the submit host, paths in transfer_output_files are again relative to the current working directory when the workflow was submitted.

Depending on the use-case, the easiest configuration is usually to just use an absolute path in initialdir, and relative paths in transfer_input_files and transfer_output_files.

HTCondor submit file with file transfers enabled

Send a job configuration file and injection xml table out with a job, and send back an output directory (assumed here to be created by the executable), on completion:

universe = Vanilla
executable = /cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/envs/igwn-py37/bin/lalapps_somejob
transfer_executable = False
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
initialdir = /home/albert.einstein/GW123456
transfer_input_files = config.ini, GW123456_siminspiral.xml
transfer_output_files = results_dir
log = example.log
error = example.err
output = example.out
queue 1

The submit machine prior to job submission:

/home
└── albert.einstein
    └── GW123456
        ├── config.ini
        ├── example.sub
        └── GW123456_siminspiral.xml

The job runs on the execute host in a sandbox like:

/srv
└── job_123456
    ├── config.ini
    ├── GW123456_siminspiral.xml
    └── results_dir
where results_dir is an output directory created by lalapps_somejob.

And, on completion, the submit machine has:

/home
└── albert.einstein
    └── GW123456
        ├── config.ini
        ├── example.err
        ├── example.log
        ├── example.out
        ├── example.sub
        ├── GW123456_siminspiral.xml
        └── results_dir

Use condor_rm carefully

In the event you need to remove jobs from the queue, be aware that condor_rm will permanently wipe the contents of the job sandbox. If jobs use periodic self-checkpointing, you may be able to retrieve checkpoint data on the submit host by locating your jobs' directory in the SPOOL:

$ condor_config_val SPOOL
/var/lib/condor/spool
and searching for the sub-directory named after the ID of your job.

If you require intermediate data access prior to job completion, you may wish to explore using condor_chirp.

Using file transfer by default can be a good idea

Even moderately I/O-intensive jobs may benefit from using HTCondor file transfer regardless of whether they are running on the IGWN Grid. Heavy loads or poor network performance can be problematic for jobs using shared filesystems. Using HTCondor file transfer mitigates I/O problems by using local storage on the execute host and internal checksumming to ensure data integrity during transfers.

Technical note: all machines in the IGWN Grid pool are on a different FileSystemDomain than the submit machine. All machines on any local pool accessible to the submit machine are on the same FileSystemDomain as the submit machine.

Advanced usage

Dataflow jobs: your workflow may involve jobs which are redundant if their output already exists. A typical use-case might be running a new instance of a DAG using output from a previous instance of the same DAG. To understand how to skip jobs whose outputs already exist, see dataflow jobs.

File transfer with URLs: Note that this discussion has only covered paths to files and directories local to the submit or execute machines. HTCondor also supports file transfer via URLs.