Skip to content

Data Management on the IGWN Computing Grid

The IGWN Computing Grid connects geographically-distributed resources that do not share a file system, which requires careful management of data, both in to, and out of a job.

HTCondor file transfer

IGWN workflows should use HTCondor's file transfer mechanism to transfer data to and from jobs, including between stages of a workflow.

The only notable exception to this recommendation is when reading GWF data from the local computing centre data archive (not data from /home).

Accessing IGWN data with HTCondor

Many IGWN workflows require access to centrally-managed or shared IGWN data. IGWN supports this by publishing data into the Open Science Data Federation which enables access to these data from any machine.

What data are available?

For details of what data are distributed and available, please see IGWN Data and links therein.

There are a few different strategies for accessing data in an HTCondor workflow:

Download data using OSDF

HTCondor supports using osdf:// URLs in the transfer_input_files list given in the submit commands. This allows HTCondor to directly transfer the data directly to the execute point in an efficient way.

Discovering OSDF URLs

OSDF paths (URLs) for aggregated data or centrally-curated shared datasets can be discovered by querying a GWDataFind server that indexes data from /cvmfs, e.g:

$ python3 -m gwdatafind -r datafind.igwn.org -o H -t H1_HOFT_C00 -s 1373580000 -e 1373590000 -u osdf
osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373577216-4096.gwf
osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373581312-4096.gwf
osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373585408-4096.gwf
osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373589504-4096.gwf

For information on accessing user-curated ata, please see User-curated data (below).

Authorisation tokens

When using OSDF URLs with HTCondor, the job submission must include appropriate credentials to access the data. This amounts to adding the token name as a prefix to the URL scheme as part of the transfer_input_files argument.

See Examples for an example of configuring an HTCondor job for authorised access to OSDF data using tokens.

transfer_input_files

Once the OSDF URLs are known, they should be included in the HTCondor submit commands via the transfer_input_files argument, including the token name prefix as above:

transfer_input_files = igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373577216-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373581312-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373585408-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373589504-4096.gwf

The transfer_input_files mechanism this copy the files from the OSDF cache directly into your job's working directory (aka "sandbox"), so applications should use the local path to those files, e.g:

executable = /bin/head
arguments = -c4 H-H1_HOFT_C00-1373577216-4096.gwf H-H1_HOFT_C00-1373581312-4096.gwf H-H1_HOFT_C00-1373585408-4096.gwf H-H1_HOFT_C00-1373589504-4096.gwf

use_oauth_services = igwn
igwn_oauth_permissions = read:/ligo

should_transfer_files = yes
transfer_executable = false
transfer_input_files = igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373577216-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373581312-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373585408-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373589504-4096.gwf

Include the input file size in request_disk

The space requested for your job in the request_disk command must include the total size of all files that will be transferred into the sandbox.

User-curated data

You may also use the OSDF file transfer plugin to move files to and from any CIT access point. This is particularly appropriate for larger input and output files (1 GB or greater) where HTCondor file transfer may be slower or might overload the submit machine.

Copying files into the user-staging OSDF namspace.

For details of how to copy files from a CIT access point into the /igwn/cit/staging OSDF namespace, please see Publishing data into /igwn/cit.

OSDF URLs for user-curated data are not indexed by any GWDataFind service, but should be well documented by the owner of the data.

Paths may be discoverable directly from the filesystem at the LDAS-CIT computing centre under

/osdf/igwn/cit/staging/

e.g.

/osdf/igwn/cit/staging/duncan.macleod/hello.txt

The OSDF URL for such a filesystem path is constructed by removing the /osdf prefix and prepending the URL scheme osdf://:

osdf:///igwn/cit/staging/duncan.macleod/hello.txt

These URLs can then be used to specify an input file for an HTCondor job using the same transfer_input_files syntax as above.

It is also possible for distributed IGWN jobs running at a remote site to send job outputs back to CIT using the transfer_output_remaps HTCondor submit option.

Files placed into /osdf/igwn/cit/staging cannot be modified.

For details see the staging documentation.

Reading and writing to the user-curated staging space requires authorisation.

To read a file from your per-user staging space, your token needs the scope read:/staging.

To write back to your per-user space, you require a token with the write:/staging/marie.curie scope. To obtain the latter, it is sufficient to simply request write:/staging and the token that you actually receive will be specific to your username.

Reading from and writing to the user-curated data namespace

An example submit file using both the read and write features might look like:

executable = /bin/cp
arguments = input.txt output.txt

should_transfer_files = yes
transfer_executable = false
transfer_input_files = igwn+osdf:///igwn/cit/staging/marie.curie/input.txt
transfer_output_files = output.txt
transfer_output_remaps = "output.txt = igwn+osdf:///igwn/cit/staging/marie.curie/output.txt"

use_oauth_services = igwn
igwn_oauth_permissions = read:/staging write:/staging

If your submit file will refer to the same path repeatedly, then you may find it convenient to use HTCondor's submit file variables to make your file transfers more readable. For example:

Using a variable to simplify /igwn/cit/staging paths.

executable = /bin/cp
arguments = input.txt output.txt

OSDF_LOCATION = igwn+osdf:///igwn/cit/staging/marie.curie/workflows/run3
should_transfer_files = yes
transfer_executable = false
transfer_input_files = $(OSDF_LOCATION)/input.txt
transfer_output_files = output.txt
transfer_output_remaps = "output.txt = $(OSDF_LOCATION)/output.txt"

use_oauth_services = igwn
igwn_oauth_permissions = read:/staging write:/staging

Read data from CVMFS

Accessing data via OSDF is recommended over CVMFS

Accessing data using transfer_input_files and osdf:// URLs is recommended instead of using CVMFS - the former is more reliable and significantly easier to debug when things go wrong.

By construction, all IGWN data in CVMFS must also be available via OSDF, and both methods use the same distributed cache infrastructure, so there should be no notable performance difference between the two (when things are working).

CVMFS is a software and data distribution service that makes remote data available via a POSIX-like file system. See CVMFS (on this guide) for more details.

Workflows that need to read data from CVMFS can configure their job requirements to ensure that the jobs only match with matchines that have the necessary CVMFS repositories available.

When using data from /cvmfs, it is not necessary to explicitly use HTCondor's file transfer mechanism.

IGWN private data in CVMFS

For proprietary IGWN Data (e.g. h(t)), which requires Credentials for access, a custom requirement is used to ensure that the target machine has the data, and can handle the credentials appropriately.

If your job needs access to IGWN proprietary data, including those paths returned by the GWDataFind server at https://datafind.igwn.org the requirement command should be

requirements = HAS_CVMFS_IGWN_PRIVATE_DATA =?= True

Private data access requires credentials

Accessing the IGWN private data requires configuring your job to include Credentials.

GWOSC public data in CVMFS

For the GWOSC public CVMFS data repository (gwosc.osgstorage.org) the requirement command should be

requirements = HAS_CVMFS_gwosc_osgstorage_org =?= True