Data Management on the IGWN Computing Grid¶
The IGWN Computing Grid connects geographically-distributed resources that do not share a file system, which requires careful management of data, both in to, and out of a job.
HTCondor file transfer¶
IGWN workflows should use HTCondor's file transfer mechanism to transfer data to and from jobs, including between stages of a workflow.
The only notable exception to this recommendation is when reading GWF data from the local computing centre data archive (not data from /home
).
Accessing IGWN data with HTCondor¶
Many IGWN workflows require access to centrally-managed or shared IGWN data. IGWN supports this by publishing data into the Open Science Data Federation which enables access to these data from any machine.
What data are available?
For details of what data are distributed and available, please see IGWN Data and links therein.
There are a few different strategies for accessing data in an HTCondor workflow:
Download data using OSDF¶
HTCondor supports using osdf://
URLs in the transfer_input_files
list given in the submit commands. This allows HTCondor to directly transfer the data directly to the execute point in an efficient way.
Discovering OSDF URLs¶
OSDF paths (URLs) for aggregated data or centrally-curated shared datasets can be discovered by querying a GWDataFind server that indexes data from /cvmfs
, e.g:
$ python3 -m gwdatafind -r datafind.igwn.org -o H -t H1_HOFT_C00 -s 1373580000 -e 1373590000 -u osdf
osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373577216-4096.gwf
osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373581312-4096.gwf
osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373585408-4096.gwf
osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373589504-4096.gwf
For information on accessing user-curated ata, please see User-curated data (below).
Authorisation tokens¶
When using OSDF URLs with HTCondor, the job submission must include appropriate credentials to access the data. This amounts to adding the token name as a prefix to the URL scheme as part of the transfer_input_files
argument.
See Examples for an example of configuring an HTCondor job for authorised access to OSDF data using tokens.
transfer_input_files
¶
Once the OSDF URLs are known, they should be included in the HTCondor submit commands via the transfer_input_files
argument, including the token name prefix as above:
transfer_input_files = igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373577216-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373581312-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373585408-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373589504-4096.gwf
The transfer_input_files
mechanism this copy the files from the OSDF cache directly into your job's working directory (aka "sandbox"), so applications should use the local path to those files, e.g:
executable = /bin/head
arguments = -c4 H-H1_HOFT_C00-1373577216-4096.gwf H-H1_HOFT_C00-1373581312-4096.gwf H-H1_HOFT_C00-1373585408-4096.gwf H-H1_HOFT_C00-1373589504-4096.gwf
use_oauth_services = igwn
igwn_oauth_permissions = read:/ligo
should_transfer_files = yes
transfer_executable = false
transfer_input_files = igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373577216-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373581312-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373585408-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373589504-4096.gwf
Include the input file size in request_disk
The space requested for your job in the request_disk
command must include the total size of all files that will be transferred into the sandbox.
User-curated data¶
You may also use the OSDF file transfer plugin to move files to and from any CIT access point. This is particularly appropriate for larger input and output files (1 GB or greater) where HTCondor file transfer may be slower or might overload the submit machine.
Copying files into the user-staging OSDF namspace.
For details of how to copy files from a CIT access point into the /igwn/cit/staging
OSDF namespace, please see Publishing data into /igwn/cit
.
OSDF URLs for user-curated data are not indexed by any GWDataFind service, but should be well documented by the owner of the data.
Paths may be discoverable directly from the filesystem at the LDAS-CIT computing centre under
/osdf/igwn/cit/staging/
e.g.
/osdf/igwn/cit/staging/duncan.macleod/hello.txt
The OSDF URL for such a filesystem path is constructed by removing the /osdf
prefix and prepending the URL scheme osdf://
:
osdf:///igwn/cit/staging/duncan.macleod/hello.txt
These URLs can then be used to specify an input file for an HTCondor job using the same transfer_input_files
syntax as above.
It is also possible for distributed IGWN jobs running at a remote site to send job outputs back to CIT using the transfer_output_remaps
HTCondor submit option.
Files placed into /osdf/igwn/cit/staging
cannot be modified.
For details see the staging documentation.
Reading and writing to the user-curated staging space requires authorisation.
To read a file from your per-user staging space, your token needs the scope read:/staging
.
To write back to your per-user space, you require a token with the write:/staging/marie.curie
scope. To obtain the latter, it is sufficient to simply request write:/staging
and the token that you actually receive will be specific to your username.
Reading from and writing to the user-curated data namespace
An example submit file using both the read and write features might look like:
executable = /bin/cp
arguments = input.txt output.txt
should_transfer_files = yes
transfer_executable = false
transfer_input_files = igwn+osdf:///igwn/cit/staging/marie.curie/input.txt
transfer_output_files = output.txt
transfer_output_remaps = "output.txt = igwn+osdf:///igwn/cit/staging/marie.curie/output.txt"
use_oauth_services = igwn
igwn_oauth_permissions = read:/staging write:/staging
If your submit file will refer to the same path repeatedly, then you may find it convenient to use HTCondor's submit file variables to make your file transfers more readable. For example:
Using a variable to simplify /igwn/cit/staging
paths.
executable = /bin/cp
arguments = input.txt output.txt
OSDF_LOCATION = igwn+osdf:///igwn/cit/staging/marie.curie/workflows/run3
should_transfer_files = yes
transfer_executable = false
transfer_input_files = $(OSDF_LOCATION)/input.txt
transfer_output_files = output.txt
transfer_output_remaps = "output.txt = $(OSDF_LOCATION)/output.txt"
use_oauth_services = igwn
igwn_oauth_permissions = read:/staging write:/staging
Read data from CVMFS¶
Accessing data via OSDF is recommended over CVMFS
Accessing data using transfer_input_files
and osdf://
URLs is recommended instead of using CVMFS - the former is more reliable and significantly easier to debug when things go wrong.
By construction, all IGWN data in CVMFS must also be available via OSDF, and both methods use the same distributed cache infrastructure, so there should be no notable performance difference between the two (when things are working).
CVMFS is a software and data distribution service that makes remote data available via a POSIX-like file system. See CVMFS (on this guide) for more details.
Workflows that need to read data from CVMFS can configure their job requirements
to ensure that the jobs only match with matchines that have the necessary CVMFS repositories available.
When using data from /cvmfs
, it is not necessary to explicitly use HTCondor's file transfer mechanism.
IGWN private data in CVMFS¶
For proprietary IGWN Data (e.g. h(t)), which requires Credentials for access, a custom requirement is used to ensure that the target machine has the data, and can handle the credentials appropriately.
If your job needs access to IGWN proprietary data, including those paths returned by the GWDataFind server at https://datafind.igwn.org
the requirement command should be
requirements = HAS_CVMFS_IGWN_PRIVATE_DATA =?= True
Private data access requires credentials
Accessing the IGWN private data requires configuring your job to include Credentials.
GWOSC public data in CVMFS¶
For the GWOSC public CVMFS data repository (gwosc.osgstorage.org
) the requirement command should be
requirements = HAS_CVMFS_gwosc_osgstorage_org =?= True