Data Management on the IGWN Computing Grid¶
The IGWN Computing Grid connects geographically-distributed resources that do not share a file system, which requires careful management of data, both in to, and out of a job.
HTCondor file transfer¶
IGWN workflows should use HTCondor's file transfer mechanism to transfer data to and from jobs, including between stages of a workflow.
There are a few noted exceptions to this generic recommendation:
- when using CVMFS, see below,
- when reading GWF data from the local computing centre data archive (not data from
Accessing IGWN data with HTCondor¶
Many IGWN workflows require access to centrally-managed or shared IGWN data. IGWN supports this by publishing data into the Open Science Data Federation which enables access to these data from any machine.
What data are available?
For details of what data are distributed and available, please see IGWN Data and links therein.
There are a few different strategies for accessing data in an HTCondor workflow:
Download data using OSDF¶
HTCondor supports using
osdf:// URLs in the
transfer_input_files list given in the submit commands. This allows HTCondor to directly transfer the data directly to the execute point in an efficient way.
Discovering OSDF URLs¶
$ python3 -m gwdatafind -r datafind.igwn.org -o H -t H1_HOFT_C00 -s 1373580000 -e 1373590000 -u osdf osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373577216-4096.gwf osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373581312-4096.gwf osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373585408-4096.gwf osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373589504-4096.gwf
When using OSDF URLs with HTCondor, the job submission must include appropriate credentials to access the data. This amounts to adding the token name as a prefix to the URL scheme as part of the
See Examples for an example of configuring an HTCondor job for authorised access to OSDF data using tokens.
Once the OSDF URLs are known, they should be included in the HTCondor submit commands via the
transfer_input_files argument, including the token name prefix as above:
transfer_input_files = igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373577216-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373581312-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373585408-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373589504-4096.gwf
transfer_input_files mechanism this copy the files from the OSDF cache directly into your job's working directory (aka "sandbox"), so applications should use the local path to those files, e.g:
executable = /bin/head arguments = -c4 H-H1_HOFT_C00-1373577216-4096.gwf H-H1_HOFT_C00-1373581312-4096.gwf H-H1_HOFT_C00-1373585408-4096.gwf H-H1_HOFT_C00-1373589504-4096.gwf use_oauth_services = igwn igwn_oauth_permissions = read:/ligo should_transfer_files = yes transfer_executable = false transfer_input_files = igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373577216-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373581312-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373585408-4096.gwf igwn+osdf:///igwn/ligo/frames/O4/hoft_C00/H1/H-H1_HOFT_C00-137/H-H1_HOFT_C00-1373589504-4096.gwf
Include the input file size in
The space requested for your job in the
request_disk command must include the total size of all files that will be transferred into the sandbox.
You may also use the OSDF file transfer plugin to move files to and from any CIT access point. This is particularly appropriate for larger input and output files (1 GB or greater) where HTCondor file transfer may be slower or might overload the submit machine.
Copying files into the user-staging OSDF namspace.
For details of how to copy files from a CIT access point into the
/igwn/cit/staging OSDF namespace, please see Publishing data into
OSDF URLs for user-curated data are not indexed by any GWDataFind service, but should be well documented by the owner of the data.
Paths may be discoverable directly from the filesystem at the LDAS-CIT computing centre under
The OSDF URL for such a filesystem path is constructed by removing the
/osdf prefix and prepending the URL scheme
These URLs can then be used to specify an input file for an HTCondor job using the same
transfer_input_files syntax as above.
It is also possible for distributed IGWN jobs running at a remote site to send job outputs back to CIT using the
transfer_output_remaps HTCondor submit option.
Files placed into
/osdf/igwn/cit/staging cannot be modified.
For details see the staging documentation.
Reading and writing to the user-curated staging space requires authorisation.
To read a file from your per-user staging space, your token needs the scope
To write back to your per-user space, you require a token with the
write:/staging/marie.curie scope. To obtain the latter, it is sufficient to simply request
write:/staging and the token that you actually receive will be specific to your username.
Reading from and writing to the user-curated data namespace
An example submit file using both the read and write features might look like:
executable = /bin/cp arguments = input.txt output.txt should_transfer_files = yes transfer_executable = false transfer_input_files = igwn+osdf:///igwn/cit/staging/marie.curie/input.txt transfer_output_files = output.txt transfer_output_remaps = "output.txt = igwn+osdf:///igwn/cit/staging/marie.curie/output.txt" use_oauth_services = igwn igwn_oauth_permissions = read:/staging write:/staging
If your submit file will refer to the same path repeatedly, then you may find it convenient to use HTCondor's submit file variables to make your file transfers more readable. For example:
Using a variable to simplify
executable = /bin/cp arguments = input.txt output.txt OSDF_LOCATION = igwn+osdf:///igwn/cit/staging/marie.curie/workflows/run3 should_transfer_files = yes transfer_executable = false transfer_input_files = $(OSDF_LOCATION)/input.txt transfer_output_files = output.txt transfer_output_remaps = "output.txt = $(OSDF_LOCATION)/output.txt" use_oauth_services = igwn igwn_oauth_permissions = read:/staging write:/staging
Read data from CVMFS¶
Workflows that need to read data from CVMFS can configure their job
requirements to ensure that the jobs only match with matchines that have the necessary CVMFS repositories available.
When using data from
/cvmfs, it is not necessary to explicitly use HTCondor's file transfer mechanism.
IGWN private data in CVMFS¶
For proprietary IGWN Data (e.g. h(t)), which requires Credentials for access, a custom requirement is used to ensure that the target machine has the data, and can handle the credentials appropriately.
If your job needs access to IGWN proprietary data, including those paths returned by the GWDataFind server at
https://datafind.igwn.org the requirement command should be
requirements = HAS_CVMFS_IGWN_PRIVATE_DATA =?= True
Private data access requires credentials
Accessing the IGWN private data requires configuring your job to include Credentials.
GWOSC public data in CVMFS¶
For the GWOSC public CVMFS data repository (
gwosc.osgstorage.org) the requirement command should be
requirements = HAS_CVMFS_gwosc_osgstorage_org =?= True