HTCondor .sub generation tutorial¶
Your computing task can be univocally described by the input(s), the executable to run, the parameters of such exeutable and the produced output(s).
Following this schema it is trivial to create an HTCondor-specific job descriptor (namely a
.sub file, or submit file), which can be used to launch a specific task on the IGWN pool.
Every example reported in the following sections is available for download on GitHub. Feel free to fork and clone this repository.
Note: it is beyond the scope of this tutorial to exhaustively describe all possible HTCondor job configurations. For more detail, see the HTCondor manual. To take advantage of advanced features, you may wish to check your local HTCondor version with
condor_version and consult the appropriate manual version.
How to describe a job¶
A skeleton of a valid submit file is:
universe = vanilla # Application executable = arguments = # Resources request_disk = request_memory = # Logging log = std.log output = std.out error = std.err # i/o transfer_input_files = transfer_output_files = # Transfers should_transfer_files = YES when_to_transfer_output = ON_EXIT # Accounting accounting_group = something.somethingelse.somethingmore accounting_group_user = albert.einstein queue 1
# is a comment (ignored by HTCondor) and lines in the "application", "resources" sections must be filled for all jobs, and "i/o" must be filled where applicable. Only if you are executing a script with no required input arguments may the
arguments line may be ignored. Lines in the "Logging" section can be modified to redirect logging, ouput and error messages to other files.
By the time you are running high throughput workflows in HTCondor, you will likely have some sense of the disk and memory requirements of your application. If that is not true, you should be able to get a reasonable estimate for memory usage using programs like
htop and monitoring a command line call to your program during an interactive shell session. Similarly, disk usage requirements can be determined by listing directory contents for input and output with e.g.
ls -lh. These values are generally specified as
<value> <unit> where the unit may be Megabytes (
MB) and Gigabytes (
Which disk does
request_disk refer to?
request_disk refers to the local disk space that HTCondor manages for each job as it runs on an execute machine (referred to as a job's sandbox in the HTCondor manual), and does not refer to how much storage is used in your
/home directory on a submit machine, which may (or may not) be shared with execute nodes in a particular cluster. Nor does
request_disk refer to any other shared filesystems such as
/scratch/albert.einstein that some clusters have available.
Note, once a job completes and all specified output files are transferred back to a submit machine then a job's sandbox is automatically deleted by Condor to free up local disk space for another job.
Jobs whose submit file does not specify
request_disk will fail with message
ERROR: Failed to commit job submission into the queue., followed by a link to this documentation.
The lines in the "Transfers" section determine whether and when to transfer the files specified in "i/o" with the job or whether to assume the existence of a shared filesystem. It is generally recommended to use condor file transfer, as above, for reasons of portability and robustness. The example above represents reasonable defaults but for more info, see the htcondor manual.
The "Accounting" section should be filled using the tags generated from this site and (typically) your own name in the form
albert.einstein. The GitHub repository contains two tools,
set_IGWN_user.sh, which are provided to modify in bulk the accounting information of the submit files. Use them according to the informations reported above (e.g.
./set_IGWN_user.sh albert.einstein). Line 24 is the line which triggers the enqueuing of the task.
1. Input/Output -less job¶
In order to figure how to complete the "Application" section, let's think about a simple example where one simply wants to run on the worker node the following bash command:
ls -lrt .
which will list the content of any directory found in your user home on the HTCondor worker node. There is no input needed for such operation, neither a output file is created: the only interesting output comes in the
std.out file, where the output of
ls will be printed.
In this case the submit file becomes:
universe = vanilla executable = /bin/ls arguments = -lrt . request_disk = 1 MB request_memory = 1 MB log = std.log output = std.out error = std.err should_transfer_files = YES when_to_transfer_output = ON_EXIT accounting_group = something.somethingelse.somethingmore accounting_group_user = albert.einstein queue 1
where we have specified deliberately small values for disk and memory.
As you can see the executable is
/bin/ls, while all additional arguments are put in the arguments field. The executable path should always be path-independent, hence absolute.
2. Input-only job¶
In this case lets figure how to describe a job which requires the input of a file and which outputs the file contents in
Let's assume a file
something_to_print.txt is present alongside the
The command to print the file content would be:
In this case the submit file becomes:
universe = vanilla transfer_input_files = ./something_to_print.txt executable = /bin/cat arguments = ./something_to_print.txt request_disk = 1 MB request_memory = 1 MB log = std.log output = std.out error = std.err should_transfer_files = YES when_to_transfer_output = ON_EXIT accounting_group = something.somethingelse.somethingmore accounting_group_user = albert.einstein queue 1
The path given to the
transfer_input_files field is relative to the path where the job submission happens on the submission host. The main executable is
/bin/cat and the argument is the file path relative to the worker node user home of the input file transferred above. Again, the disk and memory requirements for this job are small but are required.
The inputs are transferred at job submission automatically.
3. Output-only job¶
A job which takes no input and generates a file or a directory can be represented by the following command:
which creates a file filled with the "Hello world!" string.
In this case the submit file becomes:
universe = vanilla executable = /bin/touch arguments = my_test_output_file.txt transfer_output_files = ./my_test_output_file.txt request_disk = 1 MB request_memory = 1 MB log = std.log output = std.out error = std.err should_transfer_files = YES when_to_transfer_output = ON_EXIT accounting_group = accounting_group_goes_here accounting_group_user = albert.einstein queue 1
which should be of trivial interpretation. Please note that on line 4, the double quotes have to be escaped with HTCondor syntax, requiring double-double quotes to make them interpreted as the
4. Input/Output job¶
For this example lets assume one wants to perform a byte copy of a input file using
cat, creating a new file as output. The command to run in:
cat ./my_input.txt >> ./my_output.txt
The submit file becomes:
universe = vanilla transfer_input_files = ./my_input.txt executable = /bin/cp arguments = my_input.txt my_output.txt transfer_output_files = my_output.txt request_disk = 1 MB request_memory = 1 MB log = std.log output = std.out error = std.err should_transfer_files = YES when_to_transfer_output = ON_EXIT accounting_group = accounting_group_goes_here accounting_group_user = albert.einstein queue 1
which should be trivial to understand given previous examples.
5. Script-executing job¶
Let's assume a job runs a bash script named
script.sh available on the submit node. In this case the custom script should be treated as a input file, make it transferred to the worker node by HTCondor automatically.
The submit file becomes:
universe = vanilla executable = ./script.sh arguments = test_argument transfer_output_files = ./surprise.txt request_disk = X MB request_memory = Y MB log = std.log output = std.out error = std.err should_transfer_files = YES when_to_transfer_output = ON_EXIT accounting_group = accounting_group_goes_here accounting_group_user = albert.einstein queue 1
Remember to check and update the values
Y for your
script.sh by manually executing a test instance in the shell and checking the input/output file sizes and memory footprint with e.g.
HTCondor job IDs¶
To better organize the output files collected from the submit node, it might be interesting to append or prepend the job ID to the
.err file. To do so it is possible to use, inside the
.sub file, a pletora of self expanding fields. For example
$(PROCESS) are expanded to the according values.
Referring to the
.sub of the last example, it would become:
universe = vanilla transfer_input_files = ./script.sh executable = ./script.sh arguments = <put your arguments here> transfer_output_files = <register here the outputs you want to retrieve> request_disk = <your disk usage> MB request_memory = <your memory usage> MB log = std-$(CLUSTER)-$(PROCESS).log output = std-$(CLUSTER)-$(PROCESS).out error = std-$(CLUSTER)-$(PROCESS).err should_transfer_files = YES when_to_transfer_output = ON_EXIT accounting_group = something.somethingelse.somethingmore accounting_group_user = albert.einstein queue 1
where the file names in the
error lines will be expanded to include the job and process ID.
In the case you only queue a single job with the submit file, the process ID
$(PROCESS) will be zero. If you queue multiple instances in the job cluster, each instance will have its own process ID and corresponding
Using X.509 credentials in a job¶
In case some of the operations you need to run in the HTCondor job require a valid GRID or VOMS proxy to be accomplished, add the following lines to your submit file:
use_x509userproxy = true
Always remember to create a valid proxy on the submit node before calling
voms-proxy-init -rfc --voms virgo:virgo/virgo [--valid HH:MM]
ligo-proxy-init -p albert.einstein
MM express the required proxy duration (capped at 144 hours, or 6 days), and the
-p option is needed to create a RFC 3820 compliant proxy.
Please note that the
use_x509userproxy = true looks for the
$509_USER_PROXY environment variable. In order to point to a specific proxy other than the default location one can use the following equivalent approach:
x509userproxy = <path-to-custom-proxy>
How to run on a specific site¶
In order to run on a specific site it is enough to add the following line to your submit file:
+DESIRED_Sites = "SITE-NAME"
SITE-NAME is one (or a comma separed list) of the sites listed here.
For example, to run exclusively at CNAF, add:
+DESIRED_Sites = "CNAF"
Avoid executable transfer¶
By default HTCondor transfers the executable from the submit node to the workers. To avoid that behaviour (i.e. when you are sure the executable already installed on the worker node, e.g. via CVMFS) a specific flag can be added to the submit file:
transfer_executable = False
Propagate your env to the worker nodes¶
To propagate the environment variables from the submit node to the worker nodes add the following line anywhere in the
environment = "KEY=VALUE KEY2=VALUE2"