Skip to content

HTCondor .sub generation tutorial

Your computing task can be univocally described by the input(s), the executable to run, the parameters of such exeutable and the produced output(s).

Following this schema it is trivial to create an HTCondor-specific job descriptor (namely a .sub file, or submit file), which can be used to launch a specific task on the IGWN pool.

Every example reported in the following sections is available for download on GitHub. Feel free to fork and clone this repository.

Note: it is beyond the scope of this tutorial to exhaustively describe all possible HTCondor job configurations. For more detail, see the HTCondor manual. To take advantage of advanced features, you may wish to check your local HTCondor version with condor_version and consult the appropriate manual version.

How to describe a job

A skeleton of a valid submit file is:

universe = vanilla

# Application
executable =
arguments =

# Resources
request_disk = 
request_memory =

# Logging
log = std.log
output = std.out
error = std.err

# i/o
transfer_input_files =
transfer_output_files =

# Transfers
should_transfer_files = YES
when_to_transfer_output = ON_EXIT

# Accounting
accounting_group = something.somethingelse.somethingmore
accounting_group_user = albert.einstein

queue 1

where # is a comment (ignored by HTCondor) and lines in the "application", "resources" sections must be filled for all jobs, and "i/o" must be filled where applicable. Only if you are executing a script with no required input arguments may the arguments line may be ignored. Lines in the "Logging" section can be modified to redirect logging, ouput and error messages to other files.

By the time you are running high throughput workflows in HTCondor, you will likely have some sense of the disk and memory requirements of your application. If that is not true, you should be able to get a reasonable estimate for memory usage using programs like top or htop and monitoring a command line call to your program during an interactive shell session. Similarly, disk usage requirements can be determined by listing directory contents for input and output with e.g. ls -lh. These values are generally specified as <value> <unit> where the unit may be Megabytes (MB) and Gigabytes (GB).

Which disk does request_disk refer to?

request_disk refers to the local disk space that HTCondor manages for each job as it runs on an execute machine (referred to as a job's sandbox in the HTCondor manual), and does not refer to how much storage is used in your /home directory on a submit machine, which may (or may not) be shared with execute nodes in a particular cluster. Nor does request_disk refer to any other shared filesystems such as /scratch/albert.einstein that some clusters have available.

Note, once a job completes and all specified output files are transferred back to a submit machine then a job's sandbox is automatically deleted by Condor to free up local disk space for another job.

Jobs whose submit file does not specify request_disk will fail with message ERROR: Failed to commit job submission into the queue., followed by a link to this documentation.

The lines in the "Transfers" section determine whether and when to transfer the files specified in "i/o" with the job or whether to assume the existence of a shared filesystem. It is generally recommended to use condor file transfer, as above, for reasons of portability and robustness. The example above represents reasonable defaults but for more info, see the htcondor manual.

The "Accounting" section should be filled using the tags generated from this site and (typically) your own name in the form albert.einstein. The GitHub repository contains two tools, set_IGWN_group.sh and set_IGWN_user.sh, which are provided to modify in bulk the accounting information of the submit files. Use them according to the informations reported above (e.g. ./set_IGWN_user.sh albert.einstein). Line 24 is the line which triggers the enqueuing of the task.

1. Input/Output -less job

In order to figure how to complete the "Application" section, let's think about a simple example where one simply wants to run on the worker node the following bash command:

ls -lrt .

which will list the content of any directory found in your user home on the HTCondor worker node. There is no input needed for such operation, neither a output file is created: the only interesting output comes in the std.out file, where the output of ls will be printed.

In this case the submit file becomes:

universe = vanilla

executable = /bin/ls
arguments = -lrt .

request_disk = 1 MB
request_memory = 1 MB

log = std.log
output = std.out
error = std.err

should_transfer_files = YES
when_to_transfer_output = ON_EXIT

accounting_group = something.somethingelse.somethingmore
accounting_group_user = albert.einstein

queue 1

where we have specified deliberately small values for disk and memory.

As you can see the executable is /bin/ls, while all additional arguments are put in the arguments field. The executable path should always be path-independent, hence absolute.

2. Input-only job

In this case lets figure how to describe a job which requires the input of a file and which outputs the file contents in std.out.

Let's assume a file something_to_print.txt is present alongside the .sub file.

The command to print the file content would be:

cat ./something_to_print.txt

In this case the submit file becomes:

universe = vanilla

transfer_input_files = ./something_to_print.txt
executable = /bin/cat
arguments = ./something_to_print.txt

request_disk = 1 MB
request_memory = 1 MB

log = std.log
output = std.out
error = std.err

should_transfer_files = YES
when_to_transfer_output = ON_EXIT

accounting_group = something.somethingelse.somethingmore
accounting_group_user = albert.einstein

queue 1

The path given to the transfer_input_files field is relative to the path where the job submission happens on the submission host. The main executable is /bin/cat and the argument is the file path relative to the worker node user home of the input file transferred above. Again, the disk and memory requirements for this job are small but are required.

The inputs are transferred at job submission automatically.

3. Output-only job

A job which takes no input and generates a file or a directory can be represented by the following command:

touch my_test_output_file.txt

which creates a file filled with the "Hello world!" string.

In this case the submit file becomes:

universe = vanilla

executable = /bin/touch
arguments = my_test_output_file.txt
transfer_output_files = ./my_test_output_file.txt

request_disk = 1 MB
request_memory = 1 MB

log = std.log
output = std.out
error = std.err

should_transfer_files = YES
when_to_transfer_output = ON_EXIT

accounting_group = accounting_group_goes_here
accounting_group_user = albert.einstein

queue 1

which should be of trivial interpretation. Please note that on line 4, the double quotes have to be escaped with HTCondor syntax, requiring double-double quotes to make them interpreted as the " character.

4. Input/Output job

For this example lets assume one wants to perform a byte copy of a input file using cat, creating a new file as output. The command to run in:

cat ./my_input.txt >> ./my_output.txt

The submit file becomes:

universe = vanilla

transfer_input_files = ./my_input.txt
executable = /bin/cp
arguments = my_input.txt my_output.txt
transfer_output_files = my_output.txt

request_disk = 1 MB
request_memory = 1 MB

log = std.log
output = std.out
error = std.err

should_transfer_files = YES
when_to_transfer_output = ON_EXIT

accounting_group = accounting_group_goes_here
accounting_group_user = albert.einstein

queue 1

which should be trivial to understand given previous examples.

5. Script-executing job

Let's assume a job runs a bash script named script.sh available on the submit node. In this case the custom script should be treated as a input file, make it transferred to the worker node by HTCondor automatically.

The submit file becomes:

universe = vanilla

executable = ./script.sh
arguments = test_argument
transfer_output_files = ./surprise.txt

request_disk = X MB
request_memory = Y MB

log = std.log
output = std.out
error = std.err

should_transfer_files = YES
when_to_transfer_output = ON_EXIT

accounting_group = accounting_group_goes_here
accounting_group_user = albert.einstein

queue 1

Remember to check and update the values X and Y for your script.sh by manually executing a test instance in the shell and checking the input/output file sizes and memory footprint with e.g. top or htop.

HTCondor goodies

HTCondor job IDs

To better organize the output files collected from the submit node, it might be interesting to append or prepend the job ID to the .out, .log or .err file. To do so it is possible to use, inside the .sub file, a pletora of self expanding fields. For example $(CLUSTER) and $(PROCESS) are expanded to the according values.

Referring to the .sub of the last example, it would become:

universe = vanilla

transfer_input_files = ./script.sh
executable = ./script.sh
arguments = <put your arguments here>
transfer_output_files = <register here the outputs you want to retrieve>

request_disk = <your disk usage> MB
request_memory = <your memory usage> MB

log = std-$(CLUSTER)-$(PROCESS).log
output = std-$(CLUSTER)-$(PROCESS).out
error = std-$(CLUSTER)-$(PROCESS).err

should_transfer_files = YES
when_to_transfer_output = ON_EXIT

accounting_group = something.somethingelse.somethingmore
accounting_group_user = albert.einstein

queue 1

where the file names in the log, output and error lines will be expanded to include the job and process ID.

In the case you only queue a single job with the submit file, the process ID $(PROCESS) will be zero. If you queue multiple instances in the job cluster, each instance will have its own process ID and corresponding log, output and error files.

Using X.509 credentials in a job

In case some of the operations you need to run in the HTCondor job require a valid GRID or VOMS proxy to be accomplished, add the following lines to your submit file:

    use_x509userproxy = true

Always remember to create a valid proxy on the submit node before calling condor_submit, via:

voms-proxy-init -rfc --voms virgo:virgo/virgo [--valid HH:MM]

or via:

ligo-proxy-init -p albert.einstein

where HH and MM express the required proxy duration (capped at 144 hours, or 6 days), and the -rfc/-p option is needed to create a RFC 3820 compliant proxy.

Please note that the use_x509userproxy = true looks for the $509_USER_PROXY environment variable. In order to point to a specific proxy other than the default location one can use the following equivalent approach:

x509userproxy = <path-to-custom-proxy>

How to run on a specific site

In order to run on a specific site it is enough to add the following line to your submit file:

+DESIRED_Sites = "SITE-NAME"

where SITE-NAME is one (or a comma separed list) of the sites listed here.

For example, to run exclusively at CNAF, add:

+DESIRED_Sites = "CNAF"

Avoid executable transfer

By default HTCondor transfers the executable from the submit node to the workers. To avoid that behaviour (i.e. when you are sure the executable already installed on the worker node, e.g. via CVMFS) a specific flag can be added to the submit file:

transfer_executable = False

Propagate your env to the worker nodes

To propagate the environment variables from the submit node to the worker nodes add the following line anywhere in the .sub file:

environment = "KEY=VALUE KEY2=VALUE2"