Configuring and Deploying Condor
- With the deployment of the new LDG-5.x, Condor no longer is installed via Pacman in the /opt/ldg directory.
- OS' native distribution [in the form of *.rpm packages for RedHat_based OS (Scientific Linux, CentOS, Fedora, etc) and/or *.deb packs for Debian] are used. All files are installed now in the Standard File Hierarchy [FHS] directories.
- Installation packs are downloaded from the Condor repositories, or
- For RedHat_based OS, a YUM repository exist at http://research.cs.wisc.edu/condor/yum/, which can be downloaded locally [by following the instructions at that URL] and then used to download Condor packages.
- For Debian, the APT repo exist at http://research.cs.wisc.edu/condor/debian/.
- There's no longer the need of sourcing "setup.sh" files to set any special environmental variable. System env vars will suffice.
- The following instructions will assume Condor version >= 7.6.x
Most likely the default installation and configuration is not what is needed.
As you must already know, a machine running Condor can play (basically) 3 roles:
- as the Central Manager [CM] node that rules the entire (condor) cluster,
- as a condor submission node which is used by people to send jobs to the (condor) cluster,
- and/or as a condor execution node which is the machine that executes the jobs sent by users.
- Install the rpm/deb condor package in ALL the machines of your cluster. The package will automatically add a 'condor' user/group if it does not exist already. Cluster sites with a specific security policy should add the 'condor' user/group manually before performing the installation.
- Choose the machine that will play the role of the CM, and edit the main configuration file /etc/condor/condor_config:
- Set CONDOR_HOST to the FQDN [Full Qualified Domain Name] of the machine that will play the role of CM. If this machine has a seperate network interface just for access to the cluster nodes, use that FQDN or equivalent IP address. For example, if that machine has FQDN universe.sverige.kth.edu and IP=666.999.555.111, then set CONDOR_HOST=universe.sverige.kth.edu or to its equivalent IP address, CONDOR_HOST=666.999.555.111.
- Set RELEASE_DIR = /usr
- Set LOCAL_DIR = /var
- Set LOCAL_CONFIG_FILE = /etc/condor/condor_config.local
- Set CONDOR_ADMIN to an appropriate email address
- Set UID_DOMAIN to the subnet domain for your cluster. For example, at UWM a typical node has FQDN medusa-slave001.medusa.phys.uwm.edu so the subnet domain is medusa.phys.uwm.edu. This means, UID_DOMAIN=medusa.phys.uwm.edu
- Set FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) if your cluster does NOT have a shared filesystem for users, or set it to the subnet domain if it does have a shared filesystem for users.
- Set USE_NFS = True if your cluster has a shared filesystem for users.
- Create a tar file that you can deploy onto each node of your cluster, with the configuration file:
tar -cf condor.tar /etc/condor/condor_config
- Deploy the tar file onto each node of your cluster, making sure that the deployed file are in the correct path [this means, condor_config will overwrite the old installed config file].
- On each node execute chkconfig --add condor.
- Back on the CM machine, edit the file /etc/condor/condor_config.local to add the following:
COLLECTOR_NAME = $(CONDOR_HOST) DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR
where the CONDOR_HOST value is taken from the same variable in the /etc/condor/condor_config file. - If this machine has a seperate network interface for the cluster nodes also add the line
NETWORK_INTERFACE = your IP address
which means the IP address for the seperate network interface. - Choose a machine to be the condor-submission node, and edit /etc/condor/condor_config.local to add the following:
COLLECTOR_NAME = $(CONDOR_HOST) DAEMON_LIST = MASTER, SCHEDD
where the CONDOR_HOST value is taken from the same variable in the /etc/condor/condor_config file. - Choose the machines that will be the condor-execution nodes, and edit /etc/condor/condor_config.local to add the following:
COLLECTOR_NAME = $(CONDOR_HOST) DAEMON_LIST = MASTER, STARTD
where the CONDOR_HOST value is taken from the same variable in the /etc/condor/condor_config file. - Start Condor first on the CM machine, with:
/etc/init.d/condor start
- Start condor on the condor submission node by doing the same.
- Start condor on the condor execution nodes by doing the same on each node.
- On the Condor Central Manager machine do
/usr/bin/condor_status to see the status of your Condor pool. One must be able to see sth similar to
Name OpSys Arch State Activity LoadAv Mem ActvtyTime nemo-slave0001.nem LINUX X86_64 Claimed Busy 1.000 3960 0+06:42:47 nemo-slave0003.nem LINUX X86_64 Claimed Busy 1.000 3960 0+09:14:30 nemo-slave0004.nem LINUX X86_64 Claimed Busy 1.000 3960 0+06:41:30 nemo-slave0005.nem LINUX X86_64 Claimed Busy 1.000 3960 0+03:47:21 nemo-slave0006.nem LINUX X86_64 Claimed Busy 0.990 3960 0+00:48:46 nemo-slave0007.nem LINUX X86_64 Claimed Busy 0.990 3960 0+00:36:10 nemo-slave0008.nem LINUX X86_64 Claimed Busy 0.990 3960 0+00:31:38 nemo-slave0009.nem LINUX X86_64 Claimed Busy 1.000 3960 0+02:47:22 nemo-slave0010.nem LINUX X86_64 Claimed Busy 0.990 3960 0+05:15:34 nemo-slave0011.nem LINUX X86_64 Claimed Busy 0.990 3960 0+06:41:03 ... slot1_2@nemo-slave LINUX X86_64 Claimed Busy 1.000 25 1+01:02:19 slot1_3@nemo-slave LINUX X86_64 Claimed Busy 1.000 25 1+01:00:46 slot1_4@nemo-slave LINUX X86_64 Claimed Busy 1.000 25 1+00:59:34 slot1_7@nemo-slave LINUX X86_64 Claimed Busy 1.000 25 1+00:56:14 slot1_8@nemo-slave LINUX X86_64 Claimed Busy 1.000 25 0+10:49:23 slot1_9@nemo-slave LINUX X86_64 Claimed Busy 1.000 25 0+10:47:48 slot2@nemo-slave11 LINUX X86_64 Unclaimed Idle 0.270 48397 2+03:25:40 Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 5171 0 4931 234 0 6 0 Total 5171 0 4931 234 0 6 0
This completes a basic deployment and configuration of Condor. You are strongly encouraged to read the Condor Manual and learn how to configure Condor is the best way for your particular cluster.

$Id$