Disable mounting of home directories on LIGO Lab worker nodes¶
Effective Date: This will be rolled out gradually, starting 21 April 2026
Service Impacted: Users submitting HTCondor workflows from access points (submit nodes) at LIGO Lab managed compute clusters that should run locally at those clusters and that require access to the home directory filesystem from a running job.
Details¶
Overview¶
Until this change starts to be deployed, users submitting jobs to the local HTCondor pool from an access point (AP, sometimes also called a submit node or login node) at any of the three LIGO Lab clusters (CIT, LHO, or LLO) could have those jobs access the home filesystem mounted under /home/marie.curie/ in each job. The ability of any job on the cluster to read and write to these shared filesystems from any of the worker nodes is unsustainable for stable performance of those filesystems, and with this change such access will start to be gradually removed.
Impact¶
The removal of mounting home directories on execute points (EP, also known as worker nodes) will begin at CIT on 21 April 2026. It will not initially be deployed to all worker nodes, but only to a subset corresponding to those nodes shared as dedicated low-latency nodes for low-latency analyses that have indicated that they are ready for this transition. Since whether or not the home filesystem is mounted is a binary choice for each EP, if a dedicated EP is transitioned away from mouting this filesystem for a low-latency analysis, then it will also stop mounting it when made available for offline analyses. Over time, the fraction of the CIT pool where the home file system is mounted will decrease, and jobs that still require it will have a decreasing set of resources on which they may run. The transition timeline for LHO and LLO will be determined at a later date, and informed by the experience at CIT.
This change will not remove the mounting of the home filesystem on access points. Note that at CIT, it is already the case that exactly one of the citloginN access points will mount any given user's home directory as read-write; on all others it is mounted read only.
This change will not affect jobs that use condor file transfer to move all of their input data to a working "sandbox" on the EP, and then transfer any necessary checkpoint files or output files and directories using condor file transfer. Indeed, any workflow that needs access to the widest possible set of resources should be modified to use this mechanism (for large files in offline workflows it is also advised to use OSDF).
If, in the short term, your workflow cannot be readily modified to use HTCondor file transfer, then there are two options:
- (Preferred) Append
(TARGET.EPNFS =?= True)to the job submit requirements. If you choose this approach, the job may be submitted from any AP at CIT (though note that most AP will now by default target the distributed IGWN grid, and additional syntax is needed to ensure that a job only runs locally at CIT). - Submit your workflow at CIT using the special access point
epnfs.ligo.caltech.edu. All jobs submitted from this access point will automatically have a condor job transform added to them so that they will only target EP on which the home filesystem is mounted.
The dedicated AP epnfs.ligo.caltech.edu will only submit to the local CIT pool. Therefore, any hybrid workflows where some jobs are intended to run on the IGWN grid but others must run on the local CIT cluster with access to a shared home filesystem must use only method number 1 above to target where those jobs requiring home filesystem access land.
If a job implicitly assumes access to the shared home filesystem but does not assert this need using one of these two methods, then with some probability it will land on an EP that does not have that home file system mounted, and likely encounter errors in startup, running, or creating output files caused by the absence of that filesystem. Over time, the fraction of such EP where this will happen, and therefore the fraction of jobs that fail because they require the home file system but do not declare that, will increase.
Even for jobs that do declare their need for the home file system, over time the fraction of EP where they can run will decrease, and if such a workflow needs more resources then it should be modified to remove the assumption of a home filesystem.