OSPool Site Admin Documentation

Supported Cluster OSes and HTCondor Versions

OS	HTCondor	Notes
EL7 (*)	23.10.*	EL7 is no longer supported, and thus our ability to support such systems may be removed at any time.
EL8 (*)	24.* (> 24.0)
EL9 (*)	24.* (> 24.0)
Debian 11 (bullseye)	24.* (> 24.0)
Debian 12 (bookworm)	24.* (> 24.0)
Ubuntu 20.04 (focal)	24.0.*	Ubuntu 20.04 is no longer supported, and thus our ability to support such systems may be removed at any time.
Ubuntu 22.04 (jammy)	24.* (> 24.0)
Ubuntu 24.04 (noble)	24.* (> 24.0)
(*) Tested variants are RHEL, Alma, and Rocky.			⁠

Monitoring and Information

Viewing researcher jobs running within a glidein job

Log in to a worker node as the OSPool user (e.g., “osg01” but may be different at your site)
Pick a glidein (and its directory):

Note: SCRATCH is the path of the scratch directory you gave us to put glideins in.
1. Option 1: List HTCondor processes, pick one, and note its glidein directory:
```
$ ps -u osg01 -f | grep master
osg01    4122304 [...] SCRATCH/glide_qJDh7z/main/condor/sbin/condor_master [...]
```
2. Option 2: List glidein directories and pick an active one:
```
$ ls -l SCRATCH/glide_*/_GLIDE_LEASE_FILE
```
  Pick a “glide_xxxxxx” directory that has a lease file with a timestamp less than about 5 minutes old.
Note: Let GLIDEDIR be the path to the chosen glidein directory, e.g., SCRATCH/glide_xxxxxx

Make sure HTCondor is recent enough:

$ GLIDEDIR/main/condor/bin/condor_version
$CondorVersion: 24.6.1 2025-03-20 BuildID: 794846 PackageID: 24.6.1-1 $
$CondorPlatform: x86_64_AlmaLinux9 $

The Condor Version should be 2.7.0 or later; if not, go back to step 2 and pick a different glidein directory.

Pick the PID of an HTCondor “startd” process to query:
1. If you are just exploring, run the following command and pick any “condor_startd” process — it does not matter if it is associated with the HTCondor instance identified in steps 2–3 above:
```
$ ps -u osg01 -f | grep condor_startd
```
2. If you are looking for the “startd” associated with some other process, start with a process tree:
```
$ ps -u osg01 -f --forest
```
  Find the process of interest, then work upward in the process tree to the nearest ancestor “condor_startd” process.
3. In either case, note the PID (leftmost number) on the “condor_startd” line you picked.

Run condor_who on the PID you picked:

$ GLIDEDIR/main/condor/bin/condor_who -pid PID -ospool
<b>Batch System : SLURM</b>
<b>Batch Job    : 346675</b>
<b>Birthdate    : 2025-04-07 14:47:02</b>
<b>Temp Dir     : /var/lib/condor/execute/osg01/glide_qJDh7z/tmp</b>
<b>Startd : [email protected] has 1 job(s) running:</b>
PROJECT   USER   AP_HOSTNAME         JOBID        RUNTIME    MEMORY    DISK      CPUs EFCY PID     STARTER
Inst-Proj jsmith ap20.uc.osg-htc.org 27781234.0   0+00:17:43  512.0 MB  129.0 MB    1 0.00 4124321 4123123
…

Definitions of fields in the header (before the PROJECT USER … row):

Batch System	HTCondor’s name for the type of batch system you are running.
Batch Job	The identifier for this glidein job in your batch system.
Birthdate	When HTCondor began running within this glidein; typically, this is a few minutes after the glidein job itself began running.
Temp Dir	The path to the glidein job directory (remove the trailing “/tmp”)
Startd	HTCondor’s identifier for its “startd” process within the glidein job.

Definitions of fields in each row of the researcher job table:

PROJECT	The OSPool project identifier for this researcher job
USER	The OSPool AP’s user identifier for this researcher job
AP_HOSTNAME	The OSPool AP’s hostname
JOBID	HTCondor’s identifier for this researcher job on its AP
RUNTIME	HTCondor’s value for the runtime of the researcher job
MEMORY	The amount of memory (RAM), in MB, that HTCondor allocated to this researcher job
DISK	The amount of disk, in KB, that HTCondor allocated to this researcher job
CPUs	The number of CPU cores that HTCondor allocated to this researcher job
EFCY	An HTCondor measure of the efficiency of the job, roughly calculated as CPU time / wallclock time; a value noticeably greater than the CPUs value may mean the researcher job is using more cores than requested
PID	The local PID of the researcher job, or more often, of the root of the process tree for the researcher job (e.g., this could be a wrapper script or even Singularity or Apptainer for a researcher job in a container)
STARTER	The local PID of the HTCondor “starter” process that owns this research job

Viewing researcher jobs running within all OSPool glidein jobs

Note: Steps 1–3 are the same as above.

Log in to a worker node as the OSPool user (e.g., “osg01” but may be different at your site)
Pick a glidein (and its directory):

Note: SCRATCH is the path of the scratch directory you gave us to put glideins in.
1. Option 1: List HTCondor processes, pick one, and note its glidein directory:
```
$ ps -u osg01 -f | grep master
osg01    4122304 [...] SCRATCH/glide_qJDh7z/main/condor/sbin/condor_master [...]
```
1. Option 2: List glidein directories and pick an active one:
```
$ ls -l SCRATCH/glide_*/_GLIDE_LEASE_FILE
```
Pick a “glide_xxxxxx” directory that has a lease file with a timestamp less than about 5 minutes old.

Note: Let GLIDEDIR be the path to the chosen glidein directory, e.g., SCRATCH/glide_xxxxxx

Make sure HTCondor is recent enough:

$ GLIDEDIR/main/condor/bin/condor_version
$CondorVersion: 24.6.1 2025-03-20 BuildID: 794846 PackageID: 24.6.1-1 $
$CondorPlatform: x86_64_AlmaLinux9 $

The Condor Version should be 2.7.0 or later; if not, go back to step 2 and pick a different glidein directory.

Run condor_who on all discoverable glideins on this host running as the current user:

$ GLIDEDIR/main/condor/bin/condor_who -allpids -ospool

Batch System : SLURM
Batch Job    : 346675
Birthdate    : 2025-04-07 14:47:02
Temp Dir     : /var/lib/condor/execute/osg01/glide_qJDh7z/tmp
Startd : [email protected] has 1 job(s) running:
PROJECT   USER   AP_HOSTNAME         JOBID        RUNTIME    MEMORY    DISK      CPUs EFCY PID     STARTER
Inst-Proj jsmith ap20.uc.osg-htc.org 27781234.0   0+00:17:43  512.0 MB  129.0 MB    1 0.00 4124321 4123123
…

For each glidein job, there will be one set of heading lines (“Batch System”, etc.) and a table of researcher jobs, one per line; format and definitions are as above.

OSPool Contribution Requirements

Contributing via a Hosted CE

The cluster and login node are set up for our user account:
The cluster is operational and generally works
The user account has a home directory on the login node
The user account can read, write, and execute files and directories within its home directory
Our home directory has enough available space and inodes (TBD but not a lot)
PATH staff know the right partition (and other batch system config) to use
The batch system is configured to allow the user account to submit jobs to the right partition(s) and for the default job “shape” (e.g., 1 core, 2 GB memory, and 24-hour maximum run time)
It is possible to SSH from the CE to the login node:
PATh staff know the current hostname of your login node
That hostname has a public DNS entry that resolves to the correct IP address
PATh staff know the user account name (default, “osg01”)
PATh staff know about SSH configuration details to use (e.g., alternate port, jump host)
The SSH client on one of our IP addresses can connect to your login node (through firewalls, etc.)
The provided SSH public key has been installed in the right place and with the right permissions
The provided SSH public key is sufficient for authentication by your SSH server
The worker nodes on which our jobs may run are ready:
Our home directory is shared with each cluster node
PATh staff know the correct path to scratch space for jobs (ideally on each worker node, but a shared filesystem may work)
Our user account can create subdirectories and run executables in the scratch directory
The worker nodes have permissive outbound network connectivity to the Internet (default allow, please note specific restrictions)