OSPool Site Admin Documentation
Supported Cluster OSes and HTCondor Versions
OS | HTCondor | Notes |
---|---|---|
EL7 (*) | 23.10.* | EL7 is no longer supported, and thus our ability to support such systems may be removed at any time. |
EL8 (*) | 24.* (> 24.0) | |
EL9 (*) | 24.* (> 24.0) | |
Debian 11 (bullseye) | 24.* (> 24.0) | |
Debian 12 (bookworm) | 24.* (> 24.0) | |
Ubuntu 20.04 (focal) | 24.0.* | Ubuntu 20.04 is no longer supported, and thus our ability to support such systems may be removed at any time. |
Ubuntu 22.04 (jammy) | 24.* (> 24.0) | |
Ubuntu 24.04 (noble) | 24.* (> 24.0) | |
(*) Tested variants are RHEL, Alma, and Rocky. |
Monitoring and Information
Viewing researcher jobs running within a glidein job
-
Log in to a worker node as the OSPool user (e.g., “osg01” but may be different at your site)
-
Pick a glidein (and its directory):
Note: SCRATCH is the path of the scratch directory you gave us to put glideins in.
-
Option 1: List HTCondor processes, pick one, and note its glidein directory:
-
Option 2: List glidein directories and pick an active one:
Pick a “glide_xxxxxx” directory that has a lease file with a timestamp less than about 5 minutes old.
Note: Let GLIDEDIR be the path to the chosen glidein directory, e.g.,
SCRATCH/glide_xxxxxx
-
-
Make sure HTCondor is recent enough:
$ GLIDEDIR/main/condor/bin/condor_version $CondorVersion: 24.6.1 2025-03-20 BuildID: 794846 PackageID: 24.6.1-1 $ $CondorPlatform: x86_64_AlmaLinux9 $
The Condor Version should be 2.7.0 or later; if not, go back to step 2 and pick a different glidein directory.
-
Pick the PID of an HTCondor “startd” process to query:
-
If you are just exploring, run the following command and pick any “condor_startd” process — it does not matter if it is associated with the HTCondor instance identified in steps 2–3 above:
-
If you are looking for the “startd” associated with some other process, start with a process tree:
Find the process of interest, then work upward in the process tree to the nearest ancestor “condor_startd” process.
-
In either case, note the PID (leftmost number) on the “condor_startd” line you picked.
-
-
Run
condor_who
on the PID you picked:$ GLIDEDIR/main/condor/bin/condor_who -pid PID -ospool <b>Batch System : SLURM</b> <b>Batch Job : 346675</b> <b>Birthdate : 2025-04-07 14:47:02</b> <b>Temp Dir : /var/lib/condor/execute/osg01/glide_qJDh7z/tmp</b> <b>Startd : [email protected] has 1 job(s) running:</b> PROJECT USER AP_HOSTNAME JOBID RUNTIME MEMORY DISK CPUs EFCY PID STARTER Inst-Proj jsmith ap20.uc.osg-htc.org 27781234.0 0+00:17:43 512.0 MB 129.0 MB 1 0.00 4124321 4123123 …
Definitions of fields in the header (before the PROJECT USER … row):
Batch System | HTCondor’s name for the type of batch system you are running. |
---|---|
Batch Job | The identifier for this glidein job in your batch system. |
Birthdate | When HTCondor began running within this glidein; typically, this is a few minutes after the glidein job itself began running. |
Temp Dir | The path to the glidein job directory (remove the trailing “/tmp”) |
Startd | HTCondor’s identifier for its “startd” process within the glidein job. |
Definitions of fields in each row of the researcher job table:
PROJECT | The OSPool project identifier for this researcher job |
---|---|
USER | The OSPool AP’s user identifier for this researcher job |
AP_HOSTNAME | The OSPool AP’s hostname |
JOBID | HTCondor’s identifier for this researcher job on its AP |
RUNTIME | HTCondor’s value for the runtime of the researcher job |
MEMORY | The amount of memory (RAM), in MB, that HTCondor allocated to this researcher job |
DISK | The amount of disk, in KB, that HTCondor allocated to this researcher job |
CPUs | The number of CPU cores that HTCondor allocated to this researcher job |
EFCY | An HTCondor measure of the efficiency of the job, roughly calculated as CPU time / wallclock time; a value noticeably greater than the CPUs value may mean the researcher job is using more cores than requested |
PID | The local PID of the researcher job, or more often, of the root of the process tree for the researcher job (e.g., this could be a wrapper script or even Singularity or Apptainer for a researcher job in a container) |
STARTER | The local PID of the HTCondor “starter” process that owns this research job |
Viewing researcher jobs running within all OSPool glidein jobs
Note: Steps 1–3 are the same as above.
-
Log in to a worker node as the OSPool user (e.g., “osg01” but may be different at your site)
-
Pick a glidein (and its directory):
Note: SCRATCH is the path of the scratch directory you gave us to put glideins in.
- Option 1: List HTCondor processes, pick one, and note its glidein directory:
$ ps -u osg01 -f | grep master osg01 4122304 [...] SCRATCH/glide_qJDh7z/main/condor/sbin/condor_master [...]
- Option 2: List glidein directories and pick an active one:
Pick a “glide_xxxxxx” directory that has a lease file with a timestamp less than about 5 minutes old.
Note: Let GLIDEDIR be the path to the chosen glidein directory, e.g.,
SCRATCH/glide_xxxxxx
-
Make sure HTCondor is recent enough:
$ GLIDEDIR/main/condor/bin/condor_version $CondorVersion: 24.6.1 2025-03-20 BuildID: 794846 PackageID: 24.6.1-1 $ $CondorPlatform: x86_64_AlmaLinux9 $
The Condor Version should be 2.7.0 or later; if not, go back to step 2 and pick a different glidein directory.
-
Run
condor_who
on all discoverable glideins on this host running as the current user:$ GLIDEDIR/main/condor/bin/condor_who -allpids -ospool Batch System : SLURM Batch Job : 346675 Birthdate : 2025-04-07 14:47:02 Temp Dir : /var/lib/condor/execute/osg01/glide_qJDh7z/tmp Startd : [email protected] has 1 job(s) running: PROJECT USER AP_HOSTNAME JOBID RUNTIME MEMORY DISK CPUs EFCY PID STARTER Inst-Proj jsmith ap20.uc.osg-htc.org 27781234.0 0+00:17:43 512.0 MB 129.0 MB 1 0.00 4124321 4123123 …
For each glidein job, there will be one set of heading lines (“Batch System”, etc.) and a table of researcher jobs, one per line; format and definitions are as above.
OSPool Contribution Requirements
Contributing via a Hosted CE
-
The cluster and login node are set up for our user account:
-
The cluster is operational and generally works
-
The user account has a home directory on the login node
-
The user account can read, write, and execute files and directories within its home directory
-
Our home directory has enough available space and inodes (TBD but not a lot)
-
PATH staff know the right partition (and other batch system config) to use
-
The batch system is configured to allow the user account to submit jobs to the right partition(s) and for the default job “shape” (e.g., 1 core, 2 GB memory, and 24-hour maximum run time)
-
It is possible to SSH from the CE to the login node:
-
PATh staff know the current hostname of your login node
-
That hostname has a public DNS entry that resolves to the correct IP address
-
PATh staff know the user account name (default, “osg01”)
-
PATh staff know about SSH configuration details to use (e.g., alternate port, jump host)
-
The SSH client on one of our IP addresses can connect to your login node (through firewalls, etc.)
-
The provided SSH public key has been installed in the right place and with the right permissions
-
The provided SSH public key is sufficient for authentication by your SSH server
-
The worker nodes on which our jobs may run are ready:
-
Our home directory is shared with each cluster node
-
PATh staff know the correct path to scratch space for jobs (ideally on each worker node, but a shared filesystem may work)
-
Our user account can create subdirectories and run executables in the scratch directory
-
The worker nodes have permissive outbound network connectivity to the Internet (default allow, please note specific restrictions)