Testing OSG Software Prereleases on the Madison ITB Site¶

This document contains basic recipes for testing a OSG software prereleases on the Madison ITB site, which includes HTCondor prerelease builds and full OSG software stack prereleases from Yum.

Prerequisites¶

The following items are known prerequisites to using this recipe. If you are not running the Ansible commands from osghost, there are almost certainly other prerequisites that are not listed below. And even using osghost for Ansible and itb-submit for the submissions, there may be other prerequisites missing. Please improve this document by adding other prerequisites as they are identified!

A checkout of the osgitb directory from our local git instance (not GitHub)
Your X.509 DN in the osgitb/unmanaged/htcondor-ce/grid-mapfile file and (via Ansible) on itb-ce1 and itb-ce2

Gathering Information¶

Technically skippable, this section is about checking on the state of the ITB machines before making changes. The plan is to keep the ITB machines generally up-to-date independently, so those steps are not listed here. And honestly, the steps below are just some ideas; do whatever makes sense for the given update.

The commands can be run as-is from within the osgitb directory from git.

Check OS versions for all current ITB hosts:

ansible current -i inventory -f 20 -o -m command -a 'cat /etc/redhat-release'

Check the date and time on all hosts (in case NTP stops working):
```
ansible current -i inventory -f 20 -o -m command -a 'date'
```
Check software versions for certain hosts (e.g., for the condor package on hosts in the workers group):
```
ansible workers -i inventory -f 20 -o -m command -a 'rpm -q condor'
```

Installing HTCondor Prerelease¶

Use this section to install a new version of HTCondor, specifically a prerelease build from the development or upcoming-development repository, on the test hosts.

Obtain the NVR of the HTCondor prerelease build from OSG to test. Do this by talking to Tim T. and checking Koji.

Shut down HTCondor and HTCondor-CE on prerelease machines:

ansible 'testing:&ces' -i inventory -bK -f 20 -m service -a 'name=condor-ce state=stopped'
ansible 'testing:&condor' -i inventory -bK -f 20 -m service -a 'name=condor state=stopped'

Install new version of HTCondor on prerelease machines:

ansible 'testing:&condor' -i inventory -bK -f 10 -m command -a 'yum --enablerepo=osg-development --assumeyes update condor'

or, if you need to install an NVR that is “earlier” (in the RPM sense) than what is currently installed:

ansible 'testing:&condor' -i inventory -bK -f 10 -m command -a 'yum --enablerepo=osg-development --assumeyes downgrade condor condor-classads condor-python condor-procd blahp'

Verify correct RPM versions across the site:

ansible condor -i inventory -f 20 -o -m command -a 'rpm -q condor'

Restart HTCondor and HTCondor-CE on prerelease machines:

ansible 'testing:&condor' -i inventory -bK -f 20 -m service -a 'name=condor state=started'
ansible 'testing:&ces' -i inventory -bK -f 20 -m service -a 'name=condor-ce state=started'

Installing a Prerelease of the OSG Software Stack¶

Use this section to install new versions of all OSG software from a prerelease repository in Yum.

Check with the Release Manager to make sure that the prerelease repository has been populated with the desired package versions.
Make sure that software is generally up-to-date on the hosts — see the Madison ITB Site doc for more details

It may be desirable to update only non-OSG software at this stage, in which case one could simply disable the OSG repositories by adding command-line options to the yum update commands.

Install new software on prerelease hosts:

ansible testing -i inventory -bK -f 20 -m command -a 'yum --enablerepo=osg-prerelease --assumeyes update'

Read the Yum output carefully, and follow up on any warnings, etc.

If the osg-configure package was updated on any host(s), run the osg-configure command on the host(s):

ansible testing -i inventory -bK -f 20 -m command -a 'osg-configure -v' -l [HOST(S)]
ansible testing -i inventory -bK -f 20 -m command -a 'osg-configure -c' -l [HOST(S)]

Verify OSG software updates by inspecting the Yum output carefully or examining specific package versions:
```
ansible current -i inventory -f 20 -o -m command -a 'rpm -q osg-wn-client'
```
Use an inventory group and package names that best fit the situation.

Running Tests¶

For the first two test workflows, use your personal space on itb-submit. Copy or checkout the osgitb/htcondor-tests directory to get the test directories.

Part Ⅰ: Submitting jobs directly¶

Change into the 1-direct-jobs subdirectory
If there are old result files in the directory, remove them:
```
make distclean
```
Submit the test workflow
```
condor_submit_dag test.dag
```
Monitor the jobs until they are complete or stuck

In the initial test runs, the entire workflow ran in a few minutes. If the DAG or jobs exit immediately, go on hold, or otherwise fail, then you have some troubleshooting to do! Keep trying steps 2 and 3 until you get a clean run (or one or more HTCondor bug tickets).
Check the final output file:
```
cat count-by-hostnames.txt
```
You should see a reasonable distribution of jobs by hostname, keeping in mind the different number of cores per machine and the fact that HTCondor can and will reuse claims to process many jobs on a single host. Especially watch out for a case in which no jobs run on the newly updated hosts (at the time of writing: itb-data[456]).
(Optional) Clean up, using the make clean or make distclean commands. Use the clean target to remove intermediate result and log files generated by a workflow run but preserve the final output file; use the distclean target to remove all workflow-generated files (plus Emacs backup files).

Part Ⅱ: Submitting jobs using HTCondor-C¶

If direct submissions fail, there is probably no point to doing this step.

Change into the 2-htcondor-c-jobs subdirectory
If there are old result files in the directory, remove them:
```
make distclean
```
Get a proxy for your X.509 credentials
```
voms-proxy-init
```
Submit the test workflow
```
condor_submit_dag test.dag
```
Monitor the jobs until they are complete or stuck

In the initial test runs, the entire workflow ran in 10 minutes or less; generally, this test takes longer than the direct submission test, because of the layers of indirection. Also, status updates from the CEs back to the submit host are infrequent. For direct information about the CEs, log in to itb-ce1 and itb-ce2 to check status; don’t forget to check both condor_ce_q and condor_q on the CEs, probably in that order.

If the DAG or jobs exit immediately, go on hold, or otherwise fail, then you have some troubleshooting to do! Keep trying steps 2 and 3 until you get a clean run (or one or more HTCondor bug tickets).
Check the final output file:
```
cat count-by-hostnames.txt
```
Again, look for a reasonable distribution of jobs by hostname.
(Optional) Clean up, using the make clean or make distclean commands.

Part Ⅲ: Submitting jobs from a GlideinWMS VO Frontend¶

For this workflow, use your personal space on glidein3.chtc.wisc.edu. Copy or checkout the osgitb/htcondor-tests directory to get the test directories. Again, if previous steps fail, do not bother with this step.

Change into the 3-frontend-jobs subdirectory
If there are old result files in the directory, remove them:
```
make distclean
```
Submit the test workflow
```
condor_submit_dag test.dag
```
Monitor the jobs until they are complete or stuck

This workflow could take much longer than the first two, maybe an hour or so. Also, unless there are active glideins, it will take 10 minutes or longer for the first glideins to appear and start matching jobs. Thus it is helpful to monitor condor_q -totals until all of the jobs are submitted (there should be 2001), then switch to monitoring condor_status until glideins start appearing. After the first jobs start running and finishing, it is probably safe to ignore the rest of the run. If the jobs do not appear in the local queue, if glideins do not appear, or if jobs do not start running on the glideins, it is time to start troubleshooting.
Check the final output file:
```
cat count-by-hostnames.txt
```
The distribution of jobs per execute node may be more skewed than in the first two workflows, due to the way in which pilots ramp up over time and how HTCondor allocates jobs to slots.
(Optional) Clean up, using the make clean or make distclean commands.