Skip to content

OSG Technology Area Meeting, 7 June 2022

Announcements

  • Mat OOO Jun 13
  • BrianL traveling Jun 13-16, OOO Jun 17
  • Next week's meeting canceled

Triage Duty

  • This week: Mat
  • Next week: TimT
  • 11 (-4) open FreshDesk tickets
  • 1 (+0) open GGUS ticket

Jira (as of Monday)

# of tickets Δ State
191 +1 Open
32 +1 Selected for Dev
27 +3 In Progress
14 -3 Dev Complete
28 +3 Ready for Testing
0 +0 Ready for Release

OSG Software Team

  • Doc focus Friday afternoon
  • JLab FE and VO certs expire next Thursday, Ops may need a reminder to update the certs next week
  • BNL will be submitting EIC jobs via Harvester and will pass along a subject DN that we need added to the vo-client
  • Streamline deployments of OSDF caches and origins
    • AI (Mat): Add topology endpoint serving cache/namespace mappings (SOFTWARE-5179)
    • AI (Mat): Evaluate state of SLATE stashcache chart (SOFTWARE-5211)
    • AI (Carl): Add stashcache lookup by resource name (SOFTWARE-4347)
    • AI (Carl): stash-cache/origin: allow toggling VOMS by environment variable (SOFTWARE-5106)
    • AI (Carl): authfile generation fails if origin serves no public data (SOFTWARE-5028)
    • AI (Carl): Authfiles for auth stash origins should include the origin's DN (SOFTWARE-4399)
    • AI (Carl): Validate OSDF data in Topology (SOFTWARE-4167)
  • AI (Carl): backfill container should shut down upon HTCondor exit (SOFTWARE-4608)
  • AI (Carl): add ability to specify custom workflows in opensciencegrid/images (SOFTWARE-5013)
  • AI (Carl): remove duplicate contact entries in the Topology contact pages (SOFTWARE-5214)

Discussion

There are several Docker images that are built from individual GitHub repositories instead of the central "images" repository. They should be moved to the images repo; keep an eye on how well GitHub Actions workflows scale, though.

COManage registration denials seem to be sticky; if you want to admit someone you have previously denied, you will have to expunge the CO Person record

Support Update

  • JLab (BrianL): JLab and their ScotsGrid site are seeing low CPU utiliziation from Slurm for CLAS12 pilots. They will lift the blacklisting of ScotsGrid to try to gather more info
  • Belle II (BrianL): will be having a meeting this week to discuss token-based pilot submission
  • Texas Tech (BrianL): new backfill container site, working on fixing COManage registrations
  • WeNMR (BrianL): helping DIRAC developers test token-based pilot submission
  • BNL (Mat): Experiencing CA failures with XRootD TPC. Their CA certificates look fine; Mat has requested that they perform some manual tests to try and isolate the error. The XRootD developers are Cc'ed on the ticket and can provide debugging assistance.

OSG Release Team

  • Ready for Testing
    • GlideinWMS 3.9.5
    • gratia-probe 2.6.1
      • Log schedd cron errors with newer versions of HTCondor
      • Replace AuthToken* references with routed job attributes
      • Remove certinfo flie log messages
      • Fix crash on send failure
    • HTCondor 9.0.13: Bug fix release
      • Schedd and startd cron jobs can now log output upon non-zero exit
      • condor_config_val now produces correct syntax for multi-line values
      • The condor_run tool now reports submit errors and warnings to the terminal
      • Fix issue where Kerberos authentication would fail within DAGMan
      • Fix HTCondor startup failure with certain complex network configurations
    • HTCondor-CE 5.1.5
      • Fix whole node job glidein CPUs and GPUs exprs that caused held jobs
      • Fix bug where default CERequirements were being ignored
      • Pass whole node request from GlideinWMS to the batch system
      • Rename AuthToken attributes in the routed job to better support accounting
      • Prevent GSI environment from pointing the job to the wrong certificates
      • Fix issue where HTCondor-CE would need port 9618 open to start up
    • XCache 3.1.0
      • Fixed library dependency issues for xcache-reporter
      • Add systemd overrides for xrootd-privileged
    • XRootD 5.4.3 RC4
    • htvault-config 1.13
      • removes support for old style secret storage, requires htgettoken >= 1.7
    • htgettoken 1.12
      • avoids crash when verbose output includes UTF-8
    • osg-pki-tools 3.5.2
      • bug fix for osg-incommon-cert-request when using host file
    • osg-release 3.6-5: Add osg-next yum repository
    • osg-token-renewer 0.8.2
      • use oidc-agent's built-in password file option
      • ensure tokens are renewed more frequently than their lifespan
    • rrdtool 1.8.0-1.2.el7: make Python RRDtools available to GlideinWMS
    • stashcp 6.7.5
      • Adds multi-file transfer and improved error messages
      • relax download timeouts for file transfer plugin
      • multiple bug fixes
    • xrootd-multiuser 2.0.4
      • fix crash on EL8

Discussion

Sometimes packages get stuck in Ready for Testing state because they are considered "critical" but no external people have provided feedback. Mat suggests amending our release policy such that critical packages are released after a month without negative feedback, even if no positive feedback has been received. Tim says that would be better for software that has received VMU testing or has been well tested outside of OSG. We have an IRIS-HEP metric to increase the percentage of packages tested in VMU; Brian will create tickets. Also discussed improving coordination of testing between OSG Operations and Software