[32263] Glidein Jobs Not Picking Up Work

Past Updates

As a follow-up:
the issue has been solved in GlideinWMS, ticket https://cdcvs.fnal.gov/redmine/issues/16151 and https://cdcvs.fnal.gov/redmine/issues/16147
Will be in GWMS 3.2.19, a RC will be available in 1-2 weeks and the release is scheduled in 3 weeks.
Marco

Thank you for informing us that this issue has been resolved. I will close this ticket.
Please submit problems, requests, and questions at: https://ticket.grid.iu.edu/goc/open

OSG Grid Operations Center
goc@...., 317-278-9699
Visit the OSG Operations Page: http://osggoc.blogspot.com/

I think we're set.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Thanks Jeff.

Stephen, may we assist with anything else?

Thank you,
Vince

HI,

I'm still in communication with the glideinWMS developers to understand how to move forward. However I implemented a patch in the factory to revert to 3.2.14 behavior on one of our factories, so at least we are back to full utilization of whole node pilots at Hyak.

Unless Stephen has anything else to add I think this can be closed, the larger problem is not a Hyak problem.

Thanks,
Jeff

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost

Morning Jeff,

Any luck with your investigation?

Thank you,

Sorry, spoke too soon :( GLIDIEIN_CPUS = "node" is causing our frontends to stop requesting at Hyak.

The investigation continues.. I'll reply when we know more.

Jeff

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost

In the meantime, I've tested and confirmed we can use GLIDIEIN_CPUS = "node" at Hyak.

Marco suggested this previously, it forces the pilot to determine core count from the hardware, and since Hyak is again configured to give us a full node, I think this is at least resolved for Hyak.

Jeff

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost

I opened a GWMS ticket to enhance CPUs detection and will be in 3.2.19, probably a month form now (available sooner via patch):
https://cdcvs.fnal.gov/redmine/issues/16147

The suggestion is to use the bigger number between:
PBS_NUM_PPN
the occurences of the host in PBS_NODEFILE
and PBS_NP if PBS_NUM_NODES=1

Thanks Stephan.

You are correct that pre-glideinWMS 3.2.16 this is how it worked.  We're trying to be smarter though and configure the condor pilot to only use "what the batch system gives us" because we have some cases where we request a subset of cores on a node at a site, and should only use that number rather than all of the cores on the node, since other jobs probably have claimed those.

In your case it's much simpler since you give us a whole node, but we need the code to be smart enough to handle all the use cases.

Sorry for the hassle, we're discussing some options out of band now, perhaps PBS_NODEFILE is the solution.

Thanks,
Jeff

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost

ALLPROCS is a feature of Moab (scheduler). How it sets things in Torque (aka PBS, resource manager) is out of our control. Moab is a closed source product and we're running a version they EOL'ed a year go. Long term plan is to switch to Slurm. The new cluster we just rolled out Friday uses Slurm.

It was my understanding that the Condor instance itself running on the node automatically detected the number of CPUs (from looking at the hardware) and fetched the right number of tasks from the Condor collector based on that. Maybe this is incorrect or you want to go some other direction. PBS_NP or the file at the location in PBS_NODEFILE is the way to go. PBS_NODEFILE has definitely been in existence for the longest.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Hi Stephan,

I think we found the issue, but we don't really understand it.

From Marco (gwms developer) in a conversation out of band [1] it looks like when you set the ALLPROCS flag, the pbs environment variables in our jobs don't completely advertise what we'd expect.  Do you happen to know what happens under the hood when you set ALLPROCS?  Is that a policy you made up or is that a built in functionality of PBS?  If it's the former, can you possibly change how it sets these env vars from your end so that both PBS_NP and PBS_NUM_PPN give the full core count of the node?  If it's the latter and out of your control, I think Marco's suggestion sounds reasonable, but we'll need to patch the factory code again to get this working.

What do you think?

Thanks,
Jeff

[1]
"Jeff,
I was explained that PBS_NP is the number of cores across all nodes used by the job (# of processors) and PBS_NUM_PPN is the cores on this node (# of processors per node), so I’m looking the second (PBS_NUM_PPN) but here I see:
PBS_NP=12
PBS_NUM_NODES=1
PBS_NUM_PPN=1

Which makes little sense to me (i.e. I can use 12 processors across 1 nodes and on this node I can use 1).
Maybe I misunderstood the meaning
I could change it to use the bigger of the 2 if the node is one. Multi-node jobs are only on HPC facilities (NERSC, …).
Suggestions?

Thanks,
Marco"

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost

I've reinstated the ALLPROCS flag.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Hi Stephan,

Based on Marco's findings, I think the glideins are now doing the right thing, but something has changed from your side and we no longer get a whole node by default when we remove xcount from the pilot classad.

I suspect the change you made is at the following comment:
https://ticket.opensciencegrid.org/32263#1485452825

Can you please change it back so that when we use the osg queue without xcount, we get a full node?  I'm putting the entries in downtime in the factory until you make the change.

Thanks!
Jeff Dost
OSG Glidein Factory Operations

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost

A correction and an update.
Correction: Nodes report to have 12 cores on 2 cpus (not 8 cores as I said before - was a mistake).

Update: I tried also a submission asking for 4 cores (+xcount = 4) and I received 4 cores:
{code}
PBS_NP=4
PBS_NUM_PPN=4
{code}

And the script (same used in glideins) detected 4 cores.
Cheers,
Marco

Hi Jeff, Stephen,
I did some test submissions using the settings in the factory:
<pre>
grid_resource = condor globus1.hyak.washington.edu globus1.hyak.washington.edu:9619
+queue = osg
</pre>

I do see that the nodes have 8 cores (form /proc/cpu) but the job is assigned only one core, in the environment of my test job (same as glidein) I see:
<pre>
PBS_NP=1
PBS_NUM_PPN=1
</pre>

For what I know those 2 variables (specifically the last one on the node) should tell me the cores I can use:
* PBS_NUM_PPN	Number of procs per node allocated to the job
* PBS_NP	Number of execution slots (cores) for the job

Did I miss some parameter in the job submission?
Is the job getting only one core?
Did I misunderstood the meaning of those variables?

Please help suggesting what to fix or how to check the cores available for the job.
Thank you,
Marco

Hi Stephan,

Sorry for the bad news, but it looks like this still isn't working as before. It's possible the configuration semantics have changed however, so I've contacted the glideinWMS developers to understand what's going on.

In the meantime, I put back in our fixed 8 core pilot kludge.

I'll report back when I have more.

Jeff Dost
OSG Glidein Factory Operations

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost

Hi Stephan,

I went ahead and reverted the Hyak entry to do the whole node requesting, with dynamic cpu and memory requesting.

Let's keep this open until we confirm everything is fixed.

Thanks,
Jeff Dost
OSG Glidein Factory Operations

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost

If the automatic setting of cores and memory is fixed, we'd prefer to switch back.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Stephen,

would you like to go back to whole node scheduling with GLIDEIN_CPUs = auto and GLIDEIN_MaxMemMBs = 0 or are you happy with current settings of 8 cores and MaxMemMBs of 6hrs?

Thanks,
Marian
(gWMS Factory Ops)

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=zvada/CN=684832/CN=Marian Zvada

Stephen,

It looks like they did revert the GLIDEIN_CPUs setting in glideinWMS 3.2.17 [1]. So we can probably begin testing a GLIDEN_CPUs = auto entry configuration again on the GOC-ITB factory when you're ready.

Marty Kandes
UCSD Glidein Factory Operations

[1]

http://glideinwms.fnal.gov/doc.prd/history.html

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

Hi Vince,

we just deployed 3.2.17 on our ITB today.

-Marian

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=zvada/CN=684832/CN=Marian Zvada

Hi Marty,

Has glideinWMS 3.2.17 testing started?  Please advise.

Thank you,

Stephan,

Cool. Thanks. I can confirm we're seeing 8-core glideins submitted to Hyak successfully run many users jobs across multiple VOs. Let's leave this ticket open until we circle back around to test glideinWMS 3.2.17 inconjuction with you. In the meantime, please let us know if you come across any other issues on your side.

Thanks,

Marty Kandes
UCSD Glidein Factory Operations

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

ppn=8 and flags=ALLPROCS are incompatible. You can only use ppn=1 with ALLPROCS. I've commented out the adding of ALLPROCS and now the jobs are running.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Stephen,

I contacted the glideinWMS developers. Basically, the GLIDEIN_CPUS = auto setting we were using in conjunction with your whole node scheduling at Hyak is simply broken in glideinWMS 3.2.16. They anticipate releasing glideinWMS 3.2.17 sometime next week. It'll then probably take us a few weeks to completely roll it out from testing to production. So we'll be stuck with the current configuration with explicitly setting the number of cores per glidein --- which is currently 8 --- for the time being. Once we have glideinWMS 3.2.17 in testing we can work with you to see if all the fixes expected are working correctly before pushing it into production.

Note, it does not look like we've had any glideins run at Hyak successfully since we moved to the explicit multicore setup. It could simply be they're waiting in the queue much longer than Atlas jobs. Most glideins had been idling for a few days from what I saw. Just to be safe, I refreshed the idle glideins with a new batch. Can you make sure that you see these glideins move from the HTCondor-CE to the local batch queue? And, of course, check if there appear to be any scheduling issues on your side. I'll try to keep an eye on things from our end as well to make sure we at least get some glideins running again while we wait for glideinWMS 3.2.17.

Marty Kandes
UCSD Glidein Factory Operations

P.S. I also decided to explicitly set the requested memory to 20 GB per 8-core glidein instead of keeping it in auto-detection too. This should keep user jobs safely below your 3 GB per core limit.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

The only jobs we have right now are from atlas. I don't believe they come from any of the usual OSG glidein factories. I've been out sick, so I've not been seeing what's happening day to day.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Good afternoon,

Checking in to see how the test jobs are progressing.  Please let me know if I can assist.

Thank you,
Vince

3GB per core is safe. Max time on our jobs is 4 hours. Maybe they set it in the factory as 3 just to be sure they have some margin for error.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Stephen,

Okay, I've got everything setup to temporarily submit 8-core glideins while we figure this out. Hopefully you'll see a fresh batch starting up soon.

Marty Kandes
UCSD Glidein Factory Operations

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

Also, Stephen, it seems we still have low walltime set for glideins to 13800 (3hrs). Just double-checking if this is still the case or shall we increase it?

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=zvada/CN=684832/CN=Marian Zvada

Stephen:

how much is memory per core on these 8 core machines?

Thanks,
Marian

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=zvada/CN=684832/CN=Marian Zvada

The OSG jobs mostly run on 8 core nodes, so you can set it to that in the mean time.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Stephen,

If this last attempt to restart the automated discovery of CPUS per node does not work, we'll probably have to go to glideinWMS developers for a temporary patch that fixes their default change. In that case, I'd probably recommend that we set an explicit fixed number of CPUS as a temporary measure while we wait for the patch (or some other suggestion). What is the number of CPUs on the largest set of nodes in the cluster? This is probably what we'd want to use.

Marty Kandes
UCSD Glidein Factory Operations

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

Marian,

Can you try GLIDEIN_CPUS = 0 today?

Marty Kandes
UCSD Glidein Factory Operations.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

Yes, we still are using the same osg queue where there's one job per node that's expected to use all the CPUs in the assigned node.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Stephen,

If so, I suspect this is a problem on our end. We upgraded the SDSC and GOC glidein factories to glideinWMS 3.2.16 on December 7th and 13th, respectively. After each upgrade, glideins at Hyak drop off pretty quickly. See here [1] [2].

In this update, glideinWMS changed the default of GLIDEIN_CPUs = 1 to GLIDEIN_CPUS = auto. We already had the 'auto' setting in the Hyak entry configuration prior to this update, which is how we were auto detecting how many CPUS were available on the whole nodes you scheduled for us in the cluster. However, this new glideinWMS default caused havoc for other sites around the world, who were not expecting it. We had to perform a quick fix and override this default back to GLIDEIN_CPUS = 1 in the global factory configuration. My guess is for some reason now that we globally set GLIDEIN_CPUS = 1, gldieinWMS is not respecting the local GLIDEIN_CPUS = auto in Hyak's configuration.

We should have our weekly factory operations conference call tomorrow morning. I'll discuss it with the group, see what they think, and then continue investigating tomorrow. I'll let you know when I know more.

Thanks,

Marty Kandes
UCSD Glidein Factory Operations

[1]

http://gfactory-1.t2.ucsd.edu/factory/monitor/factoryStatus.html?entry=OSG_US_Hyak_osg&frontend=total&infoGroup=running&elements=StatusRunning,ClientGlideRunning,ClientGlideIdle,&rra=4&window_min=1477428834557.2354&window_max=1484668800000&timezone=-8

[2]

http://glidein.grid.iu.edu/factory/monitor/factoryStatus.html?entry=OSG_US_Hyak_osg&frontend=total&infoGroup=running&elements=StatusRunning,ClientGlideRunning,ClientGlideIdle,&rra=4&window_min=1480484616742.1594&window_max=1484668800000&timezone=-8

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

Stephen,

Do you still have the "osg" queue in your cluster schedule jobs by whole nodes? i.e., you're reserving us whole nodes, but we're only using 1 cpu per node?

Marty Kandes
UCSD Glidein Factory Operations

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Marty Kandes 3049

In reference to ticket 24577, assigning to Glidein ops to review.

I believe request 24577 is the request where we discussed the setup I
referenced in my last e-mail/.

On Fri, Jan 13, 2017 at 12:38 PM, Open Science Grid FootPrints
<osg@....> wrote:
> [Duplicate message snipped]

01/13/17 11:32:49 (pid:1738) Allocating auto shares for slot type 1:
Cpus: 1.000000, Memory: auto, Swap: auto, Disk: auto
slot type 1: Cpus: 1.000000, Memory: 32214, Swap: 100.00%, Disk: 100.00%

The four hour wallclock is specific to our site. We also had a special
configuration where the condor instance would request a number of
tasks equivalent to the number of CPUs in the host. This has been
lost. Can someone please fix it? Thanks,

Stephen

On Mon, Jan 9, 2017 at 7:32 AM, Open Science Grid FootPrints
<osg@....> wrote:
> [Duplicate message snipped]

I'm guessing but MaxWallClock is only 240, 1440 is more typical. I don't know if this is the issue but increasing this might get you more matches.
Also, there is this:
Invalid address: :9619
In your output below, looks like a configuration issue.

Since we had a maintenance window back in mid-December, we haven't been getting any Glidein tasks. Jobs are submitted to our scheduling system and run, but they never or rarely pickup any tasks. I looked at the logs in /tmp and everything seems to be fine there (below example). If you have any suggestions or something just got flipped on your end, let me know. Thanks,

Stephen

01/05/17 10:12:12 (pid:14432) ******************************************************
01/05/17 10:12:12 (pid:14432) ** condor_startd (CONDOR_STARTD) STARTING UP
01/05/17 10:12:12 (pid:14432) ** /tmp/glide_mjQVFZ/main/condor/sbin/condor_startd
01/05/17 10:12:12 (pid:14432) ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
01/05/17 10:12:12 (pid:14432) ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
01/05/17 10:12:12 (pid:14432) ** $CondorVersion: 8.4.6 Apr 20 2016 BuildID: 364106 $
01/05/17 10:12:12 (pid:14432) ** $CondorPlatform: x86_64_RedHat6 $
01/05/17 10:12:12 (pid:14432) ** PID = 14432
01/05/17 10:12:12 (pid:14432) ** Log last touched time unavailable (No such file or directory)
01/05/17 10:12:12 (pid:14432) ******************************************************
01/05/17 10:12:12 (pid:14432) Using config source: /tmp/glide_mjQVFZ/condor_config
01/05/17 10:12:12 (pid:14432) config Macros = 323, Sorted = 323, StringBytes = 20905, TablesBytes = 11668
01/05/17 10:12:12 (pid:14432) CLASSAD_CACHING is ENABLED
01/05/17 10:12:12 (pid:14432) Daemon Log is logging: D_ALWAYS D_ERROR D_JOB
01/05/17 10:12:12 (pid:14432) Daemoncore: Listening at <0.0.0.0:55889> on TCP (ReliSock).
01/05/17 10:12:12 (pid:14432) DaemonCore: command socket at <10.2.10.110:55889?addrs=10.2.10.110-55889&noUDP>
01/05/17 10:12:12 (pid:14432) DaemonCore: private command socket at <10.2.10.110:55889?addrs=10.2.10.110-55889>
01/05/17 10:12:12 (pid:14432) Invalid address: :9619
01/05/17 10:12:13 (pid:14432) CCBListener: registered with CCB server osg-flock.grid.iu.edu:9757 as ccbid 129.79.53.179:9757?addrs=129.79.53.179-9757#1633380
01/05/17 10:12:13 (pid:14432) my_popenv failed
01/05/17 10:12:13 (pid:14432) Failed to run hibernation plugin '/tmp/glide_mjQVFZ/main/condor/libexec/condor_power_state ad'
01/05/17 10:12:13 (pid:14432) VM-gahp server reported an internal error
01/05/17 10:12:13 (pid:14432) VM universe will be tested to check if it is available
01/05/17 10:12:13 (pid:14432) History file rotation is enabled.
01/05/17 10:12:13 (pid:14432)   Maximum history file size is: 20971520 bytes
01/05/17 10:12:13 (pid:14432)   Number of rotated history files is: 2
01/05/17 10:12:13 (pid:14432) Allocating auto shares for slot type 1: Cpus: 1.000000, Memory: auto, Swap: auto, Disk: auto
slot type 1: Cpus: 1.000000, Memory: 24148, Swap: 100.00%, Disk: 100.00%
01/05/17 10:12:13 (pid:14432) New machine resource of type 1 allocated
01/05/17 10:12:13 (pid:14432) Setting up slot pairings
01/05/17 10:12:13 (pid:14432) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
01/05/17 10:12:13 (pid:14432) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
01/05/17 10:12:13 (pid:14432) my_popenv failed
01/05/17 10:12:13 (pid:14432) Adding 'GLIDEIN_PS_1' to the Supplimental ClassAd list
01/05/17 10:12:13 (pid:14432) CronJobList: Adding job 'GLIDEIN_PS_1'
01/05/17 10:12:13 (pid:14432) Adding 'GLIDEIN_PS_2' to the Supplimental ClassAd list
01/05/17 10:12:13 (pid:14432) CronJobList: Adding job 'GLIDEIN_PS_2'
01/05/17 10:12:13 (pid:14432) Adding 'GLIDEIN_PS_3' to the Supplimental ClassAd list
01/05/17 10:12:13 (pid:14432) CronJobList: Adding job 'GLIDEIN_PS_3'
01/05/17 10:12:13 (pid:14432) CronJob: Initializing job 'GLIDEIN_PS_1' (/tmp/glide_mjQVFZ/main/script_wrapper.sh)
01/05/17 10:12:13 (pid:14432) CronJob: Initializing job 'GLIDEIN_PS_2' (/tmp/glide_mjQVFZ/main/script_wrapper.sh)
01/05/17 10:12:13 (pid:14432) CronJob: Initializing job 'GLIDEIN_PS_3' (/tmp/glide_mjQVFZ/main/script_wrapper.sh)
01/05/17 10:12:13 (pid:14432) Adding 'mips' to the Supplimental ClassAd list
01/05/17 10:12:13 (pid:14432) CronJobList: Adding job 'mips'
01/05/17 10:12:13 (pid:14432) Adding 'kflops' to the Supplimental ClassAd list
01/05/17 10:12:13 (pid:14432) CronJobList: Adding job 'kflops'
01/05/17 10:12:13 (pid:14432) CronJob: Initializing job 'mips' (/tmp/glide_mjQVFZ/main/condor/libexec/condor_mips)
01/05/17 10:12:13 (pid:14432) CronJob: Initializing job 'kflops' (/tmp/glide_mjQVFZ/main/condor/libexec/condor_kflops)
01/05/17 10:12:13 (pid:14432) State change: IS_OWNER is false
01/05/17 10:12:13 (pid:14432) Changing state: Owner -> Unclaimed
01/05/17 10:12:13 (pid:14432) State change: RunBenchmarks is TRUE
01/05/17 10:12:13 (pid:14432) Changing activity: Idle -> Benchmarking
01/05/17 10:12:13 (pid:14432) BenchMgr:StartBenchmarks()
01/05/17 10:12:35 (pid:14432) State change: benchmarks completed
01/05/17 10:12:35 (pid:14432) Changing activity: Benchmarking -> Idle

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

32263 / Glidein Jobs Not Picking Up Work