[34412] Work Quanity Reduction

Contact

Full Name

Stephen Fralich

Phone

Details

Resource Name

Hyak_CE

Associated VO

OSG

Submitted Via

GOC Ticket/submit

Submitter

Stephen Fralich

Support Center

UW-IT

Ticket Links

Ticket Type

Problem/Request

Priority

Normal

Status

Closed

Next Action

ENG Action

Next Action Deadline

2017-08-15

Assignees

OSG Glidein Factory Support / OSG Support Centers

Software Support (Triage) / OSG Software Team

Edgar Fajardo / OSG Software Team

Brian Lin / OSG Software Team

Assignees TODO

Past Updates

All set. Thanks. You can close it.

Great! Hopefully that solves the mysterious issues or at least make them easier to troubleshoot in the future. Was there anything else or can I close this ticket?

Thanks,
Brian

I did. The GridmanagerLog files in /var/log/condor-ce are getting
populated now and work has been steady. Thanks for the suggestions.

On Fri, Aug 11, 2017 at 7:46 AM, Open Science Grid FootPrints
<osg@....> wrote:
> [Duplicate message snipped]

Stephen,

Did you get a chance to tackle this during your maintenance window?

- Brian

One thing I forgot to mention, you'll also want to turn off and disable
the condor service.

- Brian

Ok. I'll plan to do that during our maintenance window next Tuesday.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Stephen,

I think it'd be a good idea to get rid of osg-ce-condor (which should
remove htcondor-ce-condor and osg-htcondor-ce-condor) and replace it
with osg-ce-pbs. If things are working correctly, you can hold off on
making the change and schedule it for whenever you have time and we can
close this ticket.

- Brian

This is the original request where we switched from GRAM to Condor: https://ticket.opensciencegrid.org/26794

We did not pick up the htcondor-ce-condor package though until 7/2016 when I updated OSG to version 3.3 however. Based on the timing in the yum.log, I assume this came as a dependency with that update. I can't imagine why I'd install it either. I stated in an e-mail to our team that the upgrade went smoothly and there wasn't much of a disruption to the flow of work.

It ran fine as far as I can tell until 12/2016 when I reported issues with the glideins not picking up work (https://ticket.opensciencegrid.org/32263). This turned out to be an issue with changes made to glideinWMS. We had another good six months of running until this issue.

It seems to be operating correctly for the moment again. I realized gratia was still trying to run on the morning of 7/27. I cleared that up and disabled it. Coincidentally I think, it's been working fine since then.

Let me know what you think.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

I appreciate your attention Brian, but I need to focus on some things here for the remainder of this week. I'll dig through my archives and logs next week and try to answer your questions. It's entirely possible I misunderstood something at some point and our weird set up is entirely my fault, but it worked until recently, so no one noticed. We can also try removing the extra software next week if we don't find any evidence it's necessary.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Stephen,

You should only need osg-ce-condor if you're a condor shop. Since you
guys use torque, you actually want to be running 'osg-ce-pbs'. It seems
like right now you're unnecessarily running an extra condor service when
you only need to be running the htcondor-ce service.

Removing the packages you mentioned shouldn't have any ill effects BUT
I'm not entirely sure why you have this funky setup in the first place.
Do you remember who helped you set up your CE originally?

Thanks,
Brian

http://staff.washington.edu/sjf4/condor.tar.bz2

"yum remove htcondor-ce-condor" says it's required by  osg-ce-condor and osg-htcondor-ce-condor. We started out using the globus CE stuff, but then were asked to switch to Condor. I assume I followed instructions provided at that time or on the twiki instance. I don't recall off-hand.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Also, any idea why you have htcondor-ce-condor installed? It shouldn't
be required for the torque setup you guys have

Hrm, looking at your config it looks like your jobs are actually routed
to your local condor submit host. Do you have a condor service running
on that host? Could you attach the contents of /var/log/condor/?

- Brian

Yes, the GridmanagerLog* files do have those dates on the live system:
[root@globus1 condor-ce]# ls -l  /var/log/condor-ce/GridmanagerLog.*
-rw-r--r-- 1 condor 495  9728646 Jan  7  2017 /var/log/condor-ce/GridmanagerLog.osgatlas
-rw-r--r-- 1 condor 495  1770870 Jan  8  2017 /var/log/condor-ce/GridmanagerLog.osgfnalg
-rw-r--r-- 1 condor 495 10486024 Jan  8  2017 /var/log/condor-ce/GridmanagerLog.osgfnalg.old
-rw-r--r-- 1 condor 495  8263314 Apr  7  2016 /var/log/condor-ce/GridmanagerLog.osgglow
-rw-r--r-- 1 condor 495 10485987 Apr  6  2016 /var/log/condor-ce/GridmanagerLog.osgglow.old
-rw-r--r-- 1 condor 495   155219 Oct  6  2015 /var/log/condor-ce/GridmanagerLog.osgmis
-rw-r--r-- 1 condor 495  1951141 Jan  8  2017 /var/log/condor-ce/GridmanagerLog.osgosg
-rw-r--r-- 1 condor 495 10485933 Jan  8  2017 /var/log/condor-ce/GridmanagerLog.osgosg.old
-rw-r--r-- 1 condor 495     7570 Oct  6  2015 /var/log/condor-ce/GridmanagerLog.sjf4

Yes, we still use Torque and Moab. I've attached that output.

We do periodically get waves of work. One such wave started today at about 9:20am Pacific.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Some of these logs seem old, for instance all of the GridmanagerLog* files have their most recent timestamps from January of this year. You're still using a PBS backend, correct? Could you provide the output of osg-system-profiler?

Thanks,
Brian

I cleared out a bunch of condor jobs from June on Friday that were still in the condor queue and that seemed to make it work properly again. Though that seems to have only helped through about Sunday at midnight when it died off.

I archive the directory and uploaded it to: http://staff.washington.edu/sjf4/condor-ce.tar.bz2

FYI: The expanded archive is > 1GB

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Hi Stephen,

Could you make the changes Brian suggested below and then attach the logs to this ticket?

"Could you set ALL_DEBUG = D_FULLDEBUG in /etc/condor-ce/config.d then attach the contents of /var/log/condor-ce?"

Thank you,
Vince

Stephen,

Could you set ALL_DEBUG = D_FULLDEBUG in /etc/condor-ce/config.d then attach the contents of /var/log/condor-ce?

Thanks,
Brian

I will say a re-occurring cause of issues is that our site is different than most (or all) other OSG sites in two ways:
1) jobs are allocated entire nodes and need to run 8, 12, or 16 tasks depending on the node configuration
2) The max walltime is only 4 hours

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Hi Mr Lin,

I am adding you as the CE expert. I see nothing obvious that might indicate about the trouble.

Edgar
OSG Software Support

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Edgar Mauricio Fajardo Hernandez 2020

http://staff.washington.edu/sjf4/SchedLog.txt.bz2

There you are.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Hmm from the JobROuter I see nothing is getting routed.

Could you also upload your:

/var/log/condor-ce/SchedLog

Thanks

Edgar
OSG Software Support

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Edgar Mauricio Fajardo Hernandez 2020

Here is the requested information:

condor_ce_config_val JOB_ROUTER_ENTRIES | sed 's/;/;\n/g'
[ GridResource = "batch pbs";
TargetUniverse = 9;
name = "Setting batch system queues and walltime";
set_default_queue = "osg";
/* Set the max walltime to 4 hr */ set_default_maxWallTime = 250;
]

- Adam

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Adam Hough 4338

Hi Stephen,

We need to look at your JobRouter config.

I most likely believe this is a problem with JobRouter limits:

So can you upload your:

/var/log/condor-ce/JobRouterLog

Can you run this in your CE:

condor_ce_config_val JOB_ROUTER_ENTRIES | sed 's/;/;\n/g'

Thanks,

Edgar
OSG Software Support

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Edgar Mauricio Fajardo Hernandez 2020

Hi Stephen,

I'm adding Software support, I think we need a condor CE expert to help you debug from your end, as there's not much else we can give you from the factory ops side.

Thanks,
Jeff Dost
OSG Glidein Factory Operations

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost

The behavior I described Friday continues. I tried updating to the latest packages in the 3.3 line and restarting as well, but we're still having the same problem.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

If I restart condor-ce we get work submitted into our scheduler for about 30 minutes before lots of jobs starting accumulating in the condor_ce_q with not apparent counterparts in our local scheduler. Restarting condor-ce causes jobs to start being submitted to our local scheduler again. I've been through this cycle 3 times this afternooon.

It seems like BLAHPD takes things from Condor and give them to the local scheduler. I couldn't find logs associated with this daemon to see if things are going wrong there.

We'll see what it does over the weekend I guess.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

All held stuff is back with same hold reason and nothing gets running at the site. Stephen, any clues from CE logs, please?

Thanks,
Marian

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=zvada/CN=684832/CN=Marian Zvada

Stephen, that I assume there is nothing what limits things on your CE. Let's give it try what idle glideins do at your resource. After removing HELD ones I see new ones getting submitted. I'll check on that later.

-Marian

PS: btw, to see what are settings per your CE Routes, you can use command:
$ condor_ce_job_router_info -config

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=zvada/CN=684832/CN=Marian Zvada

Where would I look for that? It's not something I would have set and there's no one else that would have changed the configuration.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

Hi,

there are many held glideing with reason:

CE job in status 1 put on hold by SYSTEM_PERIODIC_HOLD due to non-existent route in JOB_ROUTER_ENTRIES or route job limit.

I'm removing those from factory to see how newly submitted act. If Stephen can meantime check whether they are limiting number of jobs set per JOB_ROUTER_ENTRY in their CE setup that'll be great.

Thanks,
Marian
(gWMS Factory Ops)

by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=zvada/CN=684832/CN=Marian Zvada

Glidein Factory,

Could you look into why jobs aren't getting sent to this resource? My apologies for not getting it there sooner, I mistakenly thought it was already routed to the glidein factory support group.

-Kyle

The amount of work we receive started to falter around 6/3, then really fell after an unexpected outage on our end on 6/20. We still receive and complete some work it seems based on GRACC: https://gracc.opensciencegrid.org/dashboard/db/payload-jobs-summary?from=now-7d&to=now&orgId=1&var-VOName=All&var-Project=All&var-Facility=Hyak&var-User=All&var-ExitCode=All&var-Probe=All&var-interval=1d

If we were marked as bad, can you umark us? Thanks.

by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611

34412 / Work Quanity Reduction