All set. Thanks. You can close it.
Great! Hopefully that solves the mysterious issues or at least make them easier to troubleshoot in the future. Was there anything else or can I close this ticket?
Thanks,
Brian
I did. The GridmanagerLog files in /var/log/condor-ce are getting
populated now and work has been steady. Thanks for the suggestions.
On Fri, Aug 11, 2017 at 7:46 AM, Open Science Grid FootPrints
<osg@....> wrote:
> [Duplicate message snipped]
Stephen,
Did you get a chance to tackle this during your maintenance window?
- Brian
One thing I forgot to mention, you'll also want to turn off and disable
the condor service.
- Brian
Ok. I'll plan to do that during our maintenance window next Tuesday.
by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611
Stephen,
I think it'd be a good idea to get rid of osg-ce-condor (which should
remove htcondor-ce-condor and osg-htcondor-ce-condor) and replace it
with osg-ce-pbs. If things are working correctly, you can hold off on
making the change and schedule it for whenever you have time and we can
close this ticket.
- Brian
This is the original request where we switched from GRAM to Condor: https://ticket.opensciencegrid.org/26794
We did not pick up the htcondor-ce-condor package though until 7/2016 when I updated OSG to version 3.3 however. Based on the timing in the yum.log, I assume this came as a dependency with that update. I can't imagine why I'd install it either. I stated in an e-mail to our team that the upgrade went smoothly and there wasn't much of a disruption to the flow of work.
It ran fine as far as I can tell until 12/2016 when I reported issues with the glideins not picking up work (https://ticket.opensciencegrid.org/32263). This turned out to be an issue with changes made to glideinWMS. We had another good six months of running until this issue.
It seems to be operating correctly for the moment again. I realized gratia was still trying to run on the morning of 7/27. I cleared that up and disabled it. Coincidentally I think, it's been working fine since then.
Let me know what you think.
by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611
I appreciate your attention Brian, but I need to focus on some things here for the remainder of this week. I'll dig through my archives and logs next week and try to answer your questions. It's entirely possible I misunderstood something at some point and our weird set up is entirely my fault, but it worked until recently, so no one noticed. We can also try removing the extra software next week if we don't find any evidence it's necessary.
by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611
Stephen,
You should only need osg-ce-condor if you're a condor shop. Since you
guys use torque, you actually want to be running 'osg-ce-pbs'. It seems
like right now you're unnecessarily running an extra condor service when
you only need to be running the htcondor-ce service.
Removing the packages you mentioned shouldn't have any ill effects BUT
I'm not entirely sure why you have this funky setup in the first place.
Do you remember who helped you set up your CE originally?
Thanks,
Brian
http://staff.washington.edu/sjf4/condor.tar.bz2
"yum remove htcondor-ce-condor" says it's required by osg-ce-condor and osg-htcondor-ce-condor. We started out using the globus CE stuff, but then were asked to switch to Condor. I assume I followed instructions provided at that time or on the twiki instance. I don't recall off-hand.
by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611
Also, any idea why you have htcondor-ce-condor installed? It shouldn't
be required for the torque setup you guys have
Hrm, looking at your config it looks like your jobs are actually routed
to your local condor submit host. Do you have a condor service running
on that host? Could you attach the contents of /var/log/condor/?
- Brian
Yes, the GridmanagerLog* files do have those dates on the live system:
[root@globus1 condor-ce]# ls -l /var/log/condor-ce/GridmanagerLog.*
-rw-r--r-- 1 condor 495 9728646 Jan 7 2017 /var/log/condor-ce/GridmanagerLog.osgatlas
-rw-r--r-- 1 condor 495 1770870 Jan 8 2017 /var/log/condor-ce/GridmanagerLog.osgfnalg
-rw-r--r-- 1 condor 495 10486024 Jan 8 2017 /var/log/condor-ce/GridmanagerLog.osgfnalg.old
-rw-r--r-- 1 condor 495 8263314 Apr 7 2016 /var/log/condor-ce/GridmanagerLog.osgglow
-rw-r--r-- 1 condor 495 10485987 Apr 6 2016 /var/log/condor-ce/GridmanagerLog.osgglow.old
-rw-r--r-- 1 condor 495 155219 Oct 6 2015 /var/log/condor-ce/GridmanagerLog.osgmis
-rw-r--r-- 1 condor 495 1951141 Jan 8 2017 /var/log/condor-ce/GridmanagerLog.osgosg
-rw-r--r-- 1 condor 495 10485933 Jan 8 2017 /var/log/condor-ce/GridmanagerLog.osgosg.old
-rw-r--r-- 1 condor 495 7570 Oct 6 2015 /var/log/condor-ce/GridmanagerLog.sjf4
Yes, we still use Torque and Moab. I've attached that output.
We do periodically get waves of work. One such wave started today at about 9:20am Pacific.
by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611
Some of these logs seem old, for instance all of the GridmanagerLog* files have their most recent timestamps from January of this year. You're still using a PBS backend, correct? Could you provide the output of osg-system-profiler?
Thanks,
Brian
I cleared out a bunch of condor jobs from June on Friday that were still in the condor queue and that seemed to make it work properly again. Though that seems to have only helped through about Sunday at midnight when it died off.
I archive the directory and uploaded it to: http://staff.washington.edu/sjf4/condor-ce.tar.bz2
FYI: The expanded archive is > 1GB
by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611
Hi Stephen,
Could you make the changes Brian suggested below and then attach the logs to this ticket?
"Could you set ALL_DEBUG = D_FULLDEBUG in /etc/condor-ce/config.d then attach the contents of /var/log/condor-ce?"
Thank you,
Vince
Stephen,
Could you set ALL_DEBUG = D_FULLDEBUG in /etc/condor-ce/config.d then attach the contents of /var/log/condor-ce?
Thanks,
Brian
I will say a re-occurring cause of issues is that our site is different than most (or all) other OSG sites in two ways:
1) jobs are allocated entire nodes and need to run 8, 12, or 16 tasks depending on the node configuration
2) The max walltime is only 4 hours
by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611
Hi Mr Lin,
I am adding you as the CE expert. I see nothing obvious that might indicate about the trouble.
Edgar
OSG Software Support
by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Edgar Mauricio Fajardo Hernandez 2020
Hmm from the JobROuter I see nothing is getting routed.
Could you also upload your:
/var/log/condor-ce/SchedLog
Thanks
Edgar
OSG Software Support
by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Edgar Mauricio Fajardo Hernandez 2020
Here is the requested information:
condor_ce_config_val JOB_ROUTER_ENTRIES | sed 's/;/;\n/g'
[ GridResource = "batch pbs";
TargetUniverse = 9;
name = "Setting batch system queues and walltime";
set_default_queue = "osg";
/* Set the max walltime to 4 hr */ set_default_maxWallTime = 250;
]
- Adam
by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Adam Hough 4338
Hi Stephen,
We need to look at your JobRouter config.
I most likely believe this is a problem with JobRouter limits:
So can you upload your:
/var/log/condor-ce/JobRouterLog
Can you run this in your CE:
condor_ce_config_val JOB_ROUTER_ENTRIES | sed 's/;/;\n/g'
Thanks,
Edgar
OSG Software Support
by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Edgar Mauricio Fajardo Hernandez 2020
Hi Stephen,
I'm adding Software support, I think we need a condor CE expert to help you debug from your end, as there's not much else we can give you from the factory ops side.
Thanks,
Jeff Dost
OSG Glidein Factory Operations
by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=jdost/CN=732648/CN=Jeffrey Michael Dost
The behavior I described Friday continues. I tried updating to the latest packages in the 3.3 line and restarting as well, but we're still having the same problem.
by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611
If I restart condor-ce we get work submitted into our scheduler for about 30 minutes before lots of jobs starting accumulating in the condor_ce_q with not apparent counterparts in our local scheduler. Restarting condor-ce causes jobs to start being submitted to our local scheduler again. I've been through this cycle 3 times this afternooon.
It seems like BLAHPD takes things from Condor and give them to the local scheduler. I couldn't find logs associated with this daemon to see if things are going wrong there.
We'll see what it does over the weekend I guess.
by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611
All held stuff is back with same hold reason and nothing gets running at the site. Stephen, any clues from CE logs, please?
Thanks,
Marian
by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=zvada/CN=684832/CN=Marian Zvada
Stephen, that I assume there is nothing what limits things on your CE. Let's give it try what idle glideins do at your resource. After removing HELD ones I see new ones getting submitted. I'll check on that later.
-Marian
PS: btw, to see what are settings per your CE Routes, you can use command:
$ condor_ce_job_router_info -config
by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=zvada/CN=684832/CN=Marian Zvada
Where would I look for that? It's not something I would have set and there's no one else that would have changed the configuration.
by /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=People/CN=Stephen Fralich 2611
Hi,
there are many held glideing with reason:
CE job in status 1 put on hold by SYSTEM_PERIODIC_HOLD due to non-existent route in JOB_ROUTER_ENTRIES or route job limit.
I'm removing those from factory to see how newly submitted act. If Stephen can meantime check whether they are limiting number of jobs set per JOB_ROUTER_ENTRY in their CE setup that'll be great.
Thanks,
Marian
(gWMS Factory Ops)
by /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=zvada/CN=684832/CN=Marian Zvada
Glidein Factory,
Could you look into why jobs aren't getting sent to this resource? My apologies for not getting it there sooner, I mistakenly thought it was already routed to the glidein factory support group.
-Kyle