GlideinWMS Factory Installation
This document describes how to install a Glidein Workflow Managment System (GlideinWMS) Factory instance.
This document assumes expertise with HTCondor and familiarity with the GlideinWMS software. It does not cover anything but the simplest possible install. Please consult the GlideinWMS reference documentation for advanced topics, including non-root, non-RPM-based installation. In this document the terms glidein and pilot (job) will be used interchangeably.
This parts covers these primary components of the GlideinWMS system:
- WMS Collector / Schedd: A set of
condor_scheddprocesses that allow the submission of pilots to Grid entries.
- GlideinWMS Factory: The process submitting the pilots when needed
We really recommend you to use the OSG provided Factory and not to install your own. A VO Frontend is sufficient to submit your jobs and to decide scheduling policies. And this will avoid for you the complexity to deal directly with grid/cloud sites. If you really need you own Factory be aware that it is a complex component and may require a non trivial maintenance effort.
Before starting the installation process, consider the following points (consulting the Reference section below as needed):
Host and OS
- A host to install the GlideinWMS Factory (pristine node).
- Currently most of our testing has been done on Scientific Linux 6 and 7.
- Root access
The GlideinWMS Factory has the following requirements:
- CPU: 4-8 cores for a large installation (1 should suffice on a small install)
- RAM: 4-8GB on a large installation (1GB should suffice for small installs)
- Disk: 10GB will be plenty sufficient for all the binaries, config and log files related to GlideinWMS. If you are a large site with need to keep significant history and logs, you may want to allocate 100GB+ to store long histories.
The GlideinWMS Factory installation will create the following users unless they are already created.
|condor||none||HTCondor user (installed via dependencies).|
|gfactory||none||This user runs the GlideinWMS VO factory.|
To verify that the user gfactory has gfactory as primary group check the output of
[email protected] # getent passwd gfactory | cut -d: -f4 | xargs getent group
It should be the
|Certificate||User that owns certificate||Path to certificate|
|Host certificate||root||/etc/grid-security/hostcert.pem /etc/grid-security/hostkey.pem|
Here are instructions to request a host certificate.
The host certificate/key is used for authorization, however, authorization between the Factory and the GlideinWMS collector is done by file system authentication.
It must be on the public internet, with at least one port open to the world; all worker nodes will load data from this node trough HTTP. Note that worker nodes will also need outbound access in order to access this HTTP port.
As with all OSG software installations, there are some one-time (per host) steps to prepare in advance:
- Ensure the host has a supported operating system
- Obtain root access to the host
- Prepare the required Yum repositories
- Install CA certificates
Most required software is installed from the Factory RPM installation. HTCondor is the only exception since there are many different ways to install it, using the RPM system or not. You need to have HTCondor installed before installing the GlideinWMS Factory. If yum cannot find a HTCondor RPM, it will install the dummy empty-condor RPM, assuming that you installed HTCondor using a tarball distribution.
- If you don't have HTCondor already installed, you can install the HTCondor RPM from the OSG repository:
[email protected] # yum install condor.x86_64
If you plan to send jobs using direct batch submission (aka BOSCO), then you need also the condor-bosco package. You'll have to install the package and remove one of its files
/etc/condor/config.d/60-campus_factory.config because it interferes with the Factory configuration.
[email protected] # yum install condor-bosco [email protected] # rm /etc/condor/config.d/60-campus_factory.config [email protected] # touch /etc/condor/config.d/60-campus_factory.config
Install GWMS Factory
Download and install the Factory RPM
Install the RPM and dependencies (be prepared for a lot of dependencies).
[email protected] # yum install glideinwms-factory
This will install the current production release verified and tested by OSG with default HTCondor configuration. This command will install the GlideinWMS Factory, HTCondor, the OSG client, and all the required dependencies.
If you wish to install a different version of GlideinWMS, add the "--enablerepo" argument to the command as follows:
yum install --enablerepo=osg-testing glideinwms-factory: The most recent production release, still in testing phase. This will usually match the current tarball version on the GlideinWMS home page. (The osg-release production version may lag behind the tarball release by a few weeks as it is verified and packaged by OSG). Note that this will also take the osg-testing versions of all dependencies as well.
yum install --enablerepo=osg-upcoming glideinwms-factory: The most recent development series release, ie version 3.3.x release. This has newer features such as cloud submission support, but is less tested.
Download HTCondor tarballs
You will need to download HTCondor tarballs for each architecture that you want to deploy pilots on.
At this point, GlideinWMS factory does not support pulling HTCondor binaries from your system area.
Suggested is that you put these binaries in
/var/lib/gwms-factory/condor but any gfactory accessible location should suffice.
After installing the RPM you need to configure the components of the GlideinWMS Factory:
- Edit Factory configuration options
- Edit HTCondor configuration options
- Create a HTCondor grid map file
- Reconfigure and Start Factory
Configuring the Factory
The configuration file is
/etc/gwms-factory/glideinWMS.xml. The next steps will describe each line that you will need to edit for most cases, but you may want to review the whole file to be sure that it is configured correctly.
In the security section, you will need to provide each Frontend that is allowed to communicate with the Factory:
security key_length="2048" pub_key="RSA" remove_old_cred_age="30" remove_old_cred_freq="24" reuse_oldkey_onstartup_gracetime="900"> <frontends> <frontend identity="[email protected]_COLLECTOR_HOSTNAME" name="%ORANGE%vofrontend_sec_name%ENDCOLOR%"> <security_classes> <security_class name="%RED%frontend_sec_class%ENDCOLOR%" username="frontend" /> </security_classes> </frontend> </frontends> </security>
These attributes are very important to get exactly right or the Frontend will not be trusted. This should match one of the factory and security sections of the Frontend configuration Configuring the GlideinWMS Frontend in the following way:
This is a snippet from the Frontend configuration (for reference), not the Factory that you are configuring now!
For the factory section:
# from frontend.xml <factory query_expr='((stringListMember("VO", GLIDEIN_Supported_VOs)))'> .... <collectors> <collector DN="/DC=org/DC=doegrids/OU=Services/CN=FACTORY_COLLECTOR_HOSTNAME" comment="Define factory collector globally for simplicity" factory_identity="[email protected]_COLLECTOR_HOSTNAME" my_identity="%GREEN%username%ENDCOLOR%@FACTORY_COLLECTOR_HOSTNAME" node="FACTORY_COLLECTOR_HOSTNAME"/> </collectors> </factory>
For the security:
# from frontend.xml <security classad_proxy="/tmp/vo_proxy" proxy_DN="DN of vo_proxy" proxy_selection_plugin="ProxyAll" security_name="The security name, this is used by factory" sym_key="aes_256_cbc"> <credentials> <credential absfname="/tmp/pilot_proxy" security_class="frontend" trust_domain="OSG" type="grid_proxy"/> </credentials> </security>
Note that the identity of the Frontend must match what HTCondor authenticates the DN of the frontend to. In
/etc/condor/certs/condor_mapfile, there must be an entry with vofrontend_service definition (in this case):
GSI "^\/DC\=org\/DC\=doegrids\/OU\=Services\/CN\=Some\ Name\ 834323%ENDCOLOR%$" %GREEN%vofrontend_service%ENDCOLOR%
Entries are grid/cloud endpoints (aka Compute Elements, or gatekeepers) that can accept job requests and run pilots (which will run user jobs). Each entry needs to be configured to communicate to a specific gatekeeper.
An example test entry is provided in the default GlideinWMS configuration file. At the very least, you will need to modify the entry line:
<entry name="%RED%ENTRY_NAME%ENDCOLOR%" enabled="True" auth_method="grid_proxy" trust_domain="OSG" gatekeeper="%RED%gatekeeper.domain.tld/jobmanager-type%ENDCOLOR%" gridtype="gt2" rsl="(queue=default)(jobtype=single)" schedd_name="%RED%[email protected]_HOSTNAME%ENDCOLOR%" verbosity="std" work_dir="OSG">
You will need to modify the entry name and gatekeeper. This will determine the gatekeeper that you access. Specific gatekeepers often require specific "rsl" attributes that determine the job queue that you are submitting to, or other attributes. Add them in the rsl attribute.
Also, be sure to distribute your entries across the various HTCondor schedd work managers to balance load. To see the available schedd use
condor_status -schedd -l | grep Name.
Several schedd options are configured by default for you: schedd_glideins2, schedd_glideins3, schedd_glideins4, schedd_glideins5, as well as the default schedd. This can be modified in the HTCondor configuration. Add any specific options, such as limitations on jobs/pilots or glexec/voms requirements in the entry section below the above line. More details are in the GlideinWMS Factory configuration guide.
!!! warning If there is no match between auth_metod and trust_domain of the entry and the type and trust_domain listed in one of the credentials of one of the Frontends using this Factory, then no job can run on that entry.
The Factory must advertise the correct Resource Name of each entry for accounting purposes. Then the Factory must also advertise in the entry all the attributes that will allow to match the query expression used in the Frontends connecting to this Factory (e.g.
<factory query_expr='((stringListMember("%PINK%VO%ENDCOLOR%", GLIDEIN_Supported_VOs)))'> as explained in the VO frontend configuration document ).
Keep an eye on this part as we're dealing with singularity. Then you must advertise correctly if the site supports gLExec. If it does not set GLEXEC_BIN to NONE, if gLExec is installed via OSG set it to OSG, otherwise set it to the path of gLExec.
For example this snippet advertises GLIDEIN_Supported_VOs attribute with the supported VO so that can be used with the query above in the VO frontend and says that the resource does not support gLExec:
<entry name="RESOURCE_NAME" ... <config> ... <attrs> ... <attr name="GLIDEIN_Supported_VOs" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="%PINK%VO%ENDCOLOR%"/> <attr name="GLEXEC_BIN" const="True" glidein_publish="False" job_publish="False" parameter="True" publish="True" type="string" value="%RED%NONE%ENDCOLOR%"/> <attr name="GLIDEIN_Resource_Name" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="string" value="%RED%SiteNameFromOIM%ENDCOLOR%"/> </attrs>
Specially if jobs are sent to OSG resources, it is very important to set the GLIDEIN_Resource_Name and to be consistent with the Resource Name reported in OIM because that name will be used for job accounting in Gratia. It should be the name of the Resource in OIM or the name of the Resource Group (specially if there are many gatekeepers submitting to the same cluster).
More information on options can be found here
Each pilot will download HTCondor binaries from the staging area. Often, multiple binaries are needed to support various architectures and platforms. Currently, you will need to provide at least one tarball for GlideinWMS to use. (Using the system binaries is currently not supported).
Download a HTCondor tarball from here. Suggested is to put the binaries in
/var/lib/gwms-factory/condor, but any factory-accessible location will do just fine.
Once you have downloaded the tarball, configure it in
/etc/gwms-factory/glideinWMS.xml like in the following:
<condor_tarball arch="default" base_dir="/var/lib/gwms-factory/condor/condor-8.7.6-x86_64_RedHat6-stripped" os="rhel6" version="default"/>
Remember also to modify the
condor_arch attributes in the entries (the configured Compute Elements) to pick the correct HTCondor binary.
Here are more details on using multiple HTCondor binaries. Note that is sufficient to set the
base_dir; the reconfigure command will prepare the tarball and add it to the XML config file.
The HTCondor configuration for the Factory is placed in
Get rid of the pre-loaded HTCondor default
[email protected] # rm /etc/condor/config.d/00personal_condor.config [email protected] # touch /etc/condor/config.d/00personal_condor.config
For most installations, the items you need to modify are in
03_gwms_factory_local.config. The lines you will have to edit are:
- Credentials of the machine. You can either run using a proxy, or a service certificate. It is recommended to use a host certificate and specify its location in the variables
GSI_DAEMON_KEY. The host certificate should be owned by root and have the correct permissions, 600.
- HTCondor ids in the form UID.GID (both are integers)
- HTCondor admin email. Will receive messages when services fail.
#-- HTCondor user: condor CONDOR_IDS = #-- Contact (via email) when problems occur CONDOR_ADMIN = ############################ # GSI Security config ############################ #-- Grid Certificate directory GSI_DAEMON_TRUSTED_CA_DIR= /etc/grid-security/certificates #-- Credentials GSI_DAEMON_CERT = /etc/grid-security/hostcert.pem GSI_DAEMON_KEY = /etc/grid-security/hostkey.pem #-- HTCondor mapfile CERTIFICATE_MAPFILE= /etc/condor/certs/condor_mapfile ################################### # Whitelist of HTCondor daemon DNs ################################### #DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
Using other HTCondor RPMs, e.g. UW Madison HTCondor RPM
The above procedure will work if you are using the OSG HTCondor RPMS. You can verify that you used the OSG HTCondor RPM by using
yum list condor. The version name should include "osg", e.g.
If you are using the UW Madison HTCondor RPMS, be aware of the following changes:
- This HTCondor RPM uses a file
/etc/condor/condor_config.localto add your local machine slot to the user pool.
- If you want to disable this behavior (recommended), you should blank out that file or comment out the line in
/etc/condor/condor_configfor LOCAL_CONFIG_FILE. (Make sure that LOCAL_CONFIG_DIR is set to
- Note that the variable LOCAL_DIR is set differently in UW Madison and OSG RPMs. This should not cause any more problems in the Glideinwms RPMs, but please take note if you use this variable in your job submissions or other customizations.
In general if you are using a non OSG RPM or if you added custom configuration files for HTCondor please check the order of the configuration files:
[email protected] # condor_config_val -config Configuration source: /etc/condor/condor_config Local configuration sources: /etc/condor/config.d/00-restart_peaceful.config /etc/condor/config.d/00_gwms_factory_general.config /etc/condor/config.d/01_gwms_factory_collectors.config /etc/condor/config.d/02_gwms_factory_schedds.config /etc/condor/config.d/03_gwms_local.config /etc/condor/config.d/10-batch_gahp_blahp.config /etc/condor/condor_config.local
After configuring HTCondor, be sure to restart HTCondor:
[email protected] # service condor restart
Create a HTCondor grid mapfile.
The HTCondor grid mapfile
/etc/condor/certs/condor_mapfile is used for authentication between the glidein running on a remote worker node, and the local collector. HTCondor uses the mapfile to map certificates to pseudo-users on the local machine. It is important that you map the DN's of each frontend you are talking to.
Below is an example mapfile, by default found in
GSI "^\/DC\=org\/DC\=doegrids\/OU\=People\/CN\=Some\ Name\ 123456$" frontend GSI (.*) anonymous FS (.*) \1
Each frontend needs a line that maps to the user specified in the identity argument in the frontend security section of the Factory configuration.
After changing the configuration of GlideinWMS and making sure that Factory is running, use the following table to find the appropriate command for your operating system (run as
|If your operating system is...||Run the following command...|
|Enterprise Linux 7||
|Enterprise Linux 6||
Notice that, in the case of Enterprise Linux 7
systemctl reload gwms-factory will work only if:
- gwms-factory service is running
- gwms-factory service was started with systemctl
Otherwise, you will get the following error in any of the cases:
# systemctl reload gwms-factory Job for gwms-factory.service invalid.
Before you start the Factory service for the first time or after an update of the RPM or after you change GlideinWMS scripts, you should always use the GlideinWMS "upgrade" command. To do so:
Make sure the
gwms-factoryservices are stopped (in EL6 this will be done for you).
Issue the upgrade command:
If you are using Enterprise Linux 7:
[email protected] # /usr/sbin/gwms-factory upgrade
If you are using Enterprise Linux 6:
[email protected] # service gwms-factory upgrade
gwms-factoryservices (see next part).
Service Activation and Deactivation
To start the Factory you must start also HTCondor and the Web server beside the Factory itself:
# %RED%For RHEL 6, CentOS 6, and SL6%ENDCOLOR% [email protected] # service condor start [email protected] # service httpd start [email protected] # service gwms-factory start # %RED% For RHEL 7, CentOS 7, and SL7%ENDCOLOR% [email protected] # systemctl start condor [email protected] # systemctl start httpd [email protected] # systemctl start gwms-factory
Once you successfully start using the Factory service, anytime you change the
/etc/gwms-factory/glideinWMS.xml file you will need to run a reconfig/reload command. If you change also some code you need the upgrade command mentioned above:
# %RED% For RHEL 6, CentOS 6, and SL6%ENDCOLOR% [email protected] # service gwms-factory reconfig # %RED% But the situation is a bit more complicated in RHEL 7, CentOS 7, and SL7 due to systemd restrictions%ENDCOLOR% # %GREEN% For reconfig:%ENDCOLOR% A. %RED% when the Factory is running%ENDCOLOR% A.1 %RED% without any additional options%ENDCOLOR% [email protected] # /usr/sbin/gwms-factory reconfig%ENDCOLOR% or [email protected] # systemctl reload gwms-factory A.2 %RED% if you want to give additional options %ENDCOLOR% systemctl stop gwms-factory /usr/sbin/gwms-factory reconfig "and your options" systemctl start gwms-factory B. %RED% when the Factory is NOT running %ENDCOLOR% [email protected] # /usr/sbin/gwms-factory reconfig ("and your options")
To enable the services so that they restart after a reboot:
# %RED%# For RHEL 6, CentOS 6, and SL6%ENDCOLOR% [email protected] # /sbin/chkconfig fetch-crl-cron on [email protected] # /sbin/chkconfig fetch-crl-boot on [email protected] # /sbin/chkconfig condor on [email protected] # /sbin/chkconfig httpd on [email protected] # /sbin/chkconfig gwms-factory on # %RED%# For RHEL 7, CentOS 7, and SL7%ENDCOLOR% [email protected] # systemctl enable fetch-crl-cron [email protected] # systemctl enable fetch-crl-boot [email protected] # systemctl enable condor [email protected] # systemctl enable httpd [email protected] # systemctl enable gwms-factory
To stop the Factory:
# %RED%For RHEL 6, CentOS 6, and SL6 %ENDCOLOR% [email protected] # service gwms-factory stop # %RED%For RHEL 7, CentOS 7, and SL7%ENDCOLOR% [email protected] # systemctl stop gwms-factory
And you can stop also the other services if you are not using them independently of the Factory.
Validating GlideinWMS Factory
The complete validation of the Factory is the submission of actual jobs.
You can also check that the services are up and running:
[email protected] # condor_status -any MyType TargetType Name glidefactoryclient None [email protected][email protected] glideclient None [email protected][email protected] glidefactory None [email protected][email protected] glidefactoryglobal None [email protected]_ser glideclientglobal None [email protected]_ser Scheduler None hostname.fnal.gov DaemonMaster None hostname.fnal.gov Negotiator None hostname.fnal.gov Scheduler None [email protected] Scheduler None [email protected] Scheduler None [email protected] Scheduler None [email protected] Collector None [email protected]
You should have one "glidefactory" classAd for each entry that you have enabled. If you have already configured the frontends, you will also have one glidefactoryclient and one glideclient classAd for each frontend / entry.
You can check also the monitoring Web page:
You can also test the local submission of a job to a resource using the test script local_start.sh but you must first install the OSG client tools and generate a proxy. After that you can run the test (replace ENTRY_NAME with the name of one of the entries in
Check Web server configuration for the monitoring
Verify path and specially the URL for the GlideinWMS files served by your web server:
stage base_dir="/var/lib/gwms-factory/web-area/stage" use_symlink="True" web_base_url="http://HOSTNAME:PORT/factory/stage"
This will determine the location of your web server. Make sure that the URL is visible. Depending on your firewall or the one of your organization, you may need to change the port here and in the httpd configuration (by modifying the "Listen" directive in
/etc/httpd/conf/httpd.conf). Note that web servers are an often an attacked piece of infrastruture, so you may want to go through the Apache configuration in
/etc/httpd/conf/httpd.conf and disable unneeded modules.
Troubleshooting GlideinWMS Factory
|File Description||File Location||Comment|
|Configuration file||/etc/gwms-factory/glideinWMS.xml||Main configuration file|
|Logs||/var/log/gwms-factory/server/factory||Overall server logs|
|/var/log/gwms-factory/server/entry_NAME||Specific entry logs (generally more useful)|
|/var/log/gwms-factory/client||Glidein Pilot logs seperated by user and entry|
Increase the log level and change rotation policies
You can increase the log level of the frontend. To add a log file with all the log information add the following line with all the message types in the
process_log section of
<log_retention> <process_logs> <process_log extension="all" max_days="7.0" max_mbytes="100.0" min_days="3.0" msg_types="DEBUG,EXCEPTION,INFO,ERROR,ERR"/>
You can also change the rotation policy and choose whether compress the rotated files, all in the same section of the config files:
- max_bytes is the max size of the log files
- max_days it will be rotated.
- compression specifies if rotated files are compressed
- backup_count is the number of rotated log files kept
Further details are in the reference documentation.
Failed authentication errors
If you get messages such as these in the logs, the Factory does not trust the frontend and will not submit glideins.
WARNING: Client fermicloud128-fnal-gov_OSG_gWMSFrontend.main (secid: frontend_name) not in white list. Skipping request
This error means that the frontend name in the security section of the Factory does not match the
security_name in the frontend.
Client fermicloud128-fnal-gov_OSG_gWMSFrontend.main (secid: frontend_name) is not coming from a trusted source; AuthenticatedIdentity [email protected][email protected] Skipping for security reasons.
This error means that the identity in the security section of the Factory does not match what the
/etc/condor/certs/condor_mapfile authenticates the Frontend to in HTCondor (!Authenticated Identity in the classad).
Make sure the attributes are correctly lined up as in the Frontend security configuration section above.
Glideins start but do not connect to User pool / VO Frontend
Check the appropriate job err and out logs in
/var/log/gwms-factory/client to see if any errors were reported.
Often, this will be a pilot unable to access a web server or with an invalid proxy. Also, verify that the
condor_mapfile is correct on the VO Frontend's user pool collector and configuration.
Glideins start but fail before running job with error "Proxy not long lived enough"
If the glideins are running on a resource (entry) but the jobs are not running and the log files in
/var/log/gwms-factory/client/user_frontend/glidein_gfactory_instance/ENTRY_NAME report an error like "Proxy not long lived enough (86096 s left), shortened retire time ...", then probably the HTCondor RLM on the Compute Element is delegating the proxy and shortening its lifespan.
This can be fixed by setting
DELEGATE_JOB_GSI_CREDENTIALS = FALSE as suggested in the CE install document.