Troubleshooting Guide for Yum Repository Scripts
The repo.opensciencegrid.org and repo-itb.opensciencegrid.org hosts contain the OSG Yum software repositories plus related services and tools. In particular, the mash software is used to download RPMs from where they are built (at the University of Wisconsin–Madison), and there are some associated scripts to configure and invoke mash periodically. Use this guide to monitor the mash system for problems and to perform basic troubleshooting when such problems arise.
Monitoring
To monitor the repository hosts for proper mash operation, do the following steps on each host:
ssh
to repo.opensciencegrid.org andcd
into/var/log/repo
to view logs from mash updates- Examine the “Last modified” timestamp of all of the update_repo.*.log files
- If the timestamps are all less than 2 hours old, life is good and you can skip the remaining steps below
- Otherwise, examine the “Last modified” timestamp of the update_all_repos.err file
- If the update_all_repos.err timestamp is current, there may be a mash process that is hung; see the Troubleshooting steps below
- If all timestamps are more than 6 hours old, something may be wrong with cron or its mash entries:
- Verify that cron is running and that the cron entries for mash are still present; if not, try to restore things
- Otherwise, create a Freshdesk ticket with a subject like “Repo update logs are too old on
” and with relevant details in the body - Assign the ticket to the “Software” group
Troubleshooting and Mitigation
Identifying and fixing a hung mash process
If a mash update process hangs, all future invocations from cron of the mash scripts will exit without taking action because of the hung process. Thus, it is important to identify and remove any hung processes so that future updates can proceed. Use the procedure below to remove any hung mash processes; doing so is safe in that it will not adversely affect the Yum repositories being served from the host.
- In the listing of log files (see above), view the file =update_all_repos.err=
-
In the error log file, look for messages such as:
Wed Jan 20 18:10:02 UTC 2016: **Can't acquire lock, is update_all_repos.sh already running?**
This message indicates that the most recent update attempt quit early due to the presence of a lock file, most likely from a hung mash process.
-
Look for mash processes:
root@host # ps -C mash -o pid,ppid,pgid,start,command PID PPID PGID STARTED COMMAND 24551 24549 23455 Jan 15 /usr/bin/python /usr/bin/mash osg-3.1-el5-release -o 24552 24551 23455 Jan 15 /usr/bin/python /usr/bin/mash osg-3.1-el5-release -o
-
If there are mash processes that started on a previous date or more than 2 hours ago, it is best to remove their corresponding process groups (PGID above):
root@host # kill -TERM -23455
Then verify that the old processes are gone using the same ps command as above:
root@host # ps -C mash -o pid,ppid,pgid,start,command PID PPID PGID STARTED COMMAND
-
If any part of this process does not look or work as expected:
- Create a Freshdesk ticket with a subject like “Repo update logs are too old on
” and with relevant details in the body - Assign the ticket to the “Software” group