=================================================== Xymon configuration, manipulation, tips and tricks: =================================================== :Title: Xymon configuration, manipulation, tips and tricks: :Author: Douglas O'Leary :Description: Xymon tips/tricks, mostly CLI. :Disclaimer: Standard: Use the information that follows at your own risk. If you screw up a system, don't blame it on me... .. contents:: Overview: ========= The client I'm at uses xymon so I've been busy trying to pick that up to a supportable degree. I've had a few problems with it - probably more accurately, with the way the client's implemented it. Like most things, it's evolved differently as different admins take charge of it. The problems: * Command line interface, particularly for acknowledging alerts. The web interface is good; but, I'm a UNIX admin. Pointy clicky things make me twitch. * Separating alerts for the same resource type. Example: Alerts for filesystems under /opt/app should go application administrators whereas OS related filesystems should go to us. I can't do anything about an application filesystem being full, I don't want to hear about it. * ntp monitoring: We're using the rpm packaged version. Very nice tool; easy to manage, manipulate, but the built in ntp monitoring sucks. It's take a bit to get to where that's working reliably now. Immediately following is a list of lessons learned. Details/discussions of those lessons learned, where needed, follow the list. Lessons Learned: ================ * xymond listes on port 1984 - need that for firewall configuration. * Acknowledging an alert from the command line. Need to have the alert id, what's called the cookie (for web interface). * Obtain the alert id: use the xymon -> xymondlog command. The xymondlog subcommand displays detailed information on a specific test. See the xymon man page for details. You do not have to be root to run it. :: $ xymon localhost 'xymondlog client4.disk' client4|disk|red||1412444137|1412448972|1412450772|0|0|192.168.122.25|1578903790|||Y| red Sat Oct 4 13:56:11 CDT 2014 - Filesystems NOT ok &red /opt/app (100% used) has reached the PANIC level (95%) Filesystem 1024-blocks Used Available Capacity Mounted on /dev/mapper/vg00-root 1032088 370652 609008 38% / /dev/vda1 495844 67751 402493 15% /boot /dev/mapper/vg00-opt 1032088 34060 945600 4% /opt /dev/mapper/vg00-tmp 2064208 68616 1890736 4% /tmp /dev/mapper/vg00-usr 4128448 1684704 2234032 43% /usr /dev/mapper/vg00-var 2064208 439152 1520200 23% /var /dev/mapper/vg00-app 2064208 2042292 0 100% /opt/app To display the alert id, parse the above output: :: $ xymon localhost 'xymondlog client4.disk' | head -1 | \ awk -F\| '{print $11}' 1578903790 * To acknowledg the alert, execute: :: xymon localhost "hobbitdack ${alert_id} ${minutes} ${msg}" * To disable an alert: :: xymon 127.0.0.1 'disable ${host}.[${test}|*] ${minutes} ${free_text}' Can set ${minutes} to -1 to disable it until it comes back good again. * To segregate alerts by filesystem: * In analysis.cfg, ensure appropriate filesystems (and other alerts) are grouped: :: HOST=%client[1-3] DISK /opt/app GROUP=mw 90 95 DISK * GROUP=infra 90 05 HOST=client4 DISK /opt/app GROUP=dol 90 95 DISK * GROUP=infra 90 05 * In alerts.cfg, use GROUP name as the *host*: :: GROUP=mw MAIL $Middleware GROUP=dol MAIL $Dkoleary GROUP=infra MAIL=$Unix * To display the current status of a specific test across the environment: :: # xymon localhost 'xymondboard test=lntp' client1|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 client2|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 client3|lntp|green||1412472773|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 client4|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 ldapsvr|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 syslog|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 xymon|lntp|green||1412470655|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 See the xymon man page for details. Queries can also be filtered on host and color * To prevent acknowledged alerts from propogating red status and/or remove them from the *all non-green view* page: * Add **"--nopropack='*'** to the XYMONGENOPTS option in xymonserver.cfg. * Restart xymon. May not be absolutely mandatory, but I'm impatient. Discussion: =========== CLI interface: -------------- Alert acknowledgement: ~~~~~~~~~~~~~~~~~~~~~~ I've been studying this off and on for about a month and was having problems finding out how to manage the environment from the command line. I had my AH-HA moment today, when it suddenly became much easier. To acknowledge an alert, you have to know the alert ID or its cookie which is displayed in the emails that xymon sends. But, what if you want to acknowledge an alert that wasn't emailed out? First, find the alert ID. That's hiding in the xymondlog output: :: $ xymon localhost 'xymondlog client4.disk' client4|disk|red||1412444137|1412448972|1412450772|0|0|192.168.122.25|1578903790|||Y| red Sat Oct 4 13:56:11 CDT 2014 - Filesystems NOT ok &red /opt/app (100% used) has reached the PANIC level (95%) Filesystem 1024-blocks Used Available Capacity Mounted on /dev/mapper/vg00-root 1032088 370652 609008 38% / /dev/vda1 495844 67751 402493 15% /boot /dev/mapper/vg00-opt 1032088 34060 945600 4% /opt /dev/mapper/vg00-tmp 2064208 68616 1890736 4% /tmp /dev/mapper/vg00-usr 4128448 1684704 2234032 43% /usr /dev/mapper/vg00-var 2064208 439152 1520200 23% /var /dev/mapper/vg00-app 2064208 2042292 0 100% /opt/app That first line, the one separated by '|' is the one you want. From the xymon man page, the fields are: * hostname * test name * color * test flgs * last change: UNIX timestamp when the color changed. * log time: UNIX timestamp * validtime: UNIX timestamp when the log entry is no longer valid * acktime: man page says -1, I've only seen 0 for unack'ed alerts. * disabletime: same as acktime * sending IP address * alert id or cookie (for those counting, it's field 11) * ackmsg: acknowldgement message * disms: disable message. So, to obtain the specific alert id: :: $ xymon localhost 'xymondlog client4.disk' | head -1 | \ awk -F\| '{print $11}' 1578903790 Then, to acknowledge the alert: :: $ xymon localhost 'hobbitdack 1578903790 1440 acked for a day' Other available information: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The xymondboard subcommand seems particularly useful. To see the status of a particular test across your environment: :: # xymon localhost 'xymondboard test=lntp' client1|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 client2|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 client3|lntp|green||1412472773|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 client4|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 ldapsvr|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 syslog|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 xymon|lntp|green||1412470655|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 The fields are close to but not exactly the same as the xymondlog subcommand. Check the xymon man page for details. The times, as above, are in UNIX time format. Converting those is straight forward: :: # ctime 1412478507 Sat Oct 4 22:08:27 2014 ctime's a function I've had in my .kshrc file forever: :: ctime() { eval "perl -e '\$time = localtime($1); print \"\$time\\n\"'" } You can use that information to verify the last time a test was logged. I used it to verify that a custom test was actually **getting** run. Came in quite handy. Scripts: ~~~~~~~~ ack: ++++ The ack script combines the two steps required to acknowledge an alert. This script assumes it's run from some system other than the xymon server. Pretty straight forward: :: #!/bin/ksh ############################################################## # ack: acknowledges xymon alerts # Author: Doug O'Leary # Created: 10/04/14 # Updated: ############################################################## usage() { msg="$*" echo '' [[ ${#msg} -gt 0 ]] && echo "${msg}" print "Format: ack -h \${host}.\${test} -t \${min} -m \"\${message}\"\n" exit 1 } [[ $# -eq 0 ]] && usage "Invalid number of commands" Xymon="xymon" while getopts "h:t:m:" arg do case ${arg} in h) Test=${OPTARG};; t) Time=${OPTARG};; m) Msg="${OPTARG}";; esac done Ttime=$(echo "${Time}" | sed 's/[0-9]//g') [[ ${#Ttime} -gt 0 ]] && usage "Invalid time: ${Time}" AI=$(ssh ${Xymon} "xymon localhost 'xymondlog ${Test}'" | head -1 | \ awk -F\| '{print $11}') [[ ${#AI} -eq 0 ]] && usage "Invalid host.test: ${Test}" ssh ${Xymon} "xymon localhost 'hobbitdack ${AI} ${Time} ${Msg}'" dac: ++++ The dack script displays the results of a specific test. Even simpler, but it does save on typing: :: #!/bin/ksh ############################################################### # dack: displays xymondlog details of a test # Author: Doug O'Leary # Created: 10/04/14 ############################################################### usage() { msg="$*" echo '' [[ ${#msg} -gt 0 ]] && echo "${msg}" print "Format: dack \${host}.\${test}\n" exit 1 } [[ $# -ne 1 ]] && usage "Invalid number of commands" Xymon="xymon" Test=${1} ssh ${Xymon} "xymon localhost 'xymondlog ${Test}'" Differentiating alerts: ----------------------- This one was driving me bug nuts crazy. On call, getting paged about filesysems that belong to applications. Go figure, they get a bit hinkey if I go in and start eliminating files willy nilly. In short, the UNIX admin should not add, delete, or otherwise manipulate application data. It's called Separation of Duties. If I can't do anything about a full filesystem, don't call me. Couldn't seem to get that one through. That, or none of the local experts knew how to do it. Turns out, it's right there in the documentation. For clarity, there're two steps to getting this done: * Update the analysis.cfg and add a GROUP parameter to the specific filesysems. In this example, I want /opt/app/ filesytems on clients 1-3 to be in the *mw* group. Everything else belongs to *infra* HOST=%client[1-3] DISK /opt/app GROUP=mw 90 95 DISK * GROUP=infra 90 05 HOST=client4 DISK /opt/app GROUP=dol 90 95 DISK * GROUP=infra 90 05 * The alerts.cfg is even easier. Simply use GROUP name as the *host*, sending emails to whatever aliases you define: :: GROUP=mw MAIL $Middleware GROUP=dol MAIL $Dkoleary GROUP=infra MAIL=$Unix Monitoring ntp: --------------- The ntp monitoring in ver 4.3.17 doesn't work well. It's constantly coming up with false positives. As good as everything else is, I'll give 'em this one. It was a bit of a pain to figure out the correct way, but once I did, it's quite easy to set up. Use the scripts below as is or as a model to design your own custom scripts. 1. Update the hosts.cfg and include files with your custom test. In my case I disabled ntp (!ntp) and enabled lntp on all my hosts. Verify, with xymongrep that the test is coming out. Xymongrep may not report the tests accurately for a few minutes. Either wait, or restart xymon. :: # xymongrep lntp 192.168.122.21 client1 # lntp 192.168.122.22 client2 # lntp 192.168.122.23 client3 # lntp 192.168.122.25 client4 # lntp 192.168.122.20 ldapsvr # lntp 192.168.122.16 syslog # lntp 192.168.122.15 xymon # lntp 2. Create the monitoring script. In short, you want to use xymongrep, loop through the hosts, running your test. Based on the results, update the color and message, then call ``xymon localhost 'status...`` command. :: #!/bin/ksh export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin xymongrep lntp | while read ip host a tst do # echo "Checking on ${host}.${tst}" Out=$(ntpq -pn ${host} 2>&1) echo "${Out}" | grep -q '^\*[1-9]' if [ $? -eq 0 ] then color=green msg="Service ntp on ${host} is OK (up) Commmand: ntpq -pn ${host} ${Out} " else color=red msg="Service ntp on ${host} is NOT OK: Command: ntpq -pn ${host} ${Out} " fi xymon localhost "status ${host}.${tst} ${color} `date` ${msg} " done 3. Create the xymon tasking, by creating a file in /etc/xymon/tasks.d with the following contents: :: # cat /etc/xymon/tasks.d/lntp [lntp] CMD /usr/local/bin/lntp INTERVAL 5m 4. Restart xymon Seems like it'd be pretty easy to script out just about anything you'd want with that information. Any rate, that's the answers to the three things that have been vexing me with xymon. Once I had some time to dedicate, it was fairly straight forward, but having someone blaze the trail's always easier...