Xymon configuration, manipulation, tips and tricks:

Title:

Xymon configuration, manipulation, tips and tricks:

Author:

Douglas O’Leary <dkoleary@olearycomputers.com>

Description:

Xymon tips/tricks, mostly CLI.

Disclaimer:

Standard: Use the information that follows at your own risk. If you screw up a system, don’t blame it on me…

Overview:

The client I’m at uses xymon so I’ve been busy trying to pick that up to a supportable degree.

I’ve had a few problems with it - probably more accurately, with the way the client’s implemented it. Like most things, it’s evolved differently as different admins take charge of it.

The problems:

  • Command line interface, particularly for acknowledging alerts. The web interface is good; but, I’m a UNIX admin. Pointy clicky things make me twitch.

  • Separating alerts for the same resource type. Example: Alerts for filesystems under /opt/app should go application administrators whereas OS related filesystems should go to us. I can’t do anything about an application filesystem being full, I don’t want to hear about it.

  • ntp monitoring: We’re using the rpm packaged version. Very nice tool; easy to manage, manipulate, but the built in ntp monitoring sucks. It’s take a bit to get to where that’s working reliably now.

Immediately following is a list of lessons learned. Details/discussions of those lessons learned, where needed, follow the list.

Lessons Learned:

  • xymond listes on port 1984 - need that for firewall configuration.

  • Acknowledging an alert from the command line. Need to have the alert id, what’s called the cookie (for web interface).

    • Obtain the alert id: use the xymon -> xymondlog command. The xymondlog subcommand displays detailed information on a specific test. See the xymon man page for details. You do not have to be root to run it.

      $ xymon localhost 'xymondlog client4.disk'
      client4|disk|red||1412444137|1412448972|1412450772|0|0|192.168.122.25|1578903790|||Y|
      red Sat Oct  4 13:56:11 CDT 2014 - Filesystems NOT ok
      &red /opt/app (100% used) has reached the PANIC level (95%)
      
      Filesystem            1024-blocks    Used Available Capacity Mounted on
      /dev/mapper/vg00-root     1032088  370652    609008      38% /
      /dev/vda1                  495844   67751    402493      15% /boot
      /dev/mapper/vg00-opt      1032088   34060    945600       4% /opt
      /dev/mapper/vg00-tmp      2064208   68616   1890736       4% /tmp
      /dev/mapper/vg00-usr      4128448 1684704   2234032      43% /usr
      /dev/mapper/vg00-var      2064208  439152   1520200      23% /var
      /dev/mapper/vg00-app      2064208 2042292         0     100% /opt/app
      

      To display the alert id, parse the above output:

      $ xymon localhost 'xymondlog client4.disk' | head -1 | \
      awk -F\| '{print $11}'
      1578903790
      
    • To acknowledg the alert, execute:

      xymon localhost "hobbitdack ${alert_id} ${minutes} ${msg}"
      
  • To disable an alert:

    xymon 127.0.0.1 'disable ${host}.[${test}|*] ${minutes} ${free_text}'
    

    Can set ${minutes} to -1 to disable it until it comes back good again.

  • To segregate alerts by filesystem:

    • In analysis.cfg, ensure appropriate filesystems (and other alerts) are grouped:

      HOST=%client[1-3]
          DISK /opt/app GROUP=mw 90 95
          DISK * GROUP=infra 90 05
      HOST=client4
          DISK /opt/app GROUP=dol 90 95
          DISK * GROUP=infra 90 05
      
    • In alerts.cfg, use GROUP name as the host:

      GROUP=mw
          MAIL $Middleware
      GROUP=dol
          MAIL $Dkoleary
      GROUP=infra
          MAIL=$Unix
      
  • To display the current status of a specific test across the environment:

    # xymon localhost 'xymondboard test=lntp'
    client1|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    client2|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    client3|lntp|green||1412472773|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    client4|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    ldapsvr|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    syslog|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    xymon|lntp|green||1412470655|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    

    See the xymon man page for details. Queries can also be filtered on host and color

  • To prevent acknowledged alerts from propogating red status and/or remove them from the all non-green view page:

    • Add “–nopropack=’*’ to the XYMONGENOPTS option in xymonserver.cfg.

    • Restart xymon. May not be absolutely mandatory, but I’m impatient.

Discussion:

CLI interface:

Alert acknowledgement:

I’ve been studying this off and on for about a month and was having problems finding out how to manage the environment from the command line. I had my AH-HA moment today, when it suddenly became much easier.

To acknowledge an alert, you have to know the alert ID or its cookie which is displayed in the emails that xymon sends. But, what if you want to acknowledge an alert that wasn’t emailed out?

First, find the alert ID. That’s hiding in the xymondlog output:

$ xymon localhost 'xymondlog client4.disk'
client4|disk|red||1412444137|1412448972|1412450772|0|0|192.168.122.25|1578903790|||Y|
red Sat Oct  4 13:56:11 CDT 2014 - Filesystems NOT ok
&red /opt/app (100% used) has reached the PANIC level (95%)

Filesystem            1024-blocks    Used Available Capacity Mounted on
/dev/mapper/vg00-root     1032088  370652    609008      38% /
/dev/vda1                  495844   67751    402493      15% /boot
/dev/mapper/vg00-opt      1032088   34060    945600       4% /opt
/dev/mapper/vg00-tmp      2064208   68616   1890736       4% /tmp
/dev/mapper/vg00-usr      4128448 1684704   2234032      43% /usr
/dev/mapper/vg00-var      2064208  439152   1520200      23% /var
/dev/mapper/vg00-app      2064208 2042292         0     100% /opt/app

That first line, the one separated by ‘|’ is the one you want. From the xymon man page, the fields are:

  • hostname

  • test name

  • color

  • test flgs

  • last change: UNIX timestamp when the color changed.

  • log time: UNIX timestamp

  • validtime: UNIX timestamp when the log entry is no longer valid

  • acktime: man page says -1, I’ve only seen 0 for unack’ed alerts.

  • disabletime: same as acktime

  • sending IP address

  • alert id or cookie (for those counting, it’s field 11)

  • ackmsg: acknowldgement message

  • disms: disable message.

So, to obtain the specific alert id:

$ xymon localhost 'xymondlog client4.disk' | head -1 | \
awk -F\| '{print $11}'
1578903790

Then, to acknowledge the alert:

$ xymon localhost 'hobbitdack 1578903790 1440 acked for a day'

Other available information:

The xymondboard subcommand seems particularly useful. To see the status of a particular test across your environment:

# xymon localhost 'xymondboard test=lntp'
client1|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
client2|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
client3|lntp|green||1412472773|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
client4|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
ldapsvr|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
syslog|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
xymon|lntp|green||1412470655|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014

The fields are close to but not exactly the same as the xymondlog subcommand. Check the xymon man page for details.

The times, as above, are in UNIX time format. Converting those is straight forward:

# ctime 1412478507
Sat Oct  4 22:08:27 2014

ctime’s a function I’ve had in my .kshrc file forever:

ctime()
{
    eval "perl -e '\$time = localtime($1); print \"\$time\\n\"'"
}

You can use that information to verify the last time a test was logged. I used it to verify that a custom test was actually getting run. Came in quite handy.

Scripts:

ack:

The ack script combines the two steps required to acknowledge an alert. This script assumes it’s run from some system other than the xymon server. Pretty straight forward:

#!/bin/ksh

##############################################################
# ack:      acknowledges xymon alerts
# Author:   Doug O'Leary
# Created:  10/04/14
# Updated:
##############################################################

usage()
{   msg="$*"
    echo ''
    [[ ${#msg} -gt 0 ]] && echo "${msg}"
    print "Format: ack -h \${host}.\${test} -t \${min} -m \"\${message}\"\n"
    exit 1
}

[[ $# -eq 0 ]] && usage "Invalid number of commands"
Xymon="xymon"

while getopts "h:t:m:" arg
do
    case ${arg} in
        h)  Test=${OPTARG};;
        t)  Time=${OPTARG};;
        m)  Msg="${OPTARG}";;
    esac
done

Ttime=$(echo "${Time}" | sed 's/[0-9]//g')
[[ ${#Ttime} -gt 0 ]] && usage "Invalid time: ${Time}"

AI=$(ssh ${Xymon} "xymon localhost 'xymondlog ${Test}'"  | head -1 | \
    awk -F\| '{print $11}')
[[ ${#AI} -eq 0 ]] && usage "Invalid host.test: ${Test}"

ssh ${Xymon} "xymon localhost 'hobbitdack ${AI} ${Time} ${Msg}'"
dac:

The dack script displays the results of a specific test. Even simpler, but it does save on typing:

#!/bin/ksh

###############################################################
# dack:    displays xymondlog details of a test
# Author:  Doug O'Leary
# Created: 10/04/14
###############################################################


usage()
{   msg="$*"
    echo ''
    [[ ${#msg} -gt 0 ]] && echo "${msg}"
    print "Format: dack \${host}.\${test}\n"
    exit 1
}

[[ $# -ne 1 ]] && usage "Invalid number of commands"
Xymon="xymon"
Test=${1}

ssh ${Xymon} "xymon localhost 'xymondlog ${Test}'"

Differentiating alerts:

This one was driving me bug nuts crazy. On call, getting paged about filesysems that belong to applications. Go figure, they get a bit hinkey if I go in and start eliminating files willy nilly. In short, the UNIX admin should not add, delete, or otherwise manipulate application data. It’s called Separation of Duties. If I can’t do anything about a full filesystem, don’t call me.

Couldn’t seem to get that one through. That, or none of the local experts knew how to do it. Turns out, it’s right there in the documentation.

For clarity, there’re two steps to getting this done:

  • Update the analysis.cfg and add a GROUP parameter to the specific filesysems. In this example, I want /opt/app/ filesytems on clients 1-3 to be in the mw group. Everything else belongs to infra

    HOST=%client[1-3]

    DISK /opt/app GROUP=mw 90 95 DISK * GROUP=infra 90 05

    HOST=client4

    DISK /opt/app GROUP=dol 90 95 DISK * GROUP=infra 90 05

  • The alerts.cfg is even easier. Simply use GROUP name as the host, sending emails to whatever aliases you define:

    GROUP=mw
        MAIL $Middleware
    GROUP=dol
        MAIL $Dkoleary
    GROUP=infra
        MAIL=$Unix
    

Monitoring ntp:

The ntp monitoring in ver 4.3.17 doesn’t work well. It’s constantly coming up with false positives. As good as everything else is, I’ll give ‘em this one. It was a bit of a pain to figure out the correct way, but once I did, it’s quite easy to set up. Use the scripts below as is or as a model to design your own custom scripts.

  1. Update the hosts.cfg and include files with your custom test. In my case I disabled ntp (!ntp) and enabled lntp on all my hosts. Verify, with xymongrep that the test is coming out. Xymongrep may not report the tests accurately for a few minutes. Either wait, or restart xymon.

    # xymongrep lntp
    192.168.122.21 client1 # lntp
    192.168.122.22 client2 # lntp
    192.168.122.23 client3 # lntp
    192.168.122.25 client4 # lntp
    192.168.122.20 ldapsvr # lntp
    192.168.122.16 syslog # lntp
    192.168.122.15 xymon # lntp
    
  2. Create the monitoring script. In short, you want to use xymongrep, loop through the hosts, running your test. Based on the results, update the color and message, then call xymon localhost 'status... command.

    #!/bin/ksh
    
    export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin
    
    xymongrep lntp | while read ip host a tst
    do
        # echo "Checking on ${host}.${tst}"
        Out=$(ntpq -pn ${host} 2>&1)
        echo "${Out}" | grep -q '^\*[1-9]'
        if [ $? -eq 0 ]
        then
            color=green
            msg="Service ntp on ${host} is OK (up)
    
    Commmand: ntpq -pn ${host}
    ${Out}
    "
        else
            color=red
            msg="Service ntp on ${host} is NOT OK:
    
    Command: ntpq -pn ${host}
    ${Out}
    "
        fi
        xymon localhost "status ${host}.${tst} ${color} `date`
    
    ${msg}
    "
    done
    
  3. Create the xymon tasking, by creating a file in /etc/xymon/tasks.d with the following contents:

    # cat /etc/xymon/tasks.d/lntp
    [lntp]
        CMD /usr/local/bin/lntp
        INTERVAL 5m
    
  4. Restart xymon

Seems like it’d be pretty easy to script out just about anything you’d want with that information.

Any rate, that’s the answers to the three things that have been vexing me with xymon. Once I had some time to dedicate, it was fairly straight forward, but having someone blaze the trail’s always easier…