===================================================
Xymon configuration, manipulation, tips and tricks:
===================================================
:Title:   Xymon configuration, manipulation, tips and tricks:
:Author:  Douglas O'Leary <dkoleary@olearycomputers.com>
:Description:  Xymon tips/tricks, mostly CLI.
:Disclaimer: Standard: Use the information that follows at your own risk.  If you screw up a system, don't blame it on me...

.. contents::

Overview:
=========

The client I'm at uses xymon so I've been busy trying to pick that up to a
supportable degree.  

I've had a few problems with it - probably more accurately, with the way
the client's implemented it.  Like most things, it's evolved differently
as different admins take charge of it.

The problems:

*   Command line interface, particularly for acknowledging alerts.  The web
    interface is good; but, I'm a UNIX admin.  Pointy clicky things make
    me twitch.
*   Separating alerts for the same resource type.  Example: Alerts for 
    filesystems under /opt/app should go application administrators whereas
    OS related filesystems should go to us.  I can't do anything about an 
    application filesystem being full, I don't want to hear about it.
*   ntp monitoring: We're using the rpm packaged version.  Very nice tool;
    easy to manage, manipulate, but the built in ntp monitoring sucks.  It's
    take a bit to get to where that's working reliably now.

Immediately following is a list of lessons learned.  Details/discussions of 
those lessons learned, where needed, follow the list.

Lessons Learned:
================

*   xymond listes on port 1984 - need that for firewall configuration.
*   Acknowledging an alert from the command line.  Need to have the 
    alert id, what's called the cookie (for web interface).

    *   Obtain the alert id:  use the xymon -> xymondlog command.  The 
        xymondlog subcommand displays detailed information on a specific
        test.  See the xymon man page for details. You do not have to be
        root to run it. ::

            $ xymon localhost 'xymondlog client4.disk'
            client4|disk|red||1412444137|1412448972|1412450772|0|0|192.168.122.25|1578903790|||Y|
            red Sat Oct  4 13:56:11 CDT 2014 - Filesystems NOT ok
            &red /opt/app (100% used) has reached the PANIC level (95%)
            
            Filesystem            1024-blocks    Used Available Capacity Mounted on
            /dev/mapper/vg00-root     1032088  370652    609008      38% /
            /dev/vda1                  495844   67751    402493      15% /boot
            /dev/mapper/vg00-opt      1032088   34060    945600       4% /opt
            /dev/mapper/vg00-tmp      2064208   68616   1890736       4% /tmp
            /dev/mapper/vg00-usr      4128448 1684704   2234032      43% /usr
            /dev/mapper/vg00-var      2064208  439152   1520200      23% /var
            /dev/mapper/vg00-app      2064208 2042292         0     100% /opt/app

        To display the alert id, parse the above output: ::

            $ xymon localhost 'xymondlog client4.disk' | head -1 | \
            awk -F\| '{print $11}'
            1578903790

    *   To acknowledg the alert, execute: ::

            xymon localhost "hobbitdack ${alert_id} ${minutes} ${msg}"

*   To disable an alert: ::

        xymon 127.0.0.1 'disable ${host}.[${test}|*] ${minutes} ${free_text}'

    Can set ${minutes} to -1 to disable it until it comes back good again.

*   To segregate alerts by filesystem: 

    *   In analysis.cfg, ensure appropriate filesystems (and other alerts)
        are grouped: ::

            HOST=%client[1-3]
                DISK /opt/app GROUP=mw 90 95
                DISK * GROUP=infra 90 05
            HOST=client4
                DISK /opt/app GROUP=dol 90 95
                DISK * GROUP=infra 90 05

    *   In alerts.cfg, use GROUP name as the *host*: ::

            GROUP=mw
                MAIL $Middleware
            GROUP=dol
                MAIL $Dkoleary
            GROUP=infra
                MAIL=$Unix

*   To display the current status of a specific test across the environment: ::

        # xymon localhost 'xymondboard test=lntp'
        client1|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
        client2|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
        client3|lntp|green||1412472773|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
        client4|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
        ldapsvr|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
        syslog|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
        xymon|lntp|green||1412470655|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014

    See the xymon man page for details.  Queries can also be filtered on host 
    and color

*   To prevent acknowledged alerts from propogating red status and/or remove 
    them from the *all non-green view* page: 

    *   Add **"--nopropack='*'** to the XYMONGENOPTS option in xymonserver.cfg.
    *   Restart xymon.  May not be absolutely mandatory, but I'm impatient.

Discussion:
===========

CLI interface:
--------------

Alert acknowledgement:
~~~~~~~~~~~~~~~~~~~~~~

I've been studying this off and on for about a month and was having problems
finding out how to manage the environment from the command line.  I had my
AH-HA moment today, when it suddenly became much easier.  

To acknowledge an alert, you have to know the alert ID or its cookie which 
is displayed in the emails that xymon sends.  But, what if you want to 
acknowledge an alert that wasn't emailed out?  

First, find the alert ID.  That's hiding in the xymondlog output: ::

    $ xymon localhost 'xymondlog client4.disk'
    client4|disk|red||1412444137|1412448972|1412450772|0|0|192.168.122.25|1578903790|||Y|
    red Sat Oct  4 13:56:11 CDT 2014 - Filesystems NOT ok
    &red /opt/app (100% used) has reached the PANIC level (95%)
    
    Filesystem            1024-blocks    Used Available Capacity Mounted on
    /dev/mapper/vg00-root     1032088  370652    609008      38% /
    /dev/vda1                  495844   67751    402493      15% /boot
    /dev/mapper/vg00-opt      1032088   34060    945600       4% /opt
    /dev/mapper/vg00-tmp      2064208   68616   1890736       4% /tmp
    /dev/mapper/vg00-usr      4128448 1684704   2234032      43% /usr
    /dev/mapper/vg00-var      2064208  439152   1520200      23% /var
    /dev/mapper/vg00-app      2064208 2042292         0     100% /opt/app

That first line, the one separated by '|' is the one you want.  From the 
xymon man page, the fields are:

*   hostname
*   test name
*   color
*   test flgs
*   last change: UNIX timestamp when the color changed.
*   log time: UNIX timestamp
*   validtime: UNIX timestamp when the log entry is no longer valid
*   acktime: man page says -1, I've only seen 0 for unack'ed alerts.
*   disabletime: same as acktime
*   sending IP address
*   alert id or cookie (for those counting, it's field 11)
*   ackmsg: acknowldgement message
*   disms: disable message.

So, to obtain the specific alert id: ::

    $ xymon localhost 'xymondlog client4.disk' | head -1 | \
    awk -F\| '{print $11}'
    1578903790

Then, to acknowledge the alert: ::

    $ xymon localhost 'hobbitdack 1578903790 1440 acked for a day'

Other available information:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The xymondboard subcommand seems particularly useful.  To see the status of
a particular test across your environment: ::

    # xymon localhost 'xymondboard test=lntp'
    client1|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    client2|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    client3|lntp|green||1412472773|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    client4|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    ldapsvr|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    syslog|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014
    xymon|lntp|green||1412470655|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct  4 22:08:27 CDT 2014

The fields are close to but not exactly the same as the xymondlog subcommand.  
Check the xymon man page for details.

The times, as above, are in UNIX time format.  Converting those is straight 
forward: ::

    # ctime 1412478507
    Sat Oct  4 22:08:27 2014

ctime's a function I've had in my .kshrc file forever: ::

    ctime()
    {
        eval "perl -e '\$time = localtime($1); print \"\$time\\n\"'"
    }

You can use that information to verify the last time a test was logged.  I
used it to verify that a custom test was actually **getting** run.  Came
in quite handy.

Scripts:
~~~~~~~~

ack:
++++

The ack script combines the two steps required to acknowledge an alert.  This
script assumes it's run from some system other than the xymon server.  Pretty
straight forward: ::

    #!/bin/ksh
    
    ##############################################################
    # ack:      acknowledges xymon alerts
    # Author:   Doug O'Leary
    # Created:  10/04/14
    # Updated:
    ##############################################################
    
    usage()
    {   msg="$*"
        echo ''
        [[ ${#msg} -gt 0 ]] && echo "${msg}"
        print "Format: ack -h \${host}.\${test} -t \${min} -m \"\${message}\"\n"
        exit 1
    }
    
    [[ $# -eq 0 ]] && usage "Invalid number of commands" 
    Xymon="xymon"
    
    while getopts "h:t:m:" arg
    do
        case ${arg} in
            h)  Test=${OPTARG};;
            t)  Time=${OPTARG};;
            m)  Msg="${OPTARG}";;
        esac
    done
    
    Ttime=$(echo "${Time}" | sed 's/[0-9]//g')
    [[ ${#Ttime} -gt 0 ]] && usage "Invalid time: ${Time}"
    
    AI=$(ssh ${Xymon} "xymon localhost 'xymondlog ${Test}'"  | head -1 | \
        awk -F\| '{print $11}')
    [[ ${#AI} -eq 0 ]] && usage "Invalid host.test: ${Test}"
    
    ssh ${Xymon} "xymon localhost 'hobbitdack ${AI} ${Time} ${Msg}'"

dac:
++++

The dack script displays the results of a specific test.  Even simpler, but
it does save on typing: ::

    #!/bin/ksh
    
    ###############################################################
    # dack:    displays xymondlog details of a test
    # Author:  Doug O'Leary
    # Created: 10/04/14
    ###############################################################
    
    
    usage()
    {   msg="$*"
        echo ''
        [[ ${#msg} -gt 0 ]] && echo "${msg}"
        print "Format: dack \${host}.\${test}\n"
        exit 1
    }
    
    [[ $# -ne 1 ]] && usage "Invalid number of commands" 
    Xymon="xymon"
    Test=${1}
    
    ssh ${Xymon} "xymon localhost 'xymondlog ${Test}'"

Differentiating alerts:
-----------------------

This one was driving me bug nuts crazy.  On call, getting paged about
filesysems that belong to applications.  Go figure, they get a bit hinkey
if I go in and start eliminating files willy nilly.  In short, the UNIX
admin should not add, delete, or otherwise manipulate application data.
It's called Separation of Duties.  If I can't do anything about a full
filesystem, don't call me.  

Couldn't seem to get that one through.  That, or none of the local experts
knew how to do it.  Turns out, it's right there in the documentation.  

For clarity, there're two steps to getting this done:

*   Update the  analysis.cfg and add a GROUP parameter to the specific 
    filesysems.  In this example, I want /opt/app/ filesytems on 
    clients 1-3 to be in the *mw* group.  Everything else belongs to 
    *infra*

        HOST=%client[1-3]
            DISK /opt/app GROUP=mw 90 95
            DISK * GROUP=infra 90 05
        HOST=client4
            DISK /opt/app GROUP=dol 90 95
            DISK * GROUP=infra 90 05

*   The alerts.cfg is even easier.  Simply use GROUP name as the *host*,
    sending emails to whatever aliases you define: ::

        GROUP=mw
            MAIL $Middleware
        GROUP=dol
            MAIL $Dkoleary
        GROUP=infra
            MAIL=$Unix

Monitoring ntp:
---------------

The ntp monitoring in ver 4.3.17 doesn't work well.  It's constantly coming up
with false positives.  As good as everything else is, I'll give 'em this one.
It was a bit of a pain to figure out the correct way, but once I did, it's 
quite easy to set up.  Use the scripts below as is or as a model to design your
own custom scripts.

1.  Update the hosts.cfg and include files with your custom test.  In my case
    I disabled ntp (!ntp) and enabled lntp on all my hosts.  Verify, with 
    xymongrep that the test is coming out.  Xymongrep may not report the tests
    accurately for a few minutes.  Either wait, or restart xymon. ::

        # xymongrep lntp
        192.168.122.21 client1 # lntp
        192.168.122.22 client2 # lntp
        192.168.122.23 client3 # lntp
        192.168.122.25 client4 # lntp
        192.168.122.20 ldapsvr # lntp
        192.168.122.16 syslog # lntp
        192.168.122.15 xymon # lntp

2.  Create the monitoring script.  In short, you want to use xymongrep, loop
    through the hosts, running your test.  Based on the results, update the
    color and message, then call ``xymon localhost 'status...`` command. ::

        #!/bin/ksh
        
        export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin
        
        xymongrep lntp | while read ip host a tst
        do
            # echo "Checking on ${host}.${tst}"
            Out=$(ntpq -pn ${host} 2>&1)
            echo "${Out}" | grep -q '^\*[1-9]'
            if [ $? -eq 0 ]
            then
                color=green
                msg="Service ntp on ${host} is OK (up)
        
        Commmand: ntpq -pn ${host}
        ${Out}
        "
            else
                color=red
                msg="Service ntp on ${host} is NOT OK:
        
        Command: ntpq -pn ${host}
        ${Out}
        "
            fi
            xymon localhost "status ${host}.${tst} ${color} `date`
        
        ${msg}
        "
        done

3.  Create the xymon tasking, by creating a file in /etc/xymon/tasks.d with
    the following contents: ::

        # cat /etc/xymon/tasks.d/lntp
        [lntp]
            CMD /usr/local/bin/lntp
            INTERVAL 5m

4.  Restart xymon

Seems like it'd be pretty easy to script out just about anything you'd want
with that information.

Any rate, that's the answers to the three things that have been vexing me
with xymon.  Once I had some time to dedicate, it was fairly straight 
forward, but having someone blaze the trail's always easier...