============================================= Xymon config/installation/manipulation notes: ============================================= Lessons learned: ================ * xymond listes on port 1984 - useful for firewall restrictions. * Acknowledging an alert from the CLI: :: xymon 127.0.0.1 'hobbitdack ${alert_id} ${time_in_minutes} ${alert_msg}' * To segregate alerts by filesystem: * In analsysi.cfg, ensure appropriate filesystems (and other alerts) are grouped: :: HOST=%client[1-3] DISK /opt/app GROUP=mw 90 95 DISK * GROUP=infra 90 05 HOST=client4 DISK /opt/app GROUP=dol 90 95 DISK * GROUP=infra 90 05 * In alerts.cfg, use GROUP name as the *host*: :: GROUP=mw MAIL $Middleware GROUP=dol MAIL $Dkoleary GROUP=infra MAIL=$Mpiunix * To disable a test for a period of time: :: xymon 127.0.0.1 'disable ${host}.[${test}|*] ${minutes} ${free_text}' Can set ${minutes} to -1 to disable it until it comes back good again. * To ID the alert_id of a test - in fact, to obtain quite a bit of info regarding a test. * ``xymon localhost 'xymondlog ${host}.${test}`` displays test status. see xymon manpage, xymondlog section, for details. Note: DO NOT haveto be root to run it. :: $ xymon localhost 'xymondlog client4.disk' client4|disk|red||1412444137|1412448972|1412450772|0|0|192.168.122.25|1578903790|||Y| red Sat Oct 4 13:56:11 CDT 2014 - Filesystems NOT ok &red /opt/app (100% used) has reached the PANIC level (95%) Filesystem 1024-blocks Used Available Capacity Mounted on /dev/mapper/vg00-root 1032088 370652 609008 38% / /dev/vda1 495844 67751 402493 15% /boot /dev/mapper/vg00-opt 1032088 34060 945600 4% /opt /dev/mapper/vg00-tmp 2064208 68616 1890736 4% /tmp /dev/mapper/vg00-usr 4128448 1684704 2234032 43% /usr /dev/mapper/vg00-var 2064208 439152 1520200 23% /var /dev/mapper/vg00-app 2064208 2042292 0 100% /opt/app * To display the alert id, parse the above output: :: $ xymon localhost 'xymondlog client4.disk' | head -1 | \ awk -F\| '{print $11}' 1578903790 * To display the results of a test across the env: :: # xymon localhost 'xymondboard test=lntp' client1|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 client2|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 client3|lntp|green||1412472773|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 client4|lntp|green||1412468328|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 ldapsvr|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 syslog|lntp|green||1412468348|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 xymon|lntp|green||1412470655|1412478507|1412480307|0|0|127.0.0.1||green Sat Oct 4 22:08:27 CDT 2014 See xymon man page for details. Items to learn: =============== * How to set up different scripts. For instance, for ntp testing Notes: ====== 08/17/14 * Got xymon and four clients running. Downloaded rpms for the same version we're using at work from http://terabithia.org/rpms/xymon/. Server and clients installed but not configured and running. * Still need to: * Edit /etc/xymon-client/xymonclient.cfg updating XYMONSERVERS. * Figure out the server configuration. 09/01/14: * Scratch that and reverse. Got xymon installed on new vm, called xymon * Got xymon-client running on six other clients. * xymon.conf for http access put in place automatically. That's nice. * xymond listens on port 1984 - useful for firewall restrictions. * **Got** my ghost clients. Nice! * Read through the hosts.cfg man page. Nothing too out of the ordinary. * One interesting bit, though, was the .default. tag, used for identifying default tests on otherwise unidentified hosts. That's how you get the new hosts in the ghosts page. * OK: got my two groups, cient and infra, got clients all green and got one host in infra red. * Next goals: * ack alerts * rewrite ntp reporting. 09/02/14: * Read through alerts.cfg. I think I found out, at least initially, how to configure disk alerts to go to other people. Specific lines: :: For some tests - e.g. "procs" or "msgs" - the right group of people to alert in case of a failure may be different, depending on which of the client rules actually detected a problem. E.g. if you have PROCS rules for a host checking both "httpd" and "sshd" processes, then the Web admins should handle httpd-failures, whereas "sshd" failures are handled by the Unix admins. To handle this, all rules can have a "GROUP=groupname" setting. When a rule with this setting triggers a yellow or red status, the groupname is passed on to the Xymon alerts module, so you can use it in the alert rule definitions in alerts.cfg(5) to direct alerts to the correct group of people. Need to experiment a bit with that one. 09/05/14: * Files: * hosts.cfg: IDs the hosts to monitor and tests to run on them. * analysis.cfg: IDs specific parameters for each host: * memphys * memswap * memact * load * up * disk * alerts.cfg: IDs who gets alerted for what. * Updated analysis.cfg and alerts.cfg to direct emails for specific filesystems to specific groups. Trick is as follows: * analysis.cfg: :: HOST=%client[1-3] DISK /opt/app GROUP=mw 90 95 DISK * GROUP=infra 90 05 HOST=client4 DISK /opt/app GROUP=dol 90 95 DISK * GROUP=infra 90 05 HOST=%xymon|ldapsvr|syslog DISK * GROUP=infra 90 05 * alerts.cfg: :: GROUP=mw MAIL $Middleware GROUP=dol MAIL $Dkoleary GROUP=infra MAIL=$Mpiunix * Didn't get duplicate alerts, though. When client[14] were already alerting due to disk issues, the alert didn't go out for /tmp. That may be expected. Will have to check on that w/Justin at some point. 09/06/14: Remaining goals: * How to ID the alert number if it's not emailed out. Answer: /var/lib/xymon/histlogs/${host}/${test}: Nope; not it. * How to script an alert on a client. (ntp) * How to send alerts to scripts (for further redirection to OVO) Well, didn't find out how to acknowledge a specific alert but I did find out how to disable the damned thing for a bit. That, at least, makes it go away for the duration. I disabled caauth until it comes live again. At work, I disabled walvdevwapp062's memory until 0800 monday morning, and I disabled nap-lvad-075's memory until it goes green again. Damn thing's been yellow for pushing 20 days now... Still, remaining goals: * How to script an alert on a client. (ntp) * How to send alerts to scripts (for further redirection to OVO) 10/04/14: Been a bit. Vacation, new role at work, and complete and utter task saturation. Today's work: figure out how to identify the alert_id from an alert that's not mailed out. To do that, I'm going to kick off an alert, wait for the alert, then find the fucking alert_id. OK: forgot the firewall update on xymon. That's sorted now. Alert ID for the client4:disk is 1578903790 Found the fucker! :: xymond "xymondlog ${host}.${test}" example: :: # xymon "xymondlog client4.disk" 2014-10-04 12:39:36 No recipient specified - assuming localhost client4|disk|red||1412444137|1412444338|1412446138|0|0|192.168.122.25|1578903790|||Y| red Sat Oct 4 12:38:57 CDT 2014 - Filesystems NOT ok &red /opt/app (100% used) has reached the PANIC level (95%) Filesystem 1024-blocks Used Available Capacity Mounted on /dev/mapper/vg00-root 1032088 370652 609008 38% / /dev/vda1 495844 67751 402493 15% /boot /dev/mapper/vg00-opt 1032088 34060 945600 4% /opt /dev/mapper/vg00-tmp 2064208 68616 1890736 4% /tmp /dev/mapper/vg00-usr 4128448 1684704 2234032 43% /usr /dev/mapper/vg00-var 2064208 438492 1520860 23% /var /dev/mapper/vg00-app 2064208 2042292 0 100% /opt/app Or, more exlicity: :: xymon localhost "xymondlog client4.disk" | head -1 | \ awk -F\| '{print $11}' Combining that with our ack cli: :: xymon localhost 'hobbitdack ${alert_id} ${time_in_minutes} ${alert_msg}' xymon localhost 'hobbitdack 1578903790 5 testing cli alert ack' Then, update xymonserver.cfg to not propogate acknolwedged alerts and your non-gree view becomes much clearer. :: XYMONGENOPTS="--nopropack='*'... OK; some excellent progress today. That was one of the main goals. If ntp's still fucked up, I can probably live with that. I **really** wanted to be able to acknowledge those goddamned alerts, though.