Nagios and OP5 - writing a nrpe check script
Long time no see...
One of my main interests in working with production systems, is to be able to sleep well at the night. A very important component to help making sure I can, is to know when things go bad, which they will; sooner or later. It is just part of life. Just like any car or mechanical thing, a computer system will eventually have a hickup. It is better to know yourself when and what went wrong, than having a customer call you and tell you that something in your shop is broken.
Be proactive, not reactive.
In my world, where a small shop has a minimum of a handful of servers, and a large shop has hundreds - or perhaps thousands of servers and services, there is no way one can for sure know that something is working or if it is broken. A single server, no matter which brand/make/OS, has more than one service running, and everything running can break. So, unless you are willing to constantly log in to each and every system, you need to automate the monitoring of your stuff. For decades there has been monitoring systems around, ranging from very cheap to very expensive.
Today, I checked out the OP5 Monitor, which is a commecial but yet very attractive extension of Nagios. It has many bells and whistles, which are not part of the standard issue, mainly when it comes to reporting and configuration. It still took me a couple of hours to set it up the way I wanted. But man, the configuration is a walk in the park in comparison. After the first hit, there is almost no way back to plain vanilla Nagios.
I have used Nagios quite a lot in the past, but it is ugly (eh, the gui honestly looks like crap, but it for sure fulfills it's purpose) and there is a horde of config files to keep track of.
Well, being an old school Nagios hacker, I already know the basic concepts. Perhaps the ease of config of the OP5 Monitor software is easier for me than for many others, but I will put that aside. Here, I will just give you a quick glance on how easy it is to extend the Nagios NRPE (Nagios Remote Plugin Executor), so that the monitoring server (Nagios or OP5) can execute remote scripts on a host withot having to deal with weird home grown ssh scripts and keys.
First, I have to give you a short introduction to how Nagios checks a service. It is simple, really simple.
If you want to write your own check-script, you need to know what you want to check. A good example is to look for the presence of a file, e.g /tmp/foo.bar. Let us say, that your whole corporation is depending on knowing whether this file exists. A simple way to check this, is to write a script.
#!/bin/ksh [ ! -f /tmp/foo.bar ] && echo "The file does not exist"
This will just echo a warning if the file does not exist.
If you would like for Nagios to understand this, you need to tell it just a little more; a return code.
- 0 - All is fine, just go on as before
- 1 - Warn that something is not really ok
- 2 - Critical - this is bad, call for the fire brigade
So, to extend this script, to make it a fully phledged Nagios module, you just need to send back the correct return code:
#!/bin/ksh if [ ! -f /tmp/foo.bar ] then msg="CRITICAL - The file does not exist" rc=2 else msg="OK - The file is here!" rc=0 fi echo $msg return $rc
It is simple as that (plus that you have to go through the tedious job of configuring the chkcommand.cfg file and your Nagios services). With this you have a simple Nagios module.
To make this a NRPE module, which is remotely executed by the Nagios or OP5 server on the server of choice, you just have to put this script somewhere on your monitored server, e.g in /opt/plugins/check_myfile and setup the NRPE configuration.
remote host $> sudo chmod 755 /opt/plugins/check_myfile remote host $> grep check_myfile /etc/nrpe.d/my_config.cfg command[myfile]=/opt/plugins/check_myfile remote host $> sudo /etc/init.d/nrpe restart
On the Nagios server, check that your script works (my remote host has the IP address 192.168.2.90):
OP5 $> /opt/plugins/check_nrpe -H 192.168.2.90 -c myfile CRITICAL - The file does not exist remote_host $> touch /tmp/foo.bar OP5 $> /opt/plugins/check_nrpe -H 192.168.2.90 -c myfile OK - The file is here!
That is basically it! Now, go ahead and configure a new nrpe service for a host in your OP5 environment, and put the work "myfile" in the "check_command_args" field, and you are done. Two minutes of work, and you save yourself tons of head ache.
DEBUG: The script has to send at least something to stdout, it doesn't really matter what. Othervise you will get an error message from the server side _checknrpe script:
remote host $> grep echo # echo $msg OP5 $> /opt/plugins/check_nrpe -H 192.168.2.90 -c myfile NRPE: Unable to read output