Who monitors the monitor?

How do you know that all virtual machines (VM’s) in a VMWare environment is actually monitored in your monitoring system (read Nagios, Op5)?

The follow-up question is: is this really important? The answer is: yes. It is important. Of course there might be virtual machines in your environment that you really don’t care about. But there will be a day, when you realize that you wish that you had monitored that one machine in your environment, that just was not.

There are only two ways to know:

  • Your deployment system/process/whatever of VM’s also adds the new virtual machine to your monitoring system
  • You make a list of existing virtual machines and compare it to what is monitored

You decide what is easier for you. In most environments (1) just doesn’t happen. So, what if you are left with (2)? How do you do this automatically? In principle, you are not alone. (2) is common, but is a tedious job. I call (2) “meta monitoring”. The monitoring of the monitoring. In my environment I have a set of monitoring checks that are telling me if I am doing my job properly. This is one of them.

Most people are aware that they actually have a handful of virtual machines in their environment that they really don’t want to monitor. You might want to use a temporary VM for a test, a development system under construction. Whatever your reason might be, you might have a valid reason not to monitor a system. The common denominator is usually that you know that you don’t want to monitor it.

The following approach will give you a way of telling what is not monitored in your virtual environment, as well as allowing you to have the occasional test system running in your environment. What I advocate, is an approach which is illegal in business, called “negative confirmation”. Basically, you should give an explanation, and make an active decision if you do not want a virtual machine to be monitored. What I usually do to accomplish this, is to add a custom attribute to the virtual machines in vCenter called noMonitoring, where one should write a note if monitoring is not desired. If this field is empty, it implies that the system should be monitored.

Sounds simple, no?

Given environment:

  • VMWare hypervisor (formerly known as ESXi)
  • VMWare Virtual Center
  • A read-only user in vCenter, in my case “op5”
  • OP5, version 6.0.7 or higher
  • VMware vSphere SDK for Perl installed on your OP5 installation

In vCenter, set up a custom field called noMonitoring (Management->User defined Attributes->Add (Global attribute). I usually also want to keep track of ownership, so I have added two more custom fields; ownerCustomer and ownerTech, so that I know which customer a VM belongs to, and who is responsible for the VM from a technical point of view.

This way, you can use this field to type in information if you don’t want a virtual machine to be monitored. My recommendation is that you use this field such, that if you don’t write anything into it when you have created a virtual machine, you intend for it to be monitored. If you write anything into it, just one character or more, you mean for the virtual machine not to be monitored. The best way to keep track of the whole thing, is to write a short description on why you don’t want the system to be monitored. For example: “2013-05-20, LUM, demo system” or similar. This way other people will know why you don’t want the system to be monitored.

But, then, how do we get this information into OP5?

I have two scripts to do this:

  • getVMsAndCustomAttributes.pl
  • check_metaMonitoring_vmWare

The perl script connects to a vCenter and reads out all virtual machines and a handful of attributes (of which noMonitoring is one of them). The attributes are separated by a semicolon “;”.

Example:


root@op5-v005fry:/opt/plugins/kmg# ./getVMsAndCustomAttributes.pl --server=192.168.2.30 --username=op5 --password=op5
#vm;onHost;dataStore;noMonitoring;ownerCustomer;ownerTech
kmg-guran-0001;192.168.2.204;NFSProd;;;;
kmg-op5-0001;192.168.2.204;NFSProd,Synology02;2013-05-12, LUM, To be decommissioned;;;
kmg-zenLoadbalancer-0001;192.168.2.204;NFSDev,Synology02;2013-02-05, LUM, To be decommissioned;;;
kmg-web-0001;192.168.2.204;NFSDev,Synology02;;;;
kmg-web-0002;192.168.2.204;NFSDev,Synology02;;;;
kmg-jumphost-0002;192.168.2.204;NFSProd,Synology02;;asdf;;
kmg-sandbox-0003;192.168.2.204;NFSProd,Synology02;;;;
kmg-buildbox-0001;192.168.2.204;NFSDev,Synology02;LUM, To be decommissioned;;;
kmg-plex-0001;192.168.2.204;NFSProd,Synology02;;;;
kmg-winxp-0001;192.168.2.204;NFSDev;2012-01-12, Windows client, no monitoring;;;
kmg-op5-0004;192.168.2.204;NFSDev;2013-04-20, Quarantin, to be decommissioned when v6 works well in prod.;;;
kmg-sandbox-0005;192.168.2.204;NFSDev,Synology02;2012-10-01, LUM, To be decommissioned;;;
jira-v001fry;192.168.2.204;NFSDev,Synology02;;;;
proxy-v001fry;192.168.2.204;NFSProd,Synology02;2013-03-20, LUM, Under construction 4;;;
kmg-pfsense-0001;192.168.2.204;datastore1,Synology02;2012-12-20, Quarantin;;;
op5-v005fry;192.168.2.204;NFSProd;;;;
backup-v001fry;192.168.2.204;NFSProd,Synology02;2013-05-02, LUM, Under construction;Maggan;;
guran-v001fry;192.168.2.204;NFSProd,Synology02;2013-05-10, LUM, New server, Under construction 2;;;
vcenter-v001fry;192.168.2.204;NFSProd;;;;

Field number 4 represents my custom field “noMonitoring”.


In principle, I just have to check field number 4 of the output, and print field number 1 to get a decent list to check against my monitoring system.

root@op5-v005fry:/opt/plugins/kmg# ./getVMsAndCustomAttributes.pl –server=192.168.2.30 –username=op5 –password=op5 | awk -F";" ' $4 == "" {print $1}' kmg-guran-0001 kmg-web-0001 kmg-web-0002 kmg-jumphost-0002 kmg-sandbox-0003 kmg-plex-0001 jira-v001fry op5-v005fry vcenter-v001fry

To check this against my OP5 configuration, I just have to ask my monitoring system if the host is monitored. Had I used an older version of OP5, I would have done this by either using **** on /opt/monitor/etc/hosts (grep host_name /opt/monitor/etc/hosts.cfg | grep kmg-guran-0001 | wc -l) or connecting to the merlin database and issuing a clever sql query (no example).

But now, we are on version 6, where Op5 are nowadays using ****which in itself deserves some attention. Long story short; instead of parsing text files or updating a database, MK Livestatus is used to hook into Nagios to keep track of the configuration and the status of the system. The benefit: less disk IO. Asking your monitoring installation about more or less anythings is now very easy, communicating with MK Livestatus over a unix socket. In this case, I will make an extremely simple query, give me the host name of a configured host, that has the host name xxyy. For more inspirational references, look here: <a href="http://mathias-kettner.de/checkmk_livestatus.html">http://mathias-kettner.de/checkmk_livestatus.html</a>.

Example:

root@op5-v005fry:/opt/plugins/kmg# printf “GET hostsnColumns: host_name host_addressnFilter: host_name = kmg-guran-0001n” | unixcat /opt/monitor/var/rw/live kmg-guran-0001;192.168.2.37

We put this together into a check_script, ****which I use to keep track of unmonitored systems.

root@op5-v005fry:/opt/plugins/kmg# ./check_metaMonitoring_vmWare  2>/dev/null WARN - H: 19 M: 7 !M: 12 ok!M: 8 nok!M: 4 Hosts:  kmg-plex-0001 jira-v001fry op5-v005fry vcenter-v001fry | hosts=19 monitored=7 notMonitored=12 okNotMonitored=8 nokNotMonitored=4

I have added this as a service check to my installation (just add the command to checkcommands.cfg and add a service check to your vcenter host in your monitoring), and can see the following:

<a href="http://www.kmggroup.ch/wp-content/uploads/2013/05/meta-monitoring-service-check.png"><img src="http://www.kmggroup.ch/wp-content/uploads/2013/05/meta-monitoring-service-check.png" alt="meta monitoring - service check" width="689" height="728" 

In the output you can see the following:

  • H: 19 -> VMs in this installation
  • M: 8 -> Number of monitored VM’s
  • !M: 11 -> Number of VM’s that are not monitored
  • ok!M: 8 -> Non monitored VM’s that are ok (to not be monitored)
  • nok!M: 3 -> Not OK -> This is what we try and catch, VM’s that should be monitored.

What can you do to remedy this? You have two possibilities:

  • Add the VM’s to your monitoring system
  • Add a comment in the “noMonitoring” fields in your vCenter

Simple as that. I guess I have to add a few VM’s to my monitoring now.

Here, the sweets:

[1] getVMsAndCustomAttributes.pl

#!/usr/bin/perl
## -----------------------------------------------
# Script: getVMsAndCustomAtributes
# Author: magnus.luebeck@kmggroup.ch
# Date: 2013-05-20
#
# Description: This script will output a semicolon ";" separated
# of VMs from a vCenter, together with the custom
# attributes:
# - noMonitoring - Empty field = VM should be monitored
# - noMonitoring - Non empty = good excuse for not monitoring
# - ownerCustomer
# - ownerTech
#
# Usage: ./getVMsAndCustomAttributes.pl --server=192.168.2.30 --username=USERNAME --password=PASSWORD
## Script inspired by/to large extent copied from Reuben Stump
## (rstump@vmware.com | http://www.virtuin.com)
## http://www.virtuin.com/2012/11/best-practices-for-faster-vsphere-sdk.html
## http://communities.vmware.com/docs/DOC-10220 /
## http://communities.vmware.com/servlet/JiveServlet/download/10220-4-24610/queryVMCustomField.pl
## and http://communities.vmware.com/message/519501
## -----------------------------------------------

use strict;
use warnings;

use VMware::VIRuntime;

Opts::parse();
Opts::validate();

Util::connect();

# Fetch all VirtualMachines from SDK, limiting the property set
my $vm_views = Vim::find_entity_views(view_type => "VirtualMachine",
properties => ['name', 'runtime.host', 'datastore', 'summary' ]) ||
die "Failed to get VirtualMachines: $!";

# Fetch all HostSystems from SDK, limiting the property set
my $host_views = Vim::find_entity_views(view_type => "HostSystem",
properties => ['name']) ||
die "Failed to get HostSystems: $!";

# Fetch all Datastores from SDK, limiting the property set
my $datastore_views = Vim::find_entity_views(view_type => "Datastore",
properties => ['name']) ||
die "Failed to get Datastores: $!";

# Create hash tables with key = entity.mo_ref.value
my %host_map = map { $_->get_property('mo_ref.value') => $_ } @{ $host_views || [] };
my %ds_map = map { $_->get_property('mo_ref.value') => $_ } @{ $datastore_views || [] };

#--- The correlation between custom field ID and it's name is only found in
#--- the customFields manager
my $sc = Vim::get_service_content();
my $customFieldsMgr = Vim::get_view( mo_ref => $sc->customFieldsManager );

# Create hash table with key = keyName => value
my %keys_map = map { $_->name => $_->key } @{ $customFieldsMgr->field || [] };

# Enumerate VirtualMachines
printf ("#vm;onHost;dataStore;noMonitoring;ownerCustomer;ownerTechn");
foreach my $vm ( @{$vm_views || []} ) {
# Get HostSystem from the host map
my $host_ref = $vm->get_property('runtime.host')->{'value'};
my $host = $host_map{$host_ref};

# Get array of datastore moref values
my @ds_refs = map($_->{'value'}, @{$vm->get_property('datastore') || []});

# Get array of datastore entities from the datastore map by slicing %ds_map
my @datastores = @ds_map{@ds_refs};

# Map the custom field values to a hash
my %cVals = map { $_->key => $_->value } @{$vm->summary->customValue || []} ;

my $noMonitoring = "";
my $ownerCustomer = "";
my $ownerTech = "";

$noMonitoring = $cVals{$keys_map{"noMonitoring"}} if (defined($cVals{$keys_map{"noMonitoring"}}));
$ownerCustomer = $cVals{$keys_map{"ownerCustomer"}} if (defined($cVals{$keys_map{"ownerCustomer"}}));
$ownerTech = $cVals{$keys_map{"ownerTech"}} if (defined($cVals{$keys_map{"ownerTech"}}));

printf("%s;%s;%s;%s;%s;%s;n",
$vm->get_property('name'),
$host->get_property('name'),
join(',', map($_->get_property('name'), @datastores) ),
$noMonitoring,
$ownerCustomer,
$ownerTech
);

}

# Disable SSL hostname verification for vCenter self-signed certificate
BEGIN {
$ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0;
}

[2] kmg# cat check_metaMonitoring_vmWare

#!/bin/bash

## -----------------------------------------------
# Script: check_metaMonitoring_vmWare
# Author: magnus.luebeck@kmggroup.ch
# Date: 2013-05-20
#
# Description: This script will check if your VMs are monitored
# in your Op5-environment.
#
## -----------------------------------------------

this_dir=$(cd `dirname $0`;pwd)
live_path=$(awk '/broker_module.*live/ { print $NF}' /opt/monitor/etc/nagios.cfg)

thresholdWarning=0
thresholdCritical=10

OLD_IFS=$IFS
IFS='
'

checkHostExist(){
curHost=$1

unixcat <&2 ; (( numMonitoredHosts += 1 )) ; }
[ -z "$result" ] && { echo "$hostName is NOT monitored" 1>&2 ; (( numNotMonitoredHosts += 1 )) ; }

#--- the secret sauce - noMonitoring field is empty -> should be monitored
[[ -n "$noMonitoring" && -z "$result" ]] && { echo " - But does not have to: $noMonitoring" 1>&2 ; (( numNotMonitoredWithGoodExcuseHosts += 1 )) ; }
[[ -z "$noMonitoring" && -z "$result" ]] && { echo " - Should be monitored" 1>&2 ; (( numNotMonitoredWithoutExcuseHosts += 1 )) ; hostsToOutput="$hostsToOutput $hostName" ; }

done

[ $numNotMonitoredWithoutExcuseHosts -le $thresholdWarning ] && { retVal=0 ; retPrefix=OK ; }
[ $numNotMonitoredWithoutExcuseHosts -gt $thresholdWarning ] && { retVal=1 ; retPrefix=WARN ; }
[ $numNotMonitoredWithoutExcuseHosts -gt $thresholdCritical ] && { retVal=2 ; retPrefix=CRIT ; }

echo "$retPrefix - H: $numHosts M: $numMonitoredHosts !M: $numNotMonitoredHosts ok!M: $numNotMonitoredWithGoodExcuseHosts nok!M: $numNotMonitoredWithoutExcuseHosts"
[ -n "$hostsToOutput" ] && echo "Hosts: $hostsToOutput"
echo "| hosts=$numHosts monitored=$numMonitoredHosts notMonitored=$numNotMonitoredHosts okNotMonitored=$numNotMonitoredWithGoodExcuseHosts nokNotMonitored=$numNotMonitoredWithoutExcuseHosts"

exit $retVal