Sawtooth – The power of a waveform!


Triangles are nice. They are robust, the strongest shape of them all. A triangle will also help you spot anomalies in contextually complex situations. Today we will use this shape to make sure that your backups are running properly, as well as showing you one of the amazing capabilities of the human brain; pattern recognition.

This is just one example of how you can use arithmetic on timestamps to get  more or less anything under control. Here is a good example of something I am trying to achieve today:

The triangle is simple, you know what to expect from it. And that, is the whole point of this blog entry.

Like in any good cooking show, I prepared the dish in beforehand. This is an example of what I can see in my OP5 monitoring. At this point it does not matter what the graph shows. Look at it for a few seconds, then answer the following questions:

  1. When did I have a problem with my backups (it did not run)?
  2. When could my monitoring system _not_ get any information from my backup system?

 

You see? You could answer these two questions. If you by any chance could not come up with the answers, you are either tired, or not really the target group of this blog. Without knowing anything about my system you easily could spot the exceptions in the pattern.

The graph shows the age in seconds of the last successful backup of my file share data. My backup policy is to make a backup (incremental, more about that in a different blog entry) every four hours. But even so, it doesn’t really matter what my backup schedule is. Given your inherited human skill of being able to recognize patterns, the two abnormalities just popped out in your face.

If you didn’t see it, and still find this blog interesting, the answer is: Just before midnight the 31st of whatever month it displays, my monitoring system could not gather this data (empty spot in the graph). And, all by a sudden, just before midnight between the 1st and the second, my backups stopped running (or failed, remember the graph shows the age of the last successful backup).

And now over to the long, interesting explanation on how I got there.

THE SETUP

I am using rsnapshot for my backups. There are several reasons and considerations behind this, but the interesting point is that I really want to know that this is working (disclaimer: this type of monitoring does not guarantee anything), and my implementation outputs logs into /var/log/rsnapshot.log where a successful backup looks like this:

[ccne]

[14/Aug/2012:16:27:54] /usr/bin/rsnapshot hourly: completed successfully

[/ccne]

So, basically, since I am interested in the age of the last successful backup, I can simply filter the logfile for this (grep), use the last line of the output (tail -1), and get the timestamp (awk, tr -d”[]”).

[ccne]

cur_output=$(grep “successfully” /var/log/rsnapshot.log | tail -1 | awk ‘{print $1}’ | tr -d “[]” )

[/ccne]

This, of course, has to be scrubbed a bit, since the output is a timestamp that is not really machine readable. And, it is a timestamp, not an age. I know, my way of doing it is a little bit complicated. But it is the way I learned to do it many years ago, and it is hard to teach an old dog how to sit.

[ccne]

#— the timestamp is in a really weird format; [14/Aug/2012:16:27:54]
curDate=$(echo $cur_output | awk -F”:” ‘{print $1}’)
curTime=$(echo $cur_output | awk -F”:” ‘{print $2″:”$3″:”$4}’)

#— split the date part into day month year
echo $curDate | awk -F”/” ‘{print $1, $2, $3}’ | read curDay curMonth curYear

#— get the age of the last successful backup
#— %s returns the number of seconds since 1.1.1970, epoc
lastBackupTime=$(date -d “$curMonth $curDay $curYear $curTime” “+%s”)
nowTime=$(date “+%s”)
(( lastOkBackupAge=$nowTime – $lastBackupTime ))

[/ccne]

In principle what I do, is to convert the time to unix_timestamp (seconds from epoc, 1st of January, 1970), then subtract this from the current time. This gives me the number of seconds that has passed since the last successful backup until now. I do this through OP5/Nagios every 5 minutes, and the backups are supposed to run every 4 hours. In between backups, the output of my script will show an increasing age of the last successful backup, until just after a new backup, where the age is close to 0 seconds old.

So, the configuration for the whole setup is done on:

  • OP5 server
  • The backup server

On the OP5 server, checkcommands.cfg:

[ccne]

# command ‘kmgBackup’
define command{
command_name kmgBackup
command_line $USER1$/check_nrpe -H kmg-sandbox-0003 -c kmg_backup
}

[/ccne]

On the OP5 server, services.cfg:

[ccne]

# service ‘Rsnapshot backup’
define service{
use default-service
host_name kmg-sandbox-0003
service_description Rsnapshot backup
check_command kmgBackup
}

[/ccne]

On the backup server, /etc/nrpe.d/kmg_commands.cfg (any filename ending with .cfg will do):

[ccne]

malu@kmg-sandbox-0003:/etc/nrpe.d $cat kmg_commands.cfg
#— kmg backup

command[kmg_backup]=/app/prd/op5/bin/checkBackup

[/ccne]

And, at last, the script that checks the backups:

[ccne lines=”-1″]

malu@kmg-sandbox-0003:/app/prd/op5/bin $cat /app/prd/op5/bin/checkBackup
#!/bin/ksh

logFile=/var/log/rsnapshot.log
logHost=kmg-sandbox-0003.localdomain

getLastComplete() {

#— We are looking for lines like this. Only the timestamp of the last one is interesting
#— [14/Aug/2012:16:27:54] /usr/bin/rsnapshot hourly: completed successfully
cur_output=$(grep “successfully” /var/log/rsnapshot.log | tail -1 | awk ‘{print $1}’ | tr -d “[]” )

#— the timestamp is in a really weird format; [14/Aug/2012:16:27:54]
curDate=$(echo $cur_output | awk -F”:” ‘{print $1}’)
curTime=$(echo $cur_output | awk -F”:” ‘{print $2″:”$3″:”$4}’)

#— split the date part into day month year
echo $curDate | awk -F”/” ‘{print $1, $2, $3}’ | read curDay curMonth curYear

#— get the age of the last successful backup
#— %s returns the number of seconds since 1.1.1970, epoc
lastBackupTime=$(date -d “$curMonth $curDay $curYear $curTime” “+%s”)
nowTime=$(date “+%s”)
(( lastOkBackupAge=$nowTime – $lastBackupTime ))

#— echo the age of the last successful backup, in seconds
echo $lastOkBackupAge
}
backupAge=$(getLastComplete)

#— hard coded crit and warn
# – 18000 seconds is 5 hours
# – 86400 seconds is 24 hours
retMessage=”OK”
returnCode=0
[ $backupAge -gt 18000 ] && {
retMessage=”WARN”
returnCode=1
}

[ $backupAge -gt 86400 ] && {
retMessage=”CRIT”
returnCode=2
}

echo “$retMessage – Backup $backupAge seconds old”
echo “| backupAge=$backupAge”

[/cc]

Related Posts