Introduction
One fine day it happened: Nagios missed to alarm us for a server going down. One of the Windows servers (what else) rebooted due to a unknown cause (what else). Only it happened so darn fast that it fell exactly in between the five minute intervals when Nagios sends its 'ping' checks to verify the system is up. It is a quite rare case, only one single Nagios 'ping' check failed. With the 'ping' being set to re-test after one minute for 2 more times to avoid sending false alerts, it was just recording one fail but did not send the necessary notification.
Clearly, passive 'ping' monitoring is not perfect, so a better way to monitor these pesky 'secret' Windows reboots is to make them send SNMP traps. Now, at least we will know for sure when they come back up. ;-)
Plugin Design
The following examples have been developed and verified unter Nagios 3.0.6 running on SuSE Linux Enterprise Server 10, receiving traps from Windows 2003 Server and Windows XP clients. Nagios had been installed into /srv/app/nagios. This path is used in all examples below, please adjust it to your [nagioshome].
The 'Sending' part: Generating SNMP traps from Windows
On the Windows server, we need to have the SNMP service installed. It is available in the normal Windows package (Add/Remove Windows Components) under Management and Monitoring tools. Once installed, we go to "Start->Settings>Control Panel->Administrative Tools->Services-> SNMP Service->Properties". I assume SNMP read access is already set up. So, currently we are only interested in SNMP traps. First we go to the "Traps" tab. Following good practise we configure a dedicated trap community (different from public) and add the SNMP trap server destination IP there.
Now we can start sending our first test traps. Stopping and starting the Windows SNMP service will generate some. Let's check what traps were send and if they are received on our trap sink server, using tcpdump:
# tcpdump -s 0 -X udp port 162
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 10:31:
11.189693
IP 192.168.203.140.capioverlan > susie112.frank4dd.com.snmptrap: C=SECtrap
Trap(31) E:311 .1.1.3.1.1 192.168.203.140 coldStart 0
0x0000: 4500 004b 07f1 0000 7f11 eb08 0afd cb8c E..K............
0x0010: 0afd 6722 047b 00a2 0037 f289 302d 0201 ..g".{...7..0-..
0x0020: 0004 074e 424e 7472 6170 a41f 060c 2b06 ...SECtrap....+.
0x0030: 0104 0182 3701 0103 0101 4004 0afd cb8c ....7.....@.....
0x0040: 0201 0002 0100 4301 0030 00 ......C..0.
10:31:26.227627 IP 192.168.203.140.capioverlan > susie112.frank4dd.com.snmptrap:
C=SECtrap Trap(49) E:311.1.1.3.1.1 192.168.203.140 linkUp 1532 interfaces.
ifTable.ifEntry.ifIndex.1=1
0x0000: 4500 005d 07f2 0000 7f11 eaf5 0afd cb8c E..]............
0x0010: 0afd 6722 047b 00a2 0049 503d 303f 0201 ..g".{...IP=0?..
0x0020: 0004 074e 424e 7472 6170 a431 060c 2b06 ...SECtrap.1..+.
0x0030: 0104 0182 3701 0103 0101 4004 0afd cb8c ....7.....@.....
0x0040: 0201 0302 0100 4302 05fc 3011 300f 060a ......C...0.0...
0x0050: 2b06 0102 0102 0201 0101 0201 01 +............
10:31:26.229296 IP 192.168.203.140.capioverlan > susie112.frank4dd.com.snmptrap:
C=SECtrap Trap(49) E:311.1.1.3.1.1 192.168.203.140 linkUp 1538 interfaces.
ifTable.ifEntry.ifIndex.2=2
0x0000: 4500 005d 07f3 0000 7f11 eaf4 0afd cb8c E..]............
0x0010: 0afd 6722 047b 00a2 0049 4f36 303f 0201 ..g".{...IO60?..
0x0020: 0004 074e 424e 7472 6170 a431 060c 2b06 ...SECtrap.1..+.
0x0030: 0104 0182 3701 0103 0101 4004 0afd cb8c ....7.....@.....
0x0040: 0201 0302 0100 4302 0602 3011 300f 060a ......C...0.0...
0x0050: 2b06 0102 0102 0201 0102 0201 02 +............
10:31:26.229692 IP 192.168.203.140.capioverlan > susie112.frank4dd.com.snmptrap:
C=SECtrap Trap(49) E:311.1.1.3.1.1 192.168.203.140 linkUp 1538 interfaces.
ifTable.ifEntry.ifIndex.3=3
0x0000: 4500 005d 07f4 0000 7f11 eaf3 0afd cb8c E..]............
0x0010: 0afd 6722 047b 00a2 0049 4e35 303f 0201 ..g".{...IN50?..
0x0020: 0004 074e 424e 7472 6170 a431 060c 2b06 ...SECtrap.1..+.
0x0030: 0104 0182 3701 0103 0101 4004 0afd cb8c ....7.....@.....
0x0040: 0201 0302 0100 4302 0602 3011 300f 060a ......C...0.0...
0x0050: 2b06 0102 0102 0201 0103 0201 03 +............
4 packets captured
We can see that 4 traps were send when the Windows SNMP service is started. The first trap packet is a notification of 'coldstart', the following 3 are notifications for each available network interface (including 127.0.0.1) about their "link up" status.
The 'Receiving' part: Picking up the SNMP traps using the 'snmptrapd' daemon
For our purpose of testing and receiving traps from Windows systems, we are adding 2 MIB file to the library in /usr/share/snmp/mibs. The file MSFT.txt describes the Windows OID tree, while TRAP-TEST-MIB.txt will help us to generate a test trap later.
# vi /usr/share/snmp/mibs/MSFT.txt
MSFT-MIB DEFINITIONS ::= BEGIN
IMPORTS
enterprises
FROM RFC1155-SMI;
microsoft OBJECT IDENTIFIER ::= { enterprises 311 }
software OBJECT IDENTIFIER ::= { microsoft 1 }
systems OBJECT IDENTIFIER ::= { software 1 }
os OBJECT IDENTIFIER ::= { systems 3 }
windowsNT OBJECT IDENTIFIER ::= { os 1 }
windows OBJECT IDENTIFIER ::= { os 2 }
workstation OBJECT IDENTIFIER ::= { windowsNT 1 }
server OBJECT IDENTIFIER ::= { windowsNT 2 }
dc OBJECT IDENTIFIER ::= { windowsNT 3 }
END
# vi /usr/share/snmp/mibs/TRAP-TEST-MIB.txt
TRAP-TEST-MIB DEFINITIONS ::= BEGIN
IMPORTS ucdExperimental FROM UCD-SNMP-MIB;
demotraps OBJECT IDENTIFIER ::= { ucdExperimental 990 }
demo-trap TRAP-TYPE
STATUS current
ENTERPRISE demotraps
VARIABLES { sysLocation }
DESCRIPTION "This is just a demo"
::= 17
END
Next, we configure the 'snmptrapd' daemon. Although the daemon comes with the SNMP daemon package and is installed in /usr/sbin, no startup script has been put into /etc/init.d. Fortunately, there is a template in /usr/share/doc/packages/net-snmp.
# cp /usr/share/doc/packages/net-snmp/rc.snmptrapd /etc/init.d/snmptrapd
# vi /etc/init.d/snmptrapd
OPTIONS="-On -p /var/run/snmptrapd.pid -M /usr/share/snmp/mibs -m ALL"
change:
startproc $SNMPTRAPD $OPTIONS -c /etc/snmptrapd.conf -Lf /var/log/net-snmpd.log
to:
startproc $SNMPTRAPD $OPTIONS -c /etc/snmp/snmptrapd.conf -Lf /var/log/
net-snmpd.log
Now we create the configuration file for the 'snmptrapd' daemon. We define the trap community for simple access control and we add a trap handler 'default' to handle all traps by a test script we are going to create. Then we enable and start 'snmptrapd' through yast->system->system services (runlevel)-> enable snmptrapd for runlevel 2 3 5.
# vi /etc/snmp/snmptrapd.conf
# --------------------------------------------------------------------------- #
# snmptrapd.conf: #
# configuration file for configuring the ucd-snmp snmptrapd agent. #
# ----------------------------------------------------------------------------#
# first, we define the access control
authCommunity log,execute,net SECtrap
# next , the trap handlers
traphandle default /tmp/snmptraptest.sh
# END of snmptrapd.conf ---------------------------------------------------- #
The 'Testing' part: Learning to send, receive and filter SNMP traps
Let's create a simple test script snmptraptest.sh that writes all received SNMP traps into a log file.
# vi /tmp/snmptraptest.sh
#!/bin/sh
TESTLOG=/tmp/test
vars=
read host
read ip
while read oid val; do
if [ "$vars" = "" ]; then
vars="$oid = $val"
else
vars="$vars, $oid = $val"
fi
done
if [ -w $TESTLOG ]; then
touch $TESTLOG
fi
echo trap: $1 $host $ip $vars >> $TESTLOG
We are ready for our first test from the local system, using the 'snmptrap' command, verifying the traps are received and processed by our test script. Also notice the use of the TRAP-TEST-MIB we generated.
# snmptrap -v 2c -c SECtrap 127.0.0.1 "" TRAP-TEST-MIB::demo-trap SNMPv2-MIB::
sysLocation.0 s "here"
# cat /tmp/traptest.log
trap: localhost UDP: [127.0.0.1]:42706 DISMAN-EVENT-MIB::sysUpTimeInstance =
6:4:53:38.72,
SNMPv2-MIB::snmpTrapOID.0 = TRAP-TEST-MIB::demo-trap, SNMPv2-MIB::sysLocation
.0 = here
trap: localhost UDP: [127.0.0.1]:42706 DISMAN-EVENT-MIB::sysUpTimeInstance =
6:4:53:38.72,
SNMPv2-MIB::snmpTrapOID.0 = TRAP-TEST-MIB::demo-trap, SNMPv2-MIB::sysLocation
.0 = here
Well, we really are receiving traps, but why are we getting them twice? Let's check if our 'snmptraptest.sh' script is called twice. We can change the last line writing the output to include a random string and give it another try.
# vi /tmp/snmptraptest.sh
change:
echo trap: $1 $host $ip $vars >> $TESTLOG
to:
echo `/usr/bin/openssl rand 20 -base64` trap: $1 $host $ip $vars >> $TESTLOG
# snmptrap -v 2c -c SECtrap 127.0.0.1 "" TRAP-TEST-MIB::demo-trap SNMPv2-MIB::
sysLocation.0 s "here"
# cat /tmp/traptest.log
vRgoIkp7Y/66EyxK6fETsR7lqhY= trap: localhost UDP: [127.0.0.1]:58476 DISMAN-
EVENT-MIB::sysUpTimeInstance = 6:20:16:06.35, SNMPv2-MIB::snmpTrapOID.0 =
TRAP-TEST-MIB::demo-trap, SNMPv2-MIB::sysLocation.0 = here
aRsf084ZC/fcJqeOCjFRH/SCNdI= trap: localhost UDP: [127.0.0.1]:58476 DISMAN-
EVENT-MIB::sysUpTimeInstance = 6:20:16:06.35, SNMPv2-MIB::snmpTrapOID.0 =
TRAP-TEST-MIB::demo-trap, SNMPv2-MIB::sysLocation.0 = here
Voila, the random hash is different, the script is indeed being called twice! Further down the investigation ... it turns out that 'snmptrapd' is compiled with the default configuration file path being already set to '/etc/snmp/snmptrapd.conf'. The explicit setting of it using the '-c' option in '/etc/init.d/snmptrapd' causes the file being read and executed twice. Feature or bug? No matter, we need to remove the '-c' option from '/etc/init.d/snmptrapd'. Re-test, check, problem solved.
# vi /etc/init.d/snmptrapd
change:
startproc $SNMPTRAPD $OPTIONS -c /etc/snmp/snmptrapd.conf -Lf /var/log/
net-snmpd.log
to:
startproc $SNMPTRAPD $OPTIONS -Lf /var/log/net-snmpd.log
After we are able to reliably receive SNMP traps, its time to be selective about them. This is achieved by defining a explicit snmpTrapOID value match in '/etc/snmp/snmptrapd.conf'. Let's say we only care about the Windows 'coldstart' traps, our match would be the trap having the oid=value pair of the 'SNMPv2-MIB::snmpTrapOID.0 = SNMPv2-MIB::coldStart'. Then we restart the Windows SNMP service once more and verify receiving the trap data. This time whe recorded just a single trap in '/tmp/traptest.log'.
# vi /etc/snmp/snmptrapd.conf
change:
traphandle default /tmp/snmptraptest.sh
to:
traphandle SNMPv2-MIB::coldStart /tmp/snmptraptest.sh
# /etc/init.d/snmptrapd restart
# cat /tmp/traptest.log
trap: 192.168.203.140 UDP: [192.168.203.140]:1074 DISMAN-EVENT-MIB::sysUpTime
Instance = 0
:0:00:00.00, SNMPv2-MIB::snmpTrapOID.0 = SNMPv2-MIB::coldStart, SNMP-COMMUNITY
-MIB::snmpTrapAddress.0= 192.168.203.140, SNMP-COMMUNITY-MIB::snmpTrapCommunity
.0 = "SECtrap", SNMPv2-MIB::snmpTrapEnterprise.0 = MSFT-MIB::workstation
The 'Translating' part, converting the SNMP traps into Nagios format and send them to Nagios
Nagios can be set to receive and process data sent from external programs. Lets verify the related directives are enabled and set in the Nagios configuration file:
# egrep 'check_external_commands|command_check_interval|command_file' nagios.cfg
check_external_commands=1
#command_check_interval=-1
command_check_interval=5s
command_file=/home/app/nagios/var/rw/nagios.cmd
# grep accept_passive /home/app/nagios/etc/nagios.cfg
accept_passive_service_checks=1
accept_passive_host_checks=1
The Nagios data-receiving part is the named pipe '/home/app/nagios/var/rw/nagios.cmd'. The format of the Nagios event to send to '/home/app/nagios/var/rw/nagios.cmd' is:
[Unix Timestamp] Message Descriptor;host name;service-name;severity-code;text data
Example: [1141163054] PROCESS_SERVICE_CHECK_RESULT;susie112;check_trap_susie112;1;Trap test data
We can now send a test event to Nagios to see if it is received properly:
# echo "`date +[%s]` PROCESS_SERVICE_CHECK_RESULT;testserver;check_trap_test;1;test" >
/home/app/nagios/var/rw/nagios.cmd
# tail /home/app/nagios/var/nagios.log | grep trap
[1224133947] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;testserver;check_trap_test;1;
test
[1224133947] Warning: Passive check result was received for service 'check_trap_test' on
host 'testserver', but the host could not be found!
It is time to think of a program that translates our SNMP trap into a Nagios event and sends it to Nagios through its command file. We want this trap service for Windows reboots associated with each Nagios host in order to allow for a separate notification to the appropriate host support team. We also want the severity code set to warning, but avoid confirmation by hand. Instead we want the event to be cleared quickly to OK state, and no notification should go out for this auto-confirmation.
The association with a Nagios host requires us to get the correct host name derived from the trap IP. The auto-confirmation is made by a second, slighty delayed event submission with severity code '0'. the notification for OK is disabled in the service template. I programmed and named this program send_trap_data.pl, then put it into my nagios-home/libexec directory. If the DEBUG option is set to 1, the program writes some parameters and the submitted Nagios events into a temp file. Let's enable 'send_trap_data.pl' to start process incoming SNMP traps for Nagios:
# vi /etc/snmp/snmptrapd.conf
change:
traphandle SNMPv2-MIB::coldStart /tmp/snmptraptest
to:
# traphandle SNMPv2-MIB::coldStart /tmp/snmptraptest
traphandle SNMPv2-MIB::coldStart /home/app/nagios/libexec/send_trap_data.pl
# /etc/init.d/snmptrapd restart
# cat /tmp/test3
trapline >proxyjp02.frank4dd.com
UDP: [192.168.100.184]:12380
DISMAN-EVENT-MIB::sysUpTimeInstance 0:0:00:00.00
SNMPv2-MIB::snmpTrapOID.0 SNMPv2-MIB::coldStart
SNMP-COMMUNITY-MIB::snmpTrapAddress.0 192.168.100.184
SNMP-COMMUNITY-MIB::snmpTrapCommunity.0 "SECtrap"
SNMPv2-MIB::snmpTrapEnterprise.0 MSFT-MIB::server
<
traphost >proxyjp02.frank4dd.com<
snmpname >SNMPv2-MIB::sysName.0 = STRING: JPNHOMG035<
hostname >winserver03<
eventstr >[1224478743] PROCESS_SERVICE_CHECK_RESULT;winserver03;check_trap_coldstart;1;Syst
em *reboot* or SNMP service restarted.<
Wrote eventstr to /home/app/nagios/var/rw/nagios.cmd
eventstr >[1224478743] PROCESS_SERVICE_CHECK_RESULT;winserver03;check_trap_coldstart;0;Syst
em *reboot* or SNMP service restarted. auto-OK<
Wrote eventstr to /home/app/nagios/var/rw/nagios.cmd
End of send_trap_data.pl.
The 'Processing' part, displaying and notifying SNMP trap generated events with Nagios
Here, we define a service template and add services to it. Depending on how many different notifications we need to generate, we need to separate the actual services.
vi /home/app/nagios/etc/nagios.cfg
# passive service check for SNMP traps
cfg_file=/home/app/nagios/etc/objects/trap-services-template.cfg
cfg_file=/home/app/nagios/etc/objects/trap-services.cfg
# vi /home/app/nagios/etc/objects/trap-services-template.cfg
##############################################################################
# Define a servicegroup for SNMP trap service checks
# All SNMP trap service checks will be members of this group
##############################################################################
define servicegroup{
servicegroup_name snmptrap-checks ; The name of the servicegroup
alias SNMP Trap Services ; Long name of the group
}
##############################################################################
# Define the database check template service
##############################################################################
define service{
name generic-trap
active_checks_enabled 0 ; traps are only passive checks
passive_checks_enabled 1 ; yes, check passive
parallelize_check 1 ; yes, please
obsess_over_service 0 ; we don't run extra commands
check_freshness 0 ; don't check for freshness
notifications_enabled 1 ; send notifications
event_handler_enabled 1 ; yes, but we have none
flap_detection_enabled 0 ; with auto-OK, we don't
failure_prediction_enabled 1 ; dependency checks
process_perf_data 0 ; don't send this to perfdata
retain_status_information 1 ; yes, once auto-OK'ed, keep it
retain_nonstatus_information 1
is_volatile 1 ; enable for passive checks
check_period 24x7 ; always check for submissions
max_check_attempts 1 ; one trap is enough
normal_check_interval 1
retry_check_interval 1
contact_groups frankonly
notification_options w ; notify for warnings only
notification_interval 120 ; notify every 2 hrs
notification_period 24x7 ; always notify
register 0 ; template, don't register
servicegroups snmptrap-checks
check_command check_none ; we do not run any checks
}
# vi /home/app/nagios/etc/objects/trap-services.cfg
##############################################################################
# Receive SNMP traps for windows boot events via eventhandler scripts
##############################################################################
define service {
use generic-trap
host_name winserver03
name check_trap_coldstart
service_description check_trap_coldstart
}
##############################################################################
Credits, copyrights original scripts etc
- The updated script send_trap_data.pl version 1.2 send_trap_data.pl
- The older versions v1.1 and v1.0, just in case.
- SNMP Trap Handling with Nagios URL by Francois Meehan
- Nagios and the Nagios community at http://www.nagios.org/
- The Nagios documentation about passive checks
- SUSE Linux and SLES10 are products and trademarks of Novell, Inc. http://www.suse.com/
- Further Nagios documentation is available here http://nagios.fm4dd.com/docs/en/