Performance History Graphs for Nagios with Nagiosgraph

Frank4DD, @2009

Introduction

For us, Nagiosgraph is the most important extension to our monitoring system. Before Nagiosgraph, we had been relying entirely on Cacti for performance monitoring. With the implementation of Nagiosgraph in 2008, Nagios itself is now covering most of the performance graphing itself. We benefit from this integration through quick lookups of historic performance data in one single system. No more additional login into a second application, no more manual search and mapping of devices between Nagios and Cacti. We still retain Cacti, but it lost most of its importance. Today, Nagiosgraph running unchanged since 2008, easily handles its 1283 active devices with a total of 3740 individual graphs and is almost entirely 'maintenance-free'.

There are several packages available for Nagios graphing. At the time of implementation in 2008, the Nagiosgraph package was not well maintained and contained a serious bug (Nagiosgraph v0.9.1). The show.cgi program completely failed to generate and display the device graph tree if you happen to have a devicename starting with a number. If, say, your network team named a switch '2nd-Cat2960' or a router '3725-east' - that triggered it. After fixing this bug (by prefixing the javascript variable name with a 'host_' string) in function setOptionText(element), Nagiosgraph started to work fine as expected.

Why bother with Nagiosgraph when there are other packages out? Nagiosgraph integrates with Nagios while still being fully independend of it. It is written in Perl so fixing is easily possible. It doesn't need a database and handles everything in files. New graphs are generated automatically if a map entry exists. We do not need to define new systems or services in Nagiosgraph, a huge time saver.

Outdated Graphs

The drawback of automation is that there is no graph management tied to Nagios configurations. If a device is removed in Nagios, Nagiosgraph still retains the old graphs. This is fixed easily through a monthly routine job that spots these 'dead' graphs which have not been updated and removes them. Here is quick way to delete outdated nagiosgraph RRD database files from the commandline:

First, we find all RRD files older then 60 days, and delete them, while saving their name into a text file for reference.

susie:/srv/app/nagiosgraph/rrd # find /srv/app/nagiosgraph/rrd -name '*.rrd' -mtime +60 -exec ls -l {} ";" -exec rm {} ";" > deleted-rrd-list.txt
susie:/srv/app/nagiosgraph/rrd #
susie:/srv/app/nagiosgraph/rrd # head -5 deleted-rrd-list.txt
-rw-r--r-- 1 nagios nagios 47712 2010-04-30 19:10 /srv/app/nagiosgraph/rrd/Cat3750-P/check%2Dhost%2Dalive___ping.rrd
-rw-r--r-- 1 nagios nagios 71240 2010-04-30 19:08 /srv/app/nagiosgraph/rrd/Cat3750-P/load%2Dcheck___cpu.rrd
-rw-r--r-- 1 nagios nagios 71240 2010-04-30 19:09 /srv/app/nagiosgraph/rrd/Cat3750-P/memory%2Dcheck___memory.rrd
-rw-r----- 1 nagios nagios 47712 2014-02-07 17:30 /srv/app/nagiosgraph/rrd/akasaka1/check%2Dhost%2Dalive___ping.rrd
-rw-r----- 1 nagios nagios 71240 2014-02-07 17:26 /srv/app/nagiosgraph/rrd/akasaka1/memory%2Dcheck___memory.rrd

Second, we search for RRD host directories that now became empty, delete them and save their names again into a text file for reference.

susie:/srv/app/nagiosgraph/rrd # find /srv/app/nagiosgraph/rrd -type d -empty -print -delete > deleted-folder-list.txt
susie:/srv/app/nagiosgraph/rrd # head -5 deleted-folder-list.txt
/srv/app/nagiosgraph/rrd/komaki1812
/srv/app/nagiosgraph/rrd/nagoya2950-6
/srv/app/nagiosgraph/rrd/kyushu2950-1
/srv/app/nagiosgraph/rrd/mholwm03
/srv/app/nagiosgraph/rrd/kyushu2950-10

On all commands above, if the "rm" or "delete" part is removed we can have a dry-run of the command to ensure we don't delete any important files.

Nagiosgraph Example

Nagiosgraph Live

While Nagiosgraph is great for the large-scale installs I had to handle, PNP4Nagios is another popular graphing package for Nagios. Although not as 'simple' as Nagiosgraph, it has features that allow for a more finegrained configuration, produces more beautiful graphs, and exports to PDF. I do not have data on how PNP4Nagios behaves in large installations. I have been running Nagiosgraph with over 5.000 RRD's on a single server.

Most used Nagiosgraph Types

# ls -l nagiosgraph/rrd/ | wc -l --> 1284
# find nagiosgraph/rrd -name '*.rrd' | wc -l --> 3744

#	graph type	graph count
01	check_ping	1281
02	load_check_cisco	750
03	load_check_linux	6
04	load_check_windows	124
05	session_check_netscreen	1
06	memory_check_netscreen	1
07	memory_check_cisco	750
08	memory_check_linux	6
09	memory_check_asa	8
10	memory_check_windows	182
11	disk_check_smb	16
12	disk_check_unix	227
13	disk_check_windows	162
14	local_check_procs	1
15	web_check_access	21
16	web_check_load	12
17	port_check_tcp	40
18	health_check_temp	3
19	nw_check_bandwidth	126
20	app_check_users	7
21	service_check_ntp	1
22	app_check_smtp	2

Nagiosgraph's RRD-based graph generation is controlled through the configuration file called 'map'. Our map file has 40+ graph types configured. Below is a list of the most used graph's. There is a graph example screenshot, the related map entry and a comment to be found unter the info icon. Hopefully, you find it as useful as I think it is. Also, have a look at the latest version of Nagiosgraph at Sourceforge. I haven't looked at it, but there are new features (showgroup.cgi, etc). Have a nice screenshot or comment, anyone?

Nagiosgraph screenshots and configuration entries

Click on this symbol to see the Nagiosgraph configuration code

Click on this symbol to see notes regarding the nagios check

check_ping

The typical host check measures the network packet round-trip time with ping.

load_check_cisco

CPU load for CISCO routers measured with check_snmp_load.pl using the type parameter -T=cisco.

load_check_linux

CPU load for Linux servers measured with check_snmp_load.pl using the type parameter -T=netsl.

load_check_windows

CPU load for Windows servers measured with check_snmp_load.pl using the type parameter -T=stand.

session_check_netscreen

Number of sessions for Juniper Netscreen firewalls measured with check_netscreen_session v1.1 (nagios-plugins 1.4.13).

memory_check_netscreen

Check memory allocation for Juniper Netscreen firewalls measured with check_netscreen_mem v1.0 (nagios-plugins 1.4.13).

memory_check_cisco

Check memory allocation for Cisco routers and switches measured with check_snmp_mem.pl v1.1 using option -I (--cisco).

memory_check_linux

Check memory allocation for Linux servers measured with check_snmp_mem.pl v1.1 using option -N (--netsnmp).

memory_check_asa

Check memory allocation for Cisco ASA security appliances measured with check_snmp_mem.pl v1.1 using option -I (--cisco).

memory_check_windows

Check memory allocation for Windows servers measured with check_snmp_storage using -m(--name) "^Physical Memory$".

disk_check_smb

Check filesystem utilization for for Windows smb file shares measured with check_disk_smb.

disk_check_unix

Check filesystem utilization for UNIX servers (Linux, IBM AIX) measured with check_snmp_storage -m(--name) "^/dev/hd2$".

disk_check_windows

Check filesystem utilization for Windows servers measured with check_snmp_storage -m(--name) "^C:".

local_check_procs

Check a local UNIX systems process count measured with check_procs v2019 (nagios-plugins 1.4.13).

web_check_access

Check web server access and response with check_http v2053 (nagios-plugins 1.4.13).

web_check_load

Check Apache webserver load through /server-status page with check_apachestatus.pl v1.6.

port_check_tcp

Check if a remote TCP port is open, measure ack response with check_tcp v1991 (nagios-plugins 1.4.13).

health_check_temp

Check if a servers temperature sensor is below critical values with check_snmp_temperature.pl using --type=hp.

nw_check_bandwidth

Check if a network interface throughput is reaching critical values with check_bandwidth3 (v3.0.0rc2).

app_check_users

Check if a application is reaching it's user limits through a JDBC database user session query with check_app_sessions.class(java).

service_check_ntp

Check the local systems clock offset to the ntp server with check_ntp_time v2051 (nagios-plugins 1.4.13).

service_check_smtp

Check the e-mail gateway's SMTP port is available with check_smtp v1991 (nagios-plugins 1.4.13).

session_check_citrix

Check the Citrix server session count with SNMP through the snmp4ctx package (check_snmp_citrix 0.1).

load_check_bluecoat

CPU load for Bluecoat proxies measured with check_snmp_load.pl using the type parameter -T=bc.

power_check_hp_enclosure

Electric power consumption for HP C7000 blade enclosures measured with check_hp_bladechassis.

check_domino_sessions

Check Lotus Notes user sessions through IBM's Domino SNMP service and Notes application MIB.

check_ldap_authentication

Check LDAP server access for user authentication and return the response time.

check_avaya_load

Check the CPU load of Avaya VOIP PBX media servers S8xxx.

check_avaya_trunks

Check the trunk group call usage of Avaya VOIP PBX media servers S8xxx.

check_avaya_peak

Check the hourly peak call usage of Avaya VOIP PBX media servers S8xxx.

check_open_files

Check how many files are open and if it reaches the OS limitation.

check_open_unix_fds

Check how many files a particular Unix program has open.

Scripts, credits and links

The Nagiosgraph map configuration file with all graph definitions below download here
The fixed Nagiosgraph package version 0.9.1 download here
The original version, just in case.
Nagiosgraph Installation Manual
A real example is running on this server, check out Nagiosgraph Live
Nagios and the Nagios community can be found at http://www.nagios.org/
Further Nagios documentation is available here http://nagios.frank4dd.com/docs/en/
The PNP4Nagios graphing package can be found at http://www.pnp4nagios.org/

Nagiosgraph