Introduction


For us, Nagiosgraph is the most important extension to our monitoring system. Before Nagiosgraph, we had been relying entirely on Cacti for performance monitoring. With the implementation of Nagiosgraph in 2008, Nagios itself is now covering most of the performance graphing itself. We benefit from this integration through quick lookups of historic performance data in one single system. No more additional login into a second application, no more manual search and mapping of devices between Nagios and Cacti. We still retain Cacti, but it lost most of its importance. Today, Nagiosgraph running unchanged since 2008, easily handles its 1283 active devices with a total of 3740 individual graphs and is almost entirely 'maintenance-free'.

There are several packages available for Nagios graphing. At the time of implementation in 2008, the Nagiosgraph package was not well maintained and contained a serious bug (Nagiosgraph v0.9.1). The show.cgi program completely failed to generate and display the device graph tree if you happen to have a devicename starting with a number. If, say, your network team named a switch '2nd-Cat2960' or a router '3725-east' - that triggered it. After fixing this bug (by prefixing the javascript variable name with a 'host_' string) in function setOptionText(element), Nagiosgraph started to work fine as expected.

Why bother with Nagiosgraph when there are other packages out? Nagiosgraph integrates with Nagios while still being fully independend of it. It is written in Perl so fixing is easily possible. It doesn't need a database and handles everything in files. New graphs are generated automatically if a map entry exists. We do not need to define new systems or services in Nagiosgraph, a huge time saver.

Outdated Graphs

The drawback of automation is that there is no graph management tied to Nagios configurations. If a device is removed in Nagios, Nagiosgraph still retains the old graphs. This is fixed easily through a monthly routine job that spots these 'dead' graphs which have not been updated and removes them. Here is quick way to delete outdated nagiosgraph RRD database files from the commandline:

First, we find all RRD files older then 60 days, and delete them, while saving their name into a text file for reference.

susie:/srv/app/nagiosgraph/rrd # find /srv/app/nagiosgraph/rrd -name '*.rrd' -mtime +60 -exec ls -l {} ";" -exec rm {} ";" > deleted-rrd-list.txt
susie:/srv/app/nagiosgraph/rrd #
susie:/srv/app/nagiosgraph/rrd # head -5 deleted-rrd-list.txt
-rw-r--r-- 1 nagios nagios 47712 2010-04-30 19:10 /srv/app/nagiosgraph/rrd/Cat3750-P/check%2Dhost%2Dalive___ping.rrd
-rw-r--r-- 1 nagios nagios 71240 2010-04-30 19:08 /srv/app/nagiosgraph/rrd/Cat3750-P/load%2Dcheck___cpu.rrd
-rw-r--r-- 1 nagios nagios 71240 2010-04-30 19:09 /srv/app/nagiosgraph/rrd/Cat3750-P/memory%2Dcheck___memory.rrd
-rw-r----- 1 nagios nagios 47712 2014-02-07 17:30 /srv/app/nagiosgraph/rrd/akasaka1/check%2Dhost%2Dalive___ping.rrd
-rw-r----- 1 nagios nagios 71240 2014-02-07 17:26 /srv/app/nagiosgraph/rrd/akasaka1/memory%2Dcheck___memory.rrd

Second, we search for RRD host directories that now became empty, delete them and save their names again into a text file for reference.

susie:/srv/app/nagiosgraph/rrd # find /srv/app/nagiosgraph/rrd -type d -empty -print -delete > deleted-folder-list.txt
susie:/srv/app/nagiosgraph/rrd # head -5 deleted-folder-list.txt
/srv/app/nagiosgraph/rrd/komaki1812
/srv/app/nagiosgraph/rrd/nagoya2950-6
/srv/app/nagiosgraph/rrd/kyushu2950-1
/srv/app/nagiosgraph/rrd/mholwm03
/srv/app/nagiosgraph/rrd/kyushu2950-10

On all commands above, if the "rm" or "delete" part is removed we can have a dry-run of the command to ensure we don't delete any important files.

2012 Update: PNP4Nagios

While Nagiosgraph is great for the large-scale installs I had to handle, PNP4Nagios is probably the most popular graphing package for Nagios. Although not as 'simple' as Nagiosgraph, it has features that allow for a more finegrained configuration, produces more beautiful graphs, and exports to PDF. Feel free to judge yourself in a side-by-side comparison: I am running a simultanous Nagios performance data feed into both graphing systems.

I do not have data on how PNP4Nagios behaves in large installations. I have been running Nagiosgraph with over 5.000 RRD's on a single server.

 # ls -l nagiosgraph/rrd/ | wc -l --> 1284
# find nagiosgraph/rrd -name '*.rrd' | wc -l --> 3744
#graph typegraph count most used graph types
01check_ping1281
02load_check_cisco750
03load_check_linux6
04load_check_windows124
05session_check_netscreen1
06memory_check_netscreen1
07memory_check_cisco750
08memory_check_linux6
09memory_check_asa8
10memory_check_windows182
11disk_check_smb16
12disk_check_unix227
13disk_check_windows162
14local_check_procs1
15web_check_access21
16web_check_load12
17port_check_tcp40
18health_check_temp3
19nw_check_bandwidth126
20app_check_users7
21service_check_ntp1
22app_check_smtp2

Nagiosgraph's RRD-based graph generation is controlled through the configuration file called 'map'. Our map file has 40+ graph types configured. Below is a list of the most used graph's. There is a graph example screenshot, the related map entry and a comment to be found unter the info icon. Hopefully, you find it as useful as I think it is. Also, have a look at the latest version of Nagiosgraph at Sourceforge. I haven't looked at it, but there are new features (showgroup.cgi, etc). Have a nice screenshot or comment, anyone?

Nagiosgraph screenshots and configuration entries


Click on this symbol to see the Nagiosgraph configuration code

Click on this symbol to see notes regarding the nagios check


01 check_ping The typical host check measures the network packet round-trip time with ping.
02 load_check_cisco CPU load for CISCO routers measured with check_snmp_load.pl using the type parameter -T=cisco.
03 load_check_linux CPU load for Linux servers measured with check_snmp_load.pl using the type parameter -T=netsl.
04 load_check_windows CPU load for Windows servers measured with check_snmp_load.pl using the type parameter -T=stand.
05 session_check_netscreen Number of sessions for Juniper Netscreen firewalls measured with check_netscreen_session v1.1 (nagios-plugins 1.4.13).
06 memory_check_netscreen Check memory allocation for Juniper Netscreen firewalls measured with check_netscreen_mem v1.0 (nagios-plugins 1.4.13).
07 memory_check_cisco Check memory allocation for Cisco routers and switches measured with check_snmp_mem.pl v1.1 using option -I (--cisco).
08 memory_check_linux Check memory allocation for Linux servers measured with check_snmp_mem.pl v1.1 using option -N (--netsnmp).
09 memory_check_asa Check memory allocation for Cisco ASA security appliances measured with check_snmp_mem.pl v1.1 using option -I (--cisco).
10 memory_check_windows Check memory allocation for Windows servers measured with check_snmp_storage using -m(--name) "^Physical Memory$".
11 disk_check_smb Check filesystem utilization for for Windows smb file shares measured with check_disk_smb.
12 disk_check_unix Check filesystem utilization for UNIX servers (Linux, IBM AIX) measured with check_snmp_storage -m(--name) "^/dev/hd2$".
13 disk_check_windows Check filesystem utilization for Windows servers measured with check_snmp_storage -m(--name) "^C:".
14 local_check_procs Check a local UNIX systems process count measured with check_procs v2019 (nagios-plugins 1.4.13).
15 web_check_access Check web server access and response with check_http v2053 (nagios-plugins 1.4.13).
16 web_check_load Check Apache webserver load through /server-status page with check_apachestatus.pl v1.6.
17 port_check_tcp Check if a remote TCP port is open, measure ack response with check_tcp v1991 (nagios-plugins 1.4.13).
18 health_check_temp Check if a servers temperature sensor is below critical values with check_snmp_temperature.pl using --type=hp.
19 nw_check_bandwidth Check if a network interface throughput is reaching critical values with check_bandwidth3 (v3.0.0rc2).
20 app_check_users Check if a application is reaching it's user limits through a JDBC database user session query with check_app_sessions.class(java).
21 service_check_ntp Check the local systems clock offset to the ntp server with check_ntp_time v2051 (nagios-plugins 1.4.13).
22 service_check_smtp Check the e-mail gateway's SMTP port is available with check_smtp v1991 (nagios-plugins 1.4.13).
23 session_check_citrix Check the Citrix server session count with SNMP through the snmp4ctx package (check_snmp_citrix 0.1).
24 load_check_bluecoat CPU load for Bluecoat proxies measured with check_snmp_load.pl using the type parameter -T=bc.
25 power_check_hp_enclosure Electric power consumption for HP C7000 blade enclosures measured with check_hp_bladechassis.
26 check_domino_sessions Check Lotus Notes user sessions through IBM's Domino SNMP service and Notes application MIB.
27 check_ldap_authentication Check LDAP server access for user authentication and return the response time.
28 check_avaya_load Check the CPU load of Avaya VOIP PBX media servers S8xxx.
29 check_avaya_trunks Check the trunk group call usage of Avaya VOIP PBX media servers S8xxx.
30 check_avaya_peak Check the hourly peak call usage of Avaya VOIP PBX media servers S8xxx.
31 check_open_files Check how many files are open and if it reaches the OS limitation.
32 check_open_unix_fds Check how many files a particular Unix program has open.

Scripts, credits and links


More Information: