Introduction
For us, Nagiosgraph is the most important extension to our monitoring system. Before Nagiosgraph, we had been relying entirely on Cacti for performance monitoring. With the implementation of Nagiosgraph in 2008, Nagios itself is now covering most of the performance graphing itself. We benefit from this integration through quick lookups of historic performance data in one single system. No more additional login into a second application, no more manual search and mapping of devices between Nagios and Cacti. We still retain Cacti, but it lost most of its importance. Today, Nagiosgraph running unchanged since 2008, easily handles its 1283 active devices with a total of 3740 individual graphs and is almost entirely 'maintenance-free'.
There are several packages available for Nagios graphing. At the time of implementation in 2008, the Nagiosgraph package was not well maintained and contained a serious bug (Nagiosgraph v0.9.1). The show.cgi program completely failed to generate and display the device graph tree if you happen to have a devicename starting with a number. If, say, your network team named a switch '2nd-Cat2960' or a router '3725-east' - that triggered it. After fixing this bug (by prefixing the javascript variable name with a 'host_' string) in function setOptionText(element), Nagiosgraph started to work fine as expected.
Why bother with Nagiosgraph when there are other packages out? Nagiosgraph integrates with Nagios while still being fully independend of it. It is written in Perl so fixing is easily possible. It doesn't need a database and handles everything in files. New graphs are generated automatically if a map entry exists. We do not need to define new systems or services in Nagiosgraph, a huge time saver.
Outdated Graphs
The drawback of automation is that there is no graph management tied to Nagios configurations. If a device is removed in Nagios, Nagiosgraph still retains the old graphs. This is fixed easily through a monthly routine job that spots these 'dead' graphs which have not been updated and removes them. Here is quick way to delete outdated nagiosgraph RRD database files from the commandline:
First, we find all RRD files older then 60 days, and delete them, while saving their name into a text file for reference.
susie:/srv/app/nagiosgraph/rrd # find /srv/app/nagiosgraph/rrd -name '*.rrd' -mtime +60 -exec ls -l {} ";" -exec rm {} ";" > deleted-rrd-list.txt susie:/srv/app/nagiosgraph/rrd # susie:/srv/app/nagiosgraph/rrd # head -5 deleted-rrd-list.txt -rw-r--r-- 1 nagios nagios 47712 2010-04-30 19:10 /srv/app/nagiosgraph/rrd/Cat3750-P/check%2Dhost%2Dalive___ping.rrd -rw-r--r-- 1 nagios nagios 71240 2010-04-30 19:08 /srv/app/nagiosgraph/rrd/Cat3750-P/load%2Dcheck___cpu.rrd -rw-r--r-- 1 nagios nagios 71240 2010-04-30 19:09 /srv/app/nagiosgraph/rrd/Cat3750-P/memory%2Dcheck___memory.rrd -rw-r----- 1 nagios nagios 47712 2014-02-07 17:30 /srv/app/nagiosgraph/rrd/akasaka1/check%2Dhost%2Dalive___ping.rrd -rw-r----- 1 nagios nagios 71240 2014-02-07 17:26 /srv/app/nagiosgraph/rrd/akasaka1/memory%2Dcheck___memory.rrd
Second, we search for RRD host directories that now became empty, delete them and save their names again into a text file for reference.
susie:/srv/app/nagiosgraph/rrd # find /srv/app/nagiosgraph/rrd -type d -empty -print -delete > deleted-folder-list.txt susie:/srv/app/nagiosgraph/rrd # head -5 deleted-folder-list.txt /srv/app/nagiosgraph/rrd/komaki1812 /srv/app/nagiosgraph/rrd/nagoya2950-6 /srv/app/nagiosgraph/rrd/kyushu2950-1 /srv/app/nagiosgraph/rrd/mholwm03 /srv/app/nagiosgraph/rrd/kyushu2950-10On all commands above, if the "rm" or "delete" part is removed we can have a dry-run of the command to ensure we don't delete any important files.
Nagiosgraph Example
While Nagiosgraph is great for the large-scale installs I had to handle, PNP4Nagios is another popular graphing package for Nagios. Although not as 'simple' as Nagiosgraph, it has features that allow for a more finegrained configuration, produces more beautiful graphs, and exports to PDF. I do not have data on how PNP4Nagios behaves in large installations. I have been running Nagiosgraph with over 5.000 RRD's on a single server.
Most used Nagiosgraph Types
# ls -l nagiosgraph/rrd/ | wc -l --> 1284 # find nagiosgraph/rrd -name '*.rrd' | wc -l --> 3744
# | graph type | graph count | most used graph types |
---|---|---|---|
01 | check_ping | 1281 | ![]() |
02 | load_check_cisco | 750 | |
03 | load_check_linux | 6 | |
04 | load_check_windows | 124 | |
05 | session_check_netscreen | 1 | |
06 | memory_check_netscreen | 1 | |
07 | memory_check_cisco | 750 | |
08 | memory_check_linux | 6 | |
09 | memory_check_asa | 8 | |
10 | memory_check_windows | 182 | |
11 | disk_check_smb | 16 | |
12 | disk_check_unix | 227 | |
13 | disk_check_windows | 162 | |
14 | local_check_procs | 1 | |
15 | web_check_access | 21 | |
16 | web_check_load | 12 | |
17 | port_check_tcp | 40 | |
18 | health_check_temp | 3 | |
19 | nw_check_bandwidth | 126 | |
20 | app_check_users | 7 | |
21 | service_check_ntp | 1 | |
22 | app_check_smtp | 2 |
Nagiosgraph's RRD-based graph generation is controlled through the configuration file called 'map'. Our map file has 40+ graph types configured. Below is a list of the most used graph's. There is a graph example screenshot, the related map entry and a comment to be found unter the info icon. Hopefully, you find it as useful as I think it is. Also, have a look at the latest version of Nagiosgraph at Sourceforge. I haven't looked at it, but there are new features (showgroup.cgi, etc). Have a nice screenshot or comment, anyone?
Nagiosgraph screenshots and configuration entries
Click on this symbol to see the Nagiosgraph configuration code
Click on this symbol to see notes regarding the nagios check
01 | check_ping | The typical host check measures the network packet round-trip time with ping. | ![]() |
![]() |

02 | load_check_cisco | CPU load for CISCO routers measured with check_snmp_load.pl using the type parameter -T=cisco. | ![]() |
![]() |

03 | load_check_linux | CPU load for Linux servers measured with check_snmp_load.pl using the type parameter -T=netsl. | ![]() |
![]() |

04 | load_check_windows | CPU load for Windows servers measured with check_snmp_load.pl using the type parameter -T=stand. | ![]() |
![]() |

05 | session_check_netscreen | Number of sessions for Juniper Netscreen firewalls measured with check_netscreen_session v1.1 (nagios-plugins 1.4.13). | ![]() |
![]() |

06 | memory_check_netscreen | Check memory allocation for Juniper Netscreen firewalls measured with check_netscreen_mem v1.0 (nagios-plugins 1.4.13). | ![]() |
![]() |

07 | memory_check_cisco | Check memory allocation for Cisco routers and switches measured with check_snmp_mem.pl v1.1 using option -I (--cisco). | ![]() |
![]() |

08 | memory_check_linux | Check memory allocation for Linux servers measured with check_snmp_mem.pl v1.1 using option -N (--netsnmp). | ![]() |
![]() |

09 | memory_check_asa | Check memory allocation for Cisco ASA security appliances measured with check_snmp_mem.pl v1.1 using option -I (--cisco). | ![]() |
![]() |
