Icinga2 Graph Data Retention

Out with Nagios, in with icinga2

Note: I’ve updated this post since I originally posted it. See my post about xFilesFactor here which describes why I updated it.

I’ve been on a bit of a journey over the last few months migrating from nagios to icinga2 for my systems, some customer stuff and the platforms at my day job after living through the love-hate that is nagios on and off for quite a few years now. Out with nagios, in with icinga2

Icinga2 Dashboard
Icinga2 is soo much nicer than nagios once set up

On top of the base icinga2 / icingaweb2 install I added the graphite graphing module which essentially worked out of the box give or take a couple of typos on my part . You can find all the information you need for that part of the journey on the github page for the module.

The graphite module in the context of icinga2 has three moving parts:

  • Carbon-cache, which receives the information from icinga2 and writes it out to disk as efficiently as possible
  • Whisper which is the database format used to store the metrics which can be thought of as a more modern and efficient version of the venerable RRD which all sysadmins since Noah’s time probably have some sort of emotional response to.
  • Graphite which renders the metrics into pretty graphs into PNG format graphs.

The problem, only a day of data

A few days back I discovered that despite a my fairly uneventful installation of the graphite module on my part the graphs, despite being very pretty were only keeping a day worth of data.

So off to Google (Who is, for the record, not evil) I go and although I did find the solution it was scattered across separate documentation sites and forum posts. I also found that some of the information was a dated compared to the current versions of things I’m using so here’s a write up of my solution in one place in case anyone else is similarly searching for a one stop solution.

The symptom of the low retention time is best described with a video. Essentially you select anything larger than a one day period in the icingaweb2 interface and the graph does not get any bigger. Insert sad face here.

FYI: I’m writing this based on running Ubuntu 18.04LTS and icinga2 installed from the PPA. Icinga2 is version 2.10.2-1.bionic, graphite-carbon is 1.0.2-1.

Retention settings in carbon

The retention settings for new graphs created by carbon are found in /etc/carbon/storage-schemas.conf and the default setting is:

1[default_1min_for_1day] 
2pattern = .*
3retentions = 60s:1d

That is basically saying keep one minute resolution of data for a day.

Settings in the file are processed from top to bottom until a match for 'pattern' is found so if a metric matches the first pattern in the file that is the one used. The patterns are python regex format so '.*' in the default above matches everything.

Just like RRD, whisper can retain multiple ‘archives’ of a metric with differing resolutions and time spans. In the storage-schemas file for carbon cache that is represented as comma separated pairs of values in reducing resolution and increasing time span.

By way of example 60s:1d,5m:1w,15m:1y would keep a one minute resolution for a day, five minute averages for a week, and 15 minute averages data for a year.

The new entry in the config file for that would look like:

1[default]
2pattern = .*
3xFilesFactor = 0
4retentions = 60s:1d,5m:1w,15m:1y

Note that [default] just has to be unique, the actual text is kinda irrelevant but if you have two [billybob] entries in the file you’ll get an odd-ball error when you restart carbon.

Note the xFilesFactor setting as well. My second blog post about Graphite and Icinga2 graphing describes what’s going on there. Read that post here.

Fixing existing graphs

After you re-start carbon you’ll find that it only affects newly created whisper files, not your existing ones.

You could just delete all of the whisper files and let the system re-create them, or you can fix them up using the whisper-resize tool shipped in the graphite-carbon package. To apply our new rule to a single file you could use the following incantation:

1whisper-resize <filename>.wsp --xFilesFactor=0 60s:1d 5m:1w 15m:1y

Note that whisper-resize uses spaces instead of commas like the config file.

Note the xFilesFactor setting as well. My second blog post about Graphite and Icinga2 graphing describes what’s going on there. Read that post here.

All of the whisper files live under /var/lib/graphite/whisper/icinga2/ in a directory structure representing the metric names. In the following example we’re working with the response time for a tcp check on port 80 to a machine called mailhost.vpn in icinga2.

If you’ve got this far and realise you don’t know what I’m on about can I suggest you back out slowly now and get your head around icinga2 before messing with tuning graphite…

In all of the command-line copy’n’paste I’ve put a pwd at the top and cropped the paths as they were waaaaay too wide for a blog post.

 1root@icinga: # pwd
 2/var/lib/graphite/whisper/icinga2/mailhost_vpn/services/http:80/tcp/perfdata/time
 3root@icinga: # ls -l
 4total 40
 5-rw-r--r-- 1 _graphite _graphite 17308 Nov 25 11:04 max.wsp
 6-rw-r--r-- 1 _graphite _graphite 17308 Nov 25 11:04 value.wsp
 7root@icinga: # whisper-resize value.wsp --xFilesFactor=0 60s:1d 5m:1w 15m:1y
 8Retrieving all data from the archives
 9Creating new whisper database: value.wsp.tmp
10Created: value.wsp.tmp (462004 bytes)
11Migrating data without aggregation...
12Renaming old database to: value.wsp.bak
13Renaming new database to: value.wsp
14root@icinga: # ls -l
15total 492
16-rw-r--r-- 1 _graphite _graphite 17308 Nov 25 11:09 max.wsp
17-rw-r--r-- 1 root root 462004 Nov 25 11:11 value.wsp
18-rw-r--r-- 1 _graphite _graphite 17308 Nov 25 11:09 value.wsp.bak
19root@icinga: # chown _graphite:_graphite *
20root@icinga: #

Note: you need to change the owner of the file back to _graphite:_graphite after you make changes as root, or ‘sudo -u _graphite’ the commands.

You’ll see that the wsp file has grown out to 462k from 17k. That might be an issue for your system if you are tracking a lot of metrics. On my test icinga2 instance I have 3,700 .wsp files so that equates to the wsp files expanding out to take 1.6GB of disk.

You can check the before and after using whisper-info:

Before:

 1root@icinga: # whisper-info value.wsp.bak
 2maxRetention: 86400
 3xFilesFactor: 0.5
 4aggregationMethod: average
 5fileSize: 17308
 6
 7Archive 0
 8retention: 86400
 9secondsPerPoint: 60
10points: 1440
11size: 17280
12offset: 28

After:

 1root@icinga: # whisper-info value.wsp
 2maxRetention: 31536000
 3xFilesFactor: 0
 4aggregationMethod: average
 5fileSize: 462004
 6
 7Archive 0
 8retention: 86400
 9secondsPerPoint: 60
10points: 1440
11size: 17280
12offset: 52
13
14Archive 1
15retention: 604800
16secondsPerPoint: 300
17points: 2016
18size: 24192
19offset: 17332
20
21Archive 2
22retention: 31536000
23secondsPerPoint: 900
24points: 35040
25size: 420480
26offset: 41524

The ‘After’ shows the three separate archives with different settings in seconds. Did you know that there were 31,536,000 seconds in a year?

Obviously with 3,700 files to modify if I wanted the retention the same all round it’d take ages to manually modify all the files so find and a little testing gave me:

1find /var/lib/graphite/whisper/icinga2 -name *wsp -exec whisper-resize '{}' --xFilesFactor=0 60s:1d 5m:1w 15m:1y \; 
2chown -R _graphite:_graphite /var/lib/graphite/whisper/icinga2

Run from the host folder /var/lib/graphite/whisper/icinga2/mailhost_vpn changes all of the graphs for that single host, or you can go a level up and do everything in one shot.

If you’re concerned about resources and think a system wide re-size will kill things, this was my experience; the test system has 37 hosts with 785 services, 3,700 wsp files. It’s running on a VM with two cores on a 2.8Ghz i5 and 1GB of ram. It took a shade over two minutes and disk space usage for graphite went from 72MB to 1.6GB when the .bak files were cleaned up.

Being Selective

Updates available Graph
Updates Available, No point in 1 minute increments for this one.

If you’ve running a large system monitoring many thousands of metrics you probably don’t want, or need the same resolution for all metrics.

Carbon allows you to apply patterns in the storage-schemas file and the first match for a new graph wins, so to speak.

An example would be using the ‘apt’ check within icinga2 which I run every 12 hours on Debian and Ubuntu systems. There is simply no point in recording data at any higher resolution than say 1 hour for that graph but the nice default we just applied to all of the whisper files has it storing thousands of useless data points.

As well as the disk space concerns, having multiple archives in the whisper files substantially increases processing overheads for carbon as it has to average data out and write it to each of the successively lower resolution archives in the file each time it writes a check.

That might not be an issue for a system with 700 metrics, but what about 70,000? In larger systems it would quickly be justified putting graphite on a separate host for that reason, but I’m running everything on the one small VM.

Checking the name for the whisper files for the apt plugin we find:

1root@icinga:/var/lib/graphite/whisper/icinga2/mailhost_vpn# find . | grep apt 
2./services/apt
3./services/apt/apt
4./services/apt/apt/perfdata
5./services/apt/apt/perfdata/available_upgrades
6./services/apt/apt/perfdata/available_upgrades/value.wsp ./services/apt/apt/perfdata/critical_updates
7./services/apt/apt/perfdata/critical_updates/value.wsp
8root@icinga:/var/lib/graphite/whisper/icinga2/mailhost_vpn#

Taking that path information we can create a pattern regex match in the storage-schemas.conf file and apply a more sane retention for apt checks.

The metric names you’re matching on are the path in the whisper directory structure with ‘/’ replaced with a ‘.’ and without the .wsp on the end. 

That makes the full name for one of the metrics above:

icinga2.mailhost_vpn.services.apt.apt.perfdata.available_upgrades.value

All we need to do is match the end of that long mess to pick up everything for the apt plugin on all hosts. Lucky for those of us who are regex challenged that’s really easy using the ‘$’ operator. 

The resulting storage-schemes.conf would look like:

 1[apt-upgrades] 
 2pattern = available_upgrades.value$ 
 3retentions = 1h:1y 
 4[apt-critical] 
 5pattern = critical_updates.value$
 6retentions = 1h:1y 
 7[default] 
 8pattern = .* 
 9xFilesFactor = 0
10retentions=60s:1d,5m:1w,15m:1y

To check that this is working, and taking the example of another host (icinga.home) we check the original default-retention whisper file using whisper-info to find the single one minute / one day Archive:

 1root@icinga:# pwd /var/lib/graphite/whisper/icinga2/icinga_home/services/apt/apt/perfdata/available_upgrades 
 2root@icinga:# whisper-info value.wsp 
 3maxRetention: 86400 
 4xFilesFactor: 0.5 
 5aggregationMethod: 
 6average fileSize: 17308 
 7
 8Archive 0
 9retention: 86400 
10secondsPerPoint: 60 
11points: 1440 
12size: 17280 
13offset: 28

To test the carbon config change:

  • Change the retention settings in carbon per the example above.
  • Restart carbon using ‘service carbon-cache restart’, or systemctl if that’s how you rock.
  • Delete the value.wsp file for your test host.
  • Re-run the check a couple of times using the icingaweb interface to force the re-creation of the wsp file.
  • Check it again with whisper-info.

Whisper-info should now return:

 1root@icinga:# whisper-info value.wsp
 2maxRetention: 31536000
 3xFilesFactor: 0.5
 4aggregationMethod:
 5average fileSize: 105148
 6
 7Archive 0
 8retention: 31536000
 9secondsPerPoint: 3600
10points: 8760
11size: 105120
12offset: 28

So only 100k of data to support a years worth of that metric and this will reduce the processor overhead of averaging and archiving data which is probably a larger concern if you’ve read down this far.

You can now apply this retention rule to all of your existing .wsp files using a little bit of find-grep-xargs-foo if you don’t want to delete the current history information. Run this, or a variation of it, in the appropriate folder to pick up one host or all of them.

1find . -name *wsp | grep available_upgrades | xargs -n 1 -I % whisper-resize % 1h:1y
2find . -name *wsp | grep critical_updates | xargs -n 1 -I % whisper-resize % 1h:1y 
3chown -R _graphite:_graphite /var/lib/graphite/whisper/icinga2

As with all bash-foo I’d suggest replacing the whisper-resize command with ‘echo’ until you’re happy that you’re not going to accidentally DDOS hotmail or cause a sudden loss of gravity in Jamaica taxi cabs. A backup before you take anything I type as advice wouldn’t be a bad idea either.

Once you’re happy with the results you can remove all of the .wsp.bak files as well to fee up a bit of space.