When something is wrong and you look at the system, most often you'll see that... well, it works. There's some cpu, disk, ram usage, some number of requests per second on different services, some stuff piling up, something in short supply here and there...
And there's just no way of telling what's wrong without answers to the questions like "so, what's the usual load average here?", "is the disk always loaded with requests 80% of time?", "is it much more requests than usual?", etc, otherwise you might be off to some wild chase just to find out that load has always been that high, or solve the mystery of some unoptimized code that's been there for ages, without doing anything about the problem in question.
Historical data is the answer, and having used rrdtool with stuff like (customized) cacti and snmpd (with some my hacks on top) in the past, I was overjoyed when I stumbled upon a graphite project at some point.
Initially, I started using it with just collectd (and still using it), but there's a need for something to convert metric names to a graphite hierarcy.
Some of the problems I've noticed with such collectd setup:
- Disk I/O metrics are godawful or just doesn't work - collected metrics of read/write either for processes of device are either zeroes, have weird values detached from reality (judging by actual problems and tools like atop and sysstat provide) or just useless.
- Lots of metrics for network and memory (vmem, slab) and from various plugins have naming, inconsistent with linux /proc or documentation names.
- Some useful metrics that are in, say, sysstat doesn't seem to work with collectd, like sensor data, nfsv4, some paging and socket counters.
- Some metrics need non-trivial post-processing to be useful - disk utilization % time is one good example.
- Python plugins leak memory on every returned value. Some plugins (ping, for example) make collectd segfault several times a day.
- One of the most useful info is the metrics from per-service cgroup hierarchies, created by systemd - there you can compare resource usage of various user-space components, totally pinpointing exactly what caused the spikes on all the other graphs at some time.
- Second most useful info by far is produced from logs and while collectd has a damn powerful tail plugin, I still found it to be too limited or just too complicated to use, while simple log-tailing code does the better job and is actually simplier due to more powerful language than collectd configuration. Same problem with table plugin and /proc.
- There's still a need for lagre post-processing chunk of code and pushing the values to carbon.
So off with collectd for all the custom metrics.
Wrote a simple "while True: collect_and_send() && sleep(till_deadline);" loop in python, along with the cgroup data collectors (there are even proper "block io" and "syscall io" per-service values!), log tailer and sysstat data processor (mainly for disk and network metrics which have batshit-crazy values in collectd plugins).
Pity, gmond looked like a nice, solid and resilent thing otherwise.
I still like the idea to pipe graphite metrics through AMQP (like rocksteady does), routing them there not only to graphite, but also to some proper threshold-monitoring daemon like shinken (basically nagios, but distributed and more powerful), with alerts, escalations, trending and flapping detection, etc, but most of the existing solutions all seem to use graphite and whisper directly, which seem kinda wasteful.
Looking forward, I'm actually deciding between replacing collectd completely for a few most basic metrics it now collects, pulling them from sysstat or just /proc directly or maybe integrating my collectors back into collectd as plugins, extending collectd-carbon as needed and using collectd threshold monitoring and matches/filters to generate and export events to nagios/shinken... somehow first option seem to be more effort-effective, even in the long run, but then maybe I should just work more with collectd upstream, not hack around it.