When something is wrong and you look at the system, most often you'll see
that... well, it works. There's some cpu, disk, ram usage, some number of
requests per second on different services, some stuff piling up, something in
short supply here and there...
And there's just no way of telling what's wrong without answers to the questions
like "so, what's the usual load average here?", "is the disk always loaded with
requests 80% of time?", "is it much more requests than usual?", etc, otherwise
you might be off to some wild chase just to find out that load has always been
that high, or solve the mystery of some unoptimized code that's been there for
ages, without doing anything about the problem in question.
Historical data is the answer, and having used rrdtool with stuff like (customized) cacti and snmpd (with some my
hacks on top) in the past, I was overjoyed
when I stumbled upon a graphite project at some point.
From then on, I strived to collect as much metrics as possible, to be able to
look at history of anything I want (and lots of values can be a reasonable
symptom for the actual problem), without any kind of limitations.
carbon-cache does magic by batching writes and carbon-aggregator does a great
job at relieving you of having to push aggregate metrics along with a granular
ones or sum all these on graphs.
Initially, I started using it with just collectd (and
still using it), but there's a need for something to convert metric names to a
After looking over quite a few solutions to collecd-carbon bridge, decided to
use bucky, with a few fixes of my own and quite large translation config.
Bucky can work anywhere, just receiving data from collectd network plugin,
understands collectd types and properly translates counter increments to N/s
rates. It also includes statsd daemon
which is brilliant at handling data from non-collector daemons and scripts and
more powerful metricsd
Downside is that it's only maintained in forks, has bugs in less-used code
(like metricsd), quite resource-hungry (but can be easily scaled-out) and
there's kinda-official collectd-carbon plugin
now (although I found it
buggy as well, not to mention much less featureful, but hopefully that'll be
addressed in future collectd versions).
Some of the problems I've noticed with such collectd setup:
- Disk I/O metrics are godawful or just doesn't work - collected
metrics of read/write either for processes of device are either
zeroes, have weird values detached from reality (judging by actual
problems and tools like atop and
or just useless.
- Lots of metrics for network and memory (vmem, slab) and from various
plugins have naming, inconsistent with linux /proc or documentation
- Some useful metrics that are in, say, sysstat doesn't seem to work
with collectd, like sensor data, nfsv4, some paging and socket
- Some metrics need non-trivial post-processing to be useful - disk
utilization % time is one good example.
- Python plugins leak memory on every returned value. Some plugins
(ping, for example) make collectd segfault several times a day.
- One of the most useful info is the metrics from per-service cgroup
hierarchies, created by
systemd - there
you can compare resource usage of various user-space components,
totally pinpointing exactly what caused the spikes on all the other
graphs at some time.
- Second most useful info by far is produced from logs and while
collectd has a damn powerful tail
plugin, I still
found it to be too limited or just too complicated to use, while
simple log-tailing code does the better job and is actually simplier
due to more powerful language than collectd configuration. Same
problem with table
plugin and /proc.
- There's still a need for lagre post-processing chunk of code and
pushing the values to carbon.
Of course, I wanted to add systemd cgroup metrics, some log values and missing
(and just properly-named) /proc tables data, and initially I wrote a collectd
plugin for that. It worked, leaked memory, occasionally crashed (with
collectd itself), used some custom data types, had to have some metric-name
post-processing code chunk in bucky...
Um, what the hell for, when sending metric value directly takes just "echo
some.metric.name $val $(printf %(%s)T -1) >/dev/tcp/carbon_host/2003"?
So off with collectd for all the custom metrics.
Wrote a simple "while True: collect_and_send() && sleep(till_deadline);" loop in python, along with the cgroup
data collectors (there are even proper "block io" and "syscall io" per-service
values!), log tailer and sysstat data processor (mainly for disk and network
metrics which have batshit-crazy values in collectd plugins).
Another interesting data-collection alternative I've explored recently is
Redundant gmond collectors and aggregators, communicating efficiently over
multicast are nice. It has support for python plugins, and is very easy to
use - pulling data from gmond node network can be done with one telnet or nc
command, and it's fairly comprehensible xml, not some binary protocol. Another
nice feature is that it can re-publish values only on some significant changes
(where you define what "significant" is), thus probably eliminating traffic
for 90% of "still 0" updates.
But as I found out while trying to use it as a collectd replacement
(forwarding data to graphite through amqp via custom scripts
), there's a fatal flaw -
gmond plugins can't handle dynamic number of values, so writing a plugin that
collects metrics from systemd services' cgroups without knowing how many of
these will be started in advance is just impossible.
Also it has no concept for timestamps of values - it only has "current" ones,
making plugins like "sysstat data parser" impossible to implement as well.
collectd, in contrast, has no constraint on how many values plugin returns and
has timestamps, but with limitations on how far backwards they are.
Pity, gmond looked like a nice, solid and resilent thing otherwise.
I still like the idea to pipe graphite metrics through AMQP (like rocksteady does), routing them there not only to
graphite, but also to some proper threshold-monitoring daemon like shinken (basically nagios, but distributed and more powerful), with alerts,
escalations, trending and flapping detection, etc, but most of the existing
all seem to use graphite and whisper directly, which seem kinda wasteful.
Looking forward, I'm actually deciding between replacing collectd completely for
a few most basic metrics it now collects, pulling them from sysstat or just
/proc directly or maybe integrating my collectors back into collectd as plugins,
extending collectd-carbon as needed and using collectd threshold monitoring and
matches/filters to generate and export events to nagios/shinken... somehow first
option seem to be more effort-effective, even in the long run, but then maybe I
should just work more with collectd upstream, not hack around it.
I've delayed update of the whole libnotify / notification-daemon /
notify-python stack for a while now, because notification-daemon got too
GNOME-oriented around 0.7, making it a lot more simplier, but sadly dropping
lots of good stuff I've used there.
Default nice-looking theme is gone in favor of black blobs (although colors
are probably subject to gtkrc); it's one-note-at-a-time only, which makes
reading them intolerable; configurability was dropped as well, guess blobs
follow some gnome-panel settings now.
Older notification-daemon versions won't build with newer libnotify.
Same problem with notify-python, which seem to be unnecessary now, since it's
functionality is accessible via introspection and PyGObject
(part known as PyGI
before merge - gi.repositories.Notify).
Looking for more-or-less drop-in replacements I've found notipy
project, which looked like what I
needed, and the best part is that it's python - no need to filter notification
requests in a proxy anymore, eliminating some associated complexity.
Project has a bit different goals however, them being simplicity, less deps
and concept separation, so I incorporated (more-or-less) notipy as a simple
NotificationDisplay class into notification-proxy, making it into
(first name that came to mind, not that it matters).
All the rendering now is in python using PyGObject (gi) / gtk-3.0 toolkit,
which seem to be a good idea, given that I still have no reason to keep Qt in
my system, and gtk-2.0 being obsolete.
Exploring newer Gtk stuff like css styling
auto-generated interfaces was fun, although the whole mess seem to be much
harder than expected. Simple things like adding a border, margins or some
non-solid background to existing widgets seem to be very complex and totally
counter-intuitive, unlike say, doing the same (even in totally cross-browser
fashion) with html. I also failed to find a way to just draw what I want on
arbitrary widgets, looks like it was removed (in favor of GtkDrawable) on
My (uneducated) guess is that gtk authors geared toward "one way to do one
thing" philosophy, but unlike Python motto, they've to ditch the "one
*obvious* way" part. But then, maybe it's just me being too lazy to read
Looking over Desktop Notifications
Spec in process, I've
noticed that there are more good ideas that I'm not using, so guess I
might need to revisit local notification setup in the near future.
I've heard about how easy it is to control stuff with a parallel port, but
recently I've been asked to write a simple script to send repeated signals to
some hardware via lpt and I really needed some way to test whether signals are
coming through or not.
Googling around a bit, I've found that it's trivial to plug leds right into
the port and did just that to test the script.
Since it's trivial to control these leds and they provide quite a simple way for
a visual notification for an otherwise headless servers, I've put together
another script to monitor system resources usage and indicate extra-high load
with various blinking rates.
Probably the coolest thing is that parallel port on mini-ITX motherboards comes
in a standard "male" pin group like usb or audio with "ground" pins exactly
opposite of "data" pins, so it's easy to take a few leds (power, ide, and they
usually come in different colors!) from an old computer case and plug these
directly into the board.
Making leds blink actually involves an active switching of data bits on the
port in an infinite loop, so I've forked one subprocess to do that while
another one checks/processes the data and feeds led blinking intervals'
updates to the first one via pipe.
System load data is easy to acquire from "/proc/loadavg" for cpu and as
"%util" percentage from "sar -d" reports.
And the easiest way to glue several subprocesses and a timer together into an
eventloop is twisted
, so the script is
basically 150 lines sar output processing, checks and blinking rate settings.
Obligatory link to the source. Deps
are python-2.7, twisted
Guess mail notifications could've been just as useful, but quickly-blinking leds
are more spectacular and kinda way to utilize legacy hardware capabilities that
these motherboards still have.
It's been a while since I augmented libnotify / notification-daemon stack to better
suit my (maybe not-so-) humble needs, and it certainly was an improvement, but
there's no limit to perfection and since then I felt the urge to upgrade it
every now and then.
One early fix was to let messages with priority=critical through without
delays and aggregation. I've learned it the hard way when my laptop shut down
because of drained battery without me registering any notification about it.
Other good candidates for high priority seem to be real-time messages like
track updates and network
connectivity loss events which either too important to ignore or just don't
make much sense after delay. Implementation here is straightforward - just
check urgency level and pass these unhindered to notification-daemon.
Another important feature which seem to be missing in reference daemon is the
ability to just cleanup the screen of notifications. Sometimes you just need
to dismiss them to see the content beneath, or you just read them and don't
want them drawing any more attention.
The only available interface for that seem to be CloseNotification method,
which can only close notification message using it's id, hence only useful
from the application that created the note. Kinda makes sense to avoid apps
stepping on each others toes, but since id's in question are sequential, it
won't be much of a problem to an app to abuse this mechanism anyway.
Proxy script, sitting in the middle of dbus communication as it is, don't have
to guess these ids, as can just keep track of them.
So, to clean up the occasional notification-mess I extended the
CloseNotification method to accept 0 as a "special" id, closing all the
Binding it to a key is just a matter of (a bit inelegant, but powerful)
dbus-send tool invocation:
% dbus-send --type=method_call\
Expanding the idea of occasional distraction-free needs, I found the idea of
the ability to "plug" the notification system - collecting the notifications
into the same digest behind the scenes (yet passing urgent ones, if this
behavior is enabled) - when necessary quite appealing, so I just added a flag
akin to "fullscreen" check, forcing notification aggregation regardless of
rate when it's set.
Of course, some means of control over this flag was necessary, so another
extension of the interface was to add "Set" method to control
notification-proxy options. Method was also useful to occasionally toggle
special "urgent" messages treatment, so I empowered it to do so as well by
making it accept a key-value array of parameters to apply.
And since now there is a plug, I also found handy to have a complimentary
"Flush" method to dump last digested notifications.
Same handy dbus-send tool comes to rescue again, when these need to be toggled
or set via cli:
% dbus-send --type=method_call\
In contrast to cleanup, I occasionally found myself monitoring low-traffic IRC
conversations entirely through notification boxes - no point switching the
apps if you can read the whole lines right there, but there was a catch of
course - you have to immediately switch attention from whatever you're doing
to a notification box to be able to read it before it times out and
disappears, which of course is a quite inconvenient.
Easy solution is to just override "timeout" value in notification boxes to
make them stay as long as you need to, so one more flag for the "Set" method
to handle plus one-liner check and there it is.
Now it's possible to read them with minimum distraction from the current
activity and dismiss via mentioned above extended CloseNotification method.
As if the above was not enough, sometimes I found myself willing to read and
react to the stuff from one set of sources, while temporarily ignoring the
traffic from the others, like when you're working at some hack, discussing it
(and the current implications / situation) in parallel over jabber or irc,
while heated discussion (but interesting none the less) starts in another
Shutting down the offending channel in ERC
, leaving BNC
to monitor the
conversation or just supress notifications with some ERC command would
probably be the right way to handle that, yet it's not always that simple,
especially since every notification-enabled app then would have to implement
some way of doing that, which of course is not the case at all.
Remedy is in the customizable filters for notifications, which can be a simple
set of regex'es, dumped into some specific dot-file, but even as I started to
implement the idea, I could think of several different validation scenarios
like "match summary against several regexes", "match message body", "match
simple regex with a list of exceptions" or even some counting and more complex
logic for them.
Idea of inventing yet another perlish (poorly-designed, minimal, ambiguous,
write-only) DSL for filtering rules didn't struck me as an exactly bright one,
so I thought for looking for some lib implementation of clearly-defined and
thought-through syntax for such needs, yet found nothing designed purely for
such filtering task (could be one of the reasons why every tool and daemon
hard-codes it's own DSL for that *sigh*).
On that note I thought of some generic yet easily extensible syntax for such
rules, and came to realization that simple SICP-like subset of scheme/lisp
with regex support would be exactly what I need.
Luckily, there are plenty implementations of such embedded languages in
python, and since I needed a really simple and customizabe one, I've decided
to stick with extended 90-line "lis.py
described by Peter Norvig here
. Out goes unnecessary file-handling,
plus regexes and some minor fixes and the result is "make it into whatever you
Just added a stat and mtime check on a dotfile, reading and compiling the
matcher-function from it on any change. Contents may look like this:
(define-macro define-matcher (lambda
(name comp last rev-args)
`(define ,name (lambda args
(if (= (length args) 1) ,last
(let ((atom (car args)) (args (cdr args)))
(~ ,@(if rev-args '((car args) atom) '(atom (car args))))
(apply ,name (cons atom (cdr args))))))))))
(define-matcher ~all and #t #f)
(define-matcher all~ and #t #t)
(define-matcher ~any or #f #f)
(define-matcher any~ or #f #t)
(lambda (summary body)
(~ "^erc: #\S+" summary)
(~ "^\*\*\* #\S+ (was created on|modes:) " body))
(all~ summary "^erc: #pulseaudio$" "^mail:")))
Which kinda shows what can you do with it, making your own syntax as you go
along (note that stuff like "and" is also a macro, just defined on a higher
Even with weird macros I find it much more comprehensible than rsync filters,
apache/lighttpd rewrite magic or pretty much any pseudo-simple magic set of
string-matching rules I had to work with.
I considered using python itself to the same end, but found that it's syntax
is both more verbose and less flexible/extensible for such goal, plus it
allows to do far too much for a simple filtering script which can potentially
be evaluated by process with elevated privileges, hence would need some sort
of sandboxing anyway.
In my case all this stuff is bound to convenient key shortcuts via fluxbox wm
# Notification-proxy control
Print :Exec dbus-send --type=method_call\
Shift Print :Exec dbus-send --type=method_call\
Pause :Exec dbus-send --type=method_call\
Shift Pause :Exec dbus-send --type=method_call\
Pretty sure there's more room for improvement in this aspect, so I'd have to
extend the system once again, which is fun all by itself.
Resulting (and maybe further extended) script is here, now linked against a
bit revised lis.py scheme implementation.
Everyone who uses OSS desktop these days probably seen libnotify magic in action - small popup windows that
appear at some corner of the screen, announcing events from other apps.
libnotify itself, however, is just a convenience lib for dispatching these
notifications over dbus, so the latter can pass it app listening on this
interface or even start it beforehand.
Standard app for rendering such messages is notification-daemon, which is
developed alongside with libnotify, but there are drop-in replacements like
. In dbus rpc mechanism
call signatures are clearly defined and visible, so it's pretty easy to
implement replacement for aforementioned daemons, plus vanilla
notification-daemon has introspection calls and dbus itself can be easily
monitored (via dbus-monitor utility) which make it's implementation even more
Now, polling every window for updates manually is quite inefficient - new
mail, xmpp messages, IRC chat lines, system events etc sometimes arrive every
few seconds, and going over all the windows (and by that I mean workspaces
where they're stacked) just to check them is a huge waste of time, especially
when some (or even most, in case of IRC) of these are not really important.
Either response time or focus (and, in extreme case, sanity) has to be
sacrificed in such approach. Luckily, there's another way to monitor this
stuff - small pop-up notifications allow to see what's happening right away,
w/o much attention-switching or work required from an end-user.
But that's the theory.
In practice, I've found that enabling notifications in IRC or jabber is pretty
much pointless, since you'll be swarmed by these as soon as any real activity
starts there. And w/o them it's a stupid wasteful poll practice, mentioned
Notification-daemon has no tricks to remedy the situation, but since the whole
thing is so abstract and transparent I've had no problem making my own fix.
Solution I came up with is to batch the notification messages into a digests as
soon as there are too many of them, displaying such digest pop-ups with some
time interval, so I can keep a grip on what's going on just by glancing at these
as they arrive, switching my activities if something there is important enough.
Having played with schedulers and network shaping/policing before, not much
imagination was required to devise a way to control the message flow rate.
I chose token-bucket algorithm
at first, but since prolonged flood of I-don't-care-about activity have
gradually decreasing value, I didn't want to receive digests of it every N
seconds, so I batched it with a gradual digest interval increase and
digests won't get too fat over these intervals.
Well, the result exceeded my expectations, and now I can use libnotify freely
even to indicate that some rsync just finished in a terminal on another
workspace. Wonder why such stuff isn't built into existing notification
Then, there was another, even more annoying issue: notifications during
fullscreen apps! WTF!?
Wonder if everyone got used to this ugly flickering in fullscreen mplayer,
huge lags in GL games like SpringRTS or I'm just re-inventing the wheel here,
since it's done in gnome or kde (knotify, huh?), but since I'm not gonna use
either one I just added fullscreen-app check
output, queueing them to digest if that is the case.
Ok, a few words about implementation.
Token bucket itself is based on activestate recipe
with some heavy improvements
to adjust flow on constant under/over-flow, plus with a bit more pythonic
style and features, take a look here
. Leaky bucket
implemented by this class
Main dbus magic, however, lies outside the script, since dbus calls cannot be
intercepted and the scheduler can't get'em with notification-daemon already
listening on this interface.
Solution is easy, of course - scheduler can replace the real daemon
and proxy mangled calls to it as necessary. It takes this sed line
for notification-daemon as well, since interface is hard-coded there.
Needs fgc module
, but it's just a hundred
lines on meaningful code.
One more step to making linux desktop more comfortable. Oh, joy ;)