Apr 19, 2011
Dec 29, 2010
Problem is, "watching stuff online" is even worse than just "watching stuff" - either you pay attention or you just miss it, so I set up recording as a sort of "fs-based cache", at the very least to watch the recorded streams right as they get written, being able to pause or rewind, should I need to do so.
So, mplayer needs to be wrapped in a while loop, but then you also need to give dump files unique names to keep mplayer from overwriting them, and actually do several such loops for several streams you're recording (different halls, different talks, same time), and then probably because of strain on the streaming servers mplayer tend to reconnect several times in a row, producing lots of small dumps, which aren't really good for anything, and you'd also like to get some feedback on what's happening, and... so on.
Well, I just needed a better tool, and it looks like there aren't much simple non-gui dumpers for video+audio streams and not many libs to connect to http video streams from python, existing one being vlc bindings, which isn't probably any better than mplayer, provided all I need is just to dump a stream to a file, without any reconnection or rotation or multi-stream handling mechanism.
And now, back to the ongoing talks of day 3.
Sep 12, 2010
Thanks to feedjack, I'm able to keep in sync with 120 feeds (many of them, like slashdot or reddit, being an aggregates as well), as of today. Quite a lot of stuff I couldn't even imagine handling a year ago, and a good aggregator definitely helps, keeping all the info just one click away.
And every workstation-based (desktop) aggregator I've seen is a fail:
- RSSOwl. Really nice interface and very powerful. That said, it eats more ram than a firefox!!! Hogs CPU till the whole system stutters, and eats more of it than every other app I use combined (yes, including firefox). Just keeping it in the background costs 20-30% of dualcore cpu. Changing "show new" to "show all" kills the system ;)
- liferea. Horribly slow, interface hangs on any action (like fetching feed "in the background"), hogs cpu just as RSSOwl and not quite as feature-packed.
- Claws-mail's RSSyl. Quite nice resource-wise and very responsive, unlike dedicated software (beats me why). Pity it's also very limited interface-wise and can't reliably keep track of many of feeds by itself, constantly loosing a few if closed non-properly (most likely it's a claws-mail fault, since it affects stuff like nntp as well).
- Emacs' gnus and newsticker. Good for a feed or two, epic fail in every way with more dozen of them.
- Various terminal-based readers. Simply intolerable.
Maybe someday I'll get around to use something like Google Reader, but it's still one hell of a mess, and it's no worse than similar web-based services out there. So much for the cloud services. *sigh*
Jun 05, 2010
If you're downloading stuff off the 'net via bt like me, then TPB is probably quite a familiar place to you.
So the easiest generic solution (which will, of course, work for rtorrent) I've found is just to process the .torrent files before feeding them to the leecher app. Since I'm downloading these via firefox exclusively, and there I use FlashGot (not the standard "open with" interface since I also use it to download large files on remote machine w/o firefox, and afaik it's just not the way "open with" works) to drop them into an torrent bin via script, it's actually a matter of updating the link-receiving script.
So, to the script itself...
# URL checker def trak_check(trak): if not trak: return False try: ip = gethostbyname(urlparse(trak).netloc.split(':', 1)) except gaierror: return True # prehaps it will resolve later on else: return ip not in ('127.0.0.1', '0.0.0.0') # Actual processing torrent = bdecode(torrent) for tier in list(torrent['announce-list']): for trak in list(tier): if not trak_check(trak): tier.remove(trak) # print >>sys.stderr, 'Dropped:', trak if not tier: torrent['announce-list'].remove(tier) # print >>sys.stderr, 'Result:', torrent['announce-list'] if not trak_check(torrent['announce']): torrent['announce'] = torrent['announce-list'] torrent = bencode(torrent)
I've wanted to make it as light as possible so no logging or optparse/argparse stuff I tend cram everywhere I can, and the extra and heavy imports like urllib/urllib2 are conditional as well. The only dependency is python (>=2.6) itself.
Basic use-case is one of these:
% brecode.py < /path/to/file.torrent > /path/to/proccessed.torrent % brecode.py http://tpb.org/torrents/myfile.torrent > /path/to/proccessed.torrent % brecode.py http://tpb.org/torrents/myfile.torrent\ -r http://tpb.org/ -c "...some cookies..." -d /path/to/torrents-bin/
All the extra headers like cookies and referer are optional, so is the destination path (dir, basename is generated from URL). My use-case in FlashGot is this: "[URL] -r [REFERER] -c [COOKIE] -d /mnt/p2p/bt/torrents"
May 08, 2010
Naturally, subscribing to something like an unfiltered flow of new music releases isn't an option, but no music site other than last.fm out there knows the acts in my collection to track updates for those, and last.fm doesn't seem to have the functionality I need - just to list new studio releases from the artists I listened to beyond some reasonable threshold, or I just haven't been able to find it.
First one seemed easier, but not as entertaining as the second, plus I have virtually no knowledge to pick a site which will be best-suited for that (and I'd hate to pick a first thing from the google), and I'd probably have to post-process what this site feeds me anyway. I've decided to stick with the second way.
Object-db to a web content is a typical task for a web framework, so I chose to use django as a basic template for the task.
The next task is to compare two lists to drop the stuff I already have (local albums list) from the fetched list.
After that comes db storage, and there's not much to implement but a simple ORM-model definition with a few unique keys and the django will take care of the rest.
The last part is the data representation.
One more thing I've thought of is the caching - no need to strain free databases with redundant queries, so the only non-cached data from these are the lists of the releases which should be updated from time to time, the rest can be kept in a single "seen" set of id's, so it'd be immediately obvious if the release was processed and queried before and is of no more interest now.
Apr 17, 2010
Git is a great help in these tasks, but what I feel lacking there is a first - common timeline (spanning both into the past and the future) for all these data series, and second - documentation.
Solution to the first one I've yet to find.
Documentation for anything more than local implementation details and it's history is virtually non-existant and most times it takes a lot of effort and time to retrace the original line of thought, reasoning and purpose behind the stuff I've done (and why I've done it like that) in the past, often with the considerable gaps and eventual re-invention of the wheels and pitfalls I've already done, due to faulty biological memory.
So, today I've decided to scour over the available project and task management software to find something that ties the vcs repositories and their logs with the future tickets and some sort of expanded notes, where needed.
It just looks like a completely wrong approach for my task, yet I thought that I can probably tolerate that if there are no better options and then I've stumbled upon Fossil VCS.
- Tickets and wiki, of course. Can be edited locally, synced, distributed, have local settings and appearance, based on tcl-ish domain-specific language.
- Distributed nature, yet rational position of authors on centralization and synchronization topic.
- All-in-one-static-binary approach! Installing hundreds of git binaries to every freebsd-to-debian-based system, was a pain, plus I've ended up with 1.4-1.7 version span and some features (like "add -p") depend on a whole lot of stuff there, like perl and damn lot of it's modules. Unix-way is cool, but that's really more portable and distributed-way-friendly.
- Repository in a single package, and not just a binary blob, but a freely-browsable sqlite db. It certainly is a hell lot more convenient than path with over nine thousand blobs with sha1-names, even if the actual artifact-storage here is built basically the same way. And the performance should be actually better than the fs - with just index-selects BTree-based sqlite is as fast as filesystem, but keeping different indexes on fs is by sym-/hardlinking, and that's a pain that is never done right on fs.
- As simple as possible internal blobs' format.
- Actual symbolics and terminology. Git is a faceless tool, Fossil have some sort of a style, and that's nice ;)
Yet there are some things I don't like about it:
- HTTP-only sync. In what kind of twisted world that can be better than ssh+pam or direct access? Can be fixed with a wrapper, I guess, but really, wtf...
- SQLite container around generic artifact storage. Artifacts are pure data with a single sha1sum-key for it, and that is simple, solid and easy to work with anytime, but wrapped into sqlite db it suddenly depends on this db format, libs, command-line tool or language bindings, etc. All the other tables can be rebuilt just from these blobs, so they should be as accessible as possible, but I guess that'd violate whole single-file design concept and would require a lot of separate management code, a pity.
But that's nothing more than a few hours' tour of the docs and basic hello-world tests, guess it all will look different after I'll use it for a while, which I'm intend to do right now. In the worst case it's just a distributed issue tracker + wiki with cli interface and great versioning support in one-file package (including webserver) which is more than I can say about trac, anyway.
Apr 10, 2010
CREATE TABLE state_hoard.list_http_availability ( id serial NOT NULL, target character varying(128) NOT NULL, "domain" character varying(128) NOT NULL, check_type state_hoard.check_type NOT NULL, "timestamp" numeric, source character varying(16), CONSTRAINT state_ha__id PRIMARY KEY (id), CONSTRAINT state_ha__domain_ip_check_type UNIQUE (target, domain, check_type) );
It should probably be extended with other checks later on so there's check_type field with enum like this:
CREATE TYPE state_hoard.check_type AS ENUM ('http', 'https');
Target (IP) and domain (hostname) are separate fields here, since dns data is not to be trusted but the http request should have host-field to be processed correctly.
CREATE OR REPLACE FUNCTION state_hoard.list_ha_replace() RETURNS trigger AS $BODY$ DECLARE updated integer; BEGIN -- Implicit timestamping NEW.timestamp := COALESCE( NEW.timestamp, EXTRACT('epoch' FROM CURRENT_TIMESTAMP) ); UPDATE state_hoard.list_http_availability SET timestamp = NEW.timestamp, source = NEW.source WHERE domain = NEW.domain AND target = NEW.target AND check_type = NEW.check_type; -- Check if the row still need to be inserted GET DIAGNOSTICS updated = ROW_COUNT; IF updated = 0 THEN RETURN NEW; ELSE RETURN NULL; END IF; END; $BODY$ LANGUAGE 'plpgsql' VOLATILE COST 100; CREATE TRIGGER list_ha__replace BEFORE INSERT ON state_hoard.list_http_availability FOR EACH ROW EXECUTE PROCEDURE state_hoard.list_ha_replace();
From there I had two ideas on how to use this data and store immediate results, from the poller perspective:
- To replicate the whole table into some sort of "check-list", filling fields there as the data arrives.
- To create persistent linked tables with polled data, which just replaced (on unique-domain basis) with each new poll.
While former looks appealing since it allows to keep state in DB, not the poller, latter provides persistent availability/delay tables and that's one of the things I need.
CREATE TABLE state_hoard.state_http_availability ( check_id integer NOT NULL, host character varying(32) NOT NULL, code integer, "timestamp" numeric, CONSTRAINT state_ha__check_host PRIMARY KEY (check_id, host), CONSTRAINT state_http_availability_check_id_fkey FOREIGN KEY (check_id) REFERENCES state_hoard.list_http_availability (id) MATCH SIMPLE ON UPDATE RESTRICT ON DELETE CASCADE ); CREATE TABLE state_hoard.state_http_delay ( check_id integer NOT NULL, host character varying(32) NOT NULL, delay numeric, "timestamp" numeric, CONSTRAINT state_http_delay_check_id_fkey FOREIGN KEY (check_id) REFERENCES state_hoard.list_http_availability (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE CASCADE );
CREATE OR REPLACE FUNCTION state_hoard.state_ha_replace() RETURNS trigger AS $BODY$ BEGIN -- Drop old record, if any DELETE FROM state_hoard.state_http_availability WHERE check_id = NEW.check_id AND host = NEW.host; -- Implicit timestamp setting, if it's omitted NEW.timestamp := COALESCE(NEW.timestamp, EXTRACT('epoch' FROM CURRENT_TIMESTAMP)); RETURN NEW; END; $BODY$ LANGUAGE 'plpgsql' VOLATILE COST 100; CREATE TRIGGER state_ha__replace BEFORE INSERT ON state_hoard.state_http_availability FOR EACH ROW EXECUTE PROCEDURE state_hoard.state_ha_replace();
CREATE OR REPLACE FUNCTION state_hoard.state_hd_insert() RETURNS trigger AS $BODY$ BEGIN -- Implicit timestamp setting, if it's omitted NEW.timestamp := COALESCE( NEW.timestamp, EXTRACT('epoch' FROM CURRENT_TIMESTAMP) ); RETURN NEW; END; $BODY$ LANGUAGE 'plpgsql' VOLATILE COST 100; CREATE TRIGGER state_hd__insert BEFORE INSERT ON state_hoard.state_http_delay FOR EACH ROW EXECUTE PROCEDURE state_hoard.state_hd_insert();
After that comes the logging part, and the logged part is http response codes.
CREATE TABLE state_hoard.log_http_availability ( "domain" character varying(128) NOT NULL, code integer, "timestamp" numeric NOT NULL, CONSTRAINT log_ha__domain_timestamp PRIMARY KEY (domain, "timestamp") );
Interval for these averages can be acquired via simple rounding, and it's convenient to have single function for that, plus the step in retriveable form. "Immutable" type here means that the results will be cached for each set of parameters.
CREATE OR REPLACE FUNCTION state_hoard.log_ha_step() RETURNS integer AS 'SELECT 600;' LANGUAGE 'sql' IMMUTABLE COST 100; CREATE OR REPLACE FUNCTION state_hoard.log_ha_discrete_time(numeric) RETURNS numeric AS 'SELECT (div($1, state_hoard.log_ha_step()::numeric) + 1) * state_hoard.log_ha_step();' LANGUAGE 'sql' IMMUTABLE COST 100;
CREATE OR REPLACE FUNCTION state_hoard.log_ha_coerce() RETURNS trigger AS $BODY$ DECLARE updated integer; BEGIN -- Implicit timestamp setting, if it's omitted NEW.timestamp := state_hoard.log_ha_discrete_time( COALESCE( NEW.timestamp, EXTRACT('epoch' FROM CURRENT_TIMESTAMP) )::numeric ); IF NEW.code = 200 THEN -- Successful probe overrides (probably random) errors UPDATE state_hoard.log_http_availability SET code = NEW.code WHERE domain = NEW.domain AND timestamp = NEW.timestamp; GET DIAGNOSTICS updated = ROW_COUNT; ELSE -- Errors don't override anything SELECT COUNT(*) FROM state_hoard.log_http_availability WHERE domain = NEW.domain AND timestamp = NEW.timestamp INTO updated; END IF; -- True for first value in a new interval IF updated = 0 THEN RETURN NEW; ELSE RETURN NULL; END IF; END; $BODY$ LANGUAGE 'plpgsql' VOLATILE COST 100; CREATE TRIGGER log_ha__coerce BEFORE INSERT ON state_hoard.log_http_availability FOR EACH ROW EXECUTE PROCEDURE state_hoard.log_ha_coerce();
The only thing left at this point is to actually tie this intermediate log-table with the state-table, and after-insert/update hooks are good place for that.
CREATE OR REPLACE FUNCTION state_hoard.state_ha_log() RETURNS trigger AS $BODY$ DECLARE domain_var character varying (128); code_var integer; -- Timestamp of the log entry, explicit to get the older one, checking for random errors ts numeric := state_hoard.log_ha_discrete_time(EXTRACT('epoch' FROM CURRENT_TIMESTAMP)); BEGIN SELECT domain FROM state_hoard.list_http_availability WHERE id = NEW.check_id INTO domain_var; SELECT code FROM state_hoard.log_http_availability WHERE domain = domain_var AND timestamp = ts INTO code_var; -- This actually replaces older entry, see log_ha_coerce hook INSERT INTO state_hoard.log_http_availability (domain, code, timestamp) VALUES (domain_var, NEW.code, ts); -- Random errors' trapping IF code_var != NEW.code AND (NEW.code > 400 OR code_var > 400) THEN code_var = CASE WHEN NEW.code > 400 THEN NEW.code ELSE code_var END; INSERT INTO state_hoard.log_http_random_errors (domain, code) VALUES (domain_var, code_var); END IF; RETURN NULL; END; $BODY$ LANGUAGE 'plpgsql' VOLATILE COST 100; CREATE TRIGGER state_ha__log_insert AFTER INSERT ON state_hoard.state_http_availability FOR EACH ROW EXECUTE PROCEDURE state_hoard.state_ha_log(); CREATE TRIGGER state_ha__log_update AFTER UPDATE ON state_hoard.state_http_availability FOR EACH ROW EXECUTE PROCEDURE state_hoard.state_ha_log();
From here, the log will get populated already, but in a few days it will get millions of entries and counting, so it have to be aggregated and the most efficient method for this sort of data seem to be in keeping just change-points for return codes since they're quite rare.
"Random errors" are trapped here as well and stored to the separate table. They aren't frequent, so no other action is taken there.
The log-diff table is just that - code changes. "code_prev" field is here for convenience, since I needed to get if there were any changes for a given period, so the rows there would give complete picture.
CREATE TABLE state_hoard.log_http_availability_diff ( "domain" character varying(128) NOT NULL, code integer, code_prev integer, "timestamp" numeric NOT NULL, CONSTRAINT log_had__domain_timestamp PRIMARY KEY (domain, "timestamp") );
Updates to this table happen on cron-basis and generated right inside the db, thanks to plpgsql for that.
LOCK TABLE log_http_availability_diff IN EXCLUSIVE MODE; LOCK TABLE log_http_availability IN EXCLUSIVE MODE; INSERT INTO log_http_availability_diff SELECT * FROM log_ha_diff_for_period(NULL, NULL) AS data(domain character varying, code int, code_prev int, timestamp numeric); TRUNCATE TABLE log_http_availability;
And the diff-generation code:
CREATE OR REPLACE FUNCTION state_hoard.log_ha_diff_for_period(ts_min numeric, ts_max numeric) RETURNS SETOF record AS $BODY$ DECLARE rec state_hoard.log_http_availability%rowtype; rec_next state_hoard.log_http_availability%rowtype; rec_diff state_hoard.log_http_availability_diff%rowtype; BEGIN FOR rec_next IN EXECUTE 'SELECT domain, code, timestamp FROM state_hoard.log_http_availability' || CASE WHEN NOT (ts_min IS NULL AND ts_max IS NULL) THEN ' WHERE timestamp BETWEEN '||ts_min||' AND '||ts_max ELSE '' END || ' ORDER BY domain, timestamp' LOOP IF NOT rec_diff.domain IS NULL AND rec_diff.domain != rec_next.domain THEN -- Last record for this domain - skip unknown vals and code change check rec_diff.domain = NULL; END IF; IF NOT rec_diff.domain IS NULL THEN -- Time-skip (unknown values) addition rec_diff.timestamp = state_hoard.log_ha_discrete_time(rec.timestamp + 1); IF rec_diff.timestamp < rec_next.timestamp THEN -- Map unknown interval rec_diff.code = NULL; rec_diff.code_prev = rec.code; RETURN NEXT rec_diff; END IF; -- rec.code here should be affected by unknown-vals as well IF rec_diff.code != rec_next.code THEN rec_diff.code_prev = rec_diff.code; rec_diff.code = rec_next.code; rec_diff.timestamp = rec_next.timestamp; RETURN NEXT rec_diff; END IF; ELSE -- First record for new domain or whole loop (not returned) -- RETURN NEXT rec_next; rec_diff.domain = rec_next.domain; END IF; rec.code = rec_next.code; rec.timestamp = rec_next.timestamp; END LOOP; END; $BODY$ LANGUAGE 'plpgsql' VOLATILE COST 100 ROWS 1000;
Feb 28, 2010
Since I've put some two-day effort into creation of net-snmp snmpd extension and had some free time to report bug in source of this inspiration, thought I might as well save someone trouble of re-inventing the wheel and publish it somewhere, since snmpd extension definitely looks like a black area from python perspective.
I've used sf.net as a project admin before, publishing some crappy php code for hyscar project with pretty much the same reasons in mind, and I didn't like the experience much - cvs for code storage and weird interface are among the reasons I can remember, but I'll gladly take all this criticism back - current interface has by far exceed all my expectations (well, prehaps they were too low in the first place?).
Putting up a full-fledged project page took me (a complete n00b at that) about half an hour, everything being simple and obvious as it is, native-to-me git vcs, and even trac among the (numerous) features. Damn pleasant xp, making you wanna upload something else just for the sake of it ;)