LeoNerd's programming thoughts: 2011

2011/12/13

Perl - more Socket

Some of you may have noticed that I'm now maintaining Socket dual-life on CPAN. Recently uploaded is version 1.96, which is an extraction of what was in bleadperl, updated to support building out-of-core on versions of perl back to at least 5.10.0, and some nicely rewritten documentation.

My plans for 1.97 are to neaten up the build system a bit more. Currently it's a rather hastily-written set of support code to handle the dual-life nature of it, so it can build on CPAN outside of core. Once that is in place, it will be much easier to add support for new features.

At this point I'll be starting to take more ideas from around CPAN. What constants or structure handling functions need adding. What socket options are being used in practice? And new protocol families it doesn't yet support - e.g. PF_RFCOMM (Bluetooth)? If anyone has any feature requests, now would be an excellent time to get them to me. :)

2011/11/24

No longer thinking about Perl 5.8

This week I got around to rebuilding all my local Perl module packages for the Perl 5.14.1 now in debian testing. Because that replaced the 5.12.4 that was previously running, I've rebuilt that into my homedir with perlbrew.

Added to the 5.10.1 I previously had, that's still one system and two custom installed perls I have. I try to remember to test everything before releasing it to CPAN on all these three.

At this point, I can't really be bothered to look after 5.8 as well, so effectively I'm no longer testing anything on 5.8. I'll still keep an eye on smoke-test results and see if they're passing or failing, and if I see a failure that looks easy to fix I might have a go at it. But no promises now.

If I manage to break anything on 5.8 in a module of mine and anyone actually cares about it, feel free to raise me a bug on rt.cpan.org. But apart from that, I hope I don't have to think too much about 5.8 any more.

2011/11/15

LPW2011 Talk Slides

If anyone wants to see the slides of the talks I co-presented with Tom this year at LPW, they're at

Though, at least the latter talk was mostly a series of code demos, so the slides alone won't make much sense.

2011/11/03

Perl - Tiny lightweight structures module

$ cat struct.pm 
package struct;

use strict;
use warnings;

sub import
{
   shift;
   my ( $name, $fields ) = @_;
   my $caller = caller;

   my %subs;
   foreach ( 0 .. $#$fields ) {
      my $idx = $_;
      $subs{$fields->[$idx]} = sub :lvalue { shift->[$idx] };
   }

   my $pkg = "struct::impl::$name";

   no strict 'refs';
   *{$pkg."::$_"} = $subs{$_} for keys %subs;
   *{$caller."::$name"} = sub { bless [ @_ ], $pkg };
}

1;

$ perl
use struct Point => [qw( x y )];
my $p = Point(10,20);
printf "Point is at (%d,%d)\n", $p->x, $p->y;
$p->x = 30;
printf "Point is now at (%d,%d)\n", $p->x, $p->y;
$p->z = 40;

__END__
Point is at (10,20)
Point is now at (30,20)
Can't locate object method "z" via package "struct::impl::Point" at - line 6.

It's specifically and intentionally not an object class. You cannot subclass it. You cannot provide methods. You cannot apply roles or mixins or metaclasses or traits or antlers or whatever else is in fashion this week.

On the other hand, it is tiny, single-file, creates cheap lightweight array-backed structures, uses nothing outside of core. And I defy anyone to even measure its startup overhead with a repeatable benchmark.

It's intended simply to be a slightly nicer way to store internal data structures, where otherwise you might be tempted to abuse a hash, complete with the risk of typoing key names.

Would anyone use this, if it were available?

2011/10/26

Perl - Dual-life Socket

I've been working on how to dual-life the Socket module, so it can go on CPAN and give older Perl versions the benefits of recent additions; namely getaddrinfo(3)/getnameinfo(3) wrappings and other IPv6 support.

I have what's currently marked as an unauthorized release now on CPAN, though I'm hoping to get co-maint on it to release officially. In the meantime, it'd be useful to get some smoketest reports on what works and what doesn't. I've tested on Linux at perl 5.10.1, 5.12.4 and 5.14.1. Other OSes, especially MSWin32, would be most appreciated.

http://search.cpan.org/~pevans/Socket-1.94_03/

2011/09/30

libvterm/pangoterm and Tickit

Lately I've written a bit about my terminal emulator library, libvterm, and briefly mentioned Tickit, my terminal UI module for Perl. I'll write about each in more detail soon, but I thought since I hit an important milestone recently, I'd write a little something more about libvterm and pangoterm.

libvterm is a purely abstract C99 library that implements the bulk of the logic of being a terminal emulator. Bytes from the PTY master are fed into it by the containing program, and it maintains the abstract state of the terminal; the position of the cursor, the state of the pen, what charcters are where with what attributes, and so on. It calls callback functions registered by the containing program, to inform it of damaged screen regions that need repainting. Two of the main selling points of the library are

It is purely abstract C99, doesn't rely on POSIX or any particular rendering/UI system

During normal operation of just feeding it bytes and processing events, it does not use the malloc system.

These properties make it ideal for a number of situations, ranging from desktop applications, to small embedded systems or operating system kernels.

pangoterm is a GTK/Pango-driven embedding of this libary, in a simple single-.c-file application, mostly for me to develop and test it. It is currently maintained in the libvterm source tree.

For a while now this combination has been complete enough to drive vim sufficient to edit its own source code - pangoterm and libvterm are now self-hosting. A couple of weeks ago I finally managed to fix the last of a number of small issues making it not quite perfect. Last week I also managed to get pangoterm to completely correctly render a Tickit-based program; the final missing piece being some of the mouse tracking modes.

I now have a bit of extra configuration in my .vimrc to take advantage of a few of pangoterm's abilities, such as support for italics.

As well as italics, it also supports strikethrough and alternative fonts, although so far I've only managed to find one alternative font that actually looks at all decent alongside DejaVu Sans Mono. These are all shown off quite well by Tickit's demo-pen.pl example script here.

And finally here, a demo of the xterm-like 256 colour handling.

These screenshots briefly show Tickit working nicely with pangoterm. Sometime soon I shall get around to writing about Tickit in more detail, and also explaining some of my further plans for the whole Tickit+libtermkey vs libvterm combination.

2011/09/28

Perl - Sentinel - version 0.02

I have just uploaded the first (non-development) release of Sentinel; version 0.02.

It provides a little helper function that creates lvalues, suitable for lvalue typed accessors that want to run real code on assignment (such as for type checking or coercion, or update triggers), rather than just update a scalar.

use Sentinel;
 
sub foo :lvalue
{
   my $self = shift;
   sentinel get => sub { return $self->get_foo },
            set => sub { $self->set_foo( $_[0] ) };
}

It makes the following two lines equivalent

my $obj = Some::Object->new;

$obj->set_foo( 100 );
$obj->foo = 100;

This is rare among my modules, more or less the first thing I've written mostly as a result of ranting about it on #perl, rather than because I actually wanted it. My argument kept being that if anyone does want an lvalue accessor, it's trivially easy to write one given some function sentinel as in this example, and that actually implementing the sentinel function itself isn't hard; it's a small piece of obvious XS magic.

So here it is.

It has to cheat somewhat on versions of Perl before 5.14, because of the way lvalue context isn't properly as powerful as it now is. Long story short - there may be unit-test failures on some older versions of Perl; but I can't tell yet because the CPAN Testers web frontend is still down, so I can't see the smokers. If anyone sees any version-related failures, please do let me know.

Edit 2011/11/29: By the power of crowdsourced smoke testing, it appears this does work stably across many Perl versions - thanks all who've informed me.

2011/08/30

Perl - Term::TermKey - version 0.09

Last week saw a new version of Term::TermKey (0.09), and also the underlying libtermkey (0.9). This contains a fairly small new feature, giving control of the way that EINTR is handled.

Version 0.8 added graceful handling of EINTR to restart IO operations rather than fail with an error. This had unfortunate knock-on effects for the Perl-level wrapping of it, because of the deferred nature of Perl's safe signals. On a signal (such as the not-so-unlikely SIGWINCH) a flag would be set, but the termkey_waitkey(3) operation would be restarted, not returning back to Perl's control until a keypress event was actually received. This upset programs that wish to respond to SIGWINCH and redraw the terminal to the new size.

Version 0.9 of libtermkey now therefore has a new flag, TERMKEY_FLAG_EINTR whose presence makes the blocking IO operations (termkey_waitkey(3) and termkey_advisereadable(3)) to return a new result code, TERMKEY_RES_ERROR after which the caller can inspect the value of errno, to observe an EINTR.

Again because of Perl's safe signal handling, the Perl wrapping of libtermkey has to always enable this flag, so it can invoke the $SIG{WINCH} signal handler, for example. The Term::TermKey module therefore now always sets TERMKEY_FLAG_EINTR on the underlying TermKey instance, and emulates the presence or absence of this flag at the Perl level, by optionally restarting its IO operation, or itself returning TERMKEY_RES_ERROR.

An unfortunate bug here in the emulation and hiding of this flag from the Perl level means that Term::TermKey 0.09 fails to correctly read the TermKey flags back out of the underlying object. In particular it fails to be able to check on the presence of TERMKEY_FLAG_UTF8 that libtermkey itself may have enabled, after detecting a UTF-8 locale. This causes Tickit's unit tests to break with the familiar "Wide character in syswrite at ..." error.

This bug has now been fixed in the source code repository, and will be present in the next version, 0.10. Tickit 0.10 also has an independent fix for the same bug, by using Perl's ${^UTF8LOCALE} instead of reading the TermKey flags back out again.

Also upcoming in libtermkey 0.10, will be some Solaris portability fixes, and a new canonicalisation flag, which turns a TERMKEY_SYM_DEL key into TERMKEY_SYM_BACKSPACE, for those terminals that send DEL on Backspace.

2011/08/27

IO::Async and AnyEvent

I've recently been working on a new IO::Async loop implementation, IO::Async::Loop::AnyEvent. Like the Glib and POE loops, this one uses another event system as its underlying implementation; AnyEvent.

What makes this one a little different and noteworthy, is that AnyEvent claims not to be an event system as such, but rather a compatibility layer on top of event systems. I have so far resisted writing this particular loop implementation on the grounds that, since AnyEvent just applies a cross-compatibility layer on top of some other existing event system anyway, IO::Async might as well use that underlying event system directly. In practice however this doesn't quite seem to work out all the time; existing code exists and is already written. Sometimes that existing code already works using AnyEvent directly, making it harder to drop a small section of IO::Async-based code inside it.

In that situation, an individual module/function/etc.. can construct itself an IO::Async::Loop::AnyEvent, to allow that component to use IO::Async-based functionallity, while still interacting correctly with the rest of the AnyEvent-based program.

It is not primarily intended that this module be used as the basis of an entire program; mostly because a neater solution for mixed IO::Async+AnyEvent exists in the form of AnyEvent::Impl::IOAsync.

There are still some minor issues to deal with currently; most notably that because each of IO::Async and AnyEvent could use the other as an event source, there are some circularity problems when AnyEvent picks AnyEvent::Impl::IOAsync to use, when IO::Async::Loop::AnyEvent loads it.

Deep recursion on subroutine "AnyEvent::Impl::IOAsync::io" at /home/leo/src/perl/IO-Async-Loop-AnyEvent/blib/lib/IO/Async/Loop/AnyEvent.pm line 103.

Hmmmmm.... A little work there still needed :)

2011/07/31

Perl - Term::Terminfo - version 0.06

I've just uploaded a new version of Term::Terminfo. In brief; this is a small wrapper around the terminfo database, and can be used to enquire about properties of a given terminal. For annoying historic reasons, each of these properties is known by at least two names; its short "capname", a two or three letter code often found in the actual file on disk, and a longer "varname", the name of a variable in the curses C library, which stores its value for the current terminal.

This latest version, 0.06, adds a whole duplicate set of methods - varname accessors. Prior to 0.06, the properties were only accessible using their short capnames.

It's probably best to use the longer varnames anyway. They are more self-documenting, and a little more future-proof against the admittedly-remote possibility of new variables being added in the future.

I'm also planning to support unibilium in a later version. This is a standalone terminfo-parsing library, which is useful for reading terminfo data without linking against the full curses library. This is especially useful when trying to build a replacement for curses...

2011/07/15

XS beats Pure Perl

Someone reported some test failures trying to install Tickit, which seemed to be related to shortcomings in Text::CharWidth. The latter seems to have very poor unit test coverage on itself, so the failures didn't appear during its installation, only when Tickit::Utils was tested against it. On initial inspection I wondered if Text::CharWidth simply wasn't using wcswidth(3) correctly, and whether I should get around to my plan of rewriting bits of Tickit::Utils in XS instead for performance, as well as work around this bug.

This turned out to be quite a good idea. Implementing cols2chars() and chars2cols() in XS instead of Perl makes them at least 10 times faster. I tested it on four strings; two ASCII and two Unicode; a long and a short of each:

Calls/sec		PP	XS	Ratio
chars2cols	along	48685	406504	834.97%
chars2cols	ashort	72674	704225	969.02%
chars2cols	ulong	37341	387596	1037.99%
chars2cols	ushort	52966	714285	1348.57%
cols2chars	along	16350	403255	2466.39%
cols2chars	ashort	58685	649350	1106.50%
cols2chars	ulong	13561	362318	2671.76%
cols2chars	ushort	50556	632911	1251.90%

In fact, some cases it turns out to be 24 times faster.

I haven't looked into too much detail on why, but I suspect a large amount of the reason is to do with the way the XS functions primarily walk along the internal UTF-8 representation of the strings, counting bytes, characters, and columns as they go, and returning the appropriate count(s) when the required. The pureperl implementation doesn't have direct access to the byte offsets, so only has character numbers to work to. The frequent character-to-byte or byte-to-character conversions at all the boundaries between the functions result in multiple UTF-8 byte skip counting steps along the string each time a function is entered or left, generally slowing it down.

As to the original test failure, it turned out to be entirely unrelated lack of locale support in the platform's libc. The XS implementations fail there in the same way. But having implemented the above improvements, I decided to leave them in anyway.

XS faster than Pure Perl; who'd have thought it?

2011/06/20

Perl - IO::Async - version 0.41

Wow; seems I haven't been writing these posts as much as I should be. I last wrote about version 0.34, and just now I've uploaded version 0.41 to CPAN. Quite a bit changed between then and now, here's a rundown of the most important bits:

Added IO::Async::FileStream. This behaves like a read-only IO::Async::Stream, but reads its data from a regular file on the filesystem. Like tail -f, or File::Tail, it watches the file for changes in size, and follows appended data.

Added IO::Async::Function. This is a true IO::Async::Notifier subclass to represent an asynchronous block of code; which was previously handled by DetachedCode. Being a true Notifier subclass gives it many advantages, and should provide a better base to build other things on. IO::Async::Resolver is now a subclass of Function, not DetachedCode.

Added IO::Async::Process. This is another IO::Async::Notifier subclass, this time to represent an external process.

Support encoding parameter in IO::Async::Stream. This allows the Stream to handle text in some encoding, rather than simply raw bytes.

Support first_interval parameter to IO::Async::Timer::Periodic. If supplied, this will be the first wait time when the timer is started. In particular, if it is zero the timer's first invocation will happen immediately.

Allow Loop->listen to be extended via extensions, similar to ->connect. Such extensions as IO::Async::SSL are already set up to make use of this.

Distribution now uses Test::Fatal rather than Test::Exception. The former is smaller and simpler, whereas the latter relies on clever tricks to hack on caller(), which sometimes breaks some setups.

Added convenient method in IO::Async::Loop for handling socket addresses; extract_addrinfo. This recognises strings for common networking constants; 'inet', 'inet6' and 'unix' as socket families, 'stream', 'dgram' and 'raw' as socket types. These convenient forms are also recognised by the ->connect and ->listen methods.

Now prefers to use the IPv6 support functions found in Perl 5.14's Socket module; only falling back to using Socket::GetAddrInfo in earlier perls.

2011/06/10

Sleep Sorting with IO::Async and CPS

There's a bit of a silly meme going around lately; an implementation of a sorting algorithm that works by many parallel sleeps. So I thought I'd have a go at it from an IO::Async perspective.

use CPS qw( kpareach );
use IO::Async::Loop;

my $loop = IO::Async::Loop->new;

kpareach( \@ARGV,
   sub {
      my ( $val, $k ) = @_;
      $loop->enqueue_timer(
         delay => $val,
         code  => sub { print "$val\n"; goto $k },
      );
   },
   sub { $loop->loop_stop },
);

$loop->loop_forever;

This produces such output as:

$ perl sleepsort.pl 3 8 4 2 6 5 7 1
1
2
3
4
5
6
7
8

It's a little more verbose than I'd like it though. I have been pondering creating a CPS::Async module, which would provide some more simple CPS-like versions of the usual blocking primatives we're used to, like sleep and sysread. Using this new hypothetical module would lead to something looking like:

use CPS qw( kpareach );
use CPS::Async qw( ksleep );

kpareach( \@ARGV,
   sub {
      my ( $val, $k ) = @_;
      ksleep( $val, sub {
         print "$val\n";
         goto $k;
      } );
   },
   sub { CPS::Async::stop },
);

CPS::Async::run;

This would be a small wrapper around IO::Async, providing CPS-style functions that turn into method calls on some implied IO::Async::Loop object; similar to for example, the AnyEvent::Impl::IOAsync, which exposes an AnyEvent API using IO::Async.

I have plenty of other projects to keep me amused currently, but I'll sit it on the back burner in case something interesting happens that way...

2011/05/24

libtermkey - read keypresses from terminals

I have just released version 0.8 of libtermkey. It contains a small set of bugfixes on the previous version (0.7), relating to handling the signal-raising keys (Ctrl-C, Ctrl-\, Ctrl-Z), gracefully handles EINTR from read(2) calls, and actually gets around to implementing the CSI u modified Unicode encoding scheme I documented a long time ago.

libtermkey is a library for reading keypress events (and in fact mouse events too) from a terminal, in a terminal-based application. I won't list all its points and features, but as a brief overview:

It presents events to the application in a structure, containing basic key information and a bitfield of modifiers in effect, rather than a single flat enumeration integer, as curses tries to do. In this way, it can easily represent the various modifier-combinations on cursor keys, like arrows or PageUp/PageDown, and can easily represent any Unicode character vs. any special key.

It supplies a pair of functions (termkey_strfkey and termkey_strpkey) for converting between these structural notations and plain-text human-readable strings, such as "Ctrl-PageUp". These assist with easy reporting of keypresses in applications, or for config file parsing.

Perl bindings exist, primarily in the form of Term::TermKey.

2011/05/19

Wearing Two Hats

A while ago, I wrote libtermkey, a C library for reading keypress events from a terminal. I wrote a Perl binding for it, Term::TermKey. On a separate note, I also maintain IO::Async, an event framework for Perl. Naturally, the combination of these two lead me to write an IO::Async module to handle libtermkey, Term::TermKey::Async. These all work together quite well.

However, these two things are separate considerations. There's nothing specific to Perl, in libtermkey, nor anything specific to IO::Async in Term::TermKey. After some thought, and discussions on #perl, I decided in the end, that I had to realise these were two separate concerns, two different problem domains, that neither should be allowed to influence the other. In short, I had to wear two hats.

With my libtermkey hat on, I went to join #poe and #anyevent on irc.perl.org, and talked my way around designing two new modules, which are now happily sitting on CPAN as well: POE::Wheel::TermKey and AnyEvent::TermKey.

With so many CPAN modules effectively being glue between two others (such as in the case of Term::TermKey::Async being the glue between Term::TermKey and IO::Async), it's often the case that at least one if not both sides of the module are written by the same author. It is important to recognise these cases, and to consider whether the community as a whole would be better served by taking a look around to see if other modules and other use cases need attention as well.

When wearing hats, it is important to remember which hat you are wearing, and only try to wear one hat at once.

2011/04/26

Deterministic Random Testing

I've just written an implementation of a weighted random shuffle. Because the function is random, it won't return the same results every time. This makes it hard to unit-test; I can't just run it on some given input and assert it returns some given output. Indeed, by its nature, any ordering of the results is potentially valid. One way to solve this is to run it a large number of times, counting the frequency of various output results, and asserting that roughly the right proportion of results are returned. Of course, exactly how close "roughly right" is, is hard to determine.

This isn't a very satisfactory testing method, however. Get the tolerance too tight, and spurious random differences cause tests to fail unnecessarily. Too loose, and you might miss a subtle logic bug that skews the probabilities of some case.

The weighted random shuffle algorithm calls int rand $limit a number of times, each time passing in a small integer. This effectively uses the RNG like a dice roll, randomly selecting some integer 0 .. $limit - 1. Because each call takes a small integer, and because the algorithm is entirely deterministic for any given set of random results, this suggests a better testing method. If we could instead enumerate all possible returns of random numbers, each one exactly once, we can generate all possible results from the shuffle algorithm in their ideal proportions. Because this will be an exact deterministic count, free from randomness, we can assert exact values for the results.

So this is what I did. I've created a module, for now simply called Unrandom, which exports one function, unrandomly. It is used like the following:

use Unrandom 'unrandomly';

unrandomly {
  my $da = 1 + int rand 6;
  my $db = 1 + int rand 6;
  say "Roll 2d6: $da + $db = " . ($da+$db);
};

The unrandomly function has the effect of replacing the rand function with one under its control while it runs the block of code. It runs the block of code a number of times, enumerating the entire tree of possible return values, in a given deterministic order:

Roll 2d6: 1 + 1 = 2
Roll 2d6: 1 + 2 = 3
Roll 2d6: 1 + 3 = 4
Roll 2d6: 1 + 4 = 5
Roll 2d6: 1 + 5 = 6
Roll 2d6: 1 + 6 = 7
Roll 2d6: 2 + 1 = 3
Roll 2d6: 2 + 2 = 4
...
Roll 2d6: 6 + 5 = 11
Roll 2d6: 6 + 6 = 12

Because each possible combination is returned exactly once, we can unit test that this dice-rolling algorithm does in fact give us the right distribution of results; there will be exactly one 2, two 3s, etc... We don't have to run it a few million times, and check that we got "roughly" the right number; we can be exact.

While I'm using this code in a unit-test, there's nothing directly test-related in the code. It could be useful anywhere that statistical modelling is used, or other problems involving random integer generation.

I'm now just looking for a good name to call it, so I can extract it from the unit tests and give it a life of its own on CPAN.

Suggestions anyone?

2011/04/09

Extended colour support, terminals, and ECMA-48

tl;dr summary: Terminal authors - please accept CSI 38:5:123 m as well as anything else, to set extended colours. NOTE THE COLON. Many terminals these days are starting to support extended colour modes; specifically things like 256 colours. They're all doing it wrong.

CSI 38;5;123 m

No no no no no. That is NOT what you think. It does not select colour 123 from palette 5. CSI arguments are separated by semicolons. This sets three unrelated SGRs; 38, 5, and 123. 38 sets a foreground colour to .. er.. nothing in particular. 5 is blinking mode, 123 has no defined meaning to my knowledge. All the other following are exactly identical:

CSI 5;38;123 m
CSI 38 m CSI 5 m CSI 123 m

ECMA-48 already defines a perfectly good way for parameters to take sub-parameters. It's the colon. The correct way to encode this concept is

CSI 38:5:123 m

Why does this matter? It matters for parsers not to have to understand what is going on. Consider

CSI 3;38:9:5:4:3:2:1;11;1

This is equivalent to

CSI 3 CSI 38:9:5:4:3:2:1 CSI 11; CSI 1

I.e. I have no idea what palette 9 is, but I can definitely parse out the contents of this SGR from the others, knowing exactly where it ends. I don't have to "just know" that palette 9 happens to take 5 parameters. The CSI encodes this. My own terminal emulator library, libvterm, already understands these colon-based arguments. It parses them correctly. This is important for other palettes that don't take just one value. For example, the RGB, RGBA, CMY, and CMYK palettes. It's vital to be able to parse out these sub-arguments from the single SGR 38 or 48. Standards exist for a reason, people. Please use them.

Edit 2012/12/20: actually, I have recently learned that xterm now supports this in its correct form as well, since xterm patch 282.

2011/03/29

When failure isn't failure

Lately I have been looking at two different problems, with a common theme.

My first problem concerns Parser::MGC, and its ability to read input lazily as needed, rather than needing to slurp an entire file all at once. This ability is provided by the from_reader method, which takes a CODE reference to a reader function.

As the documentation points out, this is only supported for reading input that's broken across skippable whitespace. This is because it's implemented by calling the reader function to look for more input if the current input buffer is completely exhausted. It cannot work in general, for splitting the input stream arbitrarily, because Perl's regular expression engine does not give sufficient feedback. It is not possible to ask, after a match attempt, whether the engine reached the end of the stream, For example, when looking for a match for m/food/, an input of "fool" definitely fails, whereas an input of "foo" is not yet a failure, because it might be that reading more input from the stream can complete the match. If the regular expression engine gave such feedback, then the reader function could be invoked again to provide more input that may help to resolve the parse.

My second problem concerns how to handle UTF-8 encoded data in nonblocking reads. An IO::Async::Stream object wraps a bytestream, such as a TCP socket or pipe. If the underlying stream contains UTF-8 encoded Unicode text, then the Unicode characters need to be decoded from these bytes, by using the Encode module.

The trouble here is that Encode does not provide a way to do this sanely. It is quite likely that a multibyte UTF-8 sequence gets split across multiple read calls. To cope with such a case, Encode has a mode where it will stop on the first error it encounters (called FB_QUIET), returning the prefix it has decoded so far, and deleting the bytes so consumed from the input. The intention here is that another call supplies more bytes, and it continues from there. Problem is, it returns on any failure, whether that's running out of input bytes or encountering an invalid byte. Without the ability to distinguish these two different conditions, it is impossible to handle nonblocking or stream-based UTF-8 decoding while still having sensible error handling.

The common theme of these two problems is that neither considers the nature of a failure, treating various reasons the same. Both cases have two kinds of failure: one a failure because something has been received that is not correct; the other a failure because something that would be correct has simply not yet been received.

Sometimes, failure is not really failure at all. Sometimes it is simply deferred success that is yet to happen.

2011/03/04

Carp from somewhere else

Carp provides two main functions, carp and croak, as replacements for core Perl's warn and die. The functions from Carp report the file and line number slightly differently from core Perl; walking up the callstack looking for a different package name. The idea being these are used to report errors in values passed in to the function, typically bad arguments from the caller.

These functions use the dynamic callstack (as provided by caller()) at the time the message is warned (or thrown) to create their context. One scenario where this does not lead to sensible results is when the called function is internally implemented using a closure via some other package again.

Having thought about this one a bit, it seems what's required is to split collecting the callstack information, from creating the message. Collect the information in one function call, and pass it into the other.

This would be useful in CPS for example. Because the closures used in CPS::kseq or kpar aren't necessarily invoked during the dynamic scope of the function that lexically contains the code, a call to croak may not be able to infer the calling context. Even if they are, the presence of stack frames in the CPS package would confuse croak's scanning of the callstack. Instead, it would be better to capture the calling context using whence, and pass it into whoops if required for generating a message.

For example, this from IO::Async::Connector:

sub connect
{
   my ( %params ) = @_;
   ...

   my $where = whence;

   kpar(
      sub {
         my ( $k ) = @_;
         if( exists $params{host} and exists $params{service} ) { 
            my $on_resolve_error = $params{on_resolve_error} or whoops $where, "Expected 'on_resolve_error' callback";
            ...
}

These functions would be a fairly simple replacement of carp and croak; capture the callsite information at entry to a function, and pass it to the message warning function.

It does occur to me though, the code will be small and self-contained, and not specific to CPS. I think it ought to live in its own module somewhere - any ideas on a name?

2011/01/22

IPv6 in Perl

A lot of people are talking about IPv6 lately. And perhaps with good reason - it's been years in coming, but finally we're really starting to run out of IPv4 addresses. Tools like the IPv4 Address Report may put a certain amount of panic on things with their realtime Flash-based countdown widget, but the problem is real and does need sorting.

The latest development release of Perl; perl-5.13.9, now has full support for IPv6, and the new address handling functions specified in RFC 2553. The new functions all live in Socket.

As well as the low-level AF_INET6 constant, and the pack_sockaddr_in6 and unpack_sockaddr_in6 structure functions (already in place in 5.13.8 in fact), there is now the full set of getaddrinfo, getnameinfo and associated constants. These allow fully protocol-agnostic handling of connections and addresses. There is now enough in core to support IO::Socket::IP. This is a fully protocol-agnostic replacement of IO::Socket::INET.

use IO::Socket::IP;

my $sock = IO::Socket::IP->new(
   PeerHost    => "www.google.com",
   PeerService => "www",
) or die "Cannot construct socket - $@";

printf "Now connected to %s:%s\n", $sock->peerhost_service;

...

What could be simpler?

Perhaps now is the time to make a case for putting IO::Socket::IP itself in the core distribution, such that when perl-5.14.0 comes out, it will be properly ready for the next 30 years of the Internet?

IO::Socket::IP's API is designed to be a drop-in replacement for the IPv4-only IO::Socket::INET. Any program currently using INET ought to work exactly the same, by simply substituting IP instead, and now will also work on IPv6.

If you maintain a distribution that uses IO::Socket::INET, please try out IO::Socket::IP instead. I'd be very keen to hear from anyone who finds it doesn't JustWork in their situation.

See also my earlier post on the subject; Perl - IO::Socket::IP