2011/03/29

When failure isn't failure

Lately I have been looking at two different problems, with a common theme.

My first problem concerns Parser::MGC, and its ability to read input lazily as needed, rather than needing to slurp an entire file all at once. This ability is provided by the from_reader method, which takes a CODE reference to a reader function.

As the documentation points out, this is only supported for reading input that's broken across skippable whitespace. This is because it's implemented by calling the reader function to look for more input if the current input buffer is completely exhausted. It cannot work in general, for splitting the input stream arbitrarily, because Perl's regular expression engine does not give sufficient feedback. It is not possible to ask, after a match attempt, whether the engine reached the end of the stream, For example, when looking for a match for m/food/, an input of "fool" definitely fails, whereas an input of "foo" is not yet a failure, because it might be that reading more input from the stream can complete the match. If the regular expression engine gave such feedback, then the reader function could be invoked again to provide more input that may help to resolve the parse.

My second problem concerns how to handle UTF-8 encoded data in nonblocking reads. An IO::Async::Stream object wraps a bytestream, such as a TCP socket or pipe. If the underlying stream contains UTF-8 encoded Unicode text, then the Unicode characters need to be decoded from these bytes, by using the Encode module.

The trouble here is that Encode does not provide a way to do this sanely. It is quite likely that a multibyte UTF-8 sequence gets split across multiple read calls. To cope with such a case, Encode has a mode where it will stop on the first error it encounters (called FB_QUIET), returning the prefix it has decoded so far, and deleting the bytes so consumed from the input. The intention here is that another call supplies more bytes, and it continues from there. Problem is, it returns on any failure, whether that's running out of input bytes or encountering an invalid byte. Without the ability to distinguish these two different conditions, it is impossible to handle nonblocking or stream-based UTF-8 decoding while still having sensible error handling.

The common theme of these two problems is that neither considers the nature of a failure, treating various reasons the same. Both cases have two kinds of failure: one a failure because something has been received that is not correct; the other a failure because something that would be correct has simply not yet been received.

Sometimes, failure is not really failure at all. Sometimes it is simply deferred success that is yet to happen.

2011/03/04

Carp from somewhere else

Carp provides two main functions, carp and croak, as replacements for core Perl's warn and die. The functions from Carp report the file and line number slightly differently from core Perl; walking up the callstack looking for a different package name. The idea being these are used to report errors in values passed in to the function, typically bad arguments from the caller.

These functions use the dynamic callstack (as provided by caller()) at the time the message is warned (or thrown) to create their context. One scenario where this does not lead to sensible results is when the called function is internally implemented using a closure via some other package again.

Having thought about this one a bit, it seems what's required is to split collecting the callstack information, from creating the message. Collect the information in one function call, and pass it into the other.

This would be useful in CPS for example. Because the closures used in CPS::kseq or kpar aren't necessarily invoked during the dynamic scope of the function that lexically contains the code, a call to croak may not be able to infer the calling context. Even if they are, the presence of stack frames in the CPS package would confuse croak's scanning of the callstack. Instead, it would be better to capture the calling context using whence, and pass it into whoops if required for generating a message.

For example, this from IO::Async::Connector:
sub connect
{
my ( %params ) = @_;
...

my $where = whence;

kpar(
sub {
my ( $k ) = @_;
if( exists $params{host} and exists $params{service} ) {
my $on_resolve_error = $params{on_resolve_error} or whoops $where, "Expected 'on_resolve_error' callback";
...
}
These functions would be a fairly simple replacement of carp and croak; capture the callsite information at entry to a function, and pass it to the message warning function.

It does occur to me though, the code will be small and self-contained, and not specific to CPS. I think it ought to live in its own module somewhere - any ideas on a name?