Lately I have been looking at two different problems, with a common theme.
My first problem concerns Parser::MGC, and its ability to read input lazily as needed, rather than needing to slurp an entire file all at once. This ability is provided by the from_reader method, which takes a CODE reference to a reader function.
As the documentation points out, this is only supported for reading input that's broken across skippable whitespace. This is because it's implemented by calling the reader function to look for more input if the current input buffer is completely exhausted. It cannot work in general, for splitting the input stream arbitrarily, because Perl's regular expression engine does not give sufficient feedback. It is not possible to ask, after a match attempt, whether the engine reached the end of the stream, For example, when looking for a match for m/food/, an input of "fool" definitely fails, whereas an input of "foo" is not yet a failure, because it might be that reading more input from the stream can complete the match. If the regular expression engine gave such feedback, then the reader function could be invoked again to provide more input that may help to resolve the parse.
My second problem concerns how to handle UTF-8 encoded data in nonblocking reads. An IO::Async::Stream object wraps a bytestream, such as a TCP socket or pipe. If the underlying stream contains UTF-8 encoded Unicode text, then the Unicode characters need to be decoded from these bytes, by using the Encode module.
The trouble here is that Encode does not provide a way to do this sanely. It is quite likely that a multibyte UTF-8 sequence gets split across multiple read calls. To cope with such a case, Encode has a mode where it will stop on the first error it encounters (called FB_QUIET), returning the prefix it has decoded so far, and deleting the bytes so consumed from the input. The intention here is that another call supplies more bytes, and it continues from there. Problem is, it returns on any failure, whether that's running out of input bytes or encountering an invalid byte. Without the ability to distinguish these two different conditions, it is impossible to handle nonblocking or stream-based UTF-8 decoding while still having sensible error handling.
The common theme of these two problems is that neither considers the nature of a failure, treating various reasons the same. Both cases have two kinds of failure: one a failure because something has been received that is not correct; the other a failure because something that would be correct has simply not yet been received.
Sometimes, failure is not really failure at all. Sometimes it is simply deferred success that is yet to happen.
I don't think "fool" is definitely a failure, for the same reason "foo" is not a failure, namely appending more input would allow a match. If the next input contained "pfoody", then both subject strings would be matched.
ReplyDeleteHowever, if your regex were changed to m/^food/, your point would hold. One way to solve that would be to use the regex m/^(food|foo|fo|f|)/ and then compare the match length vs the subject string length. I don't know how to create such a regex given any other regex, in general, but I don't think it has to do with "reaching the end of the stream" per se, but something more subtle.
Oh, oops. That's a mistake in the original post. Parser::MGC uses \G-anchored regexps to walk the input string, trying to make it parse. It uses m//gc regexps, hence the name. My original post should have put m/^food/, or else made the \G behaviour more explicit.
ReplyDelete