Perl Parser Plugins - part 1

Today's subject is the parser plugins that were added to perl in 5.14. The parser plugin API itself is relatively small and modest, but the power it grants the user is enormous. To use this power however, you need to know and be familiar with a lot of the internals of the perl core.

This post will be a little different to most of my usual posts. Rather than announcing and demonstrating some code that's already written and available on CPAN or wherever, I shall today be writing about some code I'm currently in the progress of writing. I'm hoping this series of posts will contain useful information about the underlying subject of perl parser plugins and other internal details.

As a parser plugin would be written in C as part of an XS module, the reader is presumed to be at least somewhat familiar with C syntax, and to have at least a basic understanding of what an XS module is and how to go about creating one.

The PL_keyword_plugin Hook

The initial point of entry into the parser plugin system is a function pointer variable that's part of the interpreter core, called PL_keyword_plugin. This variable points to a function that the perl parser will invoke whenever it finds something that looks like it could be a keyword. It's typed as

int (*PL_keyword_plugin)(pTHX_
    char *keyword_ptr, STRLEN keyword_len,
    OP **op_ptr);

How it behaves is that whenever the parser finds a sequence of identifier-like characters that could be a keyword, it first asks the keyword plugin whether it wishes to handle it. The sequence of characters itself is passed in via the pointer and length pair. Because this is a pointer directly into the parser's working-space buffer, the keyword itself will not be NUL-terminated, and so the length argument is required too. The integer return value works with the op_ptr double-pointer value to let the plugin return its result back to the interpreter in the form of an optree. For today's post we won't concern ourselves with this optree, and instead write a plugin that doesn't actually do anything.

Side note:If you don't recognise what that pTHX_ macro is doing at the start of the argument list, don't worry too much about it. It exists to pass around the perl interpreter state if the perl is built with threads enabled. Just think of it as a curious quirk of writing C functions in the perl interpreter; the detail of its operation does not concern us too closely here.

To use this plugin API, we need to define our own custom handling function, and set this variable to its address so that the perl parser will invoke it. This following example implements a trivially tiny plugin; one that declines to process any keyword, but as a side-effect prints its name to the standard error stream, so we can see it in action.

static int MY_keyword_plugin(pTHX_ char *kw, STRLEN kwlen,
    OP **op_ptr)
  fprintf(stderr, "Keyword [%.*s]\n", kwlen, kw);

MODULE = tmp  PACKAGE = tmp

  PL_keyword_plugin = &MY_keyword_plugin;

To test this out, I've compiled this little blob of XS code into a tiny module called tmp. The plugin's output can clearly be seen:

$ perl -Mtmp -e 'sub main { print "Hello, world\n" }  main()'
Keyword [sub]
Keyword [print]
Keyword [main]
Hello, world

Here we can see that the keyword was first informed about the sub. Because we declined to handle it, this was taken by the core perl parser which consumes the main identifier and opening brace in its usual manner. Next up is another candidate for being a keyword, print, which we also decline. The entire string literal does not look like a keyword, so the plugin wasn't invoked here. It is finally informed again of the main that appears at the top-level of the program, which it also declines.

From this initial example, we can see already how the parser plugin API is a lot more robust than earlier mechanisms, such as source filters. We haven't had to take any special precautions against false matches of keywords inside string literals, comments, or anything else like that. The perl parser core itself already invokes the plugin only in places where it expects there to be a real keyword, so all of that is taken care of already.

Hook Chains

The plugin we have created above has one critical flaw in it - it is very rude. It takes over control of the PL_keyword_plugin pointer, overwriting whatever else was there. If the containing program had already loaded a module that wanted to provide custom keyword parsing, our module has just destroyed its ability to do that.

The way we solve this in practice is to wrap the pre-existing parser hook with our own function; saving the old pointer value in a local variable within the plugin code, so if we don't want to handle the keyword we simply invoke the next function down the chain. The perl core helpfully initialises this variable to point to a function that simply declines any input, so the first module loaded doesn't have to take special precautions to check if it was set. It can safely chain to the next one in any circumstance.

We can now update our code to do this:

static int (*next_keyword_plugin)(pTHX_ char *, STRLEN, OP **);

static int MY_keyword_plugin(pTHX_ char *kw, STRLEN kwlen,
    OP **op_ptr)
  fprintf(stderr, "Keyword [%.*s]\n", kwlen, kw);

  return (*next_keyword_plugin)(aTHX_ kw, kwlen, op_ptr);

MODULE = tmp  PACKAGE = tmp

  next_keyword_plugin = PL_keyword_plugin;
  PL_keyword_plugin = &MY_keyword_plugin;

The plugin now behaves exactly as it did before, but this time if there had been any more plugins loaded before this one, they will be given a chance to run as well. Hopefully the next one that's loaded does this same logic, so that ours continues to work.

Well, I say "work".

Aside from spying on the keywords in the program, this parser plugin doesn't do anything useful yet. I shall start to explore what things the plugin can actually do in part 2.


Devel::MAT interactive debugging and filehandles

For a while now, I've been using Devel::MAT for all sorts of debugging tasks. Originally I wrote it in order to track down a single memory-leak case in a large program, but due to the very generic way it just captures as much of the perl process state as it can for later analysis, it turns out to be useful for a variety of other problems as well. Every time I try to debug a new sort of problem that it doesn't quite fit for, I end up adding more features and making a new release, so it's growing in ability quite nicely now.

A recent bug I was trying to fix involved leaking filehandles - a longrunning process ends up running out of spare filehandles and receives EMFILE trying to make a socket(2) call. So, newly added to Devel::MAT version 0.23 is tracking of file numbers associated with IO handles.

To start this session of bug-hunting, first I had to provoke the leak into happening, and capture a .pmat file from it. Provoking the leak proved to be quite easy by just artificially lowering the maximum file descriptor count using ulimit, and capturing the file was achieved using the usual Devel::MAT::Dumper options:

$ ulimit -n 64
$ perl -MDevel::MAT::Dumper=-eager_open,-dump_at_END,-dump_at_DIE ...

Sure enough, it didn't take long running my program before it died because of EMFILE, at which point a .pmat file appeared. Now we can take a look into it and find the problem. So I started by loading this into my new pmat interactive debugging shell, which wraps a number of commandline-type tools, allowing interactive application of them to the loaded file, without having to reload the file each and every time. This speeds up such debugging sessions a lot.

$ pmat run-tests.pl.pmat 
Perl memory dumpfile from perl 5.22.1
Heap contains 156829 objects

The first useful command to run here is io list to obtain a list of every IO SV that has at least one file number associated with it. This will correspond to the filehandles the kernel is complaining that we have too many of. These will be a good place to start to see if something is leaking that shouldn't.

pmat> io list
Addr                           ifileno  ofileno
IO() at 0xa2f378               -1       -1
IO() at 0x8cf540               0        -1
IO() at 0x8cf570               1        1
IO() at 0x8cf5d0               2        2
IO() at 0x16063f8              3        3
IO() at 0x3bf8e68              59       59
IO() at 0x3b7e548              60       60
IO() at 0x3bccda8              61       61
IO() at 0x3b57958              62       62
IO() at 0x3b4b950              63       63

Well sure enough, we do seem to have our full allocation of 64 filehandles, and luckily for us they seem to be perl-visible ones (lucky because Devel::MAT can't peek into the inner workings of lower-level C libraries, which could have been the reason here). The first few handles will be the usual STDIN, STDOUT and friends, and some socket handles that the process itself is supposed to be hanging onto, so our leak likely relates to the higher-numbered ones. Lets pick a random candidate near the end, and see if we can investigate further what it's doing still hanging around.

The identify command takes the address of an SV, and tries to walk back up the references chain, looking for what refers to it, so we can find out what's still holding it.

pmat> identify 0x3bf8e68
IO() at 0x3bf8e68 is:
└─the io of GLOB(%*I) at 0x282f118, which is:
  ├─the referrant of REF() at 0x3b7ef98, which is:
  │ └─value {write_handle} of HASH(29) at 0x3b577c0, which is (*A):
  │   └─the referrant of REF() at 0x3476cc0, which is:
  │     └─element [0] of ARRAY(1) at 0x3bbc120, which is:
  │       └─the referrant of REF() at 0x1f81eb8, which is:
  │         └─the lexical $conns of PAD(23) at 0x3bcc658, which is:
  │           └─pad at depth 1 of CODE(PP,C) at 0x3b4c2b0, which is:
  │             └─the referrant of REF() at 0x3bbb358, which is:
  │               └─the lexical $on_closed of PAD(14) at 0x3bcc6a0, which is:
  │                 └─pad at depth 1 of CODE(PP,C) at 0x1cfa850, which is:
  │                   └─the referrant of REF() at 0x3b63698, which is:
  │                     └─value {on_closed} of HASH(29) at 0x3b577c0, which is:
  │                       └─already found as *A
  └─the referrant of REF() at 0x3bf8550, which is:
    └─value {read_handle} of HASH(29) at 0x3b577c0, which is:
      └─already found as *A

This looks suspect. The SV identified with the *A marker appears later on in its own referrant chain. That sounds like a cyclic reference. Lets look at that one in more detail.

HASH(29) at 0x3b577c0 is:
└─the referrant of REF() at 0x3476cc0, which is:
  └─element [0] of ARRAY(1) at 0x3bbc120, which is:
    └─the referrant of REF() at 0x1f81eb8, which is:
      └─the lexical $conns of PAD(23) at 0x3bcc658, which is:
        └─pad at depth 1 of CODE(PP,C) at 0x3b4c2b0, which is:
          └─the referrant of REF() at 0x3bbb358, which is:
            └─the lexical $on_closed of PAD(14) at 0x3bcc6a0, which is:
              └─pad at depth 1 of CODE(PP,C) at 0x1cfa850, which is:
                └─the referrant of REF() at 0x3b63698, which is:
                  └─value {on_closed} of HASH(29) at 0x3b577c0, which is:
                    └─already found circularly

Indeed, this is the classic identify shape of a leaked cyclic reference; a single linear chain of references that loops back onto itself. Just because we have a cycle though doesn't necessarily mean that's bad. Creating a reference cycle in Perl is only bad if nothing ever cleans it up again. Cyclic references can be perfectly safe, provided something cleans them up again at some point. To determine what's wrong with this one and why it didn't get cleaned up requires some inspection of the code that created it. As this particular cycle involves two lexical scalars in two closures we can use the show command lets us see more detail about these two CODE SVs.

pmat> show 0x3b4c2b0
CODE(PP,C) at 0x3b4c2b0 with refcount 1
  stash=STASH(66) at 0x12fe9c0
  glob=GLOB(*) at 0x1b4eaf0
  location=/home/leo/lib/perl5/Net/Async/HTTP.pm line 422
  no scope
  no padlist
  no padnames
  pad[0]=PAD(23) at 0x3bcc658
Referenced globs:
  GLOB(@I) at 0x8b1ea0, GLOB(&*) at 0x1b40bf0

pmat> show 0x1cfa850
CODE(PP,C) at 0x1cfa850 with refcount 1
  stash=STASH(65) at 0x1a3cac0
  glob=GLOB(*) at 0x1ab35f0
  location=/home/leo/lib/perl5/Net/Async/HTTP/Connection.pm line 62
  no scope
  no padlist
  no padnames
  pad[0]=PAD(14) at 0x3bcc6a0

We can now take a peek at the first of these closures by looking in the source code at the indicated line number; which appears to be the on_closed event handler of the Net::Async::HTTP instance:

      on_closed => sub {
         my $conn = shift;
         my $http = $conn->parent;
         @$conns = grep { $_ != $conn } @$conns;
         if( my $next = first { !$_->connecting } @$ready_queue ) {
            # Requeue another connection attempt as there's still more to do
            $http->get_connection( %args, ready => $next );

Sure enough, there's the $conns lexical that the identification trace suggested was the culprit. Indeed, this code should in fact have removed the connection from that array when it runs, thus breaking the cycle and allowing all these instances to be reclaimed, avoiding a leak. That the leak happened suggests this cleanup code did not in fact execute.

At this point, we can apply some analysis of the code changes leading up to this leak being introduced in the first place. On this occasion, the change was itself a workaround for another bug involving connection pooling and reuse in the HTTP client, by instead using a single-shot HTTP client for each individual request, which is added to the IO::Async::Loop instance and then removed when it is done.

Looking again at the above cleanup code which we expect didn't run, and with knowledge of the code change and the basics of how IO::Async::Notifiers work, we surmise that the code doesn't run because the file handle is never actually explicitly closed. In normal operation with longlived HTTP clients, this isn't a problem as either the client times out and closes the connection itself, or the server does and we're informed of an EOF to close it. These two cases give the on_closed code its opportunity to run, thus avoiding a resource leak in the normal case.

But in our one-shot HTTP client that then gets removed, this never happens. The connection remains open at the point that the client gets removed from the loop, wherein it now has no opportunity to be informed that the server closed it, so the cleanup code never runs. This one now looks like quite an easy fix; just ->close all the remaining connections at the time the client is removed from the loop, by making use of the _remove_from_loop method:

sub _remove_from_loop
   my $self = shift;

   foreach my $conn ( map { @$_ } values %{ $self->{connections} } ) {

   $self->SUPER::_remove_from_loop( @_ );

In fact, this is the very bugfix that was added to make release Net::Async::HTTP version 0.41.

So there we have it - a real-world example of an actual bug that was found and fixed with the help of Devel::MAT.


Latest libvterm / pangoterm work

Since it's been a while since I wrote anything, and I've lately been doing some work on both pangoterm and the underlying libvterm I thought I'd write a little post summarising the recent updates.


Two new features now supported in pangoterm are bracketed paste mode, and configurable basic and high-brightness palettes.

Bracketed paste mode is a feature that applications can enable, which causes text pasted into the terminal via the clipboard to be surrounded by special markers indicating this fact. Applications like text editors can make use of these markers to interpret the keystrokes that arrive as literal text, overriding whatever other meaning they might wish to apply during that time.

There's also a few bugfixes; interpretation of numeric keypad events and IME handling (written by FireFly on Freenode), and a fix for cells with the maximally-supported set of combining chars applied (such as so-called Zalgotext).


On the libvterm front, there is now a second set of callbacks at the State and Screen layers, to allow embedding applications to at least inspect CSI, DCS or OSC strings that libvterm itself doesn't understand, thus allowing some method of extension by applications. I'm not terribly happy with the interface shape of these yet - it will likely change in future - but for now it serves its purpose to satisfy Neovim.

There's a fix for an awkward bug I've been trying to isolate for a while, which involved the Screen layer in its most aggressive optimisation setting, where there is a pending scrollrect and pending damage, and another scrollrect arrives. This illusive bug was the cause of occasional rendering failures during scroll after a quick line edit in command shells, and similar situations.


libtickit terminal size handling

I've recently been working on the Window layer in the C library libtickit (a migration to C from my Tickit perl distribution). It's very-almost ready now - I think we have all the code and tests ported from Perl (with the single exception of the extended drag-and-drop events whose names start "drag_..."). I haven't yet done the docs though.

I've also been adding/fixing a few other bits in the core code. Two new term-related functions make actually writing programs a little neater:

void tickit_term_observe_sigwinch(TickitTerm *tt, bool observe);
TickitTerm *tickit_term_open_stdio(void);

This latter convenience function makes all the examples a few lines shorter, and while it's at it, enables the SIGWINCH behaviour. That ensures that the TickitTerm instance actually sends EV_RESIZE events when the terminal is resized. Previously C programs couldn't actually detect that one; it was done in the Perl layer using a handler on $SIG{WINCH}.


Meta-Meta Documentation

When any project of mine gets over a certain size and people start actually using it for things, inevitably people ask me for more documentation. Often there is a problem here in that I'm never really quite sure what to write about. I find it very hard to provide tutorial-style documentation "up-front" explaining the hows and the whys (as compared to reference-style docs, which just lists all the pieces that exist).

Let me give a random example: Consider Parser::MGC, a base for writing grammar parsers. Someone might want to use this for parsing some sort of expression language, and they'd find a lack of explaining how to do that. It's not that I don't know how to explain that, it's simply that it hasn't occurred to me to write that documentation as a discrete subject. If they'd ask me on IRC I could explain at great length, because they'd given me that initial impulse to give them an answer. But at some point someone will say "Well why isn't that written in documentation?" or "You should write a tutorial", and the same cycle of conversation will usually end with something like

But I don't know what documentation to write for you.

I have sometimes asked people to explain to me what documentation they want, and usually people don't like doing that because they feel since they don't already know the answers, they can't provide the questions. This article is my attempt to break out of that cycle.

Here now then is my rather direct plea to anyone who would ask me for more documentation. Here is your documentation on how to document what documentation I should write - a meta-meta-document:

If you explain to me what problems you want to solve, or what situations you want documentation for, that's a good start. It would be better if you can provide me some introduction text around your problem, your background, and then I can write something better for you specifically, that still stands nicely on its own.

But, here's the killer revelation:

If you write some documentation wherein you write the English waffly-wordy bits that explain the background of the problem and introduce the reader to the background context of the code, and then write a block of code that just says
banana banana banana
then I will find that about a hundred times easier to write. All I have to do is fill in that blob of perl code, and hey presto; there's your documentation.

I know this is a lot to ask of people, and I don't imagine it will happen very often, but that would be by far the fastest quickest way to get more documentation out of me.


Turning a Hard Problem Into an Easy One

Recently I continued my stewardship (for want of a better word) of List::Util by finally adding the last function in a useful set of the pairwise key/value list management functions, called unpairs.

I had been intending to add to the set of pairfirst, pairgrep and pairmap, a sorting function called pairsort. To manage that I would first have to get around a slightly awkward interface requirement. Perl's regular grep and map pass the single items in via $_, whereas pairgrep and pairmap operate not on single values of their input list, but odd/even sets of two elements each, consisting of a notional "key" and "value". They have to pass the key and value items in via $a and $b. Since regular sort already uses $a and $b, there'd be no easily convenient place to pass the second key and value in to pairsort - perl doesn't provide a $c and $d for this purpose.

It would be easier, in hindsight, if the pair* versions of these functions had continued using the $_ variable to pass both key and value by using the Pair objects in the list provided by the pairs function. If this were the case, then pairsort could be implemented using just $a and $b, as the comparison function would have access to all four items using $a->key, $a->value, $b->key and $b->value.

I started by thinking about how a pairsort function would do this, taking the values in the list and splitting them into key/value pairs, packing them into Pair objects, and... Wait a moment - this is exactly the task that the pairs function already does. Rather than writing a pairsort function as a native XS function from scratch, it would be far easier to reuse pairs and regular sort. All I'd need is a function to take the sorted output, that list of Pair objects, and turn it back into the even-sized list of keys and values. An inverse of the pairs function - call it unpairs.

Rather than a complex XS function to implement sorting a list of keys and values, all I'd need is the simpler pairs and unpairs functions, wrapped around a regular sort. This could easily be done in Perl code:

sub pairsort(&@)
   my $code = shift;
   unpairs sort &$code pairs @_;

In fact, by writing them out in a little table we can see that the two functions pairs and unpairs are in fact quite sufficient to let us implement any of the higher-order pairwise functions, using only the regular single-element functions perl already provides:

pairfirst { COND } LIST  ===   unpairs first { COND } pairs LIST
pairgrep { COND } LIST   ===   unpairs grep { COND }  pairs LIST
pairmap { FUNC } LIST    ===   unpairs map { FUNC }   pairs LIST
pairsort { CMP } LIST    ===   unpairs sort { CMP }   pairs LIST

You might at this point start to notice some similarities with the Schwartzian Transform; a method of sorting a list by the values returned by some transformation function on each item. To remind ourselves what this looks like, here's an example that sorts a list of strings by the number of vowels it has:

my @sorted_strings = map { $_->[0] }
                     sort { $a->[1] <=> $b->[1] }
                     map { [ $_, length( $_ =~ s/[^aeiou]+//gr ) ]

This starts to look rather similar to sorting an even-sized kvlist by, say, a simple string order on its keys:

my @sorted_kvlist = unpairs
                    sort { $a->key cmp $b->key }

What both of these have in common is their method of attacking a problem that is in some way inconvenient to process by changing the form of the data into a shape that is easier to solve. This is done in a reversible way so that at the end we can undo the initial transformation and return back to a list in the form we started with, but now in the right order.

In the Schwartzian's case, we take the original list of values and turn them into two-element array references containing the transformed value we'd like to sort by and the original value. In the key/value pair case, we take the original list in sets of two elements and bundle these keys and values into Pair objects. Each transformation can be easily undone, but before we undo it, we first pass the list through our required sort function, to obtain the order that we'd like.

What we have achieved then, is a way to create new sorting behaviours by just providing a pair of functions that let us use a regular sort. In the Schwartzian's case it simply gives us a more efficient way to achieve this sort by reducing the number of times we need to transform items into their sorting values, but in the pairsort case it lets us perform what previously we could not; as perl's native sort does not understand how to take items in key/value pairs from its input list. In both cases, we can write in a more generalised form, as:

my @output = UNTRANSFORM
             sort { COMPARE($a, $b) }

This works because the two functions of TRANSFORM and UNTRANSFORM are inverses of each other. We transform the input list from one type into a list of a different type, perform the sorting operation on that type instead, then undo that transformation and return back to our original type now in a new order.

There is of course nothing unique to strings or pairs, or indeed even sorting, in this pattern. It can apply to many other kinds of list-processing function, or even problems on other kinds of data. Whenever you have a problem that at first glance seems hard, it pays to consider whether the problem can be turned into one that is inherently easier by changing the values from one type into another, and then back again.


Printing function calls in Perl

Lately my favourite way to interpolate function calls into printed strings is to use the ${\ ... } notation for embedding arbitrary expressions into strings:

print "f($number) = ${\ fibonacci($number) }\n";

Other techniques include separate arguments:

print "f($number) = ", fibonacci($number), "\n";

or a helper variable:

my $result = fibonacci($number);
print "f($number) = $result\n";

or even printf:

printf "f(%d) = %d\n", $number, fibonacci($number);

Of all these techniques I tend to prefer either of the first two, because they lead to putting the expressions "in-line" with the rest of the text string, whereas in the latter two they sit elsewhere, making it harder to see at a glance what gets printed where. Especially with printf's positional arguments, it can be easy to be "off-by-one" with a large number of arguments, and put everything in the wrong place.

In addition, the first technique is not limited to print calls, as it works just on simple string interpolation, so it works just as well to construct strings in general.

(This post was originally written to answer a question on StackOverflow.)