LeoNerd's programming thoughts

2010/05/27

Weasels in the Code

I've recently written the following utility method on an object:

sub _capture_weakself
{
   my $self = shift;
   my ( $code ) = @_;   # actually bare method names work too

   weaken $self;

   return sub { $self->$code( @_ ) };
}

It's quite a useful little method for creating a new CODE ref around either a given code ref, or a method on the object. This CODE ref captures a reference to the object, passing it as the first argument. This makes it useful for passing around elsewhere, perhaps as an event callback (as it happens, this method lives in IO::Async::Notifier). Because the object ref is stored weakly in this closure, it means the returned closure can safely be stored in the object itself without creating a cycle.

It's sufficiently useful that I feel sure this technique must have a name, but so far I'm failing to find one. I manage to keep reading the name here as "capture weasel". This gives me an idea - perhaps it could be said this closure has been weaseled; short for wea(k)sel(f)ed.

my $weasled_code = $notifier->_capture_weakself( sub {
     my $self = shift;
     my ( $x, $y ) = @_;
     ....
} );

$weaseled_code->( 123, 456 );

What does anyone think here? Does this technique have a name already? If not; does this seem suitable? I find it unlikely this name already exists somewhere else in CompSci (FOLDOC doesn't have a use in this sense), so it would be a good unique name...

2010/05/18

PF_PACKET, Linux Socket Filters, and IPv6

For diagnosing network-related problems, it's often useful to be able to capture packets transmitted or received by a machine. Linux implements a socket family, PF_PACKET, to this end. Sockets in this family receive raw datagrams containing packets received or transmitted on network interfaces. A network capture program creates such a socket, then sits in a loop receiving datagrams. Each datagram contains the bytes of the packet. The AF_PACKET address format gives information about, among other things, packet direction and interface number.

In situations where a machine is passing lots of traffic, such as a busy internet router, but the problem being diagnosed concerns only a narrow selection of this traffic, capturing every single packet would be very CPU-intensive. The application could inspect each packet and discard all of those not of interest to it. But since each packet requires a recvfrom(2) system call, this gets quite expensive in context switches, and wasted buffer space. A far more efficient scheme is to have the kernel filter the packets, or at least throw away most of the uninteresting ones.

In Linux, this is done using the SO_ATTACH_FILTER socket option. This option attaches a filter program, which executes on a simple virtual machine within the kernel. This machine is given access to inspect the bytes of the packet, and can return a value indicating whether the packet should be kept, or discarded. This machine is based on BSD's BPF mechanism, with some extensions.

This virtual machine is a register machine. It has a single accumulator (A), a single index register (X), and the various usual sorts of arithmetic and logical operations one would expect. To inspect the actual packet contents it has instructions to load data from the captured packet, in 8, 16, or 32bit unsigned quantities, into the accumulator. The program gives its answer by returning a number to the kernel, which should be the number of bytes of the packet to capture, or 0 to discard it entirely.

For example, lets consider the following example, which selects any TCP packets:

LD BYTE[9]
JEQ 6, 0, 2

LD len
RET A

RET 0

The LD BYTE[offs] instruction loads a byte from the packet (at offset 9, being where IPv4 headers keep their protocol number) into the A register. JEQ is a conditional jump instruction which compares the A register to the immediate constant (6, the protocol number of TCP). If they are equal it jumps to the first label; if not, the second. Both jump labels are unsigned integers, being the number of instructions forwards to skip. In the true case (i.e. the protocol is TCP), the length of the packet is loaded into the accumulator and returned from the filter. In the false case, the number 0 is returned, to discard the packet.

Of course, this filter isn't much use yet, because we don't know for sure it's even an IPv4 packet we've caught. The original BSD BPF doesn't define any metadata access scheme, so Linux has invented a way for programs to access this. It reserves the topmost 4KiB of packet buffer, to provide some virtual "registers" containing metadata. Since the load instruction offsets are 32bit integers, it is unlikely any real packet would ever be anywhere near this size, so in practice this works well.

In this extra data area, called the Ancillary Data, the first offset stores the ethertype of the packet (this field address has a symbolic name, SKF_AD_PROTOCOL). In IPv4's case, this will be 0x0800. We can extend our filter to look at this protocol number (not to be confused with IPv4's protocol field), and check it.

LD WORD[AD_PROTOCOL]
JEQ 0x0800, 0, 4

LD BYTE[9]
JEQ 6, 0, 2

LD len
RET A

RET 0

We can further extend our filter, looking only for a certain TCP port number (such as 80, for http). Presuming an IPv4 header with no extra options, it is 20 bytes long, and therefore, the TCP header will start at offset 20:

LD WORD[AD_PROTOCOL]
JEQ 0x0800, 0, 8

LD BYTE[9]
JEQ 6, 0, 6

LD HALF[20]
JEQ 22, 2, 0
LD HALF[22]
JEQ 22, 0, 2

LD len
RET A

RET 0

Of course, this filter isn't much use in the real world, because we made the rather large assumption that an IPv4 header would be 20 bytes long. This is where the index register becomes useful. We can load the index register, X, with the size of the IPv4 header, and access the TCP header relative to this. BPF provides a very special instruction, LDMSHX, for exactly this purpose. Given the address of a byte, it loads that byte, masking off the bottom 4 bits, and shifts up by 2 bits. This calculates the size in bytes of an IPv4 header, given the first byte.

LD WORD[AD_PROTOCOL]
JEQ 0x0800, 0, 9

LD BYTE[9]
JEQ 6, 0, 7

LDMSHX BYTE[0]
LD HALF[X+0]
JEQ 22, 2, 0
LD HALF[X+2]
JEQ 22, 0, 2

LD len
RET A

RET 0

Finally, this filter is only of any use on a SOCKET_DGRAM socket; a socket where the kernel will throw away the link-level header (such as the Ethernet or PPP framing), and present only the network-level header. A packet capturing program would very likely be interested in those link-level bytes too, so would be using a SOCKET_RAW socket instead. In this case, we don't directly know the offset of the IPv4 header, but once again, Linux comes to our rescue. It provides a "virtual view" over the part of the packet from the start of the network header. It provides a virtual offset, NET, where load instructions read relative to the start of the network header, rather than from the start of the packet buffer as a whole.

On a SOCKET_RAW socket, this filter would probably be appropriate to select TCP port 80 over IPv4:

LD WORD[AD_PROTOCOL]
JEQ 0x0800, 0, 9

LD BYTE[NET+9]
JEQ 6, 0, 7

LDMSHX BYTE[NET+0]
LD HALF[NET+X+0]
JEQ 22, 2, 0
LD HALF[NET+X+2]
JEQ 22, 0, 2

LD len
RET A

RET 0

This is internally implemented by carving up a further extra data area, 1MiB from the top of the packet buffer (defined by a constant SKF_NET_OFF). This offset gives a virtual view of the bytes in the packet, starting at the network protocol header.

Of course, I have been looking IPv4 quite specifically here. With the slowly-growing popularity of IPv6, it's inevitable that packet capture programs might want to capture IPv6 packets too.

IPv6 follows a different style from IPv4 in terms of its header options. In IPv4, all the IP-level options are stored in the header, one after another. The "header length" field in the header gives the total size of the header, with all these options. In IPv6, the header contains fewer fields (because things like fragmentation are now options). At the end of this header is the "next header" number, which is either a protocol number such as TCP or UDP, or gives an IPv6 extension header number (such as fragmentation control, or IPsec's authenticating header). Each option links on to the next with its own "next header" field.

This presents us something of a problem when it comes to packet capture filters. Recall how, for the JEQ instruction, the branch labels in both cases are unsigned integers? This is how BPF guarantees termination of a program in finite time - every jump has to be forwards. There can be no loops. Without loops, the program is guaranteed to terminate.

But now how do we parse these IPv6 headers? We can't write a while() loop in the program to walk down the headers, until we find a TCP one. Furtheremore, IPv6 doesn't define a standard header layout for all headers. Each header type puts its "next header" field in possibly a different place. Some headers are fixed-length, some carry a "header length" field of their own. It's a total mess, from the point of view of packet filtering.

What I would propose, is to create two new metadata constants:

A new Ancillary Data area field, SKF_AD_TRANSPROTO to store the transport level protocol.

A new data area offset, SKF_TRANS_OFF, to give a virtual view of the transport header.

This will allow us to very easily write a packet filter to capture TCP port 80, say, agnostic of IPv4 or IPV6. The filter program would look like:

LD WORD[AD_PROTOCOL]
JEQ 0x0800, 9, 0
JEQ 0x86dd, 0, 8

LD WORD[AD_TRANSPROTO]
JEQ 6, 0, 6

LD HALF[TRANS+0]
JEQ 22, 2, 0
LD HALF[TRANS+2]
JEQ 22, 0, 2

LD len
RET A

RET 0

We've now created a filter which can detect the transport protocol of TCP, and inspect the transport header, without having to directly calculate its offset from hard-coded knowledge of the network protocol.

I would like to propose that Linux adopts these two constants, and finds a way to implement them. I have some thoughts on implementation but I will defer these to a later post; as this one has gone on quite long enough already. :)

2010/04/30

Perl - ExtUtils::H2PM

I've spent a lot of time lately writing modules that wrap Linux kernel features in some way or another. They all boil down to basically the same thing - export a bunch of constants, and structure packing/unpacking functions. Various bits of extra fancy interface code are sometimes nice, but most of the time these can be written in Pure Perl once the base bits are done.

It's always annoyed me that one has to write an XS module just to obtain these. It's a lot of extra work and bother, for something that ought to be so simple. So instead, I started thinking about it, how can I make this much simpler from Perl?

What I came up with is ExtUtils::H2PM; a module to make it trivially-easy to write this sort of boring module to wrap some constants and structures from OS's header files.

As a brief example, consider the following

use ExtUtils::H2PM;

module "Fcntl";

include "fcntl.h";

constant "O_RDONLY";

write_output $ARGV[0];

This is it. This is all the code you, as a module author, have to write. You store that in some file, lets call it Fcntl.pm.PL. You let the build system go about converting that into Fcntl.pm; which on my system now looks like this:

package Fcntl;
# This module was generated automatically by ExtUtils::H2PM from -e

push @EXPORT_OK, 'O_RDONLY';
use constant O_RDONLY => 0;

1;

This is a plain standard Perl module that can be installed and used, to provide that constant.

We can also create pack/unpack-style functions to wrap structure types too. Consider

use ExtUtils::H2PM;

module "Time";

include "time.h";

structure "struct timespec",
  members => [
    tv_sec  => member_numeric,
    tv_nsec => member_numeric,
  ];

write_output $ARGV[0];

For this we obtain:

package Time;
# This module was generated automatically by ExtUtils::H2PM from -e

use Carp;
push @EXPORT_OK, 'pack_timespec', 'unpack_timespec';

sub pack_timespec
{
   @_ == 2 or croak "usage: pack_timespec(tv_sec, tv_nsec)";
   my @v = @_;
   pack "q q ", @v;
}

sub unpack_timespec
{
   length $_[0] == 16 or croak "unpack_timespec: expected 16 bytes";
   my @v = unpack "q q ", $_[0];
   @v;
}

1;

This was done entirely automatically too.

In a real use-case, you'd be using this to wrap things like socket option constants/structures, and so on; cases where existing Perl functions wrap existing syscalls, you simply want to provide new option values or structures. I've now written several modules mostly or entirely relying on ExtUtils::H2PM to build them, freeing me of having to write all that XS code to implement them.

Linux::SocketFilter

Socket::Netlink

Socket::Netlink::Route

Socket::Netlink::Taskstats

2010/04/26

Order matters even when it doesn't

Revision control diffs are most readable when they aren't noisy. Operations that disturb the order of many lines in the file create noise which makes it hard to read the interesting change. YAML specifies mappings (hashes, to us perl-types), that are unordered associations of keys to values. Even though YAML doesn't put an ordering on those, sometimes we'd like to pretend that it does, so as to preserve the order when we load a file, edit the data, then dump it back to the file.

At work we store a YAML document in Subversion, which describes a lot of details about IPsec tunnels. In an ideal world this would be the initial source of the information. The world, as you may have observed, is not yet ideal, so this file is in fact back-scraped from information in the actual config files, to keep it up to date. Naturally that's done in Perl.

The YAML file stores a big mapping, each entry itself being a record-like mapping, containing details in named keys. This causes great trouble for our load/edit/dump script, because YAML doesn't specify an ordering in mapping keys. They'll be dumped in "no particular order".
This wouldn't normally be a problem, except that because it's stored in Subversion, a commit changing one line of actual detail might suddenly produce hundreds of lines of false diff, because of reordered keys.

To solve this, I had to apply Much Evil Hackery. The YAML Perl module, it turns out, has a data structure tied to a hash, which remembers the order of keys. By subclassing YAML::Loader and replacing its method to read a mapping into a hash ref, we can force it to use this structure instead. This alteration is transparent to the perl code inbetween, it just sees a normal hash. However, YAML::Dumper sees the ordering and preserves it when it writes out.

The upshot: Load/edit/dump of trees of mappings in YAML preserves ordering, allowing cleaner commits into revision control.

This has been suggested as a wishlist bug against YAML; see also https://rt.cpan.org/Ticket/Display.html?id=56741

2010/04/06

Why I won't tell you what to do

Often in #perl we get people asking a question, whether explicitly or implied, that requires us to pick a solution for them. I try hard not to do this. The closest I'll get is to listen to their description of the problem, and suggest some things which I think might help. Hardly ever do I suggest just one thing.

I do this because it's hard for us to know the entire surrounding context of a problem. If anyone else is like me, then there'll be days, weeks, maybe even months of history behind it; various attempts they've tried already, other bits and pieces of code, system, whatever, that they haven't been able to explain in the 5 minutes they tried to give us the problem. You can't explain a 1-month problem in 5-minutes, nor can I give a solution to it in similar timeframe.

What I can do is name a few things that I'd include in a shortlist of things to think about in more detail, were I to find myself with a similar problem. They can then go away and think about these things. Maybe he already was aware of them, and we've just given some more confidence that those might be correct. Or maybe he wasn't, so we've given him something new to read about. Either way, we've helped guide the decision process, without outright saying "thou shalt do this" - because, without having that month of context around it; for all we know it could be completely wrong. But it's a good start.

2010/03/31

Perl - IO::Async

IO::Async is:

a Perl module distribution
a generic eventing framework
suitable for all kinds of IO-bound tasks
a way to achieve IO concurrency
portable to Linux, Solaris, BSDs, other POSIXes
mostly-working on cygwin
able to use OS-specific optimisations where appropriate
able to use Glib for running GTK2-based GUI applications

IO::Async serves as a base layer for all kinds of IO-heavy programs.

Its design is centred around passing code references as continuations; handlers to invoke when a certain event happens, or after a certain operation has completed. By relying on lexical variable capture in these code references, these continuations can easily store state in normal variables.

It's a different approach than that taken by POE, the other eventing system for Perl, but I find this approach fits better in my head. If you dislike storing state in a hash to pass it around disjoint named functions loosely connected by event names, you might just want to give it a try...