2010/05/18

PF_PACKET, Linux Socket Filters, and IPv6

For diagnosing network-related problems, it's often useful to be able to capture packets transmitted or received by a machine. Linux implements a socket family, PF_PACKET, to this end. Sockets in this family receive raw datagrams containing packets received or transmitted on network interfaces. A network capture program creates such a socket, then sits in a loop receiving datagrams. Each datagram contains the bytes of the packet. The AF_PACKET address format gives information about, among other things, packet direction and interface number.

In situations where a machine is passing lots of traffic, such as a busy internet router, but the problem being diagnosed concerns only a narrow selection of this traffic, capturing every single packet would be very CPU-intensive. The application could inspect each packet and discard all of those not of interest to it. But since each packet requires a recvfrom(2) system call, this gets quite expensive in context switches, and wasted buffer space. A far more efficient scheme is to have the kernel filter the packets, or at least throw away most of the uninteresting ones.

In Linux, this is done using the SO_ATTACH_FILTER socket option. This option attaches a filter program, which executes on a simple virtual machine within the kernel. This machine is given access to inspect the bytes of the packet, and can return a value indicating whether the packet should be kept, or discarded. This machine is based on BSD's BPF mechanism, with some extensions.

This virtual machine is a register machine. It has a single accumulator (A), a single index register (X), and the various usual sorts of arithmetic and logical operations one would expect. To inspect the actual packet contents it has instructions to load data from the captured packet, in 8, 16, or 32bit unsigned quantities, into the accumulator. The program gives its answer by returning a number to the kernel, which should be the number of bytes of the packet to capture, or 0 to discard it entirely.

For example, lets consider the following example, which selects any TCP packets:
LD BYTE[9]
JEQ 6, 0, 2

LD len
RET A

RET 0
The LD BYTE[offs] instruction loads a byte from the packet (at offset 9, being where IPv4 headers keep their protocol number) into the A register. JEQ is a conditional jump instruction which compares the A register to the immediate constant (6, the protocol number of TCP). If they are equal it jumps to the first label; if not, the second. Both jump labels are unsigned integers, being the number of instructions forwards to skip. In the true case (i.e. the protocol is TCP), the length of the packet is loaded into the accumulator and returned from the filter. In the false case, the number 0 is returned, to discard the packet.

Of course, this filter isn't much use yet, because we don't know for sure it's even an IPv4 packet we've caught. The original BSD BPF doesn't define any metadata access scheme, so Linux has invented a way for programs to access this. It reserves the topmost 4KiB of packet buffer, to provide some virtual "registers" containing metadata. Since the load instruction offsets are 32bit integers, it is unlikely any real packet would ever be anywhere near this size, so in practice this works well.

In this extra data area, called the Ancillary Data, the first offset stores the ethertype of the packet (this field address has a symbolic name, SKF_AD_PROTOCOL). In IPv4's case, this will be 0x0800. We can extend our filter to look at this protocol number (not to be confused with IPv4's protocol field), and check it.
LD WORD[AD_PROTOCOL]
JEQ 0x0800, 0, 4

LD BYTE[9]
JEQ 6, 0, 2

LD len
RET A

RET 0
We can further extend our filter, looking only for a certain TCP port number (such as 80, for http). Presuming an IPv4 header with no extra options, it is 20 bytes long, and therefore, the TCP header will start at offset 20:
LD WORD[AD_PROTOCOL]
JEQ 0x0800, 0, 8

LD BYTE[9]
JEQ 6, 0, 6

LD HALF[20]
JEQ 22, 2, 0
LD HALF[22]
JEQ 22, 0, 2

LD len
RET A

RET 0
Of course, this filter isn't much use in the real world, because we made the rather large assumption that an IPv4 header would be 20 bytes long. This is where the index register becomes useful. We can load the index register, X, with the size of the IPv4 header, and access the TCP header relative to this. BPF provides a very special instruction, LDMSHX, for exactly this purpose. Given the address of a byte, it loads that byte, masking off the bottom 4 bits, and shifts up by 2 bits. This calculates the size in bytes of an IPv4 header, given the first byte.
LD WORD[AD_PROTOCOL]
JEQ 0x0800, 0, 9

LD BYTE[9]
JEQ 6, 0, 7

LDMSHX BYTE[0]
LD HALF[X+0]
JEQ 22, 2, 0
LD HALF[X+2]
JEQ 22, 0, 2

LD len
RET A

RET 0
Finally, this filter is only of any use on a SOCKET_DGRAM socket; a socket where the kernel will throw away the link-level header (such as the Ethernet or PPP framing), and present only the network-level header. A packet capturing program would very likely be interested in those link-level bytes too, so would be using a SOCKET_RAW socket instead. In this case, we don't directly know the offset of the IPv4 header, but once again, Linux comes to our rescue. It provides a "virtual view" over the part of the packet from the start of the network header. It provides a virtual offset, NET, where load instructions read relative to the start of the network header, rather than from the start of the packet buffer as a whole.

On a SOCKET_RAW socket, this filter would probably be appropriate to select TCP port 80 over IPv4:
LD WORD[AD_PROTOCOL]
JEQ 0x0800, 0, 9

LD BYTE[NET+9]
JEQ 6, 0, 7

LDMSHX BYTE[NET+0]
LD HALF[NET+X+0]
JEQ 22, 2, 0
LD HALF[NET+X+2]
JEQ 22, 0, 2

LD len
RET A

RET 0
This is internally implemented by carving up a further extra data area, 1MiB from the top of the packet buffer (defined by a constant SKF_NET_OFF). This offset gives a virtual view of the bytes in the packet, starting at the network protocol header.

Of course, I have been looking IPv4 quite specifically here. With the slowly-growing popularity of IPv6, it's inevitable that packet capture programs might want to capture IPv6 packets too.

IPv6 follows a different style from IPv4 in terms of its header options. In IPv4, all the IP-level options are stored in the header, one after another. The "header length" field in the header gives the total size of the header, with all these options. In IPv6, the header contains fewer fields (because things like fragmentation are now options). At the end of this header is the "next header" number, which is either a protocol number such as TCP or UDP, or gives an IPv6 extension header number (such as fragmentation control, or IPsec's authenticating header). Each option links on to the next with its own "next header" field.

This presents us something of a problem when it comes to packet capture filters. Recall how, for the JEQ instruction, the branch labels in both cases are unsigned integers? This is how BPF guarantees termination of a program in finite time - every jump has to be forwards. There can be no loops. Without loops, the program is guaranteed to terminate.

But now how do we parse these IPv6 headers? We can't write a while() loop in the program to walk down the headers, until we find a TCP one. Furtheremore, IPv6 doesn't define a standard header layout for all headers. Each header type puts its "next header" field in possibly a different place. Some headers are fixed-length, some carry a "header length" field of their own. It's a total mess, from the point of view of packet filtering.

What I would propose, is to create two new metadata constants:

  • A new Ancillary Data area field, SKF_AD_TRANSPROTO to store the transport level protocol.

  • A new data area offset, SKF_TRANS_OFF, to give a virtual view of the transport header.

This will allow us to very easily write a packet filter to capture TCP port 80, say, agnostic of IPv4 or IPV6. The filter program would look like:
LD WORD[AD_PROTOCOL]
JEQ 0x0800, 9, 0
JEQ 0x86dd, 0, 8

LD WORD[AD_TRANSPROTO]
JEQ 6, 0, 6

LD HALF[TRANS+0]
JEQ 22, 2, 0
LD HALF[TRANS+2]
JEQ 22, 0, 2

LD len
RET A

RET 0
We've now created a filter which can detect the transport protocol of TCP, and inspect the transport header, without having to directly calculate its offset from hard-coded knowledge of the network protocol.

I would like to propose that Linux adopts these two constants, and finds a way to implement them. I have some thoughts on implementation but I will defer these to a later post; as this one has gone on quite long enough already. :)

9 comments:

  1. How can I try out the sample filters that you provided?

    ReplyDelete
  2. @Anonymous:

    Ah, oops. I possibly should have made that part a little clearer. These filters are written in the syntax as understood by my Linux::SocketFilter Perl module. I've been developing it for a while, hence why I've been writing my thoughts on the subject, and also why its tangential relevance to ironman.

    You can find the module at

    http://search.cpan.org/dist/Linux-SocketFilter/

    though it goes best with the module to provide PF_PACKET sockets in Perl

    http://search.cpan.org/dist/Socket-Packet/

    ReplyDelete
  3. hi leonerd:
    can you provide a short C example(or bpf instruction) to filter out SKF_AD_PROTOCOL = ipv4,I had google it these days but found nothing.I didn't know the correct instruction

    ReplyDelete
  4. Surely something involving

    LD WORD[AD_PROTOCOL]
    JEQ ETHERTYPE_IPV4, true, false

    would do it?

    Hint: IPv4's ethertype is 0x0800

    ReplyDelete
  5. hi leonerd:
    thanks
    can I ask you one more question?
    how to write this bpf inst?

    step0

    unsigned long interm;

    interm = udp[32:4]&udp[36:4]&0x01010101 //src & dst & 0x01010101


    step1:

    result=inetrm[0]&interm[1]&interm[2]&interm[3]

    the main problem is that I don't know how to save the result to a intermediate variable,then get one byte from a intermediate variable(use st inst?) and do 'and' operation with each of its bytes.

    ReplyDelete
  6. hi leonerd:

    I am not sure if I expressed clearly

    I know there is 'ldb' inst to load one byte from packet, but how to do with intermediate variable?

    (031) ldb [x + 48]
    (032) st M[11] //this is intermediate variable
    (033) ldb [x + 52]
    (034) tax
    (035) ld M[11]
    (036) and x

    can I use:
    ldb M[11:1] or something like that?

    I want to do :
    M11[0]&m11[1]&M11[2] & M11[3]

    ReplyDelete
  7. I would say don't do it quite like that, but instead accumulate each byte individually. You are in fact just ANDing 8 bytes together.

    LD BYTE[32]
    LDX A
    LD BYTE[33]
    AND X
    LDX A
    LD BYTE[34]
    AND X
    LDX A
    /* repeat for 35..38 */
    LD BYTE[39]
    AND X
    AND 0x01

    /* answer in now in A */

    ReplyDelete
  8. regarding "Surely something involving
    LD WORD[AD_PROTOCOL] (->replace AD_PROTOCOL with SKF_AD_QUEUE)
    JEQ ETHERTYPE_IPV4, true, false"
    how to write this in C language?

    like the below?

    struct sock_filter BPF_code[]= {
    { 0x28,0,0,SKF_AD_OFF+SKF_AD_QUEUE},
    { 0x15, 0, 1, 0x00000001 },
    { 0x6, 0, 0, 0x0000ffff },
    { 0x6, 0, 0, 0x00000000 }
    };

    thanks!

    ReplyDelete
  9. Possibly. I don't usually work on them at that low a level - that's what I wrote the assembler for :)

    Ofcourse, you could likely use the Linux::SocketFilter::Assembler perl module to assemble an array of instructions for a C program. Strikes me as relatively simple in fact.

    ReplyDelete