2016/06/08

Devel::MAT interactive debugging and filehandles

For a while now, I've been using Devel::MAT for all sorts of debugging tasks. Originally I wrote it in order to track down a single memory-leak case in a large program, but due to the very generic way it just captures as much of the perl process state as it can for later analysis, it turns out to be useful for a variety of other problems as well. Every time I try to debug a new sort of problem that it doesn't quite fit for, I end up adding more features and making a new release, so it's growing in ability quite nicely now.

A recent bug I was trying to fix involved leaking filehandles - a longrunning process ends up running out of spare filehandles and receives EMFILE trying to make a socket(2) call. So, newly added to Devel::MAT version 0.23 is tracking of file numbers associated with IO handles.

To start this session of bug-hunting, first I had to provoke the leak into happening, and capture a .pmat file from it. Provoking the leak proved to be quite easy by just artificially lowering the maximum file descriptor count using ulimit, and capturing the file was achieved using the usual Devel::MAT::Dumper options:

$ ulimit -n 64
$ perl -MDevel::MAT::Dumper=-eager_open,-dump_at_END,-dump_at_DIE ...

Sure enough, it didn't take long running my program before it died because of EMFILE, at which point a .pmat file appeared. Now we can take a look into it and find the problem. So I started by loading this into my new pmat interactive debugging shell, which wraps a number of commandline-type tools, allowing interactive application of them to the loaded file, without having to reload the file each and every time. This speeds up such debugging sessions a lot.

$ pmat run-tests.pl.pmat 
Perl memory dumpfile from perl 5.22.1
Heap contains 156829 objects
pmat> 

The first useful command to run here is io list to obtain a list of every IO SV that has at least one file number associated with it. This will correspond to the filehandles the kernel is complaining that we have too many of. These will be a good place to start to see if something is leaking that shouldn't.

pmat> io list
Addr                           ifileno  ofileno
IO() at 0xa2f378               -1       -1
IO() at 0x8cf540               0        -1
IO() at 0x8cf570               1        1
IO() at 0x8cf5d0               2        2
IO() at 0x16063f8              3        3
...
IO() at 0x3bf8e68              59       59
IO() at 0x3b7e548              60       60
IO() at 0x3bccda8              61       61
IO() at 0x3b57958              62       62
IO() at 0x3b4b950              63       63

Well sure enough, we do seem to have our full allocation of 64 filehandles, and luckily for us they seem to be perl-visible ones (lucky because Devel::MAT can't peek into the inner workings of lower-level C libraries, which could have been the reason here). The first few handles will be the usual STDIN, STDOUT and friends, and some socket handles that the process itself is supposed to be hanging onto, so our leak likely relates to the higher-numbered ones. Lets pick a random candidate near the end, and see if we can investigate further what it's doing still hanging around.

The identify command takes the address of an SV, and tries to walk back up the references chain, looking for what refers to it, so we can find out what's still holding it.

pmat> identify 0x3bf8e68
IO() at 0x3bf8e68 is:
└─the io of GLOB(%*I) at 0x282f118, which is:
  ├─the referrant of REF() at 0x3b7ef98, which is:
  │ └─value {write_handle} of HASH(29) at 0x3b577c0, which is (*A):
  │   └─the referrant of REF() at 0x3476cc0, which is:
  │     └─element [0] of ARRAY(1) at 0x3bbc120, which is:
  │       └─the referrant of REF() at 0x1f81eb8, which is:
  │         └─the lexical $conns of PAD(23) at 0x3bcc658, which is:
  │           └─pad at depth 1 of CODE(PP,C) at 0x3b4c2b0, which is:
  │             └─the referrant of REF() at 0x3bbb358, which is:
  │               └─the lexical $on_closed of PAD(14) at 0x3bcc6a0, which is:
  │                 └─pad at depth 1 of CODE(PP,C) at 0x1cfa850, which is:
  │                   └─the referrant of REF() at 0x3b63698, which is:
  │                     └─value {on_closed} of HASH(29) at 0x3b577c0, which is:
  │                       └─already found as *A
  └─the referrant of REF() at 0x3bf8550, which is:
    └─value {read_handle} of HASH(29) at 0x3b577c0, which is:
      └─already found as *A

This looks suspect. The SV identified with the *A marker appears later on in its own referrant chain. That sounds like a cyclic reference. Lets look at that one in more detail.

HASH(29) at 0x3b577c0 is:
└─the referrant of REF() at 0x3476cc0, which is:
  └─element [0] of ARRAY(1) at 0x3bbc120, which is:
    └─the referrant of REF() at 0x1f81eb8, which is:
      └─the lexical $conns of PAD(23) at 0x3bcc658, which is:
        └─pad at depth 1 of CODE(PP,C) at 0x3b4c2b0, which is:
          └─the referrant of REF() at 0x3bbb358, which is:
            └─the lexical $on_closed of PAD(14) at 0x3bcc6a0, which is:
              └─pad at depth 1 of CODE(PP,C) at 0x1cfa850, which is:
                └─the referrant of REF() at 0x3b63698, which is:
                  └─value {on_closed} of HASH(29) at 0x3b577c0, which is:
                    └─already found circularly

Indeed, this is the classic identify shape of a leaked cyclic reference; a single linear chain of references that loops back onto itself. Just because we have a cycle though doesn't necessarily mean that's bad. Creating a reference cycle in Perl is only bad if nothing ever cleans it up again. Cyclic references can be perfectly safe, provided something cleans them up again at some point. To determine what's wrong with this one and why it didn't get cleaned up requires some inspection of the code that created it. As this particular cycle involves two lexical scalars in two closures we can use the show command lets us see more detail about these two CODE SVs.

pmat> show 0x3b4c2b0
CODE(PP,C) at 0x3b4c2b0 with refcount 1
  name=&Net::Async::HTTP::__ANON__
  stash=STASH(66) at 0x12fe9c0
  glob=GLOB(*) at 0x1b4eaf0
  location=/home/leo/lib/perl5/Net/Async/HTTP.pm line 422
  no scope
  no padlist
  no padnames
  pad[0]=PAD(23) at 0x3bcc658
Referenced globs:
  GLOB(@I) at 0x8b1ea0, GLOB(&*) at 0x1b40bf0

pmat> show 0x1cfa850
CODE(PP,C) at 0x1cfa850 with refcount 1
  name=&Net::Async::HTTP::Connection::__ANON__
  stash=STASH(65) at 0x1a3cac0
  glob=GLOB(*) at 0x1ab35f0
  location=/home/leo/lib/perl5/Net/Async/HTTP/Connection.pm line 62
  no scope
  no padlist
  no padnames
  pad[0]=PAD(14) at 0x3bcc6a0

We can now take a peek at the first of these closures by looking in the source code at the indicated line number; which appears to be the on_closed event handler of the Net::Async::HTTP instance:

      on_closed => sub {
         my $conn = shift;
         my $http = $conn->parent;
 
         $conn->remove_from_parent;
         @$conns = grep { $_ != $conn } @$conns;
 
         if( my $next = first { !$_->connecting } @$ready_queue ) {
            # Requeue another connection attempt as there's still more to do
            $http->get_connection( %args, ready => $next );
         }
      },

Sure enough, there's the $conns lexical that the identification trace suggested was the culprit. Indeed, this code should in fact have removed the connection from that array when it runs, thus breaking the cycle and allowing all these instances to be reclaimed, avoiding a leak. That the leak happened suggests this cleanup code did not in fact execute.

At this point, we can apply some analysis of the code changes leading up to this leak being introduced in the first place. On this occasion, the change was itself a workaround for another bug involving connection pooling and reuse in the HTTP client, by instead using a single-shot HTTP client for each individual request, which is added to the IO::Async::Loop instance and then removed when it is done.

Looking again at the above cleanup code which we expect didn't run, and with knowledge of the code change and the basics of how IO::Async::Notifiers work, we surmise that the code doesn't run because the file handle is never actually explicitly closed. In normal operation with longlived HTTP clients, this isn't a problem as either the client times out and closes the connection itself, or the server does and we're informed of an EOF to close it. These two cases give the on_closed code its opportunity to run, thus avoiding a resource leak in the normal case.

But in our one-shot HTTP client that then gets removed, this never happens. The connection remains open at the point that the client gets removed from the loop, wherein it now has no opportunity to be informed that the server closed it, so the cleanup code never runs. This one now looks like quite an easy fix; just ->close all the remaining connections at the time the client is removed from the loop, by making use of the _remove_from_loop method:

sub _remove_from_loop
{
   my $self = shift;

   foreach my $conn ( map { @$_ } values %{ $self->{connections} } ) {
      $conn->close;
   }

   $self->SUPER::_remove_from_loop( @_ );
}

In fact, this is the very bugfix that was added to make release Net::Async::HTTP version 0.41.

So there we have it - a real-world example of an actual bug that was found and fixed with the help of Devel::MAT.