LeoNerd's programming thoughts: Devel::MAT

Showing posts with label Devel::MAT. Show all posts

2018/09/27

Devel::MAT investigation into C - part 3

In the previous part we investigated a memory leak in a Perl program, and found the line of C code in a module responsible for creating the SVs that are leaking. This of course is only half of the problem - the other half is the fact that these SVs aren't getting reclaimed again when they're no longer needed, so to answer that question we'll have to investigate a little more. We'll attempt to go looking around the code at where this SV pointer is stored, and look for other parts of code that would try to reclaim it later on, and see if we can see why they don't.

Taking a look again at the source surrounding the line that is creating these leaking SVs, we see an interesting call to a function called cb_data_advanced_put() which appears to be taking ownership of it.

cb_data_advanced_put(ctx, "tlsext_status_cb!!func", newSVsv(callback));
cb_data_advanced_put(ctx, "tlsext_status_cb!!data", newSVsv(data));
SSL_CTX_set_tlsext_status_cb(ctx, tlsext_status_cb_invoke);

This cb_data_advanced_put() is defined by the XS code itself, storing the data under the named key in a hash associated with the SSL context stored in ctx. The code implementing that looks like this:

int cb_data_advanced_put(void *ptr, const char* data_name, SV* data)
{
    HV * L2HV;
    SV ** svtmp;
    int len;
    char key_name[500];
    dMY_CXT;

    len = my_snprintf(key_name, sizeof(key_name), "ptr_%p", ptr);
    if (len == sizeof(key_name)) return 0; /* error  - key_name too short*/

    /* get or create level-2 hash */
    svtmp = hv_fetch(MY_CXT.global_cb_data, key_name, strlen(key_name), 0);
    if (svtmp == NULL) {
        L2HV = newHV();
        hv_store(MY_CXT.global_cb_data, key_name, strlen(key_name), newRV_noinc((SV*)L2HV), 0);
    }
    else {
        if (!SvOK(*svtmp) || !SvROK(*svtmp)) return 0;
        L2HV = (HV*)MUTABLE_PTR(SvRV(*svtmp));
    }

    /* first delete already stored value */
    hv_delete(L2HV, data_name, strlen(data_name), G_DISCARD);
    if (data!=NULL)
        if (SvOK(data))
            hv_store(L2HV, data_name, strlen(data_name), data, 0);

    return 1;
}

This function takes ownership of the SV pointer given in data by storing it in a hash - an HV. That HV is a second-level hash, itself stored keyed by the given context pointer ptr into a hash maintained by the toplevel context in MY_CXT.global_cb_data. To understand the intended code path to release this, we'll have to find where that gets reclaimed. Inspection of the code that manages this top-level hash, and surrounding functions in the file, soon finds another function which appears to be doing that:

int cb_data_advanced_drop(void *ptr)
{
    int len;
    char key_name[500];
    dMY_CXT;

    len = my_snprintf(key_name, sizeof(key_name), "ptr_%p", ptr);
    if (len == sizeof(key_name)) return 0; /* error  - key_name too short*/

    hv_delete(MY_CXT.global_cb_data, key_name, strlen(key_name), G_DISCARD);
    return 1;
}

The important line here is the call to hv_delete(), which is deleting a key from the shared HV stored in MY_CXT.global_cb_data. When that HV is deleted, Perl should recurse into it and reclaim all of the values stored by it, including the data we put there earlier. But yet it appears not to be - or else we wouldn't observe our memory leak in the first place - so we must dive a little deeper to understand why.

Lets start by dumping the contents of the hash that hv_delete() is discarding, to see if it contains the keys we expect - remember, from the first block of code we found, we're expecting an entry under the key named tlsext_status_cb!!data. We'll do that by adjusting the code and adding a call to the handy debugging function sv_dump() so we can see what's inside it:

SV *tmp = hv_delete(MY_CXT.global_cb_data, key_name, strlen(key_name), 0);
sv_dump(tmp);

When we run this we get a nice detailed snapshot of what was inside the HV at the time it was cleaned up. The full output is quite long, but here is a cut-down version of the most relevant detail:

SV = IV(0x55c769db3358) at 0x55c769db3368
  FLAGS = (TEMP,ROK)
  RV = 0x55c769db2f78
    SV = PVHV(0x55c769dd9280) at 0x55c769db2f78
      KEYS = 1
        Elt "tlsext_status_cb!!func" HASH = 0x567bfc0c

We have a reference to a hash, containing just one key (because KEYS = 1), being the tlsext_status_cb!!func part, but it seems we're missing the related !!data key, which is the leaking one. But this seems unusual - the code path we encountered earlier would always unconditionally insert them both. Either the data key went missing sometime, or never got inserted in the first place.

Perhaps it would help to add some further logging to the cb_data_advanced_put function, having it log what the HV looks like after it's added the key, so we can see if it was never added or goes missing later on. We'll add this to the end of the function:

fprintf(stderr, "After cb_data_advanced_put key=<%s>:\n", data_name);
sv_dump(sv_2mortal(newRV_inc(L2HV)));

(The added complication of calling sv_2mortal on newRV_inc is simply to get around an odd quirk of sv_dump(), that it will dump the contents of a hash given by indirect RV reference, but not by HV pointer directly. We have to make and dump an RV referencing it, just so we can see those keys.)

Running this, we find some more logs from its behaviour:

After cb_data_advanced_put key=:
SV = IV(0x55932b203ec0) at 0x55932b203ed0
  FLAGS = (TEMP,ROK)
  RV = 0x55932b1db110
    SV = PVHV(0x55932b20c4d0) at 0x55932b1db110
      KEYS = 2
        Elt "ssleay_verify_callback!!func" HASH = 0x4101f92c
        Elt "tlsext_status_cb!!func" HASH = 0x462de35f

After cb_data_advanced_put key=:
SV = IV(0x55932b204130) at 0x55932b204140
  FLAGS = (TEMP,ROK)
  RV = 0x55932b1db110
    SV = PVHV(0x55932b20c4d0) at 0x55932b1db110
      KEYS = 2
        Elt "ssleay_verify_callback!!func" HASH = 0x4101f92c
        Elt "tlsext_status_cb!!func" HASH = 0x462de35f

Well this is odd. After adding the func key, it was indeed there (along with another relating to the verify callback, which we can ignore for our purposes), leaving it with KEYS = 2. But after attempting to put the data data in, it still only has those two and no further key was added. It looks like the function didn't attempt to add the key to the HV.

To understand why, a further look at a small section of the code in cb_data_advanced_put() might help here:

if (data!=NULL)
    if (SvOK(data))
        hv_store(L2HV, data_name, strlen(data_name), data, 0);

The hv_store() function is only called if data points to a real SV, but contains some defined value (i.e. SvOK() is true). But recalling the leaking SV candidates we found in part 1 - they were all UNDEF. It seems perhaps we've found the reason. If the data for the tlsext_status_cb is an undefined scalar value, a new SV is allocated for it (by calling newSVsv()) which is never stored in the HV, and thus is never reclaimed when the HV itself is released. Instead, the SV just sits around in memory, unreferenced by anything else, with nothing to clean it up afterwards.

If this is the case, then we can easily fix it by just throwing away the SV immediately if it is undefined. We can change the code to:

if (data!=NULL) {
    if (SvOK(data))
        hv_store(L2HV, data_name, strlen(data_name), data, 0);
    else
        SvREFCNT_dec(data);
}

Applying this to the code, rebuilding, and testing again shows this does seem to fix this bug, as now we no longer get a constantly-increasing SV count:

SVs 49656 (+49656)
SVs 56515 (+6859)
SVs 56509 (-6)
SVs 56509 (+0)
SVs 56509 (+0)
SVs 56508 (-1)
SVs 56508 (+0)

All that remains now is to send it upstream in the form of a bug report - which I have done in RT#127131.

2018/09/24

Devel::MAT investigation into C - part 2

In the previous part we discovered and started investigating a memory leak bug in a program, which got as far as needing a custom-built perl binary to look further into where the leak comes from. We built that with perlbrew and some specific compiler flags.

This new memory-logging feature is controlled by setting an environment variable (PERL_MEM_LOG) to specify the file descriptor to write the log to, and some flags to say what kinds of events will cause logs to be written. This log won't be immediately helpful to us on its own, but when combined with the analysis from Devel::MAT will let us see the history of each individual SV, and specifically the ones that pmat-leakreport tells us are our leak candidates.

To see the logging in action, first lets check that it is working on a test run of perl -V. We'll ask it to output information about SV allocations and deletions to file descriptor 3 (so as not to disturb the process's normal output on the standard output or error streams), and send that to a file:

$ PERL_MEM_LOG=3s debugperl5.26.2 -V 3>perlmem.log
Summary of my perl5 (revision 5 version 26 subversion 2) configuration:
   
...

$ wc -l perlmem.log
8170 perlmem.log

$ head -5 perlmem.log 
new_SV: sv.c:9377:Perl_newSVpvn: 564a933f1f78
new_SV: sv.c:9678:Perl_newSV_type: 564a933f1f90
new_SV: sv.c:9678:Perl_newSV_type: 564a933f1fa8
new_SV: sv.c:9678:Perl_newSV_type: 564a933f1fc0
new_SV: sv.c:9377:Perl_newSVpvn: 564a933f1fd8

$ tail -5 perlmem.log 
del_SV: sv.c:7075:Perl_sv_free2: 564a93407e90
del_SV: sv.c:7075:Perl_sv_free2: 564a93407d88
del_SV: sv.c:7075:Perl_sv_free2: 564a93407de8
del_SV: sv.c:6805:Perl_sv_clear: 564a934077e8
del_SV: sv.c:7075:Perl_sv_free2: 564a933f2038

Here we've got a nice long file, with one line per memory allocation. We can now run our leaky process again, this time with the memory log added as well. This time, when pmat-leakreport gives us some candidates we'll be able to see in more detail where they came from, at the C level:

$ PERL_MEM_LOG=3s debugperl5.26.2 ... 3>perlmem.log
...

$ pmat-leakreport -1 perl-{4,5,6,7,8}.pmat
LEAK[1] UNDEF() at 0x564e10091f80
LEAK[1] UNDEF() at 0x564e107ae740
LEAK[1] UNDEF() at 0x564e10794008

$ grep 564e10091f80 perlmem.log
...
new_SV: sv.c:9248:Perl_sv_newmortal: 564e10091f80
del_SV: sv.c:7075:Perl_sv_free2: 564e10091f80
new_SV: sv.c:9678:Perl_newSV_type: 564e10091f80
del_SV: sv.c:6805:Perl_sv_clear: 564e10091f80
new_SV: sv.c:9753:Perl_newSVsv: 564e10091f80

Well that's partly useful - we can see that the final memory operation at that address was indeed an allocation, with some previous history there which we can ignore. The program did allocate an SV at that address, that wasn't further touched until the end of the program. Problem is, this report itself isn't all that helpful - all we learn from it is that the function that allocated this SV was Perl_newSVsv. That's not saying much - about half of the SVs ever allocated by a program will be through that function. If we want to dig deeper even than this, we'll have to find a way to get the C-level stack trace at the time that allocation was made. To do that we have no alternative but to go editing perl core source code and rebuilding it once more.

We'll abuse the build directory left behind by perlbrew, as it provides a convenient place where the source is already unpacked and configured ready to build. Lets go find it, and edit the source code to add a little extra tracing.

We start by configuring the source tree to include the C backtrace feature, and then apply a patch to util.c file, to have it call the backtrace function (the source of the patch is attached at the end of this blog post):

$ rm config.sh
$ ./Configure -des -Dusedevel -Dprefix=$HOME/myperl -DDEBUGGING -Accflags=-DPERL_MEM_LOG -Dusecbacktrace
$ patch -p1 < print-c-backtrace-on-memlog.patch
$ make

Now when we run this and pass the new c flag as part of PERL_MEM_LOG, we'll get 3 layers of C stack trace along with each line in the memory log. This will tell us what C code called Perl_newSVsv to create our leaking scalars:

$ PERL_MEM_LOG=3sc debugperl5.26.2 ... 3>perlmem.log
...

$ pmat-leakreport -1 perl-{4,5,6,7,8}.pmat
LEAK[1] UNDEF() at 0x55f72bd4b4f0
LEAK[1] UNDEF() at 0x55f72c015088
LEAK[1] UNDEF() at 0x55f72bd57238

$ grep -A3 55f72bd4b4f0 perlmem.log
...
new_SV: sv.c:9753:Perl_newSVsv: 55f72bd4b4f0
  frame[0]: 0x55f72a04d4c3 Perl_newSVsv at /home/leo/perl5/perlbrew/build/perl-5.26.2/perl-5.26.2/perl +0x16d4c3
  frame[1]: 0x7f4644e5af73 - at /home/leo/perl5/perlbrew/perls/perl-5.26.2/lib/site_perl/5.26.2/x86_64-linux/auto/Net/SSLeay/SSLeay.so +0x78f73
  frame[2]: 0x55f72a019082 Perl_pp_entersub at /home/leo/perl5/perlbrew/build/perl-5.26.2/perl-5.26.2/perl +0x139082

Now we are definitely getting somewhere. We know the call to Perl_newSVsv was made from Net::SSLeay, and we have the path to the object file, and the offset address within it. The helpful addr2line utility can convert that into something more meaningful:

$ addr2line -fe /home/leo/perl5/perlbrew/perls/perl-5.26.2/lib/site_perl/5.26.2/x86_64-linux/auto/Net/SSLeay/SSLeay.so 0x78f73
XS_Net__SSLeay_CTX_set_tlsext_status_cb
/home/leo/.cpan/build/Net-SSLeay-1.85-1/SSLeay.xs:6453

The leaked scalars were created within the XS function behind the Net::SSLeay::CTX::set_tlsext_status_cb function, and we even get a source file and line number. We can go and inspect the source code:

  6450              SSL_CTX_set_tlsext_status_cb(ctx, NULL);
  6451          } else if (SvROK(callback) && (SvTYPE(SvRV(callback)) == SVt_PVCV)) {
  6452              cb_data_advanced_put(ctx, "tlsext_status_cb!!func", newSVsv(callback));
  6453              cb_data_advanced_put(ctx, "tlsext_status_cb!!data", newSVsv(data));
  6454              SSL_CTX_set_tlsext_status_cb(ctx, tlsext_status_cb_invoke);
  6455          } else {
  6456              croak("argument must be code reference");

Sure enough, there is indeed our call to newSVsv().

So now we have a definite location in the code where we can see we are creating new SVs that don't get cleaned up again. This is a start, but is only the first step of fixing the bug. We also need to find out why it isn't being released again afterwards - for that we'll continue in part 3...

The patch to util.c to create this logging is:

2018/09/20

Devel::MAT investigation into C - part 1

I have previously written about using Devel::MAT to find the cause of memory leaks in Perl programs. While it is often able to find the reasons for memory growth in a great number of cases, there are times when it cannot alone find the whole story, and we must combine other tools to finish our investigation. Here we'll look at one particular case that - like so many others - begins with the observation that a long-running Perl process slowly consumes ever-more memory from the host operating system, the longer it runs. We'd like to find out why, and fix it.

The process under investigation this time is a long-running script that regularly polls a number of sites over HTTP (well, HTTPS in fact) to collect up and aggregate some data from them. The rate of memory leak is observed to correlate with how often it performs this polling, so it suggests that the leak is likely caused by something that is done on every iteration of the program losing the same amount each time. The regular iteration of the program provides us a convenient point to attach some memory debugging to begin our investigation.

A slow increase of memory could be caused by a Perl-visible accumulation of SVs, or could be something only visible at the C layer. Before we start investigating with the heavy guns of Devel::MAT, we might do well to try to classify what's going on with something more lightweight. In this case, Devel::Gladiator provides us a simple way to count the total number of SVs on the Perl heap, to see if it's increasing. We keep a count of the previous total, so we can print the difference per iteration as well:

use Devel::Gladiator qw( arena_ref_counts );
use List::Util qw( sum );

my $lasttotal = 0;

...

    # per iteration:
    my $total = sum values %{ arena_ref_counts };
    printf "SVs %d (+%d)\n", $total, $lasttotal - $total;
    $lasttotal = $total;

We can now run this and get an initial idea of the growth of the Perl heap:

SVs 51940 (+51940)
SVs 60171 (+8231)
SVs 59655 (-516)
SVs 59652 (-3)
SVs 59655 (+3)
SVs 59657 (+2)
SVs 59660 (+3)
SVs 59663 (+3)
SVs 59665 (+2)
SVs 59668 (+3)

After a few iterations of initial startup and settling down, this program is clearly leaking actual Perl-visible SVs at around 2 or 3 per iteration, so it makes sense to get Devel::MAT involved. We can easily insert that into the iteration loop:

use Devel::MAT::Dumper;

my $iter = 0;
...

    # per iteration:
    $iter++;
    Devel::MAT::Dumper::dump( "perl-$iter.pmat" );

Now when we run this we get a sequence of numbered .pmat files, one per iteration, that we can throw into pmat-leakreport to look for leaked scalars. Because the first few files will contain a lot of noise from program startup, we'll skip those and take from, in this case, the 4th onwards:

$ pmat-leakreport -1 perl-{4,5,6,7,8}.pmat
LEAK[1] UNDEF() at 0x55efee189120
LEAK[1] UNDEF() at 0x55efee1ae1f8
LEAK[1] UNDEF() at 0x55efeda8c5e8
LEAK[1] UNDEF() at 0x55efee1acd30

This gives us four exciting candidates we can take more of a look at using the pmat shell:

$ pmat perl-8.pmat 
Perl memory dumpfile from perl 5.26.2
Heap contains 55890 objects
pmat> identify 0x55efee189120
UNDEF() at 0x55efee189120 is:
└─not found

pmat> identify 0x55efee1ae1f8
UNDEF() at 0x55efee1ae1f8 is:
└─not found

pmat> identify 0x55efeda8c5e8
UNDEF() at 0x55efeda8c5e8 is:
└─not found

pmat> identify 0x55efee1acd30
UNDEF() at 0x55efee1acd30 is:
└─not found

Well that's not very helpful. We have four scalars that contain no defined values, which aren't referenced by anywhere that Devel::MAT can see. In order to continue investigating this, we need to dig deeper with more tools. Perl itself provides an optional ability to write a more detailed log of memory allocation operations, controlled by a variable called PERL_MEM_LOG. According to perlhacktips(1):

PERL_MEM_LOG

If compiled with -DPERL_MEM_LOG (-Accflags=-DPERL_MEM_LOG), both memory and SV allocations go through logging functions, which is handy for breakpoint setting.

Now, we can't just set that environment variable on any normal perl binary, because the logging functions are sufficiently expensive at runtime that they are usually disabled on normal builds. We have to request them specifically with the use of some compile-time flags. To achieve this, it's easiest to use perlbrew to compile a custom binary for our purposes:

$ perlbrew install perl-5.26.2 --debug -Accflags=-DPERL_MEM_LOG

That'll take a while to build, but when it's ready we'll continue our investigation in part 2...

For another memory-leak investigation involving pmat-leakreport, see also the Binary.com tech blog post Differential Memory Leak Analysis with Devel::MAT.

2018/07/17

Devel::MAT::Dumper independent distribution

I've just released a new version of Devel-MAT, now with the heap dumping part split out into its own distribution, Devel-MAT-Dumper.

There's two reasons for this. Firstly, by being split into its own distribution, the heap-dumper part is smaller and lighter to install on places like production servers as it needs no non-core dependencies. Since the heapdump files still need analysis somewhere, you'll still have to install the full Devel-MAT distribution on a machine - such as a workstation or laptop - but you no longer need the analysis tools on every target machine.

Secondly, since the two parts are now in independent distributions, they can have different Perl version requirements. I want to keep the ability to analyse Perl programs running on versions from 5.10 onwards (and maybe sometime see if I can push that back to 5.8), but now the distributions are split I have begun to make use of 5.14 features in some of the analysis tools. This means that you can still analyse programs running on 5.10, but you will need an installation of 5.14 or later to run the tooling.

Looking a little further afield, I plan to continue writing more tooling, perhaps published as independent distributions in places too. It's still the case that almost every time I run into some situation that needs memory analysis like this, I end up writing more tools to help, so the flow of additional tooling will continue for a while yet.

I'm also keen to hear success stories from others (or even failure ones - what couldn't you work out and what might I do to help?) to help guide the creation of more tooling. I'm available to consult with people to help you with your own problems also - send me a note to

leonerd [at] leonerd [dot] org [dot] uk

and we can discuss your requirements.

2017/10/20

More Perl memory leaks with Devel::MAT - another use-case story

I've written a second post about Devel::MAT on the company blog, following up my first one.

Again, rather than copy the entire thing I'll just link to it here instead:

Tracing Perl memory leaks with Devel::MAT, part 2

As a little preview, I'll add that this one has screenshots:

2017/10/13

Tracing Perl memory leaks with Devel::MAT - a use-case story

I've lately been working a lot on the analysis tools around Devel::MAT, the toolset used for analysing memory leaks, unbounded growth, and other memory-related issues affecting Perl programs.

I wrote a blog post about it on the company blog. Rather than copy the entire thing I'll just link to it here instead:

Tracing Perl memory leaks with Devel::MAT

2016/06/08

Devel::MAT interactive debugging and filehandles

For a while now, I've been using Devel::MAT for all sorts of debugging tasks. Originally I wrote it in order to track down a single memory-leak case in a large program, but due to the very generic way it just captures as much of the perl process state as it can for later analysis, it turns out to be useful for a variety of other problems as well. Every time I try to debug a new sort of problem that it doesn't quite fit for, I end up adding more features and making a new release, so it's growing in ability quite nicely now.

A recent bug I was trying to fix involved leaking filehandles - a longrunning process ends up running out of spare filehandles and receives EMFILE trying to make a socket(2) call. So, newly added to Devel::MAT version 0.23 is tracking of file numbers associated with IO handles.

To start this session of bug-hunting, first I had to provoke the leak into happening, and capture a .pmat file from it. Provoking the leak proved to be quite easy by just artificially lowering the maximum file descriptor count using ulimit, and capturing the file was achieved using the usual Devel::MAT::Dumper options:

$ ulimit -n 64
$ perl -MDevel::MAT::Dumper=-eager_open,-dump_at_END,-dump_at_DIE ...

Sure enough, it didn't take long running my program before it died because of EMFILE, at which point a .pmat file appeared. Now we can take a look into it and find the problem. So I started by loading this into my new pmat interactive debugging shell, which wraps a number of commandline-type tools, allowing interactive application of them to the loaded file, without having to reload the file each and every time. This speeds up such debugging sessions a lot.

$ pmat run-tests.pl.pmat 
Perl memory dumpfile from perl 5.22.1
Heap contains 156829 objects
pmat>

The first useful command to run here is io list to obtain a list of every IO SV that has at least one file number associated with it. This will correspond to the filehandles the kernel is complaining that we have too many of. These will be a good place to start to see if something is leaking that shouldn't.

pmat> io list
Addr                           ifileno  ofileno
IO() at 0xa2f378               -1       -1
IO() at 0x8cf540               0        -1
IO() at 0x8cf570               1        1
IO() at 0x8cf5d0               2        2
IO() at 0x16063f8              3        3
...
IO() at 0x3bf8e68              59       59
IO() at 0x3b7e548              60       60
IO() at 0x3bccda8              61       61
IO() at 0x3b57958              62       62
IO() at 0x3b4b950              63       63

Well sure enough, we do seem to have our full allocation of 64 filehandles, and luckily for us they seem to be perl-visible ones (lucky because Devel::MAT can't peek into the inner workings of lower-level C libraries, which could have been the reason here). The first few handles will be the usual STDIN, STDOUT and friends, and some socket handles that the process itself is supposed to be hanging onto, so our leak likely relates to the higher-numbered ones. Lets pick a random candidate near the end, and see if we can investigate further what it's doing still hanging around.

The identify command takes the address of an SV, and tries to walk back up the references chain, looking for what refers to it, so we can find out what's still holding it.

pmat> identify 0x3bf8e68
IO() at 0x3bf8e68 is:
└─the io of GLOB(%*I) at 0x282f118, which is:
  ├─the referrant of REF() at 0x3b7ef98, which is:
  │ └─value {write_handle} of HASH(29) at 0x3b577c0, which is (*A):
  │   └─the referrant of REF() at 0x3476cc0, which is:
  │     └─element [0] of ARRAY(1) at 0x3bbc120, which is:
  │       └─the referrant of REF() at 0x1f81eb8, which is:
  │         └─the lexical $conns of PAD(23) at 0x3bcc658, which is:
  │           └─pad at depth 1 of CODE(PP,C) at 0x3b4c2b0, which is:
  │             └─the referrant of REF() at 0x3bbb358, which is:
  │               └─the lexical $on_closed of PAD(14) at 0x3bcc6a0, which is:
  │                 └─pad at depth 1 of CODE(PP,C) at 0x1cfa850, which is:
  │                   └─the referrant of REF() at 0x3b63698, which is:
  │                     └─value {on_closed} of HASH(29) at 0x3b577c0, which is:
  │                       └─already found as *A
  └─the referrant of REF() at 0x3bf8550, which is:
    └─value {read_handle} of HASH(29) at 0x3b577c0, which is:
      └─already found as *A

This looks suspect. The SV identified with the *A marker appears later on in its own referrant chain. That sounds like a cyclic reference. Lets look at that one in more detail.

HASH(29) at 0x3b577c0 is:
└─the referrant of REF() at 0x3476cc0, which is:
  └─element [0] of ARRAY(1) at 0x3bbc120, which is:
    └─the referrant of REF() at 0x1f81eb8, which is:
      └─the lexical $conns of PAD(23) at 0x3bcc658, which is:
        └─pad at depth 1 of CODE(PP,C) at 0x3b4c2b0, which is:
          └─the referrant of REF() at 0x3bbb358, which is:
            └─the lexical $on_closed of PAD(14) at 0x3bcc6a0, which is:
              └─pad at depth 1 of CODE(PP,C) at 0x1cfa850, which is:
                └─the referrant of REF() at 0x3b63698, which is:
                  └─value {on_closed} of HASH(29) at 0x3b577c0, which is:
                    └─already found circularly

Indeed, this is the classic identify shape of a leaked cyclic reference; a single linear chain of references that loops back onto itself. Just because we have a cycle though doesn't necessarily mean that's bad. Creating a reference cycle in Perl is only bad if nothing ever cleans it up again. Cyclic references can be perfectly safe, provided something cleans them up again at some point. To determine what's wrong with this one and why it didn't get cleaned up requires some inspection of the code that created it. As this particular cycle involves two lexical scalars in two closures we can use the show command lets us see more detail about these two CODE SVs.

pmat> show 0x3b4c2b0
CODE(PP,C) at 0x3b4c2b0 with refcount 1
  name=&Net::Async::HTTP::__ANON__
  stash=STASH(66) at 0x12fe9c0
  glob=GLOB(*) at 0x1b4eaf0
  location=/home/leo/lib/perl5/Net/Async/HTTP.pm line 422
  no scope
  no padlist
  no padnames
  pad[0]=PAD(23) at 0x3bcc658
Referenced globs:
  GLOB(@I) at 0x8b1ea0, GLOB(&*) at 0x1b40bf0

pmat> show 0x1cfa850
CODE(PP,C) at 0x1cfa850 with refcount 1
  name=&Net::Async::HTTP::Connection::__ANON__
  stash=STASH(65) at 0x1a3cac0
  glob=GLOB(*) at 0x1ab35f0
  location=/home/leo/lib/perl5/Net/Async/HTTP/Connection.pm line 62
  no scope
  no padlist
  no padnames
  pad[0]=PAD(14) at 0x3bcc6a0

We can now take a peek at the first of these closures by looking in the source code at the indicated line number; which appears to be the on_closed event handler of the Net::Async::HTTP instance:

      on_closed => sub {
         my $conn = shift;
         my $http = $conn->parent;
 
         $conn->remove_from_parent;
         @$conns = grep { $_ != $conn } @$conns;
 
         if( my $next = first { !$_->connecting } @$ready_queue ) {
            # Requeue another connection attempt as there's still more to do
            $http->get_connection( %args, ready => $next );
         }
      },

Sure enough, there's the $conns lexical that the identification trace suggested was the culprit. Indeed, this code should in fact have removed the connection from that array when it runs, thus breaking the cycle and allowing all these instances to be reclaimed, avoiding a leak. That the leak happened suggests this cleanup code did not in fact execute.

At this point, we can apply some analysis of the code changes leading up to this leak being introduced in the first place. On this occasion, the change was itself a workaround for another bug involving connection pooling and reuse in the HTTP client, by instead using a single-shot HTTP client for each individual request, which is added to the IO::Async::Loop instance and then removed when it is done.

Looking again at the above cleanup code which we expect didn't run, and with knowledge of the code change and the basics of how IO::Async::Notifiers work, we surmise that the code doesn't run because the file handle is never actually explicitly closed. In normal operation with longlived HTTP clients, this isn't a problem as either the client times out and closes the connection itself, or the server does and we're informed of an EOF to close it. These two cases give the on_closed code its opportunity to run, thus avoiding a resource leak in the normal case.

But in our one-shot HTTP client that then gets removed, this never happens. The connection remains open at the point that the client gets removed from the loop, wherein it now has no opportunity to be informed that the server closed it, so the cleanup code never runs. This one now looks like quite an easy fix; just ->close all the remaining connections at the time the client is removed from the loop, by making use of the _remove_from_loop method:

sub _remove_from_loop
{
   my $self = shift;

   foreach my $conn ( map { @$_ } values %{ $self->{connections} } ) {
      $conn->close;
   }

   $self->SUPER::_remove_from_loop( @_ );
}

In fact, this is the very bugfix that was added to make release Net::Async::HTTP version 0.41.

So there we have it - a real-world example of an actual bug that was found and fixed with the help of Devel::MAT.

LeoNerd's programming thoughts