2022/06/25

A troubling thought - smartmatch reïmagined

Preface: This is less a concrete idea, and more a rambling set of thoughts that lead me to a somewhat awkward place. I'm writing it out here in the hope that others can lend suggestions and ideas, and see if we can arrive at a better place.

I've been thinking about comparison operators lately - somewhat in the context of my new Syntax::Operator::Elem / Syntax::Operator::In module, somewhat in the context of smartmatch and the planned deprecations thereof, and partly in the context of my new match/case syntax.

Smartmatch Deprecations

For years now, smartmatch has been an annoying thorny design, and recently we've started making moves to get rid of it. In my mind at least, this is because it has a large and complex behaviour that is often unpredictable in advance. There are two distinct reasons for this:

  1. It tries very hard to (recursively) distribute itself across container values on either side; saying that $x ~~ @y is true if any { $x ~~ $_ } @y for example; sometimes in ways that are surprising (e.g. how do you compare an array with a hash?)
  2. It acts unpredictably with mixed strings or numbers; because those concepts are very fluid in perl and aren't well-defined

match/case and New Infix Operators

I've lately been writing some new ideas for new infix operators that Perl might want; partly because they're useful on their own but also because they're useful combined with the match/case syntax provided by Syntax::Keyword::Match. Between them all, these are intended as a replacement for the given/when syntax and its troublesome smartmatch. For example, to select an option based on a string comparison you can

match($x : eq) {
  case("abc") { say "It was the string abc" }
  case("def") { say "It was the string def" }
  case($y)    { say "It was whatever string the variable $y gives" }
}

This is much more predictable than given/when and smartmatch, because the programmer declared right upfront that the eq operator is being used here; there's no smartmatch involved.

Initially this feels like a great improvement on given/when and ~~, but it has lots of tricky cornercases to it. For example, the given/when approach can easily handle undef, whereas match/case using only the eq operator cannot distinguish undef from "". For this reason, I invented a new infix operator, called equ (provided by Syntax::Operator::Equ), which can:

say "Equal" if $x equ $y;  # true if they're both undef, or both
                           #   defined and the same string

match($x : equ) {
  # these two cases are now distinct
  case(undef) { say "It was undefined" }
  case("")    { say "It was the empty string" }

  default     { say "It was something else" }
}

Plus of course it also defines a new === operator which performs the numerical equivalent, able to distinguish undef from zero.

Syntax::Operator::Elem

Another operator I felt was required was one that can test if a given string (or number) is present in a list. For that, I wrote Syntax::Operator::Elem:

say "Present" if $x elem @ys;  # stringy

say "Present" if $x ∈ @ys;     # numerical

(Yes, that really is an operator spelled with a non-ASCII Unicode character. No I will not apologise :P)

These operators too have the "oops, undef" problem about them - which lead me briefly to consider adding two more that consider undef/"" or undef/zero to be distinct. Maybe I'd call them elemu and ... er.. well, Unicode doesn't have a variant of the ∈ operator that can suggest undefness. It was about at that point that I stopped, and wondered if really we're going about this whole thing the right way at all.

Smartmatch Reïmagined

I begin to think that if we go right back to the beginning, we might find that a huge chunk of this is unnecessary, if only we can find a better model.

During the 5.35 development series and now released in 5.36, Perl core has two improvements to what some might call its "type system":

  • Real booleans - true and false are now first-class values distinct from 1 and zero/emptystring.
  • Tracking of whether defined, nonboolean, nonreferential values began as strings or numbers; even if they have since evolved to effectively be both.

It is now possible to classify any given scalar value into exactly one of the following five categories:

undef
boolean
initially string
initially number
reference

I start to wonder whether, therefore, we have enough basis to create a better version of what the smartmatch operator tried (but ultimately failed) to be. For sake of argument, since I've already used one Unicode symbol I'm going to use another for this new one: The triple-bar identity symbol, ≡.

Lets consider a few properties this ought to have. First off, it should be well-behaved as an equality operator; it should be reflexive, symmetric and transitive. That is, given any values $x, $y and $z, all three of the following must always hold:

$x ≡ $x  is true                     # reflexive
$x ≡ $y  is the same as  $y ≡ $x     # symmetric
if $x ≡ $y and $y ≡ $z then $x ≡ $z  # transitive

Additionally, I don't think it ought to have any sort of distributive properties like $x ~~ @arr has. That sort of distribution should be handled at a higher level. (For example, the proposed caselist syntax of match/case.)

Because it only operates on pairs of scalars, this is already a much simpler kind of operator to think about. Because of the fact we can classify perl scalar values into these neat five categories, we can already write down five simple rules for when both sides are given the same category of scalar:

UNDEF undef ≡ undef is true
BOOL $x ≡ $y is true if $x and $y are both true, or both false
STR $x ≡ $y is true if $x eq $y
NUM $x ≡ $y is true if $x == $y
REF $x ≡ $y is true if refaddr($x) == refaddr($y)

I'd also like to suggest a rule that given any pair of scalars of different categories, the result is always false. This means in particular, that undef is never ≡ to any defined value (but never warns), that no boolean is ever ≡ to any non-boolean, and no reference is ever ≡ to any non-reference. I don't think anyone would argue with that.

Already this operator feels useful, because of the way it neatly handles undef as distinct from any number or string, we now don't need the equ or === operators.

The one problem I have with this whole model is what do we do with STR ≡ NUM; how do we handle code like the following:

my $x = "10";
say "Equivalent" if $x ≡ 10;

By my first suggestion, this would always be false. While it's predictable and simple, I don't think it's very useful. It would mean that whenever you want to e.g. perform a numerical case comparison on a value taken from @ARGV, you always have to "numify" it by doing some ugly code like:

match(0 + $ARGV[0] : ≡) {
  case(1) { ... }
}

This does not feel very perlish.

So maybe we can find a more useful handling of STR vs NUM. I can already think of several bad ideas:

  • Pick the category on the righthand side
    Superficially this feels beneficial to the match/case syntax, but it soon falls down in a lot of other scenarios. Plus it is blatantly not symmetric, which we already decided any good equality test operator ought to be.
  • The operator throws an exception
    This doesn't feel like the right way to go. Having things like UNDEF, BOOL and REF already neatly just yield false, means that you can safely mix strings/numbers and undef in match/case labels, for example, and all is handled nicely. To have NUM-vs-UNDEF yield false but NUM-vs-STR throw an exception feels like a bad model. Plus it would not be transitive.

About the only sensible model I can think of in this mixed case, is to say that

NUM ≡ STR  is true if both `eq` and `==` would say true

It's reflexive and symmetric. It feels useful. It does (what most people would argue is) the right thing for "10" ≡ 10.

Still, something at the back of my mind feels wrong about this design for an operator. Some situation in which is will be Obviously Terrible, and thus bring the whole tower crashing down. Perhaps it isn't truely transitive - there might be some set of values for which it fails. Offhand I can't think of one, but maybe someone can find an example?

It's a shame, because if we did happen to find an operator like this, then I think combined with match/case syntax it could go a long way towards providing a far better replacement for smartmatch + given/when and additionally solve a lot of other problems in Perl all in one go.

I'm sorry I don't have a more concrete and specific message to say there, other than that I've given (and will continue to give) this a lot of thought, and that I invite comment and ideas from others on how we might further it towards something that can really work in Perl.

Thanks all for listening.

2022/01/26

Perl in 2022 - A Yearly Update

At the end of 2020, I wrote a series of articles on the subject of recent CPAN modules that provide useful syntax, or recent core features added to perl. The series ended with a bonus post looking forward to imagine what new additions might one day appear. I followed this up with a video-based talk at FOSDEM, titled "Perl in 2025", with yet more ideas considering how a Perl might look in a few more years' time.

Over the past twelve months, I have made progress on several of these ideas. Four of them have already become CPAN modules and thus are available for writing in Perl in 2022:

  • match/case - Now available as Syntax::Keyword::Match.
    match($n : ==) {
       case(1) { say "It's one" }
       case(2) { say "It's two" }
       case(3) { say "It's three" }
    }
  • any, all - Now available as syntax-level keywords from List::Keywords.
    if( any { $_->size > 100 } @boxes ) {
       say "There are some large boxes here";
    }
  • multi sub - An early experiment in Syntax::Keyword::MultiSub.
    multi sub max()          { return undef; }
    multi sub max($x)        { return $x; }
    multi sub max($x, @more) { my $y = max(@more);
                               return $x > $y ? $x : $y; }
  • equ, === - Available from Syntax::Operator::Equ, though at present is only usable via Syntax::Keyword::Match or a specially-patched version of perl.
    if($x equ $y) {
       say "Both are undef, or defined and equal strings";
    }
     
    if($i === $j) {
       say "Both are undef, or defined and equal numbers";
    }

Of the rest:

  • in - I have the beginnings of some code but it's not yet on CPAN as it again requires a patched version of perl for pluggable infix operators.
  • let and is - not started yet.

In addition, not mentioned in the original article, the latest development version of perl has gained:

  • defer blocks.
    {
        say "This happens first";
        defer { say "This happens last"; }
     
        say "And this happens inbetween";
    }
  • finally as part of try/catch.
    try {
        say "This happens first";
    }
    catch ($e) {
        say "Oops, it failed";
    }
    finally {
        say "This happens last in either case";
    }
  • The builtin:: namespace, providing many new utility functions that ought to have been considered part of the core language - copying utilities from places like Scalar::Util and POSIX, as well as providing some new ones.
    say "The refaddr of my object is ", builtin::refaddr($obj);
    
    use builtin 'ceil';
    say "The next integer above the value is ", ceil($value);
  • Real boolean values. These will be useful in many places, such as data serialisation and cross-language conversion modules.
    use builtin qw(true false isbool);
    
    sub serialise($v) {
      return $v ? 'true' : 'false' if isbool $v;
      return qq("$v");
    }
    
    say join ",", map { serialise($_) }
        0, 1, false, true, 'true';

Overall I'm happy with progress so far. A lot of things have been started, laying much of the groundwork for more work that can follow. Behind the scenes all of these syntax modules are now using the XS::Parse::Keyword module to do the bulk of their parsing. This is great for getting something powerful written quickly, and has good properties in terms of interoperability between modules - for example, the way the new infix operators already work with the match/case syntax.

Core perl is on-track for a summer release as usual; hopefully that will provide the new defer and finally syntax, builtin functions and boolean values. I hope to have as much success in 2022 as I did in 2021 at writing more of these things, and with any luck I'll be able to write another article like this next year explaining what new progress has been achieved towards the Perl in 2025 goal.

2021/07/30

Perl UV binding hits version 2.000

Over the past few months I've been working on finishing off the libuv Perl binding module, UV. Yesterday I finally got it finished enough to feel like calling it version 2.000. Now's a good time to take a look at it.

libuv itself is a cross-platform event handling library, which focuses on providing nicely portable abstractions for things like TCP sockets, timers, and sub-process management between UNIX, Windows and other platforms. Traditionally things like event-based socket handling have always been difficult to write in a portable way between Windows and other places due to the very different ways things work on Windows as opposed to anywhere else. libuv provides a large number of helpful wrappers to write event-based code in a portable way, freeing the developer from having to care about these things.

A number of languages have nice bindings for libuv, but until recently there wasn't a good one for Perl. My latest project for The Perl Foundation aimed to fix this. The latest release of UV version 2.000 indicates that this is now done.

It's unlikely that most programs would choose to operate directly with UV itself, but rather via some higher-level event system. There are UV adapter modules for IO::Async (IO::Async::Loop::UV), Mojo (Mojo::Reactor::UV), and Future::IO (Future::IO::Impl::UV) at least.

The UV module certainly wraps much of what libuv has to offer, but there are still some parts missing. libuv can watch filesystems for changes of files, and provides asynchronous filesystem access access functions - both of these are currently missing from the Perl binding. Threadpools are an entire concept that doesn't map very well to the Perl language, so they are absent too. Finally, libuv lists an entire category of "miscellaneous functions", most of which are already available independently in Perl, so there seems little point to wrapping those provided by libuv.

Finally, we should take note of one thing that doesn't work - the UV::TCP->open and UV::UDP->open functions when running on Windows. The upshot here is that you cannot create TCP or UDP sockets in your application independently of libuv and then hand them over to be handled by the library; this is not permitted. This is because on Windows, there are fundamentally two different kinds of sockets that require two different sets of API to access them - ones using WSA_FLAG_OVERLAPPED, and ones not. libuv needs that flag in order to perform event-based IO on sockets, and so it won't work with sockets created without it - which is the usual kind that most other modules, and perl itself, will create. This means that on Windows, the only sockets you can use with the UV module are ones created by UV itself - such as by asking it to connect out to servers, or listen and accept incoming connections. Fortunately, this is sufficient for the vast majority of applications.

I would like to finish up by saying thanks to The Perl Foundation for funding me to complete this project.

2021/02/26

Writing a Perl Core Feature - part 11: Core modules

Index | < Prev

Our new feature is now implemented, tested, and documented. There's just one last thing we need to do - update the bundled modules that come with core. Specifically, because we've added some new syntax, we need to update B::Deparse to be able to deparse it.

When the isa operator was added, the deparse module needed to be informed about the new OP_ISA opcode, in this small addition: (github.com/Perl/perl5).

--- a/lib/B/Deparse.pm
+++ b/lib/B/Deparse.pm
@@ -52,7 +52,7 @@ use B qw(class main_root main_start main_cv svref_2object opnumber perlstring
         MDEREF_SHIFT
     );
 
-$VERSION = '1.51';
+$VERSION = '1.52';
 use strict;
 our $AUTOLOAD;
 use warnings ();
@@ -3060,6 +3060,8 @@ sub pp_sge { binop(@_, "ge", 15) }
 sub pp_sle { binop(@_, "le", 15) }
 sub pp_scmp { maybe_targmy(@_, \&binop, "cmp", 14) }
 
+sub pp_isa { binop(@_, "isa", 15) }
+
 sub pp_sassign { binop(@_, "=", 7, SWAP_CHILDREN) }
 sub pp_aassign { binop(@_, "=", 7, SWAP_CHILDREN | LIST_CONTEXT) }

As you can see it's quite a small addition here; we just need to add a new method to the main B::Deparse package named after the new opcode. This new method calls down to the common binop function which is shared by the various binary operators, and recurses down parts of the optree, returning a combined result using the "isa" string in between the two parts.

A more complex addition was made with the try syntax, as can be seen at (github.com/Perl/perl5); abbreviated here:

+sub pp_leavetrycatch {
+    my $self = shift;
+    my ($op) = @_;
...
+    my $trycode = scopeop(0, $self, $tryblock);
+    my $catchvar = $self->padname($catch->targ);
+    my $catchcode = scopeop(0, $self, $catchblock);
+
+    return "try {\n\t$trycode\n\b}\n" .
+           "catch($catchvar) {\n\t$catchcode\n\b}\cK";
+}

As before, this adds a new method named after the new opcode (in the case of the try/catch syntax this is named OP_LEAVETRYCATCH). The body of this method too just recurses down to parts of the sub-tree it was passed; in this case being two scope ops for the bodies of the blocks, plus a lexical variable name for the catch variable. The method then again returns a new string combining the various parts together along with the required braces, linefeeds, and indentation hints.

We can tell we need to add this for our new banana feature, as currently this does not deparse properly:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -Mexperimental=banana -MO=Deparse -ce 'print ban "Hello, world" ana;'
unexpected OP_BANANA at lib/B/Deparse.pm line 1664.
BEGIN {${^WARNING_BITS} = "\x10\x01\x00\x00\x00\x50\x04\x00\x00\x00\x00\x00\x00\x55\x51\x55\x50\x51\x45\x00"}
use feature 'banana';
print XXX;
-e syntax OK

We'll fix this by adding a new pp_banana in an appropriate place, perhaps just after the ones for lc/uc/fc. Don't forget to bump the $VERSION number too:

leo@shy:~/src/bleadperl/perl [git]
$ nvim lib/B/Deparse.pm 

leo@shy:~/src/bleadperl/perl [git]
$ git diff 
diff --git a/lib/B/Deparse.pm b/lib/B/Deparse.pm
index 67147f12dd..f6039a435d 100644
--- a/lib/B/Deparse.pm
+++ b/lib/B/Deparse.pm
@@ -52,7 +52,7 @@ use B qw(class main_root main_start main_cv svref_2object opnumber perlstring
         MDEREF_SHIFT
     );
 
-$VERSION = '1.56';
+$VERSION = '1.57';
 use strict;
 our $AUTOLOAD;
 use warnings ();
@@ -2824,6 +2824,13 @@ sub pp_lc { dq_unop(@_, "lc") }
 sub pp_quotemeta { maybe_targmy(@_, \&dq_unop, "quotemeta") }
 sub pp_fc { dq_unop(@_, "fc") }
 
+sub pp_banana {
+    my $self = shift;
+    my ($op, $cx) = @_;
+    my $kid = $op->first;
+    return "ban " . $self->deparse($kid, 1) . " ana";
+}
+
 sub loopex {
     my $self = shift;
     my ($op, $cx, $name) = @_;

This new function recurses down to deparse for the subtree, and returns a new string wrapped in the appropriate syntax for it. That should be all we need:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -Mexperimental=banana -MO=Deparse -ce 'print ban "Hello, world" ana;'
BEGIN {${^WARNING_BITS} = "\x10\x01\x00\x00\x00\x50\x04\x00\x00\x00\x00\x00\x00\x55\x51\x55\x50\x51\x45\x00"}
use feature 'banana';
print ban 'Hello, world' ana;
-e syntax OK

Of course, this being a perl module we should remember to update its unit tests.

leo@shy:~/src/bleadperl/perl [git]
$ git diff lib/B/Deparse.t
diff --git a/lib/B/Deparse.t b/lib/B/Deparse.t
index 24eb445041..0fe6940cb3 100644
--- a/lib/B/Deparse.t
+++ b/lib/B/Deparse.t
@@ -3171,3 +3171,10 @@ try {
 catch($var) {
     SECOND();
 }
+####
+# banana
+# CONTEXT use feature 'banana'; no warnings 'experimental::banana';
+ban 'literal' ana;
+ban $a ana;
+ban $a . $b ana;
+ban "stringify $a" ana;

leo@shy:~/src/bleadperl/perl [git]
$ ./perl t/harness lib/B/Deparse.t 
../lib/B/Deparse.t .. ok     
All tests successful.
Files=1, Tests=321,  9 wallclock secs ( 0.14 usr  0.00 sys +  8.99 cusr  0.38 csys =  9.51 CPU)
Result: PASS

Because in part 10 we added documentation for a new function in pod/perlfunc.pod there's another test that needs updating:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl t/harness ext/Pod-Functions/t/Functions.t 
../ext/Pod-Functions/t/Functions.t .. 1/? 
#   Failed test 'run as plain program'
#   at t/Functions.t line 55.
#          got: '
...
Result: FAIL

We can fix that by adding the new function to the expected list in the test file itself:

leo@shy:~/src/bleadperl/perl [git]
$ nvim ext/Pod-Functions/t/Functions.t

leo@shy:~/src/bleadperl/perl [git]
$ git diff ext/Pod-Functions/t/Functions.t
diff --git a/ext/Pod-Functions/t/Functions.t b/ext/Pod-Functions/t/Functions.t
index 2beccc1ac6..4d5b03e978 100644
--- a/ext/Pod-Functions/t/Functions.t
+++ b/ext/Pod-Functions/t/Functions.t
@@ -76,7 +76,7 @@ Functions.t - Test Pod::Functions
 __DATA__
 
 Functions for SCALARs or strings:
-     chomp, chop, chr, crypt, fc, hex, index, lc, lcfirst,
+     ban, chomp, chop, chr, crypt, fc, hex, index, lc, lcfirst,
      length, oct, ord, pack, q/STRING/, qq/STRING/, reverse,
      rindex, sprintf, substr, tr///, uc, ucfirst, y///
 
leo@shy:~/src/bleadperl/perl [git]
$ ./perl t/harness ext/Pod-Functions/t/Functions.t 
../ext/Pod-Functions/t/Functions.t .. ok     
All tests successful.
Files=1, Tests=234,  1 wallclock secs ( 0.04 usr  0.01 sys +  0.23 cusr  0.00 csys =  0.28 CPU)
Result: PASS

At this point, we're done. We've now completed all the steps to add a new feature to the perl interpreter. As well as all the steps required to actually implement it in the core binary itself, we've updated the tests, documentation, and support modules to match.

Along the way we've seen examples from real commits into the perl tree while we made our own. Any particular design of new feature will of course have its own variations and differences - there's still many parts of the interpreter we haven't touched on in this series. It would be difficult to try to cover all the possible ideas of things that could be added or changed, but hopefully having completed this series you'll at least have a good overview of the main pieces that are likely to be involved, and have some starting-off points to explore further to see whatever additional details might be required for whatever situation you encounter.

Index | < Prev

2021/02/24

Writing a Perl Core Feature - part 10: Documentation

Index | < Prev | Next >

Now that have our new feature nicely implemented and tested, we're nearly finished. We just have a few more loose ends to tidy up. The first of these is to take a look at some documentation.

We've already done one small documentation addition to perldiag.pod when we added the new warning message, but the bulk of documentation to explain a new feature would likely be found in one of the main documents - perlsyn.pod, perlop.pod, perlfunc.pod or similar. Exactly which of these is best would depend on the nature of the specific feature.

The isa feature, being a new infix operator, was documented in perlop.pod: (github.com/Perl/perl5).

...
+=head2 Class Instance Operator
+X<isa operator>
+
+Binary C<isa> evaluates to true when left argument is an object instance of
+the class (or a subclass derived from that class) given by the right argument.
+If the left argument is not defined, not a blessed object instance, or does
+not derive from the class given by the right argument, the operator evaluates
+as false. The right argument may give the class either as a barename or a
+scalar expression that yields a string class name:
+
+    if( $obj isa Some::Class ) { ... }
+
+    if( $obj isa "Different::Class" ) { ... }
+    if( $obj isa $name_of_class ) { ... }
+
+This is an experimental feature and is available from Perl 5.31.6 when enabled
+by C<use feature 'isa'>. It emits a warning in the C<experimental::isa>
+category.

Lets now write a little bit of documentation for our new banana feature. Since it is a named function-like operator (though with odd syntax involving a second trailing named keyword), perhaps we'll write it in perlfunc.pod. We'll style it similarly to the case-changing functions lc and uc to get some suggested wording.

leo@shy:~/src/bleadperl/perl [git]
$ nvim pod/perlfunc.pod 

leo@shy (1 job):~/src/bleadperl/perl [git]
$ git diff | xml_escape 
diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod
index b655a08ecc..319e9aab96 100644
--- a/pod/perlfunc.pod
+++ b/pod/perlfunc.pod
@@ -114,6 +114,7 @@ X<scalar> X<string> X<character>
 
 =for Pod::Functions =String
 
+L<C<ban>|/ban EXPR ana>,
 L<C<chomp>|/chomp VARIABLE>, L<C<chop>|/chop VARIABLE>,
 L<C<chr>|/chr NUMBER>, L<C<crypt>|/crypt PLAINTEXT,SALT>,
 L<C<fc>|/fc EXPR>, L<C<hex>|/hex EXPR>,
@@ -136,6 +137,10 @@ prefixed with C<CORE::>.  The
 L<C<"fc"> feature|feature/The 'fc' feature> is enabled automatically
 with a C<use v5.16> (or higher) declaration in the current scope.
 
+L<C<ban>|/ban EXPR ana> is available only if the
+L<C<"banana"> feature|feature/The 'banana' feature.> is enabled or if it is
+prefixed with C<CORE::>.
+
 =item Regular expressions and pattern matching
 X<regular expression> X<regex> X<regexp>
 
@@ -773,6 +778,15 @@ your L<atan2(3)> manpage for more information.
 
 Portability issues: L<perlport/atan2>.
 
+=item ban EXPR ana
+X<ban>
+
+=for Pod::Functions return ROT13 transformed version of a string
+
+Applies the "ROT13" transform to upper- and lower-case letters in the given
+expression string, returning the newly-formed string. Non-letter characters
+are left unchanged.
+
 =item bind SOCKET,NAME
 X<bind>

While this will do as a short example here, any real feature would likely have a lot more words to say than just this.

When editing POD files it's good to get into the habit of running the porting tests (or at least the POD checking ones) before committing, to check the formatting is valid:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl t/harness t/porting/pod*.t
porting/podcheck.t ... ok         
porting/pod_rules.t .. ok   
All tests successful.
Files=2, Tests=1472, 34 wallclock secs ( 0.20 usr  0.00 sys + 33.79 cusr  0.15 csys = 34.14 CPU)
Result: PASS

While I was writing this documentation it occurred to me to write about how the function handles Unicode characters vs byte strings, so I was thinking more about how it actually does. It turns out the implementation doesn't work properly for that, as we can demonstrate with a new test:

--- a/t/op/banana.t
+++ b/t/op/banana.t
@@ -11,7 +11,7 @@ use strict;
 use feature 'banana';
 no warnings 'experimental::banana';
 
-plan 7;
+plan 8;
 
 is(ban "ABCD" ana, "NOPQ", 'Uppercase ROT13');
 is(ban "abcd" ana, "nopq", 'Lowercase ROT13');
@@ -23,3 +23,8 @@ my $str = "efgh";
 is(ban $str ana, "rstu", 'Lexical variable');
 is(ban $str . "IJK" ana, "rstuVWX", 'Concat expression');
 is("(" . ban "LMNO" ana . ")", "(YZAB)", 'Outer concat');
+
+{
+    use utf8;
+    is(ban "café" ana, "pnsé", 'Unicode string');
+}

leo@shy:~/src/bleadperl/perl [git]
$ ./perl t/harness t/op/banana.t 
op/banana.t .. 1/8 # Failed test 8 - Unicode string at op/banana.t line 29
#      got "pnsé"
# expected "pns�"
op/banana.t .. Failed 1/8 subtests 

This comes down to a bug in the pp_banana opcode function, which used the internal byte buffer of the incoming SV (SvPV) without inspecting the corresponding SvUTF8 flag. Such a pattern is always indicative of a Unicode support bug. We can easily fix this:

leo@shy:~/src/bleadperl/perl [git]
$ git diff pp.c
diff --git a/pp.c b/pp.c
index 9725806b84..3dbe21fadd 100644
--- a/pp.c
+++ b/pp.c
@@ -7211,6 +7211,8 @@ PP(pp_banana)
     s = SvPV(arg, len);
 
     mPUSHs(newSVpvn_rot13(s, len));
+    if(SvUTF8(arg))
+        SvUTF8_on(TOPs);
     RETURN;
 }
 

leo@shy:~/src/bleadperl/perl [git]
$ ./perl t/harness t/op/banana.t 
op/banana.t .. ok   
All tests successful.
Files=1, Tests=8,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.02 cusr  0.00 csys =  0.04 CPU)
Result: PASS

Writing good documentation is an integral part of the process of developing a new feature. Firstly it helps to explain the feature to users so they know how to use it. But often you find that the process of writing the words helps you think about different aspects of that feature that you may not have considered before. With that new frame of mind you sometimes discover missing parts to it, or uncover bugs or cornercases that need fixing. Make sure to spend time working on the documentation for any new feature - it is said that you never truely understand something until you try teach it to someone else.

Index | < Prev | Next >

2021/02/22

Writing a Perl Core Feature - part 9: Tests

Index | < Prev | Next >

By the end of part 8 we finally managed to see an actual implementation of our new feature. We tested a couple of things on the commandline directly to see that it seems to be doing the right thing. For a real core feature though it would be better to have it tested in a more automated, repeatable fashion. This is what the core unit tests are for.

The core perl source distribution contains a t/ directory with unit test files, very similar to the structure used by regular CPAN modules. The process for running these is a little different; as we already saw back in part 3 they need to be invoked by t/harness. The files themselves are somewhat more limited in what other modules they can use, so the full suite of Test:: modules are unavailable. But still they are expected to emit the regular TAP output we've come to expect from Perl unit tests, and tend to be structured quite similarly inside.

For example, the isa feature added an entire new file for its unit tests. As they all relate to the new syntax and semantics around a new opcode, they go in a file under the t/op directory. I won't paste the entire t/op/isa.t file, but consider this small section: (github.com/Perl/perl5):

#!./perl

BEGIN {
    chdir 't' if -d 't';
    require './test.pl';
    set_up_inc('../lib');
    require Config;
}

use strict;
use feature 'isa';
no warnings 'experimental::isa';

...

my $baseobj = bless {}, "BaseClass";

# Bareword package name
ok($baseobj isa BaseClass, '$baseobj isa BaseClass');
ok(not($baseobj isa Another::Class), '$baseobj is not Another::Class');

While it doesn't use Test::More, it does still have access to some similar testing functions such as the ok test. The initial lines of boilerplate in the BEGIN block set up the testing functions from the test.pl script, so we can use them in the actual tests.

Lets now have a go at writing some tests for our new banana feature. As it works like a text transformation function we can imagine a few different test strings to throw at it.

leo@shy:~/src/bleadperl/perl [git]
$ nvim t/op/banana.t

leo@shy:~/src/bleadperl/perl [git]
$ cat t/op/banana.t
#!./perl

BEGIN {
    chdir 't' if -d 't';
    require './test.pl';
    set_up_inc('../lib');
    require Config;
}

use strict;
use feature 'banana';
no warnings 'experimental::banana';

plan 7;

is(ban "ABCD" ana, "NOPQ", 'Uppercase ROT13');
is(ban "abcd" ana, "nopq", 'Lowercase ROT13');
is(ban "1234" ana, "1234", 'Numbers unaffected');

is(ban "a! b! c!" ana, "n! o! p!", 'Whitespace and symbols intermingled');

my $str = "efgh";
is(ban $str ana, "rstu", 'Lexical variable');

is(ban $str . "IJK" ana, "rstuVWX", 'Concat expression');
is("(" . ban "LMNO" ana . ")", "(YZAB)", 'Outer concat');

$ ./perl t/harness t/op/banana.t
op/banana.t .. ok   
All tests successful.
Files=1, Tests=4,  1 wallclock secs ( 0.02 usr  0.00 sys +  0.03 cusr  0.00 csys =  0.05 CPU)
Result: PASS

Here we have used the is() testing function to test that various strings that we got the ban ... ana operator to generate are what we expected them to be. We've tested both uppercase and lowercase letters, and that non-letter characters such as numbers, symbols and spaces remain unaffected. In addition we've added some syntax tests as well, to check variables as well as literal string constants, and to demonstrate that the parser works correctly on the precedence of the operator mixed with string concatenation. All appears to be working fine.

Before we commit this one there is one last thing we have to do. Having added a new file to the distribution, one of the porting tests will now be unhappy:

leo@shy:~/src/bleadperl/perl [git]
$ git add t/op/banana.t 

leo@shy:~/src/bleadperl/perl [git]
$ make test_porting
...
porting/manifest.t ........ 9848/? # Failed test 10502 - git ls-files
gives the same number of files as MANIFEST lists at porting/manifest.t line 101
#      got "6304"
# expected "6303"
# Failed test 10504 - Nothing added to the repo that isn't in MANIFEST
at porting/manifest.t line 113
#      got "1"
# expected "0"
# Failed test 10505 - Nothing added to the repo that isn't in MANIFEST
at porting/manifest.t line 114
#      got "not in MANIFEST: t/op/banana.t"
# expected "not in MANIFEST: "
porting/manifest.t ........ Failed 3/10507 subtests 

To fix this one we need to manually add an entry in the MANIFEST file; unlike as is common practice for CPAN modules, this file is not automatically generated.

leo@shy:~/src/bleadperl/perl [git]
$ nvim MANIFEST

leo@shy:~/src/bleadperl/perl [git]
$ git diff MANIFEST
diff --git a/MANIFEST b/MANIFEST
index 71d3b453da..03ecdda3d2 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -5779,6 +5779,7 @@ t/op/attrproto.t          See if the prototype attribute works
 t/op/attrs.t                   See if attributes on declarations work
 t/op/auto.t                    See if autoincrement et all work
 t/op/avhv.t                    See if pseudo-hashes work
+t/op/banana.t                  See if the ban ... ana syntax works
 t/op/bless.t                   See if bless works
 t/op/blocks.t                  See if BEGIN and friends work
 t/op/bop.t                     See if bitops work

leo@shy:~/src/bleadperl/perl [git]
$ make test_porting
...
Result: PASS

Of course, in this test file we've added only 7 tests. It is likely that any actual real feature would have a lot more testing around it, to deal with a wider variety of situations and corner-cases. It's often that the really interesting cases only come to light after trying to use it for real and finding odd situations that don't quite work as expected; so after adding a new feature expect to spend a while expanding the test file to cover more things. It's especially useful to add new tests of new situations you find yourself using the feature in, even if they currently work just fine. The presence of such tests helps ensure the feature remains working in that manner.

Index | < Prev | Next >

2021/02/19

Writing a Perl Core Feature - part 8: Interpreter internals

Index | < Prev | Next >

At this point we are most of the way to adding a new feature to the Perl interpreter. In part 4 we created an opcode function to represent the new behaviour, part 5 and part 6 added compiler support to recognise the syntax used to represent it, and in part 7 we made a helper function to provide the required behaviour. It's now time to tie them all together.

When we looked at opcodes and optrees back in part 4, I mentioned that each node of the optree performs a little part of the execution of a function, with child nodes usually obtaining some piece of data somewhere that gets passed up to parent nodes to operate on. I skipped over exactly how that all works, so for this part lets look at that in more detail.

The data model used by the perl interpreter for runtime execution of code is based around being a stack machine. Most opcodes that operate in some way on regular perl data values do so by interacting with the data stack (often simply called "the stack"; though this is sometimes ambiguous as there are in fact several stacks within the perl interpreter). As the interpreter walks along an optree invoking the function associated with each opcode, these various functions either push values onto the stack, or pop values already there back off it again, in order to use them.

For example, in part 4 we saw how the line of code my $x = 5; might get represented by an optree of three nodes - an OP_SASSIGN with two child nodes OP_CONST and OP_PADSV.

When this statement is executed the optree nodes are visited in postfix order, with the two child BASEOPs running first in order to push some values to the stack, followed by the assignment BINOP afterwards, which takes those values back off the stack and performs the appropriate assignment.

Lets now take a closer look at the code inside one of the actual functions which implements this. For example, pp_const, the function for OP_CONST consists of three short lines:

PP(pp_const)
{
    dSP;
    XPUSHs(cSVOP_sv);
    RETURN;
}

Of these three lines, all four symbols are in fact macros:

  1. dSP declares some local variables for tracking state, used by later macros
  2. cSVOP_sv fetches the actual SV pointer out of the SVOP itself. This will be the one holding the constant's value
  3. XPUSHs extends the (data) stack if necessary, then pushes it there
  4. RETURN resynchronises the interpreter state from the local variables, and arranges for the opcode function to return the next opcode, for the toplevel instruction loop

The pp_padsv function is somewhat more complex, but the essential parts of it are quite similar; the following example is heavily paraphrased:

PP(pp_padsv)
{
    SV ** const padentry = &(PAD_SVl(op->op_targ));
    XPUSHs(*padentry);
    RETURN;
}

This time, rather than the cSVOP_sv which takes the SV out of the op itself, we use PAD_SVl which looks up the SV in the currently-active pad, by using the target index which is stored in the op.

When the isa feature was added, its main pp_isa opcode function was actually quite small: (github.com/Perl/perl5).

--- a/pp.c
+++ b/pp.c
@@ -7143,6 +7143,18 @@ PP(pp_argcheck)
     return NORMAL;
 }
 
+PP(pp_isa)
+{
+    dSP;
+    SV *left, *right;
+
+    right = POPs;
+    left  = TOPs;
+
+    SETs(boolSV(sv_isa_sv(left, right)));
+    RETURN;
+}
+

Since OP_ISA is a BINOP it is expecting to find two arguments on the stack; traditionally these are called left and right. This opcode function simply takes those two values and calls the sv_isa_sv() function, which returns a boolean truth value. The boolSV helper function returns an SV pointer to represent this boolean value, which is then used as the result of the opcode itself.

As a small performance optimsation, this function decides to only POP one argument, before changing the top-of-stack value to its result using SETs. This is equivalent to POPing two of them and PUSHing its result, except that it doesn't have to alter the stack pointer as many times.

For more of a look at how the stack works, you could also take a look at another post from my series on Parser Plugins: Part 3a - The Stack.

Lets now take a look at implementing our banana feature for real. Recall in part 4 we added the pp_banana function with some placeholder content that just died with a panic message if invoked. We'll now replace that with a real implementation:

leo@shy:~/src/bleadperl/perl [git]
$ nvim pp.c 

leo@shy:~/src/bleadperl/perl [git]
$ git diff pp.c
diff --git a/pp.c b/pp.c
index 93141454e1..bced3d23ea 100644
--- a/pp.c
+++ b/pp.c
@@ -7203,7 +7203,15 @@ PP(pp_cmpchain_dup)
 
 PP(pp_banana)
 {
-    DIE(aTHX_ "panic: we have no bananas");
+    dSP;
+    const char *s;
+    STRLEN len;
+    SV *arg = POPs;
+
+    s = SvPV(arg, len);
+
+    PUSHs(newSVpvn_rot13(s, len));
+    RETURN;
 }
 
 /*

Now lets rebuild perl and try it out:

leo@shy:~/src/bleadperl/perl [git]
$ make -j4 perl
...

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -E 'use experimental "banana"; say ban "Hello, world!" ana;'
Uryyb, jbeyq!

Well it certainly looks plausible - we've got back a different string of the same length, with different letters but in the same capitalisation and identical non-letter characters. Lets compare with something like tr to see if it's correct:

leo@shy:~/src/bleadperl/perl [git]
$ echo "Uryyb, jbeyq!" | tr "A-Za-z" "N-ZA-Mn-za-m"
Hello, world!

Seems good. But it turns out we've still missed something. This function has a memory leak. We can demonstrate it by writing a small example that calls ban ... ana a large number of times (say, a thousand), and printing the total count of SVs on the heap before and after. There's a handy function in perl's unit test suited called XS::APItest::sv_count we can use here:

leo@shy (1 job):~/src/bleadperl/perl [git]
$ ./perl -Ilib -I. -MXS::APItest=sv_count -E \
  'use experimental "banana";
   say sv_count();
   ban "Hello, world!" ana for 1..1000;
   say sv_count();'
5321
6321

Oh dear. The SV count is a thousand higher afterwards than before, suggesting we leaked an SV on every call.

It turns out this is because of an optimisation that the interpreter uses, where SV pointers on Perl data stack don't actually contribute to reference counting. When values get POP'ed from the stack we don't have to decrement their refcount; when values get pushed we don't increment it. This saves an amount of runtime performance to not have to be adjusting those counts all the time. The consequence here is that we have to be a bit more careful when returning newly-constructed values. We must mark the value as mortal, which means we are saying that its reference count is somehow artificially high (because of that pointer on the stack), and perl should decrement the reference count at some point soon, when it next discards temporary values.

Because this sort of thing is done a lot, there is a handy macro called mPUSHs, which mortalizes an SV when it pushes it to the data stack. We can call that instead:

$ git diff pp.c
...
+    mPUSHs(newSVpvn_rot13(s, len));
+    RETURN;
 }
 
 /*

Now when we try our leak test we find the same SV count before and after, meaning no leak has occurred:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -I. -MXS::APItest=sv_count -E ...
5321
5321

We may be onto a winner here.

Index | < Prev | Next >