2021/02/26

Writing a Perl Core Feature - part 11: Core modules

Index | < Prev

Our new feature is now implemented, tested, and documented. There's just one last thing we need to do - update the bundled modules that come with core. Specifically, because we've added some new syntax, we need to update B::Deparse to be able to deparse it.

When the isa operator was added, the deparse module needed to be informed about the new OP_ISA opcode, in this small addition: (github.com/Perl/perl5).

--- a/lib/B/Deparse.pm
+++ b/lib/B/Deparse.pm
@@ -52,7 +52,7 @@ use B qw(class main_root main_start main_cv svref_2object opnumber perlstring
         MDEREF_SHIFT
     );
 
-$VERSION = '1.51';
+$VERSION = '1.52';
 use strict;
 our $AUTOLOAD;
 use warnings ();
@@ -3060,6 +3060,8 @@ sub pp_sge { binop(@_, "ge", 15) }
 sub pp_sle { binop(@_, "le", 15) }
 sub pp_scmp { maybe_targmy(@_, \&binop, "cmp", 14) }
 
+sub pp_isa { binop(@_, "isa", 15) }
+
 sub pp_sassign { binop(@_, "=", 7, SWAP_CHILDREN) }
 sub pp_aassign { binop(@_, "=", 7, SWAP_CHILDREN | LIST_CONTEXT) }

As you can see it's quite a small addition here; we just need to add a new method to the main B::Deparse package named after the new opcode. This new method calls down to the common binop function which is shared by the various binary operators, and recurses down parts of the optree, returning a combined result using the "isa" string in between the two parts.

A more complex addition was made with the try syntax, as can be seen at (github.com/Perl/perl5); abbreviated here:

+sub pp_leavetrycatch {
+    my $self = shift;
+    my ($op) = @_;
...
+    my $trycode = scopeop(0, $self, $tryblock);
+    my $catchvar = $self->padname($catch->targ);
+    my $catchcode = scopeop(0, $self, $catchblock);
+
+    return "try {\n\t$trycode\n\b}\n" .
+           "catch($catchvar) {\n\t$catchcode\n\b}\cK";
+}

As before, this adds a new method named after the new opcode (in the case of the try/catch syntax this is named OP_LEAVETRYCATCH). The body of this method too just recurses down to parts of the sub-tree it was passed; in this case being two scope ops for the bodies of the blocks, plus a lexical variable name for the catch variable. The method then again returns a new string combining the various parts together along with the required braces, linefeeds, and indentation hints.

We can tell we need to add this for our new banana feature, as currently this does not deparse properly:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -Mexperimental=banana -MO=Deparse -ce 'print ban "Hello, world" ana;'
unexpected OP_BANANA at lib/B/Deparse.pm line 1664.
BEGIN {${^WARNING_BITS} = "\x10\x01\x00\x00\x00\x50\x04\x00\x00\x00\x00\x00\x00\x55\x51\x55\x50\x51\x45\x00"}
use feature 'banana';
print XXX;
-e syntax OK

We'll fix this by adding a new pp_banana in an appropriate place, perhaps just after the ones for lc/uc/fc. Don't forget to bump the $VERSION number too:

leo@shy:~/src/bleadperl/perl [git]
$ nvim lib/B/Deparse.pm 

leo@shy:~/src/bleadperl/perl [git]
$ git diff 
diff --git a/lib/B/Deparse.pm b/lib/B/Deparse.pm
index 67147f12dd..f6039a435d 100644
--- a/lib/B/Deparse.pm
+++ b/lib/B/Deparse.pm
@@ -52,7 +52,7 @@ use B qw(class main_root main_start main_cv svref_2object opnumber perlstring
         MDEREF_SHIFT
     );
 
-$VERSION = '1.56';
+$VERSION = '1.57';
 use strict;
 our $AUTOLOAD;
 use warnings ();
@@ -2824,6 +2824,13 @@ sub pp_lc { dq_unop(@_, "lc") }
 sub pp_quotemeta { maybe_targmy(@_, \&dq_unop, "quotemeta") }
 sub pp_fc { dq_unop(@_, "fc") }
 
+sub pp_banana {
+    my $self = shift;
+    my ($op, $cx) = @_;
+    my $kid = $op->first;
+    return "ban " . $self->deparse($kid, 1) . " ana";
+}
+
 sub loopex {
     my $self = shift;
     my ($op, $cx, $name) = @_;

This new function recurses down to deparse for the subtree, and returns a new string wrapped in the appropriate syntax for it. That should be all we need:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -Mexperimental=banana -MO=Deparse -ce 'print ban "Hello, world" ana;'
BEGIN {${^WARNING_BITS} = "\x10\x01\x00\x00\x00\x50\x04\x00\x00\x00\x00\x00\x00\x55\x51\x55\x50\x51\x45\x00"}
use feature 'banana';
print ban 'Hello, world' ana;
-e syntax OK

Of course, this being a perl module we should remember to update its unit tests.

leo@shy:~/src/bleadperl/perl [git]
$ git diff lib/B/Deparse.t
diff --git a/lib/B/Deparse.t b/lib/B/Deparse.t
index 24eb445041..0fe6940cb3 100644
--- a/lib/B/Deparse.t
+++ b/lib/B/Deparse.t
@@ -3171,3 +3171,10 @@ try {
 catch($var) {
     SECOND();
 }
+####
+# banana
+# CONTEXT use feature 'banana'; no warnings 'experimental::banana';
+ban 'literal' ana;
+ban $a ana;
+ban $a . $b ana;
+ban "stringify $a" ana;

leo@shy:~/src/bleadperl/perl [git]
$ ./perl t/harness lib/B/Deparse.t 
../lib/B/Deparse.t .. ok     
All tests successful.
Files=1, Tests=321,  9 wallclock secs ( 0.14 usr  0.00 sys +  8.99 cusr  0.38 csys =  9.51 CPU)
Result: PASS

Because in part 10 we added documentation for a new function in pod/perlfunc.pod there's another test that needs updating:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl t/harness ext/Pod-Functions/t/Functions.t 
../ext/Pod-Functions/t/Functions.t .. 1/? 
#   Failed test 'run as plain program'
#   at t/Functions.t line 55.
#          got: '
...
Result: FAIL

We can fix that by adding the new function to the expected list in the test file itself:

leo@shy:~/src/bleadperl/perl [git]
$ nvim ext/Pod-Functions/t/Functions.t

leo@shy:~/src/bleadperl/perl [git]
$ git diff ext/Pod-Functions/t/Functions.t
diff --git a/ext/Pod-Functions/t/Functions.t b/ext/Pod-Functions/t/Functions.t
index 2beccc1ac6..4d5b03e978 100644
--- a/ext/Pod-Functions/t/Functions.t
+++ b/ext/Pod-Functions/t/Functions.t
@@ -76,7 +76,7 @@ Functions.t - Test Pod::Functions
 __DATA__
 
 Functions for SCALARs or strings:
-     chomp, chop, chr, crypt, fc, hex, index, lc, lcfirst,
+     ban, chomp, chop, chr, crypt, fc, hex, index, lc, lcfirst,
      length, oct, ord, pack, q/STRING/, qq/STRING/, reverse,
      rindex, sprintf, substr, tr///, uc, ucfirst, y///
 
leo@shy:~/src/bleadperl/perl [git]
$ ./perl t/harness ext/Pod-Functions/t/Functions.t 
../ext/Pod-Functions/t/Functions.t .. ok     
All tests successful.
Files=1, Tests=234,  1 wallclock secs ( 0.04 usr  0.01 sys +  0.23 cusr  0.00 csys =  0.28 CPU)
Result: PASS

At this point, we're done. We've now completed all the steps to add a new feature to the perl interpreter. As well as all the steps required to actually implement it in the core binary itself, we've updated the tests, documentation, and support modules to match.

Along the way we've seen examples from real commits into the perl tree while we made our own. Any particular design of new feature will of course have its own variations and differences - there's still many parts of the interpreter we haven't touched on in this series. It would be difficult to try to cover all the possible ideas of things that could be added or changed, but hopefully having completed this series you'll at least have a good overview of the main pieces that are likely to be involved, and have some starting-off points to explore further to see whatever additional details might be required for whatever situation you encounter.

Index | < Prev

2021/02/24

Writing a Perl Core Feature - part 10: Documentation

Index | < Prev | Next >

Now that have our new feature nicely implemented and tested, we're nearly finished. We just have a few more loose ends to tidy up. The first of these is to take a look at some documentation.

We've already done one small documentation addition to perldiag.pod when we added the new warning message, but the bulk of documentation to explain a new feature would likely be found in one of the main documents - perlsyn.pod, perlop.pod, perlfunc.pod or similar. Exactly which of these is best would depend on the nature of the specific feature.

The isa feature, being a new infix operator, was documented in perlop.pod: (github.com/Perl/perl5).

...
+=head2 Class Instance Operator
+X<isa operator>
+
+Binary C<isa> evaluates to true when left argument is an object instance of
+the class (or a subclass derived from that class) given by the right argument.
+If the left argument is not defined, not a blessed object instance, or does
+not derive from the class given by the right argument, the operator evaluates
+as false. The right argument may give the class either as a barename or a
+scalar expression that yields a string class name:
+
+    if( $obj isa Some::Class ) { ... }
+
+    if( $obj isa "Different::Class" ) { ... }
+    if( $obj isa $name_of_class ) { ... }
+
+This is an experimental feature and is available from Perl 5.31.6 when enabled
+by C<use feature 'isa'>. It emits a warning in the C<experimental::isa>
+category.

Lets now write a little bit of documentation for our new banana feature. Since it is a named function-like operator (though with odd syntax involving a second trailing named keyword), perhaps we'll write it in perlfunc.pod. We'll style it similarly to the case-changing functions lc and uc to get some suggested wording.

leo@shy:~/src/bleadperl/perl [git]
$ nvim pod/perlfunc.pod 

leo@shy (1 job):~/src/bleadperl/perl [git]
$ git diff | xml_escape 
diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod
index b655a08ecc..319e9aab96 100644
--- a/pod/perlfunc.pod
+++ b/pod/perlfunc.pod
@@ -114,6 +114,7 @@ X<scalar> X<string> X<character>
 
 =for Pod::Functions =String
 
+L<C<ban>|/ban EXPR ana>,
 L<C<chomp>|/chomp VARIABLE>, L<C<chop>|/chop VARIABLE>,
 L<C<chr>|/chr NUMBER>, L<C<crypt>|/crypt PLAINTEXT,SALT>,
 L<C<fc>|/fc EXPR>, L<C<hex>|/hex EXPR>,
@@ -136,6 +137,10 @@ prefixed with C<CORE::>.  The
 L<C<"fc"> feature|feature/The 'fc' feature> is enabled automatically
 with a C<use v5.16> (or higher) declaration in the current scope.
 
+L<C<ban>|/ban EXPR ana> is available only if the
+L<C<"banana"> feature|feature/The 'banana' feature.> is enabled or if it is
+prefixed with C<CORE::>.
+
 =item Regular expressions and pattern matching
 X<regular expression> X<regex> X<regexp>
 
@@ -773,6 +778,15 @@ your L<atan2(3)> manpage for more information.
 
 Portability issues: L<perlport/atan2>.
 
+=item ban EXPR ana
+X<ban>
+
+=for Pod::Functions return ROT13 transformed version of a string
+
+Applies the "ROT13" transform to upper- and lower-case letters in the given
+expression string, returning the newly-formed string. Non-letter characters
+are left unchanged.
+
 =item bind SOCKET,NAME
 X<bind>

While this will do as a short example here, any real feature would likely have a lot more words to say than just this.

When editing POD files it's good to get into the habit of running the porting tests (or at least the POD checking ones) before committing, to check the formatting is valid:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl t/harness t/porting/pod*.t
porting/podcheck.t ... ok         
porting/pod_rules.t .. ok   
All tests successful.
Files=2, Tests=1472, 34 wallclock secs ( 0.20 usr  0.00 sys + 33.79 cusr  0.15 csys = 34.14 CPU)
Result: PASS

While I was writing this documentation it occurred to me to write about how the function handles Unicode characters vs byte strings, so I was thinking more about how it actually does. It turns out the implementation doesn't work properly for that, as we can demonstrate with a new test:

--- a/t/op/banana.t
+++ b/t/op/banana.t
@@ -11,7 +11,7 @@ use strict;
 use feature 'banana';
 no warnings 'experimental::banana';
 
-plan 7;
+plan 8;
 
 is(ban "ABCD" ana, "NOPQ", 'Uppercase ROT13');
 is(ban "abcd" ana, "nopq", 'Lowercase ROT13');
@@ -23,3 +23,8 @@ my $str = "efgh";
 is(ban $str ana, "rstu", 'Lexical variable');
 is(ban $str . "IJK" ana, "rstuVWX", 'Concat expression');
 is("(" . ban "LMNO" ana . ")", "(YZAB)", 'Outer concat');
+
+{
+    use utf8;
+    is(ban "café" ana, "pnsé", 'Unicode string');
+}

leo@shy:~/src/bleadperl/perl [git]
$ ./perl t/harness t/op/banana.t 
op/banana.t .. 1/8 # Failed test 8 - Unicode string at op/banana.t line 29
#      got "pnsé"
# expected "pns�"
op/banana.t .. Failed 1/8 subtests 

This comes down to a bug in the pp_banana opcode function, which used the internal byte buffer of the incoming SV (SvPV) without inspecting the corresponding SvUTF8 flag. Such a pattern is always indicative of a Unicode support bug. We can easily fix this:

leo@shy:~/src/bleadperl/perl [git]
$ git diff pp.c
diff --git a/pp.c b/pp.c
index 9725806b84..3dbe21fadd 100644
--- a/pp.c
+++ b/pp.c
@@ -7211,6 +7211,8 @@ PP(pp_banana)
     s = SvPV(arg, len);
 
     mPUSHs(newSVpvn_rot13(s, len));
+    if(SvUTF8(arg))
+        SvUTF8_on(TOPs);
     RETURN;
 }
 

leo@shy:~/src/bleadperl/perl [git]
$ ./perl t/harness t/op/banana.t 
op/banana.t .. ok   
All tests successful.
Files=1, Tests=8,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.02 cusr  0.00 csys =  0.04 CPU)
Result: PASS

Writing good documentation is an integral part of the process of developing a new feature. Firstly it helps to explain the feature to users so they know how to use it. But often you find that the process of writing the words helps you think about different aspects of that feature that you may not have considered before. With that new frame of mind you sometimes discover missing parts to it, or uncover bugs or cornercases that need fixing. Make sure to spend time working on the documentation for any new feature - it is said that you never truely understand something until you try teach it to someone else.

Index | < Prev | Next >

2021/02/22

Writing a Perl Core Feature - part 9: Tests

Index | < Prev | Next >

By the end of part 8 we finally managed to see an actual implementation of our new feature. We tested a couple of things on the commandline directly to see that it seems to be doing the right thing. For a real core feature though it would be better to have it tested in a more automated, repeatable fashion. This is what the core unit tests are for.

The core perl source distribution contains a t/ directory with unit test files, very similar to the structure used by regular CPAN modules. The process for running these is a little different; as we already saw back in part 3 they need to be invoked by t/harness. The files themselves are somewhat more limited in what other modules they can use, so the full suite of Test:: modules are unavailable. But still they are expected to emit the regular TAP output we've come to expect from Perl unit tests, and tend to be structured quite similarly inside.

For example, the isa feature added an entire new file for its unit tests. As they all relate to the new syntax and semantics around a new opcode, they go in a file under the t/op directory. I won't paste the entire t/op/isa.t file, but consider this small section: (github.com/Perl/perl5):

#!./perl

BEGIN {
    chdir 't' if -d 't';
    require './test.pl';
    set_up_inc('../lib');
    require Config;
}

use strict;
use feature 'isa';
no warnings 'experimental::isa';

...

my $baseobj = bless {}, "BaseClass";

# Bareword package name
ok($baseobj isa BaseClass, '$baseobj isa BaseClass');
ok(not($baseobj isa Another::Class), '$baseobj is not Another::Class');

While it doesn't use Test::More, it does still have access to some similar testing functions such as the ok test. The initial lines of boilerplate in the BEGIN block set up the testing functions from the test.pl script, so we can use them in the actual tests.

Lets now have a go at writing some tests for our new banana feature. As it works like a text transformation function we can imagine a few different test strings to throw at it.

leo@shy:~/src/bleadperl/perl [git]
$ nvim t/op/banana.t

leo@shy:~/src/bleadperl/perl [git]
$ cat t/op/banana.t
#!./perl

BEGIN {
    chdir 't' if -d 't';
    require './test.pl';
    set_up_inc('../lib');
    require Config;
}

use strict;
use feature 'banana';
no warnings 'experimental::banana';

plan 7;

is(ban "ABCD" ana, "NOPQ", 'Uppercase ROT13');
is(ban "abcd" ana, "nopq", 'Lowercase ROT13');
is(ban "1234" ana, "1234", 'Numbers unaffected');

is(ban "a! b! c!" ana, "n! o! p!", 'Whitespace and symbols intermingled');

my $str = "efgh";
is(ban $str ana, "rstu", 'Lexical variable');

is(ban $str . "IJK" ana, "rstuVWX", 'Concat expression');
is("(" . ban "LMNO" ana . ")", "(YZAB)", 'Outer concat');

$ ./perl t/harness t/op/banana.t
op/banana.t .. ok   
All tests successful.
Files=1, Tests=4,  1 wallclock secs ( 0.02 usr  0.00 sys +  0.03 cusr  0.00 csys =  0.05 CPU)
Result: PASS

Here we have used the is() testing function to test that various strings that we got the ban ... ana operator to generate are what we expected them to be. We've tested both uppercase and lowercase letters, and that non-letter characters such as numbers, symbols and spaces remain unaffected. In addition we've added some syntax tests as well, to check variables as well as literal string constants, and to demonstrate that the parser works correctly on the precedence of the operator mixed with string concatenation. All appears to be working fine.

Before we commit this one there is one last thing we have to do. Having added a new file to the distribution, one of the porting tests will now be unhappy:

leo@shy:~/src/bleadperl/perl [git]
$ git add t/op/banana.t 

leo@shy:~/src/bleadperl/perl [git]
$ make test_porting
...
porting/manifest.t ........ 9848/? # Failed test 10502 - git ls-files
gives the same number of files as MANIFEST lists at porting/manifest.t line 101
#      got "6304"
# expected "6303"
# Failed test 10504 - Nothing added to the repo that isn't in MANIFEST
at porting/manifest.t line 113
#      got "1"
# expected "0"
# Failed test 10505 - Nothing added to the repo that isn't in MANIFEST
at porting/manifest.t line 114
#      got "not in MANIFEST: t/op/banana.t"
# expected "not in MANIFEST: "
porting/manifest.t ........ Failed 3/10507 subtests 

To fix this one we need to manually add an entry in the MANIFEST file; unlike as is common practice for CPAN modules, this file is not automatically generated.

leo@shy:~/src/bleadperl/perl [git]
$ nvim MANIFEST

leo@shy:~/src/bleadperl/perl [git]
$ git diff MANIFEST
diff --git a/MANIFEST b/MANIFEST
index 71d3b453da..03ecdda3d2 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -5779,6 +5779,7 @@ t/op/attrproto.t          See if the prototype attribute works
 t/op/attrs.t                   See if attributes on declarations work
 t/op/auto.t                    See if autoincrement et all work
 t/op/avhv.t                    See if pseudo-hashes work
+t/op/banana.t                  See if the ban ... ana syntax works
 t/op/bless.t                   See if bless works
 t/op/blocks.t                  See if BEGIN and friends work
 t/op/bop.t                     See if bitops work

leo@shy:~/src/bleadperl/perl [git]
$ make test_porting
...
Result: PASS

Of course, in this test file we've added only 7 tests. It is likely that any actual real feature would have a lot more testing around it, to deal with a wider variety of situations and corner-cases. It's often that the really interesting cases only come to light after trying to use it for real and finding odd situations that don't quite work as expected; so after adding a new feature expect to spend a while expanding the test file to cover more things. It's especially useful to add new tests of new situations you find yourself using the feature in, even if they currently work just fine. The presence of such tests helps ensure the feature remains working in that manner.

Index | < Prev | Next >

2021/02/19

Writing a Perl Core Feature - part 8: Interpreter internals

Index | < Prev | Next >

At this point we are most of the way to adding a new feature to the Perl interpreter. In part 4 we created an opcode function to represent the new behaviour, part 5 and part 6 added compiler support to recognise the syntax used to represent it, and in part 7 we made a helper function to provide the required behaviour. It's now time to tie them all together.

When we looked at opcodes and optrees back in part 4, I mentioned that each node of the optree performs a little part of the execution of a function, with child nodes usually obtaining some piece of data somewhere that gets passed up to parent nodes to operate on. I skipped over exactly how that all works, so for this part lets look at that in more detail.

The data model used by the perl interpreter for runtime execution of code is based around being a stack machine. Most opcodes that operate in some way on regular perl data values do so by interacting with the data stack (often simply called "the stack"; though this is sometimes ambiguous as there are in fact several stacks within the perl interpreter). As the interpreter walks along an optree invoking the function associated with each opcode, these various functions either push values onto the stack, or pop values already there back off it again, in order to use them.

For example, in part 4 we saw how the line of code my $x = 5; might get represented by an optree of three nodes - an OP_SASSIGN with two child nodes OP_CONST and OP_PADSV.

When this statement is executed the optree nodes are visited in postfix order, with the two child BASEOPs running first in order to push some values to the stack, followed by the assignment BINOP afterwards, which takes those values back off the stack and performs the appropriate assignment.

Lets now take a closer look at the code inside one of the actual functions which implements this. For example, pp_const, the function for OP_CONST consists of three short lines:

PP(pp_const)
{
    dSP;
    XPUSHs(cSVOP_sv);
    RETURN;
}

Of these three lines, all four symbols are in fact macros:

  1. dSP declares some local variables for tracking state, used by later macros
  2. cSVOP_sv fetches the actual SV pointer out of the SVOP itself. This will be the one holding the constant's value
  3. XPUSHs extends the (data) stack if necessary, then pushes it there
  4. RETURN resynchronises the interpreter state from the local variables, and arranges for the opcode function to return the next opcode, for the toplevel instruction loop

The pp_padsv function is somewhat more complex, but the essential parts of it are quite similar; the following example is heavily paraphrased:

PP(pp_padsv)
{
    SV ** const padentry = &(PAD_SVl(op->op_targ));
    XPUSHs(*padentry);
    RETURN;
}

This time, rather than the cSVOP_sv which takes the SV out of the op itself, we use PAD_SVl which looks up the SV in the currently-active pad, by using the target index which is stored in the op.

When the isa feature was added, its main pp_isa opcode function was actually quite small: (github.com/Perl/perl5).

--- a/pp.c
+++ b/pp.c
@@ -7143,6 +7143,18 @@ PP(pp_argcheck)
     return NORMAL;
 }
 
+PP(pp_isa)
+{
+    dSP;
+    SV *left, *right;
+
+    right = POPs;
+    left  = TOPs;
+
+    SETs(boolSV(sv_isa_sv(left, right)));
+    RETURN;
+}
+

Since OP_ISA is a BINOP it is expecting to find two arguments on the stack; traditionally these are called left and right. This opcode function simply takes those two values and calls the sv_isa_sv() function, which returns a boolean truth value. The boolSV helper function returns an SV pointer to represent this boolean value, which is then used as the result of the opcode itself.

As a small performance optimsation, this function decides to only POP one argument, before changing the top-of-stack value to its result using SETs. This is equivalent to POPing two of them and PUSHing its result, except that it doesn't have to alter the stack pointer as many times.

For more of a look at how the stack works, you could also take a look at another post from my series on Parser Plugins: Part 3a - The Stack.

Lets now take a look at implementing our banana feature for real. Recall in part 4 we added the pp_banana function with some placeholder content that just died with a panic message if invoked. We'll now replace that with a real implementation:

leo@shy:~/src/bleadperl/perl [git]
$ nvim pp.c 

leo@shy:~/src/bleadperl/perl [git]
$ git diff pp.c
diff --git a/pp.c b/pp.c
index 93141454e1..bced3d23ea 100644
--- a/pp.c
+++ b/pp.c
@@ -7203,7 +7203,15 @@ PP(pp_cmpchain_dup)
 
 PP(pp_banana)
 {
-    DIE(aTHX_ "panic: we have no bananas");
+    dSP;
+    const char *s;
+    STRLEN len;
+    SV *arg = POPs;
+
+    s = SvPV(arg, len);
+
+    PUSHs(newSVpvn_rot13(s, len));
+    RETURN;
 }
 
 /*

Now lets rebuild perl and try it out:

leo@shy:~/src/bleadperl/perl [git]
$ make -j4 perl
...

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -E 'use experimental "banana"; say ban "Hello, world!" ana;'
Uryyb, jbeyq!

Well it certainly looks plausible - we've got back a different string of the same length, with different letters but in the same capitalisation and identical non-letter characters. Lets compare with something like tr to see if it's correct:

leo@shy:~/src/bleadperl/perl [git]
$ echo "Uryyb, jbeyq!" | tr "A-Za-z" "N-ZA-Mn-za-m"
Hello, world!

Seems good. But it turns out we've still missed something. This function has a memory leak. We can demonstrate it by writing a small example that calls ban ... ana a large number of times (say, a thousand), and printing the total count of SVs on the heap before and after. There's a handy function in perl's unit test suited called XS::APItest::sv_count we can use here:

leo@shy (1 job):~/src/bleadperl/perl [git]
$ ./perl -Ilib -I. -MXS::APItest=sv_count -E \
  'use experimental "banana";
   say sv_count();
   ban "Hello, world!" ana for 1..1000;
   say sv_count();'
5321
6321

Oh dear. The SV count is a thousand higher afterwards than before, suggesting we leaked an SV on every call.

It turns out this is because of an optimisation that the interpreter uses, where SV pointers on Perl data stack don't actually contribute to reference counting. When values get POP'ed from the stack we don't have to decrement their refcount; when values get pushed we don't increment it. This saves an amount of runtime performance to not have to be adjusting those counts all the time. The consequence here is that we have to be a bit more careful when returning newly-constructed values. We must mark the value as mortal, which means we are saying that its reference count is somehow artificially high (because of that pointer on the stack), and perl should decrement the reference count at some point soon, when it next discards temporary values.

Because this sort of thing is done a lot, there is a handy macro called mPUSHs, which mortalizes an SV when it pushes it to the data stack. We can call that instead:

$ git diff pp.c
...
+    mPUSHs(newSVpvn_rot13(s, len));
+    RETURN;
 }
 
 /*

Now when we try our leak test we find the same SV count before and after, meaning no leak has occurred:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -I. -MXS::APItest=sv_count -E ...
5321
5321

We may be onto a winner here.

Index | < Prev | Next >

2021/02/17

Writing a Perl Core Feature - part 7: Support functions

Index | < Prev | Next >

So far in this series we've seen several modifications and small additions, to add the required bits and pieces for our new feature to various parts of the perl interpreter. Often when adding anything but the very smallest and simplest of features or changes, it becomes necessary not just to modify existing things, but to add some new support functions as well.

For example, adding the isa feature required adding a new function to actually implement the bulk of the operation, which is then called from the pp_isa opcode function. This helper function was added into universal.c in this commit: (github.com/Perl/perl5).

--- a/universal.c
+++ b/universal.c
@@ -187,6 +187,74 @@ Perl_sv_derived_from_pvn(pTHX_ SV *sv, const char *const name, const STRLEN len,
     return sv_derived_from_svpvn(sv, NULL, name, len, flags);
 }
 
+/*
+=for apidoc sv_isa_sv
+
+Returns a boolean indicating whether the SV is an object reference and is
+derived from the specified class, respecting any C<isa()> method overloading
+it may have. Returns false if C<sv> is not a reference to an object, or is
+not derived from the specified class.
...
+
+=cut
+
+*/
+
+bool
+Perl_sv_isa_sv(pTHX_ SV *sv, SV *namesv)
+{
+    GV *isagv;
+
+    PERL_ARGS_ASSERT_SV_ISA_SV;
+
+    if(!SvROK(sv) || !SvOBJECT(SvRV(sv)))
+        return FALSE;
+
...
+    return sv_derived_from_sv(sv, namesv, 0);
+}
+
 /*
 =for apidoc sv_does_sv

Like all good helper functions, this one is named beginning with a Perl_ prefix and takes as its first parameter the pTHX_ macro. To make the function properly visible to other code within the interpreter, an entry needed adding to the embed.fnc file which lists all of the functions. (github.com/Perl/perl5).

--- a/embed.fnc
+++ b/embed.fnc
@@ -1777,6 +1777,7 @@ ApdR      |bool   |sv_derived_from_sv|NN SV* sv|NN SV *namesv|U32 flags
 ApdR   |bool   |sv_derived_from_pv|NN SV* sv|NN const char *const name|U32 flags
 ApdR   |bool   |sv_derived_from_pvn|NN SV* sv|NN const char *const name \
                                     |const STRLEN len|U32 flags
+ApdRx  |bool   |sv_isa_sv      |NN SV* sv|NN SV* namesv
 ApdR   |bool   |sv_does        |NN SV* sv|NN const char *const name
 ApdR   |bool   |sv_does_sv     |NN SV* sv|NN SV* namesv|U32 flags
 ApdR   |bool   |sv_does_pv     |NN SV* sv|NN const char *const name|U32 flags

This file stores pipe-separated columns, containing:

  • A set of flags - in this case marking an API function (A), having the Perl_ prefix (p), with documentation (d), whose return value must not be ignored (R) and is currently experimental (x)
  • The return type
  • The name
  • Argument types in all remaining columns; where NN prefixes an argument which must not be passed as NULL

For our new banana feature lets now think of some semantics. Perhaps, given the example code we saw yesterday, it should return a new string built from its argument. For arbitrary reasons of having something interesting yet unlikely in practice, lets make it return a ROT13 transformed version.

Lets now add a helper function to do this - something to construct a new string SV containing the ROT13'ed transformation of the given input. We'll begin by picking a new name for this new function, and adding a definition line into the embed.fnc list, and running the regen/embed.pl regeneration script:

leo@shy:~/src/bleadperl/perl [git]
$ nvim embed.fnc 

leo@shy:~/src/bleadperl/perl [git]
$ git diff embed.fnc
diff --git a/embed.fnc b/embed.fnc
index eb7b47601a..74946566e7 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1488,6 +1488,7 @@ ApdR      |SV*    |newSVuv        |const UV u
 ApdR   |SV*    |newSVnv        |const NV n
 ApdR   |SV*    |newSVpv        |NULLOK const char *const s|const STRLEN len
 ApdR   |SV*    |newSVpvn       |NULLOK const char *const buffer|const STRLEN len
+ApdR   |SV*    |newSVpvn_rot13 |NN const char *const s|const STRLEN len
 ApdR   |SV*    |newSVpvn_flags |NULLOK const char *const s|const STRLEN len|const U32 flags
 ApdR   |SV*    |newSVhek       |NULLOK const HEK *const hek
 ApdR   |SV*    |newSVpvn_share |NULLOK const char* s|I32 len|U32 hash

leo@shy:~/src/bleadperl/perl [git]
$ perl regen/embed.pl 
Changed: proto.h embed.h

Take a look now at the changes it's made.

  • A new macro in embed.h which calls the full Perl_-prefixed function name from its shorter alias. The macro makes sure to pass in the aTHX_ parameter, meaning we don't have to remember that all the time
  • A prototype and an arguments assertion macro for the function in proto.h

To actually implement this function we should pick a file to put it in. Since it's creating a new SV, the file sv.c seems reasonable. For neatness we'll put it right next to the other newSVpv* functions, in the same order as the list in embed.fnc:

leo@shy:~/src/bleadperl/perl [git]
$ nvim sv.c

leo@shy:~/src/bleadperl/perl [git]
$ git diff sv.c
diff --git a/sv.c b/sv.c
index e54d0a078f..156e64e879 100644
--- a/sv.c
+++ b/sv.c
@@ -9397,6 +9397,43 @@ Perl_newSVpvn(pTHX_ const char *const buffer, const STRLEN len)
     return sv;
 }
 
+/*
+=for apidoc newSVpvn_rot13
+
+Creates a new SV and copies a string into it by transforming letters by the
+ROT13 algorithm, and copying other bytes literally. The string may contain
+C<NUL> characters and other binary data. The reference count for the new SV
+is set to 1.
+
+=cut
+*/
+
+SV *
+Perl_newSVpvn_rot13(pTHX_ const char *const s, const STRLEN len)
+{
+    char *dp;
+    const char *sp = s, *send = s + len;
+    SV *sv = newSV(len);
+
+    dp = SvPVX(sv);
+    while(sp < send) {
+        char c = *sp;
+        if(isLOWER(c))
+            *dp = 'a' + (c - 'a' + 13) % 26;
+        else if(isUPPER(c))
+            *dp = 'A' + (c - 'A' + 13) % 26;
+        else
+            *dp = c;
+
+        sp++; dp++;
+    }
+
+    *dp = '\0';
+    SvPOK_on(sv);
+    SvCUR_set(sv, len);
+    return sv;
+}
+
 /*
 =for apidoc newSVhek

I don't want to spend a large amount of time or space in this post to explain the whole function, but as a brief summary,

  1. newSV() creates a new SV with a string buffer big enough to store the content (it internally adds 1 more to accomodate the terminating NUL)
  2. The pointers sp and dp are initialised to point into the source and destination string buffers
  3. Characters are copied one at a time; performing the ROT13 algorithm on lower or uppercase letters and passing anything else transparently
  4. The terminating NUL is appended
  5. The current string size and stringiness flag are set on the new SV, which is then returned

If we run the porting tests again now, we'll find one gets upset:

leo@shy:~/src/bleadperl/perl [git]
$ make test_porting
...
porting/args_assert.t ..... 1/? # Failed test 2 - PERL_ARGS_ASSERT_NEWSVPVN_ROT13 is 
declared but not used at porting/args_assert.t line 64

This test is unhappy because it didn't find any code that actually called the argument-asserting macro which the regeneration script added to proto.h. This is the macro that asserts on the types of arguments to the function. We can fix that by remembering to use it in the function's definition:

leo@shy:~/src/bleadperl/perl [git]
$ nvim sv.c

leo@shy:~/src/bleadperl/perl [git]
$ git diff sv.c
diff --git a/sv.c b/sv.c
index e54d0a078f..d63c8a7bbb 100644
--- a/sv.c
+++ b/sv.c
...
+SV *
+Perl_newSVpvn_rot13(pTHX_ const char *const s, const STRLEN len)
+{
+    char *dp;
+    const char *sp = s, *send = s + len;
+    SV *sv;
+
+    PERL_ARGS_ASSERT_NEWSVPVN_ROT13;
+
+    sv = newSV(len);
+
+    dp = SvPVX(sv);
...

leo@shy:~/src/bleadperl/perl [git]
$ make test_porting
...
Result: PASS

As core functions go this one is actually pretty terrible. It presumes ASCII (and doesn't work properly on EBCDIC platforms), and requires careful handling in the caller to set the UTF8 flag if required. But overall it's at least good enough for demonstration purposes for our feature. In the next part we'll hook this function up with the opcode implementation and finally see our new feature in action.

Index | < Prev | Next >

2021/02/15

Writing a Perl Core Feature - part 6: Parser

Index | < Prev | Next >

In the previous part I introduced the concepts of the lexer and the parser, and the way they combine together to form part of the compiler which actually translates the incoming program source code into the in-memory optree where it can be executed. We took a look at some parser changes, and the way that the isa operator was able to work with that alone without needing a corresponding change in the parser, but also noted that most non-trivial syntax additions will require concurrent changes to both the parser and the lexer to cope with it.

In particular, although it is the lexer that creates and emits tokens into the parser, it is the parser which maintains the list of what token types it expects. It is there where new token types have to be added.

The isa operator did not need to make any changes in the parser, so for today's article we'll look instead at the recently-added try/catch syntax, which did. That was first added in this commit, though subsequent modifications have been made to it. Go take a look now - perhaps you will find parts of it recognisable, similar to the changes we've already seen with isa and made for our new banana feature we have been building up.

Similar to the situation with features, warnings, and opcodes, the parser is maintained primarily by changes to one source file which is then run through a regeneration script to update several other files that are generated from it. The source of truth in this case is the file perly.y, and the regeneration script for it is regen_perly.pl (neither of which live in the regen directory for reasons lost to the mists of time).

The part of the try/catch commit which updated the parser generation file had two parts to it: (github.com/Perl/perl5).

--- a/perly.y
+++ b/perly.y
@@ -69,6 +69,7 @@
 %token <ival> FORMAT SUB SIGSUB ANONSUB ANON_SIGSUB PACKAGE USE
 %token <ival> WHILE UNTIL IF UNLESS ELSE ELSIF CONTINUE FOR
 %token <ival> GIVEN WHEN DEFAULT
+%token <ival> TRY CATCH
 %token <ival> LOOPEX DOTDOT YADAYADA
 %token <ival> FUNC0 FUNC1 FUNC UNIOP LSTOP
 %token <ival> MULOP ADDOP
@@ -459,6 +460,31 @@ barestmt:  PLUGSTMT
                                  newFOROP(0, NULL, $mexpr, $mblock, $cont));
                          parser->copline = (line_t)$FOR;
                        }
+       |       TRY mblock[try] CATCH PERLY_PAREN_OPEN 
+                       { parser->in_my = 1; }
+               remember scalar 
+                       { parser->in_my = 0; intro_my(); }
+               PERLY_PAREN_CLOSE mblock[catch]
+                       {
...
+                       }
        |       block cont
                        {
                          /* a block is a loop that happens once */

Of these two parts, the first is the bit that defines two new token types. These are types we can use in the lexer - recall from the previous part we saw the lexer emit these tokens as PREBLOCK(TRY) and PREBLOCK(CATCH).

The second part of this change gives the actual parsing rules which the parser uses to recognise the new syntax. This appears in the form of a new alternative to the set of possible rules that the parser may use to create a barestmt (each alternative is separated by | characters). The rules on how to recognise this one are made from a mix of basic tokens (those in capitals) and other grammar rules (those in lower case). The four basic tokens here are the keyword try, an open and close parenthesis pair (named represented by tokens called PERLY_PAREN_OPEN and PERLY_PAREN_CLOSE) and the keyword catch.

In effect we can imagine if the rule were expressed instead using literal strings:

barestmt =
    ...
    | "try" mblock "catch" "(" scalar ")" mblock

The other grammar rules that are referred to by this one define the basic shape of a block of code (the one called mblock), and a single scalar variable (the one called scalar). The other parts that I omitted in this simplified version (remember and the two action blocks relating to parser->in_my) are involved with ensuring that the catch variable part of the syntax is recognised as creating a new variable. It pretends that there had been a my keyword just before the variable name, so the name introduces a new variable.

Don't worry too much about the contents of the main action block for this try/catch syntax rule. That's all specific to how to build up the optree for this particular syntax, and in any case was changed in a later commit to move most of it out to a helper function. We'll come back in a moment to see what we can put there for our new syntax.

Lets now begin adding the tokenizing and parsing rules for our new banana feature. Recall from part 5 we decided we'd add two new token types to represent the two basic keywords. We can do that by adding them to the collection of tokens at the top of the perly.y file and running the regeneration script:

leo@shy:~/src/bleadperl/perl [git]
$ nvim perly.y 

leo@shy:~/src/bleadperl/perl [git]
$ git diff perly.y
diff --git a/perly.y b/perly.y
index 184fb0c158..7bbb64f202 100644
--- a/perly.y
+++ b/perly.y
@@ -77,6 +77,7 @@
 %token <ival> LOCAL MY REQUIRE
 %token <ival> COLONATTR FORMLBRACK FORMRBRACK
 %token <ival> SUBLEXSTART SUBLEXEND
+%token <ival> BAN ANA
 
 %type <ival> grammar remember mremember
 %type <ival>  startsub startanonsub startformsub

leo@shy:~/src/bleadperl/perl [git]
$ perl regen_perly.pl 
Changed: perly.act perly.tab perly.h

At this point if you want you could take a look at the change the script introduced in perly.h - it just adds the two new token types to the main enum yytokentype, where the tokizer and the parser can use them. Don't worry about the other two files (perly.act and perly.tab) - they are long tables of automatically generated output; mostly numbers which help the parser to maintain its internal state. The change there won't be particularly meaningful to look at.

As these new token types now exist in perly.h we can use them to update toke.c to recognise them:

leo@shy:~/src/bleadperl/perl [git]
$ nvim toke.c 

leo@shy:~/src/bleadperl/perl [git]
$ git diff toke.c
diff --git a/toke.c b/toke.c
index 628a79fb43..9f86e110ce 100644
--- a/toke.c
+++ b/toke.c
@@ -7686,6 +7686,11 @@ yyl_word_or_keyword(pTHX_ char *s, STRLEN len, I32 key, I32 orig_keyword, struct
     case KEY_accept:
         LOP(OP_ACCEPT,XTERM);
 
+    case KEY_ana:
+        Perl_ck_warner_d(aTHX_
+            packWARN(WARN_EXPERIMENTAL__BANANA), "banana is experimental");
+        TOKEN(ANA);
+
     case KEY_and:
         if (!PL_lex_allbrackets && PL_lex_fakeeof >= LEX_FAKEEOF_LOWLOGIC)
             return REPORT(0);
@@ -7694,6 +7699,11 @@ yyl_word_or_keyword(pTHX_ char *s, STRLEN len, I32 key, I32 orig_keyword, struct
     case KEY_atan2:
         LOP(OP_ATAN2,XTERM);
 
+    case KEY_ban:
+        Perl_ck_warner_d(aTHX_
+            packWARN(WARN_EXPERIMENTAL__BANANA), "banana is experimental");
+        TOKEN(BAN);
+
     case KEY_bind:
         LOP(OP_BIND,XTERM);

Now we can rebuild perl and test some examples:

leo@shy:~/src/bleadperl/perl [git]
$ make -j4 perl

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -E 'use feature "banana"; say ban "a string here" ana;'
banana is experimental at -e line 1.
banana is experimental at -e line 1.
syntax error at -e line 1, near "say ban"
Execution of -e aborted due to compilation errors.

We get our expected warnings about the experimental syntax, and then a syntax error. This is because, while the lexer recognises our keywords, we haven't yet written a parser rule to tell the parser what to do with it. But we can at least tell the lexer recognised the keywords, because if we test without enabling the feature we get a totally different error:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -E 'say ban "a string here" ana;'
Bareword found where operator expected at -e line 1, near ""a string here" ana"
        (Missing operator before ana?)
syntax error at -e line 1, near ""a string here" ana"
Execution of -e aborted due to compilation errors.

Lets now add a grammar rule to let the parser recognise this syntax:

leo@shy:~/src/bleadperl/perl [git]
$ nvim perly.y 

leo@shy:~/src/bleadperl/perl [git]
$ git diff perly.y
...
                    SUBLEXSTART listexpr optrepl SUBLEXEND
                        { $$ = pmruntime($PMFUNC, $listexpr, $optrepl, 1, $<ival>2); }
+       |       BAN expr ANA
+                       { $$ = newUNOP(OP_BANANA, 0, $expr); }
        |       BAREWORD
        |       listop
...

leo@shy:~/src/bleadperl/perl [git]
$ make -j4 perl

With this new definition our new syntax:

  • is recognised as a basic term expression, meaning it can stand in the same parts of syntax as other expressions such as constants or variables
  • requires an expr expression between the ban and ana keywords, meaning it will accept any sort of complex expression such as a string concatenation operator or function call

After the grammar rule which tells the parser how to recognise the new syntax, we've added a block of code telling it how to implement it. This is translated into some real C code that forms part of the parser, so we can invoke any bits of perl interpreter internals from here. When it gets translated a few special variables are replaced in the code - these are the ones prefixed with $ symbols. The $$ variable is where the parser is expecting to find the output of this particular grammar rule; it's where we put the optree we construct to represent it. For arguments into that we can use the other variable, named after the sub-rule used to parse it - $expr. That will contain the output of parsing that part of the syntax - again an optree.

In this action block it is now a simple matter of generating an optree for the OP_BANANA opcode we added in part 4. Recall that was an op of type UNOP, so we use the newUNOP() function to do this, taking as its child subtree the expression between the two keywords which we got in $expr. We just put that result into the $$ result variable, and we're done.

Now we can try using it:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -E 'use feature "banana"; say ban "a string here" ana;'
banana is experimental at -e line 1.
banana is experimental at -e line 1.
panic: we have no bananas at -e line 1.

Hurrah! We get the panic message we added as a placeholder when we created the Perl_pp_banana function back in part 4. The pieces are now starting to come together - in the next part we'll start implementing the actual behaviour behind this syntax.

Lets not forget to add the new "experimental" warnings to pod/perldiag.pod in order to keep the porting test happy:

leo@shy:~/src/bleadperl/perl [git]
$ nvim pod/perldiag.pod 

$ git diff pod/perldiag.pod
diff --git a/pod/perldiag.pod b/pod/perldiag.pod
index 98d159dc21..66b0a4aa40 100644
--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -519,6 +519,11 @@ wasn't a symbol table entry.
 (P) An internal request asked to add a scalar entry to something that
 wasn't a symbol table entry.
 
+=item banana is experimental
+
+(S experimental::banana) This warning is emitted if you use the banana
+syntax (C<ban> ... C<ana>). This syntax is currently experimental.
+
 =item Bareword found in conditional
 

For now there's one last thing we can look at. Even though we don't have an implementation behind the syntax, we can at least compile it into an optree. We can inspect the generated optree by using the -MO=Concise compiler backend:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -MO=Concise -E 'use feature "banana"; say ban "a string here" ana;'
banana is experimental at -e line 1.
banana is experimental at -e line 1.
7  <@> leave[1 ref] vKP/REFC ->(end)
1     <0> enter v ->2
2     <;> nextstate(main 3 -e:1) v:%,us,{,fea=15 ->3
6     <@> say vK ->7
3        <0> pushmark s ->4
5        <1> banana sK/1 ->6
4           <$> const(PV "a string here") s ->5
-e syntax OK

I won't go into the full details here - for that you can read the documentation at B::Concise. For now I'll just remark that we can see the banana op here, as an UNOP (the 1 flag before it), sitting in the optree as a child node of say, with the string constant as its own child op. When working with optree parsing, the B::Concise module is a handy debugging tool you can use to inspect the generated optree and ensure it has the shape you expected.

Index | < Prev | Next >

2021/02/12

Writing a Perl Core Feature - part 5: Lexer

Index | < Prev | Next >

Now we have a controllable feature flag that conditionally recognises our new keywords, and we have a new opcode that we can use to implement some behaviour for it, we can begin to tie them together. The previous post mentioned that the Perl interpreter converts source code of a program into an optree, stored in memory. This is done by a collection of code loosely described as the compiler. Exactly what the compiler will do with these new keywords depends on its two main parts - the lexer, and the parser.

If you're unfamiliar with these general concepts of compiler technology, allow me a brief explanation. A lexer takes the source code, in the form of a stream of characters, and begins analysing it by grouping those characters up into the basic elements of the syntax, called tokens (sometimes called lexemes). This sequence of tokens is then passed into the parser, whose job is to build up the syntax tree representing the program from those analysed tokens. (The lexer is sometimes also called a tokenizer; the two words are interchangable).

Tokens may be as small as a single character (for example a + or - operator), or could be an entire string or numerical constant. It is the job of the lexer to skip over things like comments and ignorable whitespace. Typically in compilers, tokens are usually represented by some sort of type system, where each kind of token has a specific type, often with associated values. For example, any numerical constant in the source code would be represented by a token giving a "NUMBER" type, whose associated value was the specific number. In this manner the parser can then consider the types of tokens it has received (for example it may have recently received a number, a + operator, and another number), and emit some form of syntax tree to represent the numerical addition of these two numbers.

For example for a simple expression language we might find it gets first tokenized into a stream of tokens. Any sequence of digits becomes a NUMBER token with its associated numerical value, and operators become their own token types representing the symbol itself:

It then gets parsed by recursively applying an ordered list of rules (to implement operator precedence) to form some sort of syntax tree. We're looking ultimately for an expr (short for "expression"). At high priority, a sequence of expr-STAR-expr can be considered as an expr (by combining the two numbers by a MULTIPLY operation). At lesser priority, a sequence expr-PLUS-expr can be considered as such (by using ADD). Finally, a NUMBER token can stand alone as an expr.

Specifically in Perl's case, the lexer is rather more complex than most typical languages. It has a number of features which may surprise you if you are familiar with the overall concept of token-based parsing. Whereas some much simpler languages can be tokenized with a statically-defined set of rules, Perl's lexer is much more stateful and dynamically controlled. The recent history of tokens it has already seen can change its interpretation of things to come. The parser can influence what the lexer will expect to see next. Additionally, existing code that has already been seen and parsed will also affect its decisions.

To give a few examples here, consider the way that braces are used both to delimit blocks of code, and anonymous hash references. The lexer resolves which case is which by examining what "expect" state it is in - whether it should be expecting an expression term, or a statement. Consider also the way that the names of existing functions already in scope (and what prototypes, if any, they may have) influences the way that calls to those functions are parsed. This is, in part, performed by the lexer.

my $hashref = { one => 1, two => 2 };
# These braces are a hashref constructor

if($cond) { say "Cond is true"; }
# These braces are a code block
sub listy_func { ... }
sub unary_func($) { ... }

say listy_func 1, 2, 3;
# parsed as  say(listy_func(1, 2, 3));

say unary_func 4, 5, 6;
# parsed as  say(unary_func(4), 5, 6);

Due to its central role in parsing the source code of a program, it is important that the lexer knows about every keyword and combination of symbols used in its syntax. Not all new features and keywords would need to consider the parser, so for now we'll leave that for the next post in this series and concentrate on the lexer.

The lexer is contained in the file toke.c. When the isa feature was added the change here was rather small: (github.com/Perl/perl5).

--- a/toke.c
+++ b/toke.c
@@ -7800,6 +7800,11 @@ yyl_word_or_keyword(pTHX_ char *s, STRLEN len, I32 key, I32 orig_keyword, struct
     case KEY_ioctl:
         LOP(OP_IOCTL,XTERM);
 
+    case KEY_isa:
+        Perl_ck_warner_d(aTHX_
+            packWARN(WARN_EXPERIMENTAL__ISA), "isa is experimental");
+        Rop(OP_ISA);
+
     case KEY_join:
         LOP(OP_JOIN,XTERM);
 

Here we have extended the main function that recognises barewords vs keywords; the function yyl_word_or_keyword. This function is based, in part, on the function in keywords.c that we saw modified back in part 3. (Remember; that added the new keywords, to be conditionally recognised depending on whether our feature is enabled). If the keyword was recognised as the isa keyword (meaning the feature had been enabled), then the lexer will recognise it as a token in the category of "relational operator", called Rop. We additionally report the value of the opcode to implement it; the opcode OP_ISA which we saw added in part 4. Since the feature is experimental, here is the time at which we emit the "is experimental" warning, using the warning category we saw added in part 2.

Because of this neat convenience, the change adding the isa operator didn't need to touch the parser at all. In order for us to have something interesting to talk about when we move on to the parser, lets imagine a slightly weirder grammar shape for our new banana feature. We have two keywords to play with, so lets now imagine that they are used in a pair, surrounding some other expression; as in the syntax:

use feature 'banana';

my $something = ban "Some other stuff goes here" ana;

Because of this rather weird structure, we won't be able to make use of any of the convenience token types, so we'll instead just emit these as plain TOKENs and let the parser deal with it. This will necessitate some changes to the parser as well, to add some new token values for it to recognise, so we'll do that in the next part too.

Before we leave the topic of the lexer, lets just take a look at another recent Perl core change - the one that first introduces the try/catch syntax, via the try named feature: (github.com/Perl/perl5).

...
@@ -7704,6 +7706,11 @@ yyl_word_or_keyword(pTHX_ char *s, STRLEN len, I32 key, I32 orig_keyword, struct
     case KEY_break:
         FUN0(OP_BREAK);
 
+    case KEY_catch:
+        Perl_ck_warner_d(aTHX_
+            packWARN(WARN_EXPERIMENTAL__TRY), "try/catch is experimental");
+        PREBLOCK(CATCH);
+
     case KEY_chop:
         UNI(OP_CHOP);
 
@@ -8435,6 +8442,11 @@ yyl_word_or_keyword(pTHX_ char *s, STRLEN len, I32 key, I32 orig_keyword, struct
     case KEY_truncate:
         LOP(OP_TRUNCATE,XTERM);
 
+    case KEY_try:
+        Perl_ck_warner_d(aTHX_
+            packWARN(WARN_EXPERIMENTAL__TRY), "try/catch is experimental");
+        PREBLOCK(TRY);
+
     case KEY_uc:
         UNI(OP_UC);
 

This was a very similar change - again just two new case labels to handle the two newly-added keywords. Each one emits a token of the PREBLOCK type. This is a hint to the parser that following the keyword it should expect to find a block of code surrounded by braces ({ ... }). In general when adding new syntax, there will likely be some existing token types that can be used for it, because it is likely following a similar shape to things already there.

Each of these changes adds a new warning - a call to Perl_ck_warner_d. There's a porting test file that checks to see that every one of these has been mentioned somewhere in pod/perldiag.pod. In order to keep that test happy, each commit had to add a new section there too; for example for isa: (github.com/Perl/perl5).

--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -3262,6 +3262,12 @@ an anonymous subroutine, or a reference to a subroutine.
 (W overload) You tried to overload a constant type the overload package is
 unaware of.
 
+=item isa is experimental
+
+(S experimental::isa) This warning is emitted if you use the (C<isa>)
+operator. This operator is currently experimental and its behaviour may
+change in future releases of Perl.
+
 =item -i used with no filenames on the command line, reading from STDIN
 
 (S inplace) The C<-i> option was passed on the command line, indicating

In the next part, we'll take a look at the other half of the compiler, the parser. It is there where we'll make our next modifications to add the banana feature.

Index | < Prev | Next >

2021/02/10

Writing a Perl Core Feature - part 4: Opcodes

Index | < Prev | Next >

Optrees and Ops

Before we get into this next part, I want to first explain some details about how the Perl interpreter works. In summary, the source code of a Perl program is translated into a more compiled form when the interpreter starts up and reads the files. This form is stored in memory and is used to implement the behaviour of the functions that make up the program. It is called an Optree.

Or rather more accurately, every individual function in the program is represented by an Optree. This is a tree-shaped data structure, whose individual nodes each represent one basic kind of operation or step in the execution of that function. This could be considered similar to a sort of assembly language representation, except that rather than being stored as a flat list of instructions, the tree-shaped structure of the individual nodes (called "ops") helps determine the behaviour of the program when run.

For example, while there are many kinds of ops that have no child nodes, these are typically used to represent constants in the program, or fetch items from well-defined locations elsewhere in the interpreter - such as lexical or package variables. Most other kinds of op take one or more subtrees as child nodes and form the tree structure, where they will operate on the data those child nodes previously fetched - such as adding numbers together, or assigning values into variable locations. To execute the optree the interpreter visits each node in postfix order; recursively gathering results from child nodes of the tree to pass upwards to their parents.

Each individual type of op determines what sort of tree-shaped structure it will have, and are grouped together by classes. The most basic class of op (variously called either just "op", or sometimes a "baseop") is one that has no child nodes. An op class with a single child op is called an "unop" (for "unary operator"), one with two children is called a "binop" (for "binary operator"), and one with a variable number of children is a "listop". Within these broad categories there are also sub-divisions: for example a basic op which carries a Perl value with it is an "svop".

Specific types of op are identified by names, given by the constants defined in opnames.h. For example, a basic op carrying a constant value is an OP_CONST, and one representing a lexical variable is an OP_PADSV (so named because variables - SVs - are stored in a data structure called a scratchpad, or pad for short). A binop which performs a scalar assignment between its two child ops is OP_SASSIGN. Thus, for example, the following Perl statement could be represented by the optree given below it:

my $x = 5;

Of course, in such a brief overview as this I have omitted many details, as well as made many simplifications of the actual subject. This should be sufficient to stand as an introduction into the next step of adding a new core Perl feature, but for more information on the subject you could take a look at another blog post of mine, where I talked about optrees from the perspective of writing syntax keyword modules - Perl Parser Plugins 3 - Optrees.

One final point to note is that in some ways you can think of an optree as being similar to an abstract syntax tree (an AST). This isn't always a great analogy, because some parts of the optree don't bear a very close resemblence to the syntax of the source code that produced it. While there are certain similarities, it is important to remember it is not quite the same. For example, there is no opcode to represent the if syntax; the same opcode is used as for the and infix shortcircuit operator. It is best to think of the optree as representing the abstract algorithm - the sequence of required operations - that were described by the source code that compiled into it.

Opcodes in Perl Core

As with adding features, warnings, and keywords, the first step to adding a new opcode to the Perl core begins with editing a file under regen/. The file in this case is regen/opcodes, and is not a perl script, but a plain-text file listing the various kinds of op, along with various properties about them. The file begins with a block of comments which explains more of the details.

The choice of how to represent a new Perl feature in terms of the optree that the syntax will generate depends greatly on exactly what the behaviour of the feature should be. Especially when creating a new feature as core syntax (rather than just adding some functions in a module) the syntax and semantic shape often don't easily relate to a simple function-like structure. There aren't any hard-and-fast rules here; the best bet is usually to look around the existing ops and syntax definitions for similar ideas to be inspired by.

For example, when I added the isa operator I observed that it should behave as an infix comparison-style operator, similar to perhaps the eq or == ones. In the regen/opcodes file these are defined by the two lines:

eq		numeric eq (==)		ck_cmp		Iifs2	S S<
seq		string eq		ck_null		ifs2	S S

The meanings of these five tab-separated columns are as follows:

  1. The source-level name of the op (this is used, capitalised, to form the constants OP_EQ and OP_SEQ).
  2. A human-readable string description for the op (used in printed warnings).
  3. The name of the op-checker function (more on this later).
  4. Some flags describing the operator itself; notable ones being s - produces a scalar result, and 2 - it is a binop.
  5. More flags describing the operands; in this case two scalars. It turns out in practice nothing cares about that column so on later additions it is omitted.

The definition for the isa operator was added in a similar style: (github.com/Perl/perl5).

--- a/regen/opcodes
+++ b/regen/opcodes
@@ -572,3 +572,5 @@ lvref               lvalue ref assignment   ck_null         d%
 lvrefslice     lvalue ref assignment   ck_null         d@
 lvavref                lvalue array reference  ck_null         d%
 anonconst      anonymous constant      ck_null         ds1
+
+isa            derived class test      ck_isa          s2

Lets now consider what we need for our new banana feature. Although we've added two new keywords in the previous part, that is just for the source code way to spell this feature. Perhaps the semantics we want can be represented by a single opcode (remembering what we said above - that the optree is more a representation of the underlying semantics of the program, and not merely the surface level syntax of how it is written).

For sake of argument, let us now imagine that whatever new syntax our new banana feature requires, its operation (via that one opcode) will behave somewhat like a string transform function (perhaps similar to uc or lc). As with so many things relating to adding a new feature/keyword/opcode/etc... it is often best to look for something else similar to copy and adjust as appropriate. We'll add a single new opcode to the list by making a copy of one of those and editing it:

leo@shy:~/src/bleadperl/perl [git]
$ nvim regen/opcodes

leo@shy:~/src/bleadperl/perl [git]
$ git diff
diff --git a/regen/opcodes b/regen/opcodes
index 2a2da77c5c..27114c9659 100644
--- a/regen/opcodes
+++ b/regen/opcodes
@@ -579,3 +579,5 @@ cmpchain_and        comparison chaining     ck_null         |
 cmpchain_dup   comparand shuffling     ck_null         1
 
 catch          catch {} block          ck_null         |
+
+banana         banana operation        ck_null         s1

leo@shy:~/src/bleadperl/perl [git]
$ perl regen/opcode.pl 
Changed: opcode.h opnames.h pp_proto.h lib/B/Op_private.pm

The regeneration script has edited quite a few files this time. Take a look at those now. The notable parts are:

  • A new value named OP_BANANA has been added to the list in opnames.h.
  • A new entry has been added to each of several arrays defined in opcode.h. These contain the name and description strings, function pointers, and various bitflags. Of specific note is the new entry in PL_ppaddr[] which points to a new function named Perl_pp_banana.
  • A new function prototype for Perl_pp_banana in pp_proto.h.

If we were to try building perl now we'd find it won't currently even compile, because the opcode tables are looking for this new Perl_pp_banana function but we haven't even written it yet:

leo@shy:~/src/bleadperl/perl [git]
$ make -j4 perl
...
/usr/bin/ld: globals.o:(.data.rel+0xc88): undefined reference to `Perl_pp_banana'
collect2: error: ld returned 1 exit status

We'll have to provide an actual function for this. There are in fact a number of files which potentially could contain this function. pp_ctl.c contains the control-flow ops (such as entersub and return), pp_sys.c contains the various ops that interact with the OS (such as open and socket), pp_sort.c and pp_pack.c each contain just those specific ops (for various reasons), and the rest of the "normal" ops are scattered between pp.c and pp_hot.c - the latter containing a few of the smaller more-frequently invoked ops.

For adding a new feature like this, it's almost certain that we want to be adding it to pp.c. For now so that we can at least compile perl again and continue our work lets just add a little stub function that will panic if actually run.

leo@shy:~/src/bleadperl/perl [git]
$ nvim pp.c 

leo@shy:~/src/bleadperl/perl [git]
$ git diff pp.c
diff --git a/pp.c b/pp.c
index d0e639fa32..bc54a06aa3 100644
--- a/pp.c
+++ b/pp.c
@@ -7207,6 +7207,11 @@ PP(pp_cmpchain_dup)
     RETURN;
 }
 
+PP(pp_banana)
+{
+    DIE(aTHX_ "panic: we have no bananas");
+}
+
 /*
  * ex: set ts=8 sts=4 sw=4 et:
  */

leo@shy:~/src/bleadperl/perl [git]
$ make -j4 perl

Before we conclude this already-long part, there's something we have to tidy up to keep the unit tests happy. There are a few tests which care about the total list of opcodes, and since we've added one more they will now need adjusting.

porting/utils.t ........... 58/? # Failed test 59 - utils/cpan compiles at porting/utils.t line 85
#      got "Untagged opnames: banana\nutils/cpan syntax OK\n"
# expected "utils/cpan syntax OK\n"
# when executing perl with '-c utils/cpan'
porting/utils.t ........... Failed 1/82 subtests 

It's non-obvious from the error result, but this is actually complaining that the module Opcode::Opcode has not categorised this opcode into a category. We can fix that by editing the module file and again doing similar to whatever uc and lc do. Again as it's a shipped .pm file don't forget to update the $VERSION declaration:

leo@shy:~/src/bleadperl/perl [git]
$ nvim ext/Opcode/Opcode.pm 

leo@shy:~/src/bleadperl/perl [git]
$ git diff ext/Opcode/Opcode.pm
diff --git a/ext/Opcode/Opcode.pm b/ext/Opcode/Opcode.pm
index f1b2247b07..eaabc43757 100644
--- a/ext/Opcode/Opcode.pm
+++ b/ext/Opcode/Opcode.pm
@@ -6,7 +6,7 @@ use strict;
 
 our($VERSION, @ISA, @EXPORT_OK);
 
-$VERSION = "1.50";
+$VERSION = "1.51";
 
 use Carp;
 use Exporter ();
@@ -336,7 +336,7 @@ invert_opset function.
     substr vec stringify study pos length index rindex ord chr
 
     ucfirst lcfirst uc lc fc quotemeta trans transr chop schop
-    chomp schomp
+    chomp schomp banana
 
     match split qr
 

At this point, the tests should all run cleanly again. We're now getting perilously close to actually being able to implement something. Maybe we'll get around to that in the next part.

Index | < Prev | Next >

2021/02/08

Writing a Perl Core Feature - part 3: Keywords

Index | < Prev | Next >

Some Perl features use a syntax entirely made of punctuation symbols; for example Perl 5.10's defined-or operator (//), or Perl 5.24's postfix dereference (->$*, etc..). Other features are based around new keywords spelled like regular identifiers; such as 5.10's state or 5.32's isa. It is rare to find examples where newly-added syntax can be done simply on existing operator symbols, so most new features come in the form of new keywords.

As with adding the named feature itself and its associated warning, the first step to adding a keyword begins with editing a regeneration file. The file required this time is called regen/keywords.pl.

For example when the isa feature was added, it required a new keyword of the same name: (github.com/Perl/perl5).

--- a/regen/keywords.pl
+++ b/regen/keywords.pl
@@ -46,6 +46,7 @@ my %feature_kw = (
     evalbytes => 'evalbytes',
     __SUB__   => '__SUB__',
     fc        => 'fc',
+    isa       => 'isa',
 );
 
 my %pos = map { ($_ => 1) } @{$by_strength{'+'}};
@@ -217,6 +218,7 @@ __END__
 -index
 -int
 -ioctl
+-isa
 -join
 -keys
 -kill

There are two parts to this change. The later part adds our new keyword to the main list of all the known keywords in the DATA section at the end of the script. If it wasn't for the first part of this change, then the new keyword would be recognised unconditionally in all code - almost certainly not what we want as that would cause compatibility issues in existing code. Since we have a lexical named feature for exactly this purpose, we made use of it here by listing the new keyword along with its associated feature into the %feature_kw hash so that the keyword is only recognised conditionally based on that feature being enabled.

For our new banana feature we need to decide if we're going to add some keywords, and if so what they will be called. Lets add two to make a more interesting example, called ban and ana. As before we'll start by editing the regeneration script and running it to have it rebuild some files.

leo@shy:~/src/bleadperl/perl [git]
$ nvim regen/keywords.pl 

leo@shy:~/src/bleadperl/perl [git]
$ git diff
diff --git a/regen/keywords.pl b/regen/keywords.pl
index b9ae8cf0f2..adbec89c71 100755
--- a/regen/keywords.pl
+++ b/regen/keywords.pl
@@ -47,6 +47,8 @@ my %feature_kw = (
     __SUB__   => '__SUB__',
     fc        => 'fc',
     isa       => 'isa',
+    ban       => 'banana',
+    ana       => 'banana',
 );
 
 my %pos = map { ($_ => 1) } @{$by_strength{'+'}};
@@ -125,8 +127,10 @@ __END__
 -abs
 -accept
 -alarm
+-ana
 -and
 -atan2
+-ban
 -bind
 -binmode
 -bless

leo@shy:~/src/bleadperl/perl [git]
$ perl regen/keywords.pl 
Changed: keywords.c keywords.h

We still have a few more files to edit before we're done adding the keywords, but before continuing you should take a look at these regenerated files to see what changes have been made. Notice that this time there are no changes to any Perl files, only C files. This is why we didn't need to update any $VERSION values.

The keywords.h file just contains a long list of macros named KEY_... which give numbers to each keyword. Don't worry that most of the numbers have now changed - regen/keywords.pl likes to keep them in alphabetical order, and since we added new ones near the beginning it has had to move the rest downwards. This won't be a problem because the numbers are only internal within the perl lexer and parser, so there's no API compatibility to worry about here.

The keywords.c file contains just one function, whose job is to recognise any of the keywords by name. It returns values of these KEY_... macros. Take a look at the added code, and notice that its recognition of each of our additions is conditional on the FEATURE_BANANA_IS_ENABLED macro we saw added when we added the named feature.

We're not quite done yet though. If we were to run the full test suite now, we'd already find a few tests that fail:

op/coreamp.t .. 1/? # Failed test 591 - ana either has been tested or is not ampable at op/coreamp.t line 1178
# Failed test 593 - ban either has been tested or is not ampable at op/coreamp.t line 1178
op/coreamp.t .. Failed 2/778 subtests 
...
op/coresubs.t .. 1/? perl: op.c:14795: Perl_ck_entersub_args_core: Assertion `!"UNREACHABLE"' failed.
op/coresubs.t .. All 52 subtests passed
...
../lib/B/Deparse-core.t .. 3690/3904 # keyword 'ana' seen in ../regen/keywords.pl, but not tested here!!
# keyword 'ban' seen in ../regen/keywords.pl, but not tested here!!

#   Failed test 'sanity checks'
#   at ../lib/B/Deparse-core.t line 430.
# Looks like you failed 1 test of 3904.
../lib/B/Deparse-core.t .. Dubious, test returned 1 (wstat 256, 0x100)

The two tests in t/op are checking variations on a theme of the &CORE::... syntax, by which core operators can be reïfied into regular code references to functions that behave like the operator. Often this is appropriate for operators which act like regular functions - for example the mathematical sin and cos operators, but isn't what we want for keywords that act more structural like basic syntax. We should tell these tests to skip the new keywords by adding them to each file's skip list:

leo@shy:~/src/bleadperl/perl [git]
$ nvim t/op/coreamp.t t/op/coresubs.t 

leo@shy:~/src/bleadperl/perl [git]
$ git diff t/
diff --git a/t/op/coreamp.t b/t/op/coreamp.t
index b57609bef0..bd60ca83b9 100644
--- a/t/op/coreamp.t
+++ b/t/op/coreamp.t
@@ -1162,7 +1162,7 @@ like $@, qr'^Undefined format "STDOUT" called',
   my %nottest_words = map { $_ => 1 } qw(
     AUTOLOAD BEGIN CHECK CORE DESTROY END INIT UNITCHECK
     __DATA__ __END__
-    and cmp default do dump else elsif eq eval for foreach format ge given goto
+    ana and ban cmp default do dump else elsif eq eval for foreach format ge given goto
     grep gt if isa last le local lt m map my ne next no or our package print
     printf q qq qr qw qx redo require return s say sort state sub tr unless
     until use when while x xor y
diff --git a/t/op/coresubs.t b/t/op/coresubs.t
index 1fa11c02f0..85c08a4756 100644
--- a/t/op/coresubs.t
+++ b/t/op/coresubs.t
@@ -15,7 +15,8 @@ BEGIN {
 use B;
 
 my %unsupported = map +($_=>1), qw (
- __DATA__ __END__ AUTOLOAD BEGIN UNITCHECK CORE DESTROY END INIT CHECK and
+ __DATA__ __END__ AUTOLOAD BEGIN UNITCHECK CORE DESTROY END INIT CHECK
+  ana and ban
   cmp default do dump else elsif eq eval for foreach
   format ge given goto grep gt if isa last le local lt m map my ne next
   no  or  our  package  print  printf  q  qq  qr  qw  qx  redo  require
   

Now lets run those two tests in particular. We can do this by using our newly-built perl binary to run the t/harness script and pass in the paths (relative to the t/ directory) to specific tests we wish to run:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl t/harness op/coreamp.t op/coresubs.t
op/coreamp.t ... ok     
op/coresubs.t .. 1/? # Failed test 51 - no CORE::ana at op/coresubs.t line 53
# Failed test 58 - no CORE::ban at op/coresubs.t line 53
op/coresubs.t .. Failed 2/1099 subtests 

Test Summary Report
-------------------
op/coresubs.t (Wstat: 0 Tests: 1099 Failed: 2)
  Failed tests:  51, 58
Files=2, Tests=1875,  1 wallclock secs ( 0.35 usr  0.02 sys +  0.67 cusr  0.03 csys =  1.07 CPU)
Result: FAIL

Well that's one solved, but the other is still upset. This time it is complaining that it expected not to find a &CORE::ana at all, but instead one was there. In order to fix that we will have to edit the list of exceptions in gv.c.

leo@shy:~/src/bleadperl/perl [git]
$ nvim gv.c

leo@shy:~/src/bleadperl/perl [git]
$ git diff gv.c
diff --git a/gv.c b/gv.c
index 92bada56b1..10271159dc 100644
--- a/gv.c
+++ b/gv.c
@@ -543,8 +543,9 @@ S_maybe_add_coresub(pTHX_ HV * const stash, GV *gv,
     switch (code < 0 ? -code : code) {
      /* no support for \&CORE::infix;
         no support for funcs that do not parse like funcs */
-    case KEY___DATA__: case KEY___END__: case KEY_and: case KEY_AUTOLOAD:
-    case KEY_BEGIN   : case KEY_CHECK  : case KEY_cmp:
+    case KEY___DATA__: case KEY___END__: case KEY_ana   : case KEY_and    :
+    case KEY_AUTOLOAD: case KEY_ban    : case KEY_BEGIN : case KEY_CHECK  :
+    case KEY_cmp     :
     case KEY_default : case KEY_DESTROY:
     case KEY_do      : case KEY_dump   : case KEY_else  : case KEY_elsif  :
     case KEY_END     : case KEY_eq     : case KEY_eval  :

Now we rebuild perl (because we have edited a C file) and rerun the tests:

leo@shy:~/src/bleadperl/perl [git]
$ make -j4 perl
...

leo@shy:~/src/bleadperl/perl [git]
$ ./perl t/harness op/coreamp.t op/coresubs.t 
op/coreamp.t ... ok     
op/coresubs.t .. ok      
All tests successful.
Files=2, Tests=1875,  1 wallclock secs ( 0.43 usr  0.02 sys +  0.76 cusr  0.02 csys =  1.23 CPU)
Result: PASS

The test under ../lib/B/Deparse-core.t checks the behaviour of the B::Deparse module against the core keywords. (The path is relative to the t/ directory, which is why it begins with .., and shows that tests within bundled core modules are counted as part of the full test suite.)

When the isa feature was added, this test file was updated to add some deparsing tests around the isa operator as a regular infix binary syntax. We'll come back later and add some unit tests for our new ban and ana keywords, but for now as with the coreamp and coresubs tests it is best to just add these to the skip list in that test file as well.

leo@shy:~/src/bleadperl/perl [git]
$ nvim lib/B/Deparse-core.t 

leo@shy:~/src/bleadperl/perl [git]
$ git diff lib/B/Deparse-core.t
diff --git a/lib/B/Deparse-core.t b/lib/B/Deparse-core.t
index cdbd27ce5e..edf86f809d 100644
--- a/lib/B/Deparse-core.t
+++ b/lib/B/Deparse-core.t
@@ -362,6 +362,8 @@ my %not_tested = map { $_ => 1} qw(
     END
     INIT
     UNITCHECK
+    ana
+    ban
     default
     else
     elsif

leo@shy:~/src/bleadperl/perl [git]
$ ./perl t/harness ../lib/B/Deparse-core.t
../lib/B/Deparse-core.t .. ok         
All tests successful.
Files=1, Tests=3904, 17 wallclock secs ( 1.17 usr  0.06 sys + 16.86 cusr  0.06 csys = 18.15 CPU)
Result: PASS

At this point we now have a named feature with its associated warning, and some conditionally-recognised keywords. In the next parts we will get the compiler to recognise these when parsing Perl code.

Index | < Prev | Next >

2021/02/05

Writing a Perl Core Feature - part 2: warnings.pm

Index | < Prev | Next >

Ever since Perl version 5.18, newly added features are initially declared as experimental. This gives time for them to be more widely tested and used in practice, so that the design can be further refined and changed if necessary. In order to achieve this for a new feature our next step will be to add a warning to warnings.pm.

Similar to the named feature in feature.pm this file also isn't edited directly, but instead is maintained by a regeneration script; this one called regen/warnings.pl.

For example, the isa feature added a new warning here: (github.com/Perl/perl5).

--- a/regen/warnings.pl
+++ b/regen/warnings.pl
@@ -16,7 +16,7 @@
 #
 # This script is normally invoked from regen.pl.
 
-$VERSION = '1.45';
+$VERSION = '1.46';
 
 BEGIN {
     require './regen/regen_lib.pl';
@@ -117,6 +117,8 @@ my $tree = {
                                     [ 5.029, DEFAULT_ON ],
                                 'experimental::vlb' =>
                                     [ 5.029, DEFAULT_ON ],
+                                'experimental::isa' =>
+                                    [ 5.031, DEFAULT_ON ],
                         }],
 
         'missing'       => [ 5.021, DEFAULT_OFF],

This change simply adds another entry into the list of defined warnings. It has a name, a Perl version from which it appears, and is declared to be on by default (as all "experimental" warnings should be). We also have to bump the version number because that is the value inserted into the generated warnings.pm file.

For adding a new warning to go along with our banana feature, we follow a similar process to what we did for the named feature bit. We edit the regeneration file to make a similar change to the one seen above, then run the script to have it generate the required files.

leo@shy:~/src/bleadperl/perl [git]
$ nvim regen/warnings.pl 

leo@shy:~/src/bleadperl/perl [git]
$ perl regen/warnings.pl 
Changed: warnings.h lib/warnings.pm

As before, we can see that it has generated the new lib/warnings.pm Perl pragma file, and also a header file for compiling the interpreter itself. Take a look at these files now to get a feel for what's there.

In particular, the items of note are:

  • The generated warnings.pm file includes changes to the documented list of known warning categories.
  • A new WARN_EXPERIMENTAL__BANANA macro has been created in the warnings.h file. We shall be seeing this used soon.

Now that we have both the named feature and the experimental warning we can check that the experimental pragma module can enable it:

leo@shy:~/src/bleadperl/perl [git]
$ make -j4 perl
...

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -ce 'use experimental "banana";'
-e syntax OK

We're now one step closer to being able to actually start implementing this feature.

Index | < Prev | Next >

2021/02/03

Writing a Perl Core Feature - part 1: feature.pm

Index | < Prev | Next >

The first step towards adding a new feature to Perl is introducing the new name into feature.pm, so that it may be requested by

use feature 'banana';

To accomplish this we don't actually edit feature.pm directly, because that is a file which is automatically generated from other source. The primary file we need to work on that lives in the regen/ directory, called regen/feature.pl.

For example, when adding the isa feature this was the change made there: (github.com/Perl/perl5).

--- a/regen/feature.pl
+++ b/regen/feature.pl
@@ -35,6 +35,7 @@ my %feature = (
     unicode_strings => 'unicode',
     fc              => 'fc',
     signatures      => 'signatures',
+    isa             => 'isa',
 );
 
 # NOTE: If a feature is ever enabled in a non-contiguous range of Perl
@@ -752,6 +753,14 @@ Reference to a Variable> for examples.
 
 This feature is available from Perl 5.26 onwards.
 
+=head2 The 'isa' feature
+
+This allows the use of the C<isa> infix operator, which tests whether the
+scalar given by the left operand is an object of the class given by the
+right operand. See L<perlop/Class Instance Operator> for more details.
+
+This feature is available from Perl 5.32 onwards.
+
 =head1 FEATURE BUNDLES
 
 It's possible to load multiple features together, using

We can see two distinct parts in here. The first, a single line addition to the %feature hash, is the part which actually introduces the new name. The second part adds some documentation for it, which will appear in the generated feature.pm file.

To add our new banana feature then, this is where we must start editing. For now don't worry too much about the documentation part - we'll come back to that later. Just add a single line into the %feature hash.

leo@shy:~/src/bleadperl/perl [git]
$ nvim regen/feature.pl

Once we've made our required changes in here, we run the script to get it to regenerate its files. Note that we need to use a perl to run this, but it doesn't have to be the one we are trying to build (indeed - that would be problematic would it not? ;) ). Any recently up-to-date system Perl install will be fine.

leo@shy:~/src/bleadperl/perl [git]
$ perl regen/feature.pl 
Changed: lib/feature.pm feature.h

Here we can see that it has regenerated two files. The first of these is the lib/feature.pm file that the perl VM will use at runtime to implement the actual use feature pragma with. The second file is feature.h which is used during compiling the interpreter itself and contains the various feature-test macros. If you want, take a look now at the changes it has made.

Specifically, notice that:

  • A new FEATURE_BANANA_BIT macro has been created, and a value assigned to it. These features are kept in numerical order, so also notice that the subsequent features have been renumbered. This is fine - the bit fields are only used internally and there are no API guarantees of numerical stability between major versions of Perl.
  • A new FEATURE_BANANA_IS_ENABLED macro has been created, which other code may use to test if the feature is currently in effect during compile-time. Keep note of this - we will be seeing it again later on.
  • The other change in the file is in the S_magic_sethint_feature() function, which adds code to recognise the string name of the new feature; this is ultimately used by use feature ... line itself to recognise the names of the requested features.

At this point already, we can test that the newly-created feature is at least recognised by the feature.pm file itself:

leo@shy:~/src/bleadperl/perl [git]
$ make -j4 perl
...

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -ce 'use feature "banana";'
-e syntax OK

It actually turns out that the particular commit that added isa was somewhat atypical. It didn't actually need to change the $VERSION of the generated file, because another change earlier in the history had already done so. This is unlikely to be the case most of the time.

Now would be a good time to introduce the porting tests. This is a subset of the full test suite, which checks various details to do with whether the source code is being maintained properly. We can run these directly:

leo@shy:~/src/bleadperl/perl [git]
$ make test_porting
...
porting/cmp_version.t ..... 1/4 # not ok 3 - lib/feature.pm version 1.62
porting/cmp_version.t ..... Failed 1/4 subtests 
...
Test Summary Report
-------------------
porting/cmp_version.t   (Wstat: 0 Tests: 4 Failed: 1)
  Failed test:  3
Files=32, Tests=44043, 188 wallclock secs ( 7.88 usr  0.16 sys + 186.14 cusr  3.98 csys = 198.16 CPU)
Result: FAIL

Here indeed we see that for our banana feature we have forgotten to bump the version number. No matter, we can do that now and test again:

leo@shy:~/src/bleadperl/perl [git]
$ nvim regen/feature.pl 

leo@shy:~/src/bleadperl/perl [git]
$ perl regen/feature.pl
Changed: lib/feature.pm

leo@shy:~/src/bleadperl/perl [git]
$ git diff
...
--- a/lib/feature.pm
+++ b/lib/feature.pm
@@ -5,7 +5,7 @@
 
 package feature;
 
-our $VERSION = '1.62';
+our $VERSION = '1.63';
 
 our %feature = (
     fc                   => 'feature_fc',
...

leo@shy:~/src/bleadperl/perl [git]
$ make test_porting
...
All tests successful.
Files=32, Tests=44044, 175 wallclock secs ( 7.32 usr  0.12 sys + 174.11 cusr  3.58 csys = 185.13 CPU)
Result: PASS

While working on core features it's often a good idea to make use of the porting tests regularly at least. The full test suite takes quite a while to run and likely most of it won't affect the particular parts of a new feature you are working on (especially as new features should be lexically guarded and thus limited in impact in the vast majority of the exiting test suite which won't be expecting it), but the porting tests are designed to be fairly small and lightweight to run often enough and keep an eye on the most likely things to check.

Index | < Prev | Next >