2021/02/10

Writing a Perl Core Feature - part 4: Opcodes

Index | < Prev | Next >

Optrees and Ops

Before we get into this next part, I want to first explain some details about how the Perl interpreter works. In summary, the source code of a Perl program is translated into a more compiled form when the interpreter starts up and reads the files. This form is stored in memory and is used to implement the behaviour of the functions that make up the program. It is called an Optree.

Or rather more accurately, every individual function in the program is represented by an Optree. This is a tree-shaped data structure, whose individual nodes each represent one basic kind of operation or step in the execution of that function. This could be considered similar to a sort of assembly language representation, except that rather than being stored as a flat list of instructions, the tree-shaped structure of the individual nodes (called "ops") helps determine the behaviour of the program when run.

For example, while there are many kinds of ops that have no child nodes, these are typically used to represent constants in the program, or fetch items from well-defined locations elsewhere in the interpreter - such as lexical or package variables. Most other kinds of op take one or more subtrees as child nodes and form the tree structure, where they will operate on the data those child nodes previously fetched - such as adding numbers together, or assigning values into variable locations. To execute the optree the interpreter visits each node in postfix order; recursively gathering results from child nodes of the tree to pass upwards to their parents.

Each individual type of op determines what sort of tree-shaped structure it will have, and are grouped together by classes. The most basic class of op (variously called either just "op", or sometimes a "baseop") is one that has no child nodes. An op class with a single child op is called an "unop" (for "unary operator"), one with two children is called a "binop" (for "binary operator"), and one with a variable number of children is a "listop". Within these broad categories there are also sub-divisions: for example a basic op which carries a Perl value with it is an "svop".

Specific types of op are identified by names, given by the constants defined in opnames.h. For example, a basic op carrying a constant value is an OP_CONST, and one representing a lexical variable is an OP_PADSV (so named because variables - SVs - are stored in a data structure called a scratchpad, or pad for short). A binop which performs a scalar assignment between its two child ops is OP_SASSIGN. Thus, for example, the following Perl statement could be represented by the optree given below it:

my $x = 5;

Of course, in such a brief overview as this I have omitted many details, as well as made many simplifications of the actual subject. This should be sufficient to stand as an introduction into the next step of adding a new core Perl feature, but for more information on the subject you could take a look at another blog post of mine, where I talked about optrees from the perspective of writing syntax keyword modules - Perl Parser Plugins 3 - Optrees.

One final point to note is that in some ways you can think of an optree as being similar to an abstract syntax tree (an AST). This isn't always a great analogy, because some parts of the optree don't bear a very close resemblence to the syntax of the source code that produced it. While there are certain similarities, it is important to remember it is not quite the same. For example, there is no opcode to represent the if syntax; the same opcode is used as for the and infix shortcircuit operator. It is best to think of the optree as representing the abstract algorithm - the sequence of required operations - that were described by the source code that compiled into it.

Opcodes in Perl Core

As with adding features, warnings, and keywords, the first step to adding a new opcode to the Perl core begins with editing a file under regen/. The file in this case is regen/opcodes, and is not a perl script, but a plain-text file listing the various kinds of op, along with various properties about them. The file begins with a block of comments which explains more of the details.

The choice of how to represent a new Perl feature in terms of the optree that the syntax will generate depends greatly on exactly what the behaviour of the feature should be. Especially when creating a new feature as core syntax (rather than just adding some functions in a module) the syntax and semantic shape often don't easily relate to a simple function-like structure. There aren't any hard-and-fast rules here; the best bet is usually to look around the existing ops and syntax definitions for similar ideas to be inspired by.

For example, when I added the isa operator I observed that it should behave as an infix comparison-style operator, similar to perhaps the eq or == ones. In the regen/opcodes file these are defined by the two lines:

eq		numeric eq (==)		ck_cmp		Iifs2	S S<
seq		string eq		ck_null		ifs2	S S

The meanings of these five tab-separated columns are as follows:

  1. The source-level name of the op (this is used, capitalised, to form the constants OP_EQ and OP_SEQ).
  2. A human-readable string description for the op (used in printed warnings).
  3. The name of the op-checker function (more on this later).
  4. Some flags describing the operator itself; notable ones being s - produces a scalar result, and 2 - it is a binop.
  5. More flags describing the operands; in this case two scalars. It turns out in practice nothing cares about that column so on later additions it is omitted.

The definition for the isa operator was added in a similar style: (github.com/Perl/perl5).

--- a/regen/opcodes
+++ b/regen/opcodes
@@ -572,3 +572,5 @@ lvref               lvalue ref assignment   ck_null         d%
 lvrefslice     lvalue ref assignment   ck_null         d@
 lvavref                lvalue array reference  ck_null         d%
 anonconst      anonymous constant      ck_null         ds1
+
+isa            derived class test      ck_isa          s2

Lets now consider what we need for our new banana feature. Although we've added two new keywords in the previous part, that is just for the source code way to spell this feature. Perhaps the semantics we want can be represented by a single opcode (remembering what we said above - that the optree is more a representation of the underlying semantics of the program, and not merely the surface level syntax of how it is written).

For sake of argument, let us now imagine that whatever new syntax our new banana feature requires, its operation (via that one opcode) will behave somewhat like a string transform function (perhaps similar to uc or lc). As with so many things relating to adding a new feature/keyword/opcode/etc... it is often best to look for something else similar to copy and adjust as appropriate. We'll add a single new opcode to the list by making a copy of one of those and editing it:

leo@shy:~/src/bleadperl/perl [git]
$ nvim regen/opcodes

leo@shy:~/src/bleadperl/perl [git]
$ git diff
diff --git a/regen/opcodes b/regen/opcodes
index 2a2da77c5c..27114c9659 100644
--- a/regen/opcodes
+++ b/regen/opcodes
@@ -579,3 +579,5 @@ cmpchain_and        comparison chaining     ck_null         |
 cmpchain_dup   comparand shuffling     ck_null         1
 
 catch          catch {} block          ck_null         |
+
+banana         banana operation        ck_null         s1

leo@shy:~/src/bleadperl/perl [git]
$ perl regen/opcode.pl 
Changed: opcode.h opnames.h pp_proto.h lib/B/Op_private.pm

The regeneration script has edited quite a few files this time. Take a look at those now. The notable parts are:

  • A new value named OP_BANANA has been added to the list in opnames.h.
  • A new entry has been added to each of several arrays defined in opcode.h. These contain the name and description strings, function pointers, and various bitflags. Of specific note is the new entry in PL_ppaddr[] which points to a new function named Perl_pp_banana.
  • A new function prototype for Perl_pp_banana in pp_proto.h.

If we were to try building perl now we'd find it won't currently even compile, because the opcode tables are looking for this new Perl_pp_banana function but we haven't even written it yet:

leo@shy:~/src/bleadperl/perl [git]
$ make -j4 perl
...
/usr/bin/ld: globals.o:(.data.rel+0xc88): undefined reference to `Perl_pp_banana'
collect2: error: ld returned 1 exit status

We'll have to provide an actual function for this. There are in fact a number of files which potentially could contain this function. pp_ctl.c contains the control-flow ops (such as entersub and return), pp_sys.c contains the various ops that interact with the OS (such as open and socket), pp_sort.c and pp_pack.c each contain just those specific ops (for various reasons), and the rest of the "normal" ops are scattered between pp.c and pp_hot.c - the latter containing a few of the smaller more-frequently invoked ops.

For adding a new feature like this, it's almost certain that we want to be adding it to pp.c. For now so that we can at least compile perl again and continue our work lets just add a little stub function that will panic if actually run.

leo@shy:~/src/bleadperl/perl [git]
$ nvim pp.c 

leo@shy:~/src/bleadperl/perl [git]
$ git diff pp.c
diff --git a/pp.c b/pp.c
index d0e639fa32..bc54a06aa3 100644
--- a/pp.c
+++ b/pp.c
@@ -7207,6 +7207,11 @@ PP(pp_cmpchain_dup)
     RETURN;
 }
 
+PP(pp_banana)
+{
+    DIE(aTHX_ "panic: we have no bananas");
+}
+
 /*
  * ex: set ts=8 sts=4 sw=4 et:
  */

leo@shy:~/src/bleadperl/perl [git]
$ make -j4 perl

Before we conclude this already-long part, there's something we have to tidy up to keep the unit tests happy. There are a few tests which care about the total list of opcodes, and since we've added one more they will now need adjusting.

porting/utils.t ........... 58/? # Failed test 59 - utils/cpan compiles at porting/utils.t line 85
#      got "Untagged opnames: banana\nutils/cpan syntax OK\n"
# expected "utils/cpan syntax OK\n"
# when executing perl with '-c utils/cpan'
porting/utils.t ........... Failed 1/82 subtests 

It's non-obvious from the error result, but this is actually complaining that the module Opcode::Opcode has not categorised this opcode into a category. We can fix that by editing the module file and again doing similar to whatever uc and lc do. Again as it's a shipped .pm file don't forget to update the $VERSION declaration:

leo@shy:~/src/bleadperl/perl [git]
$ nvim ext/Opcode/Opcode.pm 

leo@shy:~/src/bleadperl/perl [git]
$ git diff ext/Opcode/Opcode.pm
diff --git a/ext/Opcode/Opcode.pm b/ext/Opcode/Opcode.pm
index f1b2247b07..eaabc43757 100644
--- a/ext/Opcode/Opcode.pm
+++ b/ext/Opcode/Opcode.pm
@@ -6,7 +6,7 @@ use strict;
 
 our($VERSION, @ISA, @EXPORT_OK);
 
-$VERSION = "1.50";
+$VERSION = "1.51";
 
 use Carp;
 use Exporter ();
@@ -336,7 +336,7 @@ invert_opset function.
     substr vec stringify study pos length index rindex ord chr
 
     ucfirst lcfirst uc lc fc quotemeta trans transr chop schop
-    chomp schomp
+    chomp schomp banana
 
     match split qr
 

At this point, the tests should all run cleanly again. We're now getting perilously close to actually being able to implement something. Maybe we'll get around to that in the next part.

Index | < Prev | Next >

1 comment:

  1. Doing a great job with this. I feel like I should soon be able to perpetrate horrors upon the code!

    ReplyDelete