2021/02/15

Writing a Perl Core Feature - part 6: Parser

Index | < Prev | Next >

In the previous part I introduced the concepts of the lexer and the parser, and the way they combine together to form part of the compiler which actually translates the incoming program source code into the in-memory optree where it can be executed. We took a look at some parser changes, and the way that the isa operator was able to work with that alone without needing a corresponding change in the parser, but also noted that most non-trivial syntax additions will require concurrent changes to both the parser and the lexer to cope with it.

In particular, although it is the lexer that creates and emits tokens into the parser, it is the parser which maintains the list of what token types it expects. It is there where new token types have to be added.

The isa operator did not need to make any changes in the parser, so for today's article we'll look instead at the recently-added try/catch syntax, which did. That was first added in this commit, though subsequent modifications have been made to it. Go take a look now - perhaps you will find parts of it recognisable, similar to the changes we've already seen with isa and made for our new banana feature we have been building up.

Similar to the situation with features, warnings, and opcodes, the parser is maintained primarily by changes to one source file which is then run through a regeneration script to update several other files that are generated from it. The source of truth in this case is the file perly.y, and the regeneration script for it is regen_perly.pl (neither of which live in the regen directory for reasons lost to the mists of time).

The part of the try/catch commit which updated the parser generation file had two parts to it: (github.com/Perl/perl5).

--- a/perly.y
+++ b/perly.y
@@ -69,6 +69,7 @@
 %token <ival> FORMAT SUB SIGSUB ANONSUB ANON_SIGSUB PACKAGE USE
 %token <ival> WHILE UNTIL IF UNLESS ELSE ELSIF CONTINUE FOR
 %token <ival> GIVEN WHEN DEFAULT
+%token <ival> TRY CATCH
 %token <ival> LOOPEX DOTDOT YADAYADA
 %token <ival> FUNC0 FUNC1 FUNC UNIOP LSTOP
 %token <ival> MULOP ADDOP
@@ -459,6 +460,31 @@ barestmt:  PLUGSTMT
                                  newFOROP(0, NULL, $mexpr, $mblock, $cont));
                          parser->copline = (line_t)$FOR;
                        }
+       |       TRY mblock[try] CATCH PERLY_PAREN_OPEN 
+                       { parser->in_my = 1; }
+               remember scalar 
+                       { parser->in_my = 0; intro_my(); }
+               PERLY_PAREN_CLOSE mblock[catch]
+                       {
...
+                       }
        |       block cont
                        {
                          /* a block is a loop that happens once */

Of these two parts, the first is the bit that defines two new token types. These are types we can use in the lexer - recall from the previous part we saw the lexer emit these tokens as PREBLOCK(TRY) and PREBLOCK(CATCH).

The second part of this change gives the actual parsing rules which the parser uses to recognise the new syntax. This appears in the form of a new alternative to the set of possible rules that the parser may use to create a barestmt (each alternative is separated by | characters). The rules on how to recognise this one are made from a mix of basic tokens (those in capitals) and other grammar rules (those in lower case). The four basic tokens here are the keyword try, an open and close parenthesis pair (named represented by tokens called PERLY_PAREN_OPEN and PERLY_PAREN_CLOSE) and the keyword catch.

In effect we can imagine if the rule were expressed instead using literal strings:

barestmt =
    ...
    | "try" mblock "catch" "(" scalar ")" mblock

The other grammar rules that are referred to by this one define the basic shape of a block of code (the one called mblock), and a single scalar variable (the one called scalar). The other parts that I omitted in this simplified version (remember and the two action blocks relating to parser->in_my) are involved with ensuring that the catch variable part of the syntax is recognised as creating a new variable. It pretends that there had been a my keyword just before the variable name, so the name introduces a new variable.

Don't worry too much about the contents of the main action block for this try/catch syntax rule. That's all specific to how to build up the optree for this particular syntax, and in any case was changed in a later commit to move most of it out to a helper function. We'll come back in a moment to see what we can put there for our new syntax.

Lets now begin adding the tokenizing and parsing rules for our new banana feature. Recall from part 5 we decided we'd add two new token types to represent the two basic keywords. We can do that by adding them to the collection of tokens at the top of the perly.y file and running the regeneration script:

leo@shy:~/src/bleadperl/perl [git]
$ nvim perly.y 

leo@shy:~/src/bleadperl/perl [git]
$ git diff perly.y
diff --git a/perly.y b/perly.y
index 184fb0c158..7bbb64f202 100644
--- a/perly.y
+++ b/perly.y
@@ -77,6 +77,7 @@
 %token <ival> LOCAL MY REQUIRE
 %token <ival> COLONATTR FORMLBRACK FORMRBRACK
 %token <ival> SUBLEXSTART SUBLEXEND
+%token <ival> BAN ANA
 
 %type <ival> grammar remember mremember
 %type <ival>  startsub startanonsub startformsub

leo@shy:~/src/bleadperl/perl [git]
$ perl regen_perly.pl 
Changed: perly.act perly.tab perly.h

At this point if you want you could take a look at the change the script introduced in perly.h - it just adds the two new token types to the main enum yytokentype, where the tokizer and the parser can use them. Don't worry about the other two files (perly.act and perly.tab) - they are long tables of automatically generated output; mostly numbers which help the parser to maintain its internal state. The change there won't be particularly meaningful to look at.

As these new token types now exist in perly.h we can use them to update toke.c to recognise them:

leo@shy:~/src/bleadperl/perl [git]
$ nvim toke.c 

leo@shy:~/src/bleadperl/perl [git]
$ git diff toke.c
diff --git a/toke.c b/toke.c
index 628a79fb43..9f86e110ce 100644
--- a/toke.c
+++ b/toke.c
@@ -7686,6 +7686,11 @@ yyl_word_or_keyword(pTHX_ char *s, STRLEN len, I32 key, I32 orig_keyword, struct
     case KEY_accept:
         LOP(OP_ACCEPT,XTERM);
 
+    case KEY_ana:
+        Perl_ck_warner_d(aTHX_
+            packWARN(WARN_EXPERIMENTAL__BANANA), "banana is experimental");
+        TOKEN(ANA);
+
     case KEY_and:
         if (!PL_lex_allbrackets && PL_lex_fakeeof >= LEX_FAKEEOF_LOWLOGIC)
             return REPORT(0);
@@ -7694,6 +7699,11 @@ yyl_word_or_keyword(pTHX_ char *s, STRLEN len, I32 key, I32 orig_keyword, struct
     case KEY_atan2:
         LOP(OP_ATAN2,XTERM);
 
+    case KEY_ban:
+        Perl_ck_warner_d(aTHX_
+            packWARN(WARN_EXPERIMENTAL__BANANA), "banana is experimental");
+        TOKEN(BAN);
+
     case KEY_bind:
         LOP(OP_BIND,XTERM);

Now we can rebuild perl and test some examples:

leo@shy:~/src/bleadperl/perl [git]
$ make -j4 perl

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -E 'use feature "banana"; say ban "a string here" ana;'
banana is experimental at -e line 1.
banana is experimental at -e line 1.
syntax error at -e line 1, near "say ban"
Execution of -e aborted due to compilation errors.

We get our expected warnings about the experimental syntax, and then a syntax error. This is because, while the lexer recognises our keywords, we haven't yet written a parser rule to tell the parser what to do with it. But we can at least tell the lexer recognised the keywords, because if we test without enabling the feature we get a totally different error:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -E 'say ban "a string here" ana;'
Bareword found where operator expected at -e line 1, near ""a string here" ana"
        (Missing operator before ana?)
syntax error at -e line 1, near ""a string here" ana"
Execution of -e aborted due to compilation errors.

Lets now add a grammar rule to let the parser recognise this syntax:

leo@shy:~/src/bleadperl/perl [git]
$ nvim perly.y 

leo@shy:~/src/bleadperl/perl [git]
$ git diff perly.y
...
                    SUBLEXSTART listexpr optrepl SUBLEXEND
                        { $$ = pmruntime($PMFUNC, $listexpr, $optrepl, 1, $<ival>2); }
+       |       BAN expr ANA
+                       { $$ = newUNOP(OP_BANANA, 0, $expr); }
        |       BAREWORD
        |       listop
...

leo@shy:~/src/bleadperl/perl [git]
$ make -j4 perl

With this new definition our new syntax:

  • is recognised as a basic term expression, meaning it can stand in the same parts of syntax as other expressions such as constants or variables
  • requires an expr expression between the ban and ana keywords, meaning it will accept any sort of complex expression such as a string concatenation operator or function call

After the grammar rule which tells the parser how to recognise the new syntax, we've added a block of code telling it how to implement it. This is translated into some real C code that forms part of the parser, so we can invoke any bits of perl interpreter internals from here. When it gets translated a few special variables are replaced in the code - these are the ones prefixed with $ symbols. The $$ variable is where the parser is expecting to find the output of this particular grammar rule; it's where we put the optree we construct to represent it. For arguments into that we can use the other variable, named after the sub-rule used to parse it - $expr. That will contain the output of parsing that part of the syntax - again an optree.

In this action block it is now a simple matter of generating an optree for the OP_BANANA opcode we added in part 4. Recall that was an op of type UNOP, so we use the newUNOP() function to do this, taking as its child subtree the expression between the two keywords which we got in $expr. We just put that result into the $$ result variable, and we're done.

Now we can try using it:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -E 'use feature "banana"; say ban "a string here" ana;'
banana is experimental at -e line 1.
banana is experimental at -e line 1.
panic: we have no bananas at -e line 1.

Hurrah! We get the panic message we added as a placeholder when we created the Perl_pp_banana function back in part 4. The pieces are now starting to come together - in the next part we'll start implementing the actual behaviour behind this syntax.

Lets not forget to add the new "experimental" warnings to pod/perldiag.pod in order to keep the porting test happy:

leo@shy:~/src/bleadperl/perl [git]
$ nvim pod/perldiag.pod 

$ git diff pod/perldiag.pod
diff --git a/pod/perldiag.pod b/pod/perldiag.pod
index 98d159dc21..66b0a4aa40 100644
--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -519,6 +519,11 @@ wasn't a symbol table entry.
 (P) An internal request asked to add a scalar entry to something that
 wasn't a symbol table entry.
 
+=item banana is experimental
+
+(S experimental::banana) This warning is emitted if you use the banana
+syntax (C<ban> ... C<ana>). This syntax is currently experimental.
+
 =item Bareword found in conditional
 

For now there's one last thing we can look at. Even though we don't have an implementation behind the syntax, we can at least compile it into an optree. We can inspect the generated optree by using the -MO=Concise compiler backend:

leo@shy:~/src/bleadperl/perl [git]
$ ./perl -Ilib -MO=Concise -E 'use feature "banana"; say ban "a string here" ana;'
banana is experimental at -e line 1.
banana is experimental at -e line 1.
7  <@> leave[1 ref] vKP/REFC ->(end)
1     <0> enter v ->2
2     <;> nextstate(main 3 -e:1) v:%,us,{,fea=15 ->3
6     <@> say vK ->7
3        <0> pushmark s ->4
5        <1> banana sK/1 ->6
4           <$> const(PV "a string here") s ->5
-e syntax OK

I won't go into the full details here - for that you can read the documentation at B::Concise. For now I'll just remark that we can see the banana op here, as an UNOP (the 1 flag before it), sitting in the optree as a child node of say, with the string constant as its own child op. When working with optree parsing, the B::Concise module is a handy debugging tool you can use to inspect the generated optree and ensure it has the shape you expected.

Index | < Prev | Next >

No comments:

Post a Comment