Parsing text that requires lookahead using nom - parsing

tl;dr: I'm struggling to find documentation or examples of text parsers that require lookahead using nom.
Long version
I'm using nom to parse 6502 assembly. I'm struggling with creating a parser that can parse the various addressing modes. Any given opcode will have the following format:
XXX AM
Where XXX is a three-character mnemonic and AM is the operand. The operand can take many forms and is referred to as the "addressing mode." I've defined an enum for the operands, an enum for the addressing modes, and an OpCode tuple struct containing these values, which is ultimately the result returned when parsing.
The addressing mode can be omitted completely, in which case the addressing mode is Implied, it can have a literal value of A, which is the Accumulator addressing mode.
Many of the addressing modes refer to memory locations, and it's these addressing modes I'm struggling to parse. In particular, if an addressing mode specifies a single byte in the form of $00, it is a ZeroPage addressing mode, whereas an operand specifying two bytes in the form of $0000 is an Absolute addressing mode. To complicate the matter, there are indexed variants of these addressing modes in the form of $00,X, $00,Y, $0000,X, etc.
Are there any good examples of existing text parsers that would illustrate the correct way to parse values that all start similarly ($00...) but are differentiated by how they end? The nom documentation is not very comprehensive, and the best example I've found is the INI parser, which isn't doing anything as complex as I'm trying to accomplish. I've also look at the syn source code, but it's using a lot of custom macros and is a pretty complex beast, making it hard to learn from.

One way of doing this is with the alt!() macro.
The idea is have a parser which tries each alternative in sequence. So if you already have parsers for each of the addressing modes separately, you can combine them into a parser for any of them:
// The sub-parsers all return Operand too.
named!(parse_operand<&str, Operand>,
alt!(parse_absolute_indexed |
parse_absolute |
parse_zeropage_indexed |
parse_zeropage |
parse_implied));
Some notes:
The order may be important; I've put parse_absolute after parse_absolute_indexed since the former would match the initial part of the operand and return too early.
A variant would be to include the end of line (including comments if applicable) matching into each sub parser. Then it couldn't match early.
If you're parsing to the end of the input without a byte/character which terminates the pattern (such as a newline) then you may need to use alt_complete!() instead of alt!(). The reason for this is that if you try matching ADD $00, the parser which might match ADD $0000 has to assume that it might still match if more input arrives, and alt!() won't then skip to the next case. Using alt_complete!(), or alternatively wrapping the inner matchers in complete!(), is saying that an incomplete match is a non-match.
If the parsers were very complicated it might mean doing extra work (trying each parse in sequence) compared to a parser generated by eg the venerable yacc, but I don't think it's an issue in this case.

Related

Writing a lexer for a context sensitive markup language, that has recursive structures such as nested lists

I'm working on a reStructuredText transpiler in Rust, and am in need of some advice concerning how lexing should be structured in languages that have recursive structures. For example lists within lists are possible in rST:
* This is a list item
* This is a sub list item
* And here we are at the preceding indentation level again.
The default docutils.parsers.rst took the approach of scanning the input one line at a time:
The reStructuredText parser is implemented as a state machine, examining its
input one line at a time.
The state machine mentioned basically operates on a set of states of the form (regex, match_method, next_state). It tries to match the current line to the regex based on the current state and runs match_method while transitioning to the next_state if a match succeeds, doing this until it runs out of lines to scan.
My question then is, is this the best approach to scanning a language such as rST? My approach thus far has been to create a Chars iterator of the source and eat away at the source while trying to match against structures at the current Unicode scalar. This works to some extent when all I'm doing is scanning inline content, but I've now run into the realization that handling recursive body level structures like nested lists is going to be a pain in the butt. It feels like I'm going to need a whole bunch of states with duplicate regexes and related methods in many states for matching against indentations before new lines and such.
Would it be better to simply have and iterator of the lines of the source and match on a per-line basis, and if a line such as
* this is an indented list item
is encountered in State::Body, simply transition to a state such as State::BulletList and start lexing lines based on the rules specified there? The above line could be lexed for example as a sequence
TokenType::Indent, TokenType::Bullet, TokenType::BodyText
Any thoughts on this?
I don't know much about rST. But you say it has "recursive" structures. If that's that case, you can't fully lex it as a recursive structure using just state machines or regexes or even lexer generators.
But this the wrong way to think about it. The lexer's job is to identify the atoms of the language. A parser's job is to recognize structure, especially if it is recursive (yes, parsers often build trees recording the recursive structures they found).
So build the lexer ignoring context if you can, and use a parser to pick up the recursive structures if you need them. You can read more about the distinction in my SO answer about Parsers Vs. Lexers https://stackoverflow.com/a/2852716/120163
If you insist on doing all of this in the lexer, you'll need to augment it with a pushdown stack to track the recursive structures. Then what are you building is a sloppy parser disguised as lexer. (You will probably still want a real parser to process the output of this "lexer").
Having a pushdown stack actually useful if the language has different atoms in different contexts especially if the contexts nest; in this case what you want is mode stack that you change as the lexer encounters tokens that indicate a switch from one mode to another. A really useful extension of this idea is to have mode changes select what amounts to different lexers, each of which produces lexemes unique to that mode.
As an example you might do this to lex a language that contains embedded SQL. We build parsers for JavaScript; our lexer uses a pushdown stack to process the content of regexp literals and track nesting of { ... } [...] and (... ). (This has arguably a downside: it rejects versions of JQuery.js that contain malformed regexes [yes, they exist]. Javascript doesn't care if you define a bad regex literal and never use it, but that seems pretty pointless.)
A special case of the stack occurs if you only have track single "(" ... ")" pairs or the equivalent. In this case you can use a counter to record how many "pushes" or "pop" you might have done on a real stack. If you have two or more pairs of tokens like this, counters don't work.

Is Pug context free?

I was thinking to make a Pug parser but besides the indents are well-known to be context-sensitive (that can be trivially hacked with a lexer feedback loop to make it almost context-free which is adopted by Python), what otherwise makes it not context-free?
XML tags are definitely not context-free, that each starting tag needs to match an end tag, but Pug does not have such restriction, that makes me wonder if we could just parse each starting identifier as a production for a tag root.
The main thing that Pug seems to be missing, at least from a casual scan of its website, is a formal description of its syntax. Or even an informal description. Perhaps I wasn't looking in right places.
Still, based on the examples, it doesn't look awful. There will be some challenges; in particular, it does not have a uniform tokenisation context, so the scanner is going to be complicated, not just because of the indentation issue. (I got the impression from the section on whitespace that the indentation rule is much stricter than Python's, but I didn't find a specification of what it is exactly. It appeared to me that leading whitespace after the two-character indent is significant whitespace. But that doesn't complicate things much; it might even simplify the task.)
What will prove interesting is handling embedded JavaScript. You will at least need to tokenise the embedded JS, and the corner cases in the JS spec make it non-trivial to tokenise without parsing. Anyway, just tokenising isn't sufficient to know where the embedded code terminates. (For the lexical challenge, consider the correct identification of regular expression literals. /= might be the start of a regex or it might be a divide-and-assign operator; how a subsequent { is tokenised will depend on that decision.) Template strings present another challenge (recursive embedding). However, JavaScript parsers do exist, so you might be able to leverage one.
In other words, recognising tag nesting is not going to be the most challenging part of your project. Once you've identified that a given token is a tag, the nesting part is trivial (and context-free) because it is precisely defined by the indentation, so a DEDENT token will terminate the tag.
However, it is worth noting that tag parsing is not particularly challenging for XML (or XML-like HTML variants). If you adopt the XML rule that close tags cannot be omitted (except for self-closing tags), then the tagname in a close tag does not influence the parse of a correct input. (If the tagname in the close tag does not match the close tag in the corresponding open tag, then the input is invalid. But the correspondence between open and close tags doesn't change.) Even if you adopt the HTML-5 rule that close tags cannot be omitted except in the case of a finite list of special-case tagnames, then you could theoretically do the parse with a CFG. (However, the various error recovery rules in HTML-5 are far from context free, so that would only work for input which did not require rematching of close tags.)
Ira Baxter makes precisely this point in the cross-linked post he references in a comment: you can often implement context-sensitive aspects of a language by ignoring them during the parse and detecting them in a subsequent analysis, or even in a semantic predicate during the parse. Correct matching of open- and close tagnames would fall into this category, as would the "declare-before-use" rule in languages where the declaration of an identifier does not influence the parse. (Not true of C or C++, but true in many other languages.)
Even if these aspects cannot be ignored -- as with C typedefs, for example -- the simplest solution might be to use an ambiguous CFG and a parsing technology which produces all possible parses. After the parse forest is generated, you could walk the alternatives and reject the ones which are inconsistent. (In the case of C, that would include an alternative parse in which a name was typedef'd and then used in a context where a typename is not valid.)

Extracting from .bib file with Raku (previously aka Perl 6)

I have this .bib file for reference management while writing my thesis in LaTeX:
#article{garg2017patch,
title={Patch testing in patients with suspected cosmetic dermatitis: A retrospective study},
author={Garg, Taru and Agarwal, Soumya and Chander, Ram and Singh, Aashim and Yadav, Pravesh},
journal={Journal of Cosmetic Dermatology},
year={2017},
publisher={Wiley Online Library}
}
#article{hauso2008neuroendocrine,
title={Neuroendocrine tumor epidemiology},
author={Hauso, Oyvind and Gustafsson, Bjorn I and Kidd, Mark and Waldum, Helge L and Drozdov, Ignat and Chan, Anthony KC and Modlin, Irvin M},
journal={Cancer},
volume={113},
number={10},
pages={2655--2664},
year={2008},
publisher={Wiley Online Library}
}
#article{siperstein1997laparoscopic,
title={Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases},
author={Siperstein, Allan E and Rogers, Stanley J and Hansen, Paul D and Gitomirsky, Alexis},
journal={Surgery},
volume={122},
number={6},
pages={1147--1155},
year={1997},
publisher={Elsevier}
}
If anyone wants to know what bib file is, you can find it detailed here.
I'd like to parse this with Perl 6 to extract the key along with the title like this:
garg2017patch: Patch testing in patients with suspected cosmetic dermatitis: A retrospective study
hauso2008neuroendocrine: Neuroendocrine tumor epidemiology
siperstein1997laparoscopic: Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases
Can you please help me to do this, maybe in two ways:
Using basic Perl 6
Using a Perl 6 Grammar
TL;DR
A complete and detailed answer that does just exactly as #Suman asks.
An introductory general answer to "I want to parse X. Can anyone help?"
A one-liner in a shell
I'll start with terse code that's perfect for some scenarios[1], and which someone might write if they're familiar with shell and Raku basics and in a hurry:
> raku -e 'for slurp() ~~ m:g / "#article\{" (<-[,]>+) \, \s+
"title=\{" (<-[}]>+) \} / -> $/ { put "$0: $1\n" }' < derm.bib
This produces precisely the output you specified:
garg2017patch: Patch testing in patients with suspected cosmetic dermatitis: A retrospective study
hauso2008neuroendocrine: Neuroendocrine tumor epidemiology
siperstein1997laparoscopic: Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases
Same single statement, but in a script
Skipping shell escapes and adding:
Whitespace.
Comments.
► use tio.run to run the code below
for slurp() # "slurp" (read all of) stdin and then
~~ m :global # match it "globally" (all matches) against
/ '#article{' (<-[,]>+) ',' \s+ # a "nextgen regex" that uses (`(...)`) to
'title={' (<-[}]>+) '}' / # capture the article id and title and then
-> $/ { put "$0: $1\n" } # for each article, print "article id: title".
Don't worry if the above still seems like pure gobbledygook. Later sections explain the above while also introducing code that's more general, clean, and readable.[2]
Four statements instead of one
my \input = slurp;
my \pattern = rule { '#article{' ( <-[,]>+ ) ','
'title={' ( <-[}]>+ ) }
my \articles = input .match: pattern, :global;
for articles -> $/ { put "$0: $1\n" }
my declares a lexical variable. Raku supports sigils at the start of variable names. But it also allows devs to "slash them out" as I have done.
my \pattern ...
my \pattern = rule { '#article{' ( <-[,]>+ ) ','
'title={' ( <-[}]>+ ) }
I've switched the pattern syntax from / ... / in the original one-liner to rule { ... }. I did this to:
Eliminate the risk of pathological backtracking
Classic regexes risk pathological backtracking. That's fine if you can just kill a program that's gone wild, but click the link to read how bad it can get! 🤪 We don't need backtracking to match the .bib format.
Communicate that the pattern is a rule
If you write a good deal of pattern matching code, you'll frequently want to use rule { ... }. A rule eliminates any risk of the classic regex problem just described (pathological backtracking), and has another superpower. I'll cover both aspects below, after first introducing the adverbs corresponding to those superpowers.
Raku regexes/rules can be (often are) used with "adverbs". These are convenient shortcuts that modify how patterns are applied.
I've already used an adverb in the earlier versions of this code. The "global" adverb (specified using :global or its shorthand alias :g) directs the matching engine to consume all of the input, generating a list of as many matches as it contains, instead of returning just the first match.
While there are shorthand aliases for adverbs, some are used so repeatedly that it's a lot tidier to bundle them up into distinct rule declarators. That's why I've used rule. It bundles up two adverbs appropriate for matching many data formats like .bib files:
:ratchet (alias :r)
:sigspace (alias :s)
Ratcheting (:r / :ratchet) tells the compiler that when an "atom" (a sub-pattern in a rule that is treated as one unit) has matched, there can be no going back on that. If an atom further on in the pattern in the same rule fails, then the whole rule immediately fails.
This eliminates any risk of the "pathological backtracking" discussed earlier.
Significant space handling (:s / :sigspace) tells the compiler that an atom followed by literal spacing that is in the pattern indicates that a "token" boundary pattern, aka ws should be appended to the atom.
Thus this adverb deals with tokenizing. Did you spot that I'd dropped the \s+ from the pattern compared to the original one in the one-liner? That's because :sigspace, which use of rule implies, takes care of that automatically:
say 'x#y x # y' ~~ m:g:s /x\#y/; # (「x#y」) <-- only one match
say 'x#y x # y' ~~ m:g /x \# y/; # (「x#y」) <-- only one match
say 'x#y x # y' ~~ m:g:s /x \# y/; # (「x#y」 「x # y」) <-- two matches
You might wonder why I've reverted to using / ... / to show these two examples. Turns out that while you can use rule { ... } with the .match method (described in the next section), you can't use rule with m. No problem; I just used :s instead to get the desired effect. (I didn't bother to use :r for ratcheting because it makes no difference for this pattern/input.)
To round out this dive into the difference between classic regexes (which can also be written regex { ... }) and rule rules, let me mention the other main option: token. The token declarator implies the :ratchet adverb, but not the :sigspace one. So it also eliminates the pathological backtracking risk of a regex (or / ... /) but, just like a regex, and unlike a rule, a token ignores whitespace used by a dev in writing out the rule's pattern.
my \articles = input .match: pattern, :global
This line uses the method form (.match) of the m routine used in the one-liner solution.
The result of a match when :global is used is a list of Match objects rather than just one. In this case we'll get three, corresponding to the three articles in the input file.
for articles -> $/ { put "$0: $1\n" }
This for statement successively binds a Match object corresponding to each of the three articles in your sample file to the symbol $/ inside the code block ({ ... }).
Per Raku doc on $/, "$/ is the match variable, so it usually contains objects of type Match.". It also provides some other conveniences; we take advantage of one of these conveniences related to numbered captures:
The pattern that was matched earlier contained two pairs of parentheses;
The overall Match object ($/) provides access to these two Positional captures via Positional subscripting (postfix []), so within the for's block, $/[0] and $/[1] provide access to the two Positional captures for each article;
Raku aliases $0 to $/[0] (and so on) for convenience, so most devs use the shorter syntax.
Interlude
This would be a good time to take a break. Maybe just a cuppa, or return here another day.
The last part of this answer builds up and thoroughly explains a grammar-based approach. Reading it may provide further insight into the solutions above and will show how to extend Raku's parsing to more complex scenarios.
But first...
A "boring" practical approach
I want to parse this with Raku. Can anyone help?
Raku may make writing parsers less tedious than with other tools. But less tedious is still tedious. And Raku parsing is currently slow.
In most cases, the practical answer when you want to parse well known formats and/or really big files is to find and use an existing parser. This might mean not using Raku at all, or using an existing Raku module, or using an existing non-Raku parser in Raku.
A suggested starting point is to search for the file format on modules.raku.org or raku.land. Look for a publicly shared parsing module already specifically packaged for Raku for the given file format. Then do some simple testing to see if you have a good solution.
At the time of writing there are no matches for 'bib'.
Even if you don't know C, there's almost certainly a 'bib' parsing C library already available that you can use. And it's likely to be the fastest solution. It's typically surprisingly easy to use an external library in your own Raku code, even if it's written in another programming language.
Using C libs is done using a feature called NativeCall. The doc I just linked may well be too much or too little, but please feel free to visit the freenode IRC channel #raku and ask for help. (Or post an SO question.) We're friendly folk. :)
If a C lib isn't right for a particular use case, then you can probably still use packages written in some other language such as Perl, Python, Ruby, Lua, etc. via their respective Inline::* language adapters.
The steps are:
Install a package (that's written in Perl, Python or whatever);
Make sure it runs on your system using a compiler of the language it's written for;
Install the appropriate Inline language adapter that lets Raku run packages in that other language;
Use the "foreign" package as if it were a Raku package containing exported Raku functions, classes, objects, values, etc.
(At least, that's the theory. Again, if you need help, please pop on the IRC channel or post an SO question.)
The Perl adapter is the most mature so I'll use that as an example. Let's say you use Perl's Text::BibTex packages and now wish to use Raku with that package. First, setup it up as it's supposed to be per its README. Then, in Raku, write something like:
use Text::BibTeX::BibFormat:from<Perl5>;
...
#blocks = $entry.format;
Explanation of these two lines:
The first line is how you tell Raku that you wish to load a Perl module.
(It won't work unless Inline::Perl5 is already installed and working. But it should be if you're using a popular Raku bundle. And if not, you should at least have the module installer zef so you can run zef install Inline::Perl5.)
The last line is just a mechanical Raku translation of the #blocks = $entry->format; line from the SYNOPSIS of the Perl package Text::BibTeX::BibFormat.
A Raku grammar / parser
OK. Enough "boring" practical advice. Let's now try have some fun creating a grammar based Raku parser good enough for the example from your question.
► use glot.io to run the code below
unit grammar bib;
rule TOP { <article>* }
rule article { '#article{' $<id>=<-[,]>+ ','
<kv-pairs>
'}'
}
rule kv-pairs { <kv-pair>* % ',' }
rule kv-pair { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }
With this grammar in place, we can now write something like:
die "Use CommaIDE?" unless bib .parsefile: 'derm.bib';
for $<article> -> $/ { put "$<id>: $<kv-pairs><kv-pair>[0]<value>\n" }
to generate exactly the same output as the previous solutions.
When a match or parse fails, by default Raku just returns Nil, which is, well, rather terse feedback.
There are several nice debugging options to figure out what's going on with a regex or grammar, but the best option by far is to use CommaIDE's Grammar-Live-View.
If you haven't already installed and used Comma, you're missing one of the best parts of using Raku. The features built in to the free version of Comma ("Community Edition") include outstanding grammar development / tracing / debugging tools.
Explanation of the 'bib' grammar
unit grammar bib;
The unit declarator is used at the start of a source file to tell Raku that the rest of the file declares a named package of code of a particular type.
The grammar keyword specifies a grammar. A grammar is like a class, but contains named "rules" -- not just named methods, but also named regexs, tokens, and rules. A grammar also inherits a bunch of general purpose rules from a base grammar.
rule TOP {
Unless you specify otherwise, parsing methods (.parse and .parsefile) that are called on a grammar start by calling the grammar's rule named TOP (declared with a rule, token, regex, or method declarator).
As a, er, rule of thumb, if you don't know if you should be using a rule, regex, token, or method for some bit of parsing, use a token. (Unlike regex patterns, tokens don't risk pathological backtracking.)
But I've used a rule. Like token patterns, rules also avoid the pathological backtracking risk. But, in addition rules interpret some whitespace in the pattern to be significant, in a natural manner. (See this SO answer for precise details.)
rules are typically appropriate towards the top of the parse tree. (Tokens are typically appropriate towards the leaves.)
rule TOP { <article>* }
The space at the end of the rule (between the * and pattern closing }) means the grammar will match any amount of whitespace at the end of the input.
<article> invokes another named rule in this grammar.
Because it looks like one should allow for any number of articles per bib file, I added a * (zero or more quantifier) at the end of <article>*.
rule article { '#article{' $<id>=<-[,]>+ ','
<kv-pairs>
'}'
}
If you compare this article pattern with the ones I wrote for the earlier Raku rules based solutions, you'll see various changes:
Rule in original one-liner
Rule in this grammar
Kept pattern as simple as possible.
Introduced <kv-pairs> and closing }
No attempt to echo layout of your input.
Visually echoes your input.
<[...]> is the Raku syntax for a character class, like[...] in traditional regex syntax. It's more powerful, but for now all you need to know is that the - in <-[,]> indicates negation, i.e. the same as the ^ in the [^,] syntax of ye olde regex. So <-[,]>+ attempts a match of one or more characters, none of which are ,.
$<id>=<-[,]>+ tells Raku to attempt to match the quantified "atom" on the right of the = (i.e. the <-[,]>+ bit) and store the results at the key <id> within the current match object. The latter will be hung from a branch of the parse tree; we'll get to precisely where later.
rule kv-pairs { <kv-pair>* % ',' }
This pattern illustrates one of several convenient Raku regex features. It declares you want to match zero or more kv-pairs separated by commas.
(In more detail, the % regex infix operator requires that matches of the quantified atom on its left are separated by the atom on its right.)
rule kv-pair { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }
The new bit here is '={' ~ '}'. This is another convenient regex feature. The regex Tilde operator parses a delimited structure (in this case one with a ={ opener and } closer) with the bit between the delimiters matching the quantified regex atom on the right of the closer. This confers several benefits but the main one is that error messages can be clearer.
I could have used the ~ approach in the /.../ regex in the one-liner, and vice-versa. But I wanted this grammar solution to continue the progression toward illustrating "better practice" idioms.
Constructing / deconstructing the parse tree
for $<article> { put "$<id>: $<kv-pairs><kv-pair>[0]<value>\n" }`
$<article>, $<id> etc. refer to named match objects that are stored somewhere in the "parse tree". But how did they get there? And exactly where is "there"?
Returning to the top of the grammar:
rule TOP {
If a .parse is successful, a single 'TOP' level match object is returned. (After a parse is complete the variable $/ is also bound to that top match object.) During parsing a tree will have been formed by hanging other match objects off this top match object, and then others hung off those, and so on.
Addition of match objects to a parse tree is done by adding either a single generated match object, or a list of them, to either a Positional (numbered) or Associative (named) capture of a "parent" match object. This process is explained below.
rule TOP { <article>* }
<article> invokes a match of the rule named article. An invocation of the rule <article> has two effects:
Raku tries to match the rule.
If it matches, Raku captures that match by generating a corresponding match object and adding it to the parse tree under the key <article> of the parent match object. (In this case the parent is the top match object.)
If the successfully matched pattern had been specified as just <article>, rather than as <article>*, then only one match would have been attempted, and only one value, a single match object, would have been generated and added under the key <article>.
But the pattern was <article>*, not merely <article>. So Raku attempts to match the article rule as many times as it can. If it matches at all, then a list of one or more match objects is stored as the value of the <article> key. (See my answer to "How do I access the captures within a match?" for a more detailed explanation.)
$<article> is short for $/<article>. It refers to the value stored under the <article> key of the current match object (which is stored in $/). In this case that value is a list of 3 match objects corresponding to the 3 articles in the input.
rule article { '#article{' $<id>=<-[,]>+ ','
Just as the top match object has several match objects hung off of it (the three captures of article matches that are stored under the top match object's <article> key), so too do each of those three article match objects have their own "child" match objects hanging off of them.
To see how that works, let's consider just the first of the three article match objects, the one corresponding to the text that starts "#article{garg2017patch,...". The article rule matches this article. As it's doing that matching, the $<id>=<-[,]>+ part tells Raku to store the match object corresponding to the id part of the article ("garg2017patch") under that article match object's <id> key.
Hopefully this is enough (quite possibly way too much!) and I can at last exhaustively (exhaustingly?) explain the last line of code, which, once again, was:
for $<article> -> $/ { put "$<id>: $<kv-pairs><kv-pair>[0]<value>\n" }`
At the level of the for, the variable $/ refers to the top of the parse tree generated by the parse that just completed. Thus $<article>, which is shorthand for $/<article>, refers to the list of three article match objects.
The for then iterates over that list, binding $/ within the lexical scope of the -> $/ { ... } block to each of those 3 article match objects in turn.
The $<id> bit is shorthand for $/<id>, which inside the block refers to the <id> key within the article match object that $/ has been bound to. In other words, $<id> inside the block is equivalent to $<article><id> outside the block.
The $<kv-pairs><kv-pair>[0]<value> follows the same scheme, albeit with more levels and a positional child (the [0]) in the midst of all the key (named/ associative) children.
(Note that there was no need for the article pattern to include a $<kv-pairs>=<kv-pairs> because Raku just presumes a pattern of the form <foo> should store its results under the key <foo>. If you wish to disable that, write a pattern with a non-alpha character as the first symbol. For example, use <.foo> if you want to have exactly the same matching effect as <foo> but just not store the matched input in the parse tree.)
Phew!
When the automatically generated parse tree isn't what you want
As if all the above were not enough, I need to mention one more thing.
The parse tree strongly reflects the tree structure of the grammar's rules calling one another from the top rule down to leaf rules. But the resulting structure is sometimes inconvenient.
Often one still wants a tree, but a simpler one, or perhaps some non-tree data structure.
The primary mechanism for generating exactly what you want from a parse, when the automatic results aren't suitable, is make. (This can be used in code blocks inside rules or factored out into Action classes that are separate from grammars.)
In turn, the primary use case for make is to generate a sparse tree of nodes hanging off the parse tree, such as an AST.
Footnotes
[1] Basic Raku is good for exploratory programming, spikes, one-offs, PoCs and other scenarios where the emphasis is on quickly producing working code that can be refactored later if need be.
[2] Raku's regexes/rules scale up to arbitrary parsing, as introduced in the latter half of this answer. This contrasts with past generations of regex which could not.[3]
[3] That said, ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ remains a great and relevant read. Not because Raku rules can't parse (X)HTML. In principle they can. But for a task as monumental as correctly handling full arbitrary in-the-wild XHTML I would strongly recommend you use an existing parser written expressly for that purpose. And this applies generally for existing formats; it's best not to reinvent the wheel. But the good news with Raku rules is that if you need to write a full parser, not just a bunch of regexes, you can do so, and it need not involve going insane!

Predictive editor for Rascal grammar

I'm trying to write a predictive editor for a grammar written in Rascal. The heart of this would be a function taking as input a list of symbols and returning as output a list of symbol types, such that an instance of any of those types would be a syntactically legal continuation of the input symbols under the grammar. So if the input list was [4,+] the output might be [integer]. Is there a clever way to do this in Rascal? I can think of imperative programming ways of doing it, but I suspect they don't take proper advantage of Rascal's power.
That's a pretty big question. Here's some lead to an answer but the full answer would be implementing it for you completely :-)
Reify an original grammar for the language you are interested in as a value using the # operator, so that you have a concise representation of the grammar which can be queried easily. The representation is defined over the modules Type, ParseTree which extends Type and Grammar.
Construct the same representation for the input query. This could be done in many ways. A kick-ass, language-parametric, way would be to extend Rascal's parser algorithm to return partial trees for partial input, but I believe this would be too much hassle now. An easier solution would entail writing a grammar for a set of partial inputs, i.e. the language grammar with at specific points shorter rules. The grammar will be ambiguous but that is not a problem in this case.
Use tags to tag the "short" rules so that you can find them easily later: syntax E = #short E "+";
Parse with the extended and now ambiguous grammar;
The resulting parse trees will contain the same representation as in ParseTree that you used to reify the original grammar, except in that one the rules are longer, as in prod(E, [E,+,E],...)
then select the trees which serve you best for the goal of completion (which use the #short tag), and extract their productions "prod", which look like this prod(E,[E,+],...). For example using the / operator: [candidate : /candidate:prod(_,_,/"short") := trees], and you could use a cursor position to find candidates which are close by instead of all short trees in there.
Use list matching to find prefixes in the original grammar, like if (/match:prod(_,[*prefix, predicted, *postfix],_) := grammar) ..., prefix is your query as extracted from the #short rules. predicted is your answer and postfix is whatever would come after.
yield the predicted symbol back as a type for the user to read: "<type(predicted, ())>" (will pretty print it nicely even if it's some complex regexp type and does the quoting right etc.)

Parsing special cases

If I understand correctly, parsing turns a sequence of symbols into a tree. My question is, is it possible to use some standard procedure (LR, LL, PEG, ..?) to parse the following two examples or is it necessary to write a specialized parser by hand?
Python source code, i.e. the whitespace-indented blocks
I think I read somewhere that the parser keeps track of the number of leading spaces, and pretends to replace them with curly brackets to delimitate the blocks. Is it fundamentally required because the standard parsing techniques are not powerful enough or is it for performance reasons?
PNG image format, where a block starts with a header and block size, after which there is the content of the block
The content could contain bytes which resemble some header so it is necessary to "know" that the next x bytes are not to be "parsed", i.e. they should be skipped. How to express this, say, with PEG? In other words, the "closing bracket" is represented by the length of the content.
Neither of the examples in the question are context-free, so strictly speaking they cannot be parsed with context-free grammars. But in practical terms, they are both pretty easy to parse.
The python algorithm is well-described in the Python reference manual (although you need to read that in context.) What's described there is a pre-processing step in which whitespace at the beginning of a line is systematically replaced with INDENT and DEDENT tokens.
To clarify: It's not really a preprocessing step, and it's important to observe that it happens after implicit and explicit line joining. (There are previous sections in the reference manual which describe these procedures.) In particular, lines are implicitly joined inside parentheses, braces and brackets, so the process is intertwined with parsing.
In practical terms, both the line-joining and indentation algorithms can be accomplished programmatically; typically, these would be done inside a custom scanner (tokenizer) which maintains both a stack of parentheses and indent levels. The token stream can then be parsed with normal context-free algorithms, but the tokenizer -- although it might use regular expressions -- needs context-sensitive logic (counting spaces, for example). [Note 1]
Similarly, formats which contain explicit sizes (such as most serialization formats, including PNG files, Google protobufs, and HTTP chunked encoding) are not context-free, but are obviously easy to tokenize since the tokenizer simply has to read the length and then read that many bytes.
There are a variety of context-sensitive formalisms, and these definitely have their uses, but in practical parsing the most common strategy is to use a Turing-equivalent formalism (such as any programming language, possibly augmented with a scanner-generator like flex) for the tokenizer and a context-free formalism for the parser. [Note 2]
Notes:
It may not be immediately obvious that Python indenting is not context-free, since context-free grammars can accept some categories of agreement. For example, {ωω-1 | ω∈Σ*} (the language of all even-length palindromes) is context-free, as is {anbn}.
However, these examples can't be extended, because the only count-agreement possible in a context-free language is bracketing. So while palindromes are context-free (you can implement the check with a single stack), the apparently very similar {ωω | ω∈Σ*} is not, and neither is {anbncn}
One such formalism is back-references in "regular" expressions, which might be available in some PEG implementation. Back-references allow the expression of a variety of context-sensitive languages, but do not allow the expression of all context-free languages. Unfortunately, regular expressions with back-references really suck in practice, because the problem of determining whether a string matches a regex with back-references is NP complete. You might find this question on a sister SE site interesting. (And you might want to reformulate your question in a way that could be asked on that site, http://cs.stackexchange.com.)
As a practical matter, almost all parser construction requires some clever hacks around the edges to overcome the limitations of the parsing machinery.
Pure context free parsers can't do Python; all the parser technologies you have listed are weaker than pure-context free, so they can't do it either. A hack in the lexer to keep track of indentation, and generate INDENT/DEDENT tokens, turns the indenting problem into explicit "parentheses", which are easily handled by context-free parsers.
Most binary files can't be processed either, as they usually contain, somewhere, a list of length N, where N is provided before the list body is encountered (this is kind of the example you gave). Again, you can get around this, with a more complicated hack; something must keep a stack of nested list lengths, and the parser has to signal when it moves from one list element to the next. The top-most length counter gets decremented, and the parser gets back a signal "reduce" or "shift". Other more complex linked structures are generally pretty hard to parse this way. Getting the parser to cooperate this way isn't always easy.

Resources