Recursive parse in Red

Recursive parse in Red - parsing

I want to parse-skip Forth-style if's from input, Forth-style means each if starts with if and ends with then, assuming all input is correct handling of mismatches is not required.
The problem is each part of if can recursively contain any number of other if's.
Here is my best solution with test cases:
Red []
skip-nested-ifs: [skip to ['if | 'then] skip-nested-ifs-helper]
skip-nested-ifs-helper: ['then | skip-nested-ifs skip-nested-ifs-helper ]
rules: skip-nested-ifs
test-cases: [
[if a then]
[if a else b then]
[if a if b then then]
[if a if b then 5 then]
[if a if b then 5 if c then then]
[if a else if b then then]
[if a else if b then 5 then]
[if a else if b then if c then then]
[if a if b if c then if d then then then]
]
forall test-cases [
prin [mold test-cases/1 ""]
print either parse test-cases/1 rules [ "OK" ] [ "FAIL" ]
]
The output is:
[if a then] OK
[if a else b then] OK
[if a if b then then] OK
[if a if b then 5 then] FAIL
[if a if b then 5 if c then then] FAIL
[if a else if b then then] OK
[if a else if b then 5 then] FAIL
[if a else if b then if c then then] OK
[if a if b if c then if d then then then] OK
So three of them fail because they contain something (5 in this case) between one then and another.
Probably the fix is very simple and obvious, but I don't see it right now. Could you help me to fix rule above if possible or show a different one which passes all tests?

I am not sure if your rule is fixable or not, as it heavily relies on recursion, but fails to provide iteration support which is needed for test #5. I was not able to fix it, as skip is used to consume both terminal and non-terminal tokens (including if), so it makes it hard for me to follow.
I came up with a different solution. It is longer, but passes all your tests (using Red):
rules: [
'if skip
opt ['else [some rules | skip]]
opt some rules
'then
opt [some rules | ahead 'then | skip]
]
Notes:
I tried to make the grammar rules as explicit as possible.
Notice the usage of some to iteratively consume the sub-expressions.
The ahead 'then guarding rule, is there to prevent skip from consuming an extra then which would be part of a parent expression (in case of a recursive invocation).
It uses skip to pass over the terminal value following then or else, though it is not clear from your description if there can be more than one value there. Anyway, it is easy to extend for matching more complex patterns if needed.
If you want to use such rule for skipping input, you can then invoke it like this:
skip-ifs: [to 'if rules]
Hope this helps.

Related

Why do we need FOLLOW set in LL(1) grammar parser?

In generated parsing function we use an algorithm which looks on a peek of a tokens list and chooses rule (alternative) based on the current non-terminal FIRST set. If it contains an epsilon (rule is nullable), FOLLOW set is checked as well.
Consider following grammar [not LL(1)]:
B : A term
A : N1 | N2
N1 :
N2 :
During calculation of the FOLLOW set terminal term will be propagated from A to both N1 and N2, so FOLLOW set won't help us decide.
On the other hand, if there is exactly one nullable alternative, we know for sure how to continue execution, even in case current token doesn't match against anything from the FIRST set (by choosing epsilon production).
If above statements are true, FOLLOW set is redundant. Is it needed only for error-handling?

Yes, it is not necessary.
I was asked precisely this question on the colloquium, and my answer that FOLLOW set is used
to check that grammar is LL(1)
to fail immediately when an error occurs, instead of dragging the ill-formatted token to some later production, where generated fail message may be unclear
and for nothing else
was accepted

While you can certainly find grammars for which FOLLOW is unnecessary (i.e., it doesn't play a role in the calculation of the parsing table), in general it is necessary.
For example, consider the grammar
S : A | C
A : B a
B : b | epsilon
C : D c
D : d | epsilon
You need to know that
Follow(B) = {a}
Follow(D) = {c}
to calculate
First(A) = {b, a}
First(C) = {d, c}
in order to make the correct choice at S.

Why is this grammar LL(1) even though all the FIRST sets are the same?

Consider the following CFG:
S := AbC
A := aA | epsilon
C := Ac
Here, FIRST(A) = FIRST(B) = FIRST(C) = {a, ε}, so all the FIRST sets are the same. However, this grammar is supposedly LL(1). How is that possible? Wouldn't that mean that there would be a bunch of FIRST/FIRST conflicts everywhere?

There's nothing fundamentally wrong about having multiple nonterminals that have the same FIRST sets. Things only become problematic if you have multiple nonterminals with overlapping FIRST or FOLLOW sets in a context where you have to choose between a number of production options.
As an example, consider this simple grammar:
A → bB | cC
B → b | c
C → b | c
Notice that all three of A, B, and C have the same FIRST set, namely {b, c}. But this grammar is also LL(1). While you can formally convince yourself of this by writing out the actual LL(1) parsing table, you can think of this in another way as well. Imagine you're reading the nonterminal A, and you see the character b. Which production should you pick: A → bB, or A → cC? Well, there's no reason to pick A → cC, because that would put c at the front of your string. So don't pick that one. Instead, pick A → bB. Similarly, suppose you're reading an A and you see the character c. Which production should you pick? You'd never want to pick A → bB, since that would put b at the front of your string. Instead, you'd pick A → cC.
Notice that in this discussion, we never stopped to think about what FIRST(B) or FIRST(C) was. It simply didn't come up because we never needed to know what characters could be produced there.
Now, let's look at your example. If you're trying to expand an S, there's only one possible production to apply, which is S → AbC. So there's no possible conflict here; when you see S, you always apply that rule. Similarly, if you're trying to expand a C, there's no choice of what to do. You have to expand C → Ac.
So now let's think about the nonterminal A, where there really is a choice of what to do next. If you see the character a, then we have to decide - do we expand out A → aA, or do we expand out A → ε? In answering that question, we have to think about the FOLLOW set of A, since the production A → ε would only make sense to pick if we saw a terminal symbol where we basically just want to get A out of the way. Here, FOLLOW(A) = {b, c}, with the b coming from the production S → AbC and the c coming from the production C → Ac. So we'd only pick A → ε if we see b or c, not if we see a. That means that
on reading a, we pick A → aA, and
on reading b o r c, we pick A → ε.
Notice that in this discussion we never needed to think about what FIRST(B) or FIRST(C) was. In fact, we never even needed to look at what FIRST(A) was either! So that's why there isn't necessarily a conflict. Were we to encounter a scenario where we needed to compare FIRST(A) against FIRST(B) or something like that, then yes, we'd definitely have a conflict. But that never came up, so no conflict exists.

Pattern match in Erlang

I am trying to learn some Erlang while I got stuck on these several Erlang pattern matching problems.
Given the module here:
-module(p1).
-export([f2/1]).
f2([A1, A2 | A1]) -> {A2, A1};
f2([A, true | B]) -> {A, B};
f2([A1, A2 | _]) -> {A1,A2};
f2([_|B]) -> [B];
f2([A]) -> {A};
f2(_) -> nothing_matched.
and when I execute p1:f2([x]), I received an empty list which is []. I thought it matches the 5th clause? Is that a literal can also be an atom?
When I execute p1:f2([[a],[b], a]), the result is ([b], [a]) which means it matches the first clause. However I think [a] and a are not the same thing? One is a list but the other is a literal?
Also when I execute p1:f2([2, 7 div 3 > 2 | [5,3]]) it evaluates to (2, false). I mean why 7 div 3 > 2 gets to be false? In other language such as C or Java Yeah I know 7 div 3 == 2 so it makes this statement false. But is it the same in Erlang? Because I just tried it on shell and it gives me 2.3333333.. which is larger than 2 so it will make this statement true. Can someone gives an explaination?

it is because [x] is equal to [x|[]] so it matches f2([_|B]) -> [B];. As you can see B=[] inn your case.
I think you didn't write what you want to do. in the expression [A|B], A is the first element of the list, while B is the rest of the list (so it is a list). That means that [1,2,1] will not match [A1, A2 | A1]; but [[1],2,1] or [[a,b],1,a,b] will.

First, 7 div 3 is 2. And 2 is not greater than 2, it's equal.
Secondly, [x, y] = [x | [y] ], because the right (or rest) part is always a list. That's why you get in the first clause.

Eliminating Left Recursion

So I have some grammar that doesn't work for a top-down parser due to it having left recursion:
L::= A a | B b
A::= L b | a a
B::= b B b | b a
So in order to fix this, I have to remove the left recursion. To do this, I do a little substitute-like-thing:
L::= A a | B b
A::= A a b | B b b | a a (I plugged in the possible values of "L")
A then turns to (I believe):
A::= a a A' | B b b
A'::= a b A' | ε
I'm fairly certain that I'm correct up to there (wouldn't be surprised if I'm not, though). Where I'm struggling is now removing the left recursion out of "B b b". I've tried going about this so many ways, and I don't think any of them work. Here's one that seems most logical, but ugly as well (thus saying it's probably wrong). Starting by manipulating B::= b B b | b a
B::= b B b | b a
B::= b B' (they both start with b, so maybe i can pull it out?)
B'::= B b | a
B'::= b B' b | a (substituted B's value in)
B'::= b B" | a
B"::= b B" |a B" | ε
So I guess to show what the finalized B's would be:
B::= b B'
B'::= b B" | a
B"::= b B" | a B" | ε
This seems way too ugly to be correct. Especially since I'd have to plug that into the new "A" terminals that I created.
Can someone help me out? No idea if I'm going about this the right way. I'm supposed to be able to create an LL(1) parse table afterward (should be able to do that part on my own).
Thanks.

In a parser that tries to expand nonterminals from the left, if some nonterminal can expand to a string with itself on the left, the parser may expand that nonterminal forever without actually parsing anything. This is left recursion. On the other hand, it is perfectly fine if a nonterminal expands to a string with some different nonterminal on the left, as long as no chain of expansions produces a string with the original nonterminal on the left. Similarly, it is fine if nonterminals aren't all the way to the right in the expansion, as long as they're not all the way to the left.
tl;dr There's nothing wrong with B b b or b B b. You've removed all the left recursion. You don't need to keep going.

How to remove left-recursion in the following grammar?

Unfortunately, it is not possible for ANTLR to support direct-left recursion when the rule has parameters passed. The only viable option is to remove the left recursion. Is there a way to remove the left-recursion in the following grammar ?
a[int x]
: b a[$x] c
| a[$x - 1]
(
c a[$x - 1]
| b c
)
;
The problem is in the second alternative involving left recursion. Any kind of help would be much appreciated.

Without the parameters and easier formatting, it would look like this:
a
: b a c
| a (c a | b c)
;
When a's left recursive alternative is matched n times, it would just mean that (c a | b c) will be matched n times, pre-pended with the terminating b a c (the first alternative). That means that this rule will always start with b a c, followed by zero or more occurrences of (c a | b c):
a
: b a c (c a | b c)*
;

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Recursive parse in Red - parsing

Related

Why do we need FOLLOW set in LL(1) grammar parser?

Why is this grammar LL(1) even though all the FIRST sets are the same?

Pattern match in Erlang

Eliminating Left Recursion

How to remove left-recursion in the following grammar?

Categories

Resources