In the process of writing a translator of one music language to another (ABC to Alda) as an excuse to learn Raku DSL-ability, I noticed that there doesn't seem to be a way to terminate a .parse! Here is my shortened demo code:
#!/home/hsmyers/rakudo741/bin/perl6
use v6d;
# use Grammar::Debugger;
use Grammar::Tracer;
my $test-n01 = q:to/EOS/;
a b c d e f g
A B C D E F G
EOS
grammar test {
token TOP { <score>+ }
token score {
<.ws>?
[
| <uc>
| <lc>
]+
<.ws>?
}
token uc { <[A..G]> }
token lc { <[a..g]> }
}
test.parse($test-n01).say;
And it is the last part of the Grammer::Tracer display that demonstrates my problem.
| score
| | uc
| | * MATCH "G"
| * MATCH "G\n"
| score
| * FAIL
* MATCH "a b c d e f g\nA B C D E F G\n"
「a b c d e f g
A B C D E F G
」
On the second to last line, the word FAIL tells me that the .parse run has no way of quitting. I wonder if this is correct? The .say displays everything as it should be, so I'm not clear on how real the FAIL is? The question remains, "How do I correctly write a grammar that parses multiple lines without error?"
When you use the grammar debugger, it lets you see exactly how the engine is parsing the string — fails are normal and expected. Considered, for example, matching a+b* with the string aab. You should get two matches for 'a', followed by a fail (because b is not a) but then it will retry with b and successfully match.
This might be more easily seen if you do an alternation with || (which enforces order). If you have
token TOP { I have a <fruit> }
token fruit { apple || orange || kiwi }
and you parse the sentence "I have a kiwi", you'll see it first match "I have a", followed by two fails with "apple" and "orange", and finally a match with "kiwi".
Now let's look at your case:
TOP # Trying to match top (need >1 match of score)
| score # Trying to match score (need >1 match of lc/uc)
| | lc # Trying to match lc
| | * MATCH "a" # lc had a successful match! ("a")
| * MATCH "a " # and as a result so did score! ("a ")
| score # Trying to match score again (because <score>+)
| | lc # Trying to match lc
| | * MATCH "b" # lc had a successful match! ("b")
| * MATCH "b " # and as a result so did score! ("b ")
…………… # …so forth and so on until…
| score # Trying to match score again (because <score>+)
| | uc # Trying to match uc
| | * MATCH "G" # uc had a successful match! ("G")
| * MATCH "G\n" # and as a result, so did score! ("G\n")
| score # Trying to match *score* again (because <score>+)
| * FAIL # failed to match score, because no lc/uc.
|
| # <-------------- At this point, the question is, did TOP match?
| # Remember, TOP is <score>+, so we match TOP if there
| # was at least one <score> token that matched, there was so...
|
* MATCH "a b c d e f g\nA B C D E F G\n" # this is the TOP match
The fail here is normal: at some point we will run out of <score> tokens, so a fail is inevitable. When that happens, the grammar engine can move on to whatever comes after the <score>+ in your grammar. Since there's nothing, that fail actually results in a match of the entire string (because TOP matches with implicit /^…$/).
Also, you might consider rewriting your grammar with a rule which inserts <.ws>* automatically (unless it's important for it to be a single space only):
grammar test {
rule TOP { <score>+ }
token score {
[
| <uc>
| <lc>
]+
}
token uc { <[A..G]> }
token lc { <[a..g]> }
}
Further, IME, you might want to also want to add a proto token for the uc/lc, because when you have [ <foo> | <bar> ] you will always have one of them be undefined which can make processing them in an actions class a bit annoying. You could try:
grammar test {
rule TOP { <score> + }
token score { <letter> + }
proto token letter { * }
token letter:uc { <[A..G]> }
token letter:lc { <[a..g]> }
}
$<letter> will always be defined this way.
Related
I must be missing something fundamental about recursively defined nonterminals giving issue, but all I want to do is to recognize something like a regular expression, where a series of numbers followed by a series of letters.
from nltk import CFG
import nltk
grammar = CFG.fromstring("""
S -> N L
N -> N | '1' | '2' | '3'
L -> L | 'A' | 'B' | 'C'
""")
from nltk.parse import BottomUpChartParser
parser = nltk.ChartParser(grammar)
sentence = '1 2 1 3 A C B C'.split()
for t in parser.parse(sentence):
print(t)
the above code returns an empty parse and nothing is printed
The grammar would need to be something like:
from nltk import CFG, ChartParser
grammar = CFG.fromstring("""
S -> N L
N -> N N | '1' | '2' | '3'
L -> L L | 'A' | 'B' | 'C'
""")
parser = ChartParser(grammar)
sentence = '1 2 1 3 A C B C'.split()
for t in parser.parse(sentence):
print(t)
break
<script src="https://cdn.jsdelivr.net/gh/pysnippet/pysnippet#latest/snippet.min.js"></script>
Using N -> N N as an example: the first N could be "eaten up" and transformed into a 1 when parsing the sentence, leaving the next N to go on and produce another N -> N N.
But this will result in a lot of possible parses, for something more efficient you probably want something like this:
from nltk import CFG, ChartParser
grammar = CFG.fromstring("""
S -> N L
N -> '1' N | '2' N | '3' N | '1' | '2' | '3'
L -> 'A' L | 'B' L | 'C' L | 'A' | 'B' | 'C'
""")
parser = ChartParser(grammar)
sentence = '1 2 1 3 A C B C'.split()
for t in parser.parse(sentence):
print(t)
break
<script src="https://cdn.jsdelivr.net/gh/pysnippet/pysnippet#latest/snippet.min.js"></script>
Regular Version. The language from the question: "one or more numbers followed by one or more letters" or (1,2,3)+(A,B,C)+ is a regular language, so we can represent it with a regular grammar:
from nltk import CFG, ChartParser
grammar = CFG.fromstring("""
S -> N
N -> '1' N | '2' N | '3' N | '1' | '2' | '3' | L
L -> 'A' L | 'B' L | 'C' L | 'A' | 'B' | 'C'
""")
parser = ChartParser(grammar)
sentence = '1 2 1 3 A C B C'.split()
for t in parser.parse(sentence):
print(t)
break
<script src="https://cdn.jsdelivr.net/gh/pysnippet/pysnippet#latest/snippet.min.js"></script>
Try all three out and see what the parses look like on different inputs!
I tried to use this code to scramble the characters into different characters and return a new list with those new characters. However, I keep getting errors saying : "a list but here has type char" on line 3, "a list list but given a char list" on the line 13 . Not sure how to fix this. Thanks in advance for the help.
let _scram x =
match x with
| [] -> [] // line 3
| 's' -> 'v'
| 'a' -> 's'
| 'e' -> 'o'
| '_' -> '_'
let rec scramble L P =
match L with
| [] -> P
| hd::t1 -> scramble t1 (P # (_scram hd))
let L =
let p = ['h'; 'e'; 'l'; 'l'; 'o'] //line 13
scramble p []
That's because you are calling the _scram as second operand of the (#) operator which concatenates two lists, so it infers that the whole expression has to be a list.
A quick fix is to enclose it into a list: (P # [_scram hd]), this way _scram hd is inferred to be an element (in this case a char).
Then you will discover your next error, the catch-all wildcard is in quotes, and even if it wouldn't, you can't use it to bind a value to be used later.
So you can change it to | c -> c.
Then your code will be like this:
let _scram x =
match x with
| 's' -> 'v'
| 'a' -> 's'
| 'e' -> 'o'
| c -> c
let rec scramble L P =
match L with
| [] -> P
| hd::t1 -> scramble t1 (P # [_scram hd])
let L =
let p = ['h'; 'e'; 'l'; 'l'; 'o']
scramble p []
F# code is defined sequentially. The first error indicates there is some problem with the code upto that point, the definition of _scram. The line | [] -> [] implies that _scram takes lists to lists. The next line | 's' -> 'v' implies that _scram takes chars to chars. That is incompatible and that explains the error.
How to remove ambiguity in following grammar?
E -> E * F | F + E | F
F -> F - F | id
First, we need to find the ambiguity.
Consider the rules for E without F; change F to f and consider it a terminal symbol. Then the grammar
E -> E * f
E -> f + E
E -> f
is ambiguous. Consider f + f * f:
E E
| |
+-------+--+ +-+-+
| | | | | |
E * f f + E
+-+-+ |
| | | +-+-+
f + E E * f
| |
f f
We can resolve this ambiguity by forcing * or + to take precedence. Typically, * takes precedence in the order of operations, but this is totally arbitrary.
E -> f + E | A
A -> A * f | f
Now, the string f + f * f has just one parsing:
E
|
+-+-+
| | |
f + E
|
A
|
+-+-+
A * f
|
f
Now, consider our original grammar which uses F instead of f:
E -> F + E | A
A -> A * F | F
F -> F - F | id
Is this ambiguous? It is. Consider the string id - id - id.
E E
| |
A A
| |
F F
| |
+-----+----+----+ +----+----+----+
| | | | | |
F - F F - F
| | | |
+-+-+ id id +-+-+
F - F F - F
| | | |
id id id id
The ambiguity here is that - can be left-associative or right-associative. We can choose the same convention as for +:
E -> F + E | A
A -> A * F | F
F -> id - F | id
Now, we have only one parsing:
E
|
A
|
F
|
+----+----+----+
| | |
id - F
|
+--+-+
| | |
id - F
|
id
Now, is this grammar ambiguous? It is not.
s will have #(+) +s in it, and we always need to use production E -> F + E exactly #(+) times and then production E -> A once.
s will have #(*) *s in it, and we always need to use production A -> A * F exactly #(*) times and then production E -> F once.
s will have #(-) -s in it, and we always need to use production F -> id - F exactly #(-) times and the production F -> id once.
That s has exactly #(+) +s, #(*) *s and #(-) -s can be taken for granted (the numbers can be zero if not present in s). That E -> A, A -> F and F -> id have to be used exactly once can be shown as follows:
If E -> A is never used, any string derived will still have E, a nonterminal, in it, and so will not be a string in the language (nothing is generated without taking E -> A at least once). Also, every string that can be generated before using E -> A has at most one E in it (you start with one E, and the only other production keeps one E) so it is never possible to use E -> A more than once. So E -> A is used exactly once for all derived strings. The demonstration works the same way for A -> F and F -> id.
That E -> F + E, A -> A * F and F -> id - F are used exactly #(+), #(*) and #(-) times, respectively, is apparent from the fact that these are the only productions that introduce their respective symbols and each introduces one instance.
If you consider the sub-grammars of our resulting grammars, we can prove they are unambiguous as follows:
F -> id - F | id
This is an unambiguous grammar for (id - )*id. The only derivation of (id - )^kid is to use F -> id - F k times and then use F -> id exactly once.
A -> A * F | F
We have already seen that F is unambiguous for the language it recognizes. By the same argument, this is an unambiguous grammar for the language F( * F)*. The derivation of F( * F)^k will require the use of A -> A * F exactly k times and then the use of A -> F. Because the language generated from F is unambiguous and because the language for A unambiguously separates instances of F using *, a symbol not generated by F, the grammar
A -> A * F | F
F -> id - F | id
Is also unambiguous. To complete the argument, apply the same logic to the grammar generating (F + )*A from the start symbol E.
To remove an ambiguity means that you must choose one of all possible ambiguities. This grammar is as simple as it can be, for a mathematical expression.
To make the multiplication with a higher priority than the addition and the subtraction (where the last two have the same priority, but are traditionally computed from left to right) you do that (in ABNF like syntax):
expression = addition
addition = multiplication *(("+" / "-") multiplication)
multiplication = identifier *("*" identifier)
identifier = 'a'-'z'
The idea is as follows:
first create your lowest grammar rule: the identifier
continue with the highest priority operation, in your case multiplication: *
create a rule that has this on its right hand side: X *(P X), where X is the previous rule you have created, and P is your operation sign.
if you have more than one operation with the same priority they must be in a group: (P1 / P2 / ...)
continue to do the last two operations until there are no more operations to add.
add your main rule that uses the latest one.
Then for input like: a+b+c*d+e you get this tree:
More advanced tools will get you a tree that has more than two nodes. That means that all multiplications in one addition will be in a list that you can iterate from any direction.
This grammar is easy to upgrade, and to add parentheses you can do that:
expression = addition
addition = multiplication *(("+" / "-") multiplication)
multiplication = primary *("*" primary)
primary = identifier / "(" expression ")"
identifier = 'a'-'z'
Then for input (a+b)*c you will get this tree:
If you want to add a division, you can modify the multiplication rule like that:
multiplication = primary *(("*" / "/") primary)
These are all detailed trees, there are trees with less details as well, often called abstract syntax trees.
I am a an Antlr4 newbie and have problems with a relatively simple grammar. The grammar is given at the bottom at the end. (This is a fragment from a grammar for parsing description of biological sequence variants).
I am trying to parse the string "p.A3L" in the following unit test.
#Test
public void testProteinSubtitutionWithoutRef() {
ANTLRInputStream inputStream = new ANTLRInputStream("p.A3L");
HGVSLexer l = new HGVSLexer(inputStream);
HGVSParser p = new HGVSParser(new CommonTokenStream(l));
p.setTrace(true);
p.addErrorListener(new BaseErrorListener() {
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line,
int charPositionInLine, String msg, RecognitionException e) {
throw new IllegalStateException("failed to parse at line " + line + " due to " + msg, e);
}
});
p.hgvs();
}
The test fails with the message "line 1:2 mismatched input 'A3L' expecting AA". I assume that this is related to lexing, i.e. splitting "A3L" into the three tokens A, 3, and L, such that the parser can then generate the corresponding syntax subtree containing the three terminals from it.
What is going wrong here and where can I learn how to fix this?
The grammar
grammar HGVS;
hgvs: protein_var
;
// Basix lexemes
AA: AA1
| AA3
| 'X';
AA1: 'A'
| 'R'
| 'N'
| 'D'
| 'C'
| 'Q'
| 'E'
| 'G'
| 'H'
| 'I'
| 'L'
| 'K'
| 'M'
| 'F'
| 'P'
| 'S'
| 'T'
| 'W'
| 'Y'
| 'V';
AA3: 'Ala'
| 'Arg'
| 'Asn'
| 'Asp'
| 'Cys'
| 'Gln'
| 'Glu'
| 'Gly'
| 'His'
| 'Ile'
| 'Leu'
| 'Lys'
| 'Met'
| 'Phe'
| 'Pro'
| 'Ser'
| 'Thr'
| 'Trp'
| 'Tyr'
| 'Val';
NUMBER: [0-9]+;
NAME: [a-zA-Z0-9_]+;
// Top-level Rule
/** Variant in a protein. */
protein_var: 'p.' AA NUMBER AA
;
There are two problems:
Define the rule for protein_var ahead of the lexer rules (should work now to, but is not easy to read because the other parser rule is ahead).
Remove the rule for NAME. A3L is not (as you probably expected) AA NUMBER AA but NAME <= ANTLR always prefers the longest matching lexer rule
The resulting grammar should look like:
grammar HGVS;
hgvs
: protein_var
;
protein_var
: 'p.' AA NUMBER AA
;
AA: ...;
AA3: ...;
AA1: ...;
NUMBER: [0-9]+;
If you need NAME for other purposes, you will have to disambiguate it in the lexer (by a prefix that NAMEs and AA do not have in common or by using lexer modes).
I have a question for the Follow sets of the following rules:
L -> CL'
L' -> epsilon
| ; L
C -> id:=G
|if GC
|begin L end
I have computed that the Follow(L) is in the Follow(L'). Also Follow(L') is in the Follow(L) so they both will contain: {end, $}. However, as L' is Nullable will the Follow(L) contain also the Follow(C)?
I have computed that the Follow(C) = First(L') and also Follow(C) subset Follow(L) = { ; $ end}.
In the answer the Follow(L) and Follow(L') contain only {end, $}, but shouldn't it contain ; as well from the Follow(C) as L' can be null?
Thanks
However, as L' is Nullable will the Follow(L) contain also the Follow(C)?
The opposite. Follow(C) will contain Follow(L). Think of the following sentence:
...Lx...
where X is some terminal and thus is in Follow(L). This could be expanded to:
...CL'x...
and further to:
...Cx...
So what follows L, can also follow C. The opposite is not necessarily true.
To calculate follows, think of a graph, where the nodes are (NT, n) which means non-terminal NT with the length of tokens as follow (in LL(1), n is either 1 or 0). The graph for yours would look like this:
_______
|/_ \
(L, 1)----->(L', 1) _(C, 1)
| \__________|____________/| |
| | |
| | |
| _______ | |
V |/_ \ V V
(L, 0)----->(L', 0) _(C, 0)
\_______________________/|
Where (X, n)--->(Y, m) means the follows of length n of X, depend on follows of length m of Y (of course, m <= n). That is to calculate (X, n), first you should calculate (Y, m), and then you should look at every rule that contains X on the right hand side and Y on the left hand side e.g.:
Y -> ... X REST
take what REST expands to with length n - m for every m in [0, n) and then concat every result with every follow from the (Y, m) set. You can calculate what REST expands to while calculating the firsts of REST, simply by holding a flag saying whether REST completely expands to that first, or partially. Furthermore, add firsts of REST with length n as follows of X too. For example:
S -> A a b c
A -> B C d
C -> epsilon | e | f g h i
Then to find follows of B with length 3 (which are e d a, d a b and f g h), we look at the rule:
A -> B C d
and we take the sentence C d, and look at what it can produce:
"C d" with length 0 (complete):
"C d" with length 1 (complete):
d
"C d" with length 2 (complete):
e d
"C d" with length 3 (complete or not):
f g h
Now we take these and merge with follow(A, m):
follow(A, 0):
epsilon
follow(A, 1):
a
follow(A, 2):
a b
follow(A, 3):
a b c
"C d" with length 0 (complete) concat follow(A, 3):
"C d" with length 1 (complete) concat follow(A, 2):
d a b
"C d" with length 2 (complete) concat follow(A, 1):
e d a
"C d" with length 3 (complete or not) concat follow(A, 0) (Note: follow(X, 0) is always epsilon):
f g h
Which is the set we were looking for. So in short, the algorithm becomes:
Create the graph of follow dependencies
Find the connected components and create a DAG out of it.
Traverse the DAG from the end (from the nodes that don't have any dependency) and calculate the follows with the algorithm above, having calculated firsts beforehand.
It's worth noting that the above algorithm is for any LL(K). For LL(1), the situation is much simpler.