The following pattern (from this page) matches only strings with balanced parentheses:
b = lpeg.P{ "(" * ((1 - lpeg.S"()") + lpeg.V(1))^0 * ")" }
What does 1- in 1 - lpeg.S"()" mean?
function gsub (s, patt, repl)
patt = lpeg.P(patt)
patt = lpeg.Cs((patt / repl + 1)^0)
return lpeg.match(patt, s)
end
What does the +1 in patt / repl + 1 mean?
And I still not quite get the function of prioritized choice operator / very well from this paper
Any help will be appreciated!
The 1 in 1 - lpeg.S"()" means any character. The whole statement can be read as, match any character while not matching a character in the set "()".
The +1 is the same idea, if repl is a string then patt / repl + 1 matches pattern patt and then replaces it's capture with the string repl or skips a character.
Related
This was spun off from the comments on this question.
As I understand, in the PEG grammar, it's possible to implement a non-greedy search by writing S <- E2 / E1 S (or S = pattern E2 if possible or pattern E1 and continued S).
However, I don't want to capture E2 in the final pattern - I want to capture up to E2. When trying to implement this in LPEG I've run into several issues, including 'Empty loop in rule' errors when building this into a grammar.
How would we implement the following search in a LPEG grammar: [tag] foo [/tag] where we want to capture the contents of the tag in a capture table ('foo' in the example), but we want to terminate before the ending tag? As I understand from the comments on the other question, this should be possible, but I can't find an example in LPEG.
Here's a snippet from the test grammar
local tag_start = P"[tag]"
local tag_end = P"[/tag]"
G = P{'Pandoc',
...
NotTag = #tag_end + P"1" * V"NotTag"^0;
...
tag = tag_start * Ct(V"NotTag"^0) * tag_end;
}
It's me again. I think you need better understanding about LPeg captures. Table capture (lpeg.Ct) is a capture that gathers your captures in a table. As there's no simple captures (lpeg.C) specified in NotTag rule, the final capture would become an empty table {}.
Once more, I recommend you start from lpeg.re because it's more intuitive.
local re = require('lpeg.re')
local inspect = require('inspect')
local g = re.compile[=[--lpeg
tag <- tag_start {| {NotTag} |} tag_end
NotTag <- &tag_end / . NotTag
tag_start <- '[tag]'
tag_end <- '[/tag]'
]=]
print(inspect(g:match('[tag] foo [/tag]')))
-- output: { " foo " }
Additionally, S <- E2 / E1 S is not S <- E2 / E1 S*, these two are not equivalent.
However, if I were to do the same task, I won't try to use a non-greedy match, as non-greedy matches are always slower than greedy match.
tag <- tag_start {| {( !tag_end . (!'[' .)* )*} |} tag_end
Combining not-predicate and greedy matching is enough.
I have this lexeme:
a().length()
And this PEGjs grammar:
start = func_call
func_call = header "(" ")"
header = "a" suffix*
suffix = "(" ")" ("." "length")
In the grammar, I'm parsing a function call. This currently parses, as you can try online in the PEGjs playground.
Input parsed successfully.
However, if I add an asterisk to the end of the suffix production, like so:
suffix = "(" ")" ("." "length")*
Then the input fails to parse:
Line 1, column 13: Expected "(" or "." but end of input found.
I don't understand why. From the documentation:
expression * Match zero or more repetitions of the expression and return their match results in an array. The matching is greedy
This should be a greedy match of "." "length" which should match one time. But instead it fails to match at all. Is this related to the nested use of * in header?
* matches zero or more repetitions of its operand, as you say.
So when you write
suffix = "(" ")" ("." "length")*
You are saying that a suffix is () followed by zero or more repetitions of .length. Thus it could be (), which has zero repetitions.
Thus, suffix* (in header) can match ().length() as two repetitions of suffix, first ().length, and then (). That would be the greedy match, which is PEG's modus operandi.
But after that, there is no () left to be matched by func_call. Hence the parse error.
I'm trying to parse a comma separated list. To simplify, I'm just using digits. These expressions would be valid:
(1, 4, 3)
()
(4)
I can think of two ways to do this and I'm wondering why exactly the failed example does not work. I believe it is a correct BNF, but I can't get it to work as PEG. Can anyone explain why exactly? I'm trying to get a better understanding of the PEG parsing logic.
I'm testing using the online browser parser generator here:
https://pegjs.org/online
This does not work:
list = '(' some_digits? ')'
some_digits = digit / ', ' some_digits
digit = [0-9]
(actually, it parses okay, and likes () or (1) but doesn't recognize (1, 2)
But this does work:
list = '(' some_digits? ')'
some_digits = digit another_digit*
another_digit = ', ' digit
digit = [0-9]
Why is that? (Grammar novice here)
Cool question and after digging around in their docs for a second I found that the / character means:
Try to match the first expression, if it does not succeed, try the
second one, etc. Return the match result of the first successfully
matched expression. If no expression matches, consider the match
failed.
So this lead me to the solution:
list = '(' some_digits? ')'
some_digits = digit ', ' some_digits / digit
digit = [0-9]
The reason this works:
input: (1, 4)
eats '('
check are there some digits?
check some_digits - first condition:
eats '1'
eats ', '
check some_digits - first condition:
eats '4'
fails to eat ', '
check some_digits - second condition:
eats '4'
succeeds
succeeds
eats ')'
succeeds
if you reverse the order of the some_digits conditions the first number is comes across gets eaten by digit and no recursion occurs. Then it throws an error because ')' is not present.
In one line:
some_digits = '(' digit (', ' digit)* ')'
It depends on what you want with the values and on the PEG implementation, but extracting them might be easier this way.
In a normal PEG (parsing expression grammar) this is a valid grammar:
values <- number (comma values)*
number <- [0-9]+
comma <- ','
However, if I try to write this using LPeg the recursive nature of that rule fails:
local lpeg = require'lpeg'
local comma = lpeg.P(',')
local number = lpeg.R('09')^1
local values = number * (comma * values)^-1
--> bad argument #2 to '?' (lpeg-pattern expected, got nil)
Although in this simple example I could rewrite the rule to not use recursion, I have some existing grammars that I'd prefer not to rewrite.
How can I write a self-referencing rule in LPeg?
Use a grammar.
With the use of Lua variables, it is possible to define patterns incrementally, with each new pattern using previously defined ones. However, this technique does not allow the definition of recursive patterns. For recursive patterns, we need real grammars.
LPeg represents grammars with tables, where each entry is a rule.
The call lpeg.V(v) creates a pattern that represents the nonterminal (or variable) with index v in a grammar. Because the grammar still does not exist when this function is evaluated, the result is an open reference to the respective rule.
A table is fixed when it is converted to a pattern (either by calling lpeg.P or by using it wherein a pattern is expected). Then every open reference created by lpeg.V(v) is corrected to refer to the rule indexed by v in the table.
When a table is fixed, the result is a pattern that matches its initial rule. The entry with index 1 in the table defines its initial rule. If that entry is a string, it is assumed to be the name of the initial rule. Otherwise, LPeg assumes that the entry 1 itself is the initial rule.
As an example, the following grammar matches strings of a's and b's that have the same number of a's and b's:
equalcount = lpeg.P{
"S"; -- initial rule name
S = "a" * lpeg.V"B" + "b" * lpeg.V"A" + "",
A = "a" * lpeg.V"S" + "b" * lpeg.V"A" * lpeg.V"A",
B = "b" * lpeg.V"S" + "a" * lpeg.V"B" * lpeg.V"B",
} * -1
It is equivalent to the following grammar in standard PEG notation:
S <- 'a' B / 'b' A / ''
A <- 'a' S / 'b' A A
B <- 'b' S / 'a' B B
I know this is a late answer but here is an idea how to back-reference a rule
local comma = lpeg.P(',')
local number = lpeg.R('09')^1
local values = lpeg.P{ lpeg.C(number) * (comma * lpeg.V(1))^-1 }
local t = { values:match('1,10,20,301') }
Basically a primitive grammar is passed to lpeg.P (grammar is just a glorified table) that references the first rule by number instead of name i.e. lpeg.V(1).
The sample just adds a simple lpeg.C capture on number terminal and collects all these results in local table t for further usage. (Notice that no lpeg.Ct is used which is not a big deal but still... part of the sample I guess.)
In this code:
let f(a,b,c) = a * b + c - (d())
let g(a,b,c) = a * b + c -(d())
f is (int*int*int) -> int, and g is (int*int*(int*int)) -> int.
Removing the brackets around d() in g causes the "Successive arguments should be separated by spaces or tupled" error.
What's going on?
#bytebuster is quite correct in his comment, but to put it into layman's terms ;-] one is parsed as the binary subtraction operator and the other is parsed as the unary negation operator – you're simply fighting operator precedence here.