Is it possible to describe block comments using EBNF? - parsing

Say, I have the following EBNF:
document = content , { content } ;
content = hello world | answer | space ;
hello world = "hello" , space , "world" ;
answer = "42" ;
space = " " ;
This lets me parse something like:
hello world 42
Now I want to extend this grammar with a block comment. How can I do this properly?
If I start simple:
document = content , { content } ;
content = hello world | answer | space | comment;
hello world = "hello" , space , "world" ;
answer = "42" ;
space = " " ;
comment = "/*" , ?any character? , "*/" ;
I cannot parse:
Hello /* I'm the taxman! */ World 42
If I extend the grammar further with the special case from above, it gets ugly, but parses.
document = content , { content } ;
content = hello world | answer | space | comment;
hello world = "hello" , { comment } , space , { comment } , "world" ;
answer = "42" ;
space = " " ;
comment = "/*" , ?any character? , "*/" ;
But I still cannot parse something like:
Hel/*p! I need somebody. Help! Not just anybody... */lo World 42
How would I do this with an EBNF grammar? Or is it not even possible at all?

Assuming you would consider "hello" as a token, you would not want anything to break that up. Should you need to do so, it becomes necessary to explode the rule:
hello_world = "h", {comment}, "e", {comment}, "l", {comment}, "l", {comment}, "o" ,
{ comment }, space, { comment },
"w", {comment}, "o", {comment}, "r", {comment}, "l", {comment}, "d" ;
Considering the broader question, it seems commonplace to not describe language comments as part of the formal grammar, but to instead make it a side note. However, it can generally be done by treating the comment as equivalent to whitespace:
space = " " | comment ;
You may also want to consider adding a rule to describe consecutive whitespace:
spaces = { space }- ;
Cleaning up your final grammar, but treating "hello" and "world" as tokens (i.e. not allowing them to be broken apart), could result in something like this:
document = { content }- ;
content = hello world | answer | space ;
hello world = "hello" , spaces , "world" ;
answer = "42" ;
spaces = { space }- ;
space = " " | comment ;
comment = "/*" , ?any character? , "*/" ;

How would I do this with an EBNF grammar? Or is it not even possible at all?
Some languages remove comments, some replace comments with a space, in a preprocessor. Removing the comments seems the easiest solution to this problem. However, this solution would remove comments from literals, which would not be done, normally.
document = preprocess, process;
preprocess = {(? any character ? - comment, ? append char to text ?)},
? text for input to process ?;
comment = "/*", {? any character ? - "*/"}, "*/", ? discard ?;
process = {content}-;
content = hello world | answer | spaces;
hello world = ("H" | "h"), "ello", spaces, ("W" | "w") , "orld";
answer = "42";
spaces = {" "}-;
The preprocessor, given,
Hello /* I'm the taxman! */ World 42
produces
Hello World 42
Notice the two spaces.
And, for
Hel/*p! I need somebody. Help! Not just anybody... */lo World 42
produces
Hello World 42

Related

regex for matching a string into words but leaving multiple spaces

Here's what I expect. I have a string with numbers that need to be changed into letters (a kind of cipher) and spaces to move into different letter, and there is a tripple spaces that represent a space in output. For example, a string "394 29 44 44 141 6" will be decrypted into "Hell No".
function string.decrypt(self)
local output = ""
for i in self:gmatch("%S+") do
for j, k in pairs(CODE) do
output = output .. (i == j and k or "")
end
end
return output
end
Even though it decrypts the numbers correctly I doesn't work with spacebars. So the string I used above decrypts into "HellNo", instead of expected "Hell No". How can I fix this?
You can use
CODE = {["394"] = "H", ["29"] = "e", ["44"] = "l", ["141"] = "N", ["6"] = "o"}
function replace(match)
local ret = nil
for i, v in pairs(CODE) do
if i == match then
ret = v
end
end
return ret
end
function decrypt(s)
return s:gsub("(%d+)%s?", replace):gsub(" ", " ")
end
print (decrypt("394 29 44 44 141 6"))
Output will contain Hell No. See the Lua demo online.
Here, (%d+)%s? in s:gsub("(%d+)%s?", replace) matches and captures one or more digits and just matches an optional whitespace (with %s?) and the captured value is passed to the replace function, where it is mapped to the char value in CODE. Then, all double spaces are replaced with a single space with gsub(" ", " ").

How to parse only comments using pegjs grammar?

I've written a pegjs grammar that is supposed to parse any kind of js/c-style comments. However, it's not quite working since I've only managed to capture the comment itself, and ignore everything else. How should I alter this grammar to only parse comments out of any kind of input?
Grammar:
Start
= Comment
Character
= .
Comment
= MultiLineComment
/ SingleLineComment
LineTerminator
= [\n\r\u2028\u2029]
MultiLineComment
= "/*" (!"*/" Character)* "*/"
MultiLineCommentNoLineTerminator
= "/*" (!("*/" / LineTerminator) Character)* "*/"
SingleLineComment
= "//" (!LineTerminator Character)*
Input:
/**
* Trending Content
* Returns visible videos that have the largest view percentage increase over
* the time period.
*/
Other text here
Error
Line 5, column 4: Expected end of input but "\n" found.
You need to refactor to specifically capture the line content before you consider the comment (either single or multiple line), as in:
lines = result:line* {
return result
}
line = WS* line:$( !'//' CHAR )* single_comment ( EOL / EOF ) { // single-comment line
return line.replace(/^\s+|\s+$/g,'')
}
/ WS* line:$( !'/*' CHAR )* multi_comment ( EOL / EOF ) { // mult-comment line
return line.replace(/^\s+|\s+$/g,'')
}
/ WS* line:$CHAR+ ( EOL / EOF ) { // non-blank line
return line.replace(/^\s+|\s+$/g,'')
}
/ WS* EOL { // blank line
return ''
}
single_comment = WS* '//' CHAR* WS*
multi_comment = WS* '/*' ( !'*/' ( CHAR / EOL ) )* '*/' WS*
CHAR = [^\n]
WS = [ \t]
EOF = !.
EOL = '\n'
which, when run against:
no comment here
single line comment // single-comment HERE
test of multi line comment /*
multi-comment HERE
*/
last line
returns:
[
"no comment here",
"",
"single line comment",
"",
"test of multi line comment",
"",
"last line"
]

combining these two lua formatting lines into one

is there any way to combine these last two formatting lines of code into one?
str = "1, 2, 3, 4, 5, "
str = str:gsub("%p", {[","] = " >" }) -- replaces ',' with '>'
str = string.sub(str, 1, #str - 2) --removes last whitespace + comma
Thanks in advance :)
str = "1, 2, 3, 4, 5, "
str = str:sub(1, #str-2):gsub("%p", {[","] = " >" })
This will do what you want it to do.
Egor's is a bit more elegant, though:
str = str:gsub(',',' > '):sub(1,-3)

Parsing a TeX-like language with lpeg

I am struggling to get my head around LPEG. I have managed to produce one grammar which does what I want, but I have been beating my head against this one and not getting far. The idea is to parse a document which is a simplified form of TeX. I want to split a document into:
Environments, which are \begin{cmd} and \end{cmd} pairs.
Commands which can either take an argument like so: \foo{bar} or can be bare: \foo.
Both environments and commands can have parameters like so: \command[color=green,background=blue]{content}.
Other stuff.
I also would like to keep track of line number information for error handling purposes. Here's what I have so far:
lpeg = require("lpeg")
lpeg.locale(lpeg)
-- Assume a lot of "X = lpeg.X" here.
-- Line number handling from http://lua-users.org/lists/lua-l/2011-05/msg00607.html
-- with additional print statements to check they are working.
local newline = P"\r"^-1 * "\n" / function (a) print("New"); end
local incrementline = Cg( Cb"linenum" )/ function ( a ) print("NL"); return a + 1 end , "linenum"
local setup = Cg ( Cc ( 1) , "linenum" )
nl = newline * incrementline
space = nl + lpeg.space
-- Taken from "Name-value lists" in http://www.inf.puc-rio.br/~roberto/lpeg/
local identifier = (R("AZ") + R("az") + P("_") + R("09"))^1
local sep = lpeg.S(",;") * space^0
local value = (1-lpeg.S(",;]"))^1
local pair = lpeg.Cg(C(identifier) * space ^0 * "=" * space ^0 * C(value)) * sep^-1
local list = lpeg.Cf(lpeg.Ct("") * pair^0, rawset)
local parameters = (P("[") * list * P("]")) ^-1
-- And the rest is mine
anything = C( (space^1 + (1-lpeg.S("\\{}")) )^1) * Cb("linenum") / function (a,b) return { text = a, line = b } end
begin_environment = P("\\begin") * Ct(parameters) * P("{") * Cg(identifier, "environment") * Cb("environment") * P("}") / function (a,b) return { params = a[1], environment = b } end
end_environment = P("\\end{") * Cg(identifier) * P("}")
texlike = lpeg.P{
"document";
document = setup * V("stuff") * -1,
stuff = Cg(V"environment" + anything + V"bracketed_stuff" + V"command_with" + V"command_without")^0,
bracketed_stuff = P"{" * V"stuff" * P"}" / function (a) return a end,
command_with =((P("\\") * Cg(identifier) * Ct(parameters) * Ct(V"bracketed_stuff"))-P("\\end{")) / function (i,p,n) return { command = i, parameters = p, nodes = n } end,
command_without = (( P("\\") * Cg(identifier) * Ct(parameters) )-P("\\end{")) / function (i,p) return { command = i, parameters = p } end,
environment = Cg(begin_environment * Ct(V("stuff")) * end_environment) / function (b,stuff, e) return { b = b, stuff = stuff, e = e} end
}
It almost works!
> texlike:match("\\foo[one=two]thing\\bar")
{
command = "foo",
parameters = {
{
one = "two",
},
},
}
{
line = 1,
text = "thing",
}
{
command = "bar",
parameters = {
},
}
But! First, I can't get the line number handling part to work at all. The function within incrementline is never fired.
I also can't quite work out how nested capture information is passed to handling functions (which is why I have scattered Cg, C and Ct semirandomly over the grammar). This means that only one item is returned from within a command_with:
> texlike:match("\\foo{text \\command moretext}")
{
command = "foo",
nodes = {
{
line = 1,
text = "text ",
},
},
parameters = {
},
}
I would also love to be able to check that the environment start and ends match up but when I tried to do so, my back references from "begin" were not in scope by the time I got to "end". I don't know where to go from here.
Late answer but hopefully it'll offer some insight if you're still looking for a solution or wondering what the problem was.
There are a couple of issues with your grammar, some of which can be tricky to spot.
Your line increment here looks incorrect:
local incrementline = Cg( Cb"linenum" ) /
function ( a ) print("NL"); return a + 1 end,
"linenum"
It looks like you meant to create a named capture group and not an anonymous group. The backcapture linenum is essentially being used like a variable. The problem is because this is inside an anonymous capture, linenum will not update properly -- function(a) will always receive 1 when called. You need to move the closing ) to the end so "linenum" is included:
local incrementline = Cg( Cb"linenum" /
function ( a ) print("NL"); return a + 1 end,
"linenum")
Relevant LPeg documentation for Cg capture.
The second problem is with your anything non-terminal rule:
anything = C( (space^1 + (1-lpeg.S("\\{}")) )^1) * Cb("linenum") ...
There are several things to be careful here. First, a named Cg capture (from incrementline rule once it's fixed) doesn't produce anything unless it's in a table or you backref it. The second major thing is that it has an adhoc scope like a variable. More precisely, its scope ends once you close it in an outer capture -- like what you're doing here:
C( (space^1 + (...) )^1)
Which means by the time you reference its backcapture with * Cb("linenum"), that's already too late -- the linenum you really want already closed its scope.
I always found LPeg's re syntax a bit easier to grok so I've rewritten the grammar with that instead:
local grammar_cb =
{
fold = pairfold,
resetlinenum = resetlinenum,
incrementlinenum = incrementlinenum, getlinenum = getlinenum,
error = error
}
local texlike_grammar = re.compile(
[[
document <- '' -> resetlinenum {| docpiece* |} !.
docpiece <- {| envcmd |} / {| cmd |} / multiline
beginslash <- cmdslash 'begin'
endslash <- cmdslash 'end'
envcmd <- beginslash paramblock? {:beginenv: envblock :} (!endslash docpiece)*
endslash openbrace {:endenv: =beginenv :} closebrace / &beginslash {} -> error .
envblock <- openbrace key closebrace
cmd <- cmdslash {:command: identifier :} (paramblock? cmdblock)?
cmdblock <- openbrace {:nodes: {| docpiece* |} :} closebrace
paramblock <- opensq ( {:parameters: {| parampairs |} -> fold :} / whitesp) closesq
parampairs <- parampair (sep parampair)*
parampair <- key assign value
key <- whitesp { identifier }
value <- whitesp { [^],;%s]+ }
multiline <- (nl? text)+
text <- {| {:text: (!cmd !closebrace !%nl [_%w%p%s])+ :} {:line: '' -> getlinenum :} |}
identifier <- [_%w]+
cmdslash <- whitesp '\'
assign <- whitesp '='
sep <- whitesp ','
openbrace <- whitesp '{'
closebrace <- whitesp '}'
opensq <- whitesp '['
closesq <- whitesp ']'
nl <- {%nl+} -> incrementlinenum
whitesp <- (nl / %s)*
]], grammar_cb)
The callback functions are straight-forwardly defined as:
local function pairfold(...)
local t, kv = {}, ...
if #kv % 2 == 1 then return ... end
for i = #kv, 2, -2 do
t[ kv[i - 1] ] = kv[i]
end
return t
end
local incrementlinenum, getlinenum, resetlinenum do
local line = 1
function incrementlinenum(nl)
assert(not nl:match "%S")
line = line + #nl
end
function getlinenum() return line end
function resetlinenum() line = 1 end
end
Testing the grammar with a non-trivial tex-like str with multiple lines:
local test1 = [[\foo{text \bar[color = red, background = black]{
moretext \baz{
even
more text} }
this time skipping multiple
lines even, such wow!}]]
Produces the follow AST in lua-table format:
{
command = "foo",
nodes = {
{
text = "text",
line = 1
},
{
parameters = {
color = "red",
background = "black"
},
command = "bar",
nodes = {
{
text = " moretext",
line = 2
},
{
command = "baz",
nodes = {
{
text = "even ",
line = 3
},
{
text = "more text",
line = 4
}
}
}
}
},
{
text = "this time skipping multiple",
line = 7
},
{
text = "lines even, such wow!",
line = 9
}
}
}
And a second test for begin/end environments:
local test2 = [[\begin[p1
=apple,
p2=blue]{scope} scope foobar
\end{scope} global foobar]]
Which seems to give approximately what you're looking for:
{
{
{
text = " scope foobar",
line = 3
},
parameters = {
p1 = "apple",
p2 = "blue"
},
beginenv = "scope",
endenv = "scope"
},
{
text = " global foobar",
line = 4
}
}

Case-insensitive Lua pattern-matching

I'm writing a grep utility in Lua for our mobile devices running Windows CE 6/7, but I've run into some issues implementing case-insensitive match patterns. The obvious solution of converting everything to uppercase (or lower) does not work so simply due to the character classes.
The only other thing I can think of is converting the literals in the pattern itself to uppercase.
Here's what I have so far:
function toUpperPattern(instr)
-- Check first character
if string.find(instr, "^%l") then
instr = string.upper(string.sub(instr, 1, 1)) .. string.sub(instr, 2)
end
-- Check the rest of the pattern
while 1 do
local a, b, str = string.find(instr, "[^%%](%l+)")
if not a then break end
if str then
instr = string.sub(instr, 1, a) .. string.upper(string.sub(instr, a+1, b)) .. string.sub(instr, b + 1)
end
end
return instr
end
I hate to admit how long it took to get even that far, and I can still see right away there are going to be problems with things like escaped percent signs '%%'
I figured this must be a fairly common issue, but I can't seem to find much on the topic.
Are there any easier (or at least complete) ways to do this? I'm starting to go crazy here...
Hoping you Lua gurus out there can enlighten me!
Try something like this:
function case_insensitive_pattern(pattern)
-- find an optional '%' (group 1) followed by any character (group 2)
local p = pattern:gsub("(%%?)(.)", function(percent, letter)
if percent ~= "" or not letter:match("%a") then
-- if the '%' matched, or `letter` is not a letter, return "as is"
return percent .. letter
else
-- else, return a case-insensitive character class of the matched letter
return string.format("[%s%s]", letter:lower(), letter:upper())
end
end)
return p
end
print(case_insensitive_pattern("xyz = %d+ or %% end"))
which prints:
[xX][yY][zZ] = %d+ [oO][rR] %% [eE][nN][dD]
Lua 5.1, LPeg v0.12
do
local p = re.compile([[
pattern <- ( {b} / {escaped} / brackets / other)+
b <- "%b" . .
escaped <- "%" .
brackets <- { "[" ([^]%]+ / escaped)* "]" }
other <- [^[%]+ -> cases
]], {
cases = function(str) return (str:gsub('%a',function(a) return '['..a:lower()..a:upper()..']' end)) end
})
local pb = re.compile([[
pattern <- ( {b} / {escaped} / brackets / other)+
b <- "%b" . .
escaped <- "%" .
brackets <- {: {"["} ({escaped} / bcases)* {"]"} :}
bcases <- [^]%]+ -> bcases
other <- [^[%]+ -> cases
]], {
cases = function(str) return (str:gsub('%a',function(a) return '['..a:lower()..a:upper()..']' end)) end
, bcases = function(str) return (str:gsub('%a',function(a) return a:lower()..a:upper() end)) end
})
function iPattern(pattern,brackets)
('sanity check'):find(pattern)
return table.concat({re.match(pattern, brackets and pb or p)})
end
end
local test = '[ab%c%]d%%]+ o%%r %bnm'
print(iPattern(test)) -- [ab%c%]d%%]+ [oO]%%[rR] %bnm
print(iPattern(test,true)) -- [aAbB%c%]dD%%]+ [oO]%%[rR] %bnm
print(('qwe [%D]% O%r n---m asd'):match(iPattern(test, true))) -- %D]% O%r n---m
Pure Lua version:
It is necessary to analyze all the characters in the string to convert it into a correct pattern because Lua patterns do not have alternations like in regexps (abc|something).
function iPattern(pattern, brackets)
('sanity check'):find(pattern)
local tmp = {}
local i=1
while i <= #pattern do -- 'for' don't let change counter
local char = pattern:sub(i,i) -- current char
if char == '%' then
tmp[#tmp+1] = char -- add to tmp table
i=i+1 -- next char position
char = pattern:sub(i,i)
tmp[#tmp+1] = char
if char == 'b' then -- '%bxy' - add next 2 chars
tmp[#tmp+1] = pattern:sub(i+1,i+2)
i=i+2
end
elseif char=='[' then -- brackets
tmp[#tmp+1] = char
i = i+1
while i <= #pattern do
char = pattern:sub(i,i)
if char == '%' then -- no '%bxy' inside brackets
tmp[#tmp+1] = char
tmp[#tmp+1] = pattern:sub(i+1,i+1)
i = i+1
elseif char:match("%a") then -- letter
tmp[#tmp+1] = not brackets and char or char:lower()..char:upper()
else -- something else
tmp[#tmp+1] = char
end
if char==']' then break end -- close bracket
i = i+1
end
elseif char:match("%a") then -- letter
tmp[#tmp+1] = '['..char:lower()..char:upper()..']'
else
tmp[#tmp+1] = char -- something else
end
i=i+1
end
return table.concat(tmp)
end
local test = '[ab%c%]d%%]+ o%%r %bnm'
print(iPattern(test)) -- [ab%c%]d%%]+ [oO]%%[rR] %bnm
print(iPattern(test,true)) -- [aAbB%c%]dD%%]+ [oO]%%[rR] %bnm
print(('qwe [%D]% O%r n---m asd'):match(iPattern(test, true))) -- %D]% O%r n---m

Resources