Using LPEG (Lua Parser Expression Grammars) like boost::spirit - parsing

So I am playing with lpeg to replace a boost spirit grammar, I must say boost::spirit is far more elegant and natural than lpeg. However it is a bitch to work with due to the constraints of current C++ compiler technology and the issues of TMP in C++. The type mechanism is in this case your enemy rather than your friend. Lpeg on the other hand while ugly and basic results in more productivity.
Anyway, I am digressing, part of my lpeg grammar looks like as follows:
function get_namespace_parser()
local P, R, S, C, V =
lpeg.P, lpeg.R, lpeg.S, lpeg.C, lpeg.V
namespace_parser =
lpeg.P{
"NAMESPACE";
NAMESPACE = V("WS") * P("namespace") * V("SPACE_WS") * V("NAMESPACE_IDENTIFIER")
* V("WS") * V("NAMESPACE_BODY") * V("WS"),
NAMESPACE_IDENTIFIER = V("IDENTIFIER") / print_string ,
NAMESPACE_BODY = "{" * V("WS") *
V("ENTRIES")^0 * V("WS") * "}",
WS = S(" \t\n")^0,
SPACE_WS = P(" ") * V("WS")
}
return namespace_parser
end
This grammar (although incomplete) matches the following namespace foo {}. I'd like to achieve the following semantics (which are common use-cases when using boost spirit).
Create a local variable for the namespace rule.
Add a namespace data structure to this local variable when namespace IDENTIFIER { has been matched.
Pass the newly created namespace data structure to the NAMESPACE_BODY for further construction of the AST... so on and so forth.
I am sure this use-case is achievable. No examples show it. I don't know the language or the library enough to figure out how to do it. Can someone show the syntax for it.
edit : After a few days of trying to dance with lpeg, and getting my feet troden on, I have decided to go back to spirit :D it is clear that lpeg is meant to be weaved with lua functions and that such weaving is very free-form (whereas spirit has clear very well documented semantics). I simply do not have the right mental model of lua yet.

Though "Create a local variable for the namespace rule" sounds disturbingly like "context-sensitive grammar", which is not really for LPEG, I will assume that you want to build an abstract syntax tree.
In Lua, an AST can be represented as a nested table (with named and indexed fields) or a closure, doing whatever task that tree is meant to do.
Both can be produced by a combination of nested LPEG captures.
I will limit this answer to AST as a Lua table.
Most useful, in this case, LPEG captures will be:
lpeg.C( pattern ) -- simple capture,
lpeg.Ct( pattern ) -- table capture,
lpeg.Cg( pattern, name ) -- named group capture.
The following example based on your code will produce a simple syntax tree as a Lua table:
local lpeg = require'lpeg'
local P, V = lpeg.P, lpeg.V
local C, Ct, Cg = lpeg.C, lpeg.Ct, lpeg.Cg
local locale = lpeg.locale()
local blank = locale.space ^ 0
local space = P' ' * blank
local id = P'_' ^ 0 * locale.alpha * (locale.alnum + '_') ^ 0
local NS = P{ 'ns',
-- The upper level table with two fields: 'id' and 'entries':
ns = Ct( blank * 'namespace' * space * Cg( V'ns_id', 'id' )
* blank * Cg( V'ns_body', 'entries' ) * blank ),
ns_id = id,
ns_body = P'{' * blank
-- The field 'entries' is, in turn, an indexed table:
* Ct( (C( V'ns_entry' )
* (blank * P',' * blank * C( V'ns_entry') ) ^ 0) ^ -1 )
* blank * P'}',
ns_entry = id
}
lpeg.match( NS, 'namespace foo {}' ) will give:
table#1 {
["entries"] = table#2 {
},
["id"] = "foo",
}
lpeg.match( NS, 'namespace foo {AA}' ) will give:
table#1 {
["entries"] = table#2 {
"AA"
},
["id"] = "foo",
}
lpeg.match( NS, 'namespace foo {AA, _BB}' ) will give:
table#1 {
["entries"] = table#2 {
"AA",
"_BB"
},
["id"] = "foo",
}
lpeg.match( NS, 'namespace foo {AA, _BB, CC1}' ) will give:
table#1 {
["entries"] = table#2 {
"AA",
"_BB",
"CC1"
},
["id"] = "foo",
}

Related

Lua - too many captures. How fix it?

Have problems with this one. If try convert cirilic words or wors have to many symbols and have error
function to_string(t)
local o = {};
for _, v in ipairs(t) do
local b = v < 0 and (0xff + v + 1) or v;
table.insert(o, string.char(b));
end
return table.concat(o);
end
function to_bytes(s)
local c = { s:match( (s:gsub(".", "(.)")) ) };
local o = {};
for _, v in pairs(c) do
table.insert(o, v:byte());
end
return o;
end
local t = to_bytes("If this have to many words или русские");
local out = "\\"
local chars = #t;
for i = 1, chars do
out = out..tostring(t[i]);
if i < chars then
out = out.."\\"
end
end
out = out..""
I think the error is self-explanatory: you have too many captures in your pattern (those groups that are wrapped into parentheses). The default value is 32. You have a couple of options: (1) recompile your Lua version to use a large number (you'll have to modify LUA_MAXCAPTURES value), but keep in mind that this limit is there for a reason and (2) change your pattern to avoid this many captures (possibly splitting into smaller fragments/patterns). You may also consider using more powerful parsers, like LPEG.
You don't need regex to convert string to array of bytes
function to_bytes(s)
return {s:byte(1, -1)}
end

Lua unusual variable name (question mark variable)

I have stumbled upon this line of code and I am not sure what the [ ? ] part represents (my guess is it's a sort of a wildcard but I searched it for a while and couldn't find anything):
['?'] = function() return is_canadian and "eh" or "" end
I understand that RHS is a functional ternary operator. I am curious about the LHS and what it actually is.
Edit: reference (2nd example):
http://lua-users.org/wiki/SwitchStatement
Actually, it is quite simple.
local t = {
a = "aah",
b = "bee",
c = "see",
It maps each letter to a sound pronunciation. Here, a need to be pronounced aah and b need to be pronounced bee and so on. Some letters have a different pronunciation if in american english or canadian english. So not every letter can be mapped to a single sound.
z = function() return is_canadian and "zed" or "zee" end,
['?'] = function() return is_canadian and "eh" or "" end
In the mapping, the letter z and the letter ? have a different prononciation in american english or canadian english. When the program will try to get the prononciation of '?', it will calls a function to check whether the user want to use canadian english or another english and the function will returns either zed or zee.
Finally, the 2 following notations have the same meaning:
local t1 = {
a = "aah",
b = "bee",
["?"] = "bee"
}
local t2 = {
["a"] = "aah",
["b"] = "bee",
["?"] = "bee"
}
If you look closely at the code linked in the question, you'll see that this line is part of a table constructor (the part inside {}). It is not a full statement on its own. As mentioned in the comments, it would be a syntax error outside of a table constructor. ['?'] is simply a string key.
The other posts alreay explained what that code does, so let me explain why it needs to be written that way.
['?'] = function() return is_canadian and "eh" or "" end is embedded in {}
It is part of a table constructor and assigns a function value to the string key '?'
local tbl = {a = 1} is syntactic sugar for local tbl = {['a'] = 1} or
local tbl = {}
tbl['a'] = 1
String keys that allow that convenient syntax must follow Lua's lexical conventions and hence may only contain letters, digits and underscore. They must not start with a digit.
So local a = {? = 1} is not possible. It will cause a syntax error unexpected symbol near '?' Therefor you have to explicitly provide a string value in square brackets as in local a = {['?'] = 1}
they gave each table element its own line
local a = {
1,
2,
3
}
This greatly improves readability for long table elements or very long tables and allows you maintain a maximum line length.
You'll agree that
local tbl = {
z = function() return is_canadian and "zed" or "zee" end,
['?'] = function() return is_canadian and "eh" or "" end
}
looks a lot cleaner than
local tbl = {z = function() return is_canadian and "zed" or "zee" end,['?'] = function() return is_canadian and "eh" or "" end}

PLY yacc parser : how can I create a recursive rule for parsing matrices?

I'm trying to parse matrices written this way : [[1,2];[3,2];[3,4]]
So, the matrix syntax is of the form: [[A0,0, A0,1, ...]; [A1,0, A1,1, ...];...]
The semicolon is used to separate the rows of a matrix, so it is not present in theassignment of a matrix that has only one row. On the other hand, the comma is usedto separate the columns of a matrix, which on the other hand will not be present in the assignment of a matrix that has only one column.
Now my question is : how can I parse a matrix with a variable size ? I guess with recursive rule in ply yacc, but how? Every attempt lead to a infinite recursion.
This is the error I get :
WARNING: Symbol 'primary' is unreachable
WARNING: Symbol 'expr' is unreachable
ERROR: Infinite recursion detected for symbol 'primary'
ERROR: Infinite recursion detected for symbol 'expr'
When I thry this kind of code:
def p_test(t):
'''primary : constant
| LPAREN expr RPAREN'''
print('yo')
t[0] = 1
def p_expression_matrice(t):
'''expr : primary
| primary '+' primary'''
print('hey')
t[0] = 1
(this i just a first attempt to understand how to write recurion in yacc, not even an answer to my real problem)
This is my lexer :
from global_variables import tokens
import ply.lex as lex
t_PLUS = r'\+'
t_MINUS = r'\-'
t_TIMES = r'\*'
t_DIVIDE = r'\/'
t_MODULO = r'\%'
t_EQUALS = r'\='
t_LPAREN = r'\('
t_RPAREN = r'\)'
t_LBRACK = r'\['
t_RBRACK = r'\]'
t_SEMICOLON = r'\;'
t_COMMA = r'\,'
t_POWER = r'\^'
t_QUESTION = r'\?'
t_NAME = r'[a-zA-Z]{2,}|[a-hj-zA-HJ-Z]' # all words (only letters) except the word 'i' alone
t_COMMAND = r'![\x00-\x7F]*' # all unicode characters after '!'
def t_NUMBER(t):
r'\d+(\.\d+)?'
try:
t.value = int(t.value)
except:
t.value = float(t.value)
return t
def t_IMAGINE(t):
r'i'
t.value = 1j
return t
t_ignore = " \t"
def t_error(t):
print("Illegal character '%s'" % t.value[0])
t.lexer.skip(1)
lexer = lex.lex()
After multiple tries, I got this :
def p_test3(t):
'''expression : expression SEMICOLON expression'''
t[0] = []
t[0].append(t[1])
t[0].append(t[3])
def p_test2(t):
'''expression : LBRACK expression RBRACK'''
t[0] = t[2]
def p_test(t):
'''expression : expression COMMA NUMBER'''
t[0] = []
try:
for i in t[1]:
t[0].append(i)
except:
t[0].append(t[1])
t[0].append(t[3])
Which alllows me to parse this : [[1,2,3,4,5];[5,4,3,2,1]] to get this : [[1, 2, 3, 4, 5], [5, 4, 3, 2, 1]] stored in a array of array.
Next step, I will store many of these in a dict and I'ill keep going, thanks for the support #rici :)

LPeg Increment for Each Match

I'm making a serialization library for Lua, and I'm using LPeg to parse the string. I've got K/V pairs working (with the key explicitly named), but now I'm going to add auto-indexing.
It'll work like so:
#"value"
#"value2"
Will evaluate to
{
[1] = "value"
[2] = "value2"
}
I've already got the value matching working (strings, tables, numbers, and Booleans all work perfectly), so I don't need help with that; what I'm looking for is the indexing. For each match of #[value pattern], it should capture the number of #[value pattern]'s found - in other words, I can match a sequence of values ("#"value1" #"value2") but I don't know how to assign them indexes according to the number of matches. If that's not clear enough, just comment and I'll attempt to explain it better.
Here's something of what my current pattern looks like (using compressed notation):
local process = {} -- Process a captured value
process.number = tonumber
process.string = function(s) return s:sub(2, -2) end -- Strip of opening and closing tags
process.boolean = function(s) if s == "true" then return true else return false end
number = [decimal number, scientific notation] / process.number
string = [double or single quoted string, supports escaped quotation characters] / process.string
boolean = P("true") + "false" / process.boolean
table = [balanced brackets] / [parse the table]
type = number + string + boolean + table
at_notation = (P("#") * whitespace * type) / [creates a table that includes the key and value]
As you can see in the last line of code, I've got a function that does this:
k,v matched in the pattern
-- turns into --
{k, v}
-- which is then added into an "entry table" (I loop through it and add it into the return table)
Based on what you've described so far, you should be able to accomplish this using a simple capture and table capture.
Here's a simplified example I knocked up to illustrate:
lpeg = require 'lpeg'
l = lpeg.locale(lpeg)
whitesp = l.space ^ 0
bool_val = (l.P "true" + "false") / function (s) return s == "true" end
num_val = l.digit ^ 1 / tonumber
string_val = '"' * l.C(l.alnum ^ 1) * '"'
val = bool_val + num_val + string_val
at_notation = l.Ct( (l.P "#" * whitesp * val * whitesp) ^ 0 )
local testdata = [[
#"value1"
#42
# "value2"
#true
]]
local res = l.match(at_notation, testdata)
The match returns a table containing the contents:
{
[1] = "value1",
[2] = 42,
[3] = "value2",
[4] = true
}

How Lua tables work

I am starting to learn Lua from Programming in Lua (2nd edition)
I didn't understand the following in the book. Its very vaguely explained.
a.) w={x=0,y=0,label="console"}
b.) x={math.sin(0),math.sin(1),math.sin(2)}
c.) w[1]="another field"
d.) x.f=w
e.) print (w["x"])
f.) print (w[1])
g.) print x.f[1]
When I do print(w[1]) after a.), why doesn't it print x=0
What does c.) do?
What is the difference between e.) and print (w.x)?
What is the role of b.) and g.)?
You have to realize that this:
t = {3, 4, "eggplant"}
is the same as this:
t = {}
t[1] = 3
t[2] = 4
t[3] = "eggplant"
And that this:
t = {x = 0, y = 2}
is the same as this:
t = {}
t["x"] = 0
t["y"] = 2
Or this:
t = {}
t.x = 0
t.y = 2
In Lua, tables are not just lists, they are associative arrays.
When you print w[1], then what really matters is line c.) In fact, w[1] is not defined at all until line c.).
There is no difference between e.) and print (w.x).
b.) creates a new table named x which is separate from w.
d.) places a reference to w inside of x. (NOTE: It does not actually make a copy of w, just a reference. If you've ever worked with pointers, it's similar.)
g.) Can be broken up in two parts. First we get x.f which is just another way to refer to w because of line d.). Then we look up the first element of that table, which is "another field" because of line c.)
There's another way of creating keys in in-line table declarations.
x = {["1st key has spaces!"] = 1}
The advantage here is that you can have keys with spaces and any extended ASCII character.
In fact, a key can be literally anything, even an instanced object.
function Example()
--example function
end
x = {[Example] = "A function."}
Any variable or value or data can go into the square brackets to work as a key. The same goes with the value.
Practically, this can replace features like the in keyword in python, as you can index the table by values to check if they are there.
Getting a value at an undefined part of the table will not cause an error. It will just give you nil. The same goes for using undefined variables.
local w = {
--[1] = "another field"; -- will be set this value
--["1"] = nil; -- not save to this place, different with some other language
x = 0;
y = 0;
label = "console";
}
local x = {
math.sin(0);
math.sin(1);
math.sin(2);
}
w[1] = "another field" --
x.f = w
print (w["x"])
-- because x.f = w
-- x.f and w point one talbe address
-- so value of (x.f)[1] and w[1] and x.f[1] is equal
print (w[1])
print ((x.f)[1])
print (x.f[1])
-- print (x.f)[1] this not follows lua syntax
-- only a function's has one param and type of is a string
-- you can use print "xxxx"
-- so you print x.f[1] will occuur error
-- in table you can use any lua internal type 's value to be a key
-- just like
local t_key = {v=123}
local f_key = function () print("f123") end
local t = {}
t[t_key] = 1
t[f_key] = 2
-- then t' key actualy like use t_key/f_key 's handle
-- when you user t[{}] = 123,
-- value 123 related to this no name table {} 's handle

Resources