Using lpeg to only capture on word boundaries - parsing

I've been working on a text editor that uses LPEG to implement syntax highlighting support. Getting things up and running was pretty simple, but I've only done the minimum required.
I've defined a bunch of patterns like this:
-- Keywords
local keyword = C(
P"auto" +
P"break" +
P"case" +
P"char" +
P"int"
-- more ..
) / function() add_syntax( RED, ... )
This correctly handles input, but unfortunately matches too much. For example int matches in the middle of printf, which is expected because I'm using "P" for a literal match.
Obviously to perform "proper" highlighting I need to match on word-boundaries, such that "int" matches "int", but not "printf", "vsprintf", etc.
I tried to use this to limit the match to only occurring after "<[{ \n", but this didn't do what I want:
-- space, newline, comma, brackets followed by the keyword
S(" \n(<{,")^1 * P"auto" +
Is there a simple, obvious, solution I'm missing here to match only keywords/tokens that are surrounded by whitespace or other characters that you'd expect in C-code? I do need the captured token so I can highlight it, but otherwise I'm not married to any particular approach.
e.g. These should match:
int foo;
void(int argc,std::list<int,int> ) { .. };
But this should not:
fprintf(stderr, "blah. patterns are hard\n");

The LPeg construction -pattern (or more specifically -idchar in the following example) does a good job of making sure that the current match is not followed by pattern (i.e. idchar). Luckily this also works for empty strings at the end of the input, so we don't need special handling for that. For making sure that a match is not preceded by a pattern, LPeg provides lpeg.B(pattern). Unfortunately, this requires a pattern that matches a fixed length string, and so won't work at the beginning of the input. To fix that the following code separately tries to match without lpeg.B() at the beginning of the input before falling back to a pattern that checks suffixes and prefixes for the rest of the string:
local L = require( "lpeg" )
local function decorate( word )
-- highlighting in UNIX terminals
return "\27[32;1m"..word.."\27[0m"
end
-- matches characters that may be part of an identifier
local idchar = L.R( "az", "AZ", "09" ) + L.P"_"
-- list of keywords to be highlighted
local keywords = L.C( L.P"in" +
L.P"for" )
local function highlight( s )
local p = L.P{
(L.V"nosuffix" + "") * (L.V"exactmatch" + 1)^0,
nosuffix = (keywords / decorate) * -idchar,
exactmatch = L.B( 1 - idchar ) * L.V"nosuffix",
}
return L.match( L.Cs( p ), s )
end
-- tests:
print( highlight"" )
print( highlight"hello world" )
print( highlight"in 0in int for xfor for_ |for| in" )

I think you should negate the matching pattern similar to how it's done in the example from the documentation:
If we want to look for a pattern only at word boundaries, we can use the following transformer:
local t = lpeg.locale()
function atwordboundary (p)
return lpeg.P{
[1] = p + t.alpha^0 * (1 - t.alpha)^1 * lpeg.V(1)
}
end
This SO answer also discussed somewhat similar solution, so may be of interest.
There is also another editor component that uses LPeg for parsing with the purpose of syntax highlighting, so you may want to look at how they handle this (or use their lexers if it works for your design).

Related

Match string until a character or end

I'm trying to match and return a string between two (or one if there's no closing character, then until the string's end) character.
local id = "+#a-#s,#n";
local addOperator = string.match(id, "^[+](.+)(?[-])"); -- should return "#a"
if (addOperator) then
-- ...
end
local removeOperator = string.match(id, "^[-](.+)(?[+])"); -- should return "#s,#n"
if (removeOperator) then
-- ...
end
-- Or without the excluding operator "-"
local id = "+#a";
local addOperator = string.match(id, "^[+](.+)(?[-])"); -- should return "#a", with my pattern it returns a nil.
if (addOperator) then
-- ...
end
? should come after the char you are matching 0 to 1 of.
You also can not use .+ followed by any char ? and expect the ? to restrict the results of the .+
I suggest using an a set that excludes -. Additionally you use [+] but should be using %+, % is how you escape a special character in a pattern. Using [+] to escape is not necessarily wrong functionally it just comes off as odd or non-idiomatic in Lua.
local id = "+#a-#s,#n"
print(string.match(id, "^%+([^-]+)"))
print(string.match(id, "%-(.+)"))
id = "+#a"
print(string.match(id, "^%+([^-]+)"))
This is a good resource for understanding Lua patters: Understanding Lua Patterns

How to capture a string between signs in lua?

how can I extract a few words separated by symbols in a string so that nothing is extracted if the symbols change?
for example I wrote this code:
function split(str)
result = {};
for match in string.gmatch(str, "[^%<%|:%,%FS:%>,%s]+" ) do
table.insert(result, match);
end
return result
end
--------------------------Example--------------------------------------------
str = "<busy|MPos:-750.222,900.853,1450.808|FS:2,10>"
my_status={}
status=split(str)
for key, value in pairs(status) do
table.insert(my_status,value)
end
print(my_status[1]) --
print(my_status[2]) --
print(my_status[3]) --
print(my_status[4]) --
print(my_status[5]) --
print(my_status[6]) --
print(my_status[7]) --
output :
busy
MPos
-750.222
900.853
1450.808
2
10
This code works fine, but if the characters and text in the str string change, the extraction is still done, which I do not want to be.
If the string change to
str = "Hello stack overFlow"
Output:
Hello
stack
over
low
nil
nil
nil
In other words, I only want to extract if the string is in this format: "<busy|MPos:-750.222,900.853,1450.808|FS:2,10>"
In lua patterns, you can use captures, which are perfect for things like this. I use something like the following:
--------------------------Example--------------------------------------------
str = "<busy|MPos:-750.222,900.853,1450.808|FS:2,10>"
local status, mpos1, mpos2, mpos3, fs1, fs2 = string.match(str, "%<(%w+)%|MPos:(%--%d+%.%d+),(%--%d+%.%d+),(%--%d+%.%d+)%|FS:(%d+),(%d+)%>")
print(status, mpos1, mpos2, mpos3, fs1, fs2)
I use string.match, not string.gmatch here, because we don't have an arbitrary number of entries (if that is the case, you have to have a different approach). Let's break down the pattern: All captures are surrounded by parantheses () and get returned, so there are as many return values as captures. The individual captures are:
the status flag (or whatever that is): busy is a simple word, so we can use the %w character class (alphanumeric characters, maybe %a, only letters would also do). Then apply the + operator (you already know that one). The + is within the capture
the three numbers for the MPos entry each get (%--%d+%.%d+), which looks weird at first. I use % in front of any non-alphanumeric character, since it turns all magic characters (such as + into normal ones). - is a magic character, so it is required here to match a literal -, but lua allows to put that in front of any non-alphanumerical character, which I do. So the minus is optional, so the capture starts with %-- which is one or zero repetitions (- operator) of a literal - (%-). Then I just match two integers separated by a dot (%d is a digit, %. matches a literal dot). We do this three times, separated by a comma (which I don't escape since I'm sure it is not a magical character).
the last entry (FS) works practically the same as the MPos entry
all entries are separated by |, which I simply match with %|
So putting it together:
start of string: %<
status field: (%w+)
separator: %|
MPos (three numbers): MPos:(%--%d+%.%d+),(%--%d+%.%d+),(%--%d+%.%d+)
separator: %|
FS entry (two integers): FS:(%d+),(%d+)
end of string: %>
With this approach you have the data in local variables with sensible names, which you can then put into a table (for example).
If the match failes (for instance, when you use "Hello stack overFlow"), nil` is returned, which can simply be checked for (you could check any of the local variables, but it is common to check the first one.

lua match repeating pattern

I need to encapsulate in some way pattern in lua pattern matching to find whole sequence of this pattern in string. What do I mean by that.
For example we have string like that:
"word1,word2,word3,,word4,word5,word6, word7,"
I need to match first sequence of words followed by coma (word1,word2,word3,)
In python I would use this pattern "(\w+,)+", but similar pattern in lua (like (%w+,)+), will return just nil, because brackets in lua patterns means completely different thing.
I hope now you see my problem.
Is there a way to do repeating patterns in lua?
Your example wasn't too clear in terms of what should happen to the word4,word5,word6 and word7,
This would give you any seqence of comma separated words without white space or empty positions.
local text = "word1,word2,word3,,word4,word5,word6, word7,"
-- replace any comma followed by any white space or comma
--- by a comma and a single white space
text = text:gsub(",[%s,]+", ", ")
-- then match any sequence of >=1 non-whitespace characters
for sequence in text:gmatch("%S+,") do
print(sequence)
end
Prints
word1,word2,word3,
word4,word5,word6,
word7,
You could do this easily using LPeg if that's available to you:
local lpeg = require "lpeg"
local str = "word1,word2,word3,,word4,word5,word6, word7,"
local word = (lpeg.R"az"+lpeg.R"AZ"+lpeg.R"09") ^ 1
local sequence = lpeg.C((word * ",") ^1)
print(sequence:match(str))

Pattern not matching *(%(*.%))

I'm trying to learn how patterns (implemented in string.gmatch, etc.) do work in Lua 5.3, from the reference manual.
(Thanks #greatwolf for correcting my interpretation about the pattern item using *.)
What I'm trying to do is to match '(%(.*%))*' (substrings enclosed by ( and ); for example, '(grouped (etc))'), so that it logs
(grouped (etc))
(etc)
or
grouped (etc)
etc
But it does nothing 😐 (online compiler).
local test = '(grouped (etc))'
for sub in test:gmatch '(%(.*%))*' do
print(sub)
end
Another possibility -- using recursion:
function show(s)
for s in s:gmatch '%b()' do
print(s)
show(s:sub(2,-2))
end
end
show '(grouped (etc))'
I don't think you can do this with gmatch but using %b() along with the while loop may work:
local pos, _, sub = 0
while true do
pos, _, sub = ('(grouped (etc))'):find('(%b())', pos+1)
if not sub then break end
print(sub)
end
This prints your expected results for me.
local test = '(grouped (etc))'
print( test:match '.+%((.-)%)' )
Here:
. +%( catch the maximum number of characters until it %( ie until the last bracket including it, where %( just escapes the bracket.
(.-)%) will return your substring to the first escaped bracket %)

Why does this return the same index?

I want to run two different lua string find on the same string " (55)"
Pattern 1 "[^%w_](%d+)", should match any number
Pattern 2 "[%(|%)|%%|%+|%=|%-|%{%|%}|%,|%:|%*|%^]", should match any of these ( ) % + = - { } , : * ^ characters.
Both of these patterns return 2, why? Also if I run a string match, they return ( and 55 respectivly (as expected).
It seems you are using the patterns with string.find that finds the first occurrence of the pattern in the string passed. If an instance of the pattern is found a pair of values representing the start and end of the string is returned. If the pattern cannot be found nil is returned.
Both patterns find a match at Position 2: [^%w_](%d+) finds ( because it is matched with [^%w_] (a char other than letter, digit or _), and [%(|%)|%%|%+|%=|%-|%{%|%}|%,|%:|%*|%^] matches the ( because it is part of the character set.
However, the first pattern can be re-written using a frontier pattern, %f[%w_]%d+, that will match 1+ digits if not preceded with letters, digits or underscore, and the second pattern does not require such heavy escaping, [()%%+={},:*^-] is enough (only % needs escaping here, as the - is placed at the end of the character set and is thus treated as a literal hyphen).
See this Lua demo:
a = " (55)"
for word in string.gmatch(a, "%f[%w_]%d+") do print(word) end
-- 55
for word in string.gmatch(a, "[()%%+={},:*^-]+") do print(word) end
-- (, )

Resources