Match string until a character or end - lua

I'm trying to match and return a string between two (or one if there's no closing character, then until the string's end) character.
local id = "+#a-#s,#n";
local addOperator = string.match(id, "^[+](.+)(?[-])"); -- should return "#a"
if (addOperator) then
-- ...
end
local removeOperator = string.match(id, "^[-](.+)(?[+])"); -- should return "#s,#n"
if (removeOperator) then
-- ...
end
-- Or without the excluding operator "-"
local id = "+#a";
local addOperator = string.match(id, "^[+](.+)(?[-])"); -- should return "#a", with my pattern it returns a nil.
if (addOperator) then
-- ...
end

? should come after the char you are matching 0 to 1 of.
You also can not use .+ followed by any char ? and expect the ? to restrict the results of the .+
I suggest using an a set that excludes -. Additionally you use [+] but should be using %+, % is how you escape a special character in a pattern. Using [+] to escape is not necessarily wrong functionally it just comes off as odd or non-idiomatic in Lua.
local id = "+#a-#s,#n"
print(string.match(id, "^%+([^-]+)"))
print(string.match(id, "%-(.+)"))
id = "+#a"
print(string.match(id, "^%+([^-]+)"))
This is a good resource for understanding Lua patters: Understanding Lua Patterns

Related

Pattern not matching *(%(*.%))

I'm trying to learn how patterns (implemented in string.gmatch, etc.) do work in Lua 5.3, from the reference manual.
(Thanks #greatwolf for correcting my interpretation about the pattern item using *.)
What I'm trying to do is to match '(%(.*%))*' (substrings enclosed by ( and ); for example, '(grouped (etc))'), so that it logs
(grouped (etc))
(etc)
or
grouped (etc)
etc
But it does nothing 😐 (online compiler).
local test = '(grouped (etc))'
for sub in test:gmatch '(%(.*%))*' do
print(sub)
end
Another possibility -- using recursion:
function show(s)
for s in s:gmatch '%b()' do
print(s)
show(s:sub(2,-2))
end
end
show '(grouped (etc))'
I don't think you can do this with gmatch but using %b() along with the while loop may work:
local pos, _, sub = 0
while true do
pos, _, sub = ('(grouped (etc))'):find('(%b())', pos+1)
if not sub then break end
print(sub)
end
This prints your expected results for me.
local test = '(grouped (etc))'
print( test:match '.+%((.-)%)' )
Here:
. +%( catch the maximum number of characters until it %( ie until the last bracket including it, where %( just escapes the bracket.
(.-)%) will return your substring to the first escaped bracket %)

Why does this return the same index?

I want to run two different lua string find on the same string " (55)"
Pattern 1 "[^%w_](%d+)", should match any number
Pattern 2 "[%(|%)|%%|%+|%=|%-|%{%|%}|%,|%:|%*|%^]", should match any of these ( ) % + = - { } , : * ^ characters.
Both of these patterns return 2, why? Also if I run a string match, they return ( and 55 respectivly (as expected).
It seems you are using the patterns with string.find that finds the first occurrence of the pattern in the string passed. If an instance of the pattern is found a pair of values representing the start and end of the string is returned. If the pattern cannot be found nil is returned.
Both patterns find a match at Position 2: [^%w_](%d+) finds ( because it is matched with [^%w_] (a char other than letter, digit or _), and [%(|%)|%%|%+|%=|%-|%{%|%}|%,|%:|%*|%^] matches the ( because it is part of the character set.
However, the first pattern can be re-written using a frontier pattern, %f[%w_]%d+, that will match 1+ digits if not preceded with letters, digits or underscore, and the second pattern does not require such heavy escaping, [()%%+={},:*^-] is enough (only % needs escaping here, as the - is placed at the end of the character set and is thus treated as a literal hyphen).
See this Lua demo:
a = " (55)"
for word in string.gmatch(a, "%f[%w_]%d+") do print(word) end
-- 55
for word in string.gmatch(a, "[()%%+={},:*^-]+") do print(word) end
-- (, )

Using lpeg to only capture on word boundaries

I've been working on a text editor that uses LPEG to implement syntax highlighting support. Getting things up and running was pretty simple, but I've only done the minimum required.
I've defined a bunch of patterns like this:
-- Keywords
local keyword = C(
P"auto" +
P"break" +
P"case" +
P"char" +
P"int"
-- more ..
) / function() add_syntax( RED, ... )
This correctly handles input, but unfortunately matches too much. For example int matches in the middle of printf, which is expected because I'm using "P" for a literal match.
Obviously to perform "proper" highlighting I need to match on word-boundaries, such that "int" matches "int", but not "printf", "vsprintf", etc.
I tried to use this to limit the match to only occurring after "<[{ \n", but this didn't do what I want:
-- space, newline, comma, brackets followed by the keyword
S(" \n(<{,")^1 * P"auto" +
Is there a simple, obvious, solution I'm missing here to match only keywords/tokens that are surrounded by whitespace or other characters that you'd expect in C-code? I do need the captured token so I can highlight it, but otherwise I'm not married to any particular approach.
e.g. These should match:
int foo;
void(int argc,std::list<int,int> ) { .. };
But this should not:
fprintf(stderr, "blah. patterns are hard\n");
The LPeg construction -pattern (or more specifically -idchar in the following example) does a good job of making sure that the current match is not followed by pattern (i.e. idchar). Luckily this also works for empty strings at the end of the input, so we don't need special handling for that. For making sure that a match is not preceded by a pattern, LPeg provides lpeg.B(pattern). Unfortunately, this requires a pattern that matches a fixed length string, and so won't work at the beginning of the input. To fix that the following code separately tries to match without lpeg.B() at the beginning of the input before falling back to a pattern that checks suffixes and prefixes for the rest of the string:
local L = require( "lpeg" )
local function decorate( word )
-- highlighting in UNIX terminals
return "\27[32;1m"..word.."\27[0m"
end
-- matches characters that may be part of an identifier
local idchar = L.R( "az", "AZ", "09" ) + L.P"_"
-- list of keywords to be highlighted
local keywords = L.C( L.P"in" +
L.P"for" )
local function highlight( s )
local p = L.P{
(L.V"nosuffix" + "") * (L.V"exactmatch" + 1)^0,
nosuffix = (keywords / decorate) * -idchar,
exactmatch = L.B( 1 - idchar ) * L.V"nosuffix",
}
return L.match( L.Cs( p ), s )
end
-- tests:
print( highlight"" )
print( highlight"hello world" )
print( highlight"in 0in int for xfor for_ |for| in" )
I think you should negate the matching pattern similar to how it's done in the example from the documentation:
If we want to look for a pattern only at word boundaries, we can use the following transformer:
local t = lpeg.locale()
function atwordboundary (p)
return lpeg.P{
[1] = p + t.alpha^0 * (1 - t.alpha)^1 * lpeg.V(1)
}
end
This SO answer also discussed somewhat similar solution, so may be of interest.
There is also another editor component that uses LPeg for parsing with the purpose of syntax highlighting, so you may want to look at how they handle this (or use their lexers if it works for your design).

Check if variable matches pattern, LUA

So if I have the variable name. How can I check if it matches a pattern?
For instance, I would like to check if the variable name equals a pattern like text_text. So words at the beginning then underscore then words. With no numbers.
I really have no idea where to even start with this.
local pattern = something
if name == pattern then
UPDATE** I have tried the following, still nothing is working.
local pattern = "a%%sa%"
if string.match (name, pattern) then
return 1
else
return 0
end
Also tried this way
local pattern = "a%_a%"
if string.match (name, pattern) then
return 1
else
return 0
end
Can I please get some help
A pattern that matches any combination of small letters linked with an underscore would be:
"%l+_%l+"
%l specifies any lower case letter,
followed by a + which specifies the number of the symbol it follows.
So %l+ means "at least one lower case letter"
_ is simply an underscore.
So the pattern "%l+_%l+" means "at least one lower case letter followed by one underscore followed by at least one lower case letter
Please refer to https://www.lua.org/pil/20.2.html and/or https://www.lua.org/manual/5.3/manual.html#6.4.1 for all the possible bricks you can construct patterns of.
You can use those patterns in the functions provided by the string standard library. string.find, string.gsub and so on.
If you want to use that stuff more excessively checkout LPeg for example.
Try this:
function match(string)
local string = string:match('^(%a+_%a+)$')
if string then
return true
else
return false
end
end
local string = match('hi I'm webrom')
print(string)
output = false
local string = match('14545_15456')
print(string)
output = false
local string = match('hello_world')
print(string)
output = true
local string = match('HELLO_WORLD')
print(string)
output = true
The pattern %a match any letter.
in string patterns in lua,
anything between parantheses will be returned
also you do % with a special caracter afterwards like a for letters, to match ONE letter
adding + after %a would make you detect THE MOST possible letters , in the pattern before you get to the next character / set of characters
for example :
s = "Hello Julia my love ! Here's a hug!"
lovername,gift = string.match(s,"(%a+) my love ! Here's a (%a+)")
print(lovername.." was given a "..gift)
In your example, you can do this :
var = "hello_world"
if (string.match(var, "%a+_%a+")) then -- In this example, _ will be normal
print("Valid variable name !")
else
print("Invalid variable name")
end

Pattern ^u.meta(\.|$) not working as expected

I have this pattern:
^u.meta(\.|$)
EXPECTED BEHAVIOUR
^u.meta(\.|$) will match all the roles like:
u.meta
u.meta.admin
u.meta.admin.system
u.meta.*
Where as it should not match something like below:
u.meta_admin
u.meta_admin_system
I have tested this pattern with https://regex101.com/ online regexp tester.
PROBLEM:
I have to implement this pattern with lua script.
but getting invalid escape sequence near '\.':
-- lua script
> return string.match("u.meta.admin", '^u.meta(\.|$)')
stdin:1: invalid escape sequence near '\.'
And I tried adding double \\ as well as removing '\' escape char in that regexp but got nil in return:
-- lua script
> return string.match("u.meta.admin", '^u.meta(\\.|$)')
nil
> return string.match("u.meta.admin", '^u.meta(.|$)')
nil
See Lua regex docs:
The character % works as an escape for those magic characters.
Also, the (...|...) alternation is not supported in Lua. Instead, I guess, you need a word boundary here, like %f[set] frontier pattern:
%f[set], a frontier pattern; such item matches an empty string at any position such that the next character belongs to set and the previous character does not belong to set. The set set is interpreted as previously described. The beginning and the end of the subject are handled as if they were the character \0.
So, you can use
return string.match("u.meta.admin", '^u%.meta%f[%A]')
To only match at the end or before a .:
return string.match("u.meta", '^u%.meta%f[\0.]')
To match only if the admin is not followed with a letter or an underscore, use a negated character class [^%a_]:
return string.match("u.meta_admin", '^u%.meta%f[[^%a_]]')
See IDEONE demo to check the difference between the two expressions.
print(string.match("u.meta", '^u%.meta%f[\0.]')) -- u.meta
print(string.match("u.meta.admin", '^u%.meta%f[\0.]')) -- u.meta
print(string.match("u.meta-admin", '^u%.meta%f[\0.]')) -- nil
print(string.match("u.meta", '^u%.meta%f[%A]')) -- u.meta
print(string.match("u.meta.admin", '^u%.meta%f[%A]')) -- u.meta
print(string.match("u.meta-admin", '^u%.meta%f[%A]')) -- u.meta
-- To exclude a match if `u.admin` is followed with `_`:
print(string.match("u.meta_admin", '^u%.meta%f[[^%a_]]')) -- nil
NOTE To match the end of the string, instead of \0, you can safely use %z (as #moteus noted in his comment) (see this reference):
%z the character with representation 0

Resources