a = "stackoverflow.com/questions/ask"
print(string.match(a,"(.*/)")) -- stackoverflow.com/questions/
print(string.match(a,"(.*/).*")) -- stackoverflow.com/questions/
I can't understand the second result. In my option it should be "stackoverflow.com/questions/ask" as "(.*/)" matches "stackoverflow.com/questions/" and ".*" matches "ask". Can someone tell me WHY the second result is "stackoverflow.com/questions/" ? Does x = string.match(a,"(.*/).*") and x = string.match(a,"(.*/)") are same?
the () means you have used Captures.so maybe you can use it like this:
print(string.match(a,"((.*/).*)"))
Captures:
A pattern can contain sub-patterns enclosed in parentheses; they describe captures. When a match succeeds, the substrings of the subject string that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the part of the string matching "a*(.)%w(%s*)" is stored as the first capture (and therefore has number 1); the character matching "." is captured with number 2, and the part matching "%s*" has number 3.
Related
how can I extract a few words separated by symbols in a string so that nothing is extracted if the symbols change?
for example I wrote this code:
function split(str)
result = {};
for match in string.gmatch(str, "[^%<%|:%,%FS:%>,%s]+" ) do
table.insert(result, match);
end
return result
end
--------------------------Example--------------------------------------------
str = "<busy|MPos:-750.222,900.853,1450.808|FS:2,10>"
my_status={}
status=split(str)
for key, value in pairs(status) do
table.insert(my_status,value)
end
print(my_status[1]) --
print(my_status[2]) --
print(my_status[3]) --
print(my_status[4]) --
print(my_status[5]) --
print(my_status[6]) --
print(my_status[7]) --
output :
busy
MPos
-750.222
900.853
1450.808
2
10
This code works fine, but if the characters and text in the str string change, the extraction is still done, which I do not want to be.
If the string change to
str = "Hello stack overFlow"
Output:
Hello
stack
over
low
nil
nil
nil
In other words, I only want to extract if the string is in this format: "<busy|MPos:-750.222,900.853,1450.808|FS:2,10>"
In lua patterns, you can use captures, which are perfect for things like this. I use something like the following:
--------------------------Example--------------------------------------------
str = "<busy|MPos:-750.222,900.853,1450.808|FS:2,10>"
local status, mpos1, mpos2, mpos3, fs1, fs2 = string.match(str, "%<(%w+)%|MPos:(%--%d+%.%d+),(%--%d+%.%d+),(%--%d+%.%d+)%|FS:(%d+),(%d+)%>")
print(status, mpos1, mpos2, mpos3, fs1, fs2)
I use string.match, not string.gmatch here, because we don't have an arbitrary number of entries (if that is the case, you have to have a different approach). Let's break down the pattern: All captures are surrounded by parantheses () and get returned, so there are as many return values as captures. The individual captures are:
the status flag (or whatever that is): busy is a simple word, so we can use the %w character class (alphanumeric characters, maybe %a, only letters would also do). Then apply the + operator (you already know that one). The + is within the capture
the three numbers for the MPos entry each get (%--%d+%.%d+), which looks weird at first. I use % in front of any non-alphanumeric character, since it turns all magic characters (such as + into normal ones). - is a magic character, so it is required here to match a literal -, but lua allows to put that in front of any non-alphanumerical character, which I do. So the minus is optional, so the capture starts with %-- which is one or zero repetitions (- operator) of a literal - (%-). Then I just match two integers separated by a dot (%d is a digit, %. matches a literal dot). We do this three times, separated by a comma (which I don't escape since I'm sure it is not a magical character).
the last entry (FS) works practically the same as the MPos entry
all entries are separated by |, which I simply match with %|
So putting it together:
start of string: %<
status field: (%w+)
separator: %|
MPos (three numbers): MPos:(%--%d+%.%d+),(%--%d+%.%d+),(%--%d+%.%d+)
separator: %|
FS entry (two integers): FS:(%d+),(%d+)
end of string: %>
With this approach you have the data in local variables with sensible names, which you can then put into a table (for example).
If the match failes (for instance, when you use "Hello stack overFlow"), nil` is returned, which can simply be checked for (you could check any of the local variables, but it is common to check the first one.
I have been trying to find all possible strings in between 2 strings
This is my input: "print/// to be able to put any amount of strings here endprint///"
The goal is to print every string in between print/// and endprint///
You can use Lua's string patterns to achieve that.
local text = "print/// to be able to put any amount of strings here endprint///"
print(text:match("print///(.*)endprint///"))
The pattern "print///(.*)endprint///" captures any character that is between "print///" and "endprint///"
Lua string patterns here
In this kind of problem, you don't use the greedy quantifiers * or +, instead, you use the lazy quantifier -. This is because * matches until the last occurrence of the sub-pattern after it, while - matches until the first occurence of the sub-pattern after it. So, you should use this pattern:
print///(.-)endprint///
And to match it in Lua, you do this:
local text = "print/// to be able to put any amount of strings here endprint///"
local match = text:match("print///(.-)endprint///")
-- `match` should now be the text in-between.
print(match) -- "to be able to put any amount of strings here "
I'm trying to match any strings that come in that follow the format Word 100.00% ~(45.56, 34.76) in LUA. As such, I'm looking to do a regex close (in theory) to this:
%D%s[%d%.%d]%%(%d.%d, %d.%d)
But I'm having no luck so far. LUA's patterns are weird.
What am I missing?
Your pattern is close you neglected to allow for multiple instances of a digit you can do this by using a + at like %d+.
You also did not use [,( and . correctly in the pattern.
[s in a pattern will create a set of chars that you are trying to match such as [abc] means you are looking to match any as bs or c at that position.
( are used to define a capture so the specific values you want returned rather then the whole string in the event of a match, in order to use it as a char you for the match you need to escape it with a %.
. will match any character rather then specifically a . you will need to add a % to escape if you want to match a . specifically.
local str = "Word 100.00% ~(45.56, 34.76)"
local pattern = "%w+%s%d+%.%d+%%%s~%(%d+%.%d+, %d+%.%d+%)"
print(string.match(str, pattern))
Here you will see the input string print if it matches the pattern otherwise you will see nil.
Suggested resource: Understanding Lua Patterns
I am writing a parse for a script language.
I need to recognize strings, integers and floats.
I successfully recognize strings with the rule:
[a-zA-Z0-9_]+ {return STRING;}
But I have problem recognizing Integers and Floats. These are the (wrong) rules I wrote:
["+"|"-"][1-9]{DIGIT}* { return INTEGER;}
["+"|"-"]["0." | [1-9]{DIGIT}*"."]{DIGIT}+ {return FLOAT;}
How can I fix them?
Furthermore, since a "abc123" is a valid string, how can I make sure that it is recognized as a string and not as the concatenation of a string ("abc") and an Integer ("123") ?
First problem: There's a difference between (...) and [...]. Your regular expressions don't do what you think they do because you're using the wrong punctuation.
Beyond that:
No numeric rule recognizes 0.
Both numeric rules require an explicit sign.
Your STRING rule recognizes integers.
So, to start:
[...] encloses a set of individual characters or character ranges. It matches a single character which is a member of the set.
(...) encloses a regular expression. The parentheses are used for grouping, as in mathematics.
"..." encloses a sequence of individual characters, and matches exactly those characters.
With that in mind, let's look at
["+"|"-"][1-9]{DIGIT}*
The first bracket expression ["+"|"-"] is a set of individual characters or ranges. In this case, the set contains: ", +, " (again, which has no effect because a set contains zero or one instances of each member), |, and the range "-", which is a range whose endpoints are the same character, and consequently only includes that character, ", which is already in the set. In short, that was equivalent to ["+|]. It will match one of those three characters. It requires one of those three characters, in fact.
The second bracket expression [1-9] matches one character in the range 1-9, so it probably does what you expected. Again, it matches exactly one character.
Finally, {DIGIT} matches the expansion of the name DIGIT. I'll assume that you have the definition:
DIGIT [0-9]
somewhere in your definitions section. (In passing, I note that you could have just used the character class [:digit:], which would have been unambiguous, and you would not have needed to define it.) It's followed by a *, which means that it will match zero or more repetitions of the {DIGIT} definition.
Now, an example of a string which matches that pattern:
|42
And some examples of strings which don't match that pattern:
-7 # The pattern must start with |, + or "
42 # Again, the pattern must start with |, + or "
+0 # The character following the + must be in the range [0-9]
Similarly, your float pattern, once the [...] expressions are simplified, becomes (writing out the individual pieces one per line, to make it more obvious):
["+|] # i.e. the set " + |
["0.|[1-9] # i.e. the set " 0 | [ 1 2 3 4 5 6 7 8 9
{DIGIT}* # Any number of digits
"." # A single period
] # A single ]
{DIGIT}+ # one or more digits
So here's a possible match:
"..]3
I'll skip over writing out the solution because I think you'll benefit more from doing it yourself.
Now, the other issues:
Some rule should match 0. If you don't want to allow leading zeros, you'll need to just a it as a separate rule.
Use the optional operator (?) to indicate that the preceding object is optional. eg. "foo"? matches either the three characters f, o, o (in order) or matches the empty string. You can use that to make the sign optional.
The problem is not the matching of abc123, as in your question. (F)lex always gives you the longest possible match, and the only rule which could match the starting character a is the string rule, so it will allow the string rule to continue as long as it can. It will always match all of abc123. However, it will also match 123, which you would probably prefer to be matched by your numeric rule. Here, the other (f)lex matching criterion comes into play: when there are two or more rules which could match exactly the same string, and none of the rules can match a longer string, (f)lex chooses the first rule in the file. So if you want to give numbers priority over strings, you have to put the number rule earlier in your (f)lex file than the string rule.
I hope that gives you some ideas about how to fix things.
I've got a question on failure function description from "Compilers: Principles, Techniques, and Tools" aka DragonBook
Firstly, the quote:
In order to process text strings rapidly and search those strings for a keyword,
it is useful to define, for keyword b1b2...bn, and position s in that keyword , a failure function, f (s) ...
The objective is that b1b2.. - bf(s) is the longest proper prefix of
b1...bs, that is also a suffix of b1...bs. The reason f (s) is important is that
if we are trying to match a text string for blb2..bn, and we have matched the
first s positions, but we then fail (i.e., the next position of the text string does
not hold bs+l), then f (s) is the longest prefix of b1..bn that could possibly
match the text string up to the point we are at. Of course, the next character of
the text string must be bf(s)+1 or else we still have problems and must consider
a yet shorter prefix, which will be bf(f(s)).
So, the questions:
1. If we've matched s positions with the text, why f (s) is the longest prefix of b1..bn that matches the string? I think s - is the longest prefix.
2. Next character of the text string must be bf(s)+1, why? We have a mismatch at this position, does it matter at all what the char is?
f(s) is the longest prefix at that position that might match the entire keyword. The idea is not to try to match the keyword with the text from the start, but to find a position where the keyword appears.
Consider a search for the word 'aaaba' in the text 'aaaabaa'. The match fails after the three first a's, but it's not necessary to retry from the second 'a', since we know that if the next letter is a 'b' (which it is), we may have a match there.