This works.
"<name> <substring>"[/.*<([^>]*)/,1]
=> "substring"
But I want to extract substring within [ and ].
input:
string = "123 [asd]"
output:
asd
Anyone can help me?
You can do:
"123 [asd]"[/\[(.*?)\]/, 1]
will return
"asd"
You can test it here:
https://rextester.com/YGZEA91495
Here are a few more ways to extract the desired string.
str = "123 [asd] 456"
#1
r = /
(?<=\[) # match '[' in a positive lookbehind
[^\]]* # match 1+ characters other than ']'
(?=\]) # match ']' in a positive lookahead
/x # free-spacing regex definition mode
str[r]
#=> "asd"
#2
r = /
\[ # match '['
[^\]]* # match 1+ characters other than ']'
\] # match ']'
/x # free-spacing regex definition mode
str[r][1..-2]
#=> "asd"
#3
r = /
.*\[ # match 0+ characters other than a newline, then '['
| # or
\].* # match ']' then 0+ characters other than a newline
/x # free-spacing regex definition mode
str.gsub(r, '')
#=> "asd"
#4
n = str.index('[')
#=> 4
m = str.index(']', n+1)
#=> 8
str[n+1..m-1]
#=> "asd"
See String#index.
Related
I'm new to the "LPeg" and "re" modules of Lua, currently I want to write a pattern based on following rules:
Match the string that starts with "gv_$/gv$/v$/v_$/x$/xv$/dba_/all_/cdb_", and the prefix "SYS.%s*" or "PUBLIC.%s*" is optional
The string should not follow a alphanumeric, i.e., the pattern would not match "XSYS.DBA_OBJECTS" because it follows "X"
The pattern is case-insensitive
For example, below strings should match the pattern:
,sys.dba_objects, --should return "sys.dba_objects"
SyS.Dba_OBJECTS
cdb_objects
dba_hist_snapshot) --should return "dba_hist_snapshot"
Currently my pattern is below which can only match non-alphanumeric+string in upper case :
p=re.compile[[
pattern <- %W {owner* name}
owner <- 'SYS.'/ 'PUBLIC.'
name <- {prefix %a%a (%w/"_"/"$"/"#")+}
prefix <- "GV_$"/"GV$"/"V_$"/"V$"/"DBA_"/"ALL_"/"CDB_"
]]
print(p:match(",SYS.DBA_OBJECTS"))
My questions are:
How to achieve the case-insensitive matching? There are some topics about the solution but I'm too new to understand
How to exactly return the matched string only, instead of also have to plus %W? Something like "(?=...)" in Java
Highly appreciated if you can provide the pattern or related function.
You can try to tweak this grammar
local re = require're'
local p = re.compile[[
pattern <- ((s? { <name> }) / s / .)* !.
name <- (<owner> s? '.' s?)? <prefix> <ident>
owner <- (S Y S) / (P U B L I C)
prefix <- (G V '_'? '$') / (V '_'? '$') / (D B A '_') / (C D B '_')
ident <- [_$#%w]+
s <- (<comment> / %s)+
comment <- '--' (!%nl .)*
A <- [aA]
B <- [bB]
C <- [cC]
D <- [dD]
G <- [gG]
I <- [iI]
L <- [lL]
P <- [pP]
S <- [sS]
U <- [uU]
V <- [vV]
Y <- [yY]
]]
local m = { p:match[[
,sys.dba_objects, --should return "sys.dba_objects"
SyS.Dba_OBJECTS
cdb_objects
dba_hist_snapshot) --should return "dba_hist_snapshot"
]] }
print(unpack(m))
. . . prints match table m:
sys.dba_objects SyS.Dba_OBJECTS cdb_objects dba_hist_snapshot
Note that case-insensitivity is quite hard to achieve out of the lexer so each letter has to get a separate rule -- you'll need more of these eventually.
This grammar is taking care of the comments in your sample and skips them along with whitespace so matches after "should return" are not present in output.
You can fiddle with prefix and ident rules to specify additional prefixes and allowed characters in object names.
Note: !. means end-of-file. !%nl means "not end-of-line". ! p and & p are constructing non-consuming patterns i.e. current input pointer is not incremented on match (input is only tested).
Note 2: print-ing with unpack is a gross hack.
Note 3: Here is a tracable LPeg re that can be used to debug grammars. Pass true for 3-rd param of re.compile to get execution trace with test/match/skip action on each rule and position visited.
Finally I got an solution but not so graceful, which is to add an additional parameter case_insensitive into re.compile, re.find, re.match and re.gsubfunctions. When the parameter value is true, then invoke case_insensitive_pattern to rewrite the pattern:
...
local fmt="[%s%s]"
local function case_insensitive_pattern(quote,pattern)
-- find an optional '%' (group 1) followed by any character (group 2)
local stack={}
local is_letter=nil
local p = pattern:gsub("(%%?)(.)",
function(percent, letter)
if percent ~= "" or not letter:match("%a") then
-- if the '%' matched, or `letter` is not a letter, return "as is"
if is_letter==false then
stack[#stack]=stack[#stack]..percent .. letter
else
stack[#stack+1]=percent .. letter
is_letter=false
end
else
if is_letter==false then
stack[#stack]=quote..stack[#stack]..quote
is_letter=true
end
-- else, return a case-insensitive character class of the matched letter
stack[#stack+1]=fmt:format(letter:lower(), letter:upper())
end
return ""
end)
if is_letter==false then
stack[#stack]=quote..stack[#stack]..quote
end
if #stack<2 then return stack[1] or (quote..pattern..quote) end
return '('..table.concat(stack,' ')..')'
end
local function compile (p, defs, case_insensitive)
if mm.type(p) == "pattern" then return p end -- already compiled
if case_insensitive==true then
p=p:gsub([[(['"'])([^\n]-)(%1)]],case_insensitive_pattern):gsub("%(%s*%((.-)%)%s*%)","(%1)")
end
local cp = pattern:match(p, 1, defs)
if not cp then error("incorrect pattern", 3) end
return cp
end
...
I'm using Instaparse to parse expressions like:
$(foo bar baz $(frob))
into something like:
[:expr "foo" "bar" "baz" [:expr "frob"]]
I've almost got it, but having trouble with ambiguity. Here's a simplified version of my grammar that repros, attempting to rely on negative lookahead.
(def simple
(insta/parser
"expr = <dollar> <lparen> word (<space> word)* <rparen>
<word> = !(dollar lparen) #'.+' !(rparen)
<space> = #'\\s+'
<dollar> = <'$'>
<lparen> = <'('>
<rparen> = <')'>"))
(simple "$(foo bar)")
which errors:
Parse error at line 1, column 11:
$(foo bar)
^
Expected one of:
")"
#"\s+"
Here I've said a word can be any char, in order to support expressions like:
$(foo () `bar` b-a-z)
etc. Note a word can contain () but it cannot contain $(). Not sure how to express this in the grammar. Seems the problem is <word> is too greedy, consuming the last ) instead of letting expr have it.
Update removed whitespace from word:
(def simple2
(insta/parser
"expr = <dollar> <lparen> word (<space> word)* <rparen>
<word> = !(dollar lparen) #'[^ ]+' !(rparen)
<space> = #'\\s+'
<dollar> = <'$'>
<lparen> = <'('>
<rparen> = <')'>"))
(simple2 "$(foo bar)")
; Parse error at line 1, column 11:
; $(foo bar)
; ^
; Expected one of:
; ")"
; #"\s+"
(simple2 "$(foo () bar)")
; Parse error at line 1, column 14:
; $(foo () bar)
; ^
; Expected one of:
; ")"
; #"\s+"
Update 2 more test cases
(simple2 "$(foo bar ())")
(simple2 "$((foo bar baz))")
Update 3 full working parser
For anyone curious, the full working parser, which was outside the scope of this question is:
(def parse
"expr - the top-level expression made up of cmds and sub-exprs. When multiple
cmds are present, it implies they should be sucessively piped.
cmd - a single command consisting of words.
sub-expr - a backticked or $(..)-style sub-expression to be evaluated inline.
parened - a grouping of words wrapped in parenthesis, explicitly tokenized to
allow parenthesis in cmds and disambiguate between sub-expression
syntax."
(insta/parser
"expr = cmd (<space> <pipe> <space> cmd)*
cmd = words
<sub-expr> = <backtick> expr <backtick> | nestable-sub-expr
<nestable-sub-expr> = <dollar> <lparen> expr <rparen>
words = word (<space>* word)*
<word> = sub-expr | parened | word-chars
<word-chars> = #'[^ `$()|]+'
parened = lparen words rparen
<space> = #'[ ]+'
<pipe> = #'[|]'
<dollar> = <'$'>
<lparen> = '('
<rparen> = ')'
<backtick> = <'`'>"))
Example usage:
(parse "foo bar (qux) $(clj (map (partial * $(js 45 * 2)) (range 10))) `frob`")
Parses to:
[:expr [:cmd [:words "foo" "bar" [:parened "(" [:words "qux"] ")"] [:expr [:cmd [:words "clj" [:parened "(" [:words "map" [:parened "(" [:words "partial" "*" [:expr [:cmd [:words "js" "45" "*" "2"]]]] ")"] [:parened "(" [:words "range" "10"] ")"]] ")"]]]] [:expr [:cmd [:words "frob"]]]]]]
This is a parser for a chatbot I wrote, yetibot. It replaces the previous mess of regex-based, by-hand parsing.
I don't really know instaparser, so I just read enough documentation to give me a false sense of security. I also didn't test, and I don't really know what your requirements are.
In particular, I don't know:
1) Whether $() can nest (your grammar makes that impossible, I think, but it seems odd to me)
2) Whether () can contain whitespace without being parsed as more than one word
3) Whether () can contain $()
You'll need to be clear on things like this in order to write the grammar (or, as it happens, to ask for advice).
Update: Revised the grammar based on comments. I removed the productions for $ ( and ) because they seemed unnecessary, and this way the angle-brackets feel easier to deal with.
The following is based on answering the above questions "yes, no, yes" and some random assumptions about regex format. (I'm not totally clear on how angle-brackets work, but I don't think it will be easy to make parentheses output the way you want; I settled for just outputting them as single elements. If I figure out something, I'll edit it.)
<sequence> = element (<space> element)*
<element> = expr | paren_sequence | word
expr = <'$'> <'('> sequence <')'>
<word> = !('$'? '(') #'([^ $()]|\$[^(])+'
<paren_sequence> = '(' sequence ')'
<space> = #'\\s+'
Hope that helps a bit.
Well there are two changes you have to make in order to get both of your examples to work.
1) Add Negative Lookbehind
First, you will need a negative lookbehind in the regex for <word>. That way it will drop all the occurrences of ) as the last character:
<word> = !(dollar lparen) #'[^ ]+(?<!\\))'
So this will fix your first test case:
(simple2 "$(foo bar)")
=> [:expr "foo" "bar"]
2) Add grammar for the last word
Now if you run your second test case it will fail:
(simple2 "$(foo () bar)")
=> Parse error at line 1, column 8:
$(foo () bar)
^
Expected one of:
")" (followed by end-of-string)
#"\s+"
This fails because we have told our grammar to drop the last ) in all instances of <word>. We now have to tell our grammar how to differentiate between the last instance of <word> and other instances. We'll do this by adding a specific <lastword> grammar, and make all other instances of <word> optional. The full grammar would look like this:
(def simple2
(insta/parser
"expr = <dollar> <lparen> word* lastword <rparen>
<word> = !(dollar lparen) #'[^ ]+' <space>+
<lastword> = !(dollar lparen) #'[^ ]+(?<!\\))'
<space> = #'\\s+'
<dollar> = <'$'>
<lparen> = <'('>
<rparen> = <')'>"))
And your two test cases should work fine:
(simple2 "$(foo bar)")
=> [:expr "foo" "bar"]
(simple2 "$(foo () bar)")
=> [:expr "foo" "()" "bar"]
Hope this helps.
I am working on a parser in Haskell using Parsec. The issue lies in reading in the string "| ". When I attempt to read in the following,
parseExpr = parseAtom
-- | ...
<|> do string "{|"
args <- try parseList <|> parseDottedList
string "| "
body <- try parseExpr
string " }"
return $ List [Atom "lambda", args, body]
I get a parse error, the following.
Lampas >> {|a b| "a" }
Parse error at "lisp" (line 1, column 12):
unexpected "}"
expecting letter, "\"", digit, "'", "(", "[", "{|" or "."
Another failing case is ^ which bears the following.
Lampas >> {|a b^ "a" }
Parse error at "lisp" (line 1, column 12):
unexpected "}"
expecting letter, "\"", digit, "'", "(", "[", "{|" or "."
However, it works as expected when the string "| " is replaced with "} ".
parseExpr = parseAtom
-- | ...
<|> do string "{|"
args <- try parseList <|> parseDottedList
string "} "
body <- try parseExpr
string " }"
return $ List [Atom "lambda", args, body]
The following is the REPL behavior with the above modification.
Lampas >> {|a b} "a" }
(lambda ("a" "b") ...)
So the question is (a) does pipe have a special behavior in Haskell strings, perhaps only in <|> chains?, and (b) how is this behavior averted?.
The character | may be in a set of reserved characters. Test with other characters, like ^, and I assume it will fail just as well. The only way around this would probably be to change the set of reserved characters, or the structure of your interpreter.
In the code below I can correctly parse white spaces after each of the tokens using Parsec:
whitespace = skipMany (space <?> "")
number :: Parser Integer
number = result <?> "number"
where
result = do {
ds <- many1 digit;
whitespace;
return (read ds)
}
table = result
where
result = [
[Infix (genParser '*' (*)) AssocLeft,
Infix (genParser '/' div) AssocLeft],
[Infix (genParser '+' (+)) AssocLeft,
Infix (genParser '-' (-)) AssocLeft]]
genParser s f = char s >> whitespace >> return f
factor = parenExpr <|> number <?> "parens or number"
where
parenExpr = do {
char '(';
x <- expr;
char ')';
whitespace;
return x
}
expr :: Parser Integer
expr = buildExpressionParser table factor <?> "expression"
However I get a parse error when trying to only parse white spaces before, and after the operators:
whitespace = skipMany (space <?> "")
number :: Parser Integer
number = result <?> "number"
where
result = do {
ds <- many1 digit;
return (read ds)
}
table = result
where
result = [
[Infix (genParser '*' (*)) AssocLeft,
Infix (genParser '/' div) AssocLeft],
[Infix (genParser '+' (+)) AssocLeft,
Infix (genParser '-' (-)) AssocLeft]]
genParser s f = whitespace >> char s >> whitespace >> return f
factor = parenExpr <|> number <?> "parens or number"
where
parenExpr = do {
char '(';
x <- expr;
char ')';
return x
}
expr :: Parser Integer
expr = buildExpressionParser table factor <?> "expression"
The parse error is:
$ ./parsec_example < <(echo "2 * 2 * 3")
"(stdin)" (line 2, column 1):
unexpected end of input
expecting "*"
Why does this happen? Is there some other way to parse white space around just the operators?
When I test your code, 2 * 2 * 3 parses correctly, but 2 + 2 does not. Parsing fails because the parser for * consumes some input and backtracking isn't enabled at that position, so other parsers cannot be tried.
An expression parser created by buildExpressionParser tries to parse each operator in turn until one succeeds. When parsing 2 + 2, the following occurs:
The first 2 is matched by number. The rest of the input is + 2 (note the space at the beginning).
The parser genParser '*' (*) is applied to the input. It consumes the space, but does not match the + character.
The other infix operator parsers automatically fail because some input was consumed by genParser '*' (*).
You can fix this by wrapping the critical part of the parser in try. This saves the input until after char s succeeds. If char s fails, then buildExpressionParser can backtrack and try another infix operator.
genParser s f = try (whitespace >> char s) >> whitespace >> return f
The drawback of this parser is that, because it backtracks to before the leading whitespace before an infix operator, it repeatedly scans whitespace. It is usually better to parse whitespace after a successful match, like the OP's first parser example.
Part of a Lua application of mine is a search bar, and I'm trying to make it understand boolean expressions. I'm using LPeg, but the current grammar gives a strange result:
> re, yajl = require're', require'yajl'
> querypattern = re.compile[=[
QUERY <- ( EXPR / TERM )? S? !. -> {}
EXPR <- S? TERM ( (S OPERATOR)? S TERM )+ -> {}
TERM <- KEYWORD / ( "(" S? EXPR S? ")" ) -> {}
KEYWORD <- ( WORD {":"} )? ( WORD / STRING )
WORD <- {[A-Za-z][A-Za-z0-9]*}
OPERATOR <- {("AND" / "XOR" / "NOR" / "OR")}
STRING <- ('"' {[^"]*} '"' / "'" {[^']*} "'") -> {}
S <- %s+
]=]
> = yajl.to_string(lpeg.match(querypattern, "bar foo"))
"bar"
> = yajl.to_string(lpeg.match(querypattern, "name:bar AND foo"))
> = yajl.to_string(lpeg.match(querypattern, "name:bar AND foo"))
"name"
> = yajl.to_string(lpeg.match(querypattern, "name:'bar' AND foo"))
"name"
> = yajl.to_string(lpeg.match(querypattern, "bar AND (name:foo OR place:here)"))
"bar"
It only parses the first token, and I cannot figure out why it does this. As far as I know, a partial match is impossible because of the !. at the end of the starting non-terminal. How can I fix this?
The match is getting the entire string, but the captures are wrong. Note that
'->' has a higher precedence than concatenation, so you probably need parentheses around things like this:
EXPR <- S? ( TERM ( (S OPERATOR)? S TERM )+ ) -> {}