This works.
"<name> <substring>"[/.*<([^>]*)/,1]
=> "substring"
But I want to extract substring within [ and ].
input:
string = "123 [asd]"
output:
asd
Anyone can help me?
You can do:
"123 [asd]"[/\[(.*?)\]/, 1]
will return
"asd"
You can test it here:
https://rextester.com/YGZEA91495
Here are a few more ways to extract the desired string.
str = "123 [asd] 456"
#1
r = /
(?<=\[) # match '[' in a positive lookbehind
[^\]]* # match 1+ characters other than ']'
(?=\]) # match ']' in a positive lookahead
/x # free-spacing regex definition mode
str[r]
#=> "asd"
#2
r = /
\[ # match '['
[^\]]* # match 1+ characters other than ']'
\] # match ']'
/x # free-spacing regex definition mode
str[r][1..-2]
#=> "asd"
#3
r = /
.*\[ # match 0+ characters other than a newline, then '['
| # or
\].* # match ']' then 0+ characters other than a newline
/x # free-spacing regex definition mode
str.gsub(r, '')
#=> "asd"
#4
n = str.index('[')
#=> 4
m = str.index(']', n+1)
#=> 8
str[n+1..m-1]
#=> "asd"
See String#index.
I need to iterate over some pairs of strings in a program that I am writing. Instead of putting the string pairs in a big table-of-tables, I am putting them all in a single string, because I think the end result is easier to read:
function two_column_data(data)
return data:gmatch('%s*([^%s]+)%s+([^%s]+)%s*\n')
end
for a, b in two_column_data [[
Hello world
Olá hugomg
]] do
print( a .. ", " .. b .. "!")
end
The output is what you would expect:
Hello, world!
Olá, hugomg!
However, as the name indicates, the two_column_data function only works if there are two exactly columns of data. How can I make it so it works on any number of columns?
for x in any_column_data [[
qwe
asd
]] do
print(x)
end
for x,y,z in any_column_data [[
qwe rty uio
asd dfg hjk
]] do
print(x,y,z)
end
I'm OK with using lpeg for this task if its necessary.
function any_column_data(data)
local f = data:gmatch'%S[^\r\n]+'
return
function()
local line = f()
if line then
local row, ctr = line:gsub('%s*(%S+)','%1 ')
return row:match(('(.-) '):rep(ctr))
end
end
end
Here is an lpeg re version
function re_column_data(subj)
local t, i = re.compile([[
record <- {| ({| [ %t]* field ([ %t]+ field)* |} (%nl / !.))* |}
field <- escaped / nonescaped
nonescaped <- { [^ %t"%nl]+ }
escaped <- '"' {~ ([^"] / '""' -> '"')* ~} '"']], { t = '\t' }):match(subj)
return function()
local ret
i, ret = next(t, i)
if i then
return unpack(ret)
end
end
end
It basicly is a redo of the CSV sample and supports quoted fields for some nice use-cases: values with spaces, empty values (""), multi-line values, etc.
for a, b, c in re_column_data([[
Hello world "test
test"
Olá "hug omg"
""]].."\tempty a") do
print( a .. ", " .. b .. "! " .. (c or ''))
end
local function any_column_data( str )
local pos = 0
return function()
local _, to, line = str:find("([^\n]+)\n", pos)
if line then
pos = to
local words = {}
line:gsub("[^%s]+", function( word )
table.insert(words, word)
end)
return table.unpack(words)
end
end
end
Outer loop returns lines, and inner loop returns words in line.
s = [[
qwe rty uio
asd dfg hjk
]]
for s in s:gmatch('(.-)\n') do
for s in s:gmatch('%w+') do
io.write(s,' ')
end
io.write('\n')
end
I am trying to parse an XML dump of Wikipedia to find certain links on each page using the Haskell Parsec library. Links are denoted by double brackets: texttext[[link]]texttext. To simplify the scenario as much as possible, let's say I am looking for the first link not enclosed in double curly braces (which can be nested): {{ {{ [[Wrong Link]] }} [[Wrong Link]] }} [[Right Link]]. I have written a parser to discard links which are enclosed in non-nested double braces:
import Text.Parsec
getLink :: String -> Either ParseError String
getLink = parse linkParser "Links"
linkParser = do
beforeLink
link <- many $ noneOf "]"
string "]]"
return link
beforeLink = manyTill (many notLink) (try $ string "[[")
notLink = try doubleCurlyBrac <|> (many1 normalText)
normalText = noneOf "[{"
<|> notFollowedByItself '['
<|> notFollowedByItself '{'
notFollowedByItself c = try ( do x <- char c
notFollowedBy $ char c
return x)
doubleCurlyBrac = between (string "{{") (string "}}") (many $ noneOf "}")
getLinkTest = fmap getLink testList
where testList = [" [[rightLink]] " --Correct link is found
, " {{ [[Wrong_Link]] }} [[rightLink]]" --Correct link is found
, " {{ {{ }} [[Wrong_Link]] }} [[rightLink]]" ] --Wrong link is found
I have tried making the doubleCurlyBrac parser recursive to also discard links in nested curly braces, without success:
doubleCurlyBrac = between (string "{{") (string "}}") betweenBraces
where betweenBraces = doubleCurlyBrac <|> (many $ try $ noneOf "}")
This parser stops consuming input after the first }}, rather than the final one, in a nested example. Is there an elegant way to write a recursive parser to (in this case) correctly ignore links in nested double curly braces? Also, can it be done without using try? I have found that since try does not consume input, it often causes the parser to hang on unexpected, ill-formed input.
Here's a more direct version that doesn't use a custom lexer. It does use try though, and I don't see how to avoid it here. The problem is that it seems we need a non-committing look ahead to distinguish double brackets from single brackets; try is for non-committing look ahead.
The high level approach is that same as in
my first answer. I've been careful
to make the three node parsers commute -- making the code more robust
to change -- by using both try and notFollowedBy:
{-# LANGUAGE TupleSections #-}
import Text.Parsec hiding (string)
import qualified Text.Parsec
import Control.Applicative ((<$>) , (<*) , (<*>))
import Control.Monad (forM_)
import Data.List (find)
import Debug.Trace
----------------------------------------------------------------------
-- Token parsers.
llink , rlink , lbrace , rbrace :: Parsec String u String
[llink , rlink , lbrace , rbrace] = reserved
reserved = map (try . Text.Parsec.string) ["[[" , "]]" , "{{" , "}}"]
----------------------------------------------------------------------
-- Node parsers.
-- Link, braces, or string.
data Node = L [Node] | B [Node] | S String deriving Show
nodes :: Parsec String u [Node]
nodes = many node
node :: Parsec String u Node
node = link <|> braces <|> string
link , braces , string :: Parsec String u Node
link = L <$> between llink rlink nodes
braces = B <$> between lbrace rbrace nodes
string = S <$> many1 (notFollowedBy (choice reserved) >> anyChar)
----------------------------------------------------------------------
parseNodes :: String -> Either ParseError [Node]
parseNodes = parse (nodes <* eof) "<no file>"
----------------------------------------------------------------------
-- Tests.
getLink :: [Node] -> Maybe Node
getLink = find isLink where
isLink (L _) = True
isLink _ = False
parseLink :: String -> Either ParseError (Maybe Node)
parseLink = either Left (Right . getLink) . parseNodes
testList = [ " [[rightLink]] "
, " {{ [[Wrong_Link]] }} [[rightLink]]"
, " {{ {{ }} [[Wrong_Link]] }} [[rightLink]]"
, " [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}"
-- Pathalogical example from comments.
, "{{ab}cd}}"
-- A more pathalogical example.
, "{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf"
-- No top level link.
, "{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}"
-- Too many '{{'.
, "{{ {{ {{ [[ asdf ]] }} }}"
-- Too many '}}'.
, "{{ {{ [[ asdf ]] }} }} }}"
-- Too many '[['.
, "[[ {{ [[{{[[asdf]]}}]]}}"
]
main =
forM_ testList $ \ t -> do
putStrLn $ "Test: ^" ++ t ++ "$"
let parses = ( , ) <$> parseNodes t <*> parseLink t
printParses (n , l) = do
putStrLn $ "Nodes: " ++ show n
putStrLn $ "Link: " ++ show l
printError = putStrLn . show
either printError printParses parses
putStrLn ""
The output is the same in the non-error cases:
Test: ^ [[rightLink]] $
Nodes: [S " ",L [S "rightLink"],S " "]
Link: Just (L [S "rightLink"])
Test: ^ {{ [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S " ",B [S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])
Test: ^ {{ {{ }} [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S " ",B [S " ",B [S " "],S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])
Test: ^ [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}$
Nodes: [S " ",L [B [L [S "someLink"]]],S " ",B [],S " ",B [L [S "asdf"]]]
Link: Just (L [B [L [S "someLink"]]])
Test: ^{{ab}cd}}$
Nodes: [B [S "ab}cd"]]
Link: Nothing
Test: ^{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf$
Nodes: [S "{ [ { {asf{",L [S "[asdfa"],S "]}aasdff ] ] ] ",B [L [S "asdf"]],S "asdf"]
Link: Just (L [S "[asdfa"])
Test: ^{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}$
Nodes: [B [L [S "Wrong_Link"],S "asdf",L [S "WRong_Link"],B []],B [L [L [S "Wrong"]]]]
Link: Nothing
but the parse error messages are not as informative in the cases of
unmatched openings:
Test: ^{{ {{ {{ [[ asdf ]] }} }}$
"<no file>" (line 1, column 26):
unexpected end of input
expecting "[[", "{{", "]]" or "}}"
Test: ^{{ {{ [[ asdf ]] }} }} }}$
"<no file>" (line 1, column 26):
unexpected "}}"
Test: ^[[ {{ [[{{[[asdf]]}}]]}}$
"<no file>" (line 1, column 25):
unexpected end of input
expecting "[[", "{{", "]]" or "}}"
and I couldn't figure out how to fix them.
My solution does not use try, but is relatively complicated: I used
your question as an excuse to learn how to create a lexer in
Parsec without using
makeTokenParser :D I avoid try because the only look ahead happens in the lexer (tokenize), where the various bracket pairs are identified.
The high level idea is that we treat {{, }}, [[, and ]] as
special tokens and parse the input into an AST. You didn't specify
the grammar precisely, so I chose a simple one that generates your
examples:
node ::= '{{' node* '}}'
| '[[' node* ']]'
| string
string ::= <non-empty string without '{{', '}}', '[[', or ']]'>
I parse an input string into a list of nodes. The first top-level
link ([[) node, if any, is the link you're looking for.
The approach I've taken should be relatively robust to grammar
changes. E.g., if you want to allow only strings inside links, then
change '[[' node* ']]' to '[[' string ']]'. (In the code
link = L <$> between llink rlink nodes
becomes
link = L <$> between llink rlink string
).
The code is fairly long, but mostly straightforward. Most of it
concerns creating the token stream (lexing) and parsing the individual
tokens. After this, the actual Node parsing is very simple.
Here it is:
{-# LANGUAGE TupleSections #-}
import Text.Parsec hiding (char , string)
import Text.Parsec.Pos (updatePosString , updatePosChar)
import Control.Applicative ((<$>) , (<*) , (<*>))
import Control.Monad (forM_)
import Data.List (find)
----------------------------------------------------------------------
-- Lexing.
-- Character or punctuation.
data Token = C Char | P String deriving Eq
instance Show Token where
show (C c) = [c]
show (P s) = s
tokenize :: String -> [Token]
tokenize [] = []
tokenize [c] = [C c]
tokenize (c1:c2:cs) = case [c1,c2] of
"[[" -> ts
"]]" -> ts
"{{" -> ts
"}}" -> ts
_ -> C c1 : tokenize (c2:cs)
where
ts = P [c1,c2] : tokenize cs
----------------------------------------------------------------------
-- Token parsers.
-- We update the 'sourcePos' while parsing the tokens. Alternatively,
-- we could have annotated the tokens with positions above in
-- 'tokenize', and then here we would use 'token' instead of
-- 'tokenPrim'.
llink , rlink , lbrace , rbrace :: Parsec [Token] u Token
[llink , rlink , lbrace , rbrace] =
map (t . P) ["[[" , "]]" , "{{" , "}}"]
where
t x = tokenPrim show update match where
match y = if x == y then Just x else Nothing
update pos (P s) _ = updatePosString pos s
char :: Parsec [Token] u Char
char = tokenPrim show update match where
match (C c) = Just c
match (P _) = Nothing
update pos (C c) _ = updatePosChar pos c
----------------------------------------------------------------------
-- Node parsers.
-- Link, braces, or string.
data Node = L [Node] | B [Node] | S String deriving Show
nodes :: Parsec [Token] u [Node]
nodes = many node
node :: Parsec [Token] u Node
node = link <|> braces <|> string
link , braces , string :: Parsec [Token] u Node
link = L <$> between llink (rlink <?> "]]") nodes
braces = B <$> between lbrace (rbrace <?> "}}") nodes
string = S <$> many1 char
----------------------------------------------------------------------
parseNodes :: String -> Either ParseError [Node]
parseNodes = parse (nodes <* eof) "<no file>" . tokenize
----------------------------------------------------------------------
-- Tests.
getLink :: [Node] -> Maybe Node
getLink = find isLink where
isLink (L _) = True
isLink _ = False
parseLink :: String -> Either ParseError (Maybe Node)
parseLink = either Left (Right . getLink) . parseNodes
testList = [ " [[rightLink]] "
, " {{ [[Wrong_Link]] }} [[rightLink]]"
, " {{ {{ }} [[Wrong_Link]] }} [[rightLink]]"
, " [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}"
-- Pathalogical example from comments.
, "{{ab}cd}}"
-- A more pathalogical example.
, "{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf"
-- No top level link.
, "{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}"
-- Too many '{{'.
, "{{ {{ {{ [[ asdf ]] }} }}"
-- Too many '}}'.
, "{{ {{ [[ asdf ]] }} }} }}"
-- Too many '[['.
, "[[ {{ [[{{[[asdf]]}}]]}}"
]
main =
forM_ testList $ \ t -> do
putStrLn $ "Test: ^" ++ t ++ "$"
let parses = ( , ) <$> parseNodes t <*> parseLink t
printParses (n , l) = do
putStrLn $ "Nodes: " ++ show n
putStrLn $ "Link: " ++ show l
printError = putStrLn . show
either printError printParses parses
putStrLn ""
The output from main is:
Test: ^ [[rightLink]] $
Nodes: [S " ",L [S "rightLink"],S " "]
Link: Just (L [S "rightLink"])
Test: ^ {{ [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S " ",B [S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])
Test: ^ {{ {{ }} [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S " ",B [S " ",B [S " "],S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])
Test: ^ [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}$
Nodes: [S " ",L [B [L [S "someLink"]]],S " ",B [],S " ",B [L [S "asdf"]]]
Link: Just (L [B [L [S "someLink"]]])
Test: ^{{ab}cd}}$
Nodes: [B [S "ab}cd"]]
Link: Nothing
Test: ^{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf$
Nodes: [S "{ [ { {asf{",L [S "[asdfa"],S "]}aasdff ] ] ] ",B [L [S "asdf"]],S "asdf"]
Link: Just (L [S "[asdfa"])
Test: ^{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}$
Nodes: [B [L [S "Wrong_Link"],S "asdf",L [S "WRong_Link"],B []],B [L [L [S "Wrong"]]]]
Link: Nothing
Test: ^{{ {{ {{ [[ asdf ]] }} }}$
"<no file>" (line 1, column 26):
unexpected end of input
expecting }}
Test: ^{{ {{ [[ asdf ]] }} }} }}$
"<no file>" (line 1, column 24):
unexpected }}
expecting end of input
Test: ^[[ {{ [[{{[[asdf]]}}]]}}$
"<no file>" (line 1, column 25):
unexpected end of input
expecting ]]
I am working on a parser in Haskell using Parsec. The issue lies in reading in the string "| ". When I attempt to read in the following,
parseExpr = parseAtom
-- | ...
<|> do string "{|"
args <- try parseList <|> parseDottedList
string "| "
body <- try parseExpr
string " }"
return $ List [Atom "lambda", args, body]
I get a parse error, the following.
Lampas >> {|a b| "a" }
Parse error at "lisp" (line 1, column 12):
unexpected "}"
expecting letter, "\"", digit, "'", "(", "[", "{|" or "."
Another failing case is ^ which bears the following.
Lampas >> {|a b^ "a" }
Parse error at "lisp" (line 1, column 12):
unexpected "}"
expecting letter, "\"", digit, "'", "(", "[", "{|" or "."
However, it works as expected when the string "| " is replaced with "} ".
parseExpr = parseAtom
-- | ...
<|> do string "{|"
args <- try parseList <|> parseDottedList
string "} "
body <- try parseExpr
string " }"
return $ List [Atom "lambda", args, body]
The following is the REPL behavior with the above modification.
Lampas >> {|a b} "a" }
(lambda ("a" "b") ...)
So the question is (a) does pipe have a special behavior in Haskell strings, perhaps only in <|> chains?, and (b) how is this behavior averted?.
The character | may be in a set of reserved characters. Test with other characters, like ^, and I assume it will fail just as well. The only way around this would probably be to change the set of reserved characters, or the structure of your interpreter.
I am using the log parser utility to trace the parsing.
Scala code:
import scala.util.parsing.combinator.JavaTokenParsers
object Arith extends JavaTokenParsers with App {
def expr: Parser[Any] = log(term~rep("+"~term | "-"~term))("expr")
def term: Parser[Any] = factor~rep("*"~factor | "/"~factor)
def factor: Parser[Any] = floatingPointNumber | "("~expr~")"
println(parseAll(expr, "2 * (3 + 7)"))
}
Output:
trying expr at scala.util.parsing.input.CharSequenceReader#13a317a
trying expr at scala.util.parsing.input.CharSequenceReader#14c1103
expr --> [1.11] parsed: ((3~List())~List((+~(7~List()))))
expr --> [1.12] parsed: ((2~List((*~(((~((3~List())~List((+~(7~List())))))~)))))~List())
[1.12] parsed: ((2~List((*~(((~((3~List())~List((+~(7~List())))))~)))))~List())
The input is printed as scala.util.parsing.input.CharSequenceReader#13a317a. Is there a way to print string representation of the input like "2 * (3 + 7)"?
Overriding log solved my problem.
Scala Snippet:
import scala.util.parsing.combinator.JavaTokenParsers
object Arith extends JavaTokenParsers with App {
override def log[T](p: => Parser[T])(name: String): Parser[T] = Parser { in =>
def prt(x: Input) = x.source.toString.drop(x.offset)
println("trying " + name + " at " + prt(in))
val r = p(in)
println(name + " --> " + r + " next: " + prt(r.next))
r
}
def expr: Parser[Any] = log(term ~ rep("+" ~ term | "-" ~ term))("expr")
def term: Parser[Any] = factor ~ rep("*" ~ factor | "/" ~ factor)
def factor: Parser[Any] = floatingPointNumber | "(" ~ expr ~ ")"
println(parseAll(expr, "(3+7)*2"))
}
Output:
trying expr at (3+7)*2
trying expr at 3+7)*2
expr --> [1.5] parsed: ((3~List())~List((+~(7~List())))) next: )*2
expr --> [1.8] parsed: (((((~((3~List())~List((+~(7~List())))))~))~List((*~2)))~List()) next:
[1.8] parsed: (((((~((3~List())~List((+~(7~List())))))~))~List((*~2)))~List())