Match self-define token in PARSE - parsing

I am working on a string-transforming problem. The requirement is like this:
line: {INSERT INTO `pub_company` VALUES ('1', '0', 'ABC大学', 'B', 'admin', '2014-10-09 11:40:44', '', '105210', null)}
==>
line: {INSERT INTO `pub_company` VALUES ('1', '0', 'ABC大学', 'B', 'admin', to_date('2014-10-09 11:40:44', 'yyyy-mm-dd hh24:mi:ss'), '', '105210', null)}
Note: the '2014-10-09 11:40:44' is transformed to to_date('2014-10-09 11:40:44', 'yyyy-mm-dd hh24:mi:ss').
My code looks like below:
date: use [digit][
digit: charset "0123456789"
[4 digit "-" 2 digit "-" 2 digit space 2 digit ":" 2 digit ":" 2 digit]
]
parse line [ to date to end]
but I got this error:
** Script error: PARSE - invalid rule or usage of rule: digit
** Where: parse do either either either -apply-
** Near: parse line [to date to end]
I have made some tests:
probe parse "SSS 2016-01-01 00:00:00" [thru 3 "S" space date to end] ;true
probe parse "SSS 2016-01-01 00:00:00" [ to date to end] ; the error above
As the position of date value is not the same in all my data set, how can I reach it and match it and make the corresponding change?

I did as below:
line: {INSERT INTO `pub_company` VALUES ('1', '0', 'ABC大学', 'B', 'admin', '2014-10-09 11:40:44', '', '105210', null)}
d: [2 digit]
parse/all line [some [p1: {'} 4 digit "-" d "-" d " " d ":" d ":" d {'} p2: (insert p2 ")" insert p1 "to_date(" ) | skip]]
>> {INSERT INTO `pub_company` VALUES ('1', '0', 'ABC??', 'B', 'admin', to_date('2014-10-09 11:40:44'), '', '105210', null)}

TO and THRU have historically not allowed arbitrary rules as their parameters. See #2129:
"The syntax of TO and THRU is currently restricted by design, for really significant performance reasons..."
This was relaxed in Red. So for example, the following will work there:
parse "aabbaabbaabbccc" [
thru [
some "a" (prin "a") some "b" (prin "b") some "c" (prin "c")
]
]
However, it outputs:
abababababc
This shows that it really doesn't have a better answer than just "naively" applying the parse rule iteratively at each step. Looping the PARSE engine is not as efficient as atomically running a TO/THRU for which faster methods of seeking exist (basic string searches, for instance). And the repeated execution of code in parentheses may not line up with what was actually intended.
Still...it seems better to allow it. Then let users worry about when their code is slow and performance tune it if it matters. So odds are that the Ren-C branch of Rebol will align with Red in this respect, and allow arbitrary rules.

I have made it by an indirect way:
date: use [digit][
digit: charset "0123456789"
[4 digit "-" 2 digit "-" 2 digit space 2 digit ":" 2 digit ":" 2 digit]
]
line: {INSERT INTO `pub_company` VALUES ('1', '0', 'ABC大学', 'B', 'admin', '2014-10-09 11:40:44', '', '105210', null)}
parse line [
thru "(" vals: (
blk: parse/all vals ","
foreach val blk [
if parse val [" '" date "'"][
;probe val
replace line val rejoin [ { to_date(} at val 2 {, 'yyyy-mm-dd hh24:mi:ss')}]
]
]
)
to end
(probe line)
]
The output:
{INSERT INTO `pub_company` VALUES ('1', '0', 'ABC大学', 'B', 'admin', to_date('2014-10-09 11:40:44', 'yyyy-mm-dd hh24:mi:ss'), '', '105210', null)}

Here a true Rebol2 solution
line: {INSERT INTO `pub_company` VALUES ('1', '0', 'ABC??', 'B', 'admin', '2014-10-09 11:40:44', '', '105210', null)}
date: use [digit space][
space: " "
digit: charset "0123456789"
[4 digit "-" 2 digit "-" 2 digit space 2 digit ":" 2 digit ":" 2 digit]
]
>> parse/all line [ some [ [da: "'" date (insert da "to_date (" ) 11 skip de: (insert de " 'yyyy-mm-dd hh24:mi:ss'), ") ] | skip ] ]
== true
>> probe line
{INSERT INTO `pub_company` VALUES ('1', '0', 'ABC??', 'B', 'admin', to_date ('2014-10-09 11:40:44', 'yyyy-mm-dd hh24:mi:ss'), '', '105210', null)}

Related

Rails - extract substring with in [ and ] from string

This works.
"<name> <substring>"[/.*<([^>]*)/,1]
=> "substring"
But I want to extract substring within [ and ].
input:
string = "123 [asd]"
output:
asd
Anyone can help me?
You can do:
"123 [asd]"[/\[(.*?)\]/, 1]
will return
"asd"
You can test it here:
https://rextester.com/YGZEA91495
Here are a few more ways to extract the desired string.
str = "123 [asd] 456"
#1
r = /
(?<=\[) # match '[' in a positive lookbehind
[^\]]* # match 1+ characters other than ']'
(?=\]) # match ']' in a positive lookahead
/x # free-spacing regex definition mode
str[r]
#=> "asd"
#2
r = /
\[ # match '['
[^\]]* # match 1+ characters other than ']'
\] # match ']'
/x # free-spacing regex definition mode
str[r][1..-2]
#=> "asd"
#3
r = /
.*\[ # match 0+ characters other than a newline, then '['
| # or
\].* # match ']' then 0+ characters other than a newline
/x # free-spacing regex definition mode
str.gsub(r, '')
#=> "asd"
#4
n = str.index('[')
#=> 4
m = str.index(']', n+1)
#=> 8
str[n+1..m-1]
#=> "asd"
See String#index.

Parsing expressions ! chain of operators

hi I know how to parse expressions (incl. brackets).
But normally parsing expressions assumes "operand operator operand".
F.e. :
5 + 12
( 5 * 6 ) + 11
( 3 + 4 ) + ( 5 * 2)
As you can see the values are always two.
What I'm looking for is mechanism (grammar) than can parse chain of similar operators as a single item i.e. greedy
F.e. let say I have the following expression :
5 + 4 + 2 + 7 * 6 * 2
=> sum(5 + 4 + 2)
+
=> mult(7 * 6 * 2)
I want the parser to gobble the sum as one single "action", the same for multiplication.
Here is one example of NON-working grammar, but may be you can get the idea what I want to do (python - LEPL module) :
def build_grammar2(self):
spaces = Token('[ \t]+')[:]
plus = Token('\+')
left_bracket = Token('\(')
right_bracket = Token('\)')
mult = Token('\*')
bit_var = Token('[a-zA-Z0-9_!\?]+')
# with Separator(~spaces):
expr, group2 = Delayed(), Delayed()
mul_node = bit_var & (~mult & bit_var)[1:] > Node
add_node = bit_var & (~plus & bit_var)[1:] > Node
node = mul_node | add_node
parens = ~left_bracket & expr & ~right_bracket
group1 = parens | node
add = group1 & ~plus & group2 > Node
group2 += group1 | add
mul = group2 & ~mult & expr > Node
expr += group2 | mul
self.grammar = expr
This is pretty much what you get with pyparsing:
import pyparsing as pp
add_op = pp.oneOf("+ -")
mul_op = pp.oneOf("* /")
operand = pp.pyparsing_common.number | pp.pyparsing_common.identifier
arith = pp.infixNotation(operand,
[
("-", 1, pp.opAssoc.RIGHT),
(mul_op, 2, pp.opAssoc.LEFT),
(add_op, 2, pp.opAssoc.LEFT),
])
print(arith.parseString("1+2-3+X*-7*6+Y*(3+2)").asList())
prints
[[1, '+', 2, '-', 3, '+', ['X', '*', ['-', 7], '*', 6], '+', ['Y', '*', [3, '+', 2]]]]
If you just parse numbers, you can make the parser also do parse-time eval by adding parse actions to each level of precedence (pp.pyparsing_common.number auto-converts numeric strings to int or float):
operand = pp.pyparsing_common.number
op_fn = {
'*': lambda a,b: a*b,
'/': lambda a,b: a/b,
'+': lambda a,b: a+b,
'-': lambda a,b: a-b,
}.get
def binop(t):
t_iter = iter(t[0])
ret = next(t_iter)
for op, val in zip(t_iter, t_iter):
ret = op_fn(op)(ret, val)
return ret
arith = pp.infixNotation(operand,
[
("-", 1, pp.opAssoc.RIGHT, lambda t: -t[1]),
(mul_op, 2, pp.opAssoc.LEFT, binop),
(add_op, 2, pp.opAssoc.LEFT, binop),
])
print(arith.parseString("1+2-3+8*-7*6+4*(3+2)"))
Prints:
[-316]

ANTLR: Parser rule sensitive to whitespace

I have the following input data:
Valid string: "123A"
Invalid string: "123 A"
Valid string: "111A <= 5 AND 222A"
Invalid string: "111 A <= 5 AND 222A"
Below you can see the grammar I'm using (antlr 3.4).
my_id: INT ('A'|'a') -> INT;
fragment DIGIT: '0' .. '9';
INT : DIGIT+ ;
WS : (' '|'\t'|'\n'|'\r')+ {$channel=HIDDEN;} ;
The problem is that my_id matches both 123 A and 123A. How can I throw a parsing error when detecting 123 A?
Any help is gladly appreciated.

ENBF to JavaCC difference between [] and {}

I have the following 2 production rules in EBNF:
<CharLiteral> ::= ' " ' [ <Printable> ] ' " '
and
<StringLiteral> ::= ' " ' { <Printable> } ' " '
What is the difference between the two? [] imply 1 or more repetitions and {} imply 0 or more repetitions?
In EBNF, [X] means 0 or 1 X and {X} means 0 or more X.
In JavaCC, [X] means 0 or 1 X for grammar productions; in regular expression productions, you should use (X)? instead. To express 0 or more X in JavaCC use (X)*.

Regular expression to avoid a set of characters

I am using Ruby on Rails 3.1.0 and I would like to validate a class attribute just to avoid to store in the database a string containing these characters: (blank space), <, >, ", #, %, {, }, |, \, ^, ~, [, ] and ```.
What is the regex?
Assuming it should also be non-empty:
^[^\] ><"#%{}|\\^~\[`]+$
Since someone is downvoting this, here is some test code:
ary = [' ', '<', '>', '"', '#', '%', '{', '}', '|', '\\', '^', '~', '[', ']', '`', 'a']
ary.each do |i|
puts i =~ /^[^\] ><"#%{}|\\^~\[`]+$/
end
Output:
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
nil
0
bad_chars = %w(< > " # % { } | \ ^ ~ [ ] ')
re = Regexp.union(bad_chars)
p %q(hoh'oho) =~ re #=> 3
Regexp.union takes care of escaping.
a = "foobar"
b = "foo ` bar"
re = /[ \^<>"#%\{\}\|\\~\[\]\`]/
a =~ re # => nil
b =~ re # => 3
The inverse expression is:
/\A[^ \^<>"#%\{\}\|\\~\[\]\`]+\Z/

Resources