Best way to parse this in Rebol - parsing

How do I extract the transaction receipt datetime with the least bit of noise in my parse rule from the following HTML? (The output I'm looking to get is this: "Transaction Receipt: 04/28/2011 17:03:09")
<FONT COLOR=DARKBLUE>Transaction Receipt </FONT></TH></TR><TR></TR><TR></TR><TR><TD COLSPAN=4 ALIGN=CENTER><FONT SIZE=-1 COLOR=DARKBLUE>04/28/2011 17:03:09</FONT>
The following works but I don't get a good feeling! There is guaranteed to be a datetime following the words Transaction Receipt somewhere (although I wouldn't do a greedy match if I'm doing a grep)
parse d [
thru {<FONT COLOR=DARKBLUE>Transaction Receipt </FONT></TH></TR><TR></TR><TR></TR><TR><TD COLSPAN=4 ALIGN=CENTER><FONT SIZE=-1 COLOR=DARKBLUE>}
copy t to "</FONT>"
]

This is shorter...
parse d [thru <FONT SIZE=-1 COLOR=DARKBLUE> copy t to </FONT>]
but isn't specifically looking for the datetime pair. And unfortunately REBOL considers the date used an invalid one...
>> 04/28/2011
** Syntax Error: Invalid date -- 04/28/2011
** Near: (line 1) 04/28/2011
so you can't search for it specifically. If the date was 28/04/2011 (and there was a space after the time, though why it's needed for load I'm not sure), the following would work...
parse load d [to date! copy t to </FONT>]
Hmmm. Try this...
t: ""
parse d [
some [
to "<" thru ">" mark: copy text to "<" (if text [append t text]) :mark
]
]
That returns: "Transaction Receipt 04/28/2011 17:03:09"
It works by skipping all the tags, appending any text that's left to t.
Hope that helps!

Timely as per usual: if the format is consistent, you can always try to explicitly match dates:
rule: use [dg tag date value][
tag: use [chars][
chars: charset [#"a" - #"z" #"A" - #"Z" #"0" - #"9" " =-"]
["<" opt "/" some chars ">"]
]
date: use [dg mo dy yr tm][
dg: charset "0123456789"
[
copy mo [2 dg "/"] copy dy [2 dg "/"] copy yr 4 dg
" " copy tm [2 dg ":" 2 dg ":" 2 dg]
(value: load rejoin [dy mo yr "/" tm])
]
]
[
some [
"Transaction Receipt" (probe "Transaction Receipt")
| date (probe value)
; everything else
| some " " | tag ; | skip ; will parse the whole doc...
]
]
]

Related

Print double quotes in Forth

The word ." prints a string. More precisely it compiles the (.") and the string up to the next " in the currently compiled word.
But how can I print
That's the "question".
with Forth?
In a Forth-2012 System (e.g. Gforth) you can use string literals with escaping via the word s\" as:
: foo ( -- ) s\" That's the \"question\"." type ;
In a Forth-94 system (majority of standard systems) you can use arbitrary parsing and the word sliteral as:
: foo ( -- ) [ char | parse That's the "question".| ] sliteral type ;
A string can be also extracted up to the end of the line (without printable delimiter); a multi-line string can be extracted too.
Specific helpers for particular cases can be easily defined.
For example, see the word s$ for string literals that are delimited by any arbitrary printable character, e.g.:
s$ `"test" 'passed'` type
Old school:
34 emit
Output:
"
Using gforth:
: d 34 emit ;
cr ." That's the " d ." question" d ." ." cr
Output:
That's the "question".

How do you specify the 8th namespaces in an array?

8th uses namespaces instead of vocabularies. Each namespace has its own integer representation.
ok> ns:a . cr ns:n . cr
4
2
So, 2 is for the number namespace, and 4 is for arrays.
I want to construct an array holding the namespaces which I can then place at the TOS (top of stack).
However, if I just write this
ok> [ ns:a , ns:n ]
Exception: invalid JSON array: at line 1 char 3 in ....: cr (G:;;; +000004c2)
Exception: can't find: :a: at line 1 char 6 in (null): cr (G:??? +00000029)
Exception: can't find: ,: at line 1 char 8 in (null): cr (G:??? +00000029)
Exception: can't find: ]: at line 1 char 15 in (null): n (G:??? +00000029)
I'm the developer of 8th. The solution with ' ns:a is not really what you want, since that puts the word in the array instead of the value that word would return.
You can accomplish what you're looking for by using the backtick:
[ ` ns:a ` ]
The backtick feeds the text up to the next backtick to eval and puts the value (whatever it is) in the JSON you're creating (it's not limited to JSON, it's a general construct).
You can store the function address instead in the array
[ ' ns:n , ' ns:a ]
and access the values by grabbing an array value and exec it
0 a:# w:exec . cr
2
ok>
You can also use anonymous functions
[ ( ns:a ) , ( ns:m ) ]

Does anyone have an efficient R3 function that mimics the behaviour of find/any in R2?

Rebol2 has an /ANY refinement on the FIND function that can do wildcard searches:
>> find/any "here is a string" "s?r"
== "string"
I use this extensively in tight loops that need to perform well. But the refinement was removed in Rebol3.
What's the most efficient way of doing this in Rebol3? (I'm guessing a parse solution of some sort.)
Here's a stab at handling the "*" case:
like: funct [
series [series!]
search [series!]
][
rule: copy []
remove-each s b: parse/all search "*" [empty? s]
foreach s b [
append rule reduce ['to s]
]
append rule [to end]
all [
parse series rule
find series first b
]
]
used as follows:
>> like "abcde" "b*d"
== "bcde"
I had edited your question for "clarity" and changed it to say 'was removed'. That made it sound like it was a deliberate decision. Yet it actually turns out it may just not have been implemented.
BUT if anyone asks me, I don't think it should be in the box...and not just because it's a lousy use of the word "ALL". Here's why:
You're looking for patterns in strings...so if you're constrained to using a string to specify that pattern you get into "meta" problems. Let's say I want to extract the word *Rebol* or ?Red?, now there has to be escaping and things get ugly all over again. Back to RegEx. :-/
So what you might actually want isn't a STRING! pattern like s?r but a BLOCK! pattern like ["s" ? "r"]. This would permit constructs like ["?" ? "?"] or [{?} ? {?}]. That's better than rehashing the string hackery that every other language uses.
And that's what PARSE does, albeit in a slightly-less-declarative way. It also uses words instead of symbols, as Rebol likes to do. [{?} skip {?}] is a match rule where skip is an instruction that moves the parse position past any single element of the parse series between the question marks. It could also do so if it were parsing a block as input, and would match [{?} 12-Dec-2012 {?}].
I don't know entirely what the behavior of /ALL would-or-should be with something like "ab??cd e?*f"... if it provided alternate pattern logic or what. I'm assuming the Rebol2 implementation is brief? So likely it only matches one pattern.
To set a baseline, here's a possibly-lame PARSE solution for the s?r intent:
>> parse "here is a string" [
some [ ; match rule repeatedly
to "s" ; advance to *before* "s"
pos: ; save position as potential match
skip ; now skip the "s"
[ ; [sub-rule]
skip ; ignore any single character (the "?")
"r" ; match the "r", and if we do...
return pos ; return the position we saved
| ; | (otherwise)
none ; no-op, keep trying to match
]
]
fail ; have PARSE return NONE
]
== "string"
If you wanted it to be s*r you would change the skip "r" return pos into a to "r" return pos.
On an efficiency note, I'll mention that it is indeed the case that characters are matched against characters faster than strings. So to #"s" and #"r" to end make a measurable difference in the speed when parsing strings in general. Beyond that, I'm sure others can do better.
The rule is certainly longer than "s?r". But it's not that long when comments are taken out:
[some [to #"s" pos: skip [skip #"r" return pos | none]] fail]
(Note: It does leak pos: as written. Is there a USE in PARSE, implemented or planned?)
Yet a nice thing about it is that it offers hook points at all the moments of decision, and without the escaping defects a naive string solution has. (I'm tempted to give my usual "Bad LEGO alligator vs. Good LEGO alligator" speech.)
But if you don't want to code in PARSE directly, it seems the real answer would be some kind of "Glob Expression"-to-PARSE compiler. It might be the best interpretation of glob Rebol would have, because you could do a one-off:
>> parse "here is a string" glob "s?r"
== "string"
Or if you are going to be doing the match often, cache the compiled expression. Also, let's imagine our block form uses words for literacy:
s?r-rule: glob ["s" one "r"]
pos-1: parse "here is a string" s?r-rule
pos-2: parse "reuse compiled RegEx string" s?r-rule
It might be interesting to see such a compiler for regex as well. These also might accept not only string input but also block input, so that both "s.r" and ["s" . "r"] were legal...and if you used the block form you wouldn't need escaping and could write ["." . "."] to match ".A."
Fairly interesting things would be possible. Given that in RegEx:
(abc|def)=\g{1}
matches abc=abc or def=def
but not abc=def or def=abc
Rebol could be modified to take either the string form or compile into a PARSE rule with a form like:
regex [("abc" | "def") "=" (1)]
Then you get a dialect variation that doesn't need escaping. Designing and writing such compilers is left as an exercise for the reader. :-)
I've broken this into two functions: one that creates a rule to match the given search value, and the other to perform the search. Separating the two allows you to reuse the same generated parse block where one search value is applied over multiple iterations:
expand-wildcards: use [literal][
literal: complement charset "*?"
func [
{Creates a PARSE rule matching VALUE expanding * (any characters) and ? (any one character)}
value [any-string!] "Value to expand"
/local part
][
collect [
parse value [
; empty search string FAIL
end (keep [return (none)])
|
; only wildcard return HEAD
some #"*" end (keep [to end])
|
; everything else...
some [
; single char matches
#"?" (keep 'skip)
|
; textual match
copy part some literal (keep part)
|
; indicates the use of THRU for the next string
some #"*"
; but first we're going to match single chars
any [#"?" (keep 'skip)]
; it's optional in case there's a "*?*" sequence
; in which case, we're going to ignore the first "*"
opt [
copy part some literal (
keep 'thru keep part
)
]
]
]
]
]
]
like: func [
{Finds a value in a series and returns the series at the start of it.}
series [any-string!] "Series to search"
value [any-string! block!] "Value to find"
/local skips result
][
; shortens the search a little where the search starts with a regular char
skips: switch/default first value [
#[none] #"*" #"?" ['skip]
][
reduce ['skip 'to first value]
]
any [
block? value
value: expand-wildcards value
]
parse series [
some [
; we have our match
result: value
; and return it
return (result)
|
; step through the string until we get a match
skips
]
; at the end of the string, no matches
fail
]
]
Splitting the function also gives you a base to optimize the two different concerns: finding the start and matching the value.
I went with PARSE as even though *? are seemingly simple rules, there is nothing quite as expressive and quick as PARSE to effectively implementing such a search.
It might yet as per #HostileFork to consider a dialect instead of strings with wildcards—indeed to the point where Regex is replaced by a compile-to-parse dialect, but is perhaps beyond the scope of the question.

How to express branch in Rebol PARSE dialect?

I have a mysql schema like below:
data: {
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(10) DEFAULT '' COMMENT 'the name',
`content` text COMMENT 'something',
}
now I want to extract some info from it: the filed name, type and comment if any. See below:
["id" "int" "" "name" "varchar" "the name" "content" "text" "something" ]
My code is:
parse data [
any [
thru {`} copy field to {`} {`}
thru some space copy field-type to [ {(} | space]
(comm: "")
opt [ thru {COMMENT} thru some space thru {'} copy comm to {'}]
(repend temp field repend temp field-type either comm [ repend temp comm ][ repend temp ""])
]
]
but I get something like this:
["id" "int" "the name" "content" "text" "something"]
I know the line opt .. is not right.
I want express if found COMMENT key word first, then extract the comment info; if found lf first, then continue the next loop. But I don't know how to express it. Any one can help?
I much favour (where possible) building up a set of grammar rules with positive terms to match target input—I find it's more literate, precise, flexible and easier to debug. In your snippet above, we can identify five core components:
space: use [space][
space: charset "^-^/ "
[some space]
]
word: use [letter][
letter: charset [#"a" - #"z" #"A" - #"Z" "_"]
[some letter]
]
id: use [letter][
letter: complement charset "`"
[some letter]
]
number: use [digit][
digit: charset "0123456789"
[some digit]
]
string: use [char][
char: complement charset "'"
[any [some char | "''"]]
]
With terms defined, writing a rule that describes the grammar of the input is relatively trivial:
result: collect [
parsed?: parse/all data [ ; parse/all for Rebol 2 compatibility
opt space
some [
(field: type: none comment: copy "")
"`" copy field id "`"
space
copy type word opt ["(" number ")"]
any [
space [
"COMMENT" space "'" copy comment string "'"
| word | "'" string "'" | number
]
]
opt space "," (keep reduce [field type comment])
opt space
]
]
]
As an added bonus, we can validate the input.
if parsed? [new-line/all/skip result true 3]
One wee application of new-line to smarten things up a little should yield:
== [
"id" "int" ""
"name" "varchar" "the name"
"content" "text" "something"
]
I think this is closer to what you are after.
data: {
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(10) DEFAULT '' COMMENT 'the name',
`content` text COMMENT 'something',
}
temp: []
parse data [
any [
thru {`} copy field to {`} {`}
some space copy field-type to [ {(} | space]
(comm: copy "")
opt [ thru {COMMENT} some space thru {'} copy comm to {'}]
(repend temp field repend temp field-type either comm [ repend temp comm ][ repend temp ""])
]
]
probe temp
To break down the differences.
Set up a word with an empty block for temp
Changed thru some space to just some space as this will move forward through the series in the same way. Note that the following is false
parse " " [ thru some space ]
Changed comm: "" to comm: copy "" to make sure you get a new string each time you extract the comment (does not seem to affect the output, but is good practice)
Changed {COMMENT} thru some space to {COMMENT} some space as per comment 2.
Just added a probe on the end for debugging
As a note, you can use ?? (almost) anywhere in a parse rule to help with debugging which will show you your current position.
parse/all for string parsing
data: {
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(10) DEFAULT '' COMMENT 'the name',
`content` text COMMENT 'something',
}
nodata: charset { ()'}
dat: complement nodata
collect [
parse/all data [
some [
thru {`} copy field to {`} (keep field) skip
some " " copy type some dat ( keep type comm: copy "" )
copy rest thru "," (
parse/all rest [
some [
["," (keep comm) ]
| ["COMMENT" some nodata copy comm to "'" ]
| skip
]
]
)
]
]
]
== ["id" "int" "" "name" "varchar" "the name" "content" "text" "something"]
another (better) solution with pure parse
collect [
probe parse/all data [
some [
thru {`} copy field to {`} (keep field) skip
some " " copy type some dat ( keep type comm: "" further: [])
some [
"," (keep comm further: [ to end skip])
| ["COMMENT" some nodata copy comm to "'" ]
| skip further
]
]
]
]
I figure out an alternative way to get the data as block! but not string!.
data: read/lines data.txt
probe data
temp: copy []
foreach d data [
parse d [
thru {`} copy field to {`} {`}
thru some space copy field-type to [ {(} | space]
(comm: "")
opt [ thru {COMMENT} thru some space thru {'} copy comm to {'}]
(repend temp field repend temp field-type either comm [ repend temp comm ][ repend temp ""])
]
]
probe temp

How do I use collect keep in parse, to get embedded blocks?

Looking at the html example here: http://www.red-lang.org/2013/11/041-introducing-parse.html
I would like to parse the following:
"val1-12*more text-something"
Where:
"-" marks values which should be in the same block, and
"*" should start a new block.
So, I want this:
[ ["val1" "12"] ["more text" "something"] ]
and at the moment I get this:
red>> data: "val1-12*more text-something"
== "val1-12*more text-something"
red>> c: charset reduce ['not #"-" #"*"]
== make bitset! [not #{000000000024}]
red>> parse data [collect [any [keep any c [#"-" | #"*" | end ]]]]
== ["val1" "12" "more text" "something"]
(I actually tried some other permutations, which didn't get me any farther.)
So, what's missing?
You can make it work by nesting COLLECT. For e.g.
keep-pair: [
keep some c
#"-"
keep some c
]
parse data [
collect [
some [
collect [keep-pair]
#"*"
collect [keep-pair]
]
]
]
Using your example input this outputs the result you wanted:
[["val1" "12"] ["more text" "something"]]
However I got funny feeling you maybe wanted the parse rule to be more flexible than the example input provided?

Resources