How can I extract URLs from a huge one line text file? - url

I have a text file that I want to extract links from.
The problem is that the text file is only one line with a lot of links!
Or that when I open it in Notepad it shows it in a lot files but not organized.
Sample text:
[{"participants": ["minanageh379", "xcsadc"], "conversation":
[{"sender": "minanageh379", "created_at":
"2019-04-12T12:51:56.560361+00:00", "media":
"https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2"},
{"sender": "minanageh379", "created_at":
"2019-04-12T12:51:51.923138+00:00", "text": "sd"}, {"sender":
"minanageh379", "created_at": "2019-04-12T12:51:41.689524+00:00",
"text": "sdsa"}, {"sender": "xcsadc", "created_at":
"2019-04-12T12:50:57.283147+00:00", "text": "👩‍❤️‍💋‍👩"}, {"sender":
"xcsadc", "created_at": "2019-04-12T12:39:35.248517+00:00", "text":
"czx"}, {"sender": "xcsadc", "created_at":
"2019-04-12T12:39:34.352752+00:00", "text": "dsad"}, {"sender":
"xcsadc", "created_at": "2019-04-12T12:39:30.889023+00:00", "media":
"https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2"},
{"sender": "xcsadc", "created_at": "2019-04-12T12:38:54.823472+00:00",
"text": "hi hi hi"}]}]
expected result
https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2
https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2
the updated one
{"sender": "ncccy", "created_at": "2019-01-28T17:09:29.216184+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/57c43d748xcasc1abf58c890c5a6df042/5D199AE8/t51.2885-15/e15/p480x480/49913269_2181952555454636_8892094125900591548_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg1NjgsdasdAwNjgxNTk1OTY0OTIwMTA1NTMzNDQ%3D.2"},

Try with this:
First of all, we are going to remove all characters that do not form part of a valid url plus quotes and spaces. That will remove emojis that seem to cause trouble with boost regexes on notepad++ on some circustances.
Our first replacement will be:
Search: [^a-zA-Z0-9_\-.~:\/?#\[\]#!$&'()*+,;=%"\s]
Replace by: (leave empty)
Replace all
(That previous step may not be needed on future versiones on notepad++)
After the clean up, we do the following replacement:
Search: (?i)(?:(?:(?!https?:).(?!https?:))*?"sender"\s*+:\s*+"([^"]*)"|\G)(?:.(?!"sender"\s*+:\s*+))*?(https?:.*?(?=[^a-zA-Z0-9_\-.~:\/?#\[\]#!$&'()*+,;=%]|https?:))|.*
Replacement: (?{1}\n\n\1\t\2:(?{2}\t\2)
Replace all
This should work even with "text" attributes that have several urls inside. The urls will be separated by tabulators.
So after applying the previous procedure to this data:
[{"participants": ["minanageh379", "xcsadc"], "conversation": [{"sender": "minanageh379", "created_at": "2019-04-12T12:51:56.560361+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2 http://foo.barhttps://bar.foo"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:51.923138+00:00", "text": "sd"}, {"sender": "minanageh379", "created_at": "2019-04-12T12:51:41.689524+00:00", "text": "sdsa"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:50:57.283147+00:00", "text": "👩‍❤️‍💋‍👩"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:35.248517+00:00", "text": "czx"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:34.352752+00:00", "text": "dsad"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:30.889023+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:38:54.823472+00:00", "text": "hi hi hi"}, {"sender": "no_media_no_text", "created_at": "2019-04-12T12:38:54.823472+00:00"}, {"sender": "url_inside_text", "created_at": "2019-04-12T12:38:54.823472+00:00", "text": "Hi! {check} this url: \"http://foo.bar\" another url: https://new.url.com/ yet another one: https://google.com/"}, {"sender": "ncccy", "created_at": "2019-01-28T17:09:29.216184+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/57c43d748xcasc1abf58c890c5a6df042/5D199AE8/t51.2885-15/e15/p480x480/49913269_2181952555454636_8892094125900591548_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg1NjgsdasdAwNjgxNTk1OTY0OTIwMTA1NTMzNDQ%3D.2"}, {"sender": "ny", "created_at": "2017-10-22T20:49:50.042588+00:00", "media": "https://scontent-lax3-1.cdninstagram.com/vp/19d94ea45c2102a0f7c97838ef546b93/5D14B3C3/t51.2885-15/e15/22708873_149637425772501_5029503881546039296_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjc4MzA3MDIyMTI3NDE3Njc3NTQxNTM1NTI2MjQyMjIyMDg%3D.2"}, {"sender": "xcsadc", "created_at": "2019-04-12T12:39:35.248517+00:00", "text": "czx"}]}]
We get:
minanageh379 https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2 http://foo.bar https://bar.foo
xcsadc https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2
url_inside_text http://foo.bar https://new.url.com/ https://google.com/
ncccy https://scontent-lax3-1.cdninstagram.com/vp/57c43d748xcasc1abf58c890c5a6df042/5D199AE8/t51.2885-15/e15/p480x480/49913269_2181952555454636_8892094125900591548_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg1NjgsdasdAwNjgxNTk1OTY0OTIwMTA1NTMzNDQ%3D.2
ny https://scontent-lax3-1.cdninstagram.com/vp/19d94ea45c2102a0f7c97838ef546b93/5D14B3C3/t51.2885-15/e15/22708873_149637425772501_5029503881546039296_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjc4MzA3MDIyMTI3NDE3Njc3NTQxNTM1NTI2MjQyMjIyMDg%3D.2
It may happen that you may get duplicated urls if those are duplicated on the original input (on same or different attributes).
Once processed you can remove duplicates with this regex:
Search: (?i)\t(https?:\S++)(?=[^\n]+\1)
Replace by: (nothing)
Replace All

Ctrl+H
Find what: (?:^|\G).*?"media": "(https://[^"]+)(?:(?!https:).)*
Replace with: $1\n
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
(?:^|\G) # beginning of line OR restart from last match position
.*? # 0 or more any character but newline, not greedy
"media": " # literally
( # start group 1
https://[^"]+ # https:// fllowed by 1 or more not double quote, the url
) # end group 1
(?:(?!https:).)* # Tempered greedy token, make sure we haven't "https" after
Replacement:
$1 # content of group 1, the URL
Result for given example:
https://scontent-lax3-1.cdninstagram.com/vp/edddf95178aca7bf75930ab8698ee45b/5D45203B/t51.2885-15/fr/e15/s1080x1080/55823673_114448266206459_7321604432125975069_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwNDMxNzU3OTI1MTE1NTAxNjQ1NTk5MDkwOTMzNzY%3D.2
https://scontent-lax3-1.cdninstagram.com/vp/e985406d6eac06bb11c2d6052c1821a2/5D508106/t51.2885-15/e15/s640x640/56218099_577906226037731_8663356006073884002_n.jpg?_nc_ht=scontent-lax3-1.cdninstagram.com&ig_cache_key=Mjg2ODYwMjk0MjA1ODQxNzYzNjM1OTI1ODMwMjYzMTExNjg%3D.2
Screen capture:

While the other answers do exactly what you need, one thing to note is that the string you gave is a valid JSON string. You can verify that it's valid JSON here.
If you're dealing with this string in a program, you may want to consider using a JSON parser for your language. Here's the one for Python

To extract just the links from the text file, do a regular expression Replace All using the following:
Find what:
.*?(https?:[^"]+)(?(?!.*?https?:).*)
Replace with:
$1\n\n
Demo 1 🔗
Note that you need to have Wrap around checked in case the insertion point is not at the start of the text.
Explanation:
.*?(https?:[^"]+)(?(?!.*?https?:).*)
|_||____________||_________________|
| ____| |
| | ________________|
| | |
| | [3] If there are no more following links, grab and discard the rest of the text
| [2] Store the link in $1 (starting with http and ending just before the first following")
[1] Grab and discard everything up 'til the first link (i.e. starting with http: or https:)
When using Replace All, the searching and replacing automatically continues until the regex fails to match, starting at the last point the data was matched up 'til, which in this case would be just before the double quote at the end of the current link if there are more links, or the end of the text otherwise.
To also extract the sender, use the following:
Find what:
.*?\{(?:([^"]*)"){4}[^{}]*?(https?:[^"]+)(?(?!.*?https?:).*)
Replace with:
$1 $2\n\n
Demo 2 🔗
Explanation:
Coming tomorrow
An alternative regex to do the same, but which is probably a little clearer is:
.*?"sender": "([^"]*)[^}]*?(https?:[^"]+)(?(?!.*?https?:).*)
Demo 3 🔗
Explanation:
.*?"sender": "([^"]*)[^}]*?(https?:[^"]+)(?(?!.*?https?:).*)
|_||_________||_____||____||____________||_________________|
| ___| ______| ___| _______| _____________|
| | __| ______| ___| _______|
| | | _| _____| ___|
| | | | _| _____|
| | | | | |
| | | | | [6] If there are no more following links, grab and discard the rest of the text
| | | | [5] Store the link in $2 (starting with http and ending just before the first following")
| | | [4] Grab and discard everything within the current set of braces up 'til the link
| | [3] Store the sender name in $1
| [2] Grab and discard "sender": " (i.e. up to the opening quote of the sender name)
[1] Grab and discard everything up 'til the first "sender" key which has an associated link
Step [1] works by initially starting at the beginning of the text and grabbing everything up to the first sender key, then grabbing the key via [2], grabbing the sender name in [3], and grabbing everything up to the associated link if it exists in [4]. If there is no associated link, [5] fails, and the regex backtracks to step [1] which continues grabbing everything after the first sender key up to the second sender key. This cycle is repeated until a sender key is found which has an associated link.
At this point, step [5] succeeds and then step [6] either grabs the rest of the text or nothing.
Finally, all the grabbed text is replaced by $1 $2\n\n, i.e. the sender name followed by a space, the link and then two newline characters.
This completes the first "replace". Since Replace All was selected, the whole process starts again, but with the text pointer either at the double quote at the end of the previously found link, or at the end of the text instead of at the start.

Yet another alternative would be to parse the JSON data.
You can do this with javascript.
The following snipplet should work for parsing your data. It should even work with several urls inside the same text message:
yourJSON
[0].conversation
.filter(x => x.media !== undefined || x.text !== undefined && /https?:/i.test(x.text))
.map(x => {
const tmp = x.text + ' ' + x.media;
const urls = tmp.match(/https?:[\w\-.~:\/?#\[\]#!$&'()*+,;=%]*/g);
return x.sender + ":\n" + urls.join("\n");
})
.join("\n\n");
You can paste that javascript (changing yourJSONwith your data) into some browser that has some javascript console like Firefox or Chrome. In firefox you can launch the console with (Control + Shift + K) and in Chrome with (Control + Shift + I, then click 'console')
As an alternative, you may use this jsfiddle instead.
Edit the javascript square to use your data and then push the "Run" button.

Related

Which field can I use to capture a string in Twilio Autopilot. Twilio.ALPHANUMERIC does not work

I want to collect a portion of the message as a string that can contain a combination of letters(upper/lower), numbers, dashes, spaces, slashes and I cannot define all the possible values in a Custom Field because they are dynamic. This string can be:
1. AFCON 2020
2. England
3. FIFA WC 2016
etc.
I have tried using TWILIO.ALPHANUMERIC but it cuts out other parts or it returns first letter only. How can I solve this?
I guess you already found the list of Twilio's built-in field types here. If none of the built-in's work for you you have two options:
A) Don't provide a type, if you don't provide a type then free-form input will be collected. See the documentation.
B) Use a custom field type. If you have only a selected number of options then this should be your choice.
For B) you could specify this as:
"fieldTypes": [
{
"uniqueName": "Custom.SELECTIONS",
"values": [
{
"language": "en-US",
"value" : "AFCON 2020",
"synonymOf" : null
},
{
"language": "en-US",
"value" : "England",
"synonymOf" : null
},
{
"language": "en-US",
"value" : "FIFA WC 2016",
"synonymOf" : null
},
...
]
}
]
You would use this then as "type": "Custom.SELECTIONS"

dredd fails to run with errors 'Data does not match any schemas from 'anyOf' and such more

Trying to run dredd on swagger documentation.
Dredd fails with next errors:
- error: API description parser error in /albums.json:266 (from line 266 column 10 to column 21): Data does not match any schemas from 'anyOf'
- error: API description parser error in /albums.json:266 (from line 266 column 10 to column 21): No enum match for: s
- error: API description parser error in /albums.json:266 (from line 266 column 10 to column 21): Expected type array but found type string
errors refers to this part of JSON:
265 "photos": { "$ref": "#/definitions/PhotoEntity" },
266 "created_at": {
267 "type": "s",
268 "format": "g",
269 "description": "Дата создания"
270 }
Full JSON available by gist.
Swagger-ui work with this JSON perfectly, manual testing passes, as expected.
Replace
"type": "s",
with
"type": "string",
There're also other errors in your API definition - use https://editor.swagger.io to check for syntax errors.

Does anyone have an efficient R3 function that mimics the behaviour of find/any in R2?

Rebol2 has an /ANY refinement on the FIND function that can do wildcard searches:
>> find/any "here is a string" "s?r"
== "string"
I use this extensively in tight loops that need to perform well. But the refinement was removed in Rebol3.
What's the most efficient way of doing this in Rebol3? (I'm guessing a parse solution of some sort.)
Here's a stab at handling the "*" case:
like: funct [
series [series!]
search [series!]
][
rule: copy []
remove-each s b: parse/all search "*" [empty? s]
foreach s b [
append rule reduce ['to s]
]
append rule [to end]
all [
parse series rule
find series first b
]
]
used as follows:
>> like "abcde" "b*d"
== "bcde"
I had edited your question for "clarity" and changed it to say 'was removed'. That made it sound like it was a deliberate decision. Yet it actually turns out it may just not have been implemented.
BUT if anyone asks me, I don't think it should be in the box...and not just because it's a lousy use of the word "ALL". Here's why:
You're looking for patterns in strings...so if you're constrained to using a string to specify that pattern you get into "meta" problems. Let's say I want to extract the word *Rebol* or ?Red?, now there has to be escaping and things get ugly all over again. Back to RegEx. :-/
So what you might actually want isn't a STRING! pattern like s?r but a BLOCK! pattern like ["s" ? "r"]. This would permit constructs like ["?" ? "?"] or [{?} ? {?}]. That's better than rehashing the string hackery that every other language uses.
And that's what PARSE does, albeit in a slightly-less-declarative way. It also uses words instead of symbols, as Rebol likes to do. [{?} skip {?}] is a match rule where skip is an instruction that moves the parse position past any single element of the parse series between the question marks. It could also do so if it were parsing a block as input, and would match [{?} 12-Dec-2012 {?}].
I don't know entirely what the behavior of /ALL would-or-should be with something like "ab??cd e?*f"... if it provided alternate pattern logic or what. I'm assuming the Rebol2 implementation is brief? So likely it only matches one pattern.
To set a baseline, here's a possibly-lame PARSE solution for the s?r intent:
>> parse "here is a string" [
some [ ; match rule repeatedly
to "s" ; advance to *before* "s"
pos: ; save position as potential match
skip ; now skip the "s"
[ ; [sub-rule]
skip ; ignore any single character (the "?")
"r" ; match the "r", and if we do...
return pos ; return the position we saved
| ; | (otherwise)
none ; no-op, keep trying to match
]
]
fail ; have PARSE return NONE
]
== "string"
If you wanted it to be s*r you would change the skip "r" return pos into a to "r" return pos.
On an efficiency note, I'll mention that it is indeed the case that characters are matched against characters faster than strings. So to #"s" and #"r" to end make a measurable difference in the speed when parsing strings in general. Beyond that, I'm sure others can do better.
The rule is certainly longer than "s?r". But it's not that long when comments are taken out:
[some [to #"s" pos: skip [skip #"r" return pos | none]] fail]
(Note: It does leak pos: as written. Is there a USE in PARSE, implemented or planned?)
Yet a nice thing about it is that it offers hook points at all the moments of decision, and without the escaping defects a naive string solution has. (I'm tempted to give my usual "Bad LEGO alligator vs. Good LEGO alligator" speech.)
But if you don't want to code in PARSE directly, it seems the real answer would be some kind of "Glob Expression"-to-PARSE compiler. It might be the best interpretation of glob Rebol would have, because you could do a one-off:
>> parse "here is a string" glob "s?r"
== "string"
Or if you are going to be doing the match often, cache the compiled expression. Also, let's imagine our block form uses words for literacy:
s?r-rule: glob ["s" one "r"]
pos-1: parse "here is a string" s?r-rule
pos-2: parse "reuse compiled RegEx string" s?r-rule
It might be interesting to see such a compiler for regex as well. These also might accept not only string input but also block input, so that both "s.r" and ["s" . "r"] were legal...and if you used the block form you wouldn't need escaping and could write ["." . "."] to match ".A."
Fairly interesting things would be possible. Given that in RegEx:
(abc|def)=\g{1}
matches abc=abc or def=def
but not abc=def or def=abc
Rebol could be modified to take either the string form or compile into a PARSE rule with a form like:
regex [("abc" | "def") "=" (1)]
Then you get a dialect variation that doesn't need escaping. Designing and writing such compilers is left as an exercise for the reader. :-)
I've broken this into two functions: one that creates a rule to match the given search value, and the other to perform the search. Separating the two allows you to reuse the same generated parse block where one search value is applied over multiple iterations:
expand-wildcards: use [literal][
literal: complement charset "*?"
func [
{Creates a PARSE rule matching VALUE expanding * (any characters) and ? (any one character)}
value [any-string!] "Value to expand"
/local part
][
collect [
parse value [
; empty search string FAIL
end (keep [return (none)])
|
; only wildcard return HEAD
some #"*" end (keep [to end])
|
; everything else...
some [
; single char matches
#"?" (keep 'skip)
|
; textual match
copy part some literal (keep part)
|
; indicates the use of THRU for the next string
some #"*"
; but first we're going to match single chars
any [#"?" (keep 'skip)]
; it's optional in case there's a "*?*" sequence
; in which case, we're going to ignore the first "*"
opt [
copy part some literal (
keep 'thru keep part
)
]
]
]
]
]
]
like: func [
{Finds a value in a series and returns the series at the start of it.}
series [any-string!] "Series to search"
value [any-string! block!] "Value to find"
/local skips result
][
; shortens the search a little where the search starts with a regular char
skips: switch/default first value [
#[none] #"*" #"?" ['skip]
][
reduce ['skip 'to first value]
]
any [
block? value
value: expand-wildcards value
]
parse series [
some [
; we have our match
result: value
; and return it
return (result)
|
; step through the string until we get a match
skips
]
; at the end of the string, no matches
fail
]
]
Splitting the function also gives you a base to optimize the two different concerns: finding the start and matching the value.
I went with PARSE as even though *? are seemingly simple rules, there is nothing quite as expressive and quick as PARSE to effectively implementing such a search.
It might yet as per #HostileFork to consider a dialect instead of strings with wildcards—indeed to the point where Regex is replaced by a compile-to-parse dialect, but is perhaps beyond the scope of the question.

How to express branch in Rebol PARSE dialect?

I have a mysql schema like below:
data: {
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(10) DEFAULT '' COMMENT 'the name',
`content` text COMMENT 'something',
}
now I want to extract some info from it: the filed name, type and comment if any. See below:
["id" "int" "" "name" "varchar" "the name" "content" "text" "something" ]
My code is:
parse data [
any [
thru {`} copy field to {`} {`}
thru some space copy field-type to [ {(} | space]
(comm: "")
opt [ thru {COMMENT} thru some space thru {'} copy comm to {'}]
(repend temp field repend temp field-type either comm [ repend temp comm ][ repend temp ""])
]
]
but I get something like this:
["id" "int" "the name" "content" "text" "something"]
I know the line opt .. is not right.
I want express if found COMMENT key word first, then extract the comment info; if found lf first, then continue the next loop. But I don't know how to express it. Any one can help?
I much favour (where possible) building up a set of grammar rules with positive terms to match target input—I find it's more literate, precise, flexible and easier to debug. In your snippet above, we can identify five core components:
space: use [space][
space: charset "^-^/ "
[some space]
]
word: use [letter][
letter: charset [#"a" - #"z" #"A" - #"Z" "_"]
[some letter]
]
id: use [letter][
letter: complement charset "`"
[some letter]
]
number: use [digit][
digit: charset "0123456789"
[some digit]
]
string: use [char][
char: complement charset "'"
[any [some char | "''"]]
]
With terms defined, writing a rule that describes the grammar of the input is relatively trivial:
result: collect [
parsed?: parse/all data [ ; parse/all for Rebol 2 compatibility
opt space
some [
(field: type: none comment: copy "")
"`" copy field id "`"
space
copy type word opt ["(" number ")"]
any [
space [
"COMMENT" space "'" copy comment string "'"
| word | "'" string "'" | number
]
]
opt space "," (keep reduce [field type comment])
opt space
]
]
]
As an added bonus, we can validate the input.
if parsed? [new-line/all/skip result true 3]
One wee application of new-line to smarten things up a little should yield:
== [
"id" "int" ""
"name" "varchar" "the name"
"content" "text" "something"
]
I think this is closer to what you are after.
data: {
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(10) DEFAULT '' COMMENT 'the name',
`content` text COMMENT 'something',
}
temp: []
parse data [
any [
thru {`} copy field to {`} {`}
some space copy field-type to [ {(} | space]
(comm: copy "")
opt [ thru {COMMENT} some space thru {'} copy comm to {'}]
(repend temp field repend temp field-type either comm [ repend temp comm ][ repend temp ""])
]
]
probe temp
To break down the differences.
Set up a word with an empty block for temp
Changed thru some space to just some space as this will move forward through the series in the same way. Note that the following is false
parse " " [ thru some space ]
Changed comm: "" to comm: copy "" to make sure you get a new string each time you extract the comment (does not seem to affect the output, but is good practice)
Changed {COMMENT} thru some space to {COMMENT} some space as per comment 2.
Just added a probe on the end for debugging
As a note, you can use ?? (almost) anywhere in a parse rule to help with debugging which will show you your current position.
parse/all for string parsing
data: {
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(10) DEFAULT '' COMMENT 'the name',
`content` text COMMENT 'something',
}
nodata: charset { ()'}
dat: complement nodata
collect [
parse/all data [
some [
thru {`} copy field to {`} (keep field) skip
some " " copy type some dat ( keep type comm: copy "" )
copy rest thru "," (
parse/all rest [
some [
["," (keep comm) ]
| ["COMMENT" some nodata copy comm to "'" ]
| skip
]
]
)
]
]
]
== ["id" "int" "" "name" "varchar" "the name" "content" "text" "something"]
another (better) solution with pure parse
collect [
probe parse/all data [
some [
thru {`} copy field to {`} (keep field) skip
some " " copy type some dat ( keep type comm: "" further: [])
some [
"," (keep comm further: [ to end skip])
| ["COMMENT" some nodata copy comm to "'" ]
| skip further
]
]
]
]
I figure out an alternative way to get the data as block! but not string!.
data: read/lines data.txt
probe data
temp: copy []
foreach d data [
parse d [
thru {`} copy field to {`} {`}
thru some space copy field-type to [ {(} | space]
(comm: "")
opt [ thru {COMMENT} thru some space thru {'} copy comm to {'}]
(repend temp field repend temp field-type either comm [ repend temp comm ][ repend temp ""])
]
]
probe temp

How do I use collect keep in parse, to get embedded blocks?

Looking at the html example here: http://www.red-lang.org/2013/11/041-introducing-parse.html
I would like to parse the following:
"val1-12*more text-something"
Where:
"-" marks values which should be in the same block, and
"*" should start a new block.
So, I want this:
[ ["val1" "12"] ["more text" "something"] ]
and at the moment I get this:
red>> data: "val1-12*more text-something"
== "val1-12*more text-something"
red>> c: charset reduce ['not #"-" #"*"]
== make bitset! [not #{000000000024}]
red>> parse data [collect [any [keep any c [#"-" | #"*" | end ]]]]
== ["val1" "12" "more text" "something"]
(I actually tried some other permutations, which didn't get me any farther.)
So, what's missing?
You can make it work by nesting COLLECT. For e.g.
keep-pair: [
keep some c
#"-"
keep some c
]
parse data [
collect [
some [
collect [keep-pair]
#"*"
collect [keep-pair]
]
]
]
Using your example input this outputs the result you wanted:
[["val1" "12"] ["more text" "something"]]
However I got funny feeling you maybe wanted the parse rule to be more flexible than the example input provided?

Resources