Double-capture a token - parsing

I have a SQl language where it allows strings that are double- or single-quoted, i.e., 'hello' and "hello". Additionally, it may have a JSON object, and within a JSON object, only a double-quoted string is valid, so we can have something like this in SQL:
SELECT "one", 'two', JSON "hello", JSON {"x": 2}
In other words, I'd like to have two string-constructions:
String_Literal_SQL: " ... " | ' ... ';
String_Literal_JSON: " ... ";
But, these would be 'fighting' to pick up the double-quoted string. Is the answer to this to push the rule from the Lexer to the Parser, and have different tokens for the strings, i.e., doing:
string_literal_SQL: Double_Quoted_String | Single_Quoted_String;
string_literal_JSON: Double_Quoted_String;
For whatever reason the above feels a bit hack-ish, but then again I don't think there's a way to consume a single token two ways.

Related

How do I get a list of strings that are between 2 strings in Lua?

I'm trying to write a HTML parser in Lua. A major roadblock I hit almost immediately was I have no idea how to get a list of strings between 2 strings. This is important for parsing HTML, where a tag is defined by being within 2 characters (or strings of length 1), namely '<' and '>'. I am aware of this answer, but it only gets the first occurence, not all instances of a string between the 2 given strings.
What I mean by "list of strings between 2 strings" is something like this:
someFunc("<a b c> <c b a> a </c b a> </a b c>", "<", ">")
Returns:
{"a b c", "c b a", "/c b a", "/a b c"}
This does not have to parse newlines nor text in between tags, as both of those can be handled using extra logic. However, I would prefer it if it did parse newlines, so I can run the code once for the whole string returned by the first GET request.
Note: This is a experiment project to see if this is possible in the very limited Lua environment provided by the CC: Tweaked mod for Minecraft. Documentation here: https://tweaked.cc/
You can do simply:
local list = {}
local html = [[bla]]
for innerTag in html:gmatch("<(.-)>") do
list[#list] = innerTag
end
But be aware that it is weak as doesn't validade wrong things, as someone can put an < or > inside a string etc.

Match a word or whitespaces in Lua

(Sorry for my broken English)
What I'm trying to do is matching a word (with or without numbers and special characters) or whitespace characters (whitespaces, tabs, optional new lines) in a string in Lua.
For example:
local my_string = "foo bar"
my_string:match(regex) --> should return 'foo', ' ', 'bar'
my_string = " 123!#." -- note: three whitespaces before '123!#.'
my_string:match(regex) --> should return ' ', ' ', ' ', '123!#.'
Where regex is the Lua regular expression pattern I'm asking for.
Of course I've done some research on Google, but I couldn't find anything useful. What I've got so far is [%s%S]+ and [%s+%S+] but it doesn't seem to work.
Any solution using the standart library, e.g. string.find, string.gmatch etc. is OK.
Match returns either captures or the whole match, your patterns do not define those. [%s%S]+ matches "(space or not space) multiple times more than once", basically - everything. [%s+%S+] is plain wrong, the character class [ ] is a set of single character members, it does not treat sequences of characters in any other way ("[cat]" matches "c" or "a"), nor it cares about +. The [%s+%S+] is probably "(a space or plus or not space or plus) single character"
The first example 'foo', ' ', 'bar' could be solved by:
regex="(%S+)(%s)(%S+)"
If you want a variable number of captures you are going to need the gmatch iterator:
local capt={}
for q,w,e in my_string:gmatch("(%s*)(%S+)(%s*)") do
if q and #q>0 then
table.insert(capt,q)
end
table.insert(capt,w)
if e and #e>0 then
table.insert(capt,e)
end
end
This will not however detect the leading spaces or discern between a single space and several, you'll need to add those checks to the match result processing.
Lua standard patterns are simplistic, if you are going to need more intricate matching, you might want to have a look at lua lpeg library.

Does anyone have an efficient R3 function that mimics the behaviour of find/any in R2?

Rebol2 has an /ANY refinement on the FIND function that can do wildcard searches:
>> find/any "here is a string" "s?r"
== "string"
I use this extensively in tight loops that need to perform well. But the refinement was removed in Rebol3.
What's the most efficient way of doing this in Rebol3? (I'm guessing a parse solution of some sort.)
Here's a stab at handling the "*" case:
like: funct [
series [series!]
search [series!]
][
rule: copy []
remove-each s b: parse/all search "*" [empty? s]
foreach s b [
append rule reduce ['to s]
]
append rule [to end]
all [
parse series rule
find series first b
]
]
used as follows:
>> like "abcde" "b*d"
== "bcde"
I had edited your question for "clarity" and changed it to say 'was removed'. That made it sound like it was a deliberate decision. Yet it actually turns out it may just not have been implemented.
BUT if anyone asks me, I don't think it should be in the box...and not just because it's a lousy use of the word "ALL". Here's why:
You're looking for patterns in strings...so if you're constrained to using a string to specify that pattern you get into "meta" problems. Let's say I want to extract the word *Rebol* or ?Red?, now there has to be escaping and things get ugly all over again. Back to RegEx. :-/
So what you might actually want isn't a STRING! pattern like s?r but a BLOCK! pattern like ["s" ? "r"]. This would permit constructs like ["?" ? "?"] or [{?} ? {?}]. That's better than rehashing the string hackery that every other language uses.
And that's what PARSE does, albeit in a slightly-less-declarative way. It also uses words instead of symbols, as Rebol likes to do. [{?} skip {?}] is a match rule where skip is an instruction that moves the parse position past any single element of the parse series between the question marks. It could also do so if it were parsing a block as input, and would match [{?} 12-Dec-2012 {?}].
I don't know entirely what the behavior of /ALL would-or-should be with something like "ab??cd e?*f"... if it provided alternate pattern logic or what. I'm assuming the Rebol2 implementation is brief? So likely it only matches one pattern.
To set a baseline, here's a possibly-lame PARSE solution for the s?r intent:
>> parse "here is a string" [
some [ ; match rule repeatedly
to "s" ; advance to *before* "s"
pos: ; save position as potential match
skip ; now skip the "s"
[ ; [sub-rule]
skip ; ignore any single character (the "?")
"r" ; match the "r", and if we do...
return pos ; return the position we saved
| ; | (otherwise)
none ; no-op, keep trying to match
]
]
fail ; have PARSE return NONE
]
== "string"
If you wanted it to be s*r you would change the skip "r" return pos into a to "r" return pos.
On an efficiency note, I'll mention that it is indeed the case that characters are matched against characters faster than strings. So to #"s" and #"r" to end make a measurable difference in the speed when parsing strings in general. Beyond that, I'm sure others can do better.
The rule is certainly longer than "s?r". But it's not that long when comments are taken out:
[some [to #"s" pos: skip [skip #"r" return pos | none]] fail]
(Note: It does leak pos: as written. Is there a USE in PARSE, implemented or planned?)
Yet a nice thing about it is that it offers hook points at all the moments of decision, and without the escaping defects a naive string solution has. (I'm tempted to give my usual "Bad LEGO alligator vs. Good LEGO alligator" speech.)
But if you don't want to code in PARSE directly, it seems the real answer would be some kind of "Glob Expression"-to-PARSE compiler. It might be the best interpretation of glob Rebol would have, because you could do a one-off:
>> parse "here is a string" glob "s?r"
== "string"
Or if you are going to be doing the match often, cache the compiled expression. Also, let's imagine our block form uses words for literacy:
s?r-rule: glob ["s" one "r"]
pos-1: parse "here is a string" s?r-rule
pos-2: parse "reuse compiled RegEx string" s?r-rule
It might be interesting to see such a compiler for regex as well. These also might accept not only string input but also block input, so that both "s.r" and ["s" . "r"] were legal...and if you used the block form you wouldn't need escaping and could write ["." . "."] to match ".A."
Fairly interesting things would be possible. Given that in RegEx:
(abc|def)=\g{1}
matches abc=abc or def=def
but not abc=def or def=abc
Rebol could be modified to take either the string form or compile into a PARSE rule with a form like:
regex [("abc" | "def") "=" (1)]
Then you get a dialect variation that doesn't need escaping. Designing and writing such compilers is left as an exercise for the reader. :-)
I've broken this into two functions: one that creates a rule to match the given search value, and the other to perform the search. Separating the two allows you to reuse the same generated parse block where one search value is applied over multiple iterations:
expand-wildcards: use [literal][
literal: complement charset "*?"
func [
{Creates a PARSE rule matching VALUE expanding * (any characters) and ? (any one character)}
value [any-string!] "Value to expand"
/local part
][
collect [
parse value [
; empty search string FAIL
end (keep [return (none)])
|
; only wildcard return HEAD
some #"*" end (keep [to end])
|
; everything else...
some [
; single char matches
#"?" (keep 'skip)
|
; textual match
copy part some literal (keep part)
|
; indicates the use of THRU for the next string
some #"*"
; but first we're going to match single chars
any [#"?" (keep 'skip)]
; it's optional in case there's a "*?*" sequence
; in which case, we're going to ignore the first "*"
opt [
copy part some literal (
keep 'thru keep part
)
]
]
]
]
]
]
like: func [
{Finds a value in a series and returns the series at the start of it.}
series [any-string!] "Series to search"
value [any-string! block!] "Value to find"
/local skips result
][
; shortens the search a little where the search starts with a regular char
skips: switch/default first value [
#[none] #"*" #"?" ['skip]
][
reduce ['skip 'to first value]
]
any [
block? value
value: expand-wildcards value
]
parse series [
some [
; we have our match
result: value
; and return it
return (result)
|
; step through the string until we get a match
skips
]
; at the end of the string, no matches
fail
]
]
Splitting the function also gives you a base to optimize the two different concerns: finding the start and matching the value.
I went with PARSE as even though *? are seemingly simple rules, there is nothing quite as expressive and quick as PARSE to effectively implementing such a search.
It might yet as per #HostileFork to consider a dialect instead of strings with wildcards—indeed to the point where Regex is replaced by a compile-to-parse dialect, but is perhaps beyond the scope of the question.

How do we generate all parameter combination to form a URL

Suppose I have a url
xyz.com/param1=abc&param2=123&param3=!##
Each parameter can have many values like:
param1 = abc, xyz, qwe
param2 = 123, 456, 789
param3 = !##, $%^, &*(
All the parameters and values will be read from an excel file. There can be n number of parameters and each may have any number of values.
I want to generate all the combinations which can be formed using all values of each parameters.
Output will be like:
xyz.com/param1=abc&param2=123&param3=!##
xyz.com/param1=xyz&param2=456&param3=!##
xyz.com/param1=qwe&param2=123&param3=!##
xyz.com/param1=qwe&param2=456&param3=!##
xyz.com/param1=xyz&param2=789&param3=$%^
...
..
and so on
besides the previous comment in your post, you need construct some nested loops.
I will assume you have a bash shell available (since you hadn't specified your wanted language).
for I in 'abc' 'xyz' 'qwe'
do
for J in '123' '456' '789'
do
for K in '!##' '$%^' '&*('
do
echo "xyz.com/param1=${I}&param2=${J}&param3=${K}"
done
done
done
Note that:
that '&*(' will bring you problems, since & is the character that you use to delimit each parameter.
double quotes " around echo, will make a double quote character to make this program to fail miserably
the same apply to ', \ and some others

Best way to count words in a string in Ruby?

Is there anything better than string.scan(/(\w|-)+/).size (the - is so, e.g., "one-way street" counts as 2 words instead of 3)?
string.split.size
Edited to explain multiple spaces
From the Ruby String Documentation page
split(pattern=$;, [limit]) → anArray
Divides str into substrings based on a delimiter, returning an array
of these substrings.
If pattern is a String, then its contents are used as the delimiter
when splitting str. If pattern is a single space, str is split on
whitespace, with leading whitespace and runs of contiguous whitespace
characters ignored.
If pattern is a Regexp, str is divided where the pattern matches.
Whenever the pattern matches a zero-length string, str is split into
individual characters. If pattern contains groups, the respective
matches will be returned in the array as well.
If pattern is omitted, the value of $; is used. If $; is nil (which is
the default), str is split on whitespace as if ' ' were specified.
If the limit parameter is omitted, trailing null fields are
suppressed. If limit is a positive number, at most that number of
fields will be returned (if limit is 1, the entire string is returned
as the only entry in an array). If negative, there is no limit to the
number of fields returned, and trailing null fields are not
suppressed.
" now's the time".split #=> ["now's", "the", "time"]
While that is the current version of ruby as of this edit, I learned on 1.7 (IIRC), where that also worked. I just tested it on 1.8.3.
I know this is an old question, but this might be useful to someone else looking for something more sophisticated than string.split. I wrote the words_counted gem to solve this particular problem, since defining words is pretty tricky.
The gem lets you define your own custom criteria, or use the out of the box regexp, which is pretty handy for most use cases. You can pre-filter words with a variety of options, including a string, lambda, array, or another regexp.
counter = WordsCounted::Counter.new("Hello, Renée! 123")
counter.word_count #=> 2
counter.words #=> ["Hello", "Renée"]
# filter the word "hello"
counter = WordsCounted::Counter.new("Hello, Renée!", reject: "Hello")
counter.word_count #=> 1
counter.words #=> ["Renée"]
# Count numbers only
counter = WordsCounted::Counter.new("Hello, Renée! 123", rexexp: /[0-9]/)
counter.word_count #=> 1
counter.words #=> ["123"]
The gem provides a bunch more useful methods.
If the 'word' in this case can be described as an alphanumeric sequence which can include '-' then the following solution may be appropriate (assuming that everything that doesn't match the 'word' pattern is a separator):
>> 'one-way street'.split(/[^-a-zA-Z]/).size
=> 2
>> 'one-way street'.split(/[^-a-zA-Z]/).each { |m| puts m }
one-way
street
=> ["one-way", "street"]
However, there are some other symbols that can be included in the regex - for example, ' to support the words like "it's".
This is pretty simplistic but does the job if you are typing words with spaces in between. It ends up counting numbers as well but I'm sure you could edit the code to not count numbers.
puts "enter a sentence to find its word length: "
word = gets
word = word.chomp
splits = word.split(" ")
target = splits.length.to_s
puts "your sentence is " + target + " words long"
The best way to do is to use split method.
split divides a string into sub-strings based on a delimiter, returning an array of the sub-strings.
split takes two parameters, namely; pattern and limit.
pattern is the delimiter over which the string is to be split into an array.
limit specifies the number of elements in the resulting array.
For more details, refer to Ruby Documentation: Ruby String documentation
str = "This is a string"
str.split(' ').size
#output: 4
The above code splits the string wherever it finds a space and hence it give the number of words in the string which is indirectly the size of the array.
The above solution is wrong, consider the following:
"one-way street"
You will get
["one-way","", "street"]
Use
'one-way street'.gsub(/[^-a-zA-Z]/, ' ').split.size
This splits words only on ASCII whitespace chars:
p " some word\nother\tword|word".strip.split(/\s+/).size #=> 4

Resources