split large regular expression in different lines

split large regular expression in different lines - ruby-on-rails

I have this regular expression:
INVALID_NAMES = /\b(bib$|costumes$|httpanties?|necklace|cuff link|cufflink|scarf|pendant|apron|buckle|beanie|hat|ring|blanket|polo|earrings?|plush|pacifier|tie$|panties|boxers?|slippers?|pants?|leggings|ibattz|dress|bodysuits?|charm|battstation|tea|pocket ref|pajamas?|boyshorts?|mimopowertube|coat|bathrobe)\b/i
and it's working in that way.... but I want to write something like this:
INVALID_NAMES = /\b(bib$|costumes$|httpanties?|necklace|cuff link|
cufflink|scarf|pendant|apron|buckle|beanie|hat|ring|blanket|
polo|earrings?|plush|pacifier|tie$|panties|boxers?|
slippers?|pants?|leggings|ibattz|dress|bodysuits?|charm|
battstation|tea|pocket ref|pajamas?|boyshorts?|
mimopowertube|coat|bathrobe)\b/i
but if I use the second option the words: cufflink, polo, slippers?, battstation and mimopowertube.... are not taken because the spaces that the word have before, example:
(this space before the word)cufflink
I'll be very grateful of any help.

You may use something like this
INVALID_NAMES = [
"bib$",
"costumes$",
"httpanties?",
"necklace"
]
INVALID_NAMES_REGEX = /\b(#{INVALID_NAMES.join '|'})\b/i
p INVALID_NAMES_REGEX

Construct Your Regex with the Space-Insensitive Flag
You can use the space-insensitive flag to ignore whitespace and comments in your regular expression. Note that you will need to use \s or other explicit characters to catch whitespace once you enable this flag, since the /x flag would otherwise cause the spaces to be ignored.
Consider the following example:
INVALID_NAMES =
/\b(bib$ |
costumes$ |
httpanties? |
necklace |
cuff\slink |
cufflink |
scarf |
pendant |
apron |
buckle |
beanie |
hat |
ring |
blanket |
polo |
earrings? |
plush |
pacifier |
tie$ |
panties |
boxers? |
slippers? |
pants? |
leggings |
ibattz |
dress |
bodysuits? |
charm |
battstation |
tea |
pocket\sref |
pajamas? |
boyshorts? |
mimopowertube |
coat |
bathrobe
)\b/ix
Note that you can format it in many other ways, but having one expression per line makes it easier to sort and edit your sub-expressions. If you want it to have multiple alternatives per line, you could certainly do that.
Making Sure It Works
You can see that the expression above works as intended with the following examples:
'cufflink'.match INVALID_NAMES
#=> #<MatchData "cufflink" 1:"cufflink">
'cuff link'.match INVALID_NAMES
#=> #<MatchData "cuff link" 1:"cuff link">

When you add a newline in the middle of a regex literal, the newline becomes a part of the regular expression. Look at this example:
"ab" =~ /ab/ # => 0
"ab" =~ /a
b/ # => nil
"a\nb" =~ /a
b/ # => 0
You can suppress the newline by appending a backslash at the end of the line:
"ab" =~ /a\
b/ # => 0
Applied to your regex (leading spaces also removed):
INVALID_NAMES = /\b(bib$|costumes$|httpanties?|necklace|cuff link|\
cufflink|scarf|pendant|apron|buckle|beanie|hat|ring|blanket|\
polo|earrings?|plush|pacifier|tie$|panties|boxers?|\
slippers?|pants?|leggings|ibattz|dress|bodysuits?|charm|\
battstation|tea|pocket ref|pajamas?|boyshorts?|\
mimopowertube|coat|bathrobe)\b/i

Your patterns are inefficient and will cause the Regexp engine to thrash badly.
I'd recommend you investigate what Perl's Regexp::Assemble can do to help your Ruby code:
"How do I ignore file types in a web crawler?"
"Is there an efficient way to perform hundreds of text substitutions in Ruby?"

You might do it like this:
INVALID_NAMES = ['necklace', 'cuff link', 'cufflink', 'scarf', 'tie?', 'bib$']
r = Regexp.union(INVALID_NAMES.map { |n| /\b#{n}\b/i })
str = 'cat \n cufflink bib cuff link. tie Scarf\n cow necklace? \n ti. bib'
str.scan(r)
#=> ["cufflink", "cuff link", "tie", "Scarf", "necklace", "ti", "bib"]

Related

Check if cell contains all comma separated values

I have a sheet with the following data:
| Text | Value | Value | Value |
|:----------------------------------------|:---------|:---------|:---------|
|Jax and Jax friend Kung Lao fight Raiden | jax | kung lao | raiden |
|Jax and Jax friend Kung Lao fight Raiden | kitana | kung lao | raiden |
And the following formulas:
=SUMPRODUCT( -- ISNUMBER(SEARCH(B1:D1;A1)))=COUNTA(B1:D1)
=SUMPRODUCT( -- ISNUMBER(SEARCH(B2:D2;A2)))=COUNTA(B2:D2)
Which returns:
TRUE
FALSE
This is working as expected. I get TRUE of all values is found in the text-cell, and FALSE if one or all is missing.
Now, I want to modify so instead searching in multiple cells, I only want to search in one comma separated cell. Like this:
| Text | Values | Formula |
|:----------------------------------------|:-----------------------|:--------|
|Jax and Jax friend Kung Lao fight Raiden | jax,kung lao,raiden | TRUE |
|Jax and Jax friend Kung Lao fight Raiden | kitana,kung lao,raiden | FALSE |
I've tried with =SUMPRODUCT( -- ISNUMBER(SEARCH({B2};A2)))=COUNTA({B2}) but it doesn't work.

Try splitting the comma delimited list and using the split array like a range of cells.
=SUMPRODUCT(--isnumber(search(split(B2, ",", true, true), A2)))=counta(split(B2, ",", true, true))
'full word search
=SUMPRODUCT(--isnumber(search(text(split(B2, ",", true, true), " # "), text(A2, " # "))))=counta(split(B2, ",", true, true))

Please see also:
=AND(ARRAYFORMULA(REGEXMATCH(A1,"(?i)"&SPLIT(B1,",")&"( |$)")))
(?i) - case insensitive
( |$) - space or end of text, to match whole words only.
ArrayFormula variation for cell C1:
=ArrayFormula(TRANSPOSE(NOT(REGEXMATCH(QUERY(TRANSPOSE(filter(REGEXMATCH(A:A,"(?i)"&SPLIT(B:B,",")&"( |$)"),A:A<>"")),,10^99),"FALSE"))))

Why does the Grako parsing process fail if my grammar contains an expression that consists of many or-concatenated subexpressions?

I am using Grako. In my EBNF grammar, I have an expression that consists of a lot of subexpressions that are concatenated using the OR-operator, like so:
expression = subexpressionA | subexpressionB | ... | subexpressionZ;
The parsing process always fails if the input string contains one of the latter subexpressions, say subexpressionZ. When I rewrite the grammar like this
expression = subexpressionZ | subexpressionB | ... | subexpressionA;
the parsing process finishes successfully if the input string contains subexpressionZ but will now fail if it contains subexpressionA.
Has anyone ever had a similar problem? Is that a bug in Grako (I am using 3.6.3.) or am I doing something wrong?
Thanks a lot for any ideas!

I solved my problem - a long time ago :) - by splitting up the expressions in a number of sub-expressions like so:
expression1 = subexpressionA | subexpressionB | subexpressionC;
expression2 = subexpressionD | subexpressionE | ... | subexpressionZ;
expression = expression1 | expression2;
For some reason, this works...

PEG parsing match at least one preserving order

Given the PEG rule:
rule = element1:'abc' element2:'def' element3:'ghi' ;
How do I rewrite this such that it matches at least one of the elements but possibly all while enforcing their order?
I.e. I would like to match all of the following lines:
abc def ghi
abc def
abc ghi
def ghi
abc
def
ghi
but not an empty string or misordered expressions, e.g. def abc.
Of course with three elements, I could spell out the combinations in separate rules, but as the number of elements increases, this becomes error prone.
Is there a way to specify this in a concise manner?

You can use optionals:
rule = [element1:'abc'] [element2:'def'] [element3:'ghi'] ;
You would use a semantic action for rule to check that at least one token was matched:
def rule(self, ast):
if not (ast.element1 or ast.element2 or ast.element3):
raise FailedSemantics('Expecting at least one token')
return ast
Another option is to use several choices:
rule
=
element1:'abc' [element2:'def'] [element3:'ghi']
| [element1:'abc'] element2:'def' [element3:'ghi']
| [element1:'abc'] [element2:'def'] element3:'ghi'
;
Caching will make the later as efficient as the former.
Then, you can add cut elements for additional efficiency and more meaningful error messages:
rule
=
element1:'abc' ~ [element2:'def' ~] [element3:'ghi' ~]
| [element1:'abc' ~] element2:'def' ~ [element3:'ghi' ~]
| [element1:'abc' ~] [element2:'def' ~] element3:'ghi' ~
;
or:
rule = [element1:'abc' ~] [element2:'def' ~] [element3:'ghi' ~] ;

The answer is: one precondition on the disjunct, and then a sequence of optionals.
rule = &(e1 / e2 / e3) e1? e2? e3?
This is standard PEG, with & meaning 'must be present but not consumed' and ? meaning 'optional'. Most PEG parsers have these features if not with these symbols.

How to pass spaces in table (specflow scenario)?

How to pass spaces in table ?
Background:
Given the following books
|Author |(here several spaces)Title(here several spaces)|

I would do this:
Given the following books
| Author | Title |
| "J. K. Rowling" | "Harry P " |
| " Isaac Asimov " | "Robots and Empire" |
Then your bindings can be made to strip the quotes if present, but retaining the spaces.
I think this is much preferable to the idea of adding spaces afterward, because that isn't very human readable - quotations will make the spaces visible to the human (stakeholder / coder) reading them.

You can work around it by adding an extra step. Something like:
Given the following books
|Author | Title |
Add append <5> spaces to book title
Edit:
A complete feature can look something like:
Scenario: Adding books with spaces in the title
Given the following book
| price | title |
And <5> spaces appended to a title
When book is saved
Then the title should be equals to <title without spaces>

I just faced same situation, my solution was this, added spaces in the step as follows:
Scenario: Adding books with spaces in the title
Given the following book ' <title> '
When book is saved
Then the title should be equals to '<title>'
| price | title |
| 50.00 | Working hard |

How to recognize what a string "represents"?

I am using Rails 3.1.1 and I would like to recognize (maybe using a regex) if a string "contains"/"is"/"represents" one of the following:
an email address
a Web site URL
a number
I am trying to implement a method that, given a string, returns:
email if the string is something like my_nick#email_provider.org
website if the string is something like www.web_address.org
number if the string is something like 123
null otherwise
Is it possible? How can I make that?

Here's some code for you:
def whatami(input)
return :email if input =~ EmailRegex
return :website if input =~ WebsiteRegex
return :number if input =~ NumberRegex
nil
end
You can look up individual regexes for each case above -- or perhaps you can find non-regex solutions for some cases.

It's definitely possible. But, wait, here's a Perl regex for email only. Are you sure want to continue on this path? :-)
/(?(DEFINE)
(?<address> (?&mailbox) | (?&group))
(?<mailbox> (?&name_addr) | (?&addr_spec))
(?<name_addr> (?&display_name)? (?&angle_addr))
(?<angle_addr> (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
(?<group> (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ;
(?&CFWS)?)
(?<display_name> (?&phrase))
(?<mailbox_list> (?&mailbox) (?: , (?&mailbox))*)
(?<addr_spec> (?&local_part) \# (?&domain))
(?<local_part> (?&dot_atom) | (?&quoted_string))
(?<domain> (?&dot_atom) | (?&domain_literal))
(?<domain_literal> (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
\] (?&CFWS)?)
(?<dcontent> (?&dtext) | (?&quoted_pair))
(?<dtext> (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])
(?<atext> (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
(?<atom> (?&CFWS)? (?&atext)+ (?&CFWS)?)
(?<dot_atom> (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
(?<dot_atom_text> (?&atext)+ (?: \. (?&atext)+)*)
(?<text> [\x01-\x09\x0b\x0c\x0e-\x7f])
(?<quoted_pair> \\ (?&text))
(?<qtext> (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
(?<qcontent> (?&qtext) | (?&quoted_pair))
(?<quoted_string> (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
(?&FWS)? (?&DQUOTE) (?&CFWS)?)
(?<word> (?&atom) | (?&quoted_string))
(?<phrase> (?&word)+)
# Folding white space
(?<FWS> (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
(?<ctext> (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
(?<ccontent> (?&ctext) | (?&quoted_pair) | (?&comment))
(?<comment> \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
(?<CFWS> (?: (?&FWS)? (?&comment))*
(?: (?:(?&FWS)? (?&comment)) | (?&FWS)))
# No whitespace control
(?<NO_WS_CTL> [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])
(?<ALPHA> [A-Za-z])
(?<DIGIT> [0-9])
(?<CRLF> \x0d \x0a)
(?<DQUOTE> ")
(?<WSP> [\x20\x09])
)
(?&address)/x
Copied from here.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

split large regular expression in different lines - ruby-on-rails

You may use something like this INVALID_NAMES = [ "bib$", "costumes$", "httpanties?", "necklace" ] INVALID_NAMES_REGEX = /\b(#{INVALID_NAMES.join '|'})\b/i p INVALID_NAMES_REGEX

Your patterns are inefficient and will cause the Regexp engine to thrash badly. I'd recommend you investigate what Perl's Regexp::Assemble can do to help your Ruby code: "How do I ignore file types in a web crawler?" "Is there an efficient way to perform hundreds of text substitutions in Ruby?"

Related

Check if cell contains all comma separated values

Why does the Grako parsing process fail if my grammar contains an expression that consists of many or-concatenated subexpressions?

PEG parsing match at least one preserving order

How to pass spaces in table (specflow scenario)?

How to recognize what a string "represents"?

Categories

Resources