Regular expression where pattern is repeated Ruby On rails - ruby-on-rails

I have the following regular expression
/^[a-zA-z]+\s{0,1}$/
I use this regular expression to validate a string like "hello "
but what's up if the same format is repeated again and again
example
"hello How are you "
I don't want to write it
/^[a-zA-z]+\s{0,1}[a-zA-z]\s{0,1}[a-zA-z]\s{0,1}[a-zA-z]\s{0,1}$/
It's too long
Help me!

pattern = "[a-zA-z]+\s{0,1}"
expression = /^#{pattern}#{pattern}#{pattern}#{pattern}$/
However, a better approach would be to use a better regular expression, or define the regexp to allow that specific pattern to be contained more than once.
For instance
/^([a-zA-z]+\s{0,1}){4}$/
Moreover, I guess you can probably reduce the complexity of the expression if you use some better classes and matchers.
/^[a-zA-z]+\s{0,1}$/
is equivalent to
/^[[:alpha]]+\s?$/
therefore
/^([[:alpha]]+\s?){4}$/
to match an unlimited number of words (from N to unlimited)
/^([[:alpha]]+\s?){N,}$/
or use + to match one ore more.
/^([[:alpha]]+\s?)+$/

If what you're after is simply a bunch of letters separated by 0 or 1 space, your pattern can be drastically simplified:
/([a-z]+\s?)+/i
So, working in-to-out,
[a-z] matches characters in the range a-z
+ is a quantifier matching "1 or more" times, so [a-z]+ matches "1 or more letters"
\s? - ? is a quantifier meaning "0 or 1", the same as {0,1}, so "0 or 1 space"
([a-z]+\s?) groups that sub-expression and...
+ is a quantifier matching "1 or more" times.
/i makes the entire thing case-insensitive, so no need for [A-Za-z]. Just [a-z].
Of course, you'll want to anchor the entire thing:
/^([a-z]+\s?)+$/i

#SimoneCarletti recommended using /^([:alpha]+\s?)+$/, which is using a capturing group ([:alpha]+\s?). On a long string this isn't as efficient as a non-capturing group:
(?:[[:alpha:]]+\s?)
The difference happens deep down, where the first has to remember where each match was found, consuming space and time. Non-capturing just remembers that they were found which is faster.
require 'fruity'
text = 'Lorem ipsum dolor sit amet consectetur adipisicing elit Amet platonem fastidii fieri historiae populo mutans fortasse misisti quoddam recta contentus odia bona confidere magis negant caecilii theophrastus necessariam lucilius acuti nobis viris puerilis deorsum aliquid Atilii industriae sitne ipsi improborum levis mel affectus scientiam disciplinam disciplinam repellat Odioque suam graeca intereant potiora Iracundiae docui triarium triari neque assentiar maiorem ornateque futuros fruentem orestem forensibus teneam sciscat postremo animus fortibus videntur e video probant eas delectet molestia docere dictum Unde existimo tota labefactant Forensibus deterret autem putat remissius tollatur credo allicit duo accuratius magnus finxerat effecerit facillime Pertineant concederetur placet habendus'
compare do
regex1 { text[/^([[:alpha:]]+\s?)+$/] }
regex2 { text[/^(?:[[:alpha:]]+\s?)+$/] }
end
# >> Running each test 128 times. Test will take about 1 second.
# >> regex2 is faster than regex1 by 19.999999999999996% ± 10.0%
Also, the "POSIX bracket expressions" for the "alpha" character-class should be [[:alpha:]].

If you'd like to repeat the pattern your matching for, you can wrap it in parentheses (which groups them together), and then use a repetition meta-character to set how many repeats you'd like.
In this case, if you're looking to match if a particular string is found one or more times, you can use the following:
/^([a-zA-z]+\s{0,1})+$/
Here, we're using the + repetition meta-character, which means "this must match one or more times."
As an aside, the {0,1} you're using to match the whitespace 0 or 1 times, can be replaced by at ?, which also means "match this 0 or 1 times."
So, this could turn into:
/^([a-zA-z]+\s?)+$/
You can also do a case-insensitive match by adding the ignore case option (i) at the end of your regex, like so:
/^([a-z]+\s?)+$/i
Hope this helps.

Related

Generate x characters worth of Lorem Ipsum text (sentences and paragraphs)?

I have a User field called 'bio' that can be up to 800 characters (as well as some other free text fields of varying lengths). Having it populate with dummy text would help assess the visuals/design of the front end.
How can I generate 800 characters worth of Lorem Ipsum text to place into that field? By 'Lorem Ipsum text' I mean sentences and paragraphs (not just 800 characters worth of sentences in one giant paragraph).
"a"*800 is not varied enough to resemble human paragraphs.
Note: this is for the seeds.rb file, and I am already using faker gem, in case that's useful.
I'm sure there are better ways, but this uses the Faker gem and looks quite natural:
def make_natural_text(n)
paras = ""
until paras.length > n
para = Faker::Lorem.paragraphs(number: (2..7).to_a.sample(1)[0]).join + "\n\n"
paras += para
end
paras[0..(n-1)]
end
natural_text = make_natural_text(800)
puts natural_text

Write a Lex rule to parse Integer and Float

I am writing a parse for a script language.
I need to recognize strings, integers and floats.
I successfully recognize strings with the rule:
[a-zA-Z0-9_]+ {return STRING;}
But I have problem recognizing Integers and Floats. These are the (wrong) rules I wrote:
["+"|"-"][1-9]{DIGIT}* { return INTEGER;}
["+"|"-"]["0." | [1-9]{DIGIT}*"."]{DIGIT}+ {return FLOAT;}
How can I fix them?
Furthermore, since a "abc123" is a valid string, how can I make sure that it is recognized as a string and not as the concatenation of a string ("abc") and an Integer ("123") ?
First problem: There's a difference between (...) and [...]. Your regular expressions don't do what you think they do because you're using the wrong punctuation.
Beyond that:
No numeric rule recognizes 0.
Both numeric rules require an explicit sign.
Your STRING rule recognizes integers.
So, to start:
[...] encloses a set of individual characters or character ranges. It matches a single character which is a member of the set.
(...) encloses a regular expression. The parentheses are used for grouping, as in mathematics.
"..." encloses a sequence of individual characters, and matches exactly those characters.
With that in mind, let's look at
["+"|"-"][1-9]{DIGIT}*
The first bracket expression ["+"|"-"] is a set of individual characters or ranges. In this case, the set contains: ", +, " (again, which has no effect because a set contains zero or one instances of each member), |, and the range "-", which is a range whose endpoints are the same character, and consequently only includes that character, ", which is already in the set. In short, that was equivalent to ["+|]. It will match one of those three characters. It requires one of those three characters, in fact.
The second bracket expression [1-9] matches one character in the range 1-9, so it probably does what you expected. Again, it matches exactly one character.
Finally, {DIGIT} matches the expansion of the name DIGIT. I'll assume that you have the definition:
DIGIT [0-9]
somewhere in your definitions section. (In passing, I note that you could have just used the character class [:digit:], which would have been unambiguous, and you would not have needed to define it.) It's followed by a *, which means that it will match zero or more repetitions of the {DIGIT} definition.
Now, an example of a string which matches that pattern:
|42
And some examples of strings which don't match that pattern:
-7 # The pattern must start with |, + or "
42 # Again, the pattern must start with |, + or "
+0 # The character following the + must be in the range [0-9]
Similarly, your float pattern, once the [...] expressions are simplified, becomes (writing out the individual pieces one per line, to make it more obvious):
["+|] # i.e. the set " + |
["0.|[1-9] # i.e. the set " 0 | [ 1 2 3 4 5 6 7 8 9
{DIGIT}* # Any number of digits
"." # A single period
] # A single ]
{DIGIT}+ # one or more digits
So here's a possible match:
"..]3
I'll skip over writing out the solution because I think you'll benefit more from doing it yourself.
Now, the other issues:
Some rule should match 0. If you don't want to allow leading zeros, you'll need to just a it as a separate rule.
Use the optional operator (?) to indicate that the preceding object is optional. eg. "foo"? matches either the three characters f, o, o (in order) or matches the empty string. You can use that to make the sign optional.
The problem is not the matching of abc123, as in your question. (F)lex always gives you the longest possible match, and the only rule which could match the starting character a is the string rule, so it will allow the string rule to continue as long as it can. It will always match all of abc123. However, it will also match 123, which you would probably prefer to be matched by your numeric rule. Here, the other (f)lex matching criterion comes into play: when there are two or more rules which could match exactly the same string, and none of the rules can match a longer string, (f)lex chooses the first rule in the file. So if you want to give numbers priority over strings, you have to put the number rule earlier in your (f)lex file than the string rule.
I hope that gives you some ideas about how to fix things.

How do I parse this with peg grammar?

I'm trying to make a parser using pegjs. I need to parse something like:
blah blah START Lorem ipsum
dolor sit amet, consectetur
adipiscing elit END foo bar
etc.
I have trouble writing the rule to catch the text from "START" to "END".
Use negative lookahead predicates:
phrase
=(!"START" .)* "START" result:(!"END" .)* "END" .* {
for (var i=0;i<result.length;++i)
// remove empty element added by predicate matching
{result[i]=result[i][1];
}
return result.join("");
}
You need to use a negative predicate for END as well as START because repetition in pegjs is greedy.
Alternatively, the action could be written as
{return result.join("").split(',').join("");}
Although this relies on not-necessarily documented behavior of join when dealing with nested arrays (namely that it joins the sub-arrays with commas and then concatenates them).
[UPDATE] A shorter way to deal with the empty elements is
phrase
=(!"START" .)* "START" result:(t:(!"END" .){return t[1];})* "END" .* {
return result.join("");
}

Strategies for finding dates or date/times in a text document?

Problem: Given an unstructured text document find any date or date/time substrings.
My current thoughts are to search for known formats with a bunch of regex's which feels grossly kludgy, expensive and prone to errors :-)
This is the sort of doc I'm talking about:
Bacon ipsum dolor sit amet sirloin reprehenderit spare ribs aute. Ullamco consequat shank swine chuck, laboris do pastrami January 10th 1980 est venison shankle short 1-20-1980 loin bresaola corned beef. Beef ribs 28/2/2001 tri-tip est cupidatat shank, excepteur qui non pastrami.
I suspect I'm not the first person to address this problem, and I'm hoping that the resultant code is buried in some open source project I don't know about…
Thoughts?
This is a bit of an ad-hoc heuristic - but maybe tokenize first?
You could recogize the following tokens
"junk" (the default, anything not like a date part)
dddd (4 digits - usually a year)
dd (2 digits - day month or year)
d (1 digit - day or month)
dd_st
dd_th (and variations on number of digits)
dd_rd
dd_nd
monthname
etc etc
Each token can have several interpretations (eg d is month or day) and a date is any sequence of 3 tokens where you can select one of each from year, month, day (in any order you wish to allow).
The idea here is to accept many more syntaxes than you would get with regex, if that was your intention ...

Looking for ideas on how to match a pattern, Possible or not?

I'm looking for assistance creating a pattern match to ingest emails. The end goal is to recieve an incoming message and extract just the reply message, not all the trailing junk (previous threads, signature, datastamp header, etc...)
Here are the two same formats:
Format 1:
The Message is here, etc etc can span a random # of lines
On Nov 17, 2010, at 4:18 PM, Person Name wrote:
lots of junk down here which we don't want
Format 2:
The Message is here, etc etc can span a random # of lines
On Nov 17, 2010, at 4:18 PM, Site <yadaaaa+adad#sitename.com> wrote:
lots of junk down here which we don't want
Format 3:
The Message is here, etc etc can span a random # of lines
On Fri, Nov 19, 2010 at 1:57 AM, <customerserviceonline#pge.com> wrote:
lots of junk down here which we don't want
For both examples above, I'd like to create a pattern match that finds the first instance of the 2nd line. And then returns only whats above that line. I don't want that delimiter line.
I can't match on the date stamp, but I can match on everything after the comma as that's in my control.
So the idea, Looks for either either of these two static items:
, Site <yadaaaa+adad#sitename.com> wrote:
, Person Name wrote:
And then take everything above that position. What do you think. Is this possible?
i would add a different approach: Why you don't read everything and break when you match the line that you have as stop?
Well this would be a regexp solution :
/(On (?:(?:Sun|Mon|Tues|Wed|Thurs|Fri|Sat), |)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}, [12]\d{3}(?:|,) at \d{1,2}:\d{1,2} (?:AM|PM), (?:(?:Site |)<[\w.%+-]+#[\w.-]+\.[A-Za-z]{2,4}>|Person \w+) wrote:)/
You just provided one exemple so this might not be perfect but it should do the job quite well.
Then, you have to get the first captured group with $1 or [0] if you are using match :)
regex = /(On (?:(?:Sun|Mon|Tues|Wed|Thurs|Fri|Sat), |)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}, [12]\d{3}(?:|,) at \d{1,2}:\d{1,2} (?:AM|PM), (?:(?:Site |)<[\w.%+-]+#[\w.-]+\.[A-Za-z]{2,4}>|Person \w+) wrote:)/
if str =~ regex
puts "S1 : #{$1}"
end
if res = str.match(regex)
puts "S2 : #{res[0]}"
end
Btw, you can use the option /i on the regex.
This is not a good use for regex if you're trying to do it all in one pattern. It's possible to do, but I suspect the universe will cool before you work all the bugs out.
To understand the scope of what you are trying to do, read Wikipedia's article on "Posting Style". There are a lot of different ways replies are embedded into an email message, partly controlled by the MUA (mail user agent) and partly by the person doing the reply. There isn't a set method of doing the attribution, and no rule saying that the reply is in one block on the page, or that it is at the top of the page. This means that any code you write will have to be very sophisticated in order to have a chance of working consistently.
Have you looked at Mail? It's already written, it's well tested, it's got all sorts of cool bells and whistles, and it's already written. (I said it again because reinventing wheels that work well can be really painful.)
Parsing plain text email is one task. Then there is MIME-encoded email, with different content types. Then there is "HTML" email that doesn't have MIME blocks, but instead some moron just figured everyone liked HTML formatting and blinking text. Then there's various weirdly broken types of message bodies with four reply quoting types and the full content of all the previous messages appended one below the next, and the signatures of the horribly frustrated wanna-be writers who include the whole text of my favorite book "Girl to Grab", AKA Vol. 5 of Encyclopedia Britannica. Mail can help break out all the garbage for you, giving you a good shot at the content you need.
To grab a range of text in a body, look at Ruby's .. (AKA "flip-flop") operator. It's designed to return a Boolean true/false when two different tests occur. See "When would a Ruby flip-flop be useful?"
Typically you'd build it like:
if ((string =~ /pattern1/) .. (string =~ /pattern2/))
...
end
As processing occurs, if the first test matches something then subsequent loops will fall into the if block. When the ending test is found the block will be turned off for subsequent loops. In this case you'd want to use either a string literal, or a small regex to locate your starting and ending lines. If you have a chance of seeing the starting pattern in later text then you'll have to figure out how to trap that.
For instance, here's a way to grab some content that appears to meet your stated requirements if someone does a top-reply:
msg = <<EOT
The Message is here, etc etc can span a random # of lines
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
On Nov 17, 2010, at 4:18 PM, Person Name wrote:
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
EOT
body = []
msg.lines.each do |li|
li.chomp!
body << li
break if (li =~ /^On (\S+ )*\w+ \d+, \d+, at [\d:]+ \w+, .+ wrote:/i)
end
puts body[0 .. -2]
puts '=' * 40
msg = <<EOT
The Message is here, etc etc can span a random # of lines
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
On Nov 17, 2010, at 4:18 PM, Site <yadaaaa+adad#sitename.com> wrote:
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
EOT
body = []
msg.lines.each do |li|
li.chomp!
body << li
break if (li =~ /^On (\S+ )*\w+ \d+, \d+, at [\d:]+ \w+, .+ wrote:/i)
end
puts body[0 .. -2]
And here is the output:
# >> The Message is here, etc etc can span a random # of lines
# >> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
# >>
# >> ========================================
# >> The Message is here, etc etc can span a random # of lines
# >> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
# >>
The pattern could be simpler, but if it was it would increase the chance of returning false-positives.

Resources