How do I parse this with peg grammar? - parsing

I'm trying to make a parser using pegjs. I need to parse something like:
blah blah START Lorem ipsum
dolor sit amet, consectetur
adipiscing elit END foo bar
etc.
I have trouble writing the rule to catch the text from "START" to "END".

Use negative lookahead predicates:
phrase
=(!"START" .)* "START" result:(!"END" .)* "END" .* {
for (var i=0;i<result.length;++i)
// remove empty element added by predicate matching
{result[i]=result[i][1];
}
return result.join("");
}
You need to use a negative predicate for END as well as START because repetition in pegjs is greedy.
Alternatively, the action could be written as
{return result.join("").split(',').join("");}
Although this relies on not-necessarily documented behavior of join when dealing with nested arrays (namely that it joins the sub-arrays with commas and then concatenates them).
[UPDATE] A shorter way to deal with the empty elements is
phrase
=(!"START" .)* "START" result:(t:(!"END" .){return t[1];})* "END" .* {
return result.join("");
}

Related

How to Search for a few words with a character that changes its position in the cell?

I'm trying to figure out how to search to replace text containing a word, e.g: "This Is My Name!" that also may contain an extra character, in my case, the character "/".
So for example, I'd like to be able to use the search and replace functionality to match this sentence:
This Is My Name! - blah blah / abc 123 ipsum
As well as this sentence:
ipsum lorem $999 - 3 / This Is My Name! $55
Or this:
ipsum lorem $999 - 3 / This Is My Name! $55 / Ipsum Lorem - (34)
I'm assuming some form of regex?
Thank you.
Solution
Based on examples you have provided:
=ArrayFormula(REGEXREPLACE(A1:A3,"^This Is My Name!|/ This Is My Name!","SOMETHINGNEW"))
Picture
Some explanation:
Regex is looking for
^This Is My Name!. ^ before your string means that text should start with your string
OR (this is represented by |)
/ This Is My Name! - which is your text and an extra character
ArrayFormula is added to populate formula down (in A1:A3) range
Helpful?

Ruby regex to convert uppercased words and keep titleized ones

Given string "Lorem IPSUM dolor Sit amet". Capital letters in "Lorem" and "Sit" should be kept, uppercased ones like "IPSUM" should be converted to "Ipsum"
How to make "Lorem Ipsum dolor Sit amet" from given string using gsub?
NOT working example: s.gsub(/[[:upper:]]/){$&.downcase}
You may use capitalize with /\b[[:upper:]]{2,}\b/ regex:
s.gsub(/\b[[:upper:]]{2,}\b/){$&.capitalize}
# => Lorem Ipsum dolor Sit amet
See the online Ruby demo.
Note that the \b[[:upper:]]{2,}\b pattern will match whole words (as \b are word boundaries) that only consist of 2 or more uppercase letters (there seems no need to match words like I that are already OK).

Regular expression where pattern is repeated Ruby On rails

I have the following regular expression
/^[a-zA-z]+\s{0,1}$/
I use this regular expression to validate a string like "hello "
but what's up if the same format is repeated again and again
example
"hello How are you "
I don't want to write it
/^[a-zA-z]+\s{0,1}[a-zA-z]\s{0,1}[a-zA-z]\s{0,1}[a-zA-z]\s{0,1}$/
It's too long
Help me!
pattern = "[a-zA-z]+\s{0,1}"
expression = /^#{pattern}#{pattern}#{pattern}#{pattern}$/
However, a better approach would be to use a better regular expression, or define the regexp to allow that specific pattern to be contained more than once.
For instance
/^([a-zA-z]+\s{0,1}){4}$/
Moreover, I guess you can probably reduce the complexity of the expression if you use some better classes and matchers.
/^[a-zA-z]+\s{0,1}$/
is equivalent to
/^[[:alpha]]+\s?$/
therefore
/^([[:alpha]]+\s?){4}$/
to match an unlimited number of words (from N to unlimited)
/^([[:alpha]]+\s?){N,}$/
or use + to match one ore more.
/^([[:alpha]]+\s?)+$/
If what you're after is simply a bunch of letters separated by 0 or 1 space, your pattern can be drastically simplified:
/([a-z]+\s?)+/i
So, working in-to-out,
[a-z] matches characters in the range a-z
+ is a quantifier matching "1 or more" times, so [a-z]+ matches "1 or more letters"
\s? - ? is a quantifier meaning "0 or 1", the same as {0,1}, so "0 or 1 space"
([a-z]+\s?) groups that sub-expression and...
+ is a quantifier matching "1 or more" times.
/i makes the entire thing case-insensitive, so no need for [A-Za-z]. Just [a-z].
Of course, you'll want to anchor the entire thing:
/^([a-z]+\s?)+$/i
#SimoneCarletti recommended using /^([:alpha]+\s?)+$/, which is using a capturing group ([:alpha]+\s?). On a long string this isn't as efficient as a non-capturing group:
(?:[[:alpha:]]+\s?)
The difference happens deep down, where the first has to remember where each match was found, consuming space and time. Non-capturing just remembers that they were found which is faster.
require 'fruity'
text = 'Lorem ipsum dolor sit amet consectetur adipisicing elit Amet platonem fastidii fieri historiae populo mutans fortasse misisti quoddam recta contentus odia bona confidere magis negant caecilii theophrastus necessariam lucilius acuti nobis viris puerilis deorsum aliquid Atilii industriae sitne ipsi improborum levis mel affectus scientiam disciplinam disciplinam repellat Odioque suam graeca intereant potiora Iracundiae docui triarium triari neque assentiar maiorem ornateque futuros fruentem orestem forensibus teneam sciscat postremo animus fortibus videntur e video probant eas delectet molestia docere dictum Unde existimo tota labefactant Forensibus deterret autem putat remissius tollatur credo allicit duo accuratius magnus finxerat effecerit facillime Pertineant concederetur placet habendus'
compare do
regex1 { text[/^([[:alpha:]]+\s?)+$/] }
regex2 { text[/^(?:[[:alpha:]]+\s?)+$/] }
end
# >> Running each test 128 times. Test will take about 1 second.
# >> regex2 is faster than regex1 by 19.999999999999996% ± 10.0%
Also, the "POSIX bracket expressions" for the "alpha" character-class should be [[:alpha:]].
If you'd like to repeat the pattern your matching for, you can wrap it in parentheses (which groups them together), and then use a repetition meta-character to set how many repeats you'd like.
In this case, if you're looking to match if a particular string is found one or more times, you can use the following:
/^([a-zA-z]+\s{0,1})+$/
Here, we're using the + repetition meta-character, which means "this must match one or more times."
As an aside, the {0,1} you're using to match the whitespace 0 or 1 times, can be replaced by at ?, which also means "match this 0 or 1 times."
So, this could turn into:
/^([a-zA-z]+\s?)+$/
You can also do a case-insensitive match by adding the ignore case option (i) at the end of your regex, like so:
/^([a-z]+\s?)+$/i
Hope this helps.

Ruby gem for text comparison

I am looking for a gem that can compare two strings (in this case paragraphs of text) and be able to gauge the likelihood that they are similar in content (with perhaps only a few words rearranged, changed). I believe that SO uses something similar when users submit questions.
I'd probably use something like Diff::LCS:
>> require "diff/lcs"
>> seq1 = "lorem ipsum dolor sit amet consequtor".split(" ")
>> seq2 = "lorem ipsum dolor amet sit consequtor".split(" ")
1.9.3-p194 :010 > Diff::LCS.diff(seq1, seq2).length
=> 2
It uses the longest common subsequence algorithm (the method for using LCS to get a diff is described on the wiki page).

Looking for ideas on how to match a pattern, Possible or not?

I'm looking for assistance creating a pattern match to ingest emails. The end goal is to recieve an incoming message and extract just the reply message, not all the trailing junk (previous threads, signature, datastamp header, etc...)
Here are the two same formats:
Format 1:
The Message is here, etc etc can span a random # of lines
On Nov 17, 2010, at 4:18 PM, Person Name wrote:
lots of junk down here which we don't want
Format 2:
The Message is here, etc etc can span a random # of lines
On Nov 17, 2010, at 4:18 PM, Site <yadaaaa+adad#sitename.com> wrote:
lots of junk down here which we don't want
Format 3:
The Message is here, etc etc can span a random # of lines
On Fri, Nov 19, 2010 at 1:57 AM, <customerserviceonline#pge.com> wrote:
lots of junk down here which we don't want
For both examples above, I'd like to create a pattern match that finds the first instance of the 2nd line. And then returns only whats above that line. I don't want that delimiter line.
I can't match on the date stamp, but I can match on everything after the comma as that's in my control.
So the idea, Looks for either either of these two static items:
, Site <yadaaaa+adad#sitename.com> wrote:
, Person Name wrote:
And then take everything above that position. What do you think. Is this possible?
i would add a different approach: Why you don't read everything and break when you match the line that you have as stop?
Well this would be a regexp solution :
/(On (?:(?:Sun|Mon|Tues|Wed|Thurs|Fri|Sat), |)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}, [12]\d{3}(?:|,) at \d{1,2}:\d{1,2} (?:AM|PM), (?:(?:Site |)<[\w.%+-]+#[\w.-]+\.[A-Za-z]{2,4}>|Person \w+) wrote:)/
You just provided one exemple so this might not be perfect but it should do the job quite well.
Then, you have to get the first captured group with $1 or [0] if you are using match :)
regex = /(On (?:(?:Sun|Mon|Tues|Wed|Thurs|Fri|Sat), |)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}, [12]\d{3}(?:|,) at \d{1,2}:\d{1,2} (?:AM|PM), (?:(?:Site |)<[\w.%+-]+#[\w.-]+\.[A-Za-z]{2,4}>|Person \w+) wrote:)/
if str =~ regex
puts "S1 : #{$1}"
end
if res = str.match(regex)
puts "S2 : #{res[0]}"
end
Btw, you can use the option /i on the regex.
This is not a good use for regex if you're trying to do it all in one pattern. It's possible to do, but I suspect the universe will cool before you work all the bugs out.
To understand the scope of what you are trying to do, read Wikipedia's article on "Posting Style". There are a lot of different ways replies are embedded into an email message, partly controlled by the MUA (mail user agent) and partly by the person doing the reply. There isn't a set method of doing the attribution, and no rule saying that the reply is in one block on the page, or that it is at the top of the page. This means that any code you write will have to be very sophisticated in order to have a chance of working consistently.
Have you looked at Mail? It's already written, it's well tested, it's got all sorts of cool bells and whistles, and it's already written. (I said it again because reinventing wheels that work well can be really painful.)
Parsing plain text email is one task. Then there is MIME-encoded email, with different content types. Then there is "HTML" email that doesn't have MIME blocks, but instead some moron just figured everyone liked HTML formatting and blinking text. Then there's various weirdly broken types of message bodies with four reply quoting types and the full content of all the previous messages appended one below the next, and the signatures of the horribly frustrated wanna-be writers who include the whole text of my favorite book "Girl to Grab", AKA Vol. 5 of Encyclopedia Britannica. Mail can help break out all the garbage for you, giving you a good shot at the content you need.
To grab a range of text in a body, look at Ruby's .. (AKA "flip-flop") operator. It's designed to return a Boolean true/false when two different tests occur. See "When would a Ruby flip-flop be useful?"
Typically you'd build it like:
if ((string =~ /pattern1/) .. (string =~ /pattern2/))
...
end
As processing occurs, if the first test matches something then subsequent loops will fall into the if block. When the ending test is found the block will be turned off for subsequent loops. In this case you'd want to use either a string literal, or a small regex to locate your starting and ending lines. If you have a chance of seeing the starting pattern in later text then you'll have to figure out how to trap that.
For instance, here's a way to grab some content that appears to meet your stated requirements if someone does a top-reply:
msg = <<EOT
The Message is here, etc etc can span a random # of lines
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
On Nov 17, 2010, at 4:18 PM, Person Name wrote:
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
EOT
body = []
msg.lines.each do |li|
li.chomp!
body << li
break if (li =~ /^On (\S+ )*\w+ \d+, \d+, at [\d:]+ \w+, .+ wrote:/i)
end
puts body[0 .. -2]
puts '=' * 40
msg = <<EOT
The Message is here, etc etc can span a random # of lines
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
On Nov 17, 2010, at 4:18 PM, Site <yadaaaa+adad#sitename.com> wrote:
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
EOT
body = []
msg.lines.each do |li|
li.chomp!
body << li
break if (li =~ /^On (\S+ )*\w+ \d+, \d+, at [\d:]+ \w+, .+ wrote:/i)
end
puts body[0 .. -2]
And here is the output:
# >> The Message is here, etc etc can span a random # of lines
# >> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
# >>
# >> ========================================
# >> The Message is here, etc etc can span a random # of lines
# >> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
# >>
The pattern could be simpler, but if it was it would increase the chance of returning false-positives.

Resources