Ruby gem for text comparison - ruby-on-rails

I am looking for a gem that can compare two strings (in this case paragraphs of text) and be able to gauge the likelihood that they are similar in content (with perhaps only a few words rearranged, changed). I believe that SO uses something similar when users submit questions.

I'd probably use something like Diff::LCS:
>> require "diff/lcs"
>> seq1 = "lorem ipsum dolor sit amet consequtor".split(" ")
>> seq2 = "lorem ipsum dolor amet sit consequtor".split(" ")
1.9.3-p194 :010 > Diff::LCS.diff(seq1, seq2).length
=> 2
It uses the longest common subsequence algorithm (the method for using LCS to get a diff is described on the wiki page).

Related

Generate x characters worth of Lorem Ipsum text (sentences and paragraphs)?

I have a User field called 'bio' that can be up to 800 characters (as well as some other free text fields of varying lengths). Having it populate with dummy text would help assess the visuals/design of the front end.
How can I generate 800 characters worth of Lorem Ipsum text to place into that field? By 'Lorem Ipsum text' I mean sentences and paragraphs (not just 800 characters worth of sentences in one giant paragraph).
"a"*800 is not varied enough to resemble human paragraphs.
Note: this is for the seeds.rb file, and I am already using faker gem, in case that's useful.
I'm sure there are better ways, but this uses the Faker gem and looks quite natural:
def make_natural_text(n)
paras = ""
until paras.length > n
para = Faker::Lorem.paragraphs(number: (2..7).to_a.sample(1)[0]).join + "\n\n"
paras += para
end
paras[0..(n-1)]
end
natural_text = make_natural_text(800)
puts natural_text

Ruby regex to convert uppercased words and keep titleized ones

Given string "Lorem IPSUM dolor Sit amet". Capital letters in "Lorem" and "Sit" should be kept, uppercased ones like "IPSUM" should be converted to "Ipsum"
How to make "Lorem Ipsum dolor Sit amet" from given string using gsub?
NOT working example: s.gsub(/[[:upper:]]/){$&.downcase}
You may use capitalize with /\b[[:upper:]]{2,}\b/ regex:
s.gsub(/\b[[:upper:]]{2,}\b/){$&.capitalize}
# => Lorem Ipsum dolor Sit amet
See the online Ruby demo.
Note that the \b[[:upper:]]{2,}\b pattern will match whole words (as \b are word boundaries) that only consist of 2 or more uppercase letters (there seems no need to match words like I that are already OK).

How do I parse this with peg grammar?

I'm trying to make a parser using pegjs. I need to parse something like:
blah blah START Lorem ipsum
dolor sit amet, consectetur
adipiscing elit END foo bar
etc.
I have trouble writing the rule to catch the text from "START" to "END".
Use negative lookahead predicates:
phrase
=(!"START" .)* "START" result:(!"END" .)* "END" .* {
for (var i=0;i<result.length;++i)
// remove empty element added by predicate matching
{result[i]=result[i][1];
}
return result.join("");
}
You need to use a negative predicate for END as well as START because repetition in pegjs is greedy.
Alternatively, the action could be written as
{return result.join("").split(',').join("");}
Although this relies on not-necessarily documented behavior of join when dealing with nested arrays (namely that it joins the sub-arrays with commas and then concatenates them).
[UPDATE] A shorter way to deal with the empty elements is
phrase
=(!"START" .)* "START" result:(t:(!"END" .){return t[1];})* "END" .* {
return result.join("");
}

Looking for ideas on how to match a pattern, Possible or not?

I'm looking for assistance creating a pattern match to ingest emails. The end goal is to recieve an incoming message and extract just the reply message, not all the trailing junk (previous threads, signature, datastamp header, etc...)
Here are the two same formats:
Format 1:
The Message is here, etc etc can span a random # of lines
On Nov 17, 2010, at 4:18 PM, Person Name wrote:
lots of junk down here which we don't want
Format 2:
The Message is here, etc etc can span a random # of lines
On Nov 17, 2010, at 4:18 PM, Site <yadaaaa+adad#sitename.com> wrote:
lots of junk down here which we don't want
Format 3:
The Message is here, etc etc can span a random # of lines
On Fri, Nov 19, 2010 at 1:57 AM, <customerserviceonline#pge.com> wrote:
lots of junk down here which we don't want
For both examples above, I'd like to create a pattern match that finds the first instance of the 2nd line. And then returns only whats above that line. I don't want that delimiter line.
I can't match on the date stamp, but I can match on everything after the comma as that's in my control.
So the idea, Looks for either either of these two static items:
, Site <yadaaaa+adad#sitename.com> wrote:
, Person Name wrote:
And then take everything above that position. What do you think. Is this possible?
i would add a different approach: Why you don't read everything and break when you match the line that you have as stop?
Well this would be a regexp solution :
/(On (?:(?:Sun|Mon|Tues|Wed|Thurs|Fri|Sat), |)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}, [12]\d{3}(?:|,) at \d{1,2}:\d{1,2} (?:AM|PM), (?:(?:Site |)<[\w.%+-]+#[\w.-]+\.[A-Za-z]{2,4}>|Person \w+) wrote:)/
You just provided one exemple so this might not be perfect but it should do the job quite well.
Then, you have to get the first captured group with $1 or [0] if you are using match :)
regex = /(On (?:(?:Sun|Mon|Tues|Wed|Thurs|Fri|Sat), |)(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}, [12]\d{3}(?:|,) at \d{1,2}:\d{1,2} (?:AM|PM), (?:(?:Site |)<[\w.%+-]+#[\w.-]+\.[A-Za-z]{2,4}>|Person \w+) wrote:)/
if str =~ regex
puts "S1 : #{$1}"
end
if res = str.match(regex)
puts "S2 : #{res[0]}"
end
Btw, you can use the option /i on the regex.
This is not a good use for regex if you're trying to do it all in one pattern. It's possible to do, but I suspect the universe will cool before you work all the bugs out.
To understand the scope of what you are trying to do, read Wikipedia's article on "Posting Style". There are a lot of different ways replies are embedded into an email message, partly controlled by the MUA (mail user agent) and partly by the person doing the reply. There isn't a set method of doing the attribution, and no rule saying that the reply is in one block on the page, or that it is at the top of the page. This means that any code you write will have to be very sophisticated in order to have a chance of working consistently.
Have you looked at Mail? It's already written, it's well tested, it's got all sorts of cool bells and whistles, and it's already written. (I said it again because reinventing wheels that work well can be really painful.)
Parsing plain text email is one task. Then there is MIME-encoded email, with different content types. Then there is "HTML" email that doesn't have MIME blocks, but instead some moron just figured everyone liked HTML formatting and blinking text. Then there's various weirdly broken types of message bodies with four reply quoting types and the full content of all the previous messages appended one below the next, and the signatures of the horribly frustrated wanna-be writers who include the whole text of my favorite book "Girl to Grab", AKA Vol. 5 of Encyclopedia Britannica. Mail can help break out all the garbage for you, giving you a good shot at the content you need.
To grab a range of text in a body, look at Ruby's .. (AKA "flip-flop") operator. It's designed to return a Boolean true/false when two different tests occur. See "When would a Ruby flip-flop be useful?"
Typically you'd build it like:
if ((string =~ /pattern1/) .. (string =~ /pattern2/))
...
end
As processing occurs, if the first test matches something then subsequent loops will fall into the if block. When the ending test is found the block will be turned off for subsequent loops. In this case you'd want to use either a string literal, or a small regex to locate your starting and ending lines. If you have a chance of seeing the starting pattern in later text then you'll have to figure out how to trap that.
For instance, here's a way to grab some content that appears to meet your stated requirements if someone does a top-reply:
msg = <<EOT
The Message is here, etc etc can span a random # of lines
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
On Nov 17, 2010, at 4:18 PM, Person Name wrote:
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
EOT
body = []
msg.lines.each do |li|
li.chomp!
body << li
break if (li =~ /^On (\S+ )*\w+ \d+, \d+, at [\d:]+ \w+, .+ wrote:/i)
end
puts body[0 .. -2]
puts '=' * 40
msg = <<EOT
The Message is here, etc etc can span a random # of lines
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
On Nov 17, 2010, at 4:18 PM, Site <yadaaaa+adad#sitename.com> wrote:
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
EOT
body = []
msg.lines.each do |li|
li.chomp!
body << li
break if (li =~ /^On (\S+ )*\w+ \d+, \d+, at [\d:]+ \w+, .+ wrote:/i)
end
puts body[0 .. -2]
And here is the output:
# >> The Message is here, etc etc can span a random # of lines
# >> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
# >>
# >> ========================================
# >> The Message is here, etc etc can span a random # of lines
# >> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
# >>
The pattern could be simpler, but if it was it would increase the chance of returning false-positives.

Latex listings package ignores last blank line in listing

I use LaTeX listings package with \lstinputlisting to display text from an external file. The file contains a data format description with a blank line at the end. The package ignores the blank line. How can I show the blank line in a listing?
What it displays:
1 lorem ipsum...
2 more lorem ipsum
3 lorem lorem ipsum
What I want:
1 lorem ipsum
2 more lorem ipsum
3 lorem lorem ipsum
4
See the documentation, section 4.4
`showlines=(true|false) or showlines (default = false)
If true, the package prints empty lines at the end of listings. Otherwise these lines are dropped (but they count for line numbering).
Try adding this before your listing:
\lstset{
showlines=true
}
You can escape to LaTeX from within listings by assigning an escape character like so:
\lstset{numbers=left, stepnumber=1, frame=none,basicstyle = \ttfamily}
\begin{lstlisting}[escapechar=\%]
codeline1
codeline2
%
\end{lstlisting}
Comes out as:
1 codeline1
2 codeline2
3
I know it's not \lstinputlisting but hopefully it'll help you anyway.

Resources