Using Hash functions to remove duplicate content/text - ruby-on-rails

I have a website with a lot of content and I am working on removing duplicates. For this I need to compare two strings and check their match percentage. I am using the ruby simhash gem: https://github.com/bookmate/simhash
The gem takes a string and returns an integer hash. I am not sure how to compare the two hashes.
X = 'King Gillette'.simhash(:split_by => //)
y = 'King Camp Gillette'.simhash(:split_by => //)
X >> 13716569836
y >> 13809628900
Can I take the difference and then percentage? Does that indicate the difference between the strings?

If you want to remove the duplicates of those strings way
or you want difference between the strings If I am getting right then simply you can do this....
>>a1='King Gillette'.split(" ")
>>=> ["King", "Gillette"]
>>a2='King Camp Gillette'.split(" ")
>>=> ["King", "Camp", "Gillette"]
>> a2-a1
>>["Camp"]

Related

How can I replace words in a string with elements in an array in ruby?

I'm trying to replace words (sequence of characters, more generally) in a string with corresponding values in an array. An example is:
"The dimension of the square is {{width}} and {{length}}" with array [10,20] should give
"The dimension of the square is 10 and 20"
I have tried to use gsub as
substituteValues.each do |sub|
value.gsub(/\{\{(.*?)\}\}/, sub)
end
But I could not get it to work. I have also thought of using a hash instead of an array as follows:
{"{{width}}"=>10, "{{height}}"=>20}. I feel this might work better but again, I'm not sure how to code it (new to ruby). Any help is appreciated.
You can use
h = {"{{width}}"=>10, "{{length}}"=>20}
s = "The dimension of the square is {{width}} and {{length}}"
puts s.gsub(/\{\{(?:width|length)\}\}/, h)
# => The dimension of the square is 10 and 20
See the Ruby demo. Details:
\{\{(?:width|length)\}\} - a regex that matches
\{\{ - a {{ substring
(?:width|length) - a non-capturing group that matches width or length words
\}\} - a }} substring
gsub replaces all occurrences in the string with
h - used as the second argument, allows replacing the found matches that are equal to hash keys with the corresponding hash values.
You may use a bit simpler hash definition without { and } and then use a capturing group in the regex to match length or width. Then you need
h = {"width"=>10, "length"=>20}
s = "The dimension of the square is {{width}} and {{length}}"
puts s.gsub(/\{\{(width|length)\}\}/) { h[Regexp.last_match[1]] }
See this Ruby demo. So, here, (width|length) is used instead of (?:width|length) and only Group 1 is used as the key in h[Regexp.last_match[1]] inside the block.

Changing text based on the final letter of user name using regular expression

I am looking to change the ending of the user name based on the use case (in the language system will operate, names ends depending on how it is used).
So need to define all endings of names and define the replacement for them.
Was suggested to use .gsub regular expression to search and replace in a string:
Changing text based on the final letter of user name
"name surname".gsub(/e\b/, 'ai')
this will replace e with ai, so "name surname = namai surnamai".
How can it be used for more options like: "e = ai, us = mi, i = as" on the same record?
thanks
You can use String#gsub with block. Docs say:
In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.
So you can use a regex with concatenation of all substrings to be replaced and then replace it in the block, e.g. using a hash that maps matches to replacements.
Full example:
replacements = {'e'=>'ai', 'us'=>'mi', 'i' => 'as'}
['surname', 'surnamus', 'surnami'].map do |s|
s.gsub(/(e|us|i)$/){|p| replacements[p] }
end
#Sundeep makes an important observation in a comment on the question. If, for example, the substitutions were give by the following hash:
g = {'e'=>'ai', 's'=>'es', 'us'=>'mi', 'i' => 'as'}
#=> {"e"=>"ai", "s"=>"es", "us"=>"mi", "i"=>"as"}
'surnamus' would be converted (incorrectly) to 'surnamues' merely because 's'=>'es' precedes 'us'=>'mi' in g. That situation may not exist at present, but it may be prudent to allow for it in future, particularly because it is so simple to do so:
h = g.sort_by { |k,_| -k.size }.to_h
#=> {"us"=>"mi", "e"=>"ai", "s"=>"es", "i"=>"as"}
arr = ['surname', 'surnamus', 'surnami', 'surnamo']
The substitutions can be done using the form of String##sub that employs a hash as its second argument.
r = /#{Regexp.union(h.keys)}\z/
#=> /(?-mix:us|e|s|i)\z/i
arr.map { |s| s.sub(r,h) }
#=> ["surnamai", "surnammi", "surnamas", "surnamo"]
See also Regexp::union.
Incidentally, though key-insertion order has been guaranteed for hashes since Ruby v1.9, there is a continuing debate as to whether that property should be made use of in Ruby code, mainly because there was no concept of key order when hashes were first used in computer programs. This answer provides a good example of the benefit of exploiting key order.

Ruby .scan method returns empty using regex

So given a string like this "\"turkey AND ham\" NOT \"roast beef\"" I need to get an array with the inner strings like so: ["turkey AND ham", "roast beef"] and eliminate OR's, AND's and NOT's that may or may not be there.
With the help of Rubular I came up with this regex /\\["']([^"']*)\\["']/
which returns the following 2 groups:
Match 1
1. turkey AND ham
Match 2
1. roast beef
however when I use it with .scan keep getting and empty array.
I looked at this and this other SO posts, and a few others, but can not figure out where I am going wrong
Here is the result from my rails console:
=> q = "\"turkey and ham\" OR \"roast beef\""
=> q.scan(/\\["']([^"']*)\\["']/)
=> []
Expectation:
["turkey AND ham", "roast beef"]
I shall also mention I suck at regex.
When the regex used with scan contains a capture group (#davidhu2000's approach), one generally can use lookarounds1 instead. It's just a matter of personal preference. To allow for double-quoted strings that contain either single- or (escaped) double-quoted strings, you could use the following regex.
r = /
(?<=") # match a double quote in a positive lookbehind
[^"]+ # match one or more characters that are not double-quotes
(?=") # match a double quote in a positive lookahead
| # or
(?<=') # match a single quote in a positive lookbehind
[^']+ # match one or more characters that are not single-quotes
(?=') # match a single quote in a positive lookahead
/x # free-spacing regex definition mode
"\"turkey AND ham\" NOT 'roast beef'".scan(r)
#=> ["turkey AND ham", "roast beef"]
As '"turkey AND ham" NOT "roast beef"' #=> "\"turkey AND ham\" NOT \"roast beef\"" (i.e., how the single-quoted string is saved), we need not be concerned about that being an additional case to deal with.
1 For any in the audience who still consider regular expressions to be black magic, there are four kinds of lookarounds (positive and negative lookbehinds and lookaheads) as elaborated in the doc for Regexp. Sometimes they are regarded as "zero-width" matches as they are not part of the matched text.
You regex is trying to match \, which won't match anything in the string, since the \ existed to escape the double quote, and won't be part of the string.
So if you remove \\ in your regex
res = q.scan(/["']([^"']*)["']/)
This will return a 2d array
res = [["turkey and ham"], ["roast beef"]]
Each inner array is all the matching groups from the regex, so if you have two capture groups in your regex, you will see two items in the inner array.
If you want a simple array, you can run flatten method on the array.

Rails: Split text including dollar end euro

I'm using Rails and Nokogiri and I'm trying to parse some website.
This is where I'm stuck:
doc.css('#example > li:nth-child(1)').each do |node|
money = node.xpath('//*ul/li/div/span').text
end
It returns something like:
$100,000£230,000$40,000$9,000€600$800,000
I want to split those items, save them to the database and finally hand them to the view.
So, in the view, I want it to appear like:
(1)$100,000
(2)£230,000
(3)$40,000
(4)$9,000
(5)€600
(6)$800,000
I tried to split those items by this code below.
money = node.xpath('//*ul/li/div/span').text.split(/[$€£]/)
but the result looks like this:
["", "100,000", "230,000", "40,000", "9,000", "600", "800,000"]
And I don't know which item is in Dollar, Euro, or Pond.
Is there any good way to solve this problem?
you're almost there,
just use the positive lookahead :)
irb(main):005:0> "$100,000£230,000$40,000$9,000€600$800,000".split(/(?=[$£€])/)
=> ["$100,000", "£230,000", "$40,000", "$9,000", "€600", "$800,000"]
It needs a regular expression. This works:
"$100,000£230,000$40,000$9,000$600$800,000".scan(/([^\d][0-9,]+)/)
=> [["$100,000"],
["£230,000"],
["$40,000"],
["$9,000"],
["$600"],
["$800,000"]]
The regex contains these parts:
[^\d]: A character class matching a single non-digit. This will match the currency symbol.
`[0-9,]+': Another character class, this time repeating (the '+'). It matches the numeric part (0-9) plus the thousand's separator.

Best way to count words in a string in Ruby?

Is there anything better than string.scan(/(\w|-)+/).size (the - is so, e.g., "one-way street" counts as 2 words instead of 3)?
string.split.size
Edited to explain multiple spaces
From the Ruby String Documentation page
split(pattern=$;, [limit]) → anArray
Divides str into substrings based on a delimiter, returning an array
of these substrings.
If pattern is a String, then its contents are used as the delimiter
when splitting str. If pattern is a single space, str is split on
whitespace, with leading whitespace and runs of contiguous whitespace
characters ignored.
If pattern is a Regexp, str is divided where the pattern matches.
Whenever the pattern matches a zero-length string, str is split into
individual characters. If pattern contains groups, the respective
matches will be returned in the array as well.
If pattern is omitted, the value of $; is used. If $; is nil (which is
the default), str is split on whitespace as if ' ' were specified.
If the limit parameter is omitted, trailing null fields are
suppressed. If limit is a positive number, at most that number of
fields will be returned (if limit is 1, the entire string is returned
as the only entry in an array). If negative, there is no limit to the
number of fields returned, and trailing null fields are not
suppressed.
" now's the time".split #=> ["now's", "the", "time"]
While that is the current version of ruby as of this edit, I learned on 1.7 (IIRC), where that also worked. I just tested it on 1.8.3.
I know this is an old question, but this might be useful to someone else looking for something more sophisticated than string.split. I wrote the words_counted gem to solve this particular problem, since defining words is pretty tricky.
The gem lets you define your own custom criteria, or use the out of the box regexp, which is pretty handy for most use cases. You can pre-filter words with a variety of options, including a string, lambda, array, or another regexp.
counter = WordsCounted::Counter.new("Hello, Renée! 123")
counter.word_count #=> 2
counter.words #=> ["Hello", "Renée"]
# filter the word "hello"
counter = WordsCounted::Counter.new("Hello, Renée!", reject: "Hello")
counter.word_count #=> 1
counter.words #=> ["Renée"]
# Count numbers only
counter = WordsCounted::Counter.new("Hello, Renée! 123", rexexp: /[0-9]/)
counter.word_count #=> 1
counter.words #=> ["123"]
The gem provides a bunch more useful methods.
If the 'word' in this case can be described as an alphanumeric sequence which can include '-' then the following solution may be appropriate (assuming that everything that doesn't match the 'word' pattern is a separator):
>> 'one-way street'.split(/[^-a-zA-Z]/).size
=> 2
>> 'one-way street'.split(/[^-a-zA-Z]/).each { |m| puts m }
one-way
street
=> ["one-way", "street"]
However, there are some other symbols that can be included in the regex - for example, ' to support the words like "it's".
This is pretty simplistic but does the job if you are typing words with spaces in between. It ends up counting numbers as well but I'm sure you could edit the code to not count numbers.
puts "enter a sentence to find its word length: "
word = gets
word = word.chomp
splits = word.split(" ")
target = splits.length.to_s
puts "your sentence is " + target + " words long"
The best way to do is to use split method.
split divides a string into sub-strings based on a delimiter, returning an array of the sub-strings.
split takes two parameters, namely; pattern and limit.
pattern is the delimiter over which the string is to be split into an array.
limit specifies the number of elements in the resulting array.
For more details, refer to Ruby Documentation: Ruby String documentation
str = "This is a string"
str.split(' ').size
#output: 4
The above code splits the string wherever it finds a space and hence it give the number of words in the string which is indirectly the size of the array.
The above solution is wrong, consider the following:
"one-way street"
You will get
["one-way","", "street"]
Use
'one-way street'.gsub(/[^-a-zA-Z]/, ' ').split.size
This splits words only on ASCII whitespace chars:
p " some word\nother\tword|word".strip.split(/\s+/).size #=> 4

Resources