Fuzzy String search: To find one string among any subtring of another - ruby-on-rails

I want to find out one string with some Levenshtein distance inside bigger string. I have written the code for finding the distance between two string but want to efficiently implement when i want to find some substring with fixed Levenshtein distance.
module Levenshtein
def self.distance(a, b)
a, b = a.downcase, b.downcase
costs = Array(0..b.length) # i == 0
(1..a.length).each do |i|
costs[0], nw = i, i - 1 # j == 0; nw is lev(i-1, j)
(1..b.length).each do |j|
costs[j], nw = [costs[j] + 1, costs[j-1] + 1, a[i-1] == b[j-1] ? nw : nw + 1].min, costs[j]
end
end
costs[b.length]
end
def self.test
%w{kitten sitting saturday sunday rosettacode raisethysword}.each_slice(2) do |a, b|
puts "distance(#{a}, #{b}) = #{distance(a, b)}"
end
end
end

Check at the TRE library, which does exactly this (in C), and quite efficienly. Now look carefully at the matching function, which is basically 500 lines of unreadable (but necessary) code.
I'd say that, instead of rolling your own version and provided you don't intend to read all the much difficult papers on the subject (search for "approximate string matching") and don't have a few free months for studying the subject, you'd be much better of writing a small wrapper around the library itself. Your Ruby version would be inefficient anyway in comparison with what can be obtained in C.

Related

Detecting overlapping ranges in Ruby

I have array of ranges :
[[39600..82800], [39600..70200],[70200..80480]]
I need to determine if there is overlapping or not.What is an easy way to do it in ruby?
In the above case the output should be 'Overlapping'.
This is a very interesting puzzle, especially if you care about performances.
If the ranges are just two, it's a fairly simple algorithm, which is also covered in ActiveSupport overlaps? extension.
def ranges_overlap?(r1, r2)
r1.cover?(r2.first) || r2.cover?(r1.first)
end
If you want to compare multiple ranges, it's a fairly interesting algorithm exercise.
You could loop over all the ranges, but you will need to compare each range with all the other possibilities, but this is an algorithm with exponential cost.
A more efficient solution is to order the ranges and execute a binary search, or to use data structures (such as trees) to make possible to compute the overlapping.
This problem is also explained in the Interval tree page. Computing an overlap essentially consists of finding the intersection of the trees.
Is this not a way to do it?
def any_overlapping_ranges(array_of_ranges)
array_of_ranges.sort_by(&:first).each_cons(2).any?{|x,y|x.last>y.first}
end
p any_overlapping_ranges([50..100, 1..51,200..220]) #=> True
Consider this:
class Range
include Comparable
def <=>(other)
self.begin <=> other.begin
end
def self.overlap?(*ranges)
edges = ranges.sort.flat_map { |range| [range.begin, range.end] }
edges != edges.sort.uniq
end
end
Range.overlap?(2..12, 6..36, 42..96) # => true
Notes:
This could take in any number of ranges.
Have a look at the gist with some tests to play with the code.
The code creates a flat array with the start and end of each range.
This array will retain the order if they don't overlap. (Its easier to visualize with some examples than textually explaining why, try it).
For sake of simplicity and readability I'll suggest this approach:
def overlaps?(ranges)
ranges.each_with_index do |range, index|
(index..ranges.size).each do |i|
nextRange = ranges[i] unless index == i
if nextRange and range.to_a & nextRange.to_a
puts "#{range} overlaps with #{nextRange}"
end
end
end
end
r = [(39600..82800), (39600..70200),(70200..80480)]
overlaps?(r)
and the output:
ruby ranges.rb
39600..82800 overlaps with 39600..70200
39600..82800 overlaps with 70200..80480
39600..70200 overlaps with 70200..80480

How to count the number of decimal places in a Float?

I am using Ruby 1.8.7 and Rails 2.3.5.
If I have a float like 12.525, how can a get the number of digits past the decimal place? In this case I expect to get a '3' back.
Something like that, I guess:
n = 12.525
n.to_s.split('.').last.size
You should be very careful with what you want. Floating point numbers are excellent for scientific purposes and mostly work for daily use, but they fall apart pretty badly when you want to know something like "how many digits past the decimal place" -- if only because they have about 16 digits total, not all of which will contain accurate data for your computation. (Or, some libraries might actually throw away accurate data towards the end of the number when formatting a number for output, on the grounds that "rounded numbers are more friendly". Which, while often true, means it can be a bit dangerous to rely upon formatted output.)
If you can replace the standard floating point numbers with the BigDecimal class to provide arbitrary-precision floating point numbers, then you can inspect the "raw" number:
> require 'bigdecimal'
=> true
> def digits_after_decimal_point(f)
> sign, digits, base, exponent = f.split
> return digits.length - exponent
> end
> l = %w{1.0, 1.1, 1000000000.1, 1.0000000001}
=> ["1.0,", "1.1,", "1000000000.1,", "1.0000000001"]
> list = l.map { |n| BigDecimal(n) }
=> [#<BigDecimal:7f7a56aa8f70,'0.1E1',9(18)>, #<BigDecimal:7f7a56aa8ef8,'0.11E1',18(18)>, #<BigDecimal:7f7a56aa8ea8,'0.1000000000 1E10',27(27)>, #<BigDecimal:7f7a56aa8e58,'0.1000000000 1E1',27(27)>]
> list.map { |i| digits_after_decimal_point(i) }
=> [0, 1, 1, 10]
Of course, if moving to BigDecimal makes your application too slow or is patently too powerful for what you need, this might overly complicate your code for no real benefit. You'll have to decide what is most important for your application.
Here is a very simple approach. Keep track of how many times you have to multiple the number by 10 before it equals its equivalent integer:
def decimals(a)
num = 0
while(a != a.to_i)
num += 1
a *= 10
end
num
end
decimals(1.234) # -> 3
decimals(10/3.0) # -> 16
Like This:
theFloat.to_s.split(".")[1].length
It is not very pretty, but you can insert it as a method for Float:
class Float
def decimalPlaces
self.to_s.split(".")[1].length
end
end
Can you subtract the floor and then just count how many characters left?
(12.525 -( 12.52­5.floor )).to­_s.length-­2
=> 3
edit: nope this doesnt work for a bunch of reasons, negatives and 0.99999 issues
Olexandr's answer doesn't work for integer. Can try the following:
def decimals(num)
if num
arr = num.to_s.split('.')
case arr.size
when 1
0
when 2
arr.last.size
else
nil
end
else
nil
end
end
You can use this approach
def digits_after_decimal_point(n)
splitted = n.to_s.split(".")
if splitted.count > 1
return 0 if splitted[1].to_f == 0
return splitted[1].length
else
return 0
end
end
# Examples
digits_after_decimal_point("1") #=> 0
digits_after_decimal_point("1.0") #=> 0
digits_after_decimal_point("1.01") #=> 2
digits_after_decimal_point("1.00000") #=> 0
digits_after_decimal_point("1.000001") #=> 6
digits_after_decimal_point(nil) #=> 0

What is an efficient way to measure similarity between two strings? (Levenshtein Distance makes stack too deep)

So, I started with this: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Ruby
Which works great for really small strings. But, my strings can be upwards of 10,000 characters long -- and since the Levenshtein Distance is recursive, this causes a stack too deep error in my Ruby on Rails app.
So, is there another, maybe less stack intensive method of finding the similarity between two large strings?
Alternatively, I'd need a way to make the stack have much larger size. (I don't think this is the right way to solve the problem, though)
Consider a non-recursive version to avoid the excessive call stack overhead. Seth Schroeder has an iterative implementation in Ruby which uses multi-dimensional arrays instead; it appears to be related to the dynamic programming approach for Levenshtein distance (as outlined in the pseudocode for the Wikipedia article). Seth's ruby code is reproduced below:
def levenshtein(s1, s2)
d = {}
(0..s1.size).each do |row|
d[[row, 0]] = row
end
(0..s2.size).each do |col|
d[[0, col]] = col
end
(1..s1.size).each do |i|
(1..s2.size).each do |j|
cost = 0
if (s1[i-1] != s2[j-1])
cost = 1
end
d[[i, j]] = [d[[i - 1, j]] + 1,
d[[i, j - 1]] + 1,
d[[i - 1, j - 1]] + cost
].min
next unless ##damerau
if (i > 1 and j > 1 and s1[i-1] == s2[j-2] and s1[i-2] == s2[j-1])
d[[i, j]] = [d[[i,j]],
d[[i-2, j-2]] + cost
].min
end
end
end
return d[[s1.size, s2.size]]
end

Walking over strings to guess a name from an email based on dictionary of names?

Let's say I have a dictionary of names (a huge CSV file). I want to guess a name from an email that has no obvious parsable points (., -, _). I want to do something like this:
dict = ["sam", "joe", "john", "parker", "jane", "smith", "doe"]
word = "johnsmith"
x = 0
y = word.length-1
name_array = []
for i in x..y
match_me = word[x..i]
dict.each do |name|
if match_me == name
name_array << name
end
end
end
name_array
# => ["john"]
Not bad, but I want "John Smith" or ["john", "smith"]
In other words, I recursively loop through the word (i.e., unparsed email string, "johndoe#gmail.com") until I find a match within the dictionary. I know: this is incredibly inefficient. If there's a much easier way of doing this, I'm all ears!
If there's not better way of doing it, then show me how to fix the example above, for it suffers from two major flaws: (1) how do I set the length of the loop (see problem of finding "i" below), and (2) how do I increment "x" in the example above so that I can cycle through all possible character combinations given an arbitrary string?
Problem of finding the length of the loop, "i":
for an arbitrary word, how can we derive "i" given the pattern below?
for a (i = 1)
a
for ab (i = 3)
a
ab
b
for abc (i = 6)
a
ab
abc
b
bc
c
for abcd (i = 10)
a
ab
abc
abcd
b
bc
bcd
c
cd
d
for abcde (i = 15)
a
ab
abc
abcd
abcde
b
bc
bcd
bcde
c
cd
cde
d
de
e
r = /^(#{Regexp.union(dict)})(#{Regexp.union(dict)})$/
word.match(r)
=> #<MatchData "johnsmith" 1:"john" 2:"smith">
The regex might take some time to build, but it's blazing fast.
I dare suggest a brute force solution that is not very elegant but still useful in case
you have a large number of items (building a regexp can be a pain)
the string to analyse is not limited to two components
you want to get all splittings of a string
you want only complete analyses of a string, that span from ^ to $.
Because of my poor English, I could not figure out a long personal name that can be split in more than one way, so let's analyse a phrase:
word = "godisnowhere"
The dictionary:
#dict = [ "god", "is", "now", "here", "nowhere", "no", "where" ]
#lengths = #dict.collect {|w| w.length }.uniq.sort
The array #lengths adds a slight optimization to the algorithm, we will use it to prune subwords of lengths that don't exist in the dictionary without actually performing dictionary lookup. The array is sorted, this is another optimization.
The main part of the solution is a recursive function that finds the initial subword in a given word and restarts for the tail subword.
def find_head_substring(word)
# boundary condition:
# remaining subword is shorter than the shortest word in #dict
return [] if word.length < #lengths[0]
splittings = []
#lengths.each do |len|
break if len > word.length
head = word[0,len]
if #dict.include?(head)
tail = word[len..-1]
if tail.length == 0
splittings << head
else
tails = find_head_substring(tail)
unless tails.empty?
tails.collect!{|tail| "#{head} #{tail}" }
splittings.concat tails
end
end
end
end
return splittings
end
Now see how it works
find_head_substring(word)
=>["god is no where", "god is now here", "god is nowhere"]
I have not tested it extensively, so I apologize in advance :)
If you just want the hits of matches in your dictionary:
dict.select{ |r| word[/#{r}/] }
=> ["john", "smith"]
You run a risk of too many confusing subhits, so you might want to sort your dictionary so longer names are first:
dict.sort_by{ |w| -w.size }.select{ |r| word[/#{r}/] }
=> ["smith", "john"]
You will still encounter situations where a longer name has a shorter substring following it and get multiple hits so you'll need to figure out a way to weed those out. You could have an array of first names, and another of last names, and take the first returned result of scanning for each, but given the diversity of first and last names, that doesn't guarantee 100% accuracy, and will still gather some bad results.
This sort of problem has no real good solution without further hints to the code about the person's name. Perhaps scanning the body of the message for salutation or valediction sections will help.
I'm not sure what you're doing with i, but isn't it as simple as:
dict.each do |first|
dict.each do |last|
puts first,last if first+last == word
end
end
This one bags all occurrences, not necessarily exactly two:
pattern = Regexp.union(dict)
matches = []
while match = word.match(pattern)
matches << match.to_s # Or just leave off to_s to keep the match itself
word = match.post_match
end
matches

Lua base converter

I need a base converter function for Lua. I need to convert from base 10 to base 2,3,4,5,6,7,8,9,10,11...36 how can i to this?
In the string to number direction, the function tonumber() takes an optional second argument that specifies the base to use, which may range from 2 to 36 with the obvious meaning for digits in bases greater than 10.
In the number to string direction, this can be done slightly more efficiently than Nikolaus's answer by something like this:
local floor,insert = math.floor, table.insert
function basen(n,b)
n = floor(n)
if not b or b == 10 then return tostring(n) end
local digits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
local t = {}
local sign = ""
if n < 0 then
sign = "-"
n = -n
end
repeat
local d = (n % b) + 1
n = floor(n / b)
insert(t, 1, digits:sub(d,d))
until n == 0
return sign .. table.concat(t,"")
end
This creates fewer garbage strings to collect by using table.concat() instead of repeated calls to the string concatenation operator ... Although it makes little practical difference for strings this small, this idiom should be learned because otherwise building a buffer in a loop with the concatenation operator will actually tend to O(n2) performance while table.concat() has been designed to do substantially better.
There is an unanswered question as to whether it is more efficient to push the digits on a stack in the table t with calls to table.insert(t,1,digit), or to append them to the end with t[#t+1]=digit, followed by a call to string.reverse() to put the digits in the right order. I'll leave the benchmarking to the student. Note that although the code I pasted here does run and appears to get correct answers, there may other opportunities to tune it further.
For example, the common case of base 10 is culled off and handled with the built in tostring() function. But similar culls can be done for bases 8 and 16 which have conversion specifiers for string.format() ("%o" and "%x", respectively).
Also, neither Nikolaus's solution nor mine handle non-integers particularly well. I emphasize that here by forcing the value n to an integer with math.floor() at the beginning.
Correctly converting a general floating point value to any base (even base 10) is fraught with subtleties, which I leave as an exercise to the reader.
you can use a loop to convert an integer into a string containting the required base. for bases below 10 use the following code, if you need a base larger than that you need to add a line that mapps the result of x % base to a character (usign an array for example)
x = 1234
r = ""
base = 8
while x > 0 do
r = "" .. (x % base ) .. r
x = math.floor(x / base)
end
print( r );

Resources