Fuzzy `==`, or "almost_equal_to" function for strings? - ruby-on-rails

I want to search for duplicates in my database, but it could be things like
"The smallest thing, and nothing more"
"The Smallest Things, And Nothing More"
"The smallest thing, and nothing more."
"The smallest thing, and nothing"
Is there an easy way to design a fuzzy == function that gives a weight of matching, instead of a binary true/false result?

Ruby ships with a library called did_you_mean it is used to make suggestions for code correction when you make a mistake like "abc".downcsae will ask you "Did you mean downcase?"
This library includes a module called DidYouMean::Levenshtein which has a method called distance. This distance is the number of transformations required for 2 strings to be equal
Example:
s = "The smallest thing, and nothing more"
x = "The Smallest Things, And Nothing More"
DidYouMean::Levenshtein.distance(s,x)
#=> 6
DidYouMean::Levenshtein.distance(s.downcase,x.downcase)
#=> 1
This may be useful in your case although you would need to determine the threshold.
Implementation is also available via the Gem::Text module which you could include in a class if needed e.g.
class MyClass
extend Gem::Text
def self.fuzzy_equal?(x:, y:, threshold:3)
levenshtein_distance(x,y) <= threshold
end
end
MyClass.fuzzy_equal?(x: s,y: x)
#=> false
MyClass.fuzzy_equal?(x: s.downcase,y: x.downcase)
#=> true
MyClass.fuzzy_equal?(x: s,y: x, threshold: 10)
#=> true

Related

Shorter way to convert an empty string to an int, and then clamp

I'm looking for a way to simplify the code for the following logic:
Take a value that is either a nil or an empty string
Convert that value to an integer
Set zero values to the maximum value (empty string/nil are converted to 0 when cast as an int)
.clamp the value between a minimum and a maximum
Here's the long form that works:
minimum = 1
maximum = 10_000
value = value.to_i
value = maximum if value.zero?
value = value.clamp(minimum, maximum)
So for example, if value is "", I should get 10,000. If value is "15", I should get 15. If value is "45000", I should get 10000.
Is there a way to shorten this logic, assuming that minimum and maximum are defined and that the default value is the maximum?
The biggest problem I've had in shortening it is that null-coalescing doesn't work on the zero, since Ruby considers zero a truthy value. Otherwise, it could be a one-liner.
you could still do a one-liner with your current logic
minimum, maximum = 1, 10_000
value = ( value.to_i.zero? ? maximum: value.to_i ).clamp(minimum, maximum)
but not sure if your issue is that if you enter '0' you want 1 and not 10_000 if so then try this
minimum, maximum = 1, 10_000
value = (value.to_i if Float(value) rescue maximum).clamp(minimum, maximum)
Consider Fixing the Input Object or Method
If you're messing with String objects when you expect an Integer, you're probably dealing with user input. If that's the case, the problem should really be solved through input validation and/or looping over an input prompt elsewhere in your program rather than trying to perform input transformations inline.
Duck-typing is great, but I suspect you have a broken contract between methods or objects. As a general rule, it's better to fix the source of the mismatch unless you're deliberately wrapping some piece of code that shouldn't be modified. There are a number of possible refactorings and patterns if that's the case.
One such solution is to use a collaborator object or method for information hiding. This enables you to perform your input transformations without complicating your inline logic, and allowing you to access the transformed value as a simple method call such as user_input.value.
Turning a Value into a Collaborator Object
If you are just trying to tighten up your current method you can aim for shorter code, but I'd personally recommend aiming for maintainability instead. Pragmatically, that means sending your value to the constructor of a specialized object, and then asking that object for a result. As a bonus, this allows you to use a default variable assignment to handle nil. Consider the following:
class MaximizeUnsetInputValue
MIN = 1
MAX = 10_000
def initialize value=MAX
#value = value
set_empty_to_max
end
def set_empty_to_max
#value = MAX if #value.to_i.zero?
end
def value
#value.clamp MIN, MAX
end
end
You can easily validate that this handles your various use cases while hiding the implementation details inside the collaborator object's methods. For example:
inputs_and_expected_outputs = [
[0, 10000],
[1, 1],
[10, 10],
[10001, 10000],
[nil, 10000],
['', 10000]
]
inputs_and_expected_outputs.map do |input, expected|
MaximizeUnsetInputValue.new(input).value == expected
end
#=> [true, true, true, true, true, true]
There are certainly other approaches, but this is the one I'd recommend based on your posted code. It isn't shorter, but I think it's readable, maintainable, adaptable, and reusable. Your mileage may vary.

"For all" using Apache Jenas rule engine

I'm currently working on some small examples about Apache Jena. What I want to show is universal quantification.
Let's say I have balls that each have a different color. These balls are stored within boxes. I now want to determine whether these boxes only contain balls that have the same color of if they are mixed.
So basically something along these lines:
SAME_COLOR = ∃x∀y:{y in Box a → color of y = x}
I know that this is probably not possible with Jena, and can be converted to the following:
SAME_COLOR = ∃x¬∃y:{y in Box a → color of y != x}
With "not exists" Jena's "NoValue" can be used, however, this does (at least for me) not work and I don't know how to translate above logical representations in Jena. Any thoughts on this?
See the code below, which is the only way I could think of:
(?box, ex:isA, ex:Box)
(?ball, ex:isIn, ?box)
(?ball, ex:hasColor, ?color)
(?ball2, ex:isIn, ?box)
(?ball2, ex:hasColor, ?color2)
NotEqual(?color, ?color2)
->
(?box, ex:hasSomeColors, "No").
(?box, ex:isA, ex:Box)
NoValue(?box, ex:hasSomeColors)
->
(?box, ex:hasSomeColors, "Yes").
A box with mixed content now has both values "Yes" and "No".
I've ran into the same sort of problem, which is more simplified.
The question is how to get a collection of objects or count no. of objects in rule engine.
Given that res:subj ont:has res:obj_xxx(several objects), how to get this value in rule engine?
But I just found a Primitive called Remove(), which may inspire me a bit.

Detecting overlapping ranges in Ruby

I have array of ranges :
[[39600..82800], [39600..70200],[70200..80480]]
I need to determine if there is overlapping or not.What is an easy way to do it in ruby?
In the above case the output should be 'Overlapping'.
This is a very interesting puzzle, especially if you care about performances.
If the ranges are just two, it's a fairly simple algorithm, which is also covered in ActiveSupport overlaps? extension.
def ranges_overlap?(r1, r2)
r1.cover?(r2.first) || r2.cover?(r1.first)
end
If you want to compare multiple ranges, it's a fairly interesting algorithm exercise.
You could loop over all the ranges, but you will need to compare each range with all the other possibilities, but this is an algorithm with exponential cost.
A more efficient solution is to order the ranges and execute a binary search, or to use data structures (such as trees) to make possible to compute the overlapping.
This problem is also explained in the Interval tree page. Computing an overlap essentially consists of finding the intersection of the trees.
Is this not a way to do it?
def any_overlapping_ranges(array_of_ranges)
array_of_ranges.sort_by(&:first).each_cons(2).any?{|x,y|x.last>y.first}
end
p any_overlapping_ranges([50..100, 1..51,200..220]) #=> True
Consider this:
class Range
include Comparable
def <=>(other)
self.begin <=> other.begin
end
def self.overlap?(*ranges)
edges = ranges.sort.flat_map { |range| [range.begin, range.end] }
edges != edges.sort.uniq
end
end
Range.overlap?(2..12, 6..36, 42..96) # => true
Notes:
This could take in any number of ranges.
Have a look at the gist with some tests to play with the code.
The code creates a flat array with the start and end of each range.
This array will retain the order if they don't overlap. (Its easier to visualize with some examples than textually explaining why, try it).
For sake of simplicity and readability I'll suggest this approach:
def overlaps?(ranges)
ranges.each_with_index do |range, index|
(index..ranges.size).each do |i|
nextRange = ranges[i] unless index == i
if nextRange and range.to_a & nextRange.to_a
puts "#{range} overlaps with #{nextRange}"
end
end
end
end
r = [(39600..82800), (39600..70200),(70200..80480)]
overlaps?(r)
and the output:
ruby ranges.rb
39600..82800 overlaps with 39600..70200
39600..82800 overlaps with 70200..80480
39600..70200 overlaps with 70200..80480

splitting space delimited entries into new columns in R

I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.

Can a SHA-1 hash be all-zeroes?

Is there any input that SHA-1 will compute to a hex value of fourty-zeros, i.e. "0000000000000000000000000000000000000000"?
Yes, it's just incredibly unlikely. I.e. one in 2^160, or 0.00000000000000000000000000000000000000000000006842277657836021%.
Also, becuase SHA1 is cryptographically strong, it would also be computationally unfeasible (at least with current computer technology -- all bets are off for emergent technologies such as quantum computing) to find out what data would result in an all-zero hash until it occurred in practice. If you really must use the "0" hash as a sentinel be sure to include an appropriate assertion (that you did not just hash input data to your "zero" hash sentinel) that survives into production. It is a failure condition your code will permanently need to check for. WARNING: Your code will permanently be broken if it does.
Depending on your situation (if your logic can cope with handling the empty string as a special case in order to forbid it from input) you could use the SHA1 hash ('da39a3ee5e6b4b0d3255bfef95601890afd80709') of the empty string. Also possible is using the hash for any string not in your input domain such as sha1('a') if your input has numeric-only as an invariant. If the input is preprocessed to add any regular decoration then a hash of something without the decoration would work as well (eg: sha1('abc') if your inputs like 'foo' are decorated with quotes to something like '"foo"').
I don't think so.
There is no easy way to show why it's not possible. If there was, then this would itself be the basis of an algorithm to find collisions.
Longer analysis:
The preprocessing makes sure that there is always at least one 1 bit in the input.
The loop over w[i] will leave the original stream alone, so there is at least one 1 bit in the input (words 0 to 15). Even with clever design of the bit patterns, at least some of the values from 0 to 15 must be non-zero since the loop doesn't affect them.
Note: leftrotate is circular, so no 1 bits will get lost.
In the main loop, it's easy to see that the factor k is never zero, so temp can't be zero for the reason that all operands on the right hand side are zero (k never is).
This leaves us with the question whether you can create a bit pattern for which (a leftrotate 5) + f + e + k + w[i] returns 0 by overflowing the sum. For this, we need to find values for w[i] such that w[i] = 0 - ((a leftrotate 5) + f + e + k)
This is possible for the first 16 values of w[i] since you have full control over them. But the words 16 to 79 are again created by xoring the first 16 values.
So the next step could be to unroll the loops and create a system of linear equations. I'll leave that as an exercise to the reader ;-) The system is interesting since we have a loop that creates additional equations until we end up with a stable result.
Basically, the algorithm was chosen in such a way that you can create individual 0 words by selecting input patterns but these effects are countered by xoring the input patterns to create the 64 other inputs.
Just an example: To make temp 0, we have
a = h0 = 0x67452301
f = (b and c) or ((not b) and d)
= (h1 and h2) or ((not h1) and h3)
= (0xEFCDAB89 & 0x98BADCFE) | (~0x98BADCFE & 0x10325476)
= 0x98badcfe
e = 0xC3D2E1F0
k = 0x5A827999
which gives us w[0] = 0x9fb498b3, etc. This value is then used in the words 16, 19, 22, 24-25, 27-28, 30-79.
Word 1, similarly, is used in words 1, 17, 20, 23, 25-26, 28-29, 31-79.
As you can see, there is a lot of overlap. If you calculate the input value that would give you a 0 result, that value influences at last 32 other input values.
The post by Aaron is incorrect. It is getting hung up on the internals of the SHA1 computation while ignoring what happens at the end of the round function.
Specifically, see the pseudo-code from Wikipedia. At the end of the round, the following computation is done:
h0 = h0 + a
h1 = h1 + b
h2 = h2 + c
h3 = h3 + d
h4 = h4 + e
So an all 0 output can happen if h0 == -a, h1 == -b, h2 == -c, h3 == -d, and h4 == -e going into this last section, where the computations are mod 2^32.
To answer your question: nobody knows whether there exists an input that produces all zero outputs, but cryptographers expect that there are based upon the simple argument provided by daf.
Without any knowledge of SHA-1 internals, I don't see why any particular value should be impossible (unless explicitly stated in the description of the algorithm). An all-zero value is no more or less probable than any other specific value.
Contrary to all of the current answers here, nobody knows that. There's a big difference between a probability estimation and a proof.
But you can safely assume it won't happen. In fact, you can safely assume that just about ANY value won't be the result (assuming it wasn't obtained through some SHA-1-like procedures). You can assume this as long as SHA-1 is secure (it actually isn't anymore, at least theoretically).
People doesn't seem realize just how improbable it is (if all humanity focused all of it's current resources on finding a zero hash by bruteforcing, it would take about xxx... ages of the current universe to crack it).
If you know the function is safe, it's not wrong to assume it won't happen. That may change in the future, so assume some malicious inputs could give that value (e.g. don't erase user's HDD if you find a zero hash).
If anyone still thinks it's not "clean" or something, I can tell you that nothing is guaranteed in the real world, because of quantum mechanics. You assume you can't walk through a solid wall just because of an insanely low probability.
[I'm done with this site... My first answer here, I tried to write a nice answer, but all I see is a bunch of downvoting morons who are wrong and can't even tell the reason why are they doing it. Your community really disappointed me. I'll still use this site, but only passively]
Contrary to all answers here, the answer is simply No.
The hash value always contains bits set to 1.

Resources