Ruby compare two strings similarity percentage - ruby-on-rails

Id like to compare two strings in Ruby and find their similarity
I've had a look at the Levenshtein gem but it seems this was last updated in 2008 and I can't find documentation how to use it. With some blogs suggesting its broken
I tried the text gem with Levenshtein but it gives an integer (smaller is better)
Obviously if the two strings are of variable length I run into problems with the Levenshtein Algorithm (Say comparing two names, where one has a middle name and one doesnt).
What would you suggest I do to get a percentage comparison?
Edit: Im looking for something similar to PHP's similar text

I think your question could do with some clarifications, but here's something quick and dirty (calculating as percentage of the longer string as per your clarification above):
def string_difference_percent(a, b)
longer = [a.size, b.size].max
same = a.each_char.zip(b.each_char).count { |a,b| a == b }
(longer - same) / a.size.to_f
end
I'm still not sure how much sense this percent difference you are looking for makes, but this should get you started at least.
It's a bit like Levensthein distance, in that it compares the strings character by character. So if two names differ only by the middle name, they'll actually be very different.

There is now a ruby gem for similar_text. https://rubygems.org/gems/similar_text
It provides a similar method that compares two strings and returns a number representing the percent similarity between the two strings.

I can recommend the fuzzy-string-match gem.
You can use it like this (taken from the docs):
require "fuzzystringmatch"
jarow = FuzzyStringMatch::JaroWinkler.create(:native)
p jarow.getDistance("jones", "johnson")
It will return a score ~0.832 which tells how good those strings match.

Related

Character Replacements

I have a UniCode string UniStr.
I also have a MAP of { UniCodeChar : otherMappedStrs }
I need the 'otherMappedStrs' version of UniStr.
Eg: UniStr = 'ABC', MAP = { 'A':'233','B':'#$','C':'9ij' }, Result = '233#$9ij'
I have come up with the formula below which works;
=ArrayFormula(JOIN("",VLOOKUP(REGEXEXTRACT(A1,REPT("(.)",LEN(A1))),MapRange,2,FALSE)))
The MAP being a whole character set (40 chars) is quite large.
I need to use this function in multiple spreadsheets. How can I subsume the MAP into the formula for portability ?
Is there a better way to iterate a string other than the REGEXEXTRACT method in formula ? This method has limitation for long strings.
I also tested the below formula. Problem here is it gives 2 results (or the size of the array within SUBSTITUTE replacement). If 3 substitutions made, then it gives three results. Can this be resolved ?
=ArrayFormula(SUBSTITUTE(A1,{"s","i"},{"#","#"}))
EDIT;
#Tom 's first solution appears best for my case (1) REGEX has an upper limit on search criteria which does not hinder in your solution (2) Feels fast (did not do empirical testing) (3) This is a better way to iterate string characters, I believe (you answered my Q2 - thanks)
I digress here. I wish google would introduce Named-Formulas or Formula-Aliases. In this case, hypothetically below. I have sent feed back along those lines many times. Nothing :(
MyFormula($str) == ArrayFormula(join(,vlookup(mid($str,row(indirect("1:"&len($str))),1), { "A","233";"B","#$";"C","9ij" },2,false)))
Not sure how long you want your strings to be, but the more traditional
=ArrayFormula(join(,vlookup(mid(A1,row(indirect("1:"&len(A1))),1), { "A","233";"B","#$";"C","9ij" },2,false)))
seems a bit more robust for long strings.
For a more radical idea, supposing the maximum length of your otherMappedStrings is 3 characters, then you could try:
=ArrayFormula(join(,trim(mid("233 #$9ij",find(mid(A1,row(indirect("1:"&len(A1))),1), "ABC")*3-2,3))))
where I have put a space in before #$ to pad it out to 3 characters.
Incidentally the original VLOOKUP is not case sensitive. If you want this behaviour, use SEARCH instead of FIND.
You seem to have several different Qs, but considering only portability, perhaps something like the following would help:
=join(,switch(arrayformula(regexextract(A1&"",rept("(.)",len(A1)))),"A",233,"B","#$","C","9ij"))
extended with 37 more pairs.

How to get a % difference of two NSStrings

I'm thinking this may be impossible to do resonably, but I figured I would take a shot at it. So lets say I have two NSStrings. One is #"Singin' In The Rain" and the other is #"Singing In The Rain". These strings are very similar, but have a small difference. I'm trying to find a way where I could write something like the following:
NSString *stringOne = #"Singin' In The Rain";
NSString *stringTwo = #"Singing In The Rain";
float dif = [stringOne differenceFrom:stringTwo];
//dif = .9634 or something like that
One project that I did find similar to this was taken from the previous similar question on Stack Overflow: Check if two NSStrings are similar. However, this simply returns a BOOL which isn't as accurate as I need it to be. I also tried looking into the compare: documentation for NSString but it all looked too basic. Another similar thing I found is at https://gist.github.com/iloveitaly/1515464. However, this gives varying results, even saying two of the same string are different occasionally. Any advice would be much appreciated.
The question is a little vague, but I would assume that the most satisfactory results will come from using NSLinguisticTagger. If you parse each for tags with the NSLinguisticTagSchemeLexicalClass scheme then your string will be broken down into verbs, nouns, adjectives, etc. In your example, even if you weren't spotting that singin' and singing are the same, you'd spot the other three words are the same and that the thing at the end is a noun, so they're both about doing something in the same thing.
It'd probably be wise to use something like a BK-Tree to compare individual words where you suspect there may be a match (a noun obviously doesn't match an adverb but two nouns may match even if spellings differ).
Another off the wall suggestion:
The source, and hence the algorithm, for diff and similar programs is easily available. These compare input on a line-by-line basis and detect insertions, deletions and changes.
When comparing text strings for "closeness" then the insertion, deletion or changing of words seems as good a measure as any.
So:
Break each string into "words" (white space separated should be sufficient).
Compare the two lists using the diff algorithm, treating each "word" as a "line", use a re-sync length of 1 (the number of "lines" that need to be the same to treat the two inputs as back in sync)
Calculate the "closeness" as the number of insertions/deletions/changes compared to the total word count.
For the two example strings this would give 1:4 changes or 75% similar.
If you want greater granularity for each change split the two words into characters and repeat the algorithm giving you a fraction the word is similar by (as opposed to the whole word).
For the two example strings this would give 3 6/7 words out of 4, or 96% similar.
I'd recommend dynamic time warping for such comparisons:
http://en.wikipedia.org/wiki/Dynamic_time_warping
This will however return distance between two strings (so you'll get 0 for identical), but this the best starting point I can think of.

Is it possible to solve for an input value of a hash in ruby if all other variables and output are known? (In ruby)

This question is a little obscure, I'm trying to find out if its possible to "solve" for a value inputted into a hash in ruby, it looks like this:
I have:
#hash = Digest::SHA512.hexdigest(value1 + value2 + value3)
Value2 & value3 are known, and the value of #hash is known. Value 1 is "unknown". In this situation is it possible to solve for value1 in ruby, or would this require a ton of computing power/time?
Only way to do this is: brute-force
Guess a possible value for value1
Compute the hash
Check if it matches the target hash. If not goto 1
This is only feasible if value1 is easy enough to guess. GPUs are faster at this than CPUs, so you'd probably use a bunch of ATI CPUs to attack this.
Not having a cheap way to compute an input matching a given output is an essential property of a secure hash function, which is called first pre-image resistance. For SHA-512 we know no way faster than brute-force to do this.
If v2 and v3 are integers. You could theoretically attempt to brute force it by just running through numbers and finding when the hashes match. Then subtract v2 and v3. If your set of possible numbers is all real numbers though, this would be extremely hard. And you'd be better off running it on multiple machines with greatly varying rotating subsections of real numbers. That's you're best bet. And that's assuming the values are integers.

java auto correct pattern-matcher - which item is the most similar in a given set?

I was wondering how to implement the following problem: Say I have a 'set' of Strings and I wish to know which one is the most related to a given value.
Example:
String value= "ABBCCE";
Set contains: {"JJKKLL", "ABBCC", "AAPPFFEE", "AABBCCDD", "ABBCEE", "AABBCCEE"}
By 'most related' I assume there could be many options (valid one can be the last 2), but at least we can ignore some items (JJKKLLL).
What should be the approach to solve this kind of a problem (that at minmum, a result like AABBCCEE would be acceptable)
Any java code would be appreciated :-)
You could try using the Levenshtein Distance between your "target" string (e.g. "ABBCCE") and each element in your set. Pick a maximum threshold above which you will consider items to be unrelated (in your example here, a threshold of one or two perhaps), and reject everything in the set that has a Levenshtein Distance greater than that from the target string.
An example implementation of the Levenshtein Distance computation in Java can be found here.
You may be interested in the Levenstein distance metric, which measures similarities between two strings, including insertions and removals.

Best practice for determining the probability that 2 strings match

I need to write code to determine if 2 strings match when one of the strings may contain a small deviation from the second string e.g. "South Africa" v "South-Africa" or "England" v "Enlgand". At the moment, I am considering the following approach
Determine the percentage of characters in string 1 that match those in string 2
Determine the true probability of the match by combining the result of 1 with a comparison of the length of the 2 strings e.g. although all the characters in "SA" are found in "South Africa" it is not a very likely match since "SA" could be found in a range of other country names as well.
I would appreciate to hear what current best practice is for performing such string matching.
You can look at Levenshtein distance. This is distance between two strings. The same strings have distance equal 0. Strings such as kitten and sitten have distance equal 1, and so on. Distance is measured by minimal number of simple operations that transform one string to another.
More information and algorithm in pseudo-code is given in link.
I also remember that this topic was mentioned in Game programming gems: volume 6: Article 1.6 Closest-String Matching Algorithm
To make fuzzy string matching ideal, it's important to know about the context of the strings. When it's just about small typos, Levenstein can be good enough. When it's about misheard sound, you can use a phonetic algorithm like soundex or metaphone.
Most times, you need a combination of the following algorithms, and some more specific manually written stuff.
Needleman-Wunsch
Soundex
Metaphone
Levenstein distance
Bitmap
Hamming distance
There is no best fuzzy string matching algorithm. It's all about the context it's used in, so you need to tell us about where you want to use the string matching for.
Don't reinvent the wheel. Wikipedia has the Levenshtein algorithm which has metrics for what you want to do.
http://en.wikipedia.org/wiki/Levenshtein_distance
There's also Soundex, but that might be too simplistic for your requirements.
Use of Soundex proved to work nicely for me:
With a small tweak or two to the implementation, Soundex matching can check cross-languages if two strings of different languages sound the same..
Objective-C Soundex implementation:
http://www.cocoadev.com/index.pl?NSStringSoundex
I've found an Objective-C implementation of the Levenshtein Distance Algorithm here. It works great for me.

Resources