How do Ruby min and max compare strings? - ruby-on-rails

I'm playing around with iterators. What is Ruby comparing using max and min?
If I have this array:
word_array = ["hi", "bob", "how's", "it", "going"]
and I run:
puts word_array.max
puts word_array.min
I expect to get "going" or "how's" for max, since they're both five characters long, or "going" on the theory that it's the last item in the array. I expect min to return "hi" or "it", since they're tied for the shortest string. Instead, I get back:
puts word_array.max -> it
puts word_array.min -> bob
What is Ruby measuring to make this judgement? It selected the shortest string for max and a middle length string for min.

Actually, you are (kind of) asking the wrong question. max and min are defined on Enumerable. They don't know anything about Strings.
They use the <=> combined comparison operator (aka "spaceship"). (Note: pretty much any method that compares two objects will do so using the <=> operator. That's a general rule in Ruby that you can both rely on when using objects other people have written and that you should adhere to when writing your own objects: if your objects are going to be compared to one another, you should implement the <=> operator.)
So, the question you should be asking is, how does String#<=> compare Strings? Unfortunately, the answer is not quite clear from the documentation. However, it does mention that the shorter string is considered to be less than the longer string if the two strings are equal up to that point. So, it is clear that length is used only as a tie-breaker, not as the primary criterion (as you are assuming in your question).
And really, lexicographic ordering is just the most natural thing to do. If I gave you a list of words and asked you to order them without giving you any criteria, would you order them by length or alphabetically? (Note, however, that this means that '20' is less than '3'.)

I believe it's doing a lexical sort, exactly like a dictionary order. The maximum would be the furthest one in the dictionary.

http://ruby-doc.org/core-2.1.0/Enumerable.html#method-i-max_by
ar = ['one','two','three','four','five']
ar.max_by(&:length) # => "three"

Related

Character Replacements

I have a UniCode string UniStr.
I also have a MAP of { UniCodeChar : otherMappedStrs }
I need the 'otherMappedStrs' version of UniStr.
Eg: UniStr = 'ABC', MAP = { 'A':'233','B':'#$','C':'9ij' }, Result = '233#$9ij'
I have come up with the formula below which works;
=ArrayFormula(JOIN("",VLOOKUP(REGEXEXTRACT(A1,REPT("(.)",LEN(A1))),MapRange,2,FALSE)))
The MAP being a whole character set (40 chars) is quite large.
I need to use this function in multiple spreadsheets. How can I subsume the MAP into the formula for portability ?
Is there a better way to iterate a string other than the REGEXEXTRACT method in formula ? This method has limitation for long strings.
I also tested the below formula. Problem here is it gives 2 results (or the size of the array within SUBSTITUTE replacement). If 3 substitutions made, then it gives three results. Can this be resolved ?
=ArrayFormula(SUBSTITUTE(A1,{"s","i"},{"#","#"}))
EDIT;
#Tom 's first solution appears best for my case (1) REGEX has an upper limit on search criteria which does not hinder in your solution (2) Feels fast (did not do empirical testing) (3) This is a better way to iterate string characters, I believe (you answered my Q2 - thanks)
I digress here. I wish google would introduce Named-Formulas or Formula-Aliases. In this case, hypothetically below. I have sent feed back along those lines many times. Nothing :(
MyFormula($str) == ArrayFormula(join(,vlookup(mid($str,row(indirect("1:"&len($str))),1), { "A","233";"B","#$";"C","9ij" },2,false)))
Not sure how long you want your strings to be, but the more traditional
=ArrayFormula(join(,vlookup(mid(A1,row(indirect("1:"&len(A1))),1), { "A","233";"B","#$";"C","9ij" },2,false)))
seems a bit more robust for long strings.
For a more radical idea, supposing the maximum length of your otherMappedStrings is 3 characters, then you could try:
=ArrayFormula(join(,trim(mid("233 #$9ij",find(mid(A1,row(indirect("1:"&len(A1))),1), "ABC")*3-2,3))))
where I have put a space in before #$ to pad it out to 3 characters.
Incidentally the original VLOOKUP is not case sensitive. If you want this behaviour, use SEARCH instead of FIND.
You seem to have several different Qs, but considering only portability, perhaps something like the following would help:
=join(,switch(arrayformula(regexextract(A1&"",rept("(.)",len(A1)))),"A",233,"B","#$","C","9ij"))
extended with 37 more pairs.

What are buckets in terms of hash functions?

Looking at the book Mining of Massive Datasets, section 1.3.2 has an overview of Hash Functions. Without a computer science background, this is quite new to me; Ruby was my first language, where a hash seems to be equivalent to Dictionary<object, object>. And I had never considered how this kind of datastructure is put together.
The book mentions hash functions, as a means of implementing these dictionary data structures. This paragraph:
First, a hash function h takes a hash-key value as an argument and produces
a bucket number as a result. The bucket number is an integer, normally in the
range 0 to B − 1, where B is the number of buckets. Hash-keys can be of any
type. There is an intuitive property of hash functions that they “randomize”
hash-keys
What exactly are buckets in terms of a hash function? it sounds like buckets are array-like structures, and that the hash function is some kind of algorithm / array-like-structure search that produces the same bucket number every time? What is inside this metaphorical bucket?
I've always read that javascript objects/ruby hashes/ etc don't guarantee order. In practice I've found that keys' order doesn't change (actually, I think using an older version of Mozilla's Rhino interpreter that the JS object order DID change, but I can't be sure...).
Does that mean that hashes (Ruby) / objects (JS) ARE NOT resolved by these hash functions?
Does the word hashing take on different meanings depending on the level at which you are working with computers? i.e. it would seem that a Ruby hash is not the same as a C++ hash...
When you hash a value, any useful hash function generally has a smaller range than the domain. This means that out of a large list of input values (for example all possible combinations of letters) it will output any of a smaller list of values (a number capped at a certain length). This means that more than one input value can map to the same output value.
When this is the case, the output values are refered to as buckets.
Consider the function f(x) = x mod 2
This generates the following outputs;
1 => 1
2 => 0
3 => 1
4 => 0
In this case there are two buckets (1 and 0), with a bunch of input values that fall into each.
A good hash function will fill all of these 'buckets' equally, and so enable faster searching etc. If you take the mod of any number, you get the bucket to look into, and thus have to search through less results than if you just searched initially, since each bucket has less results in it than the whole set of inputs. In the ideal situation, the hash is fast to calculate and there is only one result in each bucket, this enables lookups to take only as long as applying the hash function takes.
This is a simplified example of course but hopefully you get the idea?
The concept of a hash function is always the same. It's a function that calculates some number to represent an object. The properties of this number should be:
it's relatively cheap to compute
it's as different as possible for all objects.
Let's give a really artificial example to show what I mean with this and why/how hashes are usually used.
Take all natural numbers. Now let's assume it's expensive to check if 2 numbers are equal.
Let's also define a relatively cheap hash function as follows:
hash = number % 10
The idea is simple, just take the last digit of the number as the hash. In the explanation you got, this means we put all numbers ending in 1 into an imaginary 1-bucket, all numbers ending in 2 in the 2-bucket etc...
Those buckets don't really exists as data structure. They just make it easy to reason about the hash function.
Now that we have this cheap hash function we can use it to reduce the cost of other things. For example, we want to create a new datastructure to enable cheap searching of numbers. Let's call this datastructure a hashmap.
Here we actually put all the numbers with hash=1 together in a list/set/..., we put the numbers with hash=5 into their own list/set ... etc.
And if we then want to lookup some number, we first calculate it's hash value. Then we check the list/set corresponding to this hash, and then compare only "similar" numbers to find our exact number we want. This means we only had to do a cheap hash calculation and then have to check 1/10th of the numbers with the expensive equality check.
Note here that we use the hash function to define a new datastructure. The hash itself isn't a datastructure.
Consider a phone book.
Imagine that you wanted to look for Donald Duck in a phone book.
It would be very inefficient to have to look every page, and every entry on that page. So rather than doing that, we do the following thing:
We create an index
We create a way to obtain an index key from a name
For a phone book, the index goes from A-Z, and the function used to get the index key, is just getting first letter from the Surname.
In this case, the hashing function takes Donald Duck and gives you D.
Then you take D and go to the index where all the people with Surnames starting with D are.
That would be a very oversimplified way to put it.
Let me explain in simple terms. Buckets come into picture while handling collisions using chaining technique ( Open hashing or Closed addressing)
Here, each array entry shall correspond to a bucket and each array entry (if nonempty) will be having a pointer to the head of the linked list. (The bucket is implemented as a linked list).
The hash function shall be used by hash table to calculate an index into an array of buckets, from which the desired value can be found.
That is, while checking whether an element is in the hash table, the key is first hashed to find the correct bucket to look into. Then, the corresponding linked list is traversed to locate the desired element.
Similarly while any element addition or deletion, hashing is used to find the appropriate bucket. Then, the bucket is checked for presence/absence of required element, and accordingly it is added/removed from the bucket by traversing corresponding linked list.

How to get a % difference of two NSStrings

I'm thinking this may be impossible to do resonably, but I figured I would take a shot at it. So lets say I have two NSStrings. One is #"Singin' In The Rain" and the other is #"Singing In The Rain". These strings are very similar, but have a small difference. I'm trying to find a way where I could write something like the following:
NSString *stringOne = #"Singin' In The Rain";
NSString *stringTwo = #"Singing In The Rain";
float dif = [stringOne differenceFrom:stringTwo];
//dif = .9634 or something like that
One project that I did find similar to this was taken from the previous similar question on Stack Overflow: Check if two NSStrings are similar. However, this simply returns a BOOL which isn't as accurate as I need it to be. I also tried looking into the compare: documentation for NSString but it all looked too basic. Another similar thing I found is at https://gist.github.com/iloveitaly/1515464. However, this gives varying results, even saying two of the same string are different occasionally. Any advice would be much appreciated.
The question is a little vague, but I would assume that the most satisfactory results will come from using NSLinguisticTagger. If you parse each for tags with the NSLinguisticTagSchemeLexicalClass scheme then your string will be broken down into verbs, nouns, adjectives, etc. In your example, even if you weren't spotting that singin' and singing are the same, you'd spot the other three words are the same and that the thing at the end is a noun, so they're both about doing something in the same thing.
It'd probably be wise to use something like a BK-Tree to compare individual words where you suspect there may be a match (a noun obviously doesn't match an adverb but two nouns may match even if spellings differ).
Another off the wall suggestion:
The source, and hence the algorithm, for diff and similar programs is easily available. These compare input on a line-by-line basis and detect insertions, deletions and changes.
When comparing text strings for "closeness" then the insertion, deletion or changing of words seems as good a measure as any.
So:
Break each string into "words" (white space separated should be sufficient).
Compare the two lists using the diff algorithm, treating each "word" as a "line", use a re-sync length of 1 (the number of "lines" that need to be the same to treat the two inputs as back in sync)
Calculate the "closeness" as the number of insertions/deletions/changes compared to the total word count.
For the two example strings this would give 1:4 changes or 75% similar.
If you want greater granularity for each change split the two words into characters and repeat the algorithm giving you a fraction the word is similar by (as opposed to the whole word).
For the two example strings this would give 3 6/7 words out of 4, or 96% similar.
I'd recommend dynamic time warping for such comparisons:
http://en.wikipedia.org/wiki/Dynamic_time_warping
This will however return distance between two strings (so you'll get 0 for identical), but this the best starting point I can think of.

Ruby compare two strings similarity percentage

Id like to compare two strings in Ruby and find their similarity
I've had a look at the Levenshtein gem but it seems this was last updated in 2008 and I can't find documentation how to use it. With some blogs suggesting its broken
I tried the text gem with Levenshtein but it gives an integer (smaller is better)
Obviously if the two strings are of variable length I run into problems with the Levenshtein Algorithm (Say comparing two names, where one has a middle name and one doesnt).
What would you suggest I do to get a percentage comparison?
Edit: Im looking for something similar to PHP's similar text
I think your question could do with some clarifications, but here's something quick and dirty (calculating as percentage of the longer string as per your clarification above):
def string_difference_percent(a, b)
longer = [a.size, b.size].max
same = a.each_char.zip(b.each_char).count { |a,b| a == b }
(longer - same) / a.size.to_f
end
I'm still not sure how much sense this percent difference you are looking for makes, but this should get you started at least.
It's a bit like Levensthein distance, in that it compares the strings character by character. So if two names differ only by the middle name, they'll actually be very different.
There is now a ruby gem for similar_text. https://rubygems.org/gems/similar_text
It provides a similar method that compares two strings and returns a number representing the percent similarity between the two strings.
I can recommend the fuzzy-string-match gem.
You can use it like this (taken from the docs):
require "fuzzystringmatch"
jarow = FuzzyStringMatch::JaroWinkler.create(:native)
p jarow.getDistance("jones", "johnson")
It will return a score ~0.832 which tells how good those strings match.

The ordinal parsing problem

Rails has a nice function, ordinalize, which converts an integer to a friendly string representation. Namely 1 becomes 1st, 2 becomes 2nd, and so on. My question is how might one implement the inverse feature?
To be more general I'd like to handle both of the following cases:
>> s = "First"
>> s.integerize
=> 1
>> s = 1st
>> s.integerize
=> 1
I am looking for an smart way to do this as opposed to a giant lookup table or just hacking off the last two characters. Any ideas would be appreciated.
to_i does essentially 1/2 of that:
"72nd".to_i
=> 72
It doesn't check validity, but if you need to fail on bad input like "72x", you can just re-ordinalize and compare to the original input string.
For parsing ordinal words, Wikipedia seems impressively helpful.
The first case is relatively hard - I'd say the smart way to do it is find someone who's already done it and use their code. If you can't find someone, the next smartest thing would probably be restating (or renegotiating) the problem so that it's not needed. Beyond that, I think you're into parser-writing...
The second case is as trivial as the to_i already offered. You could also use a regex, I suppose:
"1000000th".scan(/\d+/).first.to_i #=> 1000000

Resources