What are buckets in terms of hash functions? - hash-function

Looking at the book Mining of Massive Datasets, section 1.3.2 has an overview of Hash Functions. Without a computer science background, this is quite new to me; Ruby was my first language, where a hash seems to be equivalent to Dictionary<object, object>. And I had never considered how this kind of datastructure is put together.
The book mentions hash functions, as a means of implementing these dictionary data structures. This paragraph:
First, a hash function h takes a hash-key value as an argument and produces
a bucket number as a result. The bucket number is an integer, normally in the
range 0 to B − 1, where B is the number of buckets. Hash-keys can be of any
type. There is an intuitive property of hash functions that they “randomize”
hash-keys
What exactly are buckets in terms of a hash function? it sounds like buckets are array-like structures, and that the hash function is some kind of algorithm / array-like-structure search that produces the same bucket number every time? What is inside this metaphorical bucket?
I've always read that javascript objects/ruby hashes/ etc don't guarantee order. In practice I've found that keys' order doesn't change (actually, I think using an older version of Mozilla's Rhino interpreter that the JS object order DID change, but I can't be sure...).
Does that mean that hashes (Ruby) / objects (JS) ARE NOT resolved by these hash functions?
Does the word hashing take on different meanings depending on the level at which you are working with computers? i.e. it would seem that a Ruby hash is not the same as a C++ hash...

When you hash a value, any useful hash function generally has a smaller range than the domain. This means that out of a large list of input values (for example all possible combinations of letters) it will output any of a smaller list of values (a number capped at a certain length). This means that more than one input value can map to the same output value.
When this is the case, the output values are refered to as buckets.
Consider the function f(x) = x mod 2
This generates the following outputs;
1 => 1
2 => 0
3 => 1
4 => 0
In this case there are two buckets (1 and 0), with a bunch of input values that fall into each.
A good hash function will fill all of these 'buckets' equally, and so enable faster searching etc. If you take the mod of any number, you get the bucket to look into, and thus have to search through less results than if you just searched initially, since each bucket has less results in it than the whole set of inputs. In the ideal situation, the hash is fast to calculate and there is only one result in each bucket, this enables lookups to take only as long as applying the hash function takes.
This is a simplified example of course but hopefully you get the idea?

The concept of a hash function is always the same. It's a function that calculates some number to represent an object. The properties of this number should be:
it's relatively cheap to compute
it's as different as possible for all objects.
Let's give a really artificial example to show what I mean with this and why/how hashes are usually used.
Take all natural numbers. Now let's assume it's expensive to check if 2 numbers are equal.
Let's also define a relatively cheap hash function as follows:
hash = number % 10
The idea is simple, just take the last digit of the number as the hash. In the explanation you got, this means we put all numbers ending in 1 into an imaginary 1-bucket, all numbers ending in 2 in the 2-bucket etc...
Those buckets don't really exists as data structure. They just make it easy to reason about the hash function.
Now that we have this cheap hash function we can use it to reduce the cost of other things. For example, we want to create a new datastructure to enable cheap searching of numbers. Let's call this datastructure a hashmap.
Here we actually put all the numbers with hash=1 together in a list/set/..., we put the numbers with hash=5 into their own list/set ... etc.
And if we then want to lookup some number, we first calculate it's hash value. Then we check the list/set corresponding to this hash, and then compare only "similar" numbers to find our exact number we want. This means we only had to do a cheap hash calculation and then have to check 1/10th of the numbers with the expensive equality check.
Note here that we use the hash function to define a new datastructure. The hash itself isn't a datastructure.

Consider a phone book.
Imagine that you wanted to look for Donald Duck in a phone book.
It would be very inefficient to have to look every page, and every entry on that page. So rather than doing that, we do the following thing:
We create an index
We create a way to obtain an index key from a name
For a phone book, the index goes from A-Z, and the function used to get the index key, is just getting first letter from the Surname.
In this case, the hashing function takes Donald Duck and gives you D.
Then you take D and go to the index where all the people with Surnames starting with D are.
That would be a very oversimplified way to put it.

Let me explain in simple terms. Buckets come into picture while handling collisions using chaining technique ( Open hashing or Closed addressing)
Here, each array entry shall correspond to a bucket and each array entry (if nonempty) will be having a pointer to the head of the linked list. (The bucket is implemented as a linked list).
The hash function shall be used by hash table to calculate an index into an array of buckets, from which the desired value can be found.
That is, while checking whether an element is in the hash table, the key is first hashed to find the correct bucket to look into. Then, the corresponding linked list is traversed to locate the desired element.
Similarly while any element addition or deletion, hashing is used to find the appropriate bucket. Then, the bucket is checked for presence/absence of required element, and accordingly it is added/removed from the bucket by traversing corresponding linked list.

Related

Finding duplicate cases, string-variable, SPSS

Being a novel on SPSS I am struggling with finding duplicate cases based on a string-variable in a dataset containing approx 33,000 cases.
I have a variable named "nr" that is supposed to be unique id for every case. However, it turns out that some cases might have two different values in "nr" entered,the only difference being the last character. Resulting in a case being shown as two separate rows.
The structure of the var "nr" is a as follows: XX-XXXXXXX-X or X-XXXXXXX-X i.e 2-7-1 characters or 1-7-1 characters.
I would like to sort out all cases that have a "nr" equal to another case except for the last character.
To illustrate, with a succesfull syntax I would hopefully be able to sort cases like these out from the whole dataset:
20-4026988-2
20-4026988-3
5-4026992-5
5-4026992-8
20-4027281-2
20-4027281-3
Anyone have an idea on how to make a syntax for this? Would be so grateful for any input!
I suggest to create a new variable without that last character, and then look for the doubles:
* first creating some sample data to play with.
data list list/ID (a15).
begin data.
20-4026988-2
12-2345678-7
20-4026988-3
5-4026992-5
5-4026992-8
12-1234567-1
20-4027281-2
6-1234567-1
20-4027281-3
end data.
* now creating the new variable and counting the occurrences of each shortened ID.
string ShortID (a15).
compute ShortID=char.substr(ID,1,char.rindex(ID,"-")).
* also possible: compute ShortID=char.substr(ID,1,char.length(rtrim(ID))-1).
aggregate out=* mode=add /break=ShortID/occurrences=n.
* at this point you can filter based on the number or `occurrences` or sort them.
sort cases by occurrences (d) ShortID.
After removing the last character, you can use Data > Identify Duplicate Cases to find the dups. It as a number of useful options for this.

How do I efficiently search through an ordered list?

I have a function that predicts a words being typed and returns the possibilities in an array. Unfortunately those aren’t sorted by frequency used. So I have a list of 10K ordered words listed by most frequent to less frequent. What would be an efficient way to compare the words in the array and the ordered list to return the most frequent one? (i.e the one it encounters first?)
I was tipped off by a friend to use a binary search tree but I really don't see how that helps me. From what I understood from the following website, only numerical values can be used.. Am I wrong in thinking so? Is there a better way of doing the aforementioned task?
Thanks in advance
You could create a dictionary with words as keys and frequencies as values. Then iterate over your result array, use the dictionary to obtain the frequency value for each item, and predict the item with the highest frequency.
I wouldn't use a vanilla binary search tree here. It would be possible - as Taylor Kirkpatrick says, you could just create a tree with words as keys and frequencies and use that to find the frequency for each result word, in much the same way as the dictionary solution.
The problem is that you cannot guarantee that a simple binary tree will be balanced. From the sound of it your data would probably be OK, since your words are in frequency order. The worst case would be if the words were in alphabetic order - then your binary tree would end up being identical to a linked list - it would never branch, since every node would attach to the right of the previous one. So the computational complexity of a search would be the same as iterating over the array of words - O(n) instead of O(log2N) (which is the best case for binary trees).
Of course, you could guard against this by randomising the list of words before doing the insert. But to my mind it's just easier to use a dictionary. I don't know what the actual implementation of Swift dictionaries is (and we won't until they open source it in a couple of months), but you can take it as read that it will out perform a vanilla BT for value retrieval.
I don't know what the background to this problem is - if you are learning CS it might be worth implementing the BST just for intellectual growth - in this case, with only 10,000 items you might find the performance differences are ultimately quite small. But if you are a working programmer trying to solve a problem, go with the dictionary approach.
You put all your words into a dictionary or a set. That's it. Dictionary if you have data associated with the words, set if you have no data and just want to know if the word is in the list or not.
You might want to use a Trie.
Put your word list into it. For every character entered, you traverse the Trie as deeply as you can and then show all paths to leaf nodes as possible completions.
Since the world like you have is likely static, you can precompute the Trie and load from disk/network/whatever at startup if performance is a concern.
You can use a binary search tree with anything as the actual value. To actually make use of the tree, use the frequency of the words as the numerical value. This is actually a pretty good solution to your problem. Each node of the tree will contain this word and a numeric value that represents the frequency of the word.
Here are a few links to help you out with making it.
Hope that helps.

Is there a clear algorithm for sorting/ordering the loops in an X12 file?

Even though loops are kind of a logical concept in X12 (not directly physically represented in the text), every transaction set defines a set of loops that it can contain, including identifiers for the loops and an ordering for them. My question is, what is the rule for sorting loops, generically? Is there a concise set of rules that can be expressed in some code that should be able to take a collection of loops (with known identifiers such as 1000A, 2300BB, etc) and properly sort them?
The context of my question is that I'm working on a general-purpose library that applications will use to construct a model of an X12 document/transaction-set (and write out the text such a model represents). It has objects to represent Elements, Segments, and Loops. Ordering of Segments in a particular Loop is easy, they're dictated by the Implementation Guides. But I'm trying to get Loop ordering (within a Transaction Set) to work generically; that's what I'm asking about
It seems that the general rule is that Loops are ordered based on their identifiers using the numeric portion as the primary sort key, with the alpha portion as the secondary sort key. Of course hierarchical loops contained in others will be placed before and loops following the parent in that sort order (eg: 1000A, 2000A, 2010A, 2010B, 2100, 2300 - where 2010A and 2010B are children of 2000A).
I understand that the spec and Implementation Guides contain all of this info; I'm looking for the all-encompassing rule about loop ordering (not Segment ordering). Is there any concise way to express the rule algorithmically? Is there even a hard-and-fast rule at all?
As I mentioned in my comments, the standard has a loop value. Take a look at my screenshot of the Liaison Dictionary Viewer. The CLM segment has a LOOP value of 100. The segments underneath are children of the CLM segment (extended tag). Any "order" can be defined arbitrarily by the partner, or can be in any (undefined) order provided the data is qualified. But that loop can occur 100 times max and can have repeating segments inside the loop value.
The implementation guide will give you the correct order your partner wants them in. It seems like you're writing your own syntax validation engine though.

How to create a string outside of Erlang that represents a DICT Term?

I want to construct a string in Java that represents a DICT term and that will be passed to an Erlang process for being reflected back as an erlang term ( string-to-term ).
I can achieve this easily for ORDDICT 's, since they are structured as a simple sorted key / value pair in a list of tuples such as : [ {field1 , "value1"} , {field2 , "value2} ]
But, for DICTS, they are compiled into a specific term that I want to find how to reverse-engineer it. I am aware this structure can change over new releases, but the benefits for performance and ease of integration to Java would overcome this. Unfortunately Erlang's JInterface is based on simple data structures. An efficient DICT type would be of great use.
A simple dict gets defined as follows:
D1 = dict:store("field1","AAA",dict:new()).
{dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],[],[],[],[],[],[],
[["field1",65,65,65]],
[],[],[],[],[],[],[]}}}
As it can be seen above, there are some coordinates which I do not understand what they mean ( the numbers 1,16,16,8,80,48 and a set of empty lists, which likely represent something as well.
Adding two other rows (key-value pairs) causes the data to look like:
D3 = dict:store("field3","CCC",D2).
{dict,3,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],
[["field3",67,67,67]],
[],[],[],[],[],
[["field1",65,65,65]],
[],[],[],[],
[["field2",66,66,66]],
[],[]}}}
From the above I can notice that:
the first number (3) reppresets the number of items in the DICT.
the second number (16) shows the number of list slots in the first tuple of lists
the third number (16) shows the number of list slots in the second typle of lists, of which the values ended up being placed on ( in the middle ).
the fourth number (8) appears to be the number of slots in the second row of tuples from where the values are placed ( a sort of index-pointer )
the remaining numbers (80 and 48)... no idea...
adding a key "field0" gets placed not in the end but just after "field1"'s data. This indicates the indexing approach.
So the question, is there a way (algorithm) to reliably directly create a DICT string from outside of Erlang ?
The comprehensive specification how dict is implemented can be found simply in the dict.erl sourcecode.
But I'm not sure replicating dict.erl's implementation in Java is worthwhile. This would only make sense if you want a fast dict like data structure that you need to pass often between Java and Erlang code. It might make more sense to use a Key-Value store both from Erlang and Java without passing it directly around. Depending on your application this could be e.g. riak or maybe even connect your different language worlds with RabbitMQ. Both examples are implemented in Erlang and are easily accessible from both worlds.

If you know the length of a string and apply a SHA1 hash to it, can you unhash it?

Just wondering if knowing the original string length means that you can better unlash a SHA1 encryption.
No, not in the general case: a hash function is not an encryption function and it is not designed to be reversible.
It is usually impossible to recover the original hash for certain. This is because the domain size of a hash function is larger than the range of the function. For SHA-1 the domain is unbounded but the range is 160bits.
That means that, by the Pigeonhole principle, multiple values in the domain map to the same value in the range. When such two values map to the same hash, it is called a hash collision.
However, for a specific limited set of inputs (where the domain of the inputs is much smaller than the range of the hash function), then if a hash collision is found, such as through an brute force search, it may be "acceptable" to assume that the input causing the hash was the original value. The above process is effectively a preimage attack. Note that this approach very quickly becomes infeasible, as demonstrated at the bottom. (There are likely some nice math formulas that can define "acceptable" in terms of chance of collision for a given domain size, but I am not this savvy.)
The only way to know that this was the only input that mapped to the hash, however, would be to perform an exhaustive search over all the values in the range -- such as all strings with the given length -- and ensure that it was the only such input that resulted in the given hash value.
Do note, however, that in no case is the hash process "reversed". Even without the Pigeon hole principle in effect, SHA-1 and other cryptographic hash functions are especially designed to be infeasible to reverse -- that is, they are "one way" hash functions. There are some advanced techniques which can be used to reduce the range of various hashes; these are best left to Ph.D's or people who specialize in cryptography analysis :-)
Happy coding.
For fun, try creating a brute-force preimage attack on a string of 3 characters. Assuming only English letters (A-Z, a-z) and numbers (0-9) are allowed, there are "only" 623 (238,328) combinations in this case. Then try on a string of 4 characters (624 = 14,776,336 combinations) ... 5 characters (625 = 916,132,832 combinations) ... 6 characters (626 = 56,800,235,584 combinations) ...
Note how much larger the domain is for each additional character: this approach quickly becomes impractical (or "infeasible") and the hash function wins :-)
One way password crackers speed up preimage attacks is to use rainbow tables (which may only cover a small set of all values in the domain they are designed to attack), which is why passwords that use hashing (SHA-1 or otherwise) should always have a large random salt as well.
Hash functions are one-way function. For a given size there are many strings that may have produced that hash.
Now, if you know that the input size is fixed an small enough, let's say 10 bytes, and you know that each byte can have only certain values (for example ASCII's A-Za-z0-9), then you can use that information to precompute all the possible hashes and find which plain text produces the hash you have. This technique is the basis for Rainbow tables.
If this was possible , SHA1 would not be that secure now. Is it ? So no you cannot unless you have considerable computing power [2^80 operations]. In which case you don't need to know the length either.
One of the basic property of a good Cryptographic hash function of which SHA1 happens to be one is
it is infeasible to generate a message that has a given hash
Theoretically, let's say the string was also known to be solely of ASCII characters, and it's of size n.
There are 95 characters in ASCII not including controls. We'll assume controls weren't used.
There are 95ⁿ possible such strings.
There are 1.461501×10⁴⁸ possible SHA-1 values (give or take) and a just n=25, there are 2.7739×10⁴⁹ possible ASCII-only strings without controls in them, which would mean guaranteed collisions (some such strings have the same SHA-1).
So, we only need to get to n=25 when this becomes impossible even with infinite resources and time.
And remember, up until now I've been making it deliberately easy with my ASCII-only rule. Real-world modern text doesn't follow that.
Of course, only a subset of such strings would be anything likely to be real (if one says "hello my name is Jon" and the other says "fsdfw09r12esaf" then it was probably the first). Stil though, up until now I was assuming infinite time and computing power. If we want to work it out sometime before the universe ends, we can't assume that.
Of course, the nature of the attack is also important. In some cases I want to find the original text, while in others I'll be happy with gibberish with the same hash (if I can input it into a system expecting a password).
Really though, the answer is no.
I posted this as an answer to another question, but I think it is applicable here:
SHA1 is a hashing algorithm. Hashing is one-way, which means that you can't recover the input from the output.
This picture demonstrates what hashing is, somewhat:
As you can see, both John Smith and Sandra Dee are mapped to 02. This means that you can't recover which name was hashed given only 02.
Hashing is used basically due to this principle:
If hash(A) == hash(B), then there's a really good chance that A == B. Hashing maps large data sets (like a whole database) to a tiny output, like a 10-character string. If you move the database and the hash of both the input and the output are the same, then you can be pretty sure that the database is intact. It's much faster than comparing both databases byte-by-byte.
That can be seen in the image. The long names are mapped to 2-digit numbers.
To adapt to your question, if you use bruteforce search, for a string of a given length (say length l) you will have to hash through (dictionary size)^l different hashes.
If the dictionary consists of only alphanumeric case-sensitive characters, then you have (10 + 26 + 26)^l = 62^l hashes to hash. I'm not sure how many FLOPS are required to produce one hash (as it is dependent on the hash's length). Let's be super-unrealistic and say it takes 10 FLOP to perform one hash.
For a 12-character password, that's 62^12 ~ 10^21. That's 10,000 seconds of computations on the fastest supercomputer to date.
Multiply that by a few thousand and you'll see that it is unfeasible if I increase my dictionary size a little bit or make my password longer.

Resources