How Unique is CRC16 Value? - crc16

I'm developing an OpenSource .NET Licensing Engine.
This engine use hardware id (harddisk serial number) as lock and CRC16 this value to get shorten identifier.
Example value is MAXTOR ST3100, 476300BE and CRC16 result is 3FF0
My concern is how often 2 diffrent value get same CRC16 value, or should I use CRC32 instead ?

Probability of collision between 2 items = 1 ⁄ 0x10000 = 0.00152%...
But if you have more than 2 items, see the Birthday Problem -- it gets a lot more likely:
You just need 300 items to get a 50% probability of collision.
http://www.texify.com/img/%5CLARGE%5C%21%5CLARGE%5C%21%5Cleft%281%20-%20%5Cfrac%7B0%7D%7B2%5E%7B16%7D%7D%5Cright%29%5Cleft%281%20-%20%5Cfrac%7B1%7D%7B2%5E%7B16%7D%7D%5Cright%29%5Cleft%281%20-%20%5Cfrac%7B2%7D%7B2%5E%7B16%7D%7D%5Cright%29%5Cleft%281%20-%20%5Cfrac%7B3%7D%7B2%5E%7B16%7D%7D%5Cright%29%5Ccdots%5Cleft%281%20-%20%5Cfrac%7BN%7D%7B2%5E%7B16%7D%7D%5Cright%29%3D%2050%25%20%5C%5C%20N%20%5Capprox%20300.gif

As CRC16 is a 16-bit value, I'd say that the chance is around 1 in 65536.

No hashing method generates unique values, collisions being guaranteed at some point. The closest bet based on your requirements is simply to use the harddisk serial number as-is.
Hackers will crack it easily though.

Related

Generate custom length hash values of a String in Swift

Is it possible to somehow "hash" a given String with length n to a hash value of an arbitrary length m?
I want to achieve something like follows:
let s1 = "<UNIQUE_USER_IDENTIFIER_1>"
let s2 = "<UNIQUE_USER_IDENTIFIER_2>"
let x1 = s1.hashValue(length: 4)
let x2 = s2.hashValue(length: 4)
I want to assign each given user a (e.g. four-digit) number, that is based on its unique UID. Is that possible?
First, I want to be clear that you mean "hash" and don't mean "(lossless) compress." You should expect some collisions where x1 and x2 are the same value for different s1 and s2. If you really mean a mapping so that there are no collisions, then we have to know a lot more about the problem. It is impossible to achieve that in the general case (see the Pigeonhole principle). But it can be achieved in some special cases where there is sufficient redundancy in the input. Or it can be done by maintaining a table (i.e. a database or the like). The rest of this answer is about hashing.
If your UID is a UUID created on iOS (or any v4 UUID), then its bits are already quite high quality, and the last four digits should be fine without doing any hashing at all. There are a couple of bytes in the middle that you should avoid, but the whole end section is random and so an ideal hash.
If your UUID is not random, you can try using the default hashes and pulling the required number of bits out of them, but non-cryptographic hashes don't always have good independence between their bits, so this may collide more than you like.
In that case use a cryptographic hash larger than the size you need and truncate it (or take the least-significant bits; either set are fine). This is commonly done in cryptography. For example SHA-512/256 is a commonly used hash that computes a 512-bit hash and extracts 256 bits from it. Cryptographic hashes require high independence of all their bits, so any subset of bits will also be collision resistant.
BTW, if you mean "4 decimal digits," then you should expect a collision about 1 time out 100. If you mean 16 bits (4 hex digits), you should expect a collision about one time in 300. These are your best-case scenarios and mean your hash is working well. See Birthday Attack for a table of expectations and some helpful approximations.
Based on only the information you provided:
extension String {
func hashValue(length: Int) -> Int? {
return Int(String(abs(hash)).prefix(length))
}
}
Usage:
"foo".hashValue(length: 4) // 5192
This will give you a consistent positive integer result based on the string input. Obviously it is not very useful for uuid purposes but useful for other use-cases nonetheless.

How to spike hash function collision rate?

Given there are billions cookies, UUID like strings, what is the best way to test collision rate of say 32 bit hash function like murmur3 on this sample?
First of all it is hard to generate billions of unique strings as it is impossible to keep it in memory and there is no 100% precise random string generator.
Only way I can think of is :
generating them and using approx. datastructures like bloomfilter or cuckoo filter to discard possible duplicates. Then we would have say exactly 5B of unique UUIDs stored in a file.
iterate through them, hash them and repeat step 1) with the hash codes while counting how many collisions are there.
Is there any better way of doing that? This still has a drawback in that there is a certain false positive rate while testing the hash codes in 2). The hash codes would have to be written to file too, being manually checked in case of possible false positive hit.
the murmur_32 collision rate is extremely high in these magnitudes ...
Only 100M unique uuids has 1.145577 % collision rate precisely ...
Scala snippet
Choose a word at random from the English dictionary, submit to Google, then use the urls that come back as "random" data to test your hash function on.

How to keep track of the seed

So in Lua it's common knowledge that you can use math.randomseed but it's also obvious that math.random sets the seed as well (calling it twice does not return the same result), what does it set it to, and how can I keep track of it, and if it's impossible, please explain why that is so.
This is not a Lua question, but general question on how some RNG algorithm works.
First, Lua don't have their own RNG - they just output you (slightly mangled) value from RNG of underlying C library. Most RNG implementations do not reveal you their inner state, but sometimes you can caclulate it yourself.
For example when you use Lua on Windows, you'll be using LCG-based RNG from MS C library. The numbers you get is a slice of seed, not full value. There are two ways you can deal with that:
If you know how many times you called random, you can just take initial seed value, feed it to your copy of the same algorithm with same constants that are hardcoded in MS library and get exact value of seed.
If you don't, but you can be sure that nobody interferes in between your two calls to random, you can get two generated numbers, and reverse LCG algorithm by shifting bits back to their place. This will leave you with several missing bits (with one more bit thanks to Lua mangling) that you will need to simply bruteforce - just reiterate over all missing bits until your copy of algorithm produces exactly same two "random" numbers you've recorded before. That will be current seed stored inside library's RNG as well. Well programmed solution in Lua can bruteforce this in about 0.2-0.5s on somewhat dated PC - I did it past. Here's example on Crypto.SE talking about this task in more details: Predicting values from a Linear Congruential Generator.
First approach can be used with any other RNG algorithm that doesn't use any real entropy, second with most RNGs that don't mask too much bits in slice to make bruteforcing unreasonable.
Real answer though is: you don't need to keep track of seed at all. What you want is probably something else.
If you set a seed all numbers math.random() generates are pseudo-random (This is always the case as the system will generate a seed by itself).
math.randomseed(4)
print(math.random())
print(math.random())
math.randomseed(4)
print(math.random())
Outputs
0.50827539156303
0.75454387490399
0.50827539156303
So if you reset the seed to the same value you can predict all values that are going to come up to the maximum number of consecutive values that you already generated using that seed.
What the seed does not do is keep the output of math.random() the same. It would be the same if you kept resetting it to the same value.
An analogy as an example
Imagine the random number is an integer between 0 and 9 (instead of a double between 0 and 1).
math.random() could traverse pi's decimals from an arbitrary starting position (default could be system time).
What you do when you use set.seed() is (not literally, this is an analogy as mentioned) set the starting decimals of where in pi you are going to retrieve your numbers.
If you now reset the seed to the same starting position the numbers are going to be the same as the last time you reset the starting position.
You will know the numbers of to the last call, after that you can't be certain anymore.

What should i do to maintain performance of a mobile app which is using database?

I'm building an app using database.
I have a words table and everytime user types something, this app will record and update word the database.
And the frequency field will be auto increase after user enter one matched word.
But the trouble is user type day by day and i afraid the search performance will be reduce after times and also the Int field will reach to the limit (max limit Int) someday.
So, i limit the database to around less than 50.000 records.
I delete less-used records after a certain time.
But i don't know how to deal with frequency Int field of each word?
How to know exactly frequency usage of each word without increasing the field forever?
I recommend that you use a logarithmic scale for the frequency values. That's what is often done in situations like this. See Wikipedia to learn about logarithmic scales.
For example, if you have a word MAN that has a frequency of 15, the value you store in the database would be log(15) ~= 1.17609125906.
If you then find 4 new occurrences of MAN, then you want to add 4 to the field. You cannot add the log values directly because log(x)+log(y)=log(x*y). (See the Logarithm Rules section of this article for more information on log rules.)
Instead -- assuming you use a base 10 logarithm, you would use this formula:
SET frequency = log(10^frequency+4)
Depending on the length of your words, the few bytes for the frequency don't matter. With an unsigned four bytes integer, you can count up to more than two billion, which is way above the number of words what the user can type in in their whole lifespan.
So may want to go for two or three bytes, but the savings may be negligible.
Anyway, there are the following approaches for preventing overflow:
You can detect it, and then undo the operations, scale everything down by some factor of two, and then redo.
You can periodically check all your numbers and do the scaling when approaching the limit.
You can do a probabilistic update like below.
Probabilistic update
Instead of simply incrementing the frequency every time by one, you do it only with a probability which gets lower and lower as the counter grows. For example, you can do the increment with a probability of 1.0 / (oldValue + 1) or 2 ** -oldValue. The latter leads to a logarithmic growth, but, unlike the idea in the other answer, it works.
There are obviously some disadvantages due to the randomness and precision loss, but when all you care about is the relative frequency, it should be good enough.

Lookup table size reduction

I have an application in which I have to store a couple of millions of integers, I have to store them in a Look up table, obviously I cannot store such amount of data in memory and in my requirements I am very limited I have to store the data in an embebedded system so I am very limited in the space, so I would like to ask you about recommended methods that I can use for the reduction of the look up table. I cannot use function approximation such as neural networks, the values needs to be in a table. The range of the integers is not known at the moment. When I say integers I mean a 32 bit value.
Basically the idea is use some copmpression method to reduce the amount of memory but without losing many precision. This thing needs to run in hardware so the computation overhead cannot be very high.
In my algorithm I have to access to one value of the table do some operations with it and after update the value. In the end what I should have is a function which I pass an index to it and then I get a value, and after I have to use another function to write a value in the table.
I found one called tile coding , this one is based on several look up tables, does anyone know any other method?.
Thanks.
I'd look at the types of numbers you need to store and pull out the information that's common for many of them. For example, if they're tightly clustered, you can take the mean, store it, and store the offsets. The offsets will have fewer bits than the original numbers. Or, if they're more or less uniformly distributed, you can store the first number and then store the offset to the next number.
It would help to know what your key is to look up the numbers.
I need more detail on the problem. If you cannot store the real value of the integers but instead an approximation, that means you are going to reduce (throw away) some of the data (detail), correct? I think you are looking for a hash, which can be an artform in itself. For example say you have 32 bit values, one hash would be to take the 4 bytes and xor them together, this would result in a single 8 bit value, reducing your storage by a factor of 4 but also reducing the real value of original data. Typically you could/would go further and perhaps and only use a few of those 8 bits , say the lower 4 and reduce the value further.
I think my real problem is either you need the data or you dont, if you need the data you need to compress it or find more memory to store it. If you dont, then use a hash of some sort to reduce the number of bits until you reach the amount of memory you have for storage.
Read http://www.cs.ualberta.ca/~sutton/RL-FAQ.html
"Function approximation" refers to the
use of a parameterized functional form
to represent the value function
(and/or the policy), as opposed to a
simple table."
Perhaps that applies. Also, update your question with additional facts -- don't merely answer in the comments.
Edit.
A bit array can easily store a bit for each of your millions of numbers. Let's say you have numbers in the range of 1 to 8 million. In a single megabyte of storage you can have a 1 bit for each number in your set and a 0 for each number not in your set.
If you have numbers in the range of 1 to 32 million, you'll require 4Mb of memory for a big table of all 32M distinct numbers.
See my answer to Modern, high performance bloom filter in Python? for a Python implementation of a bit array of unlimited size.
If you are merely looking for the presence of the number in question a bloom filter, might be what you are looking for. Honestly though your question is fairly vague and confusing. It would help to explain what Q values are, and what you do with them once you find them in the table.
If your set of integers is homongenous, then you could try a hash table, because there is a trick you can use to cut the size of the stored integers, in your case, in half.
Assume the integer, n, because its set is homogenous can be the hash. Assume you have 0x10000 (16k) buckets. Each bucket index, iBucket = n&FFFF. Each item in a bucket need only store 16 bits, since the first 16 bits are the bucket index. The other thing you have to do to keep the data small is to put the count of items in the bucket, and use an array to hold the items in the bucket. Using a linked list will be too large and slow. When you iterate the array looking for a match, remember you only need to compare the 16 bits that are stored.
So assuming a bucket is a pointer to the array and a count. On a 32 bit system, this is 64 bits max. If the number of ints was small enough we might be able to do some fancy things and use 32 bits for a bucket. 16k * 8 bytes = 524k, 2 million shorts = 4mb. So this gets you a method to lookup the ints and about 40% compression.

Resources