Generate custom length hash values of a String in Swift - ios

Is it possible to somehow "hash" a given String with length n to a hash value of an arbitrary length m?
I want to achieve something like follows:
let s1 = "<UNIQUE_USER_IDENTIFIER_1>"
let s2 = "<UNIQUE_USER_IDENTIFIER_2>"
let x1 = s1.hashValue(length: 4)
let x2 = s2.hashValue(length: 4)
I want to assign each given user a (e.g. four-digit) number, that is based on its unique UID. Is that possible?

First, I want to be clear that you mean "hash" and don't mean "(lossless) compress." You should expect some collisions where x1 and x2 are the same value for different s1 and s2. If you really mean a mapping so that there are no collisions, then we have to know a lot more about the problem. It is impossible to achieve that in the general case (see the Pigeonhole principle). But it can be achieved in some special cases where there is sufficient redundancy in the input. Or it can be done by maintaining a table (i.e. a database or the like). The rest of this answer is about hashing.
If your UID is a UUID created on iOS (or any v4 UUID), then its bits are already quite high quality, and the last four digits should be fine without doing any hashing at all. There are a couple of bytes in the middle that you should avoid, but the whole end section is random and so an ideal hash.
If your UUID is not random, you can try using the default hashes and pulling the required number of bits out of them, but non-cryptographic hashes don't always have good independence between their bits, so this may collide more than you like.
In that case use a cryptographic hash larger than the size you need and truncate it (or take the least-significant bits; either set are fine). This is commonly done in cryptography. For example SHA-512/256 is a commonly used hash that computes a 512-bit hash and extracts 256 bits from it. Cryptographic hashes require high independence of all their bits, so any subset of bits will also be collision resistant.
BTW, if you mean "4 decimal digits," then you should expect a collision about 1 time out 100. If you mean 16 bits (4 hex digits), you should expect a collision about one time in 300. These are your best-case scenarios and mean your hash is working well. See Birthday Attack for a table of expectations and some helpful approximations.

Based on only the information you provided:
extension String {
func hashValue(length: Int) -> Int? {
return Int(String(abs(hash)).prefix(length))
}
}
Usage:
"foo".hashValue(length: 4) // 5192
This will give you a consistent positive integer result based on the string input. Obviously it is not very useful for uuid purposes but useful for other use-cases nonetheless.

Related

Gensim doc2vec produce more vectors than given documents, when I pass unique integer id as tags

I'm trying to make documents vectors of gensim example using doc2vec.
I passed TaggedDocument which contains 9 docs and 9 tags.
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
idx = [0,1,2,3,4,5,6,7,100]
documents = [TaggedDocument(doc, [i]) for doc, i in zip(common_texts, idx)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
and it produces 101 vectors like this image.
gensim doc2vec produced 101 vectors
and what I want to know is
How can I be sure that the tag I passed is attached to the right vector?
How did the vectors with the tags which I didn't pass (8~99 in my case) come out? Were they computed as a blank?
If you use plain ints as your document-tags, then the Doc2Vec model will allocate enough doc-vectors for every int up to the highest int you provide - even if you don't use some of those ints.
This assumption, that all ints up to the highest declared are used, allows the code to avoid creating a redundant {tag -> slot} dictionary, saving a little memory. That specific potential savings is the main reason for supporting plain ints (rather than unique strings) as tag names.
Any such doc-vectors allocated but never subject to any traiing will be randomly-initialized the same as others - but never adjusted by training.
If you want to use plain int tag names, you should either be comfortable with this over-allocation, or make sure you only use all contiguous int IDs from 0 to your max ID, with none ununused. But unless your training data is very large, using unique string tags, and allowing the {tag -> slot} dictionary to be created, is straightforward and not too expensive in memory.
(Separately: min_count=1 is almost always a bad idea in these algorithms, as discarding rare tokens tends to give better results than letting their thin example usages interfere with other training.)

How to filter a random string generator from certain results

Is it possible to add a filter to a random string gen so that it cannot produce certain strings. I am using this to create unique codes for my users and I need to make sure that a code is not assigned more than once.
This is how I am generating random alphanumeric
func randomString(length: Int) -> String {
let letters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
return String((0..<length).map{ _ in letters.randomElement()! })
}
As long as your strings is long enough, your current scheme is already everything you need. You can make collisions of random values arbitrarily unlikely. This is the basis of UUIDv4. There are just so many possible values (>10^36) that no one will ever pick a duplicate. If you can use UUIDs directly, I recommend that. It's well supported.
If you need to use letters and numbers, then you can do the same thing, though.
UUID gets that by having 122 bits of randomness. You have 62 symbols. To get 122 bits of randomness, you need ~20 characters. If that's ok, you're done. Generate 20 character random codes and I promise, they will never collide. For each character you shorten it by, the likelihood of a collision goes up. This is called a Birthday Attack. Back-of-the-envelope, you expect (50% likely) your first collision after 1.25*sqrt(H) where H is the number of values you can encode.
So for 10 characters, you expect your first collision around 1.25*sqrt(62^10) or around 1 billion documents. If your total number of documents is a few orders of magnitude smaller than that (a few million or fewer), then 10 characters will be fine. Other scales can be similar calculated.
But if you can, just use UUID.

ECIES: correct way ECDH-input for KDF? Security effect?

In order to understand ECIES completely and use my favorite library I implemented some parts of ECIES myself. Doing this and comparing the results led to one point which is not really clear for me: what exacly is the input of KDF?
The result of ECDH is an vector, but what do you use for the KDF? Is it just the X value, or is it X + Y (perhaps with an prepended 04)? You can find both concept in the wild, and for sake of interoberability, it would be really interesting which way is the correct way (if there is a correct way at all - I know that ECIEs is more a concept and has several degrees of freedom).
Explanation (correct me if I'm wrong at a specific point, please). If I talk about byte length, this will refer to ECIES with 256 Bit EC Keys.
So, first, the big picture: here's the ECIES process, and I'm talking about the step 2 -> 3:
The recipient's public key is an vector V, the sender's emphemal private key is a scalar u, and key agreement function KA is ECDH which is basicly a multiplication of V * u. As a result, you get a shared key which is also a vector - let's call it "shared key".
Then you take the sender's public key, concat it with the shared key, and use this as an input for the key derival function KDF.
But: If you want to use this vector for the key derival function KDF, you have two ways of doing this:
you can use just shared key's X. Then you have a bytestring of 32 bytes.
you can use shared key's X and Y and prepend it 0x04 as you with public keys. Then you have a bytestring of 01 + 32 + 32 bytes
[3) just to be complete: you can also use X + Y as a compressed point)
The length of the bytestring does not really matter, because after KDF (which usually involves hashing) you always have a fixed value, e.g. 32 bytes (if you use sha256).
But of course the result of KDF is quite different if you choose one or the other method. So the question is: what's the correct way?
eciespy uses method 2 https://github.com/ecies/py/blob/master/ecies/utils.py#L143
python cryptography gives just X back at their ECDH: https://cryptography.io/en/latest/hazmat/primitives/asymmetric/ec/#cryptography.hazmat.primitives.asymmetric.ec.ECDH . They have no ECIES support.
if I understand CryptoC++s documentation correctly, they also just give X back: https://cryptopp.com/wiki/Elliptic_Curve_Diffie-Hellman
same with Java BountyCastle, if I read this correctly - result is an integer: https://github.com/bcgit/bc-java/blob/master/core/src/main/java/org/bouncycastle/crypto/agreement/DHBasicAgreement.java#L79
but you can also find online calculators with both, X and Y: http://www-cs-students.stanford.edu/~tjw/jsbn/ecdh.html
So, I tried to get more information in documentation:
there's the ISO propsal for ECIES. They don't describe it in detail (or I was not able to find it), but I would interpret it as the way with the full vector, X and Y: https://www.shoup.net/papers/iso-2_1.pdf
there is this paper which is widely linked in the internet which refers to just using X at page 27: http://www.secg.org/sec1-v2.pdf
So, result is: I'm confused. Can anybody point me in the right direction, or is this just a degree of freedom you have (and reason for lot's of fun when it comes to compatibility)?
To answer my quesion myself: yes, this is a degree of freedom. The X coordinate way is called compact representation, and it's defined in RFC 6090. So both are valid.
They are also equally secure, because you can calculate Y out of X as described in appendix C at RFC 6090.
The default way is using compact representation. Both ways are not compatibile to each other, so if you stumble across compatibility issues between libaries this might be an interesting point to find out.

How Unique is CRC16 Value?

I'm developing an OpenSource .NET Licensing Engine.
This engine use hardware id (harddisk serial number) as lock and CRC16 this value to get shorten identifier.
Example value is MAXTOR ST3100, 476300BE and CRC16 result is 3FF0
My concern is how often 2 diffrent value get same CRC16 value, or should I use CRC32 instead ?
Probability of collision between 2 items = 1 ⁄ 0x10000 = 0.00152%...
But if you have more than 2 items, see the Birthday Problem -- it gets a lot more likely:
You just need 300 items to get a 50% probability of collision.
http://www.texify.com/img/%5CLARGE%5C%21%5CLARGE%5C%21%5Cleft%281%20-%20%5Cfrac%7B0%7D%7B2%5E%7B16%7D%7D%5Cright%29%5Cleft%281%20-%20%5Cfrac%7B1%7D%7B2%5E%7B16%7D%7D%5Cright%29%5Cleft%281%20-%20%5Cfrac%7B2%7D%7B2%5E%7B16%7D%7D%5Cright%29%5Cleft%281%20-%20%5Cfrac%7B3%7D%7B2%5E%7B16%7D%7D%5Cright%29%5Ccdots%5Cleft%281%20-%20%5Cfrac%7BN%7D%7B2%5E%7B16%7D%7D%5Cright%29%3D%2050%25%20%5C%5C%20N%20%5Capprox%20300.gif
As CRC16 is a 16-bit value, I'd say that the chance is around 1 in 65536.
No hashing method generates unique values, collisions being guaranteed at some point. The closest bet based on your requirements is simply to use the harddisk serial number as-is.
Hackers will crack it easily though.

Lookup table size reduction

I have an application in which I have to store a couple of millions of integers, I have to store them in a Look up table, obviously I cannot store such amount of data in memory and in my requirements I am very limited I have to store the data in an embebedded system so I am very limited in the space, so I would like to ask you about recommended methods that I can use for the reduction of the look up table. I cannot use function approximation such as neural networks, the values needs to be in a table. The range of the integers is not known at the moment. When I say integers I mean a 32 bit value.
Basically the idea is use some copmpression method to reduce the amount of memory but without losing many precision. This thing needs to run in hardware so the computation overhead cannot be very high.
In my algorithm I have to access to one value of the table do some operations with it and after update the value. In the end what I should have is a function which I pass an index to it and then I get a value, and after I have to use another function to write a value in the table.
I found one called tile coding , this one is based on several look up tables, does anyone know any other method?.
Thanks.
I'd look at the types of numbers you need to store and pull out the information that's common for many of them. For example, if they're tightly clustered, you can take the mean, store it, and store the offsets. The offsets will have fewer bits than the original numbers. Or, if they're more or less uniformly distributed, you can store the first number and then store the offset to the next number.
It would help to know what your key is to look up the numbers.
I need more detail on the problem. If you cannot store the real value of the integers but instead an approximation, that means you are going to reduce (throw away) some of the data (detail), correct? I think you are looking for a hash, which can be an artform in itself. For example say you have 32 bit values, one hash would be to take the 4 bytes and xor them together, this would result in a single 8 bit value, reducing your storage by a factor of 4 but also reducing the real value of original data. Typically you could/would go further and perhaps and only use a few of those 8 bits , say the lower 4 and reduce the value further.
I think my real problem is either you need the data or you dont, if you need the data you need to compress it or find more memory to store it. If you dont, then use a hash of some sort to reduce the number of bits until you reach the amount of memory you have for storage.
Read http://www.cs.ualberta.ca/~sutton/RL-FAQ.html
"Function approximation" refers to the
use of a parameterized functional form
to represent the value function
(and/or the policy), as opposed to a
simple table."
Perhaps that applies. Also, update your question with additional facts -- don't merely answer in the comments.
Edit.
A bit array can easily store a bit for each of your millions of numbers. Let's say you have numbers in the range of 1 to 8 million. In a single megabyte of storage you can have a 1 bit for each number in your set and a 0 for each number not in your set.
If you have numbers in the range of 1 to 32 million, you'll require 4Mb of memory for a big table of all 32M distinct numbers.
See my answer to Modern, high performance bloom filter in Python? for a Python implementation of a bit array of unlimited size.
If you are merely looking for the presence of the number in question a bloom filter, might be what you are looking for. Honestly though your question is fairly vague and confusing. It would help to explain what Q values are, and what you do with them once you find them in the table.
If your set of integers is homongenous, then you could try a hash table, because there is a trick you can use to cut the size of the stored integers, in your case, in half.
Assume the integer, n, because its set is homogenous can be the hash. Assume you have 0x10000 (16k) buckets. Each bucket index, iBucket = n&FFFF. Each item in a bucket need only store 16 bits, since the first 16 bits are the bucket index. The other thing you have to do to keep the data small is to put the count of items in the bucket, and use an array to hold the items in the bucket. Using a linked list will be too large and slow. When you iterate the array looking for a match, remember you only need to compare the 16 bits that are stored.
So assuming a bucket is a pointer to the array and a count. On a 32 bit system, this is 64 bits max. If the number of ints was small enough we might be able to do some fancy things and use 32 bits for a bucket. 16k * 8 bytes = 524k, 2 million shorts = 4mb. So this gets you a method to lookup the ints and about 40% compression.

Resources