How do content addressable storage systems deal with possible hash collisions?

How do content addressable storage systems deal with possible hash collisions? - storage

Content addressable storage systems use the hash of the stored data as the identifier and the address. Collisions are incredibly rare, but if the system is used a lot for a long time, it might happen. What happens if there are two pieces of data that produce the same hash? Is it inevitable that the most recently stored one wins and data is lost, or is it possible to devise ways to store both and allow accessing both?
To keep the question narrow, I'd like to focus on Camlistore. What happens if permanodes collide?

It is assumed that collisions do not happen. Which is a perfectly reasonable assumption, given a strong hash function and a casual, non-malicious user inputs. SHA-1, which is what Camlistore currently uses, is also resistant to malicious attempts to produce collision.
In case a hash function becomes weak with time and needs to be retired, Camlistore supports a migration to a new hash function for new blobrefs, while keeping old blob refs accessible.
If a collision did happen, as far as I understand, the first stored blobref with that hash would win.
source: https://groups.google.com/forum/#!topic/camlistore/wUOnH61rkCE

In an ideal collision-resistant system, when a new file / object is ingested:
A hash is computed of the incoming item.
If the incoming hash does not already exist in the store:
the item data is saved and associated with the hash as its identifier
If incoming hash does match an existing hash in the store:
The existing data is retrieved
A bit-by-bit comparison of the existing data is performed with the new data
If the two copies are found to be identical, the new entry is linked to the existing hash
If the new copies are not identical, the new data is either
Rejected, or
Appended or prefixed* with additional data (e.g. a timestamp or userid) and re-hashed; this entire process is then repeated.
So no, it's not inevitable that information is lost in a content-addressable storage system.
* Ideally, the existing stored data would then be re-hashed in the same way, and the original hash entry tagged somehow (e.g. linked to a zero-byte payload) to notate that there were multiple stored objects that originally resolved to that hash (similar in concept to a 'Disambiguation page' on Wikipedia). Whether that is necessary depends on how data needs to be retrieved from the system.
While intentionally causing a collision may be astronomically impractical for a given algorithm, a random collision is possible as soon as the second storage transaction.
Note: Some small / non-critical systems skip the binary comparison step, trading risk for bandwidth or processing time. (Usually, this is only done if certain metadata matches, such as filename or data length.)
The risk profile of such a system (e.g. a single git repository) is far different than for an enterprise / cloud-scale environment that ingests large amounts of binary data, especially if that data is apparent random binary data (e.g. encrypted / compressed files) combined with something like sliding-window deduplication.
See also, e.g.:
https://stackoverflow.com/a/2437377/5711986

Composite Key e.g hash + userId

Related

When making read request to DRAM, why we need to read tag and data, not data only?

I am going through David Patterson and John Hennessy's computer architecture book. In chapter2, it mentions that we may need to make two separates request to read tag and data in two cycles if we store tags in DRAM. My question is why do we need to request tag at all? Isn't the tag is just higher bits of the address?

Wow - I read Patterson and Hennessy in grad school, a long, long time ago ;) Thanx for the trip down Memory Lane ;)
Here's what's going on:
https://www.webopedia.com/TERM/T/tag_RAM.html
The area in an L2 cache that identifies which data from main memory is
currently stored in each cache line. The actual data is stored in a
different part of the cache, called the data store. The values stored
in the tag RAM determine whether a cache lookup results in a hit or a
miss.
In other words, there are two different "things" (the tag, and the data) in two different "places" (the cache line, and the data store). If it's a "hit", you only need to do one lookup (to the cache line).
So why have a "tag" at all? Because different regions of memory may be mapped into a block, the tag is used to differentiate between them.

Store.putRight(..) and best practice for key selection

When using a Store by only appending data to the right, with constantly increasing key values of type Long, would it be best to query the Store's size using Store.count(..) before calling Store.putRight(..) every time and use that value as the next key? I was wondering if the store method could become quite expensive.

Store.count() is quite cheap, as it requires only the tree root to be loaded, and its record in the database is highly likely loaded in the Log cache. Store.putRight() compared to Store.put() is cheaper under any workload as it results in less random access.

Store a checksum or similar for a file, to tell easily if it's the same as another file

In our app we have a table called support_files which stores documents that have been uploaded , which are mostly PDFs.
I'd like to get a unique list of these files, often the same file is uploaded more than once. I thought that a way to do this would be to add a column to the database called "checksum", and then, for each file, calculate the checksum somehow and store it in the column. (This is obviously the slow part).
Once this is done then I can easily filter out duplicates from my table by examining the checksum column.
Can anyone recommend a method to generate this checksum/hash/whatever? Ideally I'd like to generate a hash/checksum that's large enough to guarantee uniqueness, but small enough to fit into a string field in my database.
My server's running on Ubuntu server, and the total number of files I need to checksum is currently around 12,000. For the sake of argument assume it won't grow over 100,000.
A bit of Googling reveals sha1sum, but this may be more suited to telling if a file has been accidentally changed rather than if two files are different?

Take a look at Digest::SHA256, it can interface directly with files and it works great.
From the referenced documentation:
p Digest::SHA256.file("X11R6.8.2-src.tar.bz2").hexdigest
# => "f02e3c85572dc9ad7cb77c2a638e3be24cc1b5bea9fdbb0b0299c9668475c534"
``

DHT Node ID Generation?

I just start studying DHT implementation and theory and stuck on on part, how generates node id when node startup and connect to network. I read that ID is random hash from some hashes range but, is it unique hash? and is hash generates close no the data which this node store? Help me with this.

Self-generation of the node ID using a good hash function over a large space of values is a common technique used in DHT/P2P systems. Since the hash guarantees good random distribution, the probability of a collision is very small. Statistically, the ID will (almost always) be unique.
That hash is independent from the data stored of the node.

import random
import hashlib
def newID():
s = ""
for i in range(20):
s += chr(random.randint(0, 255))
m = hashlib.sha1()
m.update(s)
return m.digest()

As said in the previous answers, the ID of a node is generated by hashing it's IP address (generally speaking, such is the case in a DHT like Chord) or other uniquely identifiable information.
And since it uses Consistent Hashing when a node will join or leave the n-network, only 1/nkeys needs to be remapped, thus it lends itself to highly dynamic network topologies, such as peer-to-peer.
Technically, the hash generated doesn't convey any information about the data that is stored on this node. Rather the hash for a certain key (or entry in a data store, if used for such purpose) originates from hashing the keyword (or the filename or the file contents).
As a direct consequence of the Consistent Hashing, the abstract concept of distance between keys emerges. (As stated here) A node owns all the keys for which its identifying key (ID) is the closest to according to the distance metric.

"Smart" / Economical Data Storage Techniques?

I would like to store millions of data lines that looks like this:
key, value
key is an integer in the range of (0 to 5,000,000); all values are unique;
value is an unsigned int16 value (0 to 65535)
the key is to store the data while taking the LEAST AMOUNT OF DISK SPACE, and yet, be able to query the data. can you think of any algorithms / smart schemes for data storage that would be helpful?
just in case it matters, I use Linux.

One option would be, if the key values are not important data but rather just index data to utilize a flat file of bits ( with a descriptive header ). Every 16 bits is a value and the nth value would then be (n - 1) * 16 bits from the end of the header.
Additionally, if the key value does matter, a set flat file of about 10MB would allow for the entire key space to be stored without storing actual keys. The 16 bits that are at the (n - 1) * 16 offset would be that key's value.
That would probably be the least space intensive method for storage, as it would be only the data that is literally required. ( Though, if you are only interested in say 100k values and one has a key of 5 million you do end up with a lot of wasted space, which wouldn't be there with an actual key,value addressing system. So this methodology only achieves a minimum disk storage for sets of tightly grouped values or many many numbers (over about the 2 million mark ).

how do you plan to use stored data? with random or sequential access? for sequential access you can use any archiving algorithm, e.g. LZMA. Random access doesn't leave you a lot of space for improvements.
can you see any patterns of this data? e.g. if the difference between adjacent keys/values are often small you can store only packed differences. and million of other possible approaches.
[EDIT] also you can check techniques used for data compression in network communication
[EDIT1] and you can check this Google Code Integer Array Compression project

This depend upon the operation and data. I would first recommend "just using a database" (a simple key-value store such as BDB/EhCache [read: Key Value store], for instance :-)
Mimisbrunnr also has a good answer if all the keys are used.
If the keys are near constant/read-only and only a relatively small percent of the keys are used, consider the use of a (disk-based) Heap data-structure (very similar to an Array-based Heap; Heaps need not be Array-based). Robert Sedgewick had a good book from the late 80's that had a very lean implementation, but I forget the name. A Heap will be more beneficial when compared to a flat index with a smaller proportion of used keys and at full-load will have worse storage requirements.
(If abstracted, the used method could be switched and/or a hybrid heap with indexed/sequenced leaf-node values could be used [along with Huffman encoding or whatnot], but that is just adding far more complications. Keep it simple ... hence first suggestion of an existing key/value store ;-)
Happy coding.

Have you considered using a database designed for mobile devices such as SQL Server Compact, or another similar database? These will have a small footprint on the disk, while still providing the full search power you need.
Another example of a compact database engine is KeyDB for linux:
http://3d2f.com/programs/11-989-keydb-download.shtml

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart