Can checksum comparison indicate how different 2 files are? - checksum

Basically like the title:
Can checksum comparison of 2 audio files indicate how different they are as well as whether they are simply different?
I need to choose one of the multiple checksum types (MD5, CRC32, SHA1) to test whether a scanned file is in fact the same as a previously scanned one. However it occurred to me that it might be feasible when comparing 2 checksums to know how similar the two files are which would be very useful too.

NO, checksums like MD5, SHA, ... are purely hashes. Even a change in one bit typically results in completely different checksum. There is simply no correlation between how the cheksum and input changed.

Related

Renaming a Telegraf measurement with a wildcard

I'm collecting measurements with a Telegraf client. Unfortunately, the measurement name isn't static. Rather, it encodes a timestamp (terrible design choice, but out of my hands) as part of its name.
For example, the following 3 lines represent 3 instances of the same measurement, but have different names:
info.quorum.2902864.agree: 6
info.quorum.2902865.agree: 6
info.quorum.2902866.agree: 5
...
is there a way to transform these measurement names into one static name? In other words, I'd like to transform these entries above to:
info.quorum.hello.agree: 6
info.quorum.hello.agree: 6
info.quorum.hello.agree: 5
I saw the rename processor (https://github.com/influxdata/telegraf/tree/master/plugins/processors/rename) - but that doesn't support wildcard.
I also saw the regex processor (https://github.com/influxdata/telegraf/tree/master/plugins/processors/regex) but that doesn't support measurement names.
any ideas on how to get this done?
EDIT: some background: the measurements are collected with http input, usin a GJSON path like a.b.*.c
EDIT2: here's what i'm trying to parse. the problem is with the key '2931747', which grows on each subsequent reading:
"quorum" : {
"2931747" : {
"agree" : 8,
"disagree" : 0,
}
So they put actual value there as the key... Downright, eh, unsmart, let's put it this way.
And I won't blame the writers of JSON format parser for not putting a handle for that situation.
So, the answer is: in current form, with HTTP plugin, available parsers & processors - there's no way to shape it in any proper form (unless you can drop that damned number-key completely - then it's trivial).
I would suggest you to push on data providers to make them stop this stupidity.
If that is not an option - you need to write your own processor for that, alas.
It could be either fully standalone (poll the http endpoint, parse the stuff, form a batch of line protocol records, send to influx) - or it could cut it off on producing lines in Influx line protocol as the its output, and be executed with Exec input plugin

Inconsistency of pickle.load() and pickle.dump()

I'm building the neural network and should test in on modified CIFAR-10. I have used keras.datasets.cifar10.load_data() for retrieving the dataset and then parsed it in a dict using pickle.load(datafile, encoding='bytes'). After some modifications I've written the images in keras-like format using pickle.dump().
I noticed that the resulting file after pickle.dump() is 53 bytes bigger then the source file. Even if I don't make any modifications and use the dump() right away after load() the resulting file has extra 53 bytes. Looks like the structure of the resulting file isn't violated because I'm able to restore images, labels, filenames from it and they are correct. But if I'm learning and testing the neural network (even the simplest NN from the example!) I'm getting very bad score (~0.5).
Please help me to figure out how the loading-dumping affects the NN's result if the structure in general doesn't change?
How I can load and dump to leave the structure and the size of files without changes? How to avoid the inconsistency of load-dump operation?
P.S. Looks like the dump() writes some header to the file and doesn't write it if the header already exists (i've tried to apply load-dump twice but the size was changes only at first applying). But how I can avoid writing this header?
Please help me to figure out how the loading-dumping affects the NN's
result if the structure in general doesn't change?
If the structure does not change, it has no effect to the network. You can simple dump with human readable outputs and compare the files. First read about protocols here.
How I can load and dump to leave the structure and the size of files
without changes? How to avoid the inconsistency of load-dump
operation?
there is no real inconsistancy, its probably just using another protocol or attaching headers. You should not deal with these internals. If you want to make your model human readable or put it under version control, you can use json or another human readbale protocol. (e.g. simplejson)

ELKI: Normalization undo for result

I am using the ELKI MiniGUI to run LOF. I have found out how to normalize the data before running by -dbc.filter, but I would like to look at the original data records and not the normalized ones in the output.
It seems that there is some flag called -normUndo, which can be set if using the command-line, but I cannot figure out how to use it in the MiniGUI.
This functionality used to exist in ELKI, but has effectively been removed (for now).
only a few normalizations ever supported this, most would fail.
there is no longer a well defined "end" with the visualization. Some users will want to visualize the normalized data, others not.
it requires carrying over normalization information along, which makes data structures more complex (albeit the hierarchical approach we have now would allow this again)
due to numerical imprecision of floating point math, you would frequently not get out the exact same values as you put in
keeping the original data in memory may be too expensive for some use cases, so we would need to add another parameter "keep non-normalized data"; furthermore you would need to choose which (normalized or non-normalized) to use for analysis, and which for visualization. This would not be hard with a full-blown GUI, but you are looking at a command line interface. (This is easy to do with Java, too...)
We would of course appreciate patches that contribute such functionality to ELKI.
The easiest way is this: Add a (non-numerical) label column, and you can identify the original objects, in your original data, by this label.

How do content addressable storage systems deal with possible hash collisions?

Content addressable storage systems use the hash of the stored data as the identifier and the address. Collisions are incredibly rare, but if the system is used a lot for a long time, it might happen. What happens if there are two pieces of data that produce the same hash? Is it inevitable that the most recently stored one wins and data is lost, or is it possible to devise ways to store both and allow accessing both?
To keep the question narrow, I'd like to focus on Camlistore. What happens if permanodes collide?
It is assumed that collisions do not happen. Which is a perfectly reasonable assumption, given a strong hash function and a casual, non-malicious user inputs. SHA-1, which is what Camlistore currently uses, is also resistant to malicious attempts to produce collision.
In case a hash function becomes weak with time and needs to be retired, Camlistore supports a migration to a new hash function for new blobrefs, while keeping old blob refs accessible.
If a collision did happen, as far as I understand, the first stored blobref with that hash would win.
source: https://groups.google.com/forum/#!topic/camlistore/wUOnH61rkCE
In an ideal collision-resistant system, when a new file / object is ingested:
A hash is computed of the incoming item.
If the incoming hash does not already exist in the store:
the item data is saved and associated with the hash as its identifier
If incoming hash does match an existing hash in the store:
The existing data is retrieved
A bit-by-bit comparison of the existing data is performed with the new data
If the two copies are found to be identical, the new entry is linked to the existing hash
If the new copies are not identical, the new data is either
Rejected, or
Appended or prefixed* with additional data (e.g. a timestamp or userid) and re-hashed; this entire process is then repeated.
So no, it's not inevitable that information is lost in a content-addressable storage system.
* Ideally, the existing stored data would then be re-hashed in the same way, and the original hash entry tagged somehow (e.g. linked to a zero-byte payload) to notate that there were multiple stored objects that originally resolved to that hash (similar in concept to a 'Disambiguation page' on Wikipedia). Whether that is necessary depends on how data needs to be retrieved from the system.
While intentionally causing a collision may be astronomically impractical for a given algorithm, a random collision is possible as soon as the second storage transaction.
Note: Some small / non-critical systems skip the binary comparison step, trading risk for bandwidth or processing time. (Usually, this is only done if certain metadata matches, such as filename or data length.)
The risk profile of such a system (e.g. a single git repository) is far different than for an enterprise / cloud-scale environment that ingests large amounts of binary data, especially if that data is apparent random binary data (e.g. encrypted / compressed files) combined with something like sliding-window deduplication.
See also, e.g.:
https://stackoverflow.com/a/2437377/5711986
Composite Key e.g hash + userId

Store a checksum or similar for a file, to tell easily if it's the same as another file

In our app we have a table called support_files which stores documents that have been uploaded , which are mostly PDFs.
I'd like to get a unique list of these files, often the same file is uploaded more than once. I thought that a way to do this would be to add a column to the database called "checksum", and then, for each file, calculate the checksum somehow and store it in the column. (This is obviously the slow part).
Once this is done then I can easily filter out duplicates from my table by examining the checksum column.
Can anyone recommend a method to generate this checksum/hash/whatever? Ideally I'd like to generate a hash/checksum that's large enough to guarantee uniqueness, but small enough to fit into a string field in my database.
My server's running on Ubuntu server, and the total number of files I need to checksum is currently around 12,000. For the sake of argument assume it won't grow over 100,000.
A bit of Googling reveals sha1sum, but this may be more suited to telling if a file has been accidentally changed rather than if two files are different?
Take a look at Digest::SHA256, it can interface directly with files and it works great.
From the referenced documentation:
p Digest::SHA256.file("X11R6.8.2-src.tar.bz2").hexdigest
# => "f02e3c85572dc9ad7cb77c2a638e3be24cc1b5bea9fdbb0b0299c9668475c534"
``

Resources