I need to compute a hash of a text that is in two pieces, held by two different parties, who should not have access to each others' texts.
Since SHA-1 is incremental I have heard that this should be possible but cannot find any answers with Google of any libraries that implement this.
I'd like the first party to SHA-1 hash its part of the text, and then feed the hash (digest) to the 2nd party, and they will continue with hashing, and compute the total digest of the combined texts. If this is possible, does anyone know of any library that takes a text and a previous hash as input arguments? Preferably in javascript or python but I'm actually happy to accept any language.
Update: I am looking at the source code of the forge (javascript) implementation of SHA-1, and see this code in the update method:
// initialize hash value for this chunk
a = s.h0;
b = s.h1;
c = s.h2;
d = s.h3;
e = s.h4;
And at the end of that method:
s.h0 = (s.h0 + a) | 0;
s.h1 = (s.h1 + b) | 0;
s.h2 = (s.h2 + c) | 0;
s.h3 = (s.h3 + d) | 0;
s.h4 = (s.h4 + e) | 0;
So it seems that by carrying these values + any remaining data in the input buffer over, one should be able to re-create the state of the SHA-1 object at party number 2. So a bit of the buffer will leak over but I think that should be fine. Will investigate further to see if this works.
Related
I have a problem where I need to do a linear interpolation on some data as it is acquired from a sensor (it's technically position data, but the nature of the data doesn't really matter). I'm doing this now in matlab, but since I will eventually migrate this code to other languages, I want to keep the code as simple as possible and not use any complicated matlab-specific/built-in functions.
My implementation initially seems OK, but when checking my work against matlab's built-in interp1 function, it seems my implementation isn't perfect, and I have no idea why. Below is the code I'm using on a dataset already fully collected, but as I loop through the data, I act as if I only have the current sample and the previous sample, which mirrors the problem I will eventually face.
%make some dummy data
np = 109; %number of data points for x and y
x_data = linspace(3,98,np) + (normrnd(0.4,0.2,[1,np]));
y_data = normrnd(2.5, 1.5, [1,np]);
%define the query points the data will be interpolated over
qp = [1:100];
kk=2; %indexes through the data
cc = 1; %indexes through the query points
qpi = qp(cc); %qpi is the current query point in the loop
y_interp = qp*nan; %this will hold our solution
while kk<=length(x_data)
kk = kk+1; %update the data counter
%perform online interpolation
if cc<length(qp)-1
if qpi>=y_data(kk-1) %the query point, of course, has to be in-between the current value and the next value of x_data
y_interp(cc) = myInterp(x_data(kk-1), x_data(kk), y_data(kk-1), y_data(kk), qpi);
end
if qpi>x_data(kk), %if the current query point is already larger than the current sample, update the sample
kk = kk+1;
else %otherwise, update the query point to ensure its in between the samples for the next iteration
cc = cc + 1;
qpi = qp(cc);
%It is possible that if the change in x_data is greater than the resolution of the query
%points, an update like the above wont work. In this case, we must lag the data
if qpi<x_data(kk),
kk=kk-1;
end
end
end
end
%get the correct interpolation
y_interp_correct = interp1(x_data, y_data, qp);
%plot both solutions to show the difference
figure;
plot(y_interp,'displayname','manual-solution'); hold on;
plot(y_interp_correct,'k--','displayname','matlab solution');
leg1 = legend('show');
set(leg1,'Location','Best');
ylabel('interpolated points');
xlabel('query points');
Note that the "myInterp" function is as follows:
function yi = myInterp(x1, x2, y1, y2, qp)
%linearly interpolate the function value y(x) over the query point qp
yi = y1 + (qp-x1) * ( (y2-y1)/(x2-x1) );
end
And here is the plot showing that my implementation isn't correct :-(
Can anyone help me find where the mistake is? And why? I suspect it has something to do with ensuring that the query point is in-between the previous and current x-samples, but I'm not sure.
The problem in your code is that you at times call myInterp with a value of qpi that is outside of the bounds x_data(kk-1) and x_data(kk). This leads to invalid extrapolation results.
Your logic of looping over kk rather than cc is very confusing to me. I would write a simple for loop over cc, which are the points at which you want to interpolate. For each of these points, advance kk, if necessary, such that qp(cc) is in between x_data(kk) and x_data(kk+1) (you can use kk-1 and kk instead if you prefer, just initialize kk=2 to ensure that kk-1 exists, I just find starting at kk=1 more intuitive).
To simplify the logic here, I'm limiting the values in qp to be inside the limits of x_data, so that we don't need to test to ensure that x_data(kk+1) exists, nor that x_data(1)<pq(cc). You can add those tests in if you wish.
Here's my code:
qp = [ceil(x_data(1)+0.1):floor(x_data(end)-0.1)];
y_interp = qp*nan; % this will hold our solution
kk=1; % indexes through the data
for cc=1:numel(qp)
% advance kk to where we can interpolate
% (this loop is guaranteed to not index out of bounds because x_data(end)>qp(end),
% but needs to be adjusted if this is not ensured prior to the loop)
while x_data(kk+1) < qp(cc)
kk = kk + 1;
end
% perform online interpolation
y_interp(cc) = myInterp(x_data(kk), x_data(kk+1), y_data(kk), y_data(kk+1), qp(cc));
end
As you can see, the logic is a lot simpler this way. The result is identical to y_interp_correct. The inner while x_data... loop serves the same purpose as your outer while loop, and would be the place where you read your data from wherever it's coming from.
I am trying to learn how to use the Sparse Coding algorithm with the mlpack library. When I call Encode() on my instance of mlpack::sparse_coding:SparseCoding, I get the error
[WARN] There are 63 inactive atoms. They will be reinitialized randomly.
error: solve(): solution not found
Is it simply that the algorithm cannot learn a latent representation of the data. Or perhaps it is my usage? The relevant section follows
EDIT: One line was modified to fix an unrelated error, but the original error remains.
double* Application::GetSparseCodes(arma::mat* trainingExample, int atomCount)
{
double* latentRep = new double[atomCount];
mlpack::sparse_coding::SparseCoding<mlpack::sparse_coding::DataDependentRandomInitializer> sc(*trainingExample, Utils::ATOM_COUNT, 1.0);
sc.Encode(Utils::MAX_ITERATIONS);
arma::mat& latentRepMat = sc.Codes();
for (int i = 0; i < atomCount; i++)
latentRep[i] = latentRepMat.at(i, 0);
return latentRep;
}
Some relevant parameters
const static int IMAGE_WIDTH = 20;
const static int IMAGE_HEIGHT = 20;
const static int PIXEL_COUNT = IMAGE_WIDTH * IMAGE_HEIGHT;
const static int ATOM_COUNT = 64;
const static int MAX_ITERATIONS = 100000;
This could be one of a handful of issues but given the description it's a little difficult to tell which of these it is (or if it is something else entirely). However, these three ideas should provide a good place to start:
Matrices in mlpack are column-major. That means each observation should represent a column. If you use mlpack::data::Load() to load, e.g., a CSV file (which are generally one row per observation), it will automatically transpose the dataset. SparseCoding will act oddly if you pass it transposed data. See also http://www.mlpack.org/doxygen.php?doc=matrices.html.
If there are 63 inactive atoms, then only one atom is actually active (given that ATOM_COUNT is 64). This means that the algorithm has found that the best way to represent the dictionary (at a given step) uses only one atom. This could happen if the matrix you are passing consists of all zeros.
mlpack will provide verbose output, which may also be helpful for debugging. Usually this is used by using mlpack's CLI class to parse command-line input, but you can enable verbose output with mlpack::Log::Info.ignoreInput = false. You may obtain a lot of output that way, but it will give a better look at what is going on...
The mlpack project has its own mailing list where you may be likely to get a quicker or more comprehensive response, by the way.
I am generating a unique and random alphanumeric string segment to represent certain links that will be generated by the users. For doing that I was approaching with "uuid" number to ensure it's uniqueness and randomness, but, as per my requirements the string shouldn't be more than 5 characters long. So I dropped that idea.
Then I decided to generate such a string using random function of ruby and current time stamp.
The code for my random string goes like this:-
temp=DateTime.now
temp=temp + rand(DateTime.now.to_i)
temp= hash.abs.to_s(36)
What I did is that I stored the current DateTime in a temp variable and then I generated a random number passing the current datetime as parameter. Then in the second line actually added current datetime and random number together to make a unique and random string.
Soon I found,while I was testing my application in two different machines and send the request at the same time, it generated the same string(Though it's rare) once after more than 100 trials.
Now I'm thinking that I should add one more parameter like mac address or client ip address before passing to_s(36) on temp variable. But can't figure out how to do it and even then whether it will be unique or nor...
Thanks....
SecureRandom in ruby uses process id (if available) and current time. You can use the urlsafe_base64(n= 16) class method to generate the sequence you need. According to your requirements I think this is your best bet.
Edit: After a bit of testing, I still think that this approach will generate non-unique keys. The way I solved this problem for barcode generation was:
barcode= barcode_sql_id_hash("#{sql_id}#{keyword}")
Here, your keyword can be time + pid.
If you are certain that you will never need more than a given M amount of unique values, and you don't need more than rudimentary protection against guessing the next generated id, you can use a Linear Congruentual Generator to generate your identificators. All you have to do is remember the last id generated, and use that to generate a new one using the following formula:
newid = (A * oldid + B) mod M
If 2³² distinct id values are enough to suit your needs, try:
def generate_id
if #lcg
#lcg = (1664525 * #lcg + 1013904223) % (2**32)
else
#lcg = rand(2**32) # Random seed
end
end
Now just pick a suitable set of characters to represent the id in as little as 6 character. Uppercase and lowercase letters should do the trick, since (26+26)^6 > 2^32:
ENCODE_CHARS = [*?a..?z, *?A..?Z]
def encode(n)
6.times.map { |i|
n, mod = n.divmod(ENCODE_CHARS.size)
ENCODE_CHARS[mod]
}.join
end
Example:
> 10.times { n = generate_id ; puts "%10d = %s" % [n, encode(n)] }
2574974483 = dyhjOg
3636751446 = QxyuDj
368621501 = bBGvYa
1689949688 = yuTgxe
1457610999 = NqzsRd
3936504298 = MPpusk
133820481 = PQLpsa
2956135596 = yvXpOh
3269402651 = VFUhFi
724653758 = knLfVb
Due to the nature of the LCG, the generated id will not repeat until all 2³² values have been used exactly once each.
There is no way you can generate a unique UUID with only five chars, with chars and numbers you have a basic space of around 56 chars, so there is a max of 56^5 combinations , aprox 551 million (Around 2^29).
If with this scheme you were about to generate 10.000 UUIDs (A very low number of UUIDs) you would have a probability of 1/5.000 of generating a collision.
When using crypto, the standard definition of a big enough space to avert collisions is around 2^80.
To put this into perspective, your algorithm would be better off if it generated just a random integer (a 32 bit uint is 2^32, 8 times the size you are proposing) which is clearly a bad idea.
Background:
I find myself harnessing F# Records a lot. Currently I am working on a project for packet dissection & replay of a proprietary binary protocol (a protocol that is very strangely designed ...).
We define the skeleton record for the packet.
type bytes = byte array
type packetSkeleton = {
part1 : bytes
part2 : bytes
... }
Now, it is easy to use this to 'dissect' our packet, (really just giving names to the byte fields).
let dissect (raw : bytes) =
let slice a b = raw.[a..b]
{ part1 = slice 0 4
part2 = slice 4 5
... }
This works perfectly even for longish packets, we can even use some neat recursive functions if there is a predicable pattern to the slicing.
So I dissect the packet, pull out the fields that I need and create a packet based off the packetSkeleton using the fields I took from the dissection, which by now is starting to look a bit awkward:
let createAuthStub a b c d e f g h ... =
{ part1 = a; part2 = b
part3 = d; ...
}
Then, after creating the populated stub, I need to deserialise it to a form that can be put on the wire:
(* packetSkeleton -> byte array *)
let deserialise (packet : packetSkeleton) =
[| packet.part1; packet.part2; ... |]
let xab = dissect buf
let authStub = createAuthStub xab.part1 1 2 xab.part9 ...
deserialise authStub |> send
So it ends up that I have 3 areas, the record type, the creation of the record for a given packet, and the deserialised byte array. Something tells me that this is a poor design choice on my part in terms of code clarity, and I can already feel it starting to shoot me in the foot even at this early stage.
Questions:
a) Am I using the correct datatype for such a project? Is my approach correct?
b) Should I just give up on trying to make this code feel clean?
As I am kinda coding this by touch and go, I would appreciate some insights!
P.S I realise that this problem is quite suited for C, but F# is more fun (additionally verification of the dissector later on sounds appealing)!
If a packet could be rather large packetSkeleton might grow unwieldy. Another option is to work with the raw bytes and define a module that reads/writes each part.
module Packet
let Length = 42
let GetPart1 src = src.[0..3]
let SetPart1 src dst = Array.blit src 0 dst 0 4
let GetPart2 src = src.[4..5]
let SetPart2 src dst = Array.blit src 0 dst 4 2
...
open Packet
let createAuthStub bytes b c =
let resp = Array.zeroCreate Packet.Length
SetPart1 (GetPart1 bytes)
SetPart2 b resp
SetPart3 c resp
SetPart4 (GetPart9 bytes)
resp
This removes the need for de/serialization functions (and probably helps performance a bit).
EDIT
Creating a wrapper type is another option
type Packet(bytes: byte[]) =
new() = Packet(Array.zeroCreate Packet.Length)
static member Length = 42
member x.Part1
with get() = bytes.[0..3]
and set value = Array.blit value 0 bytes 0 4
...
which might reduce code a bit:
let createAuthStub (req: Packet) b c =
let resp = Packet()
resp.Part1 <- req.Part1
resp.Part2 <- b
resp.Part3 <- c
resp.Part4 <- req.Part9
resp
I think your approach is essentially sound - but of course, it is difficult to tell without knowing more details.
I think one key idea that shows in your code and that is key to functional architecture is the separation between types (used to model the problem domain) and the processing functionality that creates the values of the domain model, processes it and formats them.
In your case:
The types bytes and packetSkeleton model the problem domain
The function createAuthStub processes your domain (and I agree with Daniel that it might be more readable if it took the whole packetSkeleton as an argument)
The function deserialize turns your domain back to bytes
I think this way of structuring code is quite good, because it separates different concerns of the program. I even wrote an article that tries to describe this as a more general programming approach.
Is there any input that SHA-1 will compute to a hex value of fourty-zeros, i.e. "0000000000000000000000000000000000000000"?
Yes, it's just incredibly unlikely. I.e. one in 2^160, or 0.00000000000000000000000000000000000000000000006842277657836021%.
Also, becuase SHA1 is cryptographically strong, it would also be computationally unfeasible (at least with current computer technology -- all bets are off for emergent technologies such as quantum computing) to find out what data would result in an all-zero hash until it occurred in practice. If you really must use the "0" hash as a sentinel be sure to include an appropriate assertion (that you did not just hash input data to your "zero" hash sentinel) that survives into production. It is a failure condition your code will permanently need to check for. WARNING: Your code will permanently be broken if it does.
Depending on your situation (if your logic can cope with handling the empty string as a special case in order to forbid it from input) you could use the SHA1 hash ('da39a3ee5e6b4b0d3255bfef95601890afd80709') of the empty string. Also possible is using the hash for any string not in your input domain such as sha1('a') if your input has numeric-only as an invariant. If the input is preprocessed to add any regular decoration then a hash of something without the decoration would work as well (eg: sha1('abc') if your inputs like 'foo' are decorated with quotes to something like '"foo"').
I don't think so.
There is no easy way to show why it's not possible. If there was, then this would itself be the basis of an algorithm to find collisions.
Longer analysis:
The preprocessing makes sure that there is always at least one 1 bit in the input.
The loop over w[i] will leave the original stream alone, so there is at least one 1 bit in the input (words 0 to 15). Even with clever design of the bit patterns, at least some of the values from 0 to 15 must be non-zero since the loop doesn't affect them.
Note: leftrotate is circular, so no 1 bits will get lost.
In the main loop, it's easy to see that the factor k is never zero, so temp can't be zero for the reason that all operands on the right hand side are zero (k never is).
This leaves us with the question whether you can create a bit pattern for which (a leftrotate 5) + f + e + k + w[i] returns 0 by overflowing the sum. For this, we need to find values for w[i] such that w[i] = 0 - ((a leftrotate 5) + f + e + k)
This is possible for the first 16 values of w[i] since you have full control over them. But the words 16 to 79 are again created by xoring the first 16 values.
So the next step could be to unroll the loops and create a system of linear equations. I'll leave that as an exercise to the reader ;-) The system is interesting since we have a loop that creates additional equations until we end up with a stable result.
Basically, the algorithm was chosen in such a way that you can create individual 0 words by selecting input patterns but these effects are countered by xoring the input patterns to create the 64 other inputs.
Just an example: To make temp 0, we have
a = h0 = 0x67452301
f = (b and c) or ((not b) and d)
= (h1 and h2) or ((not h1) and h3)
= (0xEFCDAB89 & 0x98BADCFE) | (~0x98BADCFE & 0x10325476)
= 0x98badcfe
e = 0xC3D2E1F0
k = 0x5A827999
which gives us w[0] = 0x9fb498b3, etc. This value is then used in the words 16, 19, 22, 24-25, 27-28, 30-79.
Word 1, similarly, is used in words 1, 17, 20, 23, 25-26, 28-29, 31-79.
As you can see, there is a lot of overlap. If you calculate the input value that would give you a 0 result, that value influences at last 32 other input values.
The post by Aaron is incorrect. It is getting hung up on the internals of the SHA1 computation while ignoring what happens at the end of the round function.
Specifically, see the pseudo-code from Wikipedia. At the end of the round, the following computation is done:
h0 = h0 + a
h1 = h1 + b
h2 = h2 + c
h3 = h3 + d
h4 = h4 + e
So an all 0 output can happen if h0 == -a, h1 == -b, h2 == -c, h3 == -d, and h4 == -e going into this last section, where the computations are mod 2^32.
To answer your question: nobody knows whether there exists an input that produces all zero outputs, but cryptographers expect that there are based upon the simple argument provided by daf.
Without any knowledge of SHA-1 internals, I don't see why any particular value should be impossible (unless explicitly stated in the description of the algorithm). An all-zero value is no more or less probable than any other specific value.
Contrary to all of the current answers here, nobody knows that. There's a big difference between a probability estimation and a proof.
But you can safely assume it won't happen. In fact, you can safely assume that just about ANY value won't be the result (assuming it wasn't obtained through some SHA-1-like procedures). You can assume this as long as SHA-1 is secure (it actually isn't anymore, at least theoretically).
People doesn't seem realize just how improbable it is (if all humanity focused all of it's current resources on finding a zero hash by bruteforcing, it would take about xxx... ages of the current universe to crack it).
If you know the function is safe, it's not wrong to assume it won't happen. That may change in the future, so assume some malicious inputs could give that value (e.g. don't erase user's HDD if you find a zero hash).
If anyone still thinks it's not "clean" or something, I can tell you that nothing is guaranteed in the real world, because of quantum mechanics. You assume you can't walk through a solid wall just because of an insanely low probability.
[I'm done with this site... My first answer here, I tried to write a nice answer, but all I see is a bunch of downvoting morons who are wrong and can't even tell the reason why are they doing it. Your community really disappointed me. I'll still use this site, but only passively]
Contrary to all answers here, the answer is simply No.
The hash value always contains bits set to 1.