Related
I'm trying to parse a very long file by using a fixed-size sliding window over it. For each such window I'd like to either insert it to a HashMap as the key with custom struct as value, or modify existing value for the window. My main problem is memory usage, since it should scale into very large quantities (up to several billions of distinct windows) and I want to reuse existing keys.
I would like to append windows (or more specifically bytes) to a vector and use the index as a key in the HashMap, but use the window under index for hash computation and key comparison. Because windows are overlapping, I will append only the part of the window which is new (if I have an input AAAB and size 3 I would have 2 windows: AAA and AAB, but would only store 4 bytes - AAAB; windows would have indices 0 and 1 respectively), which is the reason behind not keying the HM with window itself.
Here's the simplified pseudo-code, in which I omitted the minimal-input problem:
let file = // content of the file on which i can call windows()
let vec = Rc::new(RefCell::new(Vec::new())); // RefCell allows me to store Rc in the hashmap while mutating the underlying vector
let hm: HashMap<KeyStruct, ValueStruct> = HashMap::new();
for i in file.windows(FIXED_SIZE) {
let idx = vec.len();
vec.borrow_mut().push(i);
if hm.contains_key(KeyStruct::new(idx)) {
// get the key associated with i
// modify the value associated with i
// do something with the key
vec.borrow_mut().pop(); // window is already in the vector
}
else {
hm.insert(KeyStruct::new(idx), ValueStruct::new(...));
}
}
I have came up with 2 different approaches: either modifying the existing HashMap implementation so that it works as intended, or using a custom struct as key to the HashMap. Since I would only use one vector in order to store windows, I could store a Rc to it in the HashMap and then use that for lookups.
I could also create a struct which would hold both a Rc and index, using it as a key to the HashMap. The latter solution works with a vanilla HashMap, but stores a lot of redundant Rcs to the same vector. I also thought about storing a static pointer to Rc and then get Rc in unsafe blocks, but I would have to guarantee that the position of the Rc on the stack never changes and I'm not sure if I can guarantee that.
I tried to implement the first approach (custom HashMap), but it turns out that Buckets use a lot of features which are gated, and I can't compile the project using the stable compiler.
What's even worse is that I would like to get the key that is already in the HashMap on a successful lookup (because different indices can store the same window, for which the hash/cmp would be identical) and use it inside the value structure. I couldn't find a way to do this using the provided API for HashMap - the closest I get is by using entry(), which can contain an OccupiedEntry, but it doesn't have any way to retrieve the key, and there's no way to get it by unsafe memory lookups, because documentation on repr() says that the order in structs is not guaranteed in the default representation. I can store the key (or only the index) in the value struct, but that adds yet another size_of::<usize>() bytes per entry, only to store the index/key in a reachable manner, which is kept with that entry either way.
My questions are:
Is it possible to compile/reuse parts of std::collections which are not pub, such that I could modify few methods of HashMap and compile the whole project?
Is there any way of getting the key after successful lookup in the HashMap? (I even found out that libs team decided against implementing method over Entry which would allow me to get the key...)
Can you see any alternative to solutions that I mentioned?
EDIT
To clarify the problem let's consider a simple example - input ABABCBACBC and window size of 2. We should give index as a key to the HashMap, and it should get the window-size number of bytes as window starting from that index: with vector [A, A, C], index 1 and window-size 2 HashMap should try to find a hash/key for AC.
We get windows like this:
AB -> BA -> AB -> BC -> CB -> BA -> AC -> CB -> BC
First pair is AB, we append it into the empty vector and give it an index of 0.
vec = [A, B]
hm = [(0, val)]
The next pair is BA:
start with vec = [A, B]
using algorithm not shown here, I know that I have a common part between last inserted window (AB) and current window (BA), namely B
append part of the window to the existing vector, so we have vec = [A, B, A]
perform a lookup using index 1 as the index of window
it has not been found so the new key, val is inserted to HashMap
vec = [A, B, A]
hm = [(0, val0), (1, val1)]
Next up is window AB:
once again we have a common part - A
append: vec = [A, B, A, B]
lookup using index 2
it is successful, so I should delete the newly inserted part of window and get the index of the window inside vector - in this case 0
modify value, do something with the key etc...
vec = [A, B, A]
hm = [(0, val0_modified), (1, val1)]
After looping over this input i should end up with:
vec = [A, B, A, B, C, B, A, C]
and indices for pairs could be represented as: [(AB, 0), (BA, 1), (BC, 3), (CB, 4), (AC, 6)]
I do not want to modify keys. I also don't want to modify the vector with the exception of pushing/popping the window during lookup/insertion.
Sidenote: even though I still have redundant information in this particular example after putting everything into vector, it won't be the case while working with the original data.
I'm trying to iterate over 2 parameters to get two splines for each pair. The code:
y_arr:[0.2487,0.40323333333333,0.55776666666667,0.7123]$
str_h_arr:[-0.8,-1.0,-1.2,-1.4]$
z_points:[0,0.1225,0.245,0.3675,0.49,0.6125,0.735,0.8575,0.98,1.1025,1.225,1.3475,1.47,
1.5925,1.715,1.8375,1.96,2.0825,2.205,2.26625,2.3275,2.3765,2.401,2.4255,2.43775,
2.4451,2.448775,2.45]$
length(a)$
length(b)$
load(interpol)$
for y_k:1 thru length(a) do (
for h_k:1 thru length(b) do (
y:y_arr[y_k],
str_h:str_h_arr[h_k],
bot_startpoints: [[-2.45,0],[0,y],[2.45,0]],
top_startpoints: [[-2.45,str_h_min],[0,y+str_h],[2.45,str_h_min]],
spline: cspline(bot_startpoints),
bot(x):=''spline,
print(bot(0))
)
);
//Part with top spline is skipped.
For all iterations output is now the same: 0.7123
What I want to get is two splines like in picture
Members of y_arr are y values in x=0, str_h_arr: height between splines in x=0.
So bot(0) should give me all values from y_arr.
If i don't use loop and just give this block values of y_k and h_k, it's working properly.
Can anybody point me to where I'm (or Maxima is) wrong with using loop with cspline?
The problem is that quote-quote (two single quotes, '') is applied only once, when it is read in input; it is not applied every time the expression in evaluated in the loop.
Looks like you need only to evaluate the spline at x = 0 and nothing else. So I'll suggest ev(spline, x=0) to evaluate it. You can also construct a lambda expression and evaluate that.
Here is the program after I've revised it as described above. Also, it is simpler and clearer to write for y in y_arr do (...) rather than making use of an explicit index for y_arr.
y_arr:[0.2487,0.40323333333333,0.55776666666667,0.7123]$
str_h_arr:[-0.8,-1.0,-1.2,-1.4]$
z_points:[0,0.1225,0.245,0.3675,0.49,0.6125,0.735,0.8575,0.98,1.1025,1.225,1.3475,1.47,
1.5925,1.715,1.8375,1.96,2.0825,2.205,2.26625,2.3275,2.3765,2.401,2.4255,2.43775,
2.4451,2.448775,2.45]$
load(interpol)$
for y in y_arr do (
for str_h in str_h_arr do (
bot_startpoints: [[-2.45,0],[0,y],[2.45,0]],
top_startpoints: [[-2.45,str_h_min],[0,y+str_h],[2.45,str_h_min]],
spline: cspline(bot_startpoints),
print (ev (spline, x=0))));
This is the output I get:
0.2487
0.2487
0.2487
0.2487
0.40323333333333
0.40323333333333
0.40323333333333
0.40323333333333
0.55776666666667
0.55776666666667
0.55776666666667
0.55776666666667
0.7123
0.7123
0.7123
0.7123
Okay I need to redraw the pascal's triangle and explain the Fibonacci sequence embedded in it.. And i need to observe over 12 rows of the triangle (which ends on the number 144 in the fibonacci sequence) -- I understand this part as i am just explaining how each row diagonally forms the sum of the Fibonacci numbers.
But I need to use the fact that the rth number in the nth row of the triangle is
C(n, r) = n!/r! n-r!
This last part is whats confusing me.. How can i use C(n,r) to explain the Fibonacci sequence in the triangle??
Please Help. Thanks
Consider the following problem :
In how many ways can you go up a ladder of n steps if you can take either a single step at a time or 2 steps at a time?
Solution 1 : Let's construct a recurrence relation for this problem. It's pretty clear that the recurrence would be something like this : a(n) = a(n-1) + a(n-2); where a(1)=1 and a(2)=2
Thus, the answer for n would be the (n+1)th fibonacci term.
Solution 2 : Each unique way of climbing up the ladder corresponds to a unique sequence of 1's and 2's which adds up to n. The number of such sequences thus would be our answer. Let's start counting such sequences :
Number of sequences without a 2 = $ {n \choose 0 } $.
Number of sequences with one 2 = $ {n-1 \choose 1 } $.
.
.
.
and so on.
In case of even n, the last term would be $ {n/2 \choose n/2 } $.
And for odd n, it would be $ {(n+1)/2 \choose (n-1)/2 } $.
As you can see, These are the diagonal terms in a pascal's triangle.
As these two solutions compute the same result, hence they must be equal. Thus we get the relation between Fibonacci numbers and the diagonals of a pascals triangle.
Refer the link
http://ms.appliedprobability.org/data/files/Articles%2033/33-1-5.pdf
for anymore doubts.
I have a set S. It contains N subsets (which in turn contain some sub-subsets of various lengths):
1. [[a,b],[c,d],[*]]
2. [[c],[d],[e,f],[*]]
3. [[d,e],[f],[f,*]]
N. ...
I also have a list L of 'unique' elements that are contained in the set S:
a, b, c, d, e, f, *
I need to find all possible combinations between each sub-subset from each subset so, that each resulting combination has exactly one element from the list L, but any number of occurrences of the element [*] (it is a wildcard element).
So, the result of the needed function working with the above mentioned set S should be (not 100% accurate):
- [a,b],[c],[d,e],[f];
- [a,b],[c],[*],[d,e],[f];
- [a,b],[c],[d,e],[f],[*];
- [a,b],[c],[d,e],[f,*],[*];
So, basically I need an algorithm that does the following:
take a sub-subset from the subset 1,
add one more sub-subset from the subset 2 maintaining the list of 'unique' elements acquired so far (the check on the 'unique' list is skipped if the sub-subset contains the * element);
Repeat 2 until N is reached.
In other words, I need to generate all possible 'chains' (it is pairs, if N == 2, and triples if N==3), but each 'chain' should contain exactly one element from the list L except the wildcard element * that can occur many times in each generated chain.
I know how to do this with N == 2 (it is a simple pair generation), but I do not know how to enhance the algorithm to work with arbitrary values for N.
Maybe Stirling numbers of the second kind could help here, but I do not know how to apply them to get the desired result.
Note: The type of data structure to be used here is not important for me.
Note: This question has grown out from my previous similar question.
These are some pointers (not a complete code) that can take you to right direction probably:
I don't think you will need some advanced data structures here (make use of erlang list comprehensions). You must also explore erlang sets and lists module. Since you are dealing with sets and list of sub-sets, they seems like an ideal fit.
Here is how things with list comprehensions will get solved easily for you: [{X,Y} || X <- [[c],[d],[e,f]], Y <- [[a,b],[c,d]]]. Here i am simply generating a list of {X,Y} 2-tuples but for your use case you will have to put real logic here (including your star case)
Further note that with list comprehensions, you can use output of one generator as input of a later generator e.g. [{X,Y} || X1 <- [[c],[d],[e,f]], X <- X1, Y1 <- [[a,b],[c,d]], Y <- Y1].
Also for removing duplicates from a list of things L = ["a", "b", "a"]., you can anytime simply do sets:to_list(sets:from_list(L)).
With above tools you can easily generate all possible chains and also enforce your logic as these chains get generated.
Is there any input that SHA-1 will compute to a hex value of fourty-zeros, i.e. "0000000000000000000000000000000000000000"?
Yes, it's just incredibly unlikely. I.e. one in 2^160, or 0.00000000000000000000000000000000000000000000006842277657836021%.
Also, becuase SHA1 is cryptographically strong, it would also be computationally unfeasible (at least with current computer technology -- all bets are off for emergent technologies such as quantum computing) to find out what data would result in an all-zero hash until it occurred in practice. If you really must use the "0" hash as a sentinel be sure to include an appropriate assertion (that you did not just hash input data to your "zero" hash sentinel) that survives into production. It is a failure condition your code will permanently need to check for. WARNING: Your code will permanently be broken if it does.
Depending on your situation (if your logic can cope with handling the empty string as a special case in order to forbid it from input) you could use the SHA1 hash ('da39a3ee5e6b4b0d3255bfef95601890afd80709') of the empty string. Also possible is using the hash for any string not in your input domain such as sha1('a') if your input has numeric-only as an invariant. If the input is preprocessed to add any regular decoration then a hash of something without the decoration would work as well (eg: sha1('abc') if your inputs like 'foo' are decorated with quotes to something like '"foo"').
I don't think so.
There is no easy way to show why it's not possible. If there was, then this would itself be the basis of an algorithm to find collisions.
Longer analysis:
The preprocessing makes sure that there is always at least one 1 bit in the input.
The loop over w[i] will leave the original stream alone, so there is at least one 1 bit in the input (words 0 to 15). Even with clever design of the bit patterns, at least some of the values from 0 to 15 must be non-zero since the loop doesn't affect them.
Note: leftrotate is circular, so no 1 bits will get lost.
In the main loop, it's easy to see that the factor k is never zero, so temp can't be zero for the reason that all operands on the right hand side are zero (k never is).
This leaves us with the question whether you can create a bit pattern for which (a leftrotate 5) + f + e + k + w[i] returns 0 by overflowing the sum. For this, we need to find values for w[i] such that w[i] = 0 - ((a leftrotate 5) + f + e + k)
This is possible for the first 16 values of w[i] since you have full control over them. But the words 16 to 79 are again created by xoring the first 16 values.
So the next step could be to unroll the loops and create a system of linear equations. I'll leave that as an exercise to the reader ;-) The system is interesting since we have a loop that creates additional equations until we end up with a stable result.
Basically, the algorithm was chosen in such a way that you can create individual 0 words by selecting input patterns but these effects are countered by xoring the input patterns to create the 64 other inputs.
Just an example: To make temp 0, we have
a = h0 = 0x67452301
f = (b and c) or ((not b) and d)
= (h1 and h2) or ((not h1) and h3)
= (0xEFCDAB89 & 0x98BADCFE) | (~0x98BADCFE & 0x10325476)
= 0x98badcfe
e = 0xC3D2E1F0
k = 0x5A827999
which gives us w[0] = 0x9fb498b3, etc. This value is then used in the words 16, 19, 22, 24-25, 27-28, 30-79.
Word 1, similarly, is used in words 1, 17, 20, 23, 25-26, 28-29, 31-79.
As you can see, there is a lot of overlap. If you calculate the input value that would give you a 0 result, that value influences at last 32 other input values.
The post by Aaron is incorrect. It is getting hung up on the internals of the SHA1 computation while ignoring what happens at the end of the round function.
Specifically, see the pseudo-code from Wikipedia. At the end of the round, the following computation is done:
h0 = h0 + a
h1 = h1 + b
h2 = h2 + c
h3 = h3 + d
h4 = h4 + e
So an all 0 output can happen if h0 == -a, h1 == -b, h2 == -c, h3 == -d, and h4 == -e going into this last section, where the computations are mod 2^32.
To answer your question: nobody knows whether there exists an input that produces all zero outputs, but cryptographers expect that there are based upon the simple argument provided by daf.
Without any knowledge of SHA-1 internals, I don't see why any particular value should be impossible (unless explicitly stated in the description of the algorithm). An all-zero value is no more or less probable than any other specific value.
Contrary to all of the current answers here, nobody knows that. There's a big difference between a probability estimation and a proof.
But you can safely assume it won't happen. In fact, you can safely assume that just about ANY value won't be the result (assuming it wasn't obtained through some SHA-1-like procedures). You can assume this as long as SHA-1 is secure (it actually isn't anymore, at least theoretically).
People doesn't seem realize just how improbable it is (if all humanity focused all of it's current resources on finding a zero hash by bruteforcing, it would take about xxx... ages of the current universe to crack it).
If you know the function is safe, it's not wrong to assume it won't happen. That may change in the future, so assume some malicious inputs could give that value (e.g. don't erase user's HDD if you find a zero hash).
If anyone still thinks it's not "clean" or something, I can tell you that nothing is guaranteed in the real world, because of quantum mechanics. You assume you can't walk through a solid wall just because of an insanely low probability.
[I'm done with this site... My first answer here, I tried to write a nice answer, but all I see is a bunch of downvoting morons who are wrong and can't even tell the reason why are they doing it. Your community really disappointed me. I'll still use this site, but only passively]
Contrary to all answers here, the answer is simply No.
The hash value always contains bits set to 1.