CryptoPP::OID CURVE = CryptoPP::ASN1::secp256r1();
CryptoPP::AutoSeededRandomPool prng;
std::vector<kpStruct> KPVecRSU;
(loop begin)
kpStruct keyP;
CryptoPP::ECDH < CryptoPP::ECP >::Domain dhA( CURVE );
CryptoPP::SecByteBlock privA(dhA.PrivateKeyLength()), pubA(dhA.PublicKeyLength());
dhA.GenerateKeyPair(prng, privA, pubA);
CryptoPP::SecByteBlock sharedA(dhA.AgreedValueLength());
keyP.sharedECDH = sharedA;
KPVecRSU.push_back(keyP);
(loop end)
I want to create shared secret between 3 units, but this code give me different ones ! any idea please ?
ECDH shared secret doesn't match in loop, with Crypto++
Each run of the protocol produces a different shared secret because both the client and server are contributing random values during the key agreement. The inherit randomness provides forward secrecy, meaning bad guys cannot recover plain text at a later point in time because the random values were temporary or ephemeral (forgotten after the protocol execution).
In the Crypto++ implementation, the library does not even make a distinction between client and server because there's so much symmetry in the protocol. Protocols with too much symmetry can suffer the Chess Grand-Master attack, where one protocol execution is used to solve another protocol execution (think of it like a man-in-the-middle, where the bad guy is a proxy for both grand-masters). Often, you tweak a parameter on one side or the other to break the symmetry (client uses 14-byte random, server uses 18-byte random).
Other key agreement schemes we are adding do need to make the distinction between client and server, like Hashed MQV (HMQV) and Fully Hashed MQV (FHMQV). Client and Server are called Initiator and Responder in HMQV and FHMQV.
I want to create shared secret between 3 units, but this code give me different ones.
This is a different problem. This is known as Group Diffie-Hellman or Multi-party Diffie-Hellman. It has applications in, for example, chat and broadcast of protected content, where users are part of a group or join a group. The trickier part of the problem is how to revoke access to a group when a user leaves the group or is no longer authorized.
Crypto++ does not provide any group DH schemes, as far as I know. You may be able to modify existing sources to do it.
For Group Diffie-Hellman, you need to search Google Scholar for the papers. Pay particular attention to the security attributes of the scheme, like how to join and leave a group (grant and revoke access).
Related
collected_output=tff.federated_collect(client_outputs).
Please refer to this question for detailed code.
My question is the difference between the parts marked in red on the photo. In terms of the FL algorithm, I think client_outputs is a individual client' output and collected_output is SequenceType because each client_outputs is combined. Is this correct? If my guess is correct, is member a set of individual client members with client_outputs?
The terminology maybe a bit tricky. client_outputs isn't quite an "individual client's output", it still represents all client outputs, but they aren't individually addressable. Importantly TFF distinguishes that the data lives ("is placed") at the clients, it has not been communicated. collected_outputs is in some sense a stream of all individual client outputs, though the placement has changed (the values were communicated) to the server via tff.federated_collect.
In a bit more detail:
In the type specification above .member is an attribute on the tff.FederatedType. The TFF Guide Federated Core > Type System is a good resource for more details about the different TFF types.
{int32}#CLIENTS represents a federated value that consists of a set of potentially distinct integers, one per client device. Note that we are talking about a single federated value as encompassing multiple items of data that appear in multiple locations across the network. One way to think about it is as a kind of tensor with a "network" dimension, although this analogy is not perfect because TFF does not permit random access to member constituents of a federated value.
In the screenshot, client_outputs is also "placed" #CLIENTS (from the .placement attribute) and follows similar semantics: it has multiple values (one per client) but individual values are not addressable (i.e. the value does not behave like a Python list).
In contrast, the collected_output is placed #SERVER. Then this bullet:
<weights=float32[10,5],bias=float32[5]>#SERVER represents a named tuple of weight and bias tensors at the server. Since we've dropped the curly braces, this indicates the all_equal bit is set, i.e., there's only a single tuple (regardless of how many server replicas there might be in a cluster hosting this value).
Notice the "single tuple" phrase, after tff.federated_collect there is a single sequence of values placed at the server. This sequence can be iterated over like a stream.
In How to create custom Combine.PerKey in beam sdk 2.0, I asked and got a correct answer on how to create a custom Combine.PerKey in the new beam sdk 2.0. However, I now need to create a custom combinePerKey such that within my custom CombinePerKey logic, I need to be able to access the contents of the key. This was easily possible in dataflow 1.x, but in the new beam sdk 2.0, I'm unsure how to do so. Any little code snippet/example would be extremely useful.
EDIT #1 (per Ben Chambers's request)
The real use case is hard to explain, but I'm going to try:
We have a 3d space composed of millions of little hills. We try to determine the apex of these millions of hills as follows: we create billions of "rectangular probes" for the whole 3d space, and then we ask each of these billions of probes to "move" in a greedy way to the apex. Once it hits the apex, it stops. The probe then returns the apex and itself. The apex is the KEY for which we'll do a custom combine by key.
Now, the custom combine function is going to finally return a final object (called a feature) which is derived from all the probes that reach the same apex (ie the same key). When generating this "feature" object, we need to know infomration about the final apex/key (ie the top of the hill). Hence, I need this key info.
One way to solve this is using a group by key, but that was slow (at least in df 1.x); we got it to be fast (in df 1.x) using a custom combine fn. So, we'd like the key. That said, groupbykey works in beam skd 2.0.
Alternatively, we could stick the "apex" information into the "probe" objects itself, but this means that each of our billions of probe objects now needs to be tripled in size just to hold this apex information (and this apex information repeats itself, since there are only say 1 million apexes but 1 billion probes, so this intuitively feels highly inefficient.)
Rather than relying on the CombineFn to compute the entire result, could you instead have the ComibneFn compute some partial result based only on information about the probes? Then your Combine.perKey(...) returns a PCollection<KV<Apex, InfoAboutProbes>> and you can use a ParDo to combine the information about the apex with the summary information about the probes. This allows you to use the CombineFn for efficiently combining information about many probes, while using a ParDo to access the key.
I want to de-dupe a stream of data based on an ID in a windowed fashion. The stream we receive has and we want to remove data with matching within N-hour time windows. A straight-forward approach is to use an external key-store (BigTable or something similar) where we look-up for keys and write if required but our qps is extremely large making maintaining such a service pretty hard. The alternative approach I came up with was to groupBy within a timewindow so that all data for a user within a time-window falls within the same group and then, in each group, we use a separate key-store service where we look up for duplicates by the key. So, I have a few questions about this approach
[1] If I run a groupBy transform, is there any guarantee that each group will be processed in the same slave? If guaranteed, we can group by the userid and then within each group compare the sessionid for each user
[2] If it is feasible, my next question is to whether we can run such other services in each of the slave machines that run the job - in the example above, I would like to have a local Redis running which can then be used by each group to look up or write an ID too.
The idea seems off what Dataflow is supposed to do but I believe such use cases should be common - so if there is a better model to approach this problem, I am looking forward to that too. We essentially want to avoid external lookups as much as possible given the amount of data we have.
1) In the Dataflow model, there is no guarantee that the same machine will see all the groups across windows for the key. Imagine that a VM dies or new VMs are added and work is split across them for scaling.
2) Your welcome to run other services on the Dataflow VMs since they are general purpose but note that you will have to contend with resource requirements of the other applications on the host potentially causing out of memory issues.
Note that you may want to take a look at RemoveDuplicates and use that if it fits your usecase.
It also seems like you might want to be using session windows to dedupe elements. You would call:
PCollection<T> pc = ...;
PCollection<T> windowed_pc = pc.apply(
Window<T>into(Sessions.withGapDuration(Duration.standardMinutes(N hours))));
Each new element will keep extending the length of the window so it won't close until the gap closes. If you also apply an AfterCount speculative trigger of 1 with an AfterWatermark trigger on a downstream GroupByKey. The trigger would fire as soon as it could which would be once it has seen at least one element and then once more when the session closes. After the GroupByKey you would have a DoFn that filters out an element which isn't an early firing based upon the pane information ([3], [4]).
DoFn(T -> KV<session key, T>)
|
\|/
Window.into(Session window)
|
\|/
Group by key
|
\|/
DoFn(Filter based upon pane information)
It is sort of unclear from your description, can you provide more details?
Sorry for not being clear. I gave the setup you mentioned a try, except for the early and late firings part, and it is working on smaller samples. I have a couple of follow up questions, related to scaling this up. Also, I was hoping I could give you more information on what the exact scenario is.
So, we have incoming data stream, each item of which can be uniquely identified by their fields. We also know that duplicates occur pretty far apart and for now, we care about those within a 6 hour window. And regarding the volume of data, we have atleast 100K events every second, which span across a million different users - so within this 6 hour window, we could get a few billion events into the pipeline.
Given this background, my questions are
[1] For the sessioning to happen by key, I should run it on something like
PCollection<KV<key, T>> windowed_pc = pc.apply(
Window<KV<key,T>>into(Sessions.withGapDuration(Duration.standardMinutes(6 hours))));
where key is a combination of the 3 ids I had mentioned earlier. Based on the definition of Sessions, only if I run it on this KV would I be able to manage sessions per-key. This would mean that Dataflow would have too many open sessions at any given time waiting for them to close and I was worried if it would scale or I would run into any bottle-necks.
[2] Once I perform Sessioning as above, I have already removed the duplicates based on the firings since I will only care about the first firing in each session which already destroys duplicates. I no longer need the RemoveDuplicates transform which I found was a combination of (WithKeys, Combine.PerKey, Values) transforms in order, essentially performing the same operation. Is this the right assumption to make?
[3] If the solution in [1] going to be a problem, the alternative is to reduce the key for sessioning to be just user-id, session-id ignoring the sequence-id and then, running a RemoveDuplicates on top of each resulting window by sequence-id. This might reduce the number of open sessions but still would leave a lot of open sessions (#users * #sessions per user) which can easily run into millions. FWIW, I dont think we can session only by user-id since then the session might never close as different sessions for same user could keep coming in and also determining the session gap in this scenario becomes infeasible.
Hope my problem is a little more clear this time. Please let me know any of my approaches make the best use of Dataflow or if I am missing something.
Thanks
I tried out this solution at a larger scale and as long as I provide sufficient workers and disks, the pipeline scales well although I am seeing a different problem now.
After this sessionization, I run a Combine.perKey on the key and then perform a ParDo which looks into c.pane().getTiming() and only rejects anything other than an EARLY firing. I tried counting both EARLY and ONTIME firings in this ParDo and it looks like the ontime-panes are actually deduped more precisely than the early ones. I mean, the #early-firings still has some duplicates whereas the #ontime-firings is less than that and has more duplicates removed. Is there any reason this could happen? Also, is my approach towards deduping using a Combine+ParDo the right one or could I do something better?
events.apply(
WithKeys.<String, EventInfo>of(new SerializableFunction<EventInfo, String>() {
#Override
public java.lang.String apply(EventInfo input) {
return input.getUniqueKey();
}
})
)
.apply(
Window.named("sessioner").<KV<String, EventInfo>>into(
Sessions.withGapDuration(mSessionGap)
)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()
);
The implementation guides (and most web resources I can find) describe the GS06 and ST02 Control Numbers as being unique only within the Interchange they are contained in. So when we build our GS and ST segments we just start the control numbers at 1 and increment as we add more Functional Groups and/or Transaction Sets. The ISA13 control numbers we generate are always unique.
The dilemma is when we receive a 999 acknowledgment; it does not include any reference to the ISA control number that it's responding to. So we have no way to find the correct originating Functional Group in our records.
This seems like a problem that anyone receiving functional acknowledgements would face, but clearly lots of systems and companies handle it, so what is the typical practice to reconcile 997s or 999s? I think we must be missing something in our reading of the guides.
GS06 and ST02 only have to be unique within the interchange, but if you use an ID that's truly unique for each one (not just within the message), then you can skip right to the proper transaction set or functional group, not just the right message.
I typically have GS start at 1 and increment the same way that you do, but the ST02 I keep unique (to the extent allowed by the 9 character limit).
GS06 is supposed to be globally unique, not only within the interchange. This is from X12-6
In order to provide sufficient discrimination for the acknowledgment
process to operate reliably and to ensure that audit trails are
unambiguous, the combination of Functional ID Code (GS01), Application
Sender's ID (GS02), Application Receiver's ID (GS03), and Functional
Group Control Numbers (GS06, GE02) shall by themselves be unique
within a reasonably extended time frame whose boundaries shall be
defined by trading partner agreement. Because at some point it may be
necessary to reuse a sequence of control numbers, the Functional Group
Date and Time may serve as an additional discriminant only to
differentiate functional group identity over the longest possible time
frame.
In one of my current projects, i'm making use of a single-user authentication system. I say "single-user" as i've no plans on making this work for multiple users on the same Windows account (simply because it's not something i'm looking to do).
When the user starts the application, they're presented with an authentication screen. This authentication screen uses an image (i.e. click 3 specific points in the image), a username (a standard editbox), and an image choice (a dropdown menu allowing them to select which image they wish to use). The image choice, username, and the points clicked on the image must all match what the user specified when setting up the password.
All 3 results are combined into a string, which is then encoded with the Soap.EncdDecd.EncodeString method. This is then hashed using SHA-512. Finally, it's encrypted using DES. This value is then compared with the value that was created when they setup their password. If it matches, they're granted access. If not, access is denied. I plan to use the SHA512 value at other points in the application (such as a "master password" for authorising themselves with various different modules within the main application).
In one example, the initial string is 29 characters in length, the SOAP encoded string is around 40 characters, the SHA-512 string is 128 characters, and the DES value is 344 characters. Since i'm not working with massive strings, it's actually really quick. SOAP was used as very basic obfuscation and not as a security measure.
My concern is that the first parts (plain string and SOAP) could be the weak points. The basic string won't give them something they can just type and be granted access, but it would give them the "Image click co-ordinates", along with the username and image choice, which would potentially allow them access to the application. The SOAP string can be easily decoded.
What would be the best way to strengthen up this first part of the authentication to try and avoid the values being ripped straight from memory? Should i even be concerned about a potential exploiter or attacker reading the values in this way?
As an additional question directly related to this same topic;
What would be the best way to store the password hash that the user creates during initial setup?
I'm currently running with a TIniFile.SectionExists method as i've not yet got around to coming up with something more elegant. This is one area where my knowledge is lacking. I need to store the password "hash" across sessions (so using a memory stream isn't an option), but i need to make sure security is good enough that it can't be outright cracked by any script kiddie.
It's really more about whether i should be concerned, and whether the encoding, hashing, and encryption i've done is actually enough. The picture password system i developed is already a great basis for stopping the traditional "I know what your text-based password is so now i'm in your system" attack, but i'm concerned about the more technical attacks that read from memory.
Using SHA-512, it is NOT feasible (at least not before 20 years of computing power, and earth electric energy) to retrieve the initial content from the hash value.
I even think that using DES is not mandatory, and add complexity. Of course, you can use such slow process to make brute force or dictionary-based attacks harder (since it will make each try slower). A more common is not to use DES, but call SHA-512 several times (e.g. 1000 times). In this case, speed can be your enemy: a quick process will be easier to attack.
What you may do is to add a so-called "salt" to the initial values. See this Wikipedia article.
The "salt" can be fixed in the code, or stored within the password.
That is:
Hash := SHA512(Salt+Coordinates+UserName+Password);
Last advices:
Never store the plain initial text in DB or file;
Force use of strong passwords (not "hellodave", which is easy to break thanks to a dictionary);
The main security weakness is between the chair and the keyboard;
If you are paranoid, overwrite explicitly (i.e. one char per one char) the pain initial text memory before releasing it (it may still be somewhere in the RAM);
First learn a little bit about well known techniques: you should better use a "challenge" with a "nonce" to avoid any "replay" or "main in the middle" attacks;
It is safe to store the password hash in DB or even an INI file, if you take care of having a strong authentication scheme (e.g. with challenge-response), and secure the server access.
For instance, here is how to "clean" your memory (but it may be much more complex than this):
Content := Salt+Coordinates+UserName+Password;
for i := 1 to length(Coordinates) do
Coordinates[i] := ' ';
for i := 1 to length(UserName) do
UserName[i] := ' ';
for i := 1 to length(Password) do
Password[i] := ' ';
Hash := SHA512(Content);
for i := 1 to length(Content) do
Content[i] := ' ';
for i := 1 to 1000 do
Hash := SHA512(Hash);
When it deals with security, do not try to reinvent the wheel: it is a difficult matter, and you would better rely on mathematically proven (like SHA-512) and experienced techniques (like a salt, a challenge...).
For some sample of authentication scheme, take a look at how we implemented RESTful authentication for our Client-Server framework. It is certainly not perfect, but it tried to implement some best practices.