How can we run two functions in different enclaves in parallel? - sgx

I'm a beginner of Intel SGX. I was wondering whether SGX supports running two functions in different Enclaves in parallel? E.g., Function A is in Enclaves En_A, and Function B is in Enclaves En_B. Is it possible that an application calls Functions A and B in parallel?
Thanks in advance!

Yes, it's possible.
The SGX design supports having multiple enclaves on a system at the
same time, which is a necessity in multi-process environments. This is
achieved by having the EPC split into 4 KB pages that can be assigned
to different enclaves. The EPC uses the same page size as the
architecture’s address translation feature.
(source)
Looking at the Intel SGX SDK docs (page 92) you can see that sgx_create_enclave function distinguishes enclave instances by taking unique enclave_id:
sgx_status_t sgx_create_enclave (
const char *file_name,
const int debug,
sgx_launch_token_t *launch_token,
int *launch_token_updated,
sgx_enclave_id_t *enclave_id, // here
sgx_misc_attribute_t *misc_attr
);
These enclave ids are used by the application to make ECALL calls using untrusted proxy functions:
// demo.edl
enclave {
trusted {
public void get_secret([out] secret_t* secret);
};
}
// generated function signature
sgx_status_t get_secret(sgx_enclave_id_t eid, secret_t* secret);
You can find a complete explanation on page 27

Related

Is there a way for TFF clients to have internal states?

The code in the TFF tutorials and in the research projects I see generally only keep track of server states. I’d like there to be internal client states (for instance, additional client internal neural networks which are completely decentralized and don’t update in a federated manner) that would influence the federated client computations.
However, in the client computations I have seen, they are only functions of the server states and the data. Is it possible to accomplish the above?
Yup, this is easy to express in TFF, and will execution just fine in the default execution stacks.
As you've noticed, the TFF repository generally has examples of cross-device Federated Learning (Kairouz et. al 2019). Generally we talk about the state have tff.SERVER placement, and the function signature for one "round" of federated learning has the structure (for details about TFF's type shorthand, see the Federated data section of the tutorials):
(<State#SERVER, {Dataset}#CLIENTS> -> State#Server)
We can represent stateful client by simply extending the signature:
(<State#SERVER, {State}#Clients, {Dataset}#CLIENTS> -> <State#Server, {State}#Clients>)
Implementing a version of Federated Averaging (McMahan et. al 2016) that includes a client state object might look something like:
#tff.tf_computation(
model_type,
client_state_type, # additional state parameter
client_data_type)
def client_training_fn(model, state, dataset):
model_update, new_state = # do some local training
return model_update, new_state # return a tuple including updated state
#tff.federated_computation(
tff.FederatedType(server_state_type, tff.SERVER),
tff.FederatedType(client_state_type , tff.CLIENTS), # new parameter for state
tff.FederatedType(client_data_type , tff.CIENTS))
def run_fed_avg(server_state, client_states, client_datasets):
client_initial_models = tff.federated_broadcast(server_state.model)
client_updates, new_client_state = tff.federated_map(client_training_fn,
# Pass the client states as an argument.
(client_initial_models, client_states, client_datasets))
average_update = tff.federated_mean(client_updates)
new_server_state = tff.federated_map(server_update_fn, (server_state, average_update))
# Make sure to return the client states so they can be used in later rounds.
return new_server_state, new_client_states
The invocation of run_fed_avg would require passing a Python list of tensors/structures for each client participating in a round, and the result fo the method invocation will be the server state, and a list of client states.

How to prove that certain data is calculated(or generated) inside Enclave(Intel SGX)?

How to prove that certain data is calculated(or generated) inside Enclave(Intel SGX)?
I tried to generate asymmetric key pair inside enclave(private key might be invisible to outside), and
then expose public key with evidence(i guess quote or remote attestation related things).
I got how remote attestation goes but, i cannot come up with applying remote attestation to verifying enclave-generated data.
Is this possible scenario with Intel SGX?
You can prove the origin of the public key by placing it in the report_data field of a Quote generated during report attestation.
_quote_t.report_data can be used to attest arbitrary data:
The 64 byte data buffer is free form data and you can supply any information in that buffer that you would like to have identified as being in the possession and protection envelope of the enclave when the report/quote was generated. You can thus use this buffer to convey whatever information you would like to a verifying party. (Source)
The report_data field can be found by tracking the following structures:
sgx_key_exchange.h
typedef struct _ra_msg3_t {
sgx_mac_t mac
sgx_ec256_public_t g_a;
sgx_ps_sec_prop_desc_t ps_sec_prop;
uint8_t quote[]; // <- Here!
} sgx_ra_msg3_t;
sgx_quote.h
typedef struct _quote_t
{
uint16_t version;
uint16_t sign_type;
sgx_epid_group_id_t epid_group_id;
sgx_isv_svn_t qe_svn;
sgx_isv_svn_t pce_svn;
uint32_t xeid;
sgx_basename_t basename;
sgx_report_body_t report_body; // <- Here!
uint32_t signature_len;
uint8_t signature[];
} sgx_quote_t;
The Quote is part of the Msg3 (client-to-server) of remote attestation protocol. You can review the details of Msg3 creation in this official Code Sample and in the intel/sgx-ra-sample RA example.
In the latter, you can find out how the report is generated using sgx_create_report:
sgx_status_t get_report(sgx_report_t *report, sgx_target_info_t *target_info)
{
#ifdef SGX_HW_SIM
return sgx_create_report(NULL, NULL, report);
#else
return sgx_create_report(target_info, NULL, report);
#endif
}
In both cases, second argument sgx_report_data_t *report_data is NULL and can be replaced by pointer to arbitrary input. This is where you want to put your public key or any other data.

Can I visualize a Multibody pose without explicitly calculating every body's full transform?

In the examples/quadrotor/ example, a custom QuadrotorPlant is specified and its output is passed into QuadrotorGeometry where the QuadrotorPlant state is packaged into FramePoseVector for the SceneGraph to visualize.
The relevant code segment in QuadrotorGeometry that does this:
...
builder->Connect(
quadrotor_geometry->get_output_port(0),
scene_graph->get_source_pose_port(quadrotor_geometry->source_id_));
...
void QuadrotorGeometry::OutputGeometryPose(
const systems::Context<double>& context,
geometry::FramePoseVector<double>* poses) const {
DRAKE_DEMAND(frame_id_.is_valid());
const auto& state = get_input_port(0).Eval(context);
math::RigidTransformd pose(
math::RollPitchYawd(state.segment<3>(3)),
state.head<3>());
*poses = {{frame_id_, pose.GetAsIsometry3()}};
}
In my case, I have a floating based multibody system (think a quadrotor with a pendulum attached) of which I've created a custom plant (LeafSystem). The minimal coordinates for such a system would be 4 (quaternion) + 3 (x,y,z) + 1 (joint angle) = 7. If I were to follow the QuadrotorGeometry example, I believe I would need to specify the full RigidTransformd for the quadrotor and the full RigidTransformd of the pendulum.
Question
Is it possible to set up the visualization / specify the pose such that I only need to specify the 7 (pose of quadrotor + joint angle) state minimal coordinates and have the internal MultibodyPlant handle the computation of each individual body's (quadrotor and pendulum) full RigidTransform which can then be passed to the SceneGraph for visualization?
I believe this was possible with the "attic-ed" (which I take to mean "to be deprecated") RigidBodyTree, which was accomplished in examples/compass_gait
lcm::DrakeLcm lcm;
auto publisher = builder.AddSystem<systems::DrakeVisualizer>(*tree, &lcm);
publisher->set_name("publisher");
builder.Connect(compass_gait->get_floating_base_state_output_port(),
publisher->get_input_port(0));
Where get_floating_base_state_output_port() was outputting the CompassGait state with only 7 states (3 rpy + 3 xyz + 1 hip angle).
What is the MultibodyPlant, SceneGraph equivalent of this?
Update (Using MultibodyPositionToGeometryPose from Russ's deleted answer
I created the following function which, attempts to create a MultibodyPlant from the given model_file and connects the given plant pose_output_port through MultibodyPositionToGeometryPose.
The pose_output_port I'm using is the 4(quaternion) + 3(xyz) + 1(joint angle) minimal state.
void add_plant_visuals(
systems::DiagramBuilder<double>* builder,
geometry::SceneGraph<double>* scene_graph,
const std::string model_file,
const systems::OutputPort<double>& pose_output_port)
{
multibody::MultibodyPlant<double> mbp;
multibody::Parser parser(&mbp, scene_graph);
auto model_id = parser.AddModelFromFile(model_file);
mbp.Finalize();
auto source_id = *mbp.get_source_id();
auto multibody_position_to_geometry_pose = builder->AddSystem<systems::rendering::MultibodyPositionToGeometryPose<double>>(mbp);
builder->Connect(pose_output_port,
multibody_position_to_geometry_pose->get_input_port());
builder->Connect(
multibody_position_to_geometry_pose->get_output_port(),
scene_graph->get_source_pose_port(source_id));
geometry::ConnectDrakeVisualizer(builder, *scene_graph);
}
The above fails with the following exception
abort: Failure at multibody/plant/multibody_plant.cc:2015 in get_geometry_poses_output_port(): condition 'geometry_source_is_registered()' failed.
So, there's a lot in here. I have a suspicion there's a simple answer, but we may have to converge on it.
First, my assumptions:
You've got an "internal" MultibodyPlant (MBP). Presumably, you also have a context for it, allowing you to perform meaningful state-dependent calculations.
Furthermore, I presume the MBP was responsible for registering the geometry (probably happened when you parsed it).
Your LeafSystem will directly connect to the SceneGraph to provide poses.
Given your state, you routinely set the state in the MBP's context to do that evaluation.
Option 1 (Edited):
In your custom LeafSystem, create the FramePoseVector output port, create the calc callback for it, and inside that callback, simply invoke the Eval() of the pose output port of the internal MBP that your LeafSystem own (having previously set the state in your locally owned Context for the MBP and passing in the pointer to the FramePoseVector that your LeafSystem's callback was provided with).
Essentially (in a very coarse way):
MySystem::MySystem() {
this->DeclareAbstractOutputPort("geometry_pose",
&MySystem::OutputGeometryPose);
}
void MySystem::OutputGeometryPose(
const Context& context, FramePoseVector* poses) const {
mbp_context_.get_mutable_continuous_state()
.SetFromVector(my_state_vector);
mbp_.get_geometry_poses_output_port().Eval(mpb_context_, poses);
}
Option 2:
Rather than implementing a LeafSystem that has an internal plant, you could have a Diagram that contains an MBP and exports the MBP's FramePoseVector output directly through the diagram to connect.
This answer addresses, specifically, your edit where you are attempting to use the MultibodyPositionToGeometryPose approach. It doesn't address the larger design issues.
Your problem is that the MultibodyPositiontToGeometryPose system takes a reference to an MBP and keeps a reference to that same MBP. That means the MBP must be alive and well for at least as long as the MPTGP is. However, in your code snippet, your MBP is local to the add_plant_visuals() function so it is destroyed as soon as the function is over.
You'll need to create something that is persisted and owned by someone else.
(This is tightly related to my option 2 - now edited for improved clarity.)

Implementing Jena Dataset provider over MongoDB

I have started to implement a set of classes that provide a direct interface to MongoDB for persistence, similar in spirit to the now-unmaintained SDB persistor implementation for RDBMS.
I am using the time-honored technique of creating the necessary concrete classes from the interfaces and doing a println in each method, therein allowing me to trace the execution. I have gotten all the way to where the engine is calling out to my cursor set up:
public ExtendedIterator<Triple> find(Node s, Node p, Node o) {
System.out.println("+++ MongoGraph:extenditer:find(" + s + p + o + ")");
// TBD Need to turn s,p,o into a match expression! Easy!
MongoCursor cur = this.coll.find().iterator();
ExtendedIterator<Triple> curs = new JenaMongoCursorIterator(cur);
return curs;
}
Sadly, when I later call this:
while(rs.hasNext()) {
QuerySolution soln = rs.nextSolution() ;
System.out.println(soln);
}
It turns out rs.hasNext() is always false even though material is present in the MongoCursor (I can debug-print it in the find() method). Also, the trace print in the next() function in my concrete iterator JenaMongoCursorIterator (which extends NiceIterator which I believe is OK) is never hit. In short, the basic setup seems good but then the engine never cranks the iterator on find()
Trying to use SDB as a guide is completely overwhelming for someone not intimately familiar with the software architecture. It's fully factored and filled with interfaces and factories and although that is excellent, it is difficult to nav.
Has anyone tried to create their own persistor implementation and if so, what are the basic steps to getting a "hello world" running? Hello World in this case is ANY implementation, non-optimized, that can call next() on something to produce a Triple.
TLDR: It is now working.
I was coding too eagerly and JenaMongoCursorIterator contained a method hasNexl which of course did not override hasNext (with a t ) in the default implementation of NiceIterator which returns false.
This is the sort of problem that eclipse and visual debugging and tracing makes a lot easier to resolve than regular jdb. jdb is fine if you know the software architecture pretty well but if you don't, the multiple open source files and being able to mouse over vars and such provides a tremendous boost in the amount of context that can be created to home in on the problem.

SideInputs kill dataflow performance

I'm using dataflow to generate a large amount of data.
I've tested two versions of my pipeline: one with a side input (of varying sizes), and other other without.
When I run the pipeline without the side input, my job will finish in about 7 minutes. When I run my job with the side input, my job will never finish.
Here's what my DoFn looks like:
public class MyDoFn extends DoFn<String, String> {
final PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pCollectionView;
final List<CSVRecord> stuff;
private Aggregator<Integer, Integer> dofnCounter =
createAggregator("DoFn Counter", new Sum.SumIntegerFn());
public MyDoFn(PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pcv, List<CSVRecord> m) {
this.pCollectionView = pcv;
this.stuff = m;
}
#Override
public void processElement(ProcessContext processContext) throws Exception {
Map<String, Iterable<TreeMap<Long, Float>>> pdata = processContext.sideInput(pCollectionView);
processContext.output(AnotherClass.generateData(stuff, pdata));
dofnCounter.addValue(1);
}
}
And here's what my pipeline looks like:
final Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<KV<String, TreeMap<Long, Float>>> data;
data = p.apply(TextIO.Read.from("gs://where_the_files_are/*").named("Reading Data"))
.apply(ParDo.named("Parsing data").of(new DoFn<String, KV<String, TreeMap<Long, Float>>>() {
#Override
public void processElement(ProcessContext processContext) throws Exception {
// Parse some data
processContext.output(KV.of(key, value));
}
}));
final PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pcv =
data.apply(GroupByKey.<String, TreeMap<Long, Float>>create())
.apply(View.<String, Iterable<TreeMap<Long, Float>>>asMap());
DoFn<String, String> dofn = new MyDoFn(pcv, localList);
p.apply(TextIO.Read.from("gs://some_text.txt").named("Sizing"))
.apply(ParDo.named("Generating the Data").withSideInputs(pvc).of(dofn))
.apply(TextIO.Write.named("Write_out").to(outputFile));
p.run();
We've spent about two days trying various methods of getting this to work. We've narrowed it down to the inclusion of the side input. If the processContext is modified to not use the side input, it will still be very slow as long as it's included. If we don't call .withSideInput() it's very fast again.
Just to clarify, we've tested this on sideinput sizes from 20mb - 1.5gb.
Very grateful for any insight.
Edit
Including a few job ID's:
2016-01-20_14_31_12-1354600113427960103
2016-01-21_08_04_33-1642110636871153093 (latest)
Please try out the Dataflow SDK 1.5.0+, they should have addressed the known performance problems of your issue.
Side inputs in the Dataflow SDK 1.5.0+ use a new distributed format when running batch pipelines. Note that streaming pipelines and pipelines using older versions of the Dataflow SDK are still subject to re-reading the side input if the view can not be cached entirely in memory.
With the new format, we use an index to provide a block based lookup and caching strategy. Thus when looking into a list by index or looking into a map by key, only the block that contains said index or key will be loaded. Having a cache size which is greater than the working set size will aid in performance as frequently accessed indices/keys will not require re-reading the block they are contained in.
Side inputs in the Dataflow SDK can, indeed, introduce slowness if not used carefully. Most often, this happens when each worker has to re-read the entire side input per main input element.
You seem to be using a PCollectionView created via asMap. In this case, the entire side input PCollection must fit into memory of each worker. When needed, Dataflow SDK will copy this data on each worker to create such a map.
That said, the map on each worker may be created just once or multiple times, depending on several factors. If its size is small enough (usually less than 100 MB), it is likely that the map is read only once per worker and reused across elements and across bundles. However, if its size cannot fit into our cache (or something else evicts it from the cache), the entire map may be re-read again and again on each worker. This is, most often, the root-cause of the slowness.
The cache size is controllable via PipelineOptions as follows, but due to several important bugfixes, this should be used in version 1.3.0 and later only.
DataflowWorkerHarnessOptions opts = PipelineOptionsFactory.fromArgs(args).withValidation().create().cloneAs(DataflowWorkerHarnessOptions.class);
opts.setWorkerCacheMb(500);
Pipeline p = Pipeline.create(opts);
For the time being, the fix is to change the structure of the pipeline to avoid excessive re-reading. I cannot offer you a specific advice there, as you haven't shared enough information about your use case. (Please post a separate question if needed.)
We are actively working on a related feature we refer to as distributed side inputs. This will allow a lookup into the side input PCollection without constructing the entire map on the worker. It should significantly help performance in this and related cases. We expect to release this very shortly.
I didn't see anything particularly suspicious about the two jobs you have quoted. They've been cancelled relatively quickly.
I'm manually setting the cache size when creating the pipeline in the following manner:
DataflowWorkerHarnessOptions opts = PipelineOptionsFactory.fromArgs(args).withValidation().create().cloneAs(DataflowWorkerHarnessOptions.class);
opts.setWorkerCacheMb(500);
Pipeline p = Pipeline.create(opts);
for a side input of ~25mb, this speeds up the execution time considerably (job id 2016-01-25_08_42_52-657610385797048159) vs. creating a pipeline in the manner below (job id 2016-01-25_07_56_35-14864561652521586982)
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
However, when the side input is increased to ~400mb, no increase in cache size improves performance. Theoretically, is all the memory indicated by the GCE machine type available for use by the worker? What would invalidate or evict something from the worker cache, forcing the re-read?

Resources