I want to study TensorFlow for a long time , so I want to read the source code of it , but at the beginning. For example :
I can’t find the deeper level of the function.
Where does the function Shape compute?
The code fragment that you have shown is an automatically generated piece of code that adds a "Shape" operation to the graph. The string "Shape" in the arguments to _op_def_lib.apply_op() determines the operation-type of the node. The standard operation types are registered in C++ source code, in the tensorflow/core/ops/ directory of the TensorFlow source code. In particular, the "Shape" operation is registered in tensorflow/core/ops/array_ops.cc. These registrations are used to define the types of the inputs to, attrs of, and outputs from each operation, and the Python wrappers are generated from these registrations.
The first time you run a subgraph containing that node (i.e. in a call to tf.Session.run()), TensorFlow will look up the appropriate kernel that implements the operation on a particular device. (For example, there are often separate kernels for CPU and GPU implementations of operations.) The standard kernel implementations are registered in C++ source code, in the tensorflow/core/kernels/ directory of the TensorFlow source code. In particular, the "Shape" kernels are registered in tensorflow/core/kernels/shape_ops.cc. The kernel registration names a class the implements the kernel, which must be a subclass of tensorflow::OpKernel, and in this case is the tensorflow::ShapeOp class. The constructor is called when the subgraph runs for the first time, and the Compute() method is called each time the operation runs.
Related
I believe Apple has disabled being able to write and execute memory at the same time on the ARM64 architecture, see:
See mmap() RWX page on MacOS (ARM64 architecture)?
This makes it difficult to port implementations like jonesforth, which keeps generated code and the code to generate it (like the built-in assembler in jonesforth.f) in the same segment.
I thought I could do something like map the user space from start to HERE as 'r-x', and from here to the end as 'rw-'. Then I'd have to constantly remap memory as I compile new words, and I couldn't go and fix up previous words (I believe SCODE would make use of it).
Do you have any advice on how to handle such limitations ?
I guess I should look into other forth implementations that are running on M1 Macs.
A Forth implementation can have a problem with write-protected segments of code only when it generates machine code that should be executable at once. There is no such a problem if it uses threaded code. So it's supposed bellow that the Forth system have to generate machine code.
Data space and code space
Obviously, you have to separate code space from data space. Data space (at least mutable regions of, including regions for variables and data fields), as well as internal mutable memory regions and probably headers, should be mapped to 'rw-' segments. Code space should be mapped to 'r-x' segments.
The word here ( -- addr ) returns the address of the first cell available for reservation, which is writable for a program, and it should be always in an 'rw-' segment. You can have an internal word code::here ( -- addr ) that returns address in code space, if you need.
A decision for execution tokens is a compromise between speed and simplicity of implementation (an 'r-x' segment vs 'rw-'). The simplest case is that an execution token is represented by an address in an 'rw-' segment, and then execute does an additional dereferencing to get the corresponding address of code.
Code generation
In the given conditions we should generate machine code into an 'rw-' segment, but before this code is executed, this segment should be made 'r-x'.
Probably, the simplest solution is to allocate a memory block for every new definition, resize (minimize) the block on completion and make it 'r-x'. Possible disadvantages — losses due to page size (e.g. 4 KiB), and maybe memory fragmentation.
Changing protection of the main code segment starting from code::here also implies losses due to page size granularity.
Another variant is to break creating of a definition into two stages:
generate intermediate representation (IR) in a separate 'rw-' segment during compilation of a definition;
when the definition is completed, generate machine code in the main code segment from IR, and discard IR code.
Actually, it could be machine code on the first stage too, and then it's just relocated into another place on the second stage.
Before write to the main code segment you change it (or its part) to 'rw-', and after that revert it to 'r-x'.
A subroutine that translates IR code should be resided in another 'r-x' segment that you don't change.
Forth is agnostic to the format of generated code, and in a straightforward system only a handful of definitions "know" what format is generated. So only these definitions should be changed to generate IR code. If you relocate machine code, you probably don't need to change even these definitions.
If I call R code from Java within GraalVM (using GraalVM's polyglot function), does the R code and the Java code run on the same Java thread (ie there's no switching between OS or Java threads etc?) Also, is it the same "memory/heap" space? That is, in the example code below (which I took from https://www.baeldung.com/java-r-integration)
public double mean(int[] values) {
Context polyglot = Context.newBuilder().allowAllAccess(true).build();
String meanScriptContent = RUtils.getMeanScriptContent();
polyglot.eval("R", meanScriptContent);
Value rBindings = polyglot.getBindings("R");
Value rInput = rBindings.getMember("c").execute(values);
return rBindings.getMember("customMean").execute(rInput).asDouble();
}
does the call rBindings.getMember("c").execute(values) cause the values object (an array of ints) to be copied? Or is GraalVM smart enough to consider it a pointer to the same memory space? If it's a copy, is the copying time the same (or similar, ie within say 20%) time as if I were to a normal java clone() operation? Finally, does calling a polyglot function (in this case customMean implemented in R) have the same overhead as calling a native Java function? Bonus question: can the GraalVM JIT compiler even compile accross the layers, eg say I had this:
final long sum = IntStream.range(0,10000)
.stream()
.map(x -> x+4)
.map(x -> <<<FastR version of the following inverse operation: x-4 >>>)
.sum();
would the GraalVM compiler be as smart as say a normal Java JIT compiler and realize that the whole above statement can be simply written without the two map operations (Since they cancel each other out)?
FYI: I'm considering using GraalVM to run both my Java code and my R code, once the issue I identified here is resolved (Why is FASTR (ie GraalVM version of R) 10x *slower* compared to normal R despite Oracle's claim of 40x *faster*?) and one of the motivitations is that I hope to eliminate the 50% of time that calling R (using RServe()) from Java is spent on network IO (because Java communicates with RServer over TCP/IP and RServe and Java are on different threads and memory spaces etc etc.)
does the R code and the Java code run on the same Java thread. Also, is it the same "memory/heap" space?
Yes and yes. You can even use GraalVM VisualVM to inspect the heap: it provides standard Java view where you can see instances of FastR internal representations like RIntVector mingled with the rest of the other Java objects, or R view where you can see integer vectors, lists, environments, ...
does the call rBindings.getMember("c").execute(values) cause the values object (an array of ints) to be copied?
In general yes: most objects are passed to R as-is. Inside R you have two choices:
Explicitly convert them to some concrete type, i.e., as.integer(arg), which does not make a copy, but tells R explicitly how you want that value to be treated as "native" R type including R's value semantics.
Leave it up to the default rules, which will be applied once your objects is passed to some R builtin, e.g., int[] is treated as integer vector (but note that treating it as a list would be also reasonable in some cases). Again no copies here. And the object itself keeps its reference semantics.
However, sometimes FastR needs to make a copy:
some builtin functions cannot handle foreign objects yet
R language often implicitly copies vectors, because of its value semantics, arguments coercion, etc.
when a vector is passed to native R extension, we need to move its data to off heap memory
I would say that if you happen to have a very large vector, say GBs of data, you need to be very careful about it even in regular R. Note: FastR vectors are by default backed by Java arrays, so their size limitations apply to FastR vectors too.
Finally, does calling a polyglot function (in this case customMean implemented in R) have the same overhead as calling a native Java function?
Mostly yes, except that the function cannot be pulled and inlined into the surrounding Java code(+). The call itself is as fast as regular Java call. For the example you give: it cannot be optimized as you suggest, because the R function cannot be inlined(+). However, I would be very skeptical that any compiler can optimize this as you suggest even if both functions where pure Java code. That being said, yes: some things that compiler can optimize, like eliminating some useless computations that it can analyze well, is not going to work because of the impossibility to inline code across the Java <-> R boundary(+).
(+) Unless you'd run the Java code with Espresso (Java on Truffle), but then you would not be using Context API but Espresso's interop support.
I am working with a diagram which includes a MultiBodyPlant with a Propeller connected to it. The Propeller actually realizes numerous physical propellers which are distributed among the bodies of the MultiBodyPlant.
I am able to simulate the dynamics of the combined system by setting the prop forces with FixValue, so I'm on the right track.
What I'd like to be able to do is, given a configuration for the system (i.e. the MultiBodyPlant context) and a chosen propeller command, compute the generalized forces acting on the system. My sense is that this is not immediately available since the simulation is actually using the RNEA, and so does not aggregate the forces all together in that way. For what I'm doing (and even just as a sanity check), I would like to compute the forces directly, not just their effect on the evolution of the state.
Is there an existing method to compute this built into Drake, or should I compute it manually using the spatial jacobian of each propeller frame and the applied SpatialForce of the corresponding propeller? (Something along the lines of this question: How to get the matrix that maps external forces to generalized forces?)
Many thanks for your help.
I understand better now. It's a very reasonable request! You have two systems at play: the Propeller and MultibodyPlant. Unfortunately, the quantity you want is all of Propeller and just a piece of MultibodyPlant. We don't offer direct access to the B(q) matrix in that case.
What you can do is call with either AutoDiffXd or symbolic::Expression, call CalcImplicitTimeDerivativesResidual on the Diagram to get the entire dynamics in implicit form (to avoid taking M(q) inverse). You could call it twice -- once with the Propeller inputs set up via FixValue as AutoDiffXd and/or symbolic::Variable and again with them as zero, then subtract the difference.
Note: CalcImplicitTimeDerivativesResidual is relatively new; I haven't pushed the python bindings for it yet (but it's been on my list). Do you need it from python?
I think that perhaps you are looking for the MultibodyPlant reaction_forces output port?
I have trialled TFF tutorial (MNIST) on my single machine and now I am trying to perform a multi-machine process using MNIST data.
Clearly, I cannot use create_tf_dataset_for_client so I have used GRPC to learn how to pass data from one machine to another.
My scenario is that Server will dispatch the initial model (with zeroes) to all the participating clients where the model will run on local data. Each client will dispatch the new weights to the server that will perform federated_mean.
I was thinking of using tff.learning.build_federated_averaging_process where I could hopefully customise the next function (2nd argument) but I failed... I am not even sure if we use this approach to send the model and get the weights back from remote clients.
Then I thought I could use tff.federated_mean under #tff.federated_computation decorator. However, since weights are arrays and I have a list of them (as I have a number of clients), I am unable to understand how do I create a tff.FederatedType that points to that a list of lists. Any help from someone who has modelled federation on distributed dataset will be handy to understand.
Regards,
Dev.
TFF computations are designed to be platform/runtime agnostic; a single computation can be executed by several different backends.
TFF's type system can be helpful here in reasoning about how data is expected to flow in you computation. See the custom federated algorithms part 1 tutorial for an intro to how TFF thinks about types.
The result of build_federated_averaging_process expects an argument of datasets which are placed at clients; for a dataset of element type T, in TFF's usual notation this would be denoted {T*}#C. This signature particular is agnostic with respect to how the datasets arrive at the clients, or indeed how the clients themselves are represented.
Materializing the data and representing the clients is really the job of the runtime. TFF provides a few so-called native options here.
For example, in the local Python runtime clients are represented by threads on your local machine. Datasets are simply eager tf.data.Dataset objects, and the threads pull data from the datasets during training.
In the remote Python runtime, clients are represented by (threads on) remote workers, so that a single remote worker could be running more than one client. In this case, as you note, data must be materialized on the remote worker in order to train.
There are several options for accomplishing this.
One, TFF will actually handle serialization and deserialization of eager datasets across this RPC connection for you, so you could use the identical pattern of specifying data as in the local runtime, and it should "just work". This pattern actually got significantly better in March of 2021, via the use of tf.raw_ops.DatasetToGraphV2.
Perhaps better mapping to the concepts of federated computation, however, is the use of some library functions to simply instantiate the datasets on the workers.
Suppose you have an iterative process ip, which accepts a state and data argument, where data is of type {T*}#C. Suppose further we have a TFF computation get_dataset_for_client_id, which accepts a string and returns a dataset of appropriate type (IE, its TFF type signature is tf.str -> T*).
Then we can compose these two computations into another:
#tff.federated_computation(STATE_TYPE, tff.FederatedType(tf.string, tff.CLIENTS))
def new_next(state, client_ids):
datasets_on_clients = tff.federated_map(get_dataset_for_client_id, client_ids)
return ip.next(state, datasets_on_clients)
new_next now requires the controller to only specify the ids of clients on which to train, and delegates responsibility for pointing to a data store to whoever is representing the clients.
This pattern I think is likely what you want; TFF provides some helper s like the dataset_computation attribute on tff.simulation.ClientData and tff.simulation.compose_dataset_computation_with_iterative_process, which will more or less perform the wiring we did above for you.
let's do this step by step. Please let us know if the explanation below answers your question.
Let's start with an example of TF (non-federated, just local) code that takes a dataset and does something with it, say add numbers:
#tff.tf_computation(tff.SequenceType(tf.int32))
def process_data(ds):
return ds.reduce(np.int32(0), lambda x, y: x + y)
This code takes a dataset of integer numbers at input, and returns a single integer with the sum at output.
You can confirm this by lookin at the type signature, like this:
str(process_data.type_signature)
You should see this:
(int32* -> int32)
So, process_data takes a set of integers, and returns an integer.
Now, using TFF's federated operators we can create a federated computation that does this on multiple clients, like this:
#tff.federated_computation(tff.FederatedType(tff.SequenceType(tf.int32), tff.CLIENTS))
def process_data_on_clients(federated_ds):
return tff.federated_map(process_data, federated_ds)
If you look at the type signature of this new computation (just like above), you will see this:
({int32*}#CLIENTS -> {int32}#CLIENTS)
It means process_data_on_clients takes a federated set of integers (one set per client), and returns a federated integer (one integer with the sum on each client).
What happens in the above is that, the TF logic in process_data will be executed once on each client. This is how the federated_map operator works.
Now, process_data_on_clients is a little bit like the the iterative process you are working with. It wants you to provide a federated dataset as an argument.
Let's see how we can make one by following the same pattern as above.
Here's some TF code that creates a single dataset with integers, say you supply an integer n and want to create a dataset with numbers from 1 up to n, i..e, {1, 2, ..., n}:
#tff.tf_computation(tf.int32)
def make_data(n):
return tf.data.Dataset.range(tf.cast(n, tf.int64)).map(lambda x: tf.cast(x + 1, tf.int32))
This is obviously a silly example, you could do something more along the lines of what you need (e.g., read data from a file specified by a name, etc.).
And here's what its type signature looks like:
(int32 -> int32*)
You can see the similarity to process_data.
And, just like with processing data, here's now we can make data on all clients by using the federated_map operator:
#tff.federated_computation(tff.FederatedType(tf.int32, tff.CLIENTS))
def make_data_on_clients(federated_n):
return tff.federated_map(make_data, federated_n)
This is the type signature:
({int32}#CLIENTS -> {int32*}#CLIENTS)
Great, so make_data_on_clients takes a federated integer (that tells us how many data items to produce on each client), and returns a federated dataset, just like what process_data_on_clients wants.
You can check that the two work together as intended:
federated_n = [2, 3, 4]
federated_ds = make_data_on_clients(federated_n)
result = process_data_on_clients(federated_ds)
result
You should get the sums 1+2, 1+2+3, and 1+2+3+4 on the 3 clients involved in this computation (note there were 3 numbers in the federated integer above, so there are 3 clients here):
[<tf.Tensor: shape=(), dtype=int32, numpy=3>,
<tf.Tensor: shape=(), dtype=int32, numpy=6>,
<tf.Tensor: shape=(), dtype=int32, numpy=10>]
Note that all TF code you have seen so far, including both dataset creation and dataset reduce, were being executed on the clients (using federated_map).
Now, you can put the two together:
#tff.federated_computation(tff.FederatedType(tf.int32, tff.CLIENTS))
def make_and_process_data_on_clients(federated_n):
federated_ds = make_data_on_clients(federated_n)
return process_data_on_clients(federated_ds)
And now, you can invoke the make and process data combo in one shot:
make_and_process_data_on_clients(federated_n)
Again, all TF code here is executing on clients, just like in the above.
So where does this leave you?
Going back to Keith's explanation, the iterative process you got from TFF wants a federated dataset at input, just like process_data_on_clients.
The function get_dataset_for_client_id in Keith's example is like our make_data in that it is assumed to contain TensorFlow code that you want to run on each client to physically construct a dataset on that client.
In out silly example, dataset construction logic used range, but it can be anything. For example, you could load data on each client from the same local file my_data, or using a custom TF op, or by whatever other means. Just like in our example, you can pass parameters to that function to give you more centralized control (similarly to whatever did above with the federated integer).
The code snipper new_next in Keith's example is just like our make_and_process_data_on_clients, in that it combines two federated computations: one that makes federated data on clients (supplied by you, just as discussed here), and one that processes that data (from tff.learning, the iterative process).
Does this help?
If still unclear, I would recommend to try the examples I included above on your distributed setup, since you already have one. You could inject some TF print ops to that code to confirm that the TF code you wrote is executing on the client machines in your system.
Once you get that part, it's simple tweak to replace the silly data set construction logic in make_data with one that loads a dataset on each client from whatever local data source you are using.
EDITS:
Re: how to print, any TensorFlow code that appears in the body of a #tff.tf_computation is executed in eager mode, and you can use standard TensorFlow mechanisms such as tf.print to print from within TensorFlow.
tensorflow.org/api_docs/python/tf/print
On how to configure a multi-machine system with multiple worker nodes, see the Kubernetes tutorial. Note that the machine that drives the process connects to worker nodes, not the other way round.
https://www.tensorflow.org/federated/tutorials/high_performance_simulation_with_kubernetes
What is the recommended way to do expensive one-off initialization in a Beam Python DoFn? The Java SDK has DoFn.Setup, but there doesn't appear to be an equivalent in Beam Python.
Is the best way currently to attach objects to threading.local() in the DoFn initializer?
Dataflow Python is not particularly transparent about the optimal method for initializing expensive objects. There are a few mechanisms by which objects can be instantiated infrequently (it is currently not ideal to perform exactly once initialization). Below are outlined some of the experiments I have run and conclusions I have come to. Hopefully someone from the Beam community can help correct me wherever I have strayed.
__init__
Although the __init__ method can be used to initialize an expensive object exactly once, this initialization does not happen on the Worker machines. The object will need to be serialized in order to be sent off to the Worker which, for large objects, as well as Tensorflow models, can be quite unwieldy or not work at all. Furthermore, since this object will be serialized and sent over a wire, it is not secure to perform initializations here, as payloads can be intercepted. The recommendation is against using this method.
start_bundle()
Dataflow processes data in discrete groups that it calls bundles. These are fairly well defined in batch processes, but in streaming they are dependent on the throughput. There are no mechanisms for configuring how Dataflow creates its bundles, and in fact the size of a bundle is entirely dictated by Dataflow. The start_bundle() method will be called on the Worker and can be used to initialize state, however experiments find that in a streaming context, this method is called more frequently than desired, and expensive re-initializations would happen quite often.
Lazy initialization
This methodology was suggested by the Beam docs and is somewhat surprisingly the most performant. Lazy initialization means that you create some stateful parameter that you initialize to None, then execute code such as the following:
if self.expensive_object is None:
self.expensive_object = self.__expensive_initialization()
You can execute this code directly in your process() method. You can also put together some helper functions easily enough that rely on global state so that you can have functions such as (an example of what this might look like is at the bottom of this post):
self.expensive_object = get_or_initialize_global(‘expensive_object’, self.__expensive_initialization)
Experiments
The following experiments were run on a job that was configured using both start_bundle and the lazy initialization method described above, with appropriate logging to indicate invocation. Various throughput was published to the appropriate queue and the results were recorded accordingly.
At a rate of 1 msg/sec over 100s:
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 100
LAZY INITIALIZATION 25
TOTAL MESSAGES 100
At a rate of 10 msg/sec over 100s
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 942
LAZY INITIALIZATION 3
TOTAL MESSAGES 1000
At a rate of 100 msg/sec over 100s
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 2447
LAZY INITIALIZATION 30
TOTAL MESSAGES 10000
At a rate of 1000 msg/sec over 100s
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 2293
LAZY INITIALIZATION 36
TOTAL MESSAGES 100000
Takeaways
Although start_bundle works well for high throughput, lazy initialization is nonetheless the most performant by a wide margin regardless of throughput. It is the recommended way of performing expensive initializations on Python Beam. This result is perhaps not too surprising given this quote from the official docs:
Setup - called once per DoFn instance before anything else; this has not been implemented in the Python SDK so the user can work around just with lazy initialization
The fact that is is called a "work around" is not particularly encouraging though, and maybe we can expect something more robust in the near future.
Code Samples
Courtesy of Andreas Jansson:
def get_or_initialize_global(object_key, initialize_expensive_object):
if object_key in globals():
expensive_object = globals()[object_key]
else:
expensive_object = initialize_expensive_object()
globals()[object_key] = expensive_object
Setup and teardown have now been added to the Python SDK and are the recommended way to do expensive one-off initialization in a Beam Python DoFn.
This sounds like it could be it https://beam.apache.org/releases/pydoc/2.8.0/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.start_bundle