torch.no_grad() and detach() combined - machine-learning

I encountered many code fragments like the following for choosing an action, that include a mix of torch.no_grad and detach (where actor is some actor, SomeDistribution your preferred distribution), and I'm wondering whether they make sense:
def f():
with torch.no_grad():
x = actor(observation)
dist = SomeDistribution(x)
sample = dist.sample()
return sample.detach()
Is the use of detach in the return statement not unnecessary, as x has its requires_grad already set to False, so all computations using x should already be detached from the graph? Or do the computations after the torch.no_grad wrapper somehow end up on the graph again, so we need to detach them once again in the end (in which case it seems to me that no_grad would be unnecessary)?
Also, if I'm right, I suppose instead of omitting detach one could also omit torch.no_grad, and end up with the same functionality, but worse performance, so torch.no_grad is to be preferred?

While it may be redundant, it depends on the internals of actor and SomeDistribution. In general, there are three cases I can think of where detach would be necessary in this code. Since you've already observed that x has requires_grad set to False then cases 2 and 3 don't apply to your specific case.
If SomeDistribution has internal parameters (leaf tensors with requires_grad=True) then dist.sample() may result in a computation graph connecting sample to those parameters. Without detaching, that computation graph, including those parameters, would be unnecessarily kept in memory after returning.
The default behavior within a torch.no_grad context is to return the result of tensor operations having requires_grad set to False. However, if actor(observation) for some reason explicitly sets requires_grad of its return value to True before returning, then a computation graph may be created that connects x to sample. Without detaching, that computation graph, including x, would be unnecessarily kept in memory after returning.
This one seems even more unlikely, but if actor(observation) actually just returns a reference to observation, and observation.requires_grad is True, then a computation graph all the way from observation to sample may be constructed during dist.sample().
As for the suggestion of removing the no_grad context in leu of detach, this may result in the construction of a computation graph connecting observation (if it requires gradients) and/or the parameters of the distribution (if it has any) to x. The graph would be discarded after detach, but it does take time and memory to create the computation graph, so there may be a performance penalty.
In conclusion, it's safer to do both no_grad and detach, though the necessity of either depends on the details of the distribution and actor.

Related

Do eager pipes create multiple lists in F#?

When using a list an pipes does it create mutliple intermediate lists? If so is this not very bad for garbage collection?
let mylist = [...]
let filterByPipes ls =
somefilter ls //is a list created here
|> otherFiler // and then again here
|> filtersForDays // and finally returned here
Yes, F# uses applicative evaluation order, which means that every value has to be fully evaluated before it can be passed as parameter to a function, which means that yes, at every step a new list is created.
As to whether it's bad for garbage collection... "First measure then optimize" - the first rule of optimization.
In most circumstances, the amount of data is so miniscule, it doesn't really make any difference, definitely not enough to make up for the massive gain in code readability.
But if your measurements do determine a bottleneck in this particular place, and it turns out to be significant for the end goal, - then yes, you'll have to manually fuse the processing chain. Or perhaps employ a more efficient data structure. It all would depend on what the optimization goal is.

What is the sorting complexity time in the streaming API for the sorted() and thenComparing () methods (sorting by multiple fields (conditions))?

I have an information that the method sorted() in Stream API maybe to use merging sort (mergesort).
Then time complexity for the kind of sort :
Big Θ (n (log n) ) - best
Big Ω (n (log n) ) - average
Big O (n (log n) ) - worst
space complecity – O (n) - worst
And what is the time complexity if we use sorting by multiple fields of a custom object, using then.comparing() to build a chain of comparisons ?
How would you calculate the time complexity in such a case ?
While the actual algorithm used in Stream.sorted is intentionally unspecified, there are obvious reasons not to implement another sorting algorithm, but use the existing implementation of Arrays.sort.
The current implementation uses TimSort, a variation of merge sort that can exploit ranges of pre-sorted elements within the input, with a best case of being entirely linear, when the input is already sorted, which also applies to the possible case that the input is sorted backwards. In these cases, no additional memory is needed. The average case is somewhere between that best case and the unchanged worst case of O(n log n).
As explained in this answer, general statements about the algorithms used in Arrays.sort are tricky, because all of them are hybrid sort algorithms and constantly improved.
Normally, a comparison function does not depend on the size of the input (the array or collection to sort), which doesn’t change when using Comparator.comparing(…).thenComparing(…), as more expensive comparison functions only add a constant factor that doesn’t affect the overall time complexity, as long as the comparator still doesn’t depend on the input size.

Guarantee Print Order After Parallelism

I have X amount of cores doing unique work in parallel, however, their output needs to be printed in order.
Object {
Data data
int order
}
I've tried putting the objects in a min heap after they're done with their parallel work, however, even that is too much of a bottleneck.
Is there any way I could have work done in parallel and guarantee the print order? Is there a known term for my problem? Have others encountered it before?
Is there any way I could have work done in parallel and guarantee the print order?
Needless to say, we design parallelized routines with focus on an efficiency, but not constraining the order of the calculations. The printing of the results at the end, when everything is done, should dictate the ordering. In fact, parallel routines often do calculations in such a way that they’re conspicuously not in order (e.g., striding on each thread) to minimize thread and synchronization overhead.
The only question is how you structure the results to allow efficient storage and efficient, ordered retrieval. I often just use a mutable buffer or a pre-populated array. It’s very efficient in terms of both storage and retrieval. Or you can use a dictionary, too. It depends upon the nature of your Data. But I’d avoid the order property pattern in your result Object.
Just make sure you’re using optimized build if using standard Swift collections, as this can have a material impact on performance.
Q : Is there a known term for my problem?
Yes, there is. A con·​tra·​dic·​tion:
Definition of contradiction…2a : a proposition, statement, or phrase that asserts or implies both the truth and falsity of something// … both parts of a contradiction cannot possibly be true …— Thomas Hobbes
2b : a statement or phrase whose parts contradict each other// a round square is a contradiction in terms
3a : logical incongruity
3b : a situation in which inherent factors, actions, or propositions are inconsistent or contrary to one anothersource: Merriam-Webster
Computer science, having borrowed the terms { PARALLEL | SERIAL | CONCURRENT } from the theory of systems, respects the distinctive ( and never overlapping ) properties of each such class of operations, where:
[PARALLEL] orchestration of units-of-work implies, that any and every work-unit: a) starts and b) gets executed and c) gets finished at the same time, i.e. all get into/out-of [PARALLEL]-section at once and being elaborated at the very same time, not otherwise.
[SERIAL] orchestration of units-of-work implies, that all work-units be processed in a one, static, known, particular order, starting work-unit(s) in such an order, just a (known)-next one after previous one has finished its work - i.e. one-after-another, not otherwise.
[CONCURRENT] orchestration of units-of-work permits to start more than one unit-of-work, if resources and system conditions permit (scheduler priorities obeyed), resulting in unknown order of execution and unknown time of completion, as both the former and the latter depend on unknown externalities (system conditions and (non)-availability of resources, that are/will be needed for a particular work-unit elaboration)
Whereas there is an a-priori known, inherently embedded sense of an ORDER in [SERIAL]-type of processing ( as it was already pre-wired into the units-of-work processing-orchestration-code ), it has no such meaning in either [CONCURRENT], where opportunistic scheduling makes a wished-to-have order an undeterministically random result from the system states, skewed by the coincidence of all other externalities, and the same wished-to-have order is principally singular value in true [PARALLEL] by definition, as all start/execute/finish at-the-same-time - so all units-of-work being executed in [PARALLEL] fashion have no other chance, but be both 1st and last at the same time.
Q : Is there any way I could have work done in parallel and guarantee the print order?
No, unless you intentionally or unknowingly violate the [PARALLEL] orchestration rules and re-enter a re-[SERIAL]-iser logic into the work-units, so as to imperatively enforce any such wished-to-have ordering, that is not known, the less natural for the originally [PARALLEL] work-units' orchestration ( as is a common practice in python - using a GIL-monopolist indoctrinated stepping - as an example of such step )
Q : Have others encountered it before?
Yes. Since 2011, each and every semester this or similar questions reappear here, on Stack Overflow at growing amounts every year.

Computing in-place with dask

Short version
I have a dask array whose graph is ultimately based on a bunch of numpy arrays at the bottom, and which applies elementwise operations to them. Is it safe to use da.store to compute the array and store the results back into the original backup numpy arrays, making the whole thing an in-place operation?
If you're thinking "you're using dask wrong" then see the long version below for why I feel the need to do this.
Long version
I'm using dask for an application where the original data is sourced from in-memory numpy arrays that contain data collected from a scientific instrument. The goal is to fill most of the RAM (say 75%+) with the original data, which means that there isn't enough to make an in-memory copy. That makes it semantically a bit like an out-of-core problem, in that any derived value can only be realised in memory in chunks rather than all at once.
Dask is well-suited to this, except for one wrinkle. I'm simplifying a lot, but on most of the data (call it X), we need to apply an element-wise operation f, compute some summary statistics s(f(X)), and use that to compute another result over the data, say t(s(f(X)), f(X)). While all the functions are dask-friendly (can be done on a per-chunk basis), trying to simply run this dask graph would cause f(X) to all be held in memory at once because the chunks are all needed for the second pass. An alternative is to explicitly compute s before asking for t (as suggested by https://github.com/dask/dask/issues/874), and thus pay to compute f(X) twice, but it's a somewhat expensive operation so I'd like to avoid that.
However, once f has been applied, the original data are no longer needed. So I'd like to run da.store(f(X)) and have it store the results in the original backing numpy arrays. Technically I think I know how to set that up, and as long as I can be sure that each piece of data is fully consumed before it is overwritten then there are no race conditions, but I'm worried that I may be breaking an API contract by changing back data underneath dask and that it might go wrong in some way. Is there any way to guarantee that it is safe?
One way I can immediately see this going wrong is if several of the input arrays have the same contents and hence get given the same name in dask, causing them to the unified in the graph. I'm using name=False in da.from_array though, so that shouldn't be an issue.

When are placeholders necessary?

Every TensorFlow example I've seen uses placeholders to feed data into the graph. But my applications work fine without placeholders. According to the documentation, using placeholders is the "best practice", but they seem to make the code unnecessarily complex.
Are there any occasions when placeholders are absolutely necessary?
According to the documentation, using placeholders is the "best practice"
Hold on, this quote is out-of-context and could be misinterpreted. Placeholders are the best practice when feeding data through feed_dict.
Using a placeholder makes the intent clear: this is an input node that needs feeding. Tensorflow even provides a placeholder_with_default that does not need feeding — but again, the intent of such a node is clear. For all purposes, a placeholder_with_default does the same thing as a constant — you can indeed feed the constant to change its value, but is the intent clear, would that not be confusing? I doubt so.
There are other ways to input data than feeding and AFAICS all have their uses.
A placeholder is a promise to provide a value later.
Simple example is to define two placeholders a,b and then an operation on them like below .
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
adder_node = a + b # + provides a shortcut for tf.add(a, b)
a,b are not initialized and contains no data Because they were defined as placeholders.
Other approach to do same is to define variables tf.Variable and in this case you have to provide an initial value when you declare it.
like :
tf.global_variables_initializer()
or
tf.initialize_all_variables()
And this solution has two drawbacks
Performance wise that you need to do one extra step with calling
initializer however these variables are updatable .
in some cases you do not know the initial values for these variables
so you have to define it as a placeholder
Conclusion :
use tf.Variable for trainable variables such as weights (W) and biases (B) for your model or when Initial values are required in
general.
tf.placeholder allows you to create operations and build computation graph, without needing the data. In TensorFlow
terminology, we then feed data into the graph through these
placeholders.
I really like Ahmed's answer and I upvoted it, but I would like to provide an alternative explanation that might or might not make things a bit clearer.
One of the significant features of Tensorflow is that its operation graphs are compiled and then executed outside of the original environment used to build them. This allows Tensorflow do all sorts of tricks and optimizations, like distributed, platform independent calculations, graph interoperability, GPU computations etc. But all of this comes at the price of complexity. Since your graph is being executed inside its own VM of some sort, you have to have a special way of feeding data into it from the outside, for example from your python program.
This is where placeholders come in. One way of feeding data into your model is to supply it via a feed dictionary when you execute a graph op. And to indicate where inside the graph this data is supposed to go you use placeholders. This way, as Ahmed said, placeholder is a sort of a promise for data supplied in the future. It is literally a placeholder for things you will supply later. To use an example similar to Ahmed's
# define graph to do matrix muliplication
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
# this is the actual operation we want to do,
# but since we want to supply x and y at runtime
# we will use placeholders
model = tf.matmul(x, y)
# now lets supply the data and run the graph
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
# generate some data for our graph
data_x = np.random.randint(0, 10, size=[5, 5])
data_y = np.random.randint(0, 10, size=[5, 5])
# do the work
result = session.run(model, feed_dict={x: data_x, y: data_y}
There are other ways of supplying data into the graph, but arguably, placeholders and feed_dict is the most comprehensible way and it provides most flexibility.
If you want to avoid placeholders, other ways of supplying data are either loading the whole dataset into constants on graph build or moving the whole process of loading and pre-processing the data into the graph by using input pipelines. You can read up on all of this in the TF documentation.
https://www.tensorflow.org/programmers_guide/reading_data

Resources