As said in manual, http://www.erlang.org/erldoc?q=erlang:now
If you do not need the return value to be unique and monotonically increasing, use os:timestamp/0 instead to avoid some overhead.
os:timestamp/0 should be faster than erlang:now/0
But I tested on my PC with timer:tc/3, for 10000000 calls, time spent in microsecond is:
erlang:now 951000
os:timestamp 1365000
Why erlang:now/0 faster than os:timestamp/0?
My OS: Windows 7 x64, erlang version: R16B01.
------------------edit-----------------
I wrote another test code in parallel (100 thread), os:timestamp/0 performed better in parallel. here are data:
----- single thread ------
erlang:now 95000
os:timestamp 147000
----- multi thread ------
erlang:now 333000
os:timestamp 91000
So, I think the "overhead" is for parallel.
I've always thought that the 'some overhead' comment was darkly amusing. The way erlang:now/0 achieves its trick of providing guaranteed unique, monotonically increasing values is to take out a per-VM global lock. In a serial test you won't notice anything, but when you've got a lot of parallel code running, you may.
The function os:timestamp/0 doesn't take out a lock and may return the same value in two processes.
This was recently discussed on the erlang-questions mailing list ("erlang:now() vs os:timestamp()" on 3rd April 2013), where two interesting results emerged:
erlang:now seems to be faster than os:timestamp in interpreted code (as opposed to compiled code, where os:timestamp is faster).
If you benchmark them, you should measure the time taken using os:timestamp instead of erlang:now, since erlang:now forces the clock to advance.
Apart from the excellent answer by troutwine, the reason why erlang:now() is faster in a serial test is probably that it avoids the kernel since you may be calling it faster than time progresses and then you are in a situation where you don't hit the kernel as often.
But note, that your test is deceiving until you add more than a single core. Then os:timestamp() like troutwine writes, will outperform erlang:now().
Also note you are on a weak platform, namely Windows. This usually affects performance in non-trivial ways.
Related
I have a DirectCompute application making computation on images (Like computing average pixel value, applying a filter and much more). For some computation, I simply treat the image as an array of integer and dispatch a computer shader like this:
FImmediateContext.Dispatch(PixelCount, 1, 1);
The result is exactly the expected value, so the comptation is correct. Nevertheless, at runt time, I see in the debug log the following message:
D3D11 ERROR: ID3D11DeviceContext::Dispatch: There can be at most 65535 Thread Groups in each dimension of a Dispatch call. One of the following is too high: ThreadGroupCountX (3762013), ThreadGroupCountY (1), ThreadGroupCountZ (1) [ EXECUTION ERROR #2097390: DEVICE_DISPATCH_THREADGROUPCOUNT_OVERFLOW]
This error is shown only in the debug log, everything else is correct, including the computation result. This makes me thinking that the GPU somehow manage the very large thread group, probably breaking it to smaller groups sequentially executed.
My question is: should I care about this error or is it OK to keep it and letting the GPU do the work for me?
Thx.
If you only care about it working on your particular piece of hardware and driver, then it's fine. If you care about it working on all Direct3D Feature Level 11.0 cards, then it's not fine as there's no guarantee it will work on any other driver or device.
See Microsoft Docs for details on the limits for DirectCompute.
If you care about robust behavior, it's important to test DirectCompute applications across a selection of cards & drivers. The same is true of basically any use of DirectX 12. Much of the correctness behavior is left up to the application code.
I'm trying to read through the PPO1 code in OpenAi's Baselines implementation of RL algorithms (https://github.com/openai/baselines) to gain a better understanding as to how PPO works, how one might go about implementing it, etc.
I'm confused as to the difference between the "optim_batchsize" and the "timesteps_per_actorbatch" arguments that are fed into the "learn()" function. What are these hyper-parameters?
In addition, I see in the "run_atari.py" file, the "make_atari" and "wrap_deepmind" functions are used to wrap the environment. In the "make_atari" function, it uses the "EpisodicLifeEnv", which ends the episode once the a life is lost. On average, I see that the episode length in the beginning of training is about 7 - 8 timesteps, but the batch size is 256, so I don't see how any updates can occur. Thanks in advance for your help.
I've been going through it on my own as well....their code is a nightmare!
optim_batchsize is the batch size used for optimizing the policy, timesteps_per_actorbatch is the number of time steps the agent runs before optimizing.
On the episodic thing, I am not sure. Two ways it could happen, one is waiting until the 256 entries are filled before actually updating, or the other one is filling the batch with dummy data that does nothing, effectively only updating the 7 or 8 steps that the episode lasted.
Update: The while() condition below gets optimized out by the compiler, so both threads just skip the condition and enter the C.S. even with -O0 flag. Does anyone know why the compiler is doing this? By the way, declaring the global variables volatile causes the program to hang for some odd reason...
I read the CUDA programming guide but I'm still a bit unclear on how CUDA handles memory consistency with respect to global memory. (This is different from the memory hierarchy) Basically, I am running tests trying to break sequential consistency. The algorithm I am using is Peterson's algorithm for mutual exclusion between two threads inside the kernel function:
flag[threadIdx.x] = 1; // both these are global
turn = 1-threadIdx.x;
while(flag[1-threadIdx.x] == 1 && turn == (1- threadIdx.x));
shared_gloabl_variable_x ++;
flag[threadIdx.x] = 0;
This is fairly straightforward. Each thread asks for the critical section by setting its flag to one and by being nice by giving the turn to the other thread. At the evaluation of the while(), if the other thread did not set its flag, the requesting thread can then enter the critical section safely. Now a subtle problem with this approach is that if the compiler re-orders the writes so that the write to turn executes before the write to flag. If this happens both threads will end up in the C.S. at the same time. This fairly easy to prove with normal Pthreads, since most processors don't implement sequential consistency. But what about GPUs?
Both of these threads will be in the same warp. And they will execute their statements in lock-step mode. But when they reach the turn variable they are writing to the same variable so the intra-warp execution becomes serialized (doesn't matter what the order is). Now at this point, does the thread that wins proceed onto the while condition, or does it wait for the other thread to finish its write, so that both can then evaluate the while() at the same time? The paths again will diverge at the while(), because only one of them will win while the other waits.
After running the code, I am getting it to consistently break SC. The value I read is ALWAYS 1, which means that both threads somehow are entering the C.S. every single time. How is this possible (GPUs execute instructions in order)? (Note: I have compiled it with -O0, so no compiler optimization, and hence no use of volatile).
Edit: since you have only two threads and 1-threadIdx.x works, then you must be using thread IDs 0 and 1. Threads 0 and 1 will always be part of the same warp on all current NVIDIA GPUs. Warps execute instructions SIMD fashion, with a thread execution mask for divergent conditions. Your while loop is a divergent condition.
When turn and flags are not volatile, the compiler probably reorders the instructions and you see the behavior of both threads entering the C.S.
When turn and flags are volatile, you see a hang. The reason is that one of the threads will succeed at writing turn, so turn will be either 0 or 1. Let's assume turn==0: If the hardware chooses to execute thread 0's part of the divergent branch, then all is OK. But if it chooses to execute thread 1's part of the divergent branch, then it will spin on the while loop and thread 0 will never get its turn, hence the hang.
You can probably avoid the hang by ensuring that your two threads are in different warps, but I think that the warps must be concurrently resident on the SM so that instructions can issue from both and progress can be made. (Might work with concurrent warps on different SMs, since this is global memory; but that might require __threadfence() and not just __threadfence_block().)
In general this is a great example of why code like this is unsafe on GPUs and should not be used. I realize though that this is just an investigative experiment. In general CUDA GPUs do not—as you mention most processors do not—implement sequential consistency.
Original Answer
the variables turn and flag need to be volatile, otherwise the load of flag will not be repeated and the condition turn == 1-threadIdx.X will not be re-evaluated but instead will be taken as true.
There should be a __threadfence_block() between the store to flag and store to turn to get the right ordering.
There should be a __threadfence_block() before the shared variable increment (which should also be declared volatile). You may also want a __syncthreads() or at least __threadfence_block() after the increment to ensure it is visible to other threads.
I have a hunch that even after making these fixes you may still run into trouble, though. Let us know how it goes.
BTW, you have a syntax error in this line, so it's clear this isn't exactly your real code:
while(flag[1-threadIdx.x] == 1 and turn==[1- threadIdx.x]);
In the absence of extra memory barriers such as __threadfence(), sequential consistency of global memory is enforced only within a given thread.
I've been doing some computationally intensive work in F#. Functions like Array.Parallel.map which use the .Net Task Parallel Library have sped up my code exponentially for a really quite minimal effort.
However, due to memory concerns, I remade a section of my code so that it can be lazily evaluated inside a sequence expression (this means I have to store and pass less information). When it came time to evaluate I used:
// processor and memory intensive task, results are not stored
let calculations : seq<Calculation> = seq { ...yield one thing at a time... }
// extract results from calculations for summary data
PSeq.iter someFuncToExtractResults results
Instead of:
// processor and memory intensive task, storing these results is an unnecessary task
let calculations : Calculation[] = ...do all the things...
// extract results from calculations for summary data
Array.Parallel.map someFuncToExtractResults calculations
When using any of the Array.Parallel functions I can clearly see all the cores on my computer kick into gear (~100% CPU usage). However the extra memory required means the program never finished.
With the PSeq.iter version when I run the program, there's only about 8% CPU usage (and minimal RAM usage).
So: Is there some reason why the PSeq version runs so much slower? Is it because of the lazy evaluation? Is there some magic "be parallel" stuff I am missing?
Thanks,
Other resources, source code implementations of both (they seem to use different Parallel libraries in .NET):
https://github.com/fsharp/fsharp/blob/master/src/fsharp/FSharp.Core/array.fs
https://github.com/fsharp/powerpack/blob/master/src/FSharp.PowerPack.Parallel.Seq/pseq.fs
EDIT: Added more detail to code examples and details
Code:
Seq
// processor and memory intensive task, results are not stored
let calculations : seq<Calculation> =
seq {
for index in 0..data.length-1 do
yield calculationFunc data.[index]
}
// extract results from calculations for summary data (different module)
PSeq.iter someFuncToExtractResults results
Array
// processor and memory intensive task, storing these results is an unnecessary task
let calculations : Calculation[] =
Array.Parallel.map calculationFunc data
// extract results from calculations for summary data (different module)
Array.Parallel.map someFuncToExtractResults calculations
Details:
The storing the intermediate array version runs quick (as far as it gets before crash) in under 10 minutes but uses ~70GB RAM before it crashes (64GB physical, the rest paged)
The seq version takes over 34mins and uses a fraction of the RAM (only around 30GB)
There's a ~billion values I'm calculating. Hence a billion doubles (at 64bits each) = 7.4505806GB. There's more complex forms of data... and a few unnecessary copies I'm cleaning up hence the current massive RAM usage.
Yes the architecture isn't great, the lazy evaluation is the first part of me attempting to optimize the program and/or batch up the data into smaller chunks
With a smaller dataset, both chunks of code output the same results.
#pad, I tried what you suggested, the PSeq.iter seemed to work properly (all cores active) when fed the Calculation[], but there is still the matter of RAM (it eventually crashed)
both the summary part of the code and the calculation part are CPU intensive (mainly because of large data sets)
With the Seq version I just aim to parallelize once
Based on your updated information, I'm shortening my answer to just the relevant part. You just need this instead of what you currently have:
let result = data |> PSeq.map (calculationFunc >> someFuncToExtractResults)
And this will work the same whether you use PSeq.map or Array.Parallel.map.
However, your real problem is not going to be solved. This problem can be stated as: when the desired degree of parallel work is reached in order to get to 100% CPU usage, there is not enough memory to support the processes.
Can you see how this will not be solved? You can either process things sequentially (less CPU efficient, but memory efficient) or you can process things in parallel (more CPU efficient, but runs out of memory).
The options then are:
Change the degree of parallelism to be used by these functions to something that won't blow your memory:
let result = data
|> PSeq.withDegreeOfParallelism 2
|> PSeq.map (calculationFunc >> someFuncToExtractResults)
Change the underlying logic for calculationFunc >> someFuncToExtractResults so that it is a single function that is more efficient and streams data through to results. Without knowing more detail, it's not simple to see how this could be done. But internally, certainly some lazy loading may be possible.
Array.Parallel.map uses Parallel.For under the hood while PSeq is a thin wrapper around PLINQ. But the reason they behave differently here is there is not enough workloads for PSeq.iter when seq<Calculation> is sequential and too slow in yielding new results.
I do not get the idea of using intermediate seq or array. Suppose data to be the input array, moving all calculations in one place is the way to go:
// Should use PSeq.map to match with Array.Parallel.map
PSeq.map (calculationFunc >> someFuncToExtractResults) data
and
Array.Parallel.map (calculationFunc >> someFuncToExtractResults) data
You avoid consuming too much memory and have intensive computation in one place which leads to better efficiency in parallel execution.
I had a problem similar to yours and solved it by adding the following to the solution's App.config file:
<runtime>
<gcServer enabled="true" />
<gcConcurrent enabled="true"/>
</runtime>
A calculation that was taking 5'49'' and showing roughly 22% CPU utilization on Process Lasso took 1'36'' showing roughly 80% CPU utilization.
Another factor that may influence the speed of parallelized code is whether hyperthreading (Intel) or SMT (AMD) is enabled in the BIOS. I have seen cases where disabling leads to faster execution.
I see that Erlang Efficiency User's Guide Section 5.3 recommends leaving the non-flat list as it is when being used as an iolist because the penalty of non-flattening is smaller than flattening. Is there any quantitative example of the speed difference?
When a deep list contains n elements, then performing lists:flatten on it will require Θ(n) time, and worse, Θ(n) memory allocations. How slow that is on your machine is a function of many variables; measure and ye shall know.