make two dispatch() run concurrently in directx12 - memory-barriers

the problem:
there is only one [Buffer]
only compute shaders are used
there is [DispatchA] and [DispatchB]
[DispatchA] reads and writes to [Buffer]
[DispatchB] reads and writes to [Buffer]
[DispatchA] and [DispatchB] reads and writes do not collide
[DispatchA] will run 1 time
[DispatchB] will run 2 times, the second time only after the first time finished
make [DispatchA] and [DispatchB] run at the same time on the GPU
this says that a resource can't be written by two command queues
so DispatchA and DispatchB need to be on the same command list
one UAVbarrier would be needed between the two DispatchB
but UAVbarrier will make the second DispatchB wait for DispatchA to finish
so is it possible to make DispatchA and DispatchB run completely asynchronous?

I asked on the official directx discord
looks like directx12 can't do but CUDA can
so I am now using CUDA as compute and directx12 renders to the screen

Related

Windowing appears to work when running on the DirectRunner, but not when running on Cloud Dataflow

I'm trying to break fusion with a GroupByKey. This creates one huge window and since my job is big I'd rather start emitting.
With the direct runner using something like what I found here it seems to work. However, when run on Cloud Dataflow it seems to batch the GBK together and not emit output until the source nodes have "succeeded".
I'm doing a bounded/batch job. I'm extracting the contents of archive files and then writing them to gcs.
Everything works except it takes longer than I expected and cpu utilization is low. I suspect that this is due to fusion -- my hypothesis is that the extraction is fused to the write operation and so there's a pattern of extraction / higher CPU followed by less CPU because we're doing network calls and back again.
The code looks like:
.apply("Window",
Window.<MyType>into(new GlobalWindows())
.triggering(
Repeatedly.forever(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5))))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
.apply("Add key", MapElements...)
.apply(GroupByKey.create())
Locally I verify using debug logs so that I can see work is being done after the GBK. The timestamp between the first extraction finishing and the first post-GBK op usually reflects the 5s duration (or other values I change it to (1,5,10,20,30)).
On GCP I verify by looking at the pipeline structure and I can see that everything after the GBK is "not started" and the output collection of the GBK is empty ("-") while the input collection has millions of elements.
Edit:
this is on beam v2.10.0.
the extraction is being done by a SplittableDoFn (not sure this is relevant)
Looks like the answer you referred to was for a streaming pipeline (unbounded input). For batch pipeline processing a bounded input, GroupByKey will not emit till all data for a given key has been processed. Please see here for more details.

How to calculate CPU time in elixir when multiple actors/processes are involved?

Let's say I have a function which does some work by spawning multiple processes. I want to compare CPU time vs real time taken by this function.
def test do
prev_real = System.monotonic_time(:millisecond)
# Code to complete some task
# Spawn different processes & give each process some task
# Receive result
# Finish task
current_real = System.monotonic_time(:millisecond)
diff_real = current_real - prev_real
IO.puts "Real time " <> to_string(diff_real)
IO.puts "CPU time ?????"
end
How to calculate CPU time required by the given function? I am interested in calculating CPU time/Real time ratio.
If you are just trying to profile your code rather than implement your own profiling framework I would recommend using already existing tools like:
fprof which will give you information about time spent in functions (real and own)
percept which will provide you information about which processes in your system ware working at any given time and on what
xprof which is design to help you find which calls to your function will cause it to take more time (trigger inefficient branch of code).
They take advantage of both erlang:trace to figure out which function is being executed and for how long and erlang:system_profile with runnable_procs to determine which processes are currently running. You might start a function, hit a receive or be preemptive rescheduled and wait without doing any actual work. Combining those two might be complicated, and I would recommend using already existing tools before trying glue together your own.
You could also look into tools like erlgrind and eflame if you are looking for more visual representations of your calls.

Can I profile Lua scripts running in Redis?

I have a cluster app that uses a distributed Redis back-end, with dynamically generated Lua scripts dispatched to the redis instances. The Lua component scripts can get fairly complex and have a significant runtime, and I'd like to be able to profile them to find the hot spots.
SLOWLOG is useful for telling me that my scripts are slow, and exactly how slow they are, but that's not my problem. I know how slow they are, I'd like to figure out which parts of them are slow.
The redis EVAL docs are clear that redis does not export any timekeeping functions to lua, which makes it seem like this might be a lost cause.
So, short a custom fork of Redis, is there any way to tell which parts of my Lua script are slower than others?
EDIT
I took Doug's suggestion and used debug.sethook - here's the hook routine I inserted at the top of my script:
redis.call('del', 'line_sample_count')
local function profile()
local line = debug.getinfo(2)['currentline']
redis.call('zincrby', 'line_sample_count', 1, line)
end
debug.sethook(profile, '', 100)
Then, to see the hottest 10 lines of my script:
ZREVRANGE line_sample_count 0 9 WITHSCORES
If your scripts are processing bound (not I/O bound), then you may be able to use the debug.sethook function with a count hook:
The count hook: is called after the interpreter executes every
count instructions. (This event only happens while Lua is executing a
Lua function.)
You'll have to build a profiler based on the counts you receive in your callback.
The PepperfishProfiler would be a good place to start. It uses os.clock which you don't have, but you could just use hook counts for a very crude approximation.
This is also covered in PiL 23.3 – Profiles
In standard Lua C, you can't. It's not a built-in function - it only returns seconds. So, there are two options available: You either write your own Lua extension DLL to return the time in msec, or:
You can do a basic benchmark using a millisecond-resolution time. You can access the current millisecond time with LuaSocket. Though this adds a dependency to your project, it's an effective way to do trivial benchmarking.
require "socket"
t = socket.gettime();

Parallel depth-first search in Erlang is slower than its sequential counterpart

I am trying to implement a modified parallel depth-first search algorithm in Erlang (let's call it *dfs_mod*).
All I want to get is all the 'dead-end paths' which are basically the paths that are returned when *dfs_mod* visits a vertex without neighbours or a vertex with neighbours which were already visited. I save each path to ets_table1 if my custom function fun1(Path) returns true and to ets_table2 if fun1(Path) returns false(I need to filter the resulting 'dead-end' paths with some customer filter).
I have implemented a sequential version of this algorithm and for some strange reason it performs better than the parallel one.
The idea behind the parallel implementation is simple:
visit a Vertex from [Vertex|Other_vertices] = Unvisited_neighbours,
add this Vertex to the current path;
send {self(), wait} to the 'collector' process;
run *dfs_mod* for Unvisited_neighbours of the current Vertex in a new process;
continue running *dfs_mod* with the rest of the provided vertices (Other_vertices);
when there are no more vertices to visit - send {self(), done} to the collector process and terminate;
So, basically each time I visit a vertex with unvisited neighbours I spawn a new depth-first search process and then continue with the other vertices.
Right after spawning a first *dfs_mod* process I start to collect all {Pid, wait} and {Pid, done} messages (wait message is to keep the collector waiting for all the done messages). In N milliseconds after waiting the collector function returns ok.
For some reason, this parallel implementation runs from 8 to 160 seconds while the sequential version runs just 4 seconds (the testing was done on a fully-connected digraph with 5 vertices on a machine with Intel i5 processor).
Here are my thoughts on such a poor performance:
I pass the digraph Graph to each new process which runs *dfs_mod*. Maybe doing digraph:out_neighbours(Graph) against one digraph from many processes causes this slowness?
I accumulate the current path in a list and pass it to each new spawned *dfs_mod* process, maybe passing so many lists is the problem?
I use an ETS table to save a path each time I visit a new vertex and add it to the path. The ETS properties are ([bag, public,{write_concurrency, true}), but maybe I am doing something wrong?
each time I visit a new vertex and add it to the path, I check a path with a custom function fun1() (it basically checks if the path has vertices labeled with letter "n" occurring before vertices with "m" and returns true/false depending on the result). Maybe this fun1() slows things down?
I have tried to run *dfs_mod* without collecting done and wait messages, but htop shows a lot of Erlang activity for quite a long time after *dfs_mod* returns ok in the shell, so I do not think that the active message passing slows things down.
How can I make my parallel dfs_mod run faster than its sequential counterpart?
Edit: when I run the parallel *dfs_mod*, pman shows no processes at all, although htop shows that all 4 CPU threads are busy.
There is no quick way to know without the code, but here's a quick list of why this might fail:
You might be confusing parallelism and concurrency. Erlang's model is shared-nothing and aims for concurrency first (running distinct units of code independently). Parallelism is only an optimization of this (running some of the units of code at the same time). Usually, parallelism will take form at a higher level, say you want to run your sorting function on 50 different structures -- you then decide to run 50 of the sequential sort functions.
You might have synchronization problems or sequential bottlenecks, effectively changing your parallel solution into a sequential one.
The overhead of copying data, context switching and whatnot dwarfs the gains you have in terms of parallelism. This former is especially true of large data sets that you break into sub data sets, then join back into a large one. The latter is especially true of highly sequential code, as seen is the process ring benchmarks.
If I wanted to optimize this, I would try to reduce message passing and data copying to a minimum.
If I were the one working on this, I would keep the sequential version. It does what it says it should do, and when part of a larger system, as soon as you have more processes than core, parallelism will come from the many calls to the sort function rather than branches of the sort function. In the long run, if part of a server or service, using the sequential version N times should have no more negative impact than a parallel one that ends up creating many, many more processes to do the same task, and risk overloading the system more.

Lua task scheduling

I've been writing some scripts for a game, the scripts are written in Lua. One of the requirements the game has is that the Update method in your lua script (which is called every frame) may take no longer than about 2-3 milliseconds to run, if it does the game just hangs.
I solved this problem with coroutines, all I have to do is call Multitasking.RunTask(SomeFunction) and then the task runs as a coroutine, I then have to scatter Multitasking.Yield() throughout my code, which checks how long the task has been running for, and if it's over 2 ms it pauses the task and resumes it next frame. This is ok, except that I have to scatter Multitasking.Yield() everywhere throughout my code, and it's a real mess.
Ideally, my code would automatically yield when it's been running too long. So, Is it possible to take a Lua function as an argument, and then execute it line by line (maybe interpreting Lua inside Lua, which I know is possible, but I doubt it's possible if all you have is a function pointer)? In this way I could automatically check the runtime and yield if necessary between every single line.
EDIT:: To be clear, I'm modding a game, that means I only have access to Lua. No C++ tricks allowed.
check lua_sethook in the Debug Interface.
I haven't actually tried this solution myself yet, so I don't know for sure how well it will work.
debug.sethook(coroutine.yield,"",10000);
I picked the number arbitrarily; it will have to be tweaked until it's roughly the time limit you need. Keep in mind that time spent in C functions etc will not increase the instruction count value, so a loop will reach this limit far faster than calls to long-running C functions. It may be viable to set a far lower value and instead provide a function that sees how much os.clock() or similar has increased.

Resources