Simulation done/finished event in jNeuroML? - neuroscience

Is there some event mechanism that could let me know if the simulation is finished?
I have a LEMS file that is simulated in NEURON with:
jnml LEMS_file.xml -neuron -run
the GUI stays open after the simulation is done.

At least with LEMS -> NEURON simulations, the -nogui flag will run the simulation till the end and then close NEURON.
jnml LEMS_file.xml -neuron -run -nogui
Any commands after this one will be executed after the simulation is finished.

Related

make two dispatch() run concurrently in directx12

the problem:
there is only one [Buffer]
only compute shaders are used
there is [DispatchA] and [DispatchB]
[DispatchA] reads and writes to [Buffer]
[DispatchB] reads and writes to [Buffer]
[DispatchA] and [DispatchB] reads and writes do not collide
[DispatchA] will run 1 time
[DispatchB] will run 2 times, the second time only after the first time finished
make [DispatchA] and [DispatchB] run at the same time on the GPU
this says that a resource can't be written by two command queues
so DispatchA and DispatchB need to be on the same command list
one UAVbarrier would be needed between the two DispatchB
but UAVbarrier will make the second DispatchB wait for DispatchA to finish
so is it possible to make DispatchA and DispatchB run completely asynchronous?
I asked on the official directx discord
looks like directx12 can't do but CUDA can
so I am now using CUDA as compute and directx12 renders to the screen

Getting Linux prof samples even if my program is in sleep state?

With a program without sleep function, perf collects callgraph samples well.
void main()
{
while(true)
{
printf(...);
}
}
For example, more than 1,000 samples in a second.
I collected perf report with this:
sudo perf report -p <process_id> -g
However, when I do it with a program with sleep function, perf does not collect callgraph samples well: only a few samples in a second.
void main()
{
while(true)
{
sleep(1);
printf(...);
}
}
I want to collect the callgraph samples even if my program is in sleep state aka. device time. In Windows with VSPerf, callgraph with sleep state is also collected well.
Collecting callgraph for sleep state is needed for finding performance bottleneck not only in CPU time but also in device time (e.g. accessing database).
I guess there may be a perf option for collecting samples even if my program is in sleep state, because not only I but also many other programmers may want it.
How can I get the prof samples even if my program is in sleep state?
After posting this question, we found that perf -c 1 captures about 10 samples in a second. Without -c 1, perf captured 0.3 samples per second. 10 samples per second is much better for now, but it is still much less than 1000 samples per second.
Is there any better way?
CPU samples while your process is in the sleep state are mostly useless, but you could emulate this behavior by using an event that records the begin and end of the sleep syscall (capturing the stacks), and then just add the the "sleep stacks" yourself in "post processing" by duplicating the entry stack a number of times consistent with the duration of each sleep.
After all, the stack isn't going to change.
When you specify a profiling target, perf will only account for events that were generated by said target. Quite naturally, a sleep'ing target doesn't generate many performance events.
If you would like to see other processes (like a database?) in your callgraph reports, try system-wide sampling:
-a, --all-cpus
System-wide collection from all CPUs (default if no target is specified).
(from perf man page)
In addition, if you plan to spend a lot of time actually looking at the reports, there is a tool I cannot recommend you enough: FlameGraphs. This visualization may save you a great deal of effort.

Interrupt during network I/O == crash?

It seems that when an I/O pin interrupt occurs while network I/O is being performed, the system resets -- even if the interrupt function only declares a local variable and assigns it (essentially a do-nothing routine.) So I'm fairly certain it isn't to do with spending too much time in the interrupt function. (My actual working interrupt functions are pretty spartan, strictly increment and assign, not even any conditional logic.)
Is this a known constraint? My workaround is to disconnect the interrupt while using the network, but of course this introduces potential for data loss.
function fnCbUp(level)
lastTrig = rtctime.get()
gpio.trig(pin, "down", fnCbDown)
end
function fnCbDown(level)
local spin = rtcmem.read32(20)
spin = spin + 1
rtcmem.write32(20, spin)
lastTrig = rtctime.get()
gpio.trig(pin, "up", fnCbUp)
end
gpio.trig(pin, "down", fnCbDown)
gpio.mode(pin, gpio.INT, gpio.FLOAT)
branch: master
build built on: 2016-03-15 10:39
powered by Lua 5.1.4 on SDK 1.4.0
modules: adc,bit,file,gpio,i2c,net,node,pwm,rtcfifo,rtcmem,rtctime,sntp,tmr,uart,wifi
Not sure if this should be an answer or a comment. May be a bit long for a comment though.
So, the question is "Is this a known constraint?" and the short but unsatisfactory answer is "no". Can't leave it like that...
Is the code excerpt enough for you to conclude the reset must occur due to something within those few lines? I doubt it.
What you seem to be doing is a simple "global" increment of each GPIO 'down' with some debounce logic. However, I don't see any debounce, what am I missing? You get the time into the global lastTrig but you don't do anything with it. Just for debouncing you won't need rtctime IMO but I doubt it's got anything to do with the problem.
I have a gist of a tmr.delay-based debounce as well as one with tmr.now that is more like a throttle. You could use the first like so:
GPIO14 = 5
spin
function down()
spin = spin + 1
tmr.delay(50) -- time delay for switch debounce
gpio.trig(GPIO14, "up", up) -- change trigger on falling edge
end
function up()
tmr.delay(50)
gpio.trig(GPIO14, "down", down) -- trigger on rising edge
end
gpio.mode(GPIO14, gpio.INT) -- gpio.FLOAT by default
gpio.trig(GPIO14, "down", down)
I also suggest running this against the dev branch because you said it be related to network I/O during interrupts.
I have nearly the same problem.
Running ESP8266Webserver, using GPIO14 Interrupt, with too fast Impulses as input ,
the system stopps recording the interrupts.
Please see here for more details.
http://www.esp8266.com/viewtopic.php?f=28&t=9702
I'm using ARDUINO IDE 1.69 but the Problem seems to be the same.
I used an ESP8266-07 as generator & counter (without Webserver)
to generate the Pulses, wired to my ESP8266-Watersystem.
The generator works very well, with much more than 240 puls / sec,
generating and counting on the same ESP.
But the ESP-Watersystem, stops recording interrupts here at impuls > 50/ second:
/*************************************************/
/* ISR Water pulse counter */
/*************************************************/
/**
* Invoked by interrupt14 once per rotation of the hall-effect sensor. Interrupt
* handlers should be kept as small as possible so they return quickly.
*/
void ICACHE_RAM_ATTR pulseCounter()
{
// Increment the pulse Counter
cli();
G_pulseCount++;
Serial.println ( "!" );
sei();
}
The serial output is here only for showing whats happening.
It shows the correct counted Impuls, until the webserver interacts with the network.
Than is seams the Interrupt is blocked.(no serial output from here)
By stressing the System, when I several times refresh the Website in an short time,
the interrupt counting starts for an short time, but it stops short time again.
The problem is anywhere along Interrupt handling and Webservices.
I hope I could help to find this issues.
Interessted in getting some solutions.
Who can help?
Thanks from Mickbaer
Berlin Germany
Email: michael.lorenz#web.de

What are the things that I should save to a file/db with Reinforcement Learning?

I'm trying to get into machine learning, and decided to try things out for myself. I wrote a small tic-tac-toe game. So far, the computer plays against itself using random moves.
Now, I want to apply reinforcement learning by writing an agent that will explore or exploit based on the knowledge it has on the current state of the board.
The part I don't understand is this:
What does the agent use to train itself for the current state? Lets say a RNG bot (o) player does this:
[..][..][..]
[..][x][o]
[..][..][..]
Now the agent has to decide what the best move should be. A well trained one would pick 1st, 3rd, 7th or 9th. Does it look up a similar state in the DB that led him to a win? Because if so, I think I will need to save every single move into the DB up to eventually it's end state (win/lose/draw state), and that would be quite a lot of data for a single play?
If I'm thinking this through wrong, I would like to know how to this correctly.
Learning
1) Observe a current board state s;
2) Make a next move based on the distribution of all available V(s') of next moves. Strictly the choice is often based on Boltzman’s distribution of V(s'), but can be simplified to maximum-value move (greedy) or, with some probability epsilon, a random move as you are using;
3) Record s' in a sequence;
4) If the game finishes, it updates the values of the visited states in the sequence and starts over again; otherwise, go to 1).
Game Playing
1) Observe a current board state s;
2) Make a next move based on the distribution of all available V(s') of next moves;
3) Until the game is over and it starts over again; otherwise, go to 1).
Regarding your question, yes the look-up table in Game Playing phase is built up in the Learning phase. Every time the state is chosen from the all the V(s) with a maximum possible number of 3^9=19683. Here is a sample code written by Python that runs 10000 games in training.

Parallel depth-first search in Erlang is slower than its sequential counterpart

I am trying to implement a modified parallel depth-first search algorithm in Erlang (let's call it *dfs_mod*).
All I want to get is all the 'dead-end paths' which are basically the paths that are returned when *dfs_mod* visits a vertex without neighbours or a vertex with neighbours which were already visited. I save each path to ets_table1 if my custom function fun1(Path) returns true and to ets_table2 if fun1(Path) returns false(I need to filter the resulting 'dead-end' paths with some customer filter).
I have implemented a sequential version of this algorithm and for some strange reason it performs better than the parallel one.
The idea behind the parallel implementation is simple:
visit a Vertex from [Vertex|Other_vertices] = Unvisited_neighbours,
add this Vertex to the current path;
send {self(), wait} to the 'collector' process;
run *dfs_mod* for Unvisited_neighbours of the current Vertex in a new process;
continue running *dfs_mod* with the rest of the provided vertices (Other_vertices);
when there are no more vertices to visit - send {self(), done} to the collector process and terminate;
So, basically each time I visit a vertex with unvisited neighbours I spawn a new depth-first search process and then continue with the other vertices.
Right after spawning a first *dfs_mod* process I start to collect all {Pid, wait} and {Pid, done} messages (wait message is to keep the collector waiting for all the done messages). In N milliseconds after waiting the collector function returns ok.
For some reason, this parallel implementation runs from 8 to 160 seconds while the sequential version runs just 4 seconds (the testing was done on a fully-connected digraph with 5 vertices on a machine with Intel i5 processor).
Here are my thoughts on such a poor performance:
I pass the digraph Graph to each new process which runs *dfs_mod*. Maybe doing digraph:out_neighbours(Graph) against one digraph from many processes causes this slowness?
I accumulate the current path in a list and pass it to each new spawned *dfs_mod* process, maybe passing so many lists is the problem?
I use an ETS table to save a path each time I visit a new vertex and add it to the path. The ETS properties are ([bag, public,{write_concurrency, true}), but maybe I am doing something wrong?
each time I visit a new vertex and add it to the path, I check a path with a custom function fun1() (it basically checks if the path has vertices labeled with letter "n" occurring before vertices with "m" and returns true/false depending on the result). Maybe this fun1() slows things down?
I have tried to run *dfs_mod* without collecting done and wait messages, but htop shows a lot of Erlang activity for quite a long time after *dfs_mod* returns ok in the shell, so I do not think that the active message passing slows things down.
How can I make my parallel dfs_mod run faster than its sequential counterpart?
Edit: when I run the parallel *dfs_mod*, pman shows no processes at all, although htop shows that all 4 CPU threads are busy.
There is no quick way to know without the code, but here's a quick list of why this might fail:
You might be confusing parallelism and concurrency. Erlang's model is shared-nothing and aims for concurrency first (running distinct units of code independently). Parallelism is only an optimization of this (running some of the units of code at the same time). Usually, parallelism will take form at a higher level, say you want to run your sorting function on 50 different structures -- you then decide to run 50 of the sequential sort functions.
You might have synchronization problems or sequential bottlenecks, effectively changing your parallel solution into a sequential one.
The overhead of copying data, context switching and whatnot dwarfs the gains you have in terms of parallelism. This former is especially true of large data sets that you break into sub data sets, then join back into a large one. The latter is especially true of highly sequential code, as seen is the process ring benchmarks.
If I wanted to optimize this, I would try to reduce message passing and data copying to a minimum.
If I were the one working on this, I would keep the sequential version. It does what it says it should do, and when part of a larger system, as soon as you have more processes than core, parallelism will come from the many calls to the sort function rather than branches of the sort function. In the long run, if part of a server or service, using the sequential version N times should have no more negative impact than a parallel one that ends up creating many, many more processes to do the same task, and risk overloading the system more.

Resources