use 'perf trace command' to a special thread with TSX abort - perf

I try to use 'perf trace' command to trace tsx abort in a special thread. But I get errors with arguments. All command I think may be right and tried is below.
perf trace --pid 24265 --event tx-abort
perf trace --pid 24265 --event {tx-abort}
perf trace --pid 24265 --event {'tx-abort'}
perf trace --pid {24265} --event tx-abort
perf trace --pid {24265} --event {tx-abort}
perf trace --pid {24265} --event {'tx-abort'}
perf trace --pid {'24265'} --event tx-abort
perf trace --pid {'24265'} --event {tx-abort}
perf trace --pid {'24265'} --event {'tx-abort'}
All error hins is 'Problems parsing the target to trace,check your options'.
Is there any way to let perf trace run as expected?

The issue is not the argument syntax, your first line should be just fine. First, check if tx-abort is listed by perf list, to see if it's generally supported on your system. Then the error may happen because the specified pid does not exist.
The TSX events are PMU events. As opposed to syscalls or tracepoints, not the individual event is instrumented in software, but there is a hardware counter within the Performance Monitoring Unit, that counts these events and triggers an interrupt after a certain amount of events. Taking a sample on each event is not typically feasible for PMU events. I suspect that is why it doesn't work for perf trace, which is originally intended for syscalls, even though the documentation a bit vague as to what type of events are supported.
Note I can reproduce that it doesn't work, but i get an "Invalid argument". That PMU events are unsupported for perf trace is a a bit of speculation by me.
There is a broad documentation by Intel on analyzing TSX with perf, which gives examples and explanation on how to use the the tx events with perf record.

Related

Can I get line number or stack trace from gtkmm and glibmm runtime diagnostic messages?

I'm writing a toy program using gtkmm and glibmm, which are C++ bindings for GTK and Glib. I'm getting some runtime diagnostic messages. They look like this:
(process:2933): GLib-GObject-CRITICAL **: 16:19:16.920: g_object_set_qdata_full: assertion 'quark > 0' failed
The heading format already include pid and filename most of the time. However, I'd also like to see some line numbers of the corresponding source. Is it possible?
Please don't read this into a form of X-Y question, or don't fix on a particular message. Since I periodically run into - and eventually manage to mend - many of them.
As is said in Running GLib Applications, setting env G_DEBUG to include "fatal-warnings" or "fatal-criticals" will let the program raise a signal SIGTRAP whenever such a diagnostic is encountered. Catching this signal in a debugger may help to look into the stack, etc.
Example usage:
# to dump stack upon first critical diagnostic and quit
G_DEBUG=fatal-criticals gdb -batch -ex "run" -ex "i stack" ./a.out
# or upon first and second critical and or warning diagnostic whichever comes earlier
G_DEBUG=fatal-warnings gdb -batch -ex "run" -ex "i stack" -ex "cont" -ex "i stack" ./a.out

dask jobqueue worker failure at startup 'Resource temporarily unavailable'

I'm running dask over slurm via jobqueue and I have been getting 3 errors pretty consistently...
Basically my question is what could be causing these failures? At first glance the problem is that too many workers are writing to disk at once, or my workers are forking into many other processes, but it's pretty difficult to track that. I can ssh into the node but I'm not seeing an abnormal number of processes, and each node has a 500gb ssd, so I shouldn't be writing excessively.
Everything below this is just information about my configurations and such
My setup is as follows:
cluster = SLURMCluster(cores=1, memory=f"{args.gbmem}GB", queue='fast_q', name=args.name,
env_extra=["source ~/.zshrc"])
cluster.adapt(minimum=1, maximum=200)
client = await Client(cluster, processes=False, asynchronous=True)
I suppose i'm not even sure if processes=False should be set.
I run this starter script via sbatch under the conditions of 4gb of memory, 2 cores (-c) (even though i expect to only need 1) and 1 task (-n). And this sets off all of my jobs via the slurmcluster config from above. I dumped my slurm submission scripts to files and they look reasonable.
Each job is not complex, it is a subprocess.call( command to a compiled executable that takes 1 core and 2-4 GB of memory. I require the client call and further calls to be asynchronous because I have a lot of conditional computations. So each worker when loaded should consist of 1 python processes, 1 running executable, and 1 shell.
Imposed by the scheduler we have
>> ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes 512
-n: file descriptors 1024
-l: locked-in-memory size (kbytes) 64
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 1031203
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 0
-N 15: unlimited
And each node has 64 cores. so I don't really think i'm hitting any limits.
i'm using the jobqueue.yaml file that looks like:
slurm:
name: dask-worker
cores: 1 # Total number of cores per job
memory: 2 # Total amount of memory per job
processes: 1 # Number of Python processes per job
local-directory: /scratch # Location of fast local storage like /scratch or $TMPDIR
queue: fast_q
walltime: '24:00:00'
log-directory: /home/dbun/slurm_logs
I would appreciate any advice at all! Full log is below.
FORK BLOCKING IO ERROR
distributed.nanny - INFO - Start Nanny at: 'tcp://172.16.131.82:13687'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/dbun/.local/share/pyenv/versions/3.7.0/lib/python3.7/multiprocessing/forkserver.py", line 250, in main
pid = os.fork()
BlockingIOError: [Errno 11] Resource temporarily unavailable
distributed.dask_worker - INFO - End worker
Aborted!
CANT START NEW THREAD ERROR
https://pastebin.com/ibYUNcqD
BLOCKING IO ERROR
https://pastebin.com/FGfxqZEk
EDIT:
Another piece of the puzzle:
It looks like dask_worker is running multiple multiprocessing.forkserver calls? does that sound reasonable?
https://pastebin.com/r2pTQUS4
This problem was caused by having ulimit -u too low.
As it turns out each worker has a few processes associated with it, and the python ones have multiple threads. In the end you end up with approximately 14 threads that contribute to your ulimit -u. Mine was set to 512, and with a 64 core system I was likely hitting ~896. It looks like the a maximum threads per a process I could have had would have been 8.
Solution:
in .zshrc (.bashrc) I added the line
ulimit -u unlimited
Haven't had any problems since.

GNU Parallel: suppress warning when input is read from terminal

When input is read from terminal, GNU Parallel always displays a warning:
parallel: Warning: Input is read from the terminal. Only experts do this on purpose. Press CTRL-D to exit.
But sometimes I do want to read from terminal (e.g., when I'm copy & pasting stuff from elsewhere entry by entry). Is it possible to turn off this warning? I couldn't find such an option in man parallel or man parallel_tutorial.
Note that I don't want a cheap solution like 2>/dev/null, since warning messages from other programs will be turned off, too. For instance, consider the following simple script:
#!/bin/bash
function print12 () {
echo "printing $1 to stdout"
echo "printing $1 to stderr" >/dev/stderr
}
export -f print12
SHELL=/bin/bash parallel -k print12 2>/dev/null
Messages printed to stderr will all be suppressed.
Just realized that I can do a cat or some read </dev/tty to achieve my desired effect. But let's just focus on the original question.
It cannot be turned off. But see it as a praise: Since you are doing it on purpose, you are an expert (at least in the eyes of GNU Parallel).
As it is just a warning, you are free to paste your arguments and have them run: The warning does not stop GNU Parallel from reading your input.
If you really do not like the warning:
cat | parallel ...

not output exception stack trace in EUnit

I'm write a test with EUnit, but not anything exception detail output in console.
exp_test() ->
?assertEqual(0, 1/0).
Run this module:exp_test() in the Erlang Shell output following
** exception error: bad argument in an arithmetic expression
in function exp_test:'-exp_test/0-fun-0-'/1 (src/test/eunit/xxx_test.erl, line 8)
But in EUnit output following
> eunit:test(xxx).
> xxx_test: exp_test...*failed*
::badarith
EUnit not output anything exception trace info
Im trying the verbose config in eunit, but no effect.
I want to output some exception detail in eunit test result.
Thanks~
The problem seems to be that the version of eunit shipped with R15 does not understand the new stack trace format in R15. This has been fixed in the development version of eunit: github.com/richcarl/eunit
For example:
Eshell V5.10 (abort with ^G)
1> eunit:test(fun() -> (fun() -> exit(foo), ok end)() end).
erl_eval: expr...*failed*
in function erl_eval:do_apply/6 (erl_eval.erl, line 576)
in call from erl_eval:exprs/5 (erl_eval.erl, line 118)
**exit:foo
I hope this will make it into the next release of OTP R15.
This is a known problem in eunit as released in R15B and R15B01. This has been fixed in release R15B02. If you're stuck with an earlier version, you can download and apply a patch:
A workaround for releases before R15B02
You can fix the problem in your local installation by recompiling the affected module:
Download and unpack the Erlang/OTP sources, if you don't have them already.
wget http://www.erlang.org/download/otp_src_R15B01.tar.gz
tar xzf otp_src_R15B01.tar.gz
cd otp_src_R15B01
Download and apply the patch.
wget -O eunit-stacktrace.patch https://github.com/erlang/otp/commit/73b94a990bb91fd263dace4ccbaef6ff727a9637.patch
patch -p1 < eunit-stacktrace.patch
Recompile eunit_lib.erl.
cd lib/eunit
erlc -o ebin -I include src/eunit_lib.erl
Copy the new eunit_lib.beam over the old one (usually somewhere below /usr/local).
ls /usr/local/lib/erlang/lib/eunit-2.2.2/ebin/
# check that eunit_lib.beam is actually there
sudo cp ebin/eunit_lib.beam /usr/local/lib/erlang/lib/eunit-2.2.2/ebin/
Eunit is quite old and while it is officially maintained by the OTP team at Ericsson, it is usually uncared for. Eunit currently has the bad habit of eating up stack traces, and hasn't been updated for R15's line numbers in exceptions.
I wouldn't argue that "that's how it's supposed to work". No sane test tool should hide exception details and line numbers for you.
A trick I like to use is ?debugVal(catch expr) where expr is either a begin end block
or a call to the failing function. For example, ?debugVal(catch begin 1 = 2 end) will output a stacktrace in your tests.

Boot script terminates and gives error (no error logger present)

I made an Erlang application, that shall be started on booting of the operating system. The boot script is stored in /etc/init.d. It looks like this:
#!/bin/sh
cd $ROOT/lib/di
INET_ADDR=$(ifconfig eth0 | grep 'inet addr:' | cut -d: -f2 | awk '{ print $1}')
NODE_NAME=$(echo di#$INET_ADDR)
erl -pa $PWD/ebin -pa $PWD/deps/*/ebin -name $NODE_NAME -boot di $1 -setcookie agfeo
The script tries to determine the IP address of the machine, in order to give the node an unique name. When the machine boots, the script gets executed automatically. On the terminal I get the following output:
(no error logger present) error: "Error in process <0.1.0> with exit value:
{badarg,[{erlang,list_to_atom,[[<<2 bytes>>,<<5 bytes>>,46,98,111,111,116]]},
{init,get_boot,2},{init,do_boot,3}]}"
{"init terminating in do_boot",{badarg,[{erlang,list_to_atom,[[<<2 bytes>>,<<5 bytes>>,46,98,111,111,116]]},
{init,get_boot,2},{init,do_boot,3}]}}
init terminating in do_boot ()
This is what the shell prints out, when the script is loaded automatically.
When I call the script manually, my application gets started normally, without any problems.
Could anybody please explain, what the error message above means?
If we look at the stack trace the last function executed is init:get_boot/2 and the last instruction is erlang:list_to_atom([<<2 bytes>>, <<5 bytes>>, ".boot"]). In the init:get_boot/2 there are three lines with list_to_atom, so error should be one of the following:
'cannot get bootfile';
'bootfile format error';
I believe the error is 'cannot get bootfile'.

Resources