I wrote a simple program to test page faults and tlb miss with perf.
The code is as follow. It writes 1 GB data sequentially and is expected
to trigger 1GB/4KB=256K tlb misses and page faults.
#include<stdio.h>
#include <stdlib.h>
#define STEP 64
#define LENGTH (1024*1024*1024)
int main(){
char* a = malloc(LENGTH);
int i;
for(i=0; i<LENGTH; i+=STEP){
a[i] = 'a';
}
return 0;
}
However, the result is as follow and far smaller than expected. Is perf so imprecise? I would be very appreciated if anyone can run the code on his machine.
$ perf stat -e dTLB-load-misses,page-faults ./a.out
Performance counter stats for './a.out':
12299 dTLB-load-misses
1070 page-faults
0.427970453 seconds time elapsed
Environment: Ubuntu 14.04.5 LTS , kernel 4.4.0; gcc 4.8.4 glibc 2.19. No compile flags.
The CPU is Intel(R) Xeon(R) CPU E5-2640 v2 # 2.00GHz.
The kernel prefetches pages on a fault, at least after it has evidence of a pattern. Can't find a definitive reference on the algorithm, but perhaps https://github.com/torvalds/linux/blob/master/mm/readahead.c is a starting point to seeing what is going on. I'd look for other performance counters that capture the behavior of this mechanism.
I have an ESP-12F module that I flashed with the current NodeMCU dev-branch firmware. The module is powered by a >2A power supply. I use 4 GPIO's to control the driver of a little stepper motor (this is the combo).
I wrote a little Lua script (partially based on the arduino version described here) in ESPlorer to control the motor, and the program does work, the motor turns accordingly, but it reboots the module when I call the function turn with too many steps. The limit is at around 180 steps, sometimes a little bit higher, sometimes a little bit below that number.
I'm really new to programming this kind of modules and I'm also just learning Lua, can anybody imagine what happens here and how I can avoid the reboots? BTW: I also tried supplying external 5 Volts to the driver board, but it did not change anything.
This is my script:
gpio.mode(5, gpio.OUTPUT)
gpio.mode(6, gpio.OUTPUT)
gpio.mode(7, gpio.OUTPUT)
gpio.mode(0, gpio.OUTPUT)
sg = function (n,v) gpio.write(n, (v == 0 and gpio.LOW or gpio.HIGH)) end
stepRight = function ()
sg(5,0);sg(6,0);sg(7,0);sg(0,1);
sg(5,0);sg(6,0);sg(7,1);sg(0,1);
sg(5,0);sg(6,0);sg(7,1);sg(0,0);
sg(5,0);sg(6,1);sg(7,1);sg(0,0);
sg(5,0);sg(6,1);sg(7,0);sg(0,0);
sg(5,1);sg(6,1);sg(7,0);sg(0,0);
sg(5,1);sg(6,0);sg(7,0);sg(0,0);
sg(5,1);sg(6,0);sg(7,0);sg(0,1);
sg(5,0);sg(6,0);sg(7,0);sg(0,0);
end
turn = function (dir, steps)
if dir == 'right' then
for i=0,steps,1 do
stepRight()
end
end
end
Here are some details about the module and the firmware:
NodeMCU custom build by frightanic.com
branch: dev
commit: c54bc05ba61fe55f0dccc1a1506791ba41f1d31b
SSL: true
modules: adc,cjson,crypto,dht,file,gpio,hmc5883l,http,i2c,l3g4200d,mqtt,net,node,ow,pwm,spi,tmr,tsl2561,uart,wifi
build built on: 2016-11-21 19:02
powered by Lua 5.1.4 on SDK 1.5.4.1(39cb9a32)
This is what it looks like when I call the turn function with a too high value:
turn('right',200)
ets Jan 8 2013,rst cause:2, boot mode:(3,7)
load 0x40100000, len 26144, room 16
tail 0
chksum 0x95
load 0x3ffe8000, len 2288, room 8
tail 8
chksum 0xa8
load 0x3ffe88f0, len 8, room 0
tail 8
chksum 0x66
csum 0x66
����o�r��n|�llll`��r�l�l��
NodeMCU custom build by frightanic.com
branch: dev
commit: c54bc05ba61fe55f0dccc1a1506791ba41f1d31b
SSL: true
modules: adc,cjson,crypto,dht,file,gpio,hmc5883l,http,i2c,l3g4200d,mqtt,net,node,ow,pwm,spi,tmr,tsl2561,uart,wifi
build built on: 2016-11-21 19:02
powered by Lua 5.1.4 on SDK 1.5.4.1(39cb9a32)
lua: cannot open init.lua
>
Update: I found a solution that works, but I can't explain why. Maybe someone can shed some light on this?
I thought that I had to approach the problem by finding out when and how the reboot occurs, so I added a little timer delay to the for loop:
for i=0,steps,1 do
stepRight()
tmr.delay(10)
end
This does not affect the speed of the motor in any noticable way, but now I can easily crank up the numbers as high as I want ;) I can use turn('right',200000) and the reboot is completely gone, it did not reoccur even once, even if I set the delay to only 1 µs. That's great - but I'd love to know why that helps?
You are calling the sg()7,200 times in a single turn function. You have to break your processing up to avoid time-outs. This is just the way that the ESP8266 SDK requires.
Read my FAQ in the documentation for a more detailed discussion.
I want to measure flinks performance with performance counters (perf). My code:
var text = env.readTextFile("<filename>")
var counts = text.flatMap { _.toLowerCase.split("\\W+") }.map { (_, 1) }.groupBy(0).sum(1)
counts.writeAsText("<filename_result>", WriteMode.OVERWRITE)
env.execute()
I know the PID of the jobmanager. Also I can see the TID of the Thread (CHAIN DataSource), that runs the execute()-command, during execution. But for each execution the TID changes, so it wont work with the TID. Is there a way to figure out the PID of the jobmanagers child process, that runs the execute()-command? And are there different child processes for every transformation (e.g. flatMap) of the rdd? If so, is it possible to find out their distinct PIDs?
The individual operators are not executed in distinct processes. The JobManager and the TaskManagers are started as Java processes. The TaskManager then runs a set of parallel tasks (corresponding to the operators). Each parallel task is executed in its own thread. When you start Flink, then the system will create files /tmp/your-name-taskmanager.pid and /tmp/your-name-jobmanager.pid which contain the PID of the processes.
this is my first pthread program, and I have no idea why the printf statement get printed twice in child thread:
int x = 1;
void *func(void *p)
{
x = x + 1;
printf("tid %ld: x is %d\n", pthread_self(), x);
return NULL;
}
int main(void)
{
pthread_t tid;
pthread_create(&tid, NULL, func, NULL);
printf("main thread: %ld\n", pthread_self());
func(NULL);
}
Observed output on my platform (Linux 3.2.0-32-generic #51-Ubuntu SMP x86_64 GNU/Linux):
1.
main thread: 140144423188224
tid 140144423188224: x is 2
2.
main thread: 140144423188224
tid 140144423188224: x is 3
3.
main thread: 139716926285568
tid 139716926285568: x is 2
tid 139716918028032: x is 3
tid 139716918028032: x is 3
4.
main thread: 139923881056000
tid 139923881056000: x is 3
tid 139923872798464tid 139923872798464: x is 2
for 3, two output lines from the child thread
for 4, the same as 3, and even the outputs are interleaved.
Threading generally occurs by time-division multiplexing. It is generally in-efficient for the processor to switch evenly between two threads, as this requires more effort and higher context switching. Typically what you'll find is a thread will execute several times before switching (as is the case with examples 3 and 4. The child thread executes more than once before it is finally terminated (because the main thread exited).
Example 2: I don't know why x is increased by the child thread while there is no output.
Consider this. Main thread executes. it calls the pthread and a new thread is created.The new child thread increments x. Before the child thread is able to complete the printf statement the main thread kicks in. All of a sudden it also increments x. The main thread is however also able to run the printf statement. Suddenly x is now equal to 3.
The main thread now terminates (also causing the child 3 to exit).
This is likely what happened in your case for example 2.
Examples 3 clearly shows that the variable x has been corrupted due to inefficient locking and stack data corruption!!
For more info on what a thread is.
Link 1 - Additional info about threading
Link 2 - Additional info about threading
Also what you'll find is that because you are using the global variable of x, access to this variable is shared amongst the threads. This is bad.. VERY VERY bad as threads accessing the same variable create race conditions and data corruption due to multiple read writes occurring on the same register for the variable x.
It is for this reason that mutexes are used which essentially create a lock whilst variables are being updated to prevent multiple threads attempting to modify the same variable at the same time.
Mutex locks will ensure that x is updated sequentially and not sporadically as in your case.
See this link for more about Pthreads in General and Mutex locking examples.
Pthreads and Mutex variables
Cheers,
Peter
Hmm. your example uses the same "resources" from different threads. One resource is the variable x, the other one is the stdout-file. So you should use mutexes as shown down here. Also a pthread_join at the end waits for the other thread to finish its job. (Usually a good idea would also be to check the return-codes of all these pthread... calls)
#include <pthread.h>
#include <stdio.h>
int x = 1;
pthread_mutex_t mutex;
void *func(void *p)
{
pthread_mutex_lock (&mutex);
x = x + 1;
printf("tid %ld: x is %d\n", pthread_self(), x);
pthread_mutex_unlock (&mutex);
return NULL;
}
int main(void)
{
pthread_mutex_init(&mutex, 0);
pthread_t tid;
pthread_create(&tid, NULL, func, NULL);
pthread_mutex_lock (&mutex);
printf("main thread: %ld\n", pthread_self());
pthread_mutex_unlock (&mutex);
func(NULL);
pthread_join (tid, 0);
}
It looks like the real answer is Michael Burr's comment which references this glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=14697
In summary, glibc does not handle the stdio buffers correctly during program exit.
I am interested in bench-marking different parts of my program for speed. I having tried using info(statistics) and erlang:now()
I need to know down to the microsecond what the average speed is. I don't know why I am having trouble with a script I wrote.
It should be able to start anywhere and end anywhere. I ran into a problem when I tried starting it on a process that may be running up to four times in parallel.
Is there anyone who already has a solution to this issue?
EDIT:
Willing to give a bounty if someone can provide a script to do it. It needs to spawn though multiple process'. I cannot accept a function like timer.. at least in the implementations I have seen. IT only traverses one process and even then some major editing is necessary for a full test of a full program. Hope I made it clear enough.
Here's how to use eprof, likely the easiest solution for you:
First you need to start it, like most applications out there:
23> eprof:start().
{ok,<0.95.0>}
Eprof supports two profiling mode. You can call it and ask to profile a certain function, but we can't use that because other processes will mess everything up. We need to manually start it profiling and tell it when to stop (this is why you won't have an easy script, by the way).
24> eprof:start_profiling([self()]).
profiling
This tells eprof to profile everything that will be run and spawned from the shell. New processes will be included here. I will run some arbitrary multiprocessing function I have, which spawns about 4 processes communicating with each other for a few seconds:
25> trade_calls:main_ab().
Spawned Carl: <0.99.0>
Spawned Jim: <0.101.0>
<0.100.0>
Jim: asking user <0.99.0> for a trade
Carl: <0.101.0> asked for a trade negotiation
Carl: accepting negotiation
Jim: starting negotiation
... <snip> ...
We can now tell eprof to stop profiling once the function is done running.
26> eprof:stop_profiling().
profiling_stopped
And we want the logs. Eprof will print them to screen by default. You can ask it to also log to a file with eprof:log(File). Then you can tell it to analyze the results. We tell it to collapse the run time from all processes into a single table with the option total (see the manual for more options):
27> eprof:analyze(total).
FUNCTION CALLS % TIME [uS / CALLS]
-------- ----- --- ---- [----------]
io:o_request/3 46 0.00 0 [ 0.00]
io:columns/0 2 0.00 0 [ 0.00]
io:columns/1 2 0.00 0 [ 0.00]
io:format/1 4 0.00 0 [ 0.00]
io:format/2 46 0.00 0 [ 0.00]
io:request/2 48 0.00 0 [ 0.00]
...
erlang:atom_to_list/1 5 0.00 0 [ 0.00]
io:format/3 46 16.67 1000 [ 21.74]
erl_eval:bindings/1 4 16.67 1000 [ 250.00]
dict:store_bkt_val/3 400 16.67 1000 [ 2.50]
dict:store/3 114 50.00 3000 [ 26.32]
And you can see that most of the time (50%) is spent in dict:store/3. 16.67% is taken in outputting the result, another 16.67% is taken by erl_eval (this is why you get by running short functions in the shell -- parsing them becomes longer than running them).
You can then start going from there. That's the basics of profiling run times with Erlang. Handle with care, eprof can be quite a load on a production system or for functions that run for too long. Especially on a production system.
You can use eprof or fprof.
The normal way to do this is with timer:tc. Here is a good explanation.
I can recommend you this tool: https://github.com/virtan/eep
You will get something like this https://raw.github.com/virtan/eep/master/doc/sshot1.png as a result.
Step by step instruction for profiling all processes on running system:
On target system:
1> eep:start_file_tracing("file_name"), timer:sleep(20000), eep:stop_tracing().
$ scp -C $PWD/file_name.trace desktop:
On desktop:
1> eep:convert_tracing("file_name").
$ kcachegrind callgrind.out.file_name