Multi-threads slower than single threads?

Multi-threads slower than single threads? - opencv

I am trying to find out how to speed up a certain part of my code. I've got three float variables: var1, var2, var3.
in sequential mode ...
double start1, end1, t1;
start1= (double)cvGetTickCount();
var1= tester->predict(videocapture, params1, image);
var2= tester->predict(videocapture, params2, image);
var3= tester->predict(videocapture, params3, image);
end1= (double)cvGetTickCount();
t1= (end1-start1)/((double)cvGetTickFrequency()*1000.);
printf( "Time1 = %g ms\n", t1 );
it seems to be slightly faster than parallel threads ...
double start2, end2, t2;
start2= (double)cvGetTickCount();
mp_set_dynamic(0); // Explicitly enable/disable dynamic teams
omp_set_num_threads(3); // Use 3 threads for all consecutive parallel regions
#pragma omp parallel num_threads(3)
{
#pragma omp sections //nowait
{
#pragma omp section
{
#pragma omp critical
{
var1= tester->predict(videocapture, params1, image);
}
}
#pragma omp section
{
#pragma omp critical
{
var2= tester->predict(videocapture, params2, image);
}
}
#pragma omp section
{
#pragma omp critical
{
var3= tester->predict(videocapture, params3, image);
}
}
}
}
}
end2= (double)cvGetTickCount();
t2= (end2-start2)/((double)cvGetTickFrequency()*1000.);
printf( "Time2 = %g ms\n", t2 );
Can someone please help me speed up the process of finding those three variables and tell me what I am doing wrong

Look at the specification of the critical pragma here.
There are three crucial points here:
1) If you don't give the critical pragma a name, it's mapped to some unspecified name, and the same name is used for all unnamed critical sections.
2) Only one thread at a time can be in any critical section with a given name.
3) When a thread encounters a critical section, it waits there until no other threads are active in it before moving into it.
Thus your code tells all your threads to wait while they all take turns executing one item of your code at a time. This is equivalent to doing the operations serially, except that it requires the overhead of generating the threads in the first place (and whatever overhead occurs due to the waiting, which might be busy-waiting).

Related

Store the ADC stream on µSD Card without loss on STM32H743ZI

I am working on a project in which I have to store the datas of an ADC Stream on a µSD card. However even if I use a 16 bits buffer, I lose data from the ADC stream. My ADC is used with DMA and I use FATFS (WITHOUT DMA) and the SDMMC1 peripheral to fill a .bin file with the datas.
Do you have an idea to avoid this loss ?
Here is my project : https://github.com/mathieuchene/STM32H743ZI
I use a nucleo-h743zi2 Board, CubeIDE, and CubeMx in their last version.
EDIT 1
I tried to implement Colin's solution, it's better but I have a strange things in the middle of my acquisition. However when I increase the maximal count value or try to debug, the HardFault_Handler appears. I modified main.c file by creating 2 blocks (uint16_t blockX[BUFFERLENGTH/2]) and 2 flags for when adcBuffer is half filled or completely filled.
I also changed the while(1) part in main function like this
if (flagHlfCplt){
//flagCplt=0;
res = f_write(&SDFile, block1, strlen((char*)block1), (void *)&byteswritten);
memcpy(block2, adcBuffer, BUFFERLENGTH/2);
flagHlfCplt = 0;
count++;
}
if (flagCplt){
//flagHlfCplt=0;
res = f_write(&SDFile, block2, strlen((char*)block2), (void *)&byteswritten);
memcpy(block1, adcBuffer[(BUFFERLENGTH/2)-1], BUFFERLENGTH/2);
flagCplt = 0;
count++;
}
if (count == 10){
f_close(&SDFile);
HAL_ADC_Stop_DMA(&hadc1);
while(1){
HAL_GPIO_TogglePin(LD1_GPIO_Port, LD1_Pin);
HAL_Delay(1000);
}
}
}
EDIT 2
I modified my program. I set block 1 and block 2 with the length of BUFFERLENGTH and I added a pointer (*idx) to change the buffer which is filled. I don't have HardFault_Handler anymore but I still loose some datas from my adc's stream.
Here are the modification I made:
// my pointer and buffers
uint16_t block1[BUFFERLENGTH], block2[BUFFERLENGTH], *idx;
// init of pointer and adc start
idx=block1;
HAL_ADC_Start_DMA(&hadc1, (uint32_t*)idx, BUFFERLENGTH);
// while(1) part
while (1)
{
if (flagCplt){
if (flagToChangeBuffer) {
idx=block1;
res = f_write(&SDFile, block2, strlen((char*)block2), (void *)&byteswritten);
flagCplt = 0;
flagToChangeBuffer=0;
count++;
}
else {
idx=block2;
res = f_write(&SDFile, block1, strlen((char*)block1), (void *)&byteswritten);
flagCplt = 0;
flagToChangeBuffer=1;
count++;
}
}
if (count == 150){
f_close(&SDFile);
HAL_ADC_Stop_DMA(&hadc1);
while(1){
HAL_GPIO_TogglePin(LD1_GPIO_Port, LD1_Pin);
HAL_Delay(1000);
}
}
}
Does someone know how to solve my matter with these loss?
Best Regards
Mathieu

OpenMP scheduler

I have problem with distributing tasks in the OpenMP.
I have next code:
#include <stdio.h>
#include <unistd.h>
int cnttotal = 0;
int cnt1 = 0, cnt2 = 0;
int main()
{
int i;
#pragma omp parallel
#pragma omp single nowait
for (i = 0; i < 60; i++) {
if (cnttotal < 1) {
cnttotal++;
#pragma omp task
{
#pragma omp atomic
cnt1++;
usleep(10);
cnttotal--;
}
} else {
#pragma omp task
{
#pragma omp atomic
cnt2++;
sleep(1);
}
}
}
printf("cnt1 = %d; cnt2 = %d\n", cnt1, cnt2);
return 0;
}
What would I didn't, cnt1 = 1, cnt2 = 59. I think that problem in OpenMP scheduler.
Or is there something don't catch.

My feeling is that you confuse about task instantiation with the actual execution of a task. The #pragma omp task refers to the instantiaton of a task and that is extremely fast. Another thing is that an idle thread of the OpenMP runtime looks a list for ready tasks and executes it.
Going into the problem you posted. In this code is a running thread (say T1) enters the first iteration (i=0), thus it enters into the first if and then sets cnttotal to 1 and instantiates the first task (cnt1). After that instantiation, T1 keeps instantiating the remaining tasks while an idle thread (say T2) executes the task cnt1 which takes approx 10us and sets cnttotal to 0 again.
So in brief, what happens is the thread that instantiates any task is faster to execute than those 10us in task cnt1.
For instance, in my Intel(R) Core(TM) i7-2760QM CPU # 2.40GHz if I changed the code so that the loop runs until i = 500 and sleeps 1 us (usleep (1)) I get:
cnt1 = 2; cnt2 = 498
which shows that instantiation of tasks is extremely fast.

Memory leak in a Tcl wrapper

I read all I could find about memory management in the Tcl API, but haven't been able to solve my problem so far. I wrote a Tcl extension to access an existing application. It works, except for a serious issue: memory leak.
I tried to reproduce the problem with minimal code, which you can find at the end of the post. The extension defines a new command, recordings, in namespace vtcl. The recordings command creates a list of 10000 elements, each element being a new command. Each command has data attached to it, which is the name of a recording. The name subcommand of each command returns the name of the recording.
I run the following Tcl code with tclsh to reproduce the problem:
load libvtcl.so
for {set ii 0} {$ii < 1000} {incr ii} {
set recs [vtcl::recordings]
foreach r $recs {rename $r ""}
}
The line foreach r $recs {rename $r ""} deletes all the commands at each iteration, which frees the memory of the piece of data attached to each command (I can see that in gdb). I can also see in gdb that the reference count of variable recs goes to 0 at each iteration so that the contents of the list is freed. Nonetheless, I see the memory of the process running tclsh going up at each iteration.
I have no more idea what else I could try. Help will be greatly appreciated.
#include <stdio.h>
#include <string.h>
#include <tcl.h>
static void DecrementRefCount(ClientData cd);
static int ListRecordingsCmd(ClientData cd, Tcl_Interp *interp, int objc,
Tcl_Obj *CONST objv[]);
static int RecordingCmd(ClientData cd, Tcl_Interp *interp, int objc,
Tcl_Obj *CONST objv[]);
static void
DecrementRefCount(ClientData cd)
{
Tcl_Obj *obj = (Tcl_Obj *) cd;
Tcl_DecrRefCount(obj);
return;
}
static int
ListRecordingsCmd(ClientData cd, Tcl_Interp *interp, int objc,
Tcl_Obj *CONST objv[])
{
char name_buf[20];
Tcl_Obj *rec_list = Tcl_NewListObj(0, NULL);
for (int ii = 0; ii < 10000; ii++)
{
static int obj_id = 0;
Tcl_Obj *cmd;
Tcl_Obj *rec_name;
cmd = Tcl_NewStringObj ("rec", -1);
Tcl_AppendObjToObj (cmd, Tcl_NewIntObj (obj_id++));
rec_name = Tcl_NewStringObj ("DM", -1);
snprintf(name_buf, sizeof(name_buf), "%04d", ii);
Tcl_AppendStringsToObj(rec_name, name_buf, (char *) NULL);
Tcl_IncrRefCount(rec_name);
Tcl_CreateObjCommand (interp, Tcl_GetString (cmd), RecordingCmd,
(ClientData) rec_name, DecrementRefCount);
Tcl_ListObjAppendElement (interp, rec_list, cmd);
}
Tcl_SetObjResult (interp, rec_list);
return TCL_OK;
}
static int
RecordingCmd(ClientData cd, Tcl_Interp *interp, int objc, Tcl_Obj *CONST objv[])
{
Tcl_Obj *rec_name = (Tcl_Obj *)cd;
char *subcmd;
subcmd = Tcl_GetString (objv[1]);
if (strcmp (subcmd, "name") == 0)
{
Tcl_SetObjResult (interp, rec_name);
}
else
{
Tcl_Obj *result = Tcl_NewStringObj ("", 0);
Tcl_AppendStringsToObj (result,
"bad command \"",
Tcl_GetString (objv[1]),
"\"",
(char *) NULL);
Tcl_SetObjResult (interp, result);
return TCL_ERROR;
}
return TCL_OK;
}
int
Vtcl_Init(Tcl_Interp *interp)
{
#ifdef USE_TCL_STUBS
if (Tcl_InitStubs(interp, "8.5", 0) == NULL) {
return TCL_ERROR;
}
#endif
if (Tcl_PkgProvide(interp, "vtcl", "0.0.1") != TCL_OK)
return TCL_ERROR;
Tcl_CreateNamespace(interp, "vtcl", (ClientData) NULL,
(Tcl_NamespaceDeleteProc *) NULL);
Tcl_CreateObjCommand(interp, "::vtcl::recordings", ListRecordingsCmd,
(ClientData) NULL, (Tcl_CmdDeleteProc *) NULL);
return TCL_OK;
}

The management of the Tcl_Obj * reference counts looks absolutely correct, but I do wonder whether you're freeing all the other resources associated with a particular instance in your real code. It might also be something else entirely; your code is not the only thing in Tcl that allocates memory! Furthermore, the default memory allocator in Tcl does not actually return memory to the OS, but instead holds onto it until the process ends. Figuring out what is wrong can be tricky.
You can try doing a build of Tcl with the --enable-symbols=mem passed to configure. That makes Tcl build in an extra command, memory, which allows more extensive checking of memory management behaviour (it also does things like ensure that memory is never written to after it is freed). It's not enabled by default because it has a substantial performance hit, but it could well help you track down what's going on. (The memory info subcommand is where to get started.)
You could also try adding -DPURIFY to the CFLAGS when building; it completely disables the Tcl memory allocator (so memory checking tools like — commercial — Purify and — OSS — Electric Fence can get accurate information, instead of getting very confused by Tcl's high-performance thread-aware allocator) and may allow you to figure out what is going on.

I found where the leak is. In function ListRecordingsCmd, I replaced line
Tcl_AppendObjToObj (cmd, Tcl_NewIntObj (obj_id++));
with
Tcl_Obj *obj = Tcl_NewIntObj (obj_id++);
Tcl_AppendObjToObj (cmd, obj);
Tcl_DecrRefCount(obj);
The memory allocated to store the object id was not released. The memory used by the tclsh process is now stable.

how to reduce CudaMemcpy Overhead

I have an array of size 3000 the array contains 0 and 1.i want to find first array position that have 1 stored at that location starting from 0th index.i transfer this array to Host and this array is computed on device.then i sequentially computed index on Host.in my program i want to do this computation repeatably 4000 or more times.i want to reduce the time taken by this process.is there any other way by which we can do this and this array is computed on GPU actually so i have to transfer it each time.
int main()
{
for(int i=0;i<4000;i++)
{
cudaMemcpy(A,dev_A,sizeof(int)*3000,cudaMemcpyDeviceToHost);
int k;
for(k=0;k<3000;k++)
{
if(A[k]==1)
{
break;
}
}
printf("got k is %d",k);
}
}
Complete code is like this
#include"cuda.h"
#include
#define SIZE 2688
#define BLOCKS 14
#define THREADS 192
__global__ void kernel(int *A,int *d_pos)
{
int thread_id=threadIdx.x+blockIdx.x*blockDim.x;
while(thread_id<SIZE)
{
if(A[thread_id]==INT_MIN)
{
*d_pos=thread_id;
return;
}
thread_id+=1;
}
}
__global__ void kernel1(int *A,int *d_pos)
{
int thread_id=threadIdx.x+blockIdx.x*blockDim.x;
if(A[thread_id]==INT_MIN)
{
atomicMin(d_pos,thread_id);
}
}
int main()
{
int pos=INT_MAX,i;
int *d_pos;
int A[SIZE];
int *d_A;
for(i=0;i<SIZE;i++)
{
A[i]=78;
}
A[SIZE-1]=INT_MIN;
cudaMalloc((void**)&d_pos,sizeof(int));
cudaMemcpy(d_pos,&pos,sizeof(int),cudaMemcpyHostToDevice);
cudaMalloc((void**)&d_A,sizeof(int)*SIZE);
cudaMemcpy(d_A,A,sizeof(int)*SIZE,cudaMemcpyHostToDevice);
cudaEvent_t start_cp1,stop_cp1;
cudaEventCreate(&stop_cp1);
cudaEventCreate(&start_cp1);
cudaEventRecord(start_cp1,0);
kernel1<<<BLOCKS,THREADS>>>(d_A,d_pos);
cudaEventRecord(stop_cp1,0);
cudaEventSynchronize(stop_cp1);
float elapsedTime_cp1;
cudaEventElapsedTime(&elapsedTime_cp1,start_cp1,stop_cp1);
cudaEventDestroy(start_cp1);
cudaEventDestroy(stop_cp1);
printf("\nTime taken by kernel is %f\n",elapsedTime_cp1);
cudaDeviceSynchronize();
cudaEvent_t start_cp,stop_cp;
cudaEventCreate(&stop_cp);
cudaEventCreate(&start_cp);
cudaEventRecord(start_cp,0);
cudaMemcpy(A,d_A,sizeof(int)*SIZE,cudaMemcpyDeviceToHost);
cudaEventRecord(stop_cp,0);
cudaEventSynchronize(stop_cp);
float elapsedTime_cp;
cudaEventElapsedTime(&elapsedTime_cp,start_cp,stop_cp);
cudaEventDestroy(start_cp);
cudaEventDestroy(stop_cp);
printf("\ntime taken by copy of an array is %f\n",elapsedTime_cp);
cudaEvent_t start_cp2,stop_cp2;
cudaEventCreate(&stop_cp2);
cudaEventCreate(&start_cp2);
cudaEventRecord(start_cp2,0);
cudaMemcpy(&pos,d_pos,sizeof(int),cudaMemcpyDeviceToHost);
cudaEventRecord(stop_cp2,0);
cudaEventSynchronize(stop_cp2);
float elapsedTime_cp2;
cudaEventElapsedTime(&elapsedTime_cp2,start_cp2,stop_cp2);
cudaEventDestroy(start_cp2);
cudaEventDestroy(stop_cp2);
printf("\ntime taken by copy of a variable is %f\n",elapsedTime_cp2);
cudaMemcpy(&pos,d_pos,sizeof(int),cudaMemcpyDeviceToHost);
printf("\nminimum index is %d\n",pos);
return 0;
}
how can i decrease total time taken by this code with any other option for performance.

If you are running your kernel 4000 times on GPU, it might be needed to use Asynchronous execution on kernel via different streams. It might be quicker using cudaMemCpyAsync is a non-blocking function for the host (in the case that you are executing M times your kernel).
A quick introduction to stream and asynchronous execution:
https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/
Streams and concurrency:
http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf
Hope this can help...

Why is RB interrupt routine running twice?

I have some code below that has a slight bug that I don't know how to fix. Essentially what is happening is my high ISR is running twice after the the flag is set. It only runs twice and is consistent. The subroutine should run only once because the flag is set when the input on RB changes, and the routine runs twice after one change to RB's input. The testing was conducted in MPLAB v8.6 using the workbook feature.
#include <p18f4550.h>
#include <stdio.h>
void init(void)
{
RCONbits.IPEN =1; //allows priority
INTCONbits.GIE = 1; //allows interrupts
INTCONbits.PEIE = 1; //allows peripheral interrupts
INTCONbits.RBIF = 0; //sets flag to not on
INTCONbits.RBIE = 1; //enables RB interrupts
INTCON2bits.RBPU = 1; //enable pull up resistors
INTCON2bits.RBIP = 1; //RB interrupts is high priority
PORTB = 0x00;
TRISBbits.RB7 = 1; //enable RB7 as an input so we can throw interrupts when it changes.
}
#pragma code
#pragma interrupt high_isr
void high_isr(void)
{
if(INTCONbits.RBIF == 1)
{
INTCONbits.RBIF = 0;
//stuff
}
}
#pragma code
#pragma code high_isr_entry = 0x08
void high_isr_entry(void)
{_asm goto high_isr _endasm}
void main(void)
{
init();
while(1);
}

The RB7 interrupt flag is set based on a compare of the last latched value and the current state of the pin. In the datasheet it says "The pins are compared with the old value latched on the last read of PORTB. The 'mismatch' outputs of RB7:RB4 are ORed together to generate the RB Port Change Interrupt with Flag bit."
To clear the mismatch condition the datasheet goes on to say "Any read or write of PORTB (except with the
MOVFF (ANY), PORTB instruction). This will end the mismatch condition."
You then wait one Tcy (execute a nop instruction) and then clear the flag. This is documented on page 116 of the datasheet.
So the simplest fix in this scenario is in the interrupt routine declare a dummy variable and set it to RB7 like this:
#pragma interrupt high_isr
void high_isr(void)
{
unsigned short dummy;
if(INTCONbits.RBIF == 1)
{
dummy = PORTBbits.RB7; // Perform read before clearing flag
Nop();
INTCONbits.RBIF = 0;
// rest of your routine here
// ...

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Multi-threads slower than single threads? - opencv

Related

Store the ADC stream on µSD Card without loss on STM32H743ZI

OpenMP scheduler

Memory leak in a Tcl wrapper

how to reduce CudaMemcpy Overhead

Why is RB interrupt routine running twice?

Categories

Resources