Does D fiber has stack size limitation?

Does D fiber has stack size limitation? - stack

In C/C++, coroutines are implemented with stack-exchange hack, so stack-size is usually limited, doesn't grow automatically.
Does D Fiber has these limitations? Or does it grow automatically?

I tried with 4K initial fiber size, and D Fiber crashed if first stack overflows 4K. Anyway, after once yielded, it kept over 8K stack array variable in subroutines. So it seems grow the stack for each time yields. So it's not simply safe, programmer need to care stack size.
In addition to, D crashed for any printf regardless of size of stack… I don't know why…
Here's my test code.
import std.stdio;
import std.concurrency;
import core.thread;
import std.container;
import std.conv;
void main()
{
Fiber[] fs = new Fiber[10];
foreach (int i; 0..cast(int)fs.length)
{
fs[i] = new F1(i);
};
foreach (ref Fiber f ; fs)
{
f.call();
};
foreach (ref Fiber f ; fs)
{
f.call();
};
foreach (ref Fiber f ; fs)
{
auto s = f.state;
writeln(s);
};
}
class F1 : Fiber
{
this(int idx)
{
super(&run, 4096);
_idx = idx;
}
private:
int _idx;
void run()
{
byte[3700] test1;
//byte[1024] test2;
writeln(_idx);
//t1();
this.yield();
t1();
//byte[1024] test3;
//byte[1024] test4;
writeln(_idx);
}
void t1()
{
byte[4096] test;
//printf("T1!");
t2();
}
void t2()
{
byte[2048] test;
//printf("T2!");
//t3();
}
void t3()
{
byte[2048] test;
printf("T3!");
}
}
Current result.
0
1
2
3
4
5
6
7
8
9
0
./build.bash: line 11: 26460 Segmentation fault: 11 ./tester

Related

Ocaml : segfault at GC when using a custom block containing a C pointer

I am trying to create a binding between a C library and an Ocaml program. I have encountered a problem when interfacing with the Gc.
I made a small program to duplicate my problem.
The objective is to pass a custom_block allocated in the C program and containing a pointer on a C structure to the main program in Ocaml.
Then, I am trying to use it (just printing a value in the example) before cleaning (I force a call to the GC).
In the main program below in ocaml, I can either comment the line "my_print_block" or the line "Gc.compact()" and everything works fine.The address of the pointer is correct, I can print the value and the destructor is called to free the C allocated pointer.
But when the two are activated, I get a segmentation fault.
Mail.ml
type ptr
external create_block: String.t -> ptr = "create_block"
external print_block: ptr -> unit = "print_block"
let my_print_block x :unit =
print_block x;
()
let main () =
let z = create_block "2.4" in
let _ = my_print_block z in
let () = Gc.compact () in
()
let _ = main ()
Interface.c
#include <caml/mlvalues.h>
#include <caml/memory.h>
#include <caml/alloc.h>
#include <caml/custom.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
struct foo
{
float x;
};
void local_destroy(value v)
{
struct foo* p = *((struct foo**)Data_custom_val(v));
printf( "freeing p now (%p)\n with *p=%f \n", p, p->x );
fflush(stdout);
free(p);
}
static struct custom_operations ops = {
"ufpa custom_operations",
local_destroy,
custom_compare_default, //default function, should not be used
custom_hash_default, //default function, should not be used
custom_serialize_default, //default function, should not be used
custom_deserialize_default, //default function, should not be used
custom_compare_ext_default //default function, should not be used
};
void print_block(value type_str)
{
CAMLparam1(type_str);
struct foo* p = *( (struct foo**)Data_custom_val(type_str));
printf("value float = %f\n", p->x);
}
CAMLprim value create_block(value type_str)
{
CAMLparam1(type_str);
//retrieving str and creating a float value
char* fval = String_val(type_str);
float val = atof(fval);
//creating and allocating a custom_block
CAMLlocal1(res);
res = alloc_custom(&ops, sizeof(struct foo*), 10, 100);
//creating and allocating a struct pointer
struct foo* ptr = malloc(sizeof(struct foo));
printf("allocating : %p\n", ptr);
ptr->x = val;
//copying the pointer itself in the custom block
memcpy(Data_custom_val(res), &ptr, sizeof(struct foo*));
CAMLreturn(res);
}
Makefile
main.native: interface.c main.ml
rm -rf _build
rm -f main.native main.byte
ocamlbuild -cflags -g interface.o
ocamlbuild -lflag -custom -cflags -g -lflags -g main.byte -lflags interface.o
#ocamlbuild -cflags -g -lflags -g main.native -lflags interface.o
With ocamldebug, the program seems to crash on my_print_block but I wasn't able to extract more sense from the trace.
With gdb, the error is located in the Gc
#0 0x000000000040433d in caml_oldify_one ()
#1 0x0000000000406060 in caml_oldify_local_roots ()
#2 0x000000000040470f in caml_empty_minor_heap ()
#3 0x00000000004141ca in caml_gc_compaction ()
#4 0x000000000041bfd0 in caml_interprete ()
#5 0x000000000041df48 in caml_main ()
#6 0x000000000040234c in main ()
I have seen several examples and I have read the documentation about C bindings at https://caml.inria.fr/pub/docs/manual-ocaml/intfc.html but I couldn't figure what am I doing wrong. I am using ocaml version4.04.0+flambda
Thank you for your assistance

Your print_block function uses CAMLparam1() so it should return with CAMLreturn0. I'm not sure this is your problem, it's just something I noticed. But it might be the problem.

how to reduce CudaMemcpy Overhead

I have an array of size 3000 the array contains 0 and 1.i want to find first array position that have 1 stored at that location starting from 0th index.i transfer this array to Host and this array is computed on device.then i sequentially computed index on Host.in my program i want to do this computation repeatably 4000 or more times.i want to reduce the time taken by this process.is there any other way by which we can do this and this array is computed on GPU actually so i have to transfer it each time.
int main()
{
for(int i=0;i<4000;i++)
{
cudaMemcpy(A,dev_A,sizeof(int)*3000,cudaMemcpyDeviceToHost);
int k;
for(k=0;k<3000;k++)
{
if(A[k]==1)
{
break;
}
}
printf("got k is %d",k);
}
}
Complete code is like this
#include"cuda.h"
#include
#define SIZE 2688
#define BLOCKS 14
#define THREADS 192
__global__ void kernel(int *A,int *d_pos)
{
int thread_id=threadIdx.x+blockIdx.x*blockDim.x;
while(thread_id<SIZE)
{
if(A[thread_id]==INT_MIN)
{
*d_pos=thread_id;
return;
}
thread_id+=1;
}
}
__global__ void kernel1(int *A,int *d_pos)
{
int thread_id=threadIdx.x+blockIdx.x*blockDim.x;
if(A[thread_id]==INT_MIN)
{
atomicMin(d_pos,thread_id);
}
}
int main()
{
int pos=INT_MAX,i;
int *d_pos;
int A[SIZE];
int *d_A;
for(i=0;i<SIZE;i++)
{
A[i]=78;
}
A[SIZE-1]=INT_MIN;
cudaMalloc((void**)&d_pos,sizeof(int));
cudaMemcpy(d_pos,&pos,sizeof(int),cudaMemcpyHostToDevice);
cudaMalloc((void**)&d_A,sizeof(int)*SIZE);
cudaMemcpy(d_A,A,sizeof(int)*SIZE,cudaMemcpyHostToDevice);
cudaEvent_t start_cp1,stop_cp1;
cudaEventCreate(&stop_cp1);
cudaEventCreate(&start_cp1);
cudaEventRecord(start_cp1,0);
kernel1<<<BLOCKS,THREADS>>>(d_A,d_pos);
cudaEventRecord(stop_cp1,0);
cudaEventSynchronize(stop_cp1);
float elapsedTime_cp1;
cudaEventElapsedTime(&elapsedTime_cp1,start_cp1,stop_cp1);
cudaEventDestroy(start_cp1);
cudaEventDestroy(stop_cp1);
printf("\nTime taken by kernel is %f\n",elapsedTime_cp1);
cudaDeviceSynchronize();
cudaEvent_t start_cp,stop_cp;
cudaEventCreate(&stop_cp);
cudaEventCreate(&start_cp);
cudaEventRecord(start_cp,0);
cudaMemcpy(A,d_A,sizeof(int)*SIZE,cudaMemcpyDeviceToHost);
cudaEventRecord(stop_cp,0);
cudaEventSynchronize(stop_cp);
float elapsedTime_cp;
cudaEventElapsedTime(&elapsedTime_cp,start_cp,stop_cp);
cudaEventDestroy(start_cp);
cudaEventDestroy(stop_cp);
printf("\ntime taken by copy of an array is %f\n",elapsedTime_cp);
cudaEvent_t start_cp2,stop_cp2;
cudaEventCreate(&stop_cp2);
cudaEventCreate(&start_cp2);
cudaEventRecord(start_cp2,0);
cudaMemcpy(&pos,d_pos,sizeof(int),cudaMemcpyDeviceToHost);
cudaEventRecord(stop_cp2,0);
cudaEventSynchronize(stop_cp2);
float elapsedTime_cp2;
cudaEventElapsedTime(&elapsedTime_cp2,start_cp2,stop_cp2);
cudaEventDestroy(start_cp2);
cudaEventDestroy(stop_cp2);
printf("\ntime taken by copy of a variable is %f\n",elapsedTime_cp2);
cudaMemcpy(&pos,d_pos,sizeof(int),cudaMemcpyDeviceToHost);
printf("\nminimum index is %d\n",pos);
return 0;
}
how can i decrease total time taken by this code with any other option for performance.

If you are running your kernel 4000 times on GPU, it might be needed to use Asynchronous execution on kernel via different streams. It might be quicker using cudaMemCpyAsync is a non-blocking function for the host (in the case that you are executing M times your kernel).
A quick introduction to stream and asynchronous execution:
https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/
Streams and concurrency:
http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf
Hope this can help...

Why is this Dart code so slow compared to java's implementation?

The dart codes as follows is extremely slow compared to java's implementation.
//test.dart
import 'dart:io';
void main(){
for(int i = 0; i < 1 << 25;i++){
stdout.write(i); // or directly print(i);
}
stdout.close();
}
java version:
//Test.java
import java.io.*;
public class Test{
public static void main(String[]args)throws Exception {
try{
PrintWriter out = new PrintWriter(System.out);
for(int i = 0;i < 1 << 25; i++){
out.print(i);
}
out.close();
}catch(Exception e){}
}
}
$ time java Test > /dev/null
real 0m6.421s
user 0m0.046s
sys 0m0.031s
$ time dart Test.dart > /dev/null
real 0m51.978s
user 0m0.015s
sys 0m0.078s
Is stdout/print() unbuffered by default in Dart? Is there something like java's PrintWriter? thanks. (Update: after warming up vm, stdout is 2x slower than java )
real 0m15.497s
user 0m0.046s
sys 0m0.047s
===============================================================================
Update Sep 30, 2013
I have implemented custom buffers for both dart and java codes to make a further comparing,now the result is as follows:
//test.dart
final int SIZE = 8192;
final int NUM = 1 << 25;
void main(){
List<int> content = new List(SIZE);
content.fillRange(0, SIZE, 0);
for(int i = 0; i < NUM;i++){
if(i % SIZE == 0 && i > 0)
print(content);
content[i % SIZE] = i;
}
if (NUM % SIZE ==0)
print(content);
else
print(content.sublist(0, NUM % SIZE));
}
java version:
//Test.java
import java.util.Arrays;
public class Test{
public static final int SIZE = 8192;
public static final int NUM = 1 << 25;
public static void main(String[]args)throws Exception {
try{
int[] buf = new int[SIZE];
for(int i = 0;i < NUM; i++){
if(i % SIZE == 0 && i > 0)
System.out.print(Arrays.toString(buf));
buf[i % SIZE] = i;
}
if(NUM % SIZE == 0)
System.out.print(Arrays.toString(buf));
else
{
int[] newbuf = new int[NUM % SIZE];
newbuf = Arrays.copyOfRange(buf, 0, (NUM % SIZE));
System.out.print(Arrays.toString(newbuf));
}
}catch(Exception e){}
}
}
$ time java Test > /dev/null
real 0m7.397s
user 0m0.031s
sys 0m0.031s
$ time dart test.dart > /dev/null
real 0m22.406s
user 0m0.015s
sys 0m0.062s
As you see, dart is still 3x slower than java.

Maybe your code does not get optimized by the VM.
Only "frequenly" used functions are compiled and executed as native code.
Usually for such benchmark, you have to put the tested code into a function and perform a warmup. For exemple:
//test.dart
import 'dart:io';
void f(nb_shift) {
for(int i = 0; i < 1 << nb_shift;i++){
stdout.write(i); // or directly print(i);
}
}
void main(){
//warm up:
f(3);
// the test
f(25);
stdout.close();
}
Nicolas

why memory usage is different whn I just change code position?

malloc at Line A will consume more memory than Line B,
why?is it relevant to pthread?
int main()
{
char *buf = (char*)malloc(1024*1024*1024); //Line A
memset(buf,0,sizeof(1024*1024*1024));
pthread_t m_sockThreadHandle[8];
for (int i=0;i<8;i++)
{
if ( pthread_create(&m_sockThreadHandle[i], NULL, thread_run, NULL) != 0 )
{
perror("pthread_create");
}
}
sleep(10);
char *buf = (char*)malloc(1024*1024*1024);//Line B
memset(buf,0,sizeof(1024*1024*1024));
for (int i=0;i<8;i++)
{
pthread_join(m_sockThreadHandle[i],NULL);
}
}

Possibly because this isn't doing what you thought it was:
memset(buf,0,sizeof(1024*1024*1024));
sizeof(1024*1024*1024) is 4 on my compiler. I think you meant:
memset(buf,0, 1024*1024*1024);
From the code you post buf is unused, so it's not clear what you're trying to do, or why. But this at least is wrong....

Memory not freed after calling free()

I have a short program that generates a linked list by adding nodes to it, then frees the memory allocated by the linked list.
Valgrind does not report any memory leak errors, but the process continues to hold the allocated memory.
I was only able to fix the error after I changed the memory allocated from sizeof(structure_name) to fixed number 512. (see commented code)
Is this a bug or normal operation?
Here is the code:
#include <execinfo.h>
#include <stdlib.h>
#include <stdio.h>
typedef struct llist_node {
int ibody;
struct llist_node * next;
struct llist_node * previous;
struct llist * list;
}llist_node;
typedef struct llist {
struct llist_node * head;
struct llist_node * tail;
int id;
int count;
}llist;
llist_node * new_lnode (void) {
llist_node * nnode = (llist_node *) malloc ( 512 );
// llist_node * nnode = (llist_node *) malloc ( sizeof(llist_node) );
nnode->next = NULL;
nnode->previous = NULL;
nnode->list = NULL;
return nnode;
}
llist * new_llist (void) {
llist * nlist = (llist *) malloc ( 512 );
// llist * nlist = (llist *) malloc ( sizeof(llist) );
nlist->head = NULL;
nlist->tail = NULL;
nlist->count = 0;
return nlist;
}
void add_int_tail ( int ibody, llist * list ) {
llist_node * nnode = new_lnode();
nnode->ibody = ibody;
list->count++;
nnode->next = NULL;
if ( list->head == NULL ) {
list->head = nnode;
list->tail = nnode;
}
else {
nnode->previous = list->tail;
list->tail->next = nnode;
list->tail = nnode;
}
}
void destroy_list_nodes ( llist_node * nodes ) {
llist_node * llnp = NULL;
llist_node * llnpnext = NULL;
llist_node * llnp2 = NULL;
if ( nodes == NULL )
return;
for ( llnp = nodes; llnp != NULL; llnp = llnpnext ) {
llnpnext = llnp->next;
free (llnp);
}
return;
}
void destroy_list ( llist * list ) {
destroy_list_nodes ( list->head );
free (list);
}
int main () {
int i = 0;
int j = 0;
llist * list = new_llist ();
for ( i = 0; i < 100; i++ ) {
for ( j = 0; j < 100; j++ ) {
add_int_tail ( i+j, list );
}
}
printf("enter to continue and free memory...");
getchar();
destroy_list ( list );
printf("memory freed. enter to exit...");
getchar();
printf( "\n");
return 0;
}

If by "the process continues to hold the allocated memory" you mean that ps doesn't report a decrease in the process's memory usage, that's perfectly normal. Returning memory to your process's heap doesn't necessarily make the process return it to the operating system, for all sorts of reasons. If you create and destroy your list over and over again, in a big loop, and the memory usage of your process doesn't grow without limit, then you probably haven't got a real memory leak.
[EDITED to add: See also Will malloc implementations return free-ed memory back to the system? ]
[EDITED again to add: Incidentally, the most likely reason why allocating 512-byte blocks makes the problem go away is that your malloc implementation treats larger blocks specially in some way that makes it easier for it to notice when there are whole pages that are no longer being used -- which is necessary if it's going to return any memory to the OS.]

I discovered the answer to my question here:
http://linuxupc.upc.es/~pep/OLD/man/malloc.html
The memory after expanding the heap can be returned back to kernel if the conditions configured by __noshrink are satisfied. Only then the ps will notice that the memory is freed.
It is important to configure it sometimes particularly when the memory usage is small, but the heap size is bigger than the main memory available. Thus the program trashes even if the required memory is less than the available main memory.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Does D fiber has stack size limitation? - stack

In C/C++, coroutines are implemented with stack-exchange hack, so stack-size is usually limited, doesn't grow automatically. Does D Fiber has these limitations? Or does it grow automatically?

Related

Ocaml : segfault at GC when using a custom block containing a C pointer

how to reduce CudaMemcpy Overhead

Why is this Dart code so slow compared to java's implementation?

why memory usage is different whn I just change code position?

Memory not freed after calling free()

Categories

Resources