memset() behaving undesirably - memset

I am using memset function in C and having a problem. Here is my problem:
char* tail;
tail = //some memory address
int pbytes = 5;
When I call memset like:
**memset(tail+pbytes, 0 , 8); // It gives no error**
When I call memset like:
**memset(tail+pbytes, 0 , 9); // It goes into infinite loop**
When I call memset like:
**memset(tail+pbytes, 0 , 10); // last parameter (10 or above). It gives Segmentation fault**
What can be the reason of this? The program runs and gives output as desired but it gives segmentation fault in the end. I am using Linux 64 virtual machine.
Any help would be appreciated.
OK. Let me clarify more with what i am doing. I am making 128 bytes (0-127 in array) data. I write 0(NULL) from byte 112 to 119 (it goes well) but when I try to write 0 on 120th byte and run the program, it goes into infinite loop. If I write 1,2,4,6 at 120th byte, program runs well. If I write other numbers at 120th byte, program gives segmentation fault. Basically there is something wrong with bytes from 120 to 127.

This is nothing wrong with memset. It's something wrong with how you defined your pointer variable tail.
If you simply wrote
char tail[128];
memset(tail+5, 0, 9);
of course it would work fine. Your problem is that you're not doing anything that simple and correct; you're doing something obscure and incorrect, such as
char tail[1];
memset(tail+5, 0, 9);
or
void foo(int x) {
char *tail = &x;
memset(tail+5, 0, 9);
}
To paraphrase Charles Babbage: When you put wrong code into the machine, wrong answers come out.
The segfault is probably because you're trying to write to a virtual address that has not yet been allocated. The infinite loop might be because you're overwriting some part of the memset's return address, so that it returns to the wrong place (such as into the middle of an infinite loop) instead of returning to the place it was called from.

Related

Wrong semaphor in case of opencl usage

Solution:
Finally I could solve or at least to find a good workaround for my problem.
This kind of semaphore doesn't work in case of NVIDIA.
I think this comment is right.
So I decided to use atomic_add() which is mandatory part of the OpenCL 1.1.
I have a resultBuffer array and resultBufferSize global variable and the last one is set to zero.
When I have results (my result is always!! x numbers) than I simple call
position = atomic_add(resultBufferSize, x);
and I can be sure no one writes between position and position + x into the buffer.
Don't forget the global variable must be volatile.
When the threads run into endless loops the resource is not available and therefore the -5 error code during the buffer reading.
Update:
When I read back:
oclErr |= clEnqueueReadBuffer(cqCommandQueue, cm_inputNodesArraySizes, CL_TRUE, 0, lastMapCounter*sizeof(cl_uint), (void*)&inputNodesArraySizes, 0, NULL, NULL);
The value of the lastMapCounter changes. It's strange because in the ocl code I do nothing and I take care of sizes: what I wrote into the buffer creation and what I copy I read the same back. And a hidden bufferoverflow can cause many stange things indeed.
End of update
I did the following code and there is a bug in it. I want a semaphore to change the resultBufferSize global variable (now I just want to try it how it works) and get back a big number (it is supposed that each worker write something). But I get always 3 or sometimes errors. There is no logic how the compiler works.
__kernel void findCircles(__global uint *inputNodesArray, __global
uint*inputNodesArraySizes, uint lastMapCounter,
__global uint *resultBuffer,
__global uint *resultBufferSize, volatile __global uint *sem)
{
for(;atom_xchg(sem, 1) > 0;)
(*resultBufferSize) = (*resultBufferSize) + 3;
atom_xchg(sem, 0);
}
I got -48 during the kernel execution and sometimes it's OK and I got -5 when I want to read back the buffer (the size buffer).
Do you have any idea where I can find the bug?
NVIDIA opencl 1.1 which is used.
Of course on the host I configure everything well:
uint32 resultBufferSize = 0;
uint32 sem;
cl_mem cmresultBufferSize = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE,
sizeof(uint32), NULL, &ciErrNum);
cl_mem cmsem = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, sizeof(uint32), NULL,
&ciErrNum);
ciErrNum = clSetKernelArg(ckKernel, 4, sizeof(cl_mem), (void*)&cmresultBufferSize);
ciErrNum = clSetKernelArg(ckKernel, 5, sizeof(cl_mem), (void*)&cmsem);
ciErrNum |= clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 1, NULL,
&szGlobalWorkSize, &szLocalWorkSize, 0, NULL, NULL);
ciErrNum = clEnqueueReadBuffer(cqCommandQueue, cmresultBufferSize, CL_TRUE, 0,
sizeof(uint32), (void*)&resultBufferSize, 0, NULL, NULL);
(in case of this code the kernel is OK and the last reading is return -5)
I know you have come to a conclusion on this, but I want to point out two things:
1) The semaphore is non-portable because it isn't SIMD safe, as pointed out in the linked thread.
2) The memory model is not strong enough to give a meaning to the code. The update of the result buffer could move out of the critical section - nothing in the model says otherwise. At the very least you'd need fences, but the language around fences in the 1.x specs is also fairly weak. You'd need an OpenCL 2.0 implementation to be confident that this aspect is safe.

Buffer Overflow Not Overflowing Return Address

Below is the C code
#include <stdio.h>
void read_input()
{
char input[512];
int c = 0;
while (read(0, input + c++,1) == 1);
}
int main ()
{
read_input();
printf("Done !\n");
return 0;
}
In the above code, there should be a buffer overflow of the array 'input'. The file we give it will have over 600 characters in it, all 2's ( ex. 2222222...) (btw, ascii of 2 is 32). However, when executing the code with the file, no segmentation fault is thrown, meaning program counter register was unchanged. Below is the screenshot of the memory of input array in gdb, highlighted is the address of the ebp (program counter) register, and its clear that it was skipped when writing:
LINK
The writing of the characters continues after the program counter, which is maybe why segmentation fault is not shown. Please explain why this is happening, and how to cause the program counter to overflow.
This is tricky! Both input[] and c are in stack, with c following the 512 bytes of input[]. Before you read the 513th byte, c=0x00000201 (513). But since input[] is over you are reading 0x32 (50) onto c that after reading is c=0x00000232 (562): in fact this is little endian and the least significative byte comes first in memory (if this was a big endian architecture it was c=0x32000201 - and it was going to segfault mostly for sure).
So you are actually jumping 562 - 513 = 49 bytes ahead. Than there is the ++ and they are 50. In fact you have exactly 50 bytes not overwritten with 0x32 (again... 0x3232ab64 is little endian. If you display memory as bytes instead of dwords you will see 0x64 0xab 0x32 0x32).
So you are writing in not assigned stack area. It doesn't segfault because it's in the process legal space (up to the imposed limit), and is not overwriting any vital information.
Nice example of how things can go horribly wrong without exploding! Is this a real life example or an assignment?
Ah yes... for the second question, try declaring c before input[], or c as static... in order not to overwrite it.

CUDA kernels and memory access (one kernel doesn't execute entirely and the next doesn't get launched)

I'm having trouble here. I launch two kernels , check if some value is the one expected (memcpy to the host), if it is I stop, if it isn't I launch the two kernels again.
the first kernel:
__global__ void aco_step(const KPDeviceData* data)
{
int obj = threadIdx.x;
int ant = blockIdx.x;
int id = threadIdx.x + blockIdx.x * blockDim.x;
*(data->added) = 1;
while(*(data->added) == 1)
{
*(data->added) = 0;
//check if obj fits
int fits = (data->obj_weights[obj] + data->weight[ant] <= data->max_weight);
fits = fits * !(getElement(data->selections, data->selections_pitch, ant, obj));
if(obj == 0)
printf("ant %d going..\n", ant);
__syncthreads();
...
The code goes on after this. But that printf never gets printed, that syncthreads is there just for debugging purposes.
The "added" variable was shared, but since shared memory is a PITA and usually throws bugs in the code, i just removed it for now. This "added" variable isn't the smartest thing to do but it's faster than the alternative, which is checking if any variable within an array is some value on the host and deciding to keep iterating or not.
The getElement, simply does the matrix memory calculation with the pitch to access the right position and returns the element there:
int* el = (int*) ((char*)mat + row * pitch) + col;
return *el;
The obj_weights array has the right size, n*sizeof(int). So does the weight array, ants*sizeof(float). So they aren't out of bounds.
The kernel after this one has a printf right on the beginning, and it doesn't get printed either and after the printf it sets a variable on the device memory, and this memory is copied to the CPU after the kernel finished, and it isn't the right value when I print it in the CPU code. So I think this kernel is doing something illegal and the second one doesn't even get launched.
I'm testing some instances, when I launch 8 blocks and 512 threads, it runs OK. 32 blocks, 512 threads, OK. But 8 blocks and 1024 threads, and this happens, the kernel doesn't work, neither 32 blocks and 1024 threads.
Am I doing something wrong? Memory access? Am I launching too many threads?
edit: tried removing the "added" variable and the while loop, so it should execute just once. Still doesnt work, nothing gets printed, even if the printf is right after the three initial lines and the next kernel also doesn't print anything.
edit: another thing, I'm using a GTX 570, so the "Maximum number of threads per block" is 1024 according to http://en.wikipedia.org/wiki/CUDA. Maybe I'll just stick with 512 maximum or check on how higher I can put this value.
__syncthreads() inside conditional code is only allowed if the condition evaluates identically on all threads of a block.
In your case the condition suffers a race condition and is nondeterministic, so it most probably evaluates to different results for different threads.
printf() output is only displayed after the kernel finishes successfully. In this case it doesn't due to the problem mentioned above, so the output never shows up. You could have figured out this by testing the return codes all CUDA function calls for errors.

The art of exploitation - exploit_notesearch.c

i've got a question regarding the exploit_notesearch program.
This program is only used to create a command string we finally call with the system() function to exploit the notesearch program that contains a buffer overflow vulnerability.
The commandstr looks like this:
./notesearch Nop-block|shellcode|repeated ret(will jump in nop block).
Now the actual question:
The ret-adress is calculated in the exploit_notesearch program by the line:
ret = (unsigned int) &i-offset;
So why can we use the address of the i-variable that is quite at the bottom of the main-stackframe of the exploit_notesearch program to calculate the ret address that will be saved in an overflowing buffer in the notesearch program itself ,so in an completely different stackframe, and has to contain an address in the nop block(which is in the same buffer).
that will be saved in an overflowing buffer in the notesearch program itself ,so in an completely different stackframe
As long as the system uses virtual memory, another process will be created by system() for the vulnerable program, and assuming that there is no stack randomization,
both processes will have almost identical values of esp (as well as offset) when their main() functions will start, given that the exploit was compiled on the attacked machine (i.e. with vulnerable notesearch).
The address of variable i was chosen just to give an idea about where the frame base is. We could use this instead:
unsigned long sp(void) // This is just a little function
{ __asm__("movl %esp, %eax");} // used to return the stack pointer
int main(){
esp = sp();
ret = esp - offset;
//the rest part of main()
}
Because the variable i will be located on relatively constant distance from esp, we can use &i instead of esp, it doesn't matter much.
It would be much more difficult to get an approximate value for ret if the system did not use virtual memory.
the stack is allocated in a way as first in last out approach. The location of i variable is somewhere on the top and lets assume that it is 0x200, and the return address is located in a lower address 0x180 so in order to determine the where about to put the return address and yet to leave some space for the shellcode, the attacker must get the difference, which is: 0x200 - 0x180 = 0x80 (128), so he will break that down as follows, ++, the return address is 4 bytes so, we have only 48 bytes we left before reaching the segmentation. that is how it is calculated and the location i give approximate reference point.

How do I transfer an integer to __constant__ device memory?

I have a weird problem, so I thought I would ask and see if someone more experienced than me could see a solution.
I am writing a program with CUDA C/C++, and I have some constant integers that specify various things, like coordinates of the bounds of the calculation, etc.. Currently I just have those things in global device memory. They are accessed by every thread in every kernel call, and so I figured that if they are in global memory, then they never are being cached or broadcast (right?). And so these little integers are taking up a lot (relatively) of overhead, and have a lot of 'read redundancy.'
So I declare in a header:
__constant__ int* number;
I include that header, and, when I do memory stuff, I do:
cutilSafeCall( cudaMemcpyToSymbol(number, &(some_host_int), sizeof(int) );
I pass number into all my kernel's then:
__global__ void magical_kernel(int* number, ...){
//and I access 'number' like this
int data_thingy = big_array[ *number ];
}
My code crashes. With number in global memory, it is just fine. I have determined that it crashes sometime upon accessing number within the kernel. This means that either I am accessing or allocating it wrong. If it holds the wrong value, it will also cause a crash, because it is used to index into arrays.
To conclude, I will ask a few questions. First, what am I doing wrong? As a bonus: is there a better way than constant memory to accomplish this task - I don't know the value of number at compile time, so a simple #define won't work. Will constant memory even speed the code up at all, or has it been cached and broadcasted all along? Could I somehow put the data in shared memory for each threadblock and have it remain in shared memory through multiple kernel calls?
There are several problems here:
You have declared number as a pointer, but never assigned it a value which is valid address in GPU memory
You have a variable scope onflict: the argument variable int * number defined in magic_kernel is not the same variable as the __constant__ int * variable defined as compilation unit scope.
The first argument of the cudaMemcpyToSymbol call is almost certainly incorrect.
If you don't understand why either of the first two point are true, you have some revision to do on pointers and scope in C++.
Based on your response to a now deleted answer, I suspect what you are actually trying to do is this:
__constant__ int number;
__global__ void magical_kernel(...){
int data_thingy = big_array[ number ];
}
cudaMemcpyToSymbol("number", &(some_host_int), sizeof(int));
i.e. number is intended to be an integer in constant memory, not a pointer, and not a kernel argument.
EDIT: here is an exmaple which shows this in action:
#include <cstdio>
__constant__ int number;
__global__ void magical_kernel(int * out)
{
out[threadIdx.x] = number;
}
int main()
{
const int value = 314159;
const size_t sz = size_t(32) * sizeof(int);
cudaMemcpyToSymbol("number", &value, sizeof(int));
int * _out, * out;
out = (int *)malloc(sz);
cudaMalloc((void **)&_out, sz);
magical_kernel<<<1,32>>>(_out);
cudaMemcpy(out, _out, sz, cudaMemcpyDeviceToHost);
for(int i=0; i<32; i++)
fprintf(stdout, "%d %d\n", i, out[i]);
return 0;
}
You should be able to run this yourself and confirm it works as advertised.

Resources