I have a question regarding vertex buffers. How does one read the vertices from the vertex buffer in D3D11? I want to get a particular vertex's position for calculations, if this approach is wrong, how would one do it? The following code does not (obviously) work.
VERTEX* vert;
D3D11_MAPPED_SUBRESOURCE ms;
devcon->Map(pVBufferSphere, NULL, D3D11_MAP_READ, NULL, &ms);
vert = (VERTEX*) ms.pData;
devcon->Unmap(pVBufferSphere, NULL);
Thanks.
Where your code is wrong:
You asking GPU to give you an address to its memory(Map()),
Storing this adress (operator=()),
Then saying: "Thanks, I don't need it anymore" (Unmap()).
After unmap, you can't really say where your pointer now points. It can point to memory location where already allocated another stuff or at memory of your girlfriend's laptop (just kidding =) ).
You must copy data (all or it's part), not pointer in between Map() Unmap(): use memcopy, for loop, anything. Put it in array, std::vector, BST, everything.
Typical mistakes that newcomers can made here:
Not to check HRESULT return value from ID3D11DeviceContext::Map method. If map fails it can return whatever pointer it likes. Dereferencing such pointer leads to undefined behavior. So, better check any DirectX function return value.
Not to check D3D11 debug output. It can clearly say what's wrong and what to do in plain good English language (clearly better than my English =) ). So, you can fix bug almost instantly.
You can only read from ID3D11Buffer if it was created with D3D11_CPU_ACCESS_READ CPU access flag which means that you must also set D3D11_USAGE_STAGING usage fag.
How do we usualy read from buffer:
We don't use staging buffers for rendering/calculations: it's slow.
Instead we copy from main buffer (non-staging and non-readable by CPU) to staging one (ID3D11DeviceContext::CopyResource() or ID3D11DeviceContext::CopySubresourceRegion()), and then copying data to system memory (memcopy()).
We don't do this too much in release builds, it will harm performance.
There are two main real-life usages of staging buffers: debugging (see if buffer contains wrong data and fix some bug in algorithm) and reading final non-pixel data (for example if you calculating scientific data in Compute shader).
In most cases you can avoid staging buffers at all by well-designing your code. Think as if CPU<->GPU was connected only one way: CPU->GPU.
The following code only get the address of the mapped resource, you didn't read anything before Unmap.
vert = (VERTEX*) ms.pData;
If you want to read data from the mapped resource, first allocate enough memory, then use memcpy to copy the data, I don't know your VERTEX structure, so I suppose vert is void*, you can convert it yourself
vert = new BYTE[ms.DepthPitch];
memcpy(vert, ms.pData, ms.DepthPitch];
Drop's answer was helpful. I figured that the reason why I wasn't able to read the buffer was because I didn't have the CPU_ACCESS_FLAG set to D3D11_CPU_ACCESS_READ before. Here
D3D11_BUFFER_DESC bufferDesc;
ZeroMemory(&bufferDesc, sizeof(bufferDesc));
bufferDesc.ByteWidth = iNumElements * sizeof(T);
bufferDesc.Usage = D3D11_USAGE_DEFAULT;
bufferDesc.CPUAccessFlags = D3D11_CPU_ACCESS_READ | D3D11_CPU_ACCESS_WRITE;
bufferDesc.BindFlags = D3D11_BIND_UNORDERED_ACCESS | D3D11_BIND_SHADER_RESOURCE ;
bufferDesc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
bufferDesc.StructureByteStride = sizeof(T);
And then to read data I did
const ID3D11Device& device = *DXUTGetD3D11Device();
ID3D11DeviceContext& deviceContext = *DXUTGetD3D11DeviceContext();
D3D11_MAPPED_SUBRESOURCE ms;
HRESULT hr = deviceContext.Map(g_pParticles, 0, D3D11_MAP_READ, 0, &ms);
Particle* p = (Particle*)malloc(sizeof(Particle*) * g_iNumParticles);
ZeroMemory(p, sizeof(Particle*) * g_iNumParticles);
memccpy(p, ms.pData, 0, sizeof(ms.pData));
deviceContext.Unmap(g_pParticles, 0);
delete[] p;
I agree it's a performance decline, I wanted to do this, just to be able to debug the values!
Thanks anyway! =)
Related
Solution:
Finally I could solve or at least to find a good workaround for my problem.
This kind of semaphore doesn't work in case of NVIDIA.
I think this comment is right.
So I decided to use atomic_add() which is mandatory part of the OpenCL 1.1.
I have a resultBuffer array and resultBufferSize global variable and the last one is set to zero.
When I have results (my result is always!! x numbers) than I simple call
position = atomic_add(resultBufferSize, x);
and I can be sure no one writes between position and position + x into the buffer.
Don't forget the global variable must be volatile.
When the threads run into endless loops the resource is not available and therefore the -5 error code during the buffer reading.
Update:
When I read back:
oclErr |= clEnqueueReadBuffer(cqCommandQueue, cm_inputNodesArraySizes, CL_TRUE, 0, lastMapCounter*sizeof(cl_uint), (void*)&inputNodesArraySizes, 0, NULL, NULL);
The value of the lastMapCounter changes. It's strange because in the ocl code I do nothing and I take care of sizes: what I wrote into the buffer creation and what I copy I read the same back. And a hidden bufferoverflow can cause many stange things indeed.
End of update
I did the following code and there is a bug in it. I want a semaphore to change the resultBufferSize global variable (now I just want to try it how it works) and get back a big number (it is supposed that each worker write something). But I get always 3 or sometimes errors. There is no logic how the compiler works.
__kernel void findCircles(__global uint *inputNodesArray, __global
uint*inputNodesArraySizes, uint lastMapCounter,
__global uint *resultBuffer,
__global uint *resultBufferSize, volatile __global uint *sem)
{
for(;atom_xchg(sem, 1) > 0;)
(*resultBufferSize) = (*resultBufferSize) + 3;
atom_xchg(sem, 0);
}
I got -48 during the kernel execution and sometimes it's OK and I got -5 when I want to read back the buffer (the size buffer).
Do you have any idea where I can find the bug?
NVIDIA opencl 1.1 which is used.
Of course on the host I configure everything well:
uint32 resultBufferSize = 0;
uint32 sem;
cl_mem cmresultBufferSize = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE,
sizeof(uint32), NULL, &ciErrNum);
cl_mem cmsem = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, sizeof(uint32), NULL,
&ciErrNum);
ciErrNum = clSetKernelArg(ckKernel, 4, sizeof(cl_mem), (void*)&cmresultBufferSize);
ciErrNum = clSetKernelArg(ckKernel, 5, sizeof(cl_mem), (void*)&cmsem);
ciErrNum |= clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 1, NULL,
&szGlobalWorkSize, &szLocalWorkSize, 0, NULL, NULL);
ciErrNum = clEnqueueReadBuffer(cqCommandQueue, cmresultBufferSize, CL_TRUE, 0,
sizeof(uint32), (void*)&resultBufferSize, 0, NULL, NULL);
(in case of this code the kernel is OK and the last reading is return -5)
I know you have come to a conclusion on this, but I want to point out two things:
1) The semaphore is non-portable because it isn't SIMD safe, as pointed out in the linked thread.
2) The memory model is not strong enough to give a meaning to the code. The update of the result buffer could move out of the critical section - nothing in the model says otherwise. At the very least you'd need fences, but the language around fences in the 1.x specs is also fairly weak. You'd need an OpenCL 2.0 implementation to be confident that this aspect is safe.
I am trying to optimise an iOS app and just wanted some advice on an issue I am having.
I have a bool** which can hold up to 1024 * 1024 elements. Each element defaults to false but they can also be changed to true at random.
I wanted to know if there is a streamlined way to check if a true value is contained by any of the elements, as, in a worst case scenario over a million iterations would be needed to do this check using two loops.
I could be completely wrong but I had thought about casting the memory to an int, in the belief that, as a false value is equal to 0, if all the elements are false then the result of casting it to an int would, I had thought, be 0. This is not the case however.
What I may have to go with is to keep a tally of the number of true values as they are toggled but this could turn quite messy very quickly.
I hope I have made myself clear enough without code but, if you need to see any code, just ask.
--Edit--
So I decided to go with mvp's answer. Then when I need to check if a certain bit is set:
uint32_t mask32_t[] = {
0x01, 0x02, 0x04, 0x08,
0x10, 0x20, 0x40, 0x80,
0x100, 0x200, 0x400, 0x800,
0x1000, 0x2000, 0x4000, 0x8000,
0x10000, 0x20000, 0x40000, 0x80000,
0x100000, 0x200000, 0x400000, 0x800000,
0x1000000, 0x2000000, 0x4000000, 0x8000000,
0x10000000, 0x20000000, 0x40000000, 0x80000000
};
bool bitIsSet(uint32_t word, int n) {
return ( word & mask32_t[ n ] ) != 0x00;
}
bool isSetAtPoint( uint32_t** arr, int x, int y ) {
return bitIsSet( arr[ (int)floor(x / 32.0) ] [ y ], x % 32 );
}
Probably best optimization you can make is to convert your 1024x1024 array of booleans into array of bits. This approach has some benefits and drawbacks:
+ Bit array will consume 8 times less memory: 1024*1024/8 = 128KB.
+ You can quickly test many bits at once by checking 32-bit integers quickly. In other words, finding first non-zero bit can be 32-times as fast.
- You need to create custom routines to read and write bits in this array. However, this is rather simple task - just a bit of bit twiddling :).
Basically, the answer is no. You need to write a loop to check.
You cannot cast a composite value to an int. You don't report a compiler error, so I suspect you actually tried casting the pointer to your array to int, which is legal; however, the result will only be 0 if the pointer is NULL (that is, not pointing to anything.)
You can streamline the checking code by using a bitvector instead of an array of bools. (You'd also save quite a bit of memory.) However, that involves a lot more code, and it's fairly messy. If you were using C++, you would have access to std::bitset, which is a lot easier than writing your own, but has the disadvantage of needing a compile-time size.
Using integer array has been said here already and is probably the best option, but here's a another approach.
If number of true nodes is very low compared to false nodes, you could turn the problem upside-down: Store coordinates of the true-nodes instead.
I'm thinking of a hashset or BST where you would use position of the true node as a key. Then checking the existence of true would be trivial: check if your data structure has any coordinates stored in it.
i've got a question regarding the exploit_notesearch program.
This program is only used to create a command string we finally call with the system() function to exploit the notesearch program that contains a buffer overflow vulnerability.
The commandstr looks like this:
./notesearch Nop-block|shellcode|repeated ret(will jump in nop block).
Now the actual question:
The ret-adress is calculated in the exploit_notesearch program by the line:
ret = (unsigned int) &i-offset;
So why can we use the address of the i-variable that is quite at the bottom of the main-stackframe of the exploit_notesearch program to calculate the ret address that will be saved in an overflowing buffer in the notesearch program itself ,so in an completely different stackframe, and has to contain an address in the nop block(which is in the same buffer).
that will be saved in an overflowing buffer in the notesearch program itself ,so in an completely different stackframe
As long as the system uses virtual memory, another process will be created by system() for the vulnerable program, and assuming that there is no stack randomization,
both processes will have almost identical values of esp (as well as offset) when their main() functions will start, given that the exploit was compiled on the attacked machine (i.e. with vulnerable notesearch).
The address of variable i was chosen just to give an idea about where the frame base is. We could use this instead:
unsigned long sp(void) // This is just a little function
{ __asm__("movl %esp, %eax");} // used to return the stack pointer
int main(){
esp = sp();
ret = esp - offset;
//the rest part of main()
}
Because the variable i will be located on relatively constant distance from esp, we can use &i instead of esp, it doesn't matter much.
It would be much more difficult to get an approximate value for ret if the system did not use virtual memory.
the stack is allocated in a way as first in last out approach. The location of i variable is somewhere on the top and lets assume that it is 0x200, and the return address is located in a lower address 0x180 so in order to determine the where about to put the return address and yet to leave some space for the shellcode, the attacker must get the difference, which is: 0x200 - 0x180 = 0x80 (128), so he will break that down as follows, ++, the return address is 4 bytes so, we have only 48 bytes we left before reaching the segmentation. that is how it is calculated and the location i give approximate reference point.
I'm running the OpenCL kernel below with a two-dimensional global work size of 1000000 x 100 and a local work size of 1 x 100.
__kernel void myKernel(
const int length,
const int height,
and a bunch of other parameters) {
//declare some local arrays to be shared by all 100 work item in this group
__local float LP [length];
__local float LT [height];
__local int bitErrors = 0;
__local bool failed = false;
//here come my actual computations which utilize the space in LP and LT
}
This however refuses to compile, since the parameters length and height are not known at compile time. But it is not clear to my at all how to do this correctly. Should I use pointers with memalloc? How to handle this in a way that the memory is only allocated once for the entire workgroup and not once per work item?
All that I need is 2 arrays of floats, 1 int and 1 boolean that are shared among the entire workgroup (so all 100 work items). But I fail to find any method that does this correctly...
It's relatively simple, you can pass the local arrays as arguments to your kernel:
kernel void myKernel(const int length, const int height, local float* LP,
local float* LT, a bunch of other parameters)
You then set the kernelargument with a value of NULL and a size equal to the size you want to allocate for the argument (in byte). Therefore it should be:
clSetKernelArg(kernel, 2, length * sizeof(cl_float), NULL);
clSetKernelArg(kernel, 3, height* sizeof(cl_float), NULL);
local memory is always shared by the workgroup (as opposed to private), so I think the bool and int should be fine, but if not you can always pass those as arguments too.
Not really related to your problem (and not necessarily relevant, since I do not know what hardware you plan to run this on), but at least gpus don't particulary like workingsizes which are not a multiple of a particular power of two (I think it was 32 for nvidia, 64 for amd), meaning that will probably create workgroups with 128 items, of which the last 28 are basically wasted. So if you are running opencl on gpu it might help performance if you directly use workgroups of size 128 (and change the global work size appropriately)
As a side note: I never understood why everyone uses the underscore variant for kernel, local and global, seems much uglier to me.
You could also declare your arrays like this:
__local float LP[LENGTH];
And pass the LENGTH as a define in your kernel compile.
int lp_size = 128; // this is an example; could be dynamically calculated
char compileArgs[64];
sprintf(compileArgs, "-DLENGTH=%d", lp_size);
clBuildProgram(program, 0, NULL, compileArgs, NULL, NULL);
You do not have to allocate all your local memory outside the kernel, especially when it is a simple variable instead of a array.
The reason that your code cannot compile is that OpenCL does not support local memory initialization. This is specified in the document(https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/local.html). It is also not feasible in CUDA(Is there a way of setting default value for shared memory array?)
ps:The answer from Grizzly is good enough and it would be better if I can post it as a comment, but I am restricted by the reputation policy. Sorry.
I've written a complicated lua script which uses the lua sockets library. It reads a list of files from disk, sorts them by date and sends them to a HTTP process. The number of files on disk is around 65K.The memory usage in taskmanager doesn't exceed 200Mb.
After quite a while the script returns:
lua: not enough memory
I print out the current GC count at points and it never goes above 110Mb
local freeMem = collectgarbage('count');
print("GC Count : " .. freeMem/1024 .. " MB");
This is on a 32 bit windows machine.
What's the best way to diagnose this?
All memory goes through the single lua_Alloc function. This takes the form of:
typedef void* (*lua_Alloc) (void* ud, void* ptr, size_t oszie, size_t nsize);
All allocations, reallocations and frees go through this. The documentation for this can be found at this web page. You can easily write your own to track all memory operations. For example,
void* MyAlloc (void* ud, void* ptr, size_t osize, size_t nsize)
{
(void)ud; (void)osize; // Not used
if (nsize == 0)
{
free(ptr)
TrackSubtract(osize);
return NULL;
}
else
{
void* p = realloc(ptr,nsize);
TrackSubtract(osize);
if (p) TrackAdd(nsize);
return p;
}
}
You can write the TrackAdd() and TrackSubtract() functions to whatever you want: output to a log; adjust a counter and so on.
To use your new function you pass a pointer to it when you create the Lua state:
lua_State* L = lua_newstate(&MyAlloc,0);
The documentation to lua_newstate is found here.
Good luck.
Use perfmon to monitor your process and add counters for private bytes and virtual bytes.
When your script ends with 'not enough memory' see the value of each counter. If you see sudden peaks in your memory usage, try to add more points in which you print the memory usage.