How to use pinned memory / mapped memory in OpenCL - memory

In order to reduce the transfer time from host to device for my application, I want to use pinned memory. NVIDIA's best practices guide proposes mapping buffers and writing the data using the following code:
cDataIn = (unsigned char*)clEnqueueMapBuffer(cqCommandQue, cmPinnedBufIn, CL_TRUE,CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, NULL);
for(unsigned int i = 0; i < memSize; i++)
cDataIn[i] = (unsigned char)(i & 0xff);
clEnqueueWriteBuffer(cqCommandQue, cmDevBufIn, CL_FALSE, 0,
szBuffBytes, cDataIn, 0, NULL, NULL);
Intel's optimization guide recommends to use calls to clEnqueueMapBuffer and clEnqueueUnmapBuffer instead of calls to clEnqueueReadBuffer or clEnqueueWriteBuffer.
What is the right way to use pinned memory/mapped memory? Is it necessary to write the data using enqueueWriteBuffer or is enqueueMapBuffer sufficient?
Also, what is the difference between CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR?

This is an interesting topic that very little people detail.
I will try to define exactly how it works.
The pinned memory refers to a memory that as well as being in the device, exists in the host, so a DMA write is possible between these 2 memories. Increasing the copy performance.
That is why it needs CL_MEM_ALLOC_HOST_PTR in the buffer creation params.
On the other hand, CL_MEM_USE_HOST_PTR will take a host pointer for buffer creation, it is unclear by the spec if this can or cannot be a pinned memory. But generally speaking, it should NOT be pinned memory created this way, since the host pointer has not been reserved by the OpenCL API and is not clear where it resides in memory.
Regarding the Map/Read question. Both are ok. And they will give same performance.
The difference between the both techniques is that:
For Map/Unmap: You need to map before writing/reading and unmap afterwards. That way you ensure the consistency of the data. These are API calls, and take time to complete as well as being asynchronous. The good thing, is that you don't need to hold any other thing rather than the buffer object.
For Map+Read/Write: At the creation of the memory zone you need to do a Map and save the pointer value. Then, at the destruction of the buffer, you need to first Unmap and then destroy it. You need to hold buffer+Mapped_Buffer all along. The good thing is that you can now just clEnqueueRead/Write to that mapped pointer. The API will wait for the pinned data to be consistent and then consider it done. It is easier to use, since it is like doing a map+unmap in one shot.
The Read/Write mode is easier to use, specially for repetitive reads, but is not as versatile as the manual map option, since you CAN'T write a read only map, nor read a write only map. But for general use the variables that are read will never be written, and viceversa.
My understanding is that Intel recommendation, refers to "Use Map, not plain Read/Write", rather than "When you use Map, don't use Read/Write over Mapped pointers".
Did you check this nVIDIA recomendation over Intel HW? I think it should work, however I don't know if indeed the operation would be optimal (as in AMD or nVIDIA HW).


DirectX 12 Updating the Descriptor Heap

I'm currently writing my own graphics framework for DirectX12 (I've already written several DirectX 11 frameworks for personal game engines), and I'm currently trying to copy the methods used in the recent Hitman game for resource binding.
I'm confused about the best way to handle per-object resource binding for the SRV/CBV/UAV heap. I've watched several GDC presentations, and they all seem to gloss over this.
Only 1 SRV/CBV/UAV heap can be bound at a time, and switching the currently-bound heap in the middle of a command list can be bad for performance on some hardware by forcing a flush. Because of this, what is the best way to handle updating the heap with new descriptors? To me, it seems like each command list would:
Get a hold of a SRV/CBV/UAV heap for itself.
For each object in a subset of objects, create descriptors on the heap pointing to per-object data that was placed into a separate upload heap.
Afterwards, another command list takes this filled descriptor heap and binds it, then issues draw calls mixed with SetGraphicsRootDescriptorTable in order to move through the current descriptor heap.
This being said, several sources online (including another SO post) suggest using one large SRV/CBV/UAV heap and copying into it using CPU-visible heaps. I'm assuming they're not attempting to use the asynchronous CopyDescriptors, but rather CopyBufferRegion. I tried using CopyBufferRegion to update data per-object, but to me this seems under-performant with so many transitions between D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER and D3D12_RESOURCE_STATE_COPY_DEST. Am I misunderstanding something? Any clarity would be appreciated.
CopyDescriptors is not asynchronous, it is a CPU operation that is immediate on the CPU. It can happen anytime before a command list is executed for volatile descriptor ( after the command list operation using it is recorded ), or have to be ready at the usage for static descriptor ( root signature 1.1 ).
The usual approach is to have a large descriptor heap, keep a portion for static descriptors, then use the rest as a ring buffer, allocating descriptor table offset on demand to copy and use the needed descriptor for any draw/compute operation.
CopyBufferRegion has nothing to do here, remember that mapping buffers is also an immediate operation, so you also ring buffer a big chunk of memory for your per objet constant buffers, and you cycle into it. The only thing is that you need to make sure you do not overwrite memory or descriptor while they may still be in use, so you have to fence to prevent the case.

In Lua, how to determine the size of an object?

Is there a way, in Lua, to determine the (in memory) size of an object?
I found an article on Gamepedia about Lua object memory sizes, but it is not general and precise.
I would give the same explanation as #NicolBolas, but different answers to the questions.
Is there a way, in Lua, to determine the (in memory)size of an object?
Yes, but you may need to use an external module for that. See my earlier answer and specifically lua-getsize module.
Is there a way, in Lua, to determine if the table to be stored is greater than the MP size?
If you know the size of the table with X elements, you can probably extrapolate to a table with Y elements of approximately the same content, but you wont be able to limit the allocations to a particular size unless you use your own allocator that has that logic.
Is there a way, in Lua, to determine if the table to be stored is greater than the MP size?
Is there a way, in Lua, to determine the (in memory)size of an object?
Lua is not responsible for things like capping memory and so forth. That ought to be handled from the C code that creates and manages the Lua state. So if you have a 16MB limit, then that needs to be built into the lua_State when you call lua_newstate. You pass it an allocation function that needs to keep track of all such allocations. It would also allocate storage from the memory pool, not from the heap.
Of course, the allocator can't tell exactly why an allocation is happening. So there's no way to limit just this one specific table to 16MB, if you intend for the Lua state to also do other things.
If you have such specific memory needs for just this one table, you probably need to allocate and store it in C/C++, and then use the Lua interface to expose it to Lua to read/manipulate.

boost lockfree spsc_queue cache memory access

I need to be extremely concerned with speed/latency in my current multi-threaded project.
Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level.
I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue.
If the consumer core pops an element from the queue, I presume that means the element (a pointer in this case) is already loaded into the consumer core's L2 and L1 cache. But to access the element, does it not need to access the pointer itself by finding and loading the element either from either the L3 cache or across the interconnect (if the other thread is on a different cpu socket)? If so, would it maybe be better to simply send a copy of the object that could be disposed of by the consumer?
Thank you.
C++ principally a pay-for-what-you-need eco-system.
Any regular queue will let you choose the storage semantics (by value or by reference).
However, this time you ordered something special: you ordered a lock free queue.
In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.
You might doubt whether it's even possible to have a value-type that exceeds the system's native register size (say, int64_t).
Good question.
Enter Ringbuffers
Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures.
But does anything that involves copying multiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?
No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal 'dirty front index' while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].
As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from
linear memory access patterns while reading/writing
even better if you can make the record size aligned with (a multiple) physical cache lines
all the data is local unless the POD contains raw references outside that record
How Does Boost's spsc_queue Actually Do This?
Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)
typedef typename boost::aligned_storage<max_size * sizeof(T),
>::type storage_type;
storage_type storage_;
T * data()
return static_cast<T*>(storage_.address());
(The element type T need not even be POD, but it needs to be both default-constructible and copyable).
Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):
static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
atomic<size_t> write_index_;
char padding1[padding_size]; /* force read_index and write_index to different cache lines */
atomic<size_t> read_index_;
In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.
Several other optimizations are present:
branch prediction hints for platforms that support it (unlikely())
it's possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
use of std::unitialized_copy where possible
The calling of trivial constructors/destructors will be optimized out at instantiation time
the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)
All in all, we see a best-in-class possible idea for a ringbuffer
What To Use
Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.
On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).
So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it's traversed).
If you need that "attachment" model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.
Let your profiler guide you.
[1] I suppose say for spsc acq_rel should work, but I'm a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example :)

Best practice for dealing with package allocation in Go

I'm writing a package which makes heavy use of buffers internally for temporary storage. I have a single global (but not exported) byte slice which I start with 1024 elements and grow by doubling as needed.
However, it's very possible that a user of my package would use it in such a way that caused a large buffer to be allocated, but then stop using the package, thus wasting a large amount of allocated heap space, and I would have no way of knowing whether to free the buffer (or, since this is Go, let it be GC'd).
I've thought of three possible solutions, none of which is ideal. My question is: are any of these solutions, or maybe ones I haven't thought of, standard practice in situations like this? Is there any standard practice? Any other ideas?
Screw it.
Oh well. It's too hard to deal with this, and leaving allocated memory lying around isn't so bad.
The problem with this approach is obvious: it doesn't solve the problem.
Exported "I'm done" or "Shrink internal memory usage" function.
Export a function which the user can call (and calling it intelligently is obviously up to them) which will free the internal storage used by the package.
The problem with this approach is twofold. First, it makes for a more complex, less clean interface to the user. Second, it may not be possible or practical for the user to know when calling such a function is wise, so it may be useless anyway.
Run a goroutine which frees the buffer after a certain period of the package going unused, or which shrinks the buffer (perhaps halving the length) whenever its size hasn't been increased in a while.
The problem with this approach is primarily that it puts unnecessary strain on the scheduler. Obviously a single goroutine isn't so bad, but if this were accepted practice, it wouldn't scale well if every package you imported were doing this under the hood. Also, if you have a time-sensitive application, you may not want code running when you're not aware of it (that is, you may assume that the package isn't doing any work when its functions are not being called - a reasonable assumption, I'd say).
So... any ideas?
NOTE: You can see the existing project here (the relevant code is only a few tens of lines).
A common approach to this is letting the client pass an existing []byte (or whatever) as an argument to some call/function/method. For example:
// The returned slice may be a sub-slice of dst if dst was large enough
// to hold the entire encoded block. Otherwise, a newly allocated slice
// will be returned. It is valid to pass a nil dst.
func Foo(dst []byte, whatever Bar) (ret []byte, err error)
Another approach is to get a new []byte from a, for example cache and/or for example pool (if you prefer the later name for that concept) and rely on clients to return used buffers to such "recycle-bin".
BTW: You're doing it right by thinking about this. Where it's possible to reasonably reuse []byte buffers, there's a potential for lowering the GC load and thus making your program better performing. Sometimes the difference can be critical.
You could reslice your buffer at the end of every operation.
buffer = buffer[:0]
Then your function extendAndSliceBuffer would have the original backing array most likely available if it needs to grow. If not, you would suffer a new allocation, which you might get anyway when you do extendAndSliceBuffer.
Overall, I think a cleaner solution is to do like #jnml said and let the users pass their own buffer if they care about performance. If they don't care about performance, then you should not use a global var and simply allocate the buffer as you need and let it go when it gets out of scope.
I have a single global (but not exported) byte slice which I start
with 1024 elements and grow by doubling as needed.
And there's your problem. You shouldn't have a global like this in your package.
Generally the best approach is to have an exported struct with attached functions. The buffer should reside in this struct unexported. That way the user can instantiate it and let the garbage collector clean it up when they let go of it.
You also want to avoid requiring globals like this as it can hamper unit tests. A unit test should be able to instantiate the exported struct, as the user can, and do it each time for every test.
Also depending on what kind of buffer you need, bytes.Buffer may be useful as it already provides io.Reader and io.Writer functions. bytes.Buffer also automatically grows and shrinks its buffer. In buffer.go you'll see various calls to b.Truncate(0) that does the shrinking with the comment "reset to recover space".
It's generally really really bad form to write Go code that is not thread-safe. If two different goroutines call functions that modify the buffer at the same time, who knows what state the buffer will be in when they finish? Just let the user provide a scratch-space buffer if they decide that the allocation performance is a bottleneck.

Is it possible to use cudaMemcpy with src and dest as different types?

I'm using a Tesla, and for the first time, I'm running low on CPU memory instead of GPU memory! Hence, I thought I could cut the size of my host memory by switching all integers to short (all my values are below 255).
However, I want my device memory to use integers, since the memory access is faster. So is there a way to copy my host memory (in short) to my device global memory (in int)? I guess this won't work:
short *buf_h = new short[100];
int *buf_d = NULL;
cudaMalloc((void **)&buf_d, 100*sizeof(int));
cudaMemcpy( buf_d, buf_h, 100*sizeof(short), cudaMemcpyHostToDevice );
Any ideas? Thanks!
There isn't really a way to do what you are asking directly. The CUDA API doesn't support "smart copying" with padding or alignment, or "deep copying" of nested pointers, or anything like that. Memory transfers require linear host and device memory, and alignment must be the same between source and destination memory.
Having said that, one approach to circumvent this restriction would be to copy the host short data to an allocation of short2 on the device. Your device code can retrieve a short2 containing two packed shorts, extract the value it needs and then cast the value to int. This will give the code 32 bit memory transactions per thread, allowing for memory coalescing, and (if you are using Fermi GPUs) good L1 cache hit rates, because adjacent threads within a block would be reading the same 32 bit word. On non Fermi GPUs, you could probably use a shared memory scheme to efficiently retrieve all the values for a block using coalesced reads.
