I am confused about the the __local memory in OpenCL here.
I read some spec saying that the data flow has to be from Host to
__Global, and then __Local.
But I also see some kernel function like this:
__kernel void foo(__local float * a)
I was wondering how the data was transferred directly into the __local
memory in this way?
Thanks.
It is not possible to fill local buffer on the host side. Therefore you have to follow the flow host -> __global -> __local.
Local buffer can be either created on the host side and then it is passed as a kernel parameter or on gpu side inside the kernel.
Creating local buffer on the host side gives the advantage to decide about its size before the kernel is run which can be important if the local buffer size needs to be different each time the kernel is run.
Local memory is not visible to anything but a single work-group, and may be allocated as the work-group is dispatched by hardware on many architectures. Hardware that can mix multiple work-groups from different kernels on each CU will allow the scheduling component to chunk up the local memory for each of the groups being issued. It doesn't exist before the group is launched, and does not exist after the group terminates. The size of this region is what you pass in as other answers have pointed out.
The result of this is that the only way on many architectures for filling local memory from the host would be for kernel code to be inserted by the compiler that would copy data in from global memory. Given that as the basis, it isn't any worse in terms of performance for the programmer to do it manually, and gives more control over exactly what happens. You do not end up in a situation where the compiler always generates copy code and ends up copying more than was really necessary because the API didn't make it clear what memory was copy-in and what was not.
In summary, you cannot fill local memory in any automated way. In practice you will rarely want to, because doing it manually gives you the opportunity to only put the result of a first stage into local, removing extra copy operations, or to transform the data on the way in to local, allowing padding or data transposition to remove bank conflicts and so on.
As #doqtor said, the size of local memory on kernel parameter can be specified by clSetKernelArg calls.
Fortunately, OpenCL 1.2+ support VLA(variable length array), local memory kernel parameter is not required any more.
Related
This question already has answers here:
How to dynamically allocate arrays inside a kernel?
(5 answers)
Closed 2 years ago.
As per my knowledge, atomicAdd can be used on shared memory and global memory. I need to atomically add floating point numbers from threads of different blocks; hence, I need to use a global temporary to hold the sum.
Is there a way to allocate temporary globals from inside a kernel?
Currently, I allocate a temporary global and pass a pointer to my kernel. This doesn't appear to be very user-friendly.
TL;DR: require a temporary variable for atomic addition across different blocks without the need to explicitly allocate a global and pass a pointer to it to the kernel
You can use malloc() inside kernel code. However, it's rarely a good idea to do so. It's usually much better to pre-allocate scratch space before the kernel is launched, pass it as an argument, and let each thread, or group of threads, have some formula for determining the location they will use for their common atomics within that scratch area.
Now, you've written this isn't very "user-friendly"; I guess you mean developer-friendly. Well, it can be made more friendly! For example, my CUDA Modern C++ API wrappers library offers an equivalent of std::unique_ptr - but for device memory:
#include <cuda/api_wrappers.hpp>
//... etc. etc. ...
{
auto scratch = cuda::memory::device::make_unique<float[]>(1024, my_cude_device_id);
my_kernel<<<blah,blah,blah>>>(output, input, scratch.get();
} // the device memory is released here!
(this is for synchronous launches of course.)
Something else you can do be more developer-friendly is use some kind of proxy function to get the location in that scratch memory relevant to a specific thread / warp / group of threads / whatever, which uses the same address for atomics. That should at least hide away some of the repeating, annoying, address arithmetic your kernel might be using.
There's also the option of using global __device__ variables (like #RobertCrovella mentioned), but I wouldn't encourage that: The size would have to be fixed at compile time, and you wouldn't be able to use if from two kernels at once without it being painful, etc.
I'm trying to learn OpenCL but I'm a having a hard time deciding which address spaces to use, as I only find assembled resources declaring what these address spaces are, but not why they exist or when to use them. The resources are at least too scattered, so with this question I hope to assemble all this information: what are all the address spaces, why do they exist, when to use which address space and what are the advantages and disadvantages regarding memory and performance.
As I understand it (which is probably too simplified), the GPU has two physical types of memory: global memory, far from the actual processors, so slow but pretty big and available to all workers, and local memory, close to the actual processors, so fast but small and not accessible from other workers.
Intuitively, the local qualifier makes sure a variable is placed on local memory and the global qualifier makes sure a variable is placed on global memory, though I'm not sure this is exactly what happens. This leaves the private and constant qualifiers. What's the purpose of those?
There also are some implicit qualifiers. For example, the specifications mention the generic address space, which is used for arguments with no qualifiers, I think. What does this do exactly? Then there also are local function variables. What's the address space for those?
Here is an example using my intuition, but without knowing what I'm actually doing:
Example:
Say I pass an array of type long and length 10000 to a kernel which I will only use to read, then I would declare it global const as it must be available to all workers and it will not change. Why wouldn't I use the constant qualifier? When setting the buffer for this array via the CPU, I actually also just could have made the array read-only, which in my eyes says the same as declaring it const. So again, when and why would I declare something constant or global const?
When performing memory-intensive tasks, would it be better to copy the array to a local array inside the kernel? My guess is that local memory would be too small, but what if the array only had a length of 10? When would the array be too big/small? More general: when is it worth copying data from global to local memory?
Say I also want to pass the length of this array, then I would add const int length to the arguments of my kernel, but I'm unsure why I would omit the global qualifier except because I have seen other people do it. After all, length must be accessible for all workers. If I'm right, then length would have a generic address space, but again, I don't really know what that means.
I hope someone with some experience can clear this up. That would be great not only for me, but I hope also for other enthusiasts who want to gain some practical knowledge concerning memory management on the GPU.
Constant: A small portion of cached global memory visible by all workers. Use it if you can, read only.
Global: Slow, visible by all, read or write. It is where all your data will end, so some accesses to it are always necessary.
Local: Do you need to share something in a local group? Use local! Do all your local workers access the same global memory? Use local!
Local memory is only visible inside local workers, and is limited in size, however is very fast.
Private: Memory that is only visible to a worker, consider it like registers. All non defined values are private by default.
Say I pass an array of type long and length 10000 to a kernel which I
will only use to read, then I would declare it global const as it must
be available to all workers and it will not change. Why wouldn't I use
the constant qualifier?
Actually, yes, you can and you should use constant qualifier. Which places your data on the constant memory (a small portion of read only memory quickly accessible by all workers). This is used by GPUs to transfer uniforms to all vertex shaders.
When setting the buffer for this array via the CPU, I actually also
just could have made the array read-only, which in my eyes says the
same as declaring it const. So again, when and why would I declare
something constant or global const?
Not really, when you create a buffer read only you are only specifiying to OpenCL you plan to use it read only, so it can do optimizations in the back, but you can actually write to it from a kernel.
global const is just a safeguard for the developer, so you don't accidentally write to it, it will give an error at compile time.
Basically, the same as in plain C host side computing. Programs will also work fine if all memory is non-const.
When performing memory-intensive tasks, would it be better to copy the array to a local array inside the kernel? My guess is that local memory would be too small, but what if the array only had a length of 10? When would the array be too big/small? More general: when is it worth copying data from global to local memory?
It is only worth if it is read by all workers. If each worker reads a single value of the global memory, then it is not worth.
Useful here:
Worker0 -> Reads 0,1,2,3
Worker1 -> Reads 0,1,2,3
Worker2 -> Reads 0,1,2,3
Worker3 -> Reads 0,1,2,3
Not useful here:
Worker0 -> Reads 0
Worker1 -> Reads 1
Worker2 -> Reads 2
Worker3 -> Reads 3
Say I also want to pass the length of this array, then I would add
const int length to the arguments of my kernel, but I'm unsure why I
would omit the global qualifier except because I have seen other
people do it. After all, length must be accessible for all workers. If
I'm right, then length would have a generic address space, but again,
I don't really know what that means.
When you don't specify a qualifier in the kernel parameter it typically defaults to constant, which is what you want for those small elements, to have a fast access by all workers.
The rules normally OpenCL compilers follow for kernel parameters is: if it only read and fits in constant, constant, otherwise global.
I need to be extremely concerned with speed/latency in my current multi-threaded project.
Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level.
I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue.
If the consumer core pops an element from the queue, I presume that means the element (a pointer in this case) is already loaded into the consumer core's L2 and L1 cache. But to access the element, does it not need to access the pointer itself by finding and loading the element either from either the L3 cache or across the interconnect (if the other thread is on a different cpu socket)? If so, would it maybe be better to simply send a copy of the object that could be disposed of by the consumer?
Thank you.
C++ principally a pay-for-what-you-need eco-system.
Any regular queue will let you choose the storage semantics (by value or by reference).
However, this time you ordered something special: you ordered a lock free queue.
In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.
You might doubt whether it's even possible to have a value-type that exceeds the system's native register size (say, int64_t).
Good question.
Enter Ringbuffers
Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures.
But does anything that involves copying multiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?
No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal 'dirty front index' while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].
As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from
linear memory access patterns while reading/writing
even better if you can make the record size aligned with (a multiple) physical cache lines
all the data is local unless the POD contains raw references outside that record
How Does Boost's spsc_queue Actually Do This?
Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)
typedef typename boost::aligned_storage<max_size * sizeof(T),
boost::alignment_of<T>::value
>::type storage_type;
storage_type storage_;
T * data()
{
return static_cast<T*>(storage_.address());
}
(The element type T need not even be POD, but it needs to be both default-constructible and copyable).
Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):
static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
atomic<size_t> write_index_;
char padding1[padding_size]; /* force read_index and write_index to different cache lines */
atomic<size_t> read_index_;
In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.
Several other optimizations are present:
branch prediction hints for platforms that support it (unlikely())
it's possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
use of std::unitialized_copy where possible
The calling of trivial constructors/destructors will be optimized out at instantiation time
the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)
All in all, we see a best-in-class possible idea for a ringbuffer
What To Use
Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.
On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).
So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it's traversed).
If you need that "attachment" model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.
Let your profiler guide you.
[1] I suppose say for spsc acq_rel should work, but I'm a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example :)
I'm using a Tesla, and for the first time, I'm running low on CPU memory instead of GPU memory! Hence, I thought I could cut the size of my host memory by switching all integers to short (all my values are below 255).
However, I want my device memory to use integers, since the memory access is faster. So is there a way to copy my host memory (in short) to my device global memory (in int)? I guess this won't work:
short *buf_h = new short[100];
int *buf_d = NULL;
cudaMalloc((void **)&buf_d, 100*sizeof(int));
cudaMemcpy( buf_d, buf_h, 100*sizeof(short), cudaMemcpyHostToDevice );
Any ideas? Thanks!
There isn't really a way to do what you are asking directly. The CUDA API doesn't support "smart copying" with padding or alignment, or "deep copying" of nested pointers, or anything like that. Memory transfers require linear host and device memory, and alignment must be the same between source and destination memory.
Having said that, one approach to circumvent this restriction would be to copy the host short data to an allocation of short2 on the device. Your device code can retrieve a short2 containing two packed shorts, extract the value it needs and then cast the value to int. This will give the code 32 bit memory transactions per thread, allowing for memory coalescing, and (if you are using Fermi GPUs) good L1 cache hit rates, because adjacent threads within a block would be reading the same 32 bit word. On non Fermi GPUs, you could probably use a shared memory scheme to efficiently retrieve all the values for a block using coalesced reads.
How is a program (e.g. C or C++) arranged in computer memory? I kind of know a little about segments, variables etc, but basically I have no solid understanding of the entire structure.
Since the in-memory structure may differ, let's assume a C++ console application on Windows.
Some pointers to what I'm after specifically:
Outline of a function, and how is it called?
Each function has a stack frame, what does that contain and how is it arranged in memory?
Function arguments and return values
Global and local variables?
const static variables?
Thread local storage..
Links to tutorial-like material and such is welcome, but please no reference-style material assuming knowledge of assembler etc.
Might this be what you are looking for:
http://en.wikipedia.org/wiki/Portable_Executable
The PE file format is the binary file structure of windows binaries (.exe, .dll etc). Basically, they are mapped into memory like that. More details are described here with an explanation how you yourself can take a look at the binary representation of loaded dlls in memory:
http://msdn.microsoft.com/en-us/magazine/cc301805.aspx
Edit:
Now I understand that you want to learn how source code relates to the binary code in the PE file. That's a huge field.
First, you have to understand the basics about computer architecture which will involve learning the general basics of assembly code. Any "Introduction to Computer Architecture" college course will do. Literature includes e.g. "John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach" or "Andrew Tanenbaum, Structured Computer Organization".
After reading this, you should understand what a stack is and its difference to the heap. What the stack-pointer and the base pointer are and what the return address is, how many registers there are etc.
Once you've understood this, it is relatively easy to put the pieces together:
A C++ object contains code and data, i.e., member variables. A class
class SimpleClass {
int m_nInteger;
double m_fDouble;
double SomeFunction() { return m_nInteger + m_fDouble; }
}
will be 4 + 8 consecutives bytes in memory. What happens when you do:
SimpleClass c1;
c1.m_nInteger = 1;
c1.m_fDouble = 5.0;
c1.SomeFunction();
First, object c1 is created on the stack, i.e., the stack pointer esp is decreased by 12 bytes to make room. Then constant "1" is written to memory address esp-12 and constant "5.0" is written to esp-8.
Then we call a function that means two things.
The computer has to load the part of the binary PE file into memory that contains function SomeFunction(). SomeFunction will only be in memory once, no matter how many instances of SimpleClass you create.
The computer has to execute function SomeFunction(). That means several things:
Calling the function also implies passing all parameters, often this is done on the stack. SomeFunction has one (!) parameter, the this pointer, i.e., the pointer to the memory address on the stack where we have just written the values "1" and "5.0"
Save the current program state, i.e., the current instruction address which is the code address that will be executed if SomeFunction returns. Calling a function means pushing the return address on the stack and setting the instruction pointer (register eip) to the address of the function SomeFunction.
Inside function SomeFunction, the old stack is saved by storing the old base pointer (ebp) on the stack (push ebp) and making the stack pointer the new base pointer (mov ebp, esp).
The actual binary code of SomeFunction is executed which will call the machine instruction that converts m_nInteger to a double and adds it to m_fDouble. m_nInteger and m_fDouble are found on the stack, at ebp - x bytes.
The result of the addition is stored in a register and the function returns. That means the stack is discarded which means the stack pointer is set back to the base pointer. The base pointer is set back (next value on the stack) and then the instruction pointer is set to the return address (again next value on the stack). Now we're back in the original state but in some register lurks the result of the SomeFunction().
I suggest, you build yourself such a simple example and step through the disassembly. In debug build the code will be easy to understand and Visual Studio displays variable names in the disassembly view. See what the registers esp, ebp and eip do, where in memory your object is allocated, where the code is etc.
What a huge question!
First you want to learn about virtual memory. Without that, nothing else will make sense. In short, C/C++ pointers are not physical memory addresses. Pointers are virtual addresses. There's a special CPU feature (the MMU, memory management unit) that transparently maps them to physical memory. Only the operating system is allowed to configure the MMU.
This provides safety (there is no C/C++ pointer value you can possibly make that points into another process's virtual address space, unless that process is intentionally sharing memory with you) and lets the OS do some really magical things that we now take for granted (like transparently swap some of a process's memory to disk, then transparently load it back when the process tries to use it).
A process's address space (a.k.a. virtual address space, a.k.a. addressable memory) contains:
a huge region of memory that's reserved for the Windows kernel, which the process isn't allowed to touch;
regions of virtual memory that are "unmapped", i.e. nothing is loaded there, there's no physical memory assigned to those addresses, and the process will crash if it tries to access them;
parts the various modules (EXE and DLL files) that have been loaded (each of these contains machine code, string constants, and other data); and
whatever other memory the process has allocated from the system.
Now typically a process lets the C Runtime Library or the Win32 libraries do most of the super-low-level memory management, which includes setting up:
a stack (for each thread), where local variables and function arguments and return values are stored; and
a heap, where memory is allocated if the process calls malloc or does new X.
For more about the stack is structured, read about calling conventions. For more about how the heap is structured, read about malloc implementations. In general the stack really is a stack, a last-in-first-out data structure, containing arguments, local variables, and the occasional temporary result, and not much more. Since it is easy for a program to write straight past the end of the stack (the common C/C++ bug after which this site is named), the system libraries typically make sure that there is an unmapped page adjacent to the stack. This makes the process crash instantly when such a bug happens, so it's much easier to debug (and the process is killed before it can do any more damage).
The heap is not really a heap in the data structure sense. It's a data structure maintained by the CRT or Win32 library that takes pages of memory from the operating system and parcels them out whenever the process requests small pieces of memory via malloc and friends. (Note that the OS does not micromanage this; a process can to a large extent manage its address space however it wants, if it doesn't like the way the CRT does it.)
A process can also request pages directly from the operating system, using an API like VirtualAlloc or MapViewOfFile.
There's more, but I'd better stop!
For understanding stack frame structure you can refer to
http://en.wikipedia.org/wiki/Call_stack
It gives you information about structure of call stack, how locals , globals , return address is stored on call stack
Another good illustration
http://www.cs.uleth.ca/~holzmann/C/system/memorylayout.pdf
It might not be the most accurate information, but MS Press provides some sample chapters of of the book Inside Microsoft® Windows® 2000, Third Edition, containing information about processes and their creation along with images of some important data structures.
I also stumbled upon this PDF that summarizes some of the above information in an nice chart.
But all the provided information is more from the OS point of view and not to much detailed about the application aspects.
Actually - you won't get far in this matter with at least a little bit of knowledge in Assembler. I'd recoomend a reversing (tutorial) site, e.g. OpenRCE.org.