opencl - use image object with local memory - memory

i'm trying to program with opencl.
There are two types of memory object.
one is buffer and another one is image.
some blogs and web site,white papers say 'image object is little bit faster that buffer because of cache'.
i'm trying to use image object and the reason for that is 'clamp', it will make kernel code more simpler and faster(my opinion)
my question is 'is it possible to use image object and local memory and is it faster(than using buffer object with local memory)?"
Data-> image object-> copy to local memory -> operations -> write back to other image object.
As far as i understood, i cannot use async_work_group_copy instruction for local memory in this case.
so i have to copy and synchronize manually for local memory. it will make overhead a lot.

The only real answer to that is "it depends". Most implementations don't really have a value in doing async_work_group_copy. Image reads may be slightly higher latency than buffer reads when there is a cache hit, but you may get better cache behaviour from them on some architectures. Clamping, address calculation and filtering are effectively free operations performed by dedicated hardware, that you'd have to shift into shader code when using buffers, so that reduces your read latency and may increase throughput.
If you are going to get big caching benefits from images, local memory may just get in the way. The extra cost of writing to it, synchronizing, reading from it, calculating addresses and so on may cost you.
Sadly this is just one of those things you'll have to experiment with on your target architectures.

Related

Need for CPU-GPU sync if write and read happens on different memory pages of MTLBuffer?

I am using an MTLBuffer in Metal that I created by allocating several memory pages (using vm_allocate) with
device.makeBuffer(bytesNoCopy:length:options:deallocator:).
I write the buffer with CPU and the GPU only reads it. I know that generally I need to synchronise between CPU and GPU.
However, I have more knowledge about where in the MTLBuffer write (by CPU) and read (by GPU) happens and in my case writing is into different memory pages than the read (in a given time interval).
My question: Do I need to sync between CPU and GPU even if the relevant data that is written and read are on different memory pages (but in the same MTLBuffer)? Intuitively, I would think not, but then MTLBuffer is a bit opaque and I don't really know what kind of processing/requirement the GPU actually does/has with the MTLBuffer.
Additional info: This is a question for iOS and MTLStorageMode is shared.
Thank you very much for help!
Assuming the buffer was created with MTLStorageModeManaged, you can use the function didModifyRange to sync CPU to GPU for only a portion (a page for example) of the buffer.

shared memory vs texture memory in opencl

I am writing deinterlacing code in Opencl. I am reading the pixels using read_imageui() API in the local memory.
Just like the code at:
https://opencl-book-samples.googlecode.com/svn-history/r29/trunk/src/Chapter_19/oclFlow/lkflow.cl
As per my understanding when we read pixels using this API we are reading from the Texture memory. I am doubtful that using the pixels first in shared memory will help me gaining any speed as Texture memory already acts as cache and provides fast access to data.
Can anyone clarify my doubt ?
If you can fit all your data in private memory after reading it with read_imageui, you should definitely do that. Keep in mind that you only have 256 bytes of private memory per work item if your kernel compiles SIMD16 and 512 bytes if it compiles SIMD8.
Whether you should use local memory or not really depends on the access pattern. Indeed, Samplers have their own L1 and L2 caches, so if your data accesses always hit the caches, you should be fine. Remember, that local memory is banked, so you have 16 banks from which you can fetch 4 bytes at a time, which means that you get full bandwidth if you hit all 16 banks from all work items in one hardware thread (typically 16 or 8 of them). So, you might have a situation where you are better off reading image data into local memory first and then accessing local memory in an orderly fashion. Good example of this are algorithms like SIFT or SURF, where you access image in such a way that sampler cache really does not help much (you still get sampler interpolation benefits), but then you place all that data in local memory and access it repeatedly in a fairly regular pattern.
In general, that's true. However, even a cached read from a texture might be slower than a read from shared local memory, so for an algorithm that makes many overlapped reads from adjacent locations could still benefit somewhat from using shared local memory. However, it will make the kernel more complicated, so for many cases (and certainly during algorithm development) just rely on the cached texture reads.

Short question about CUDA memory access

hey there,
assuming I have a problem where each thread calculates something (reading some parameters out of the constant memory and using them for calculation) and than stores it to a global memory matrix. this matrix gets never read, just writing access... is there now any sense of using shared memory first to store all the calculated values in and than later write them to the global memory? I think no because the writes to global memory stay the same in complete, so the writes to shared memory just add to the writes which I had before already....
Thanks!
There can be, depending on the access patterns in the kernel code. Using a shared memory buffer to "stage" output can be a useful way of ensure writes are coalesced, when the naive write would not be coalesced. This was pretty crucial for performance in the first couple of generations of CUDA compatible hardware (G80/G90). In newer hardware, the case for this is a lot less strong. Fermi cards have a pretty effective L1 and L2 cache scheme which can (within reason) get close to what used to be only achievable using shared memory without any extra code.
There isn't really a general answer to this question, because it depends a lot of the specifics of what any given code does, and what target hardware it is expected to run well on.

Determining available video memory

When developing an OpenGL program, is there a way to poll from the system to find out just how many megabytes are available to store textures, etc?
Or is the standard approach these days just allocate memory and forget about everything?
Although the official stance remains "you don't need to know, you don't want to know, and it would not help you anyway", luckily at least two IHVs have shown a little more insight lately and offer extensions to query that information:
NVX_gpu_memory_info
ATI_meminfo
One nice thing about these extensions is that they have a least common denominator which is just what most people need, and you don't need to query extension support or do anything special, as they both work via glGetIntegerv.
In the easiest case, you can just initialize an array of 4 integers to zero (or some minimum default value that you'll assume in case the extensions don't work), then you call glGetIntegerv twice (with GPU_MEMORY_INFO_CURRENT_AVAILABLE_VIDMEM_NVX and TEXTURE_FREE_MEMORY_ATI, respectively), and finally call glGetError to clear the error state. glGetIntegerv does not modify the pointed-to memory if it fails, nor does it crash or any other bad thing -- it merely sets the error state to GL_INVALID_ENUM.
Both extensions return a value in the first array position, the ATI one returns some values in the other 3 too.
n.b.: glAreTexturesResident has not been supported for almost a decade on mainstream hardware, in the same manner as texture priorities. The common mantra is that the driver writer knows much better than you anyway.
OpenGL doesn't give you this information. And frankly: There's only little benefit, simply because today we have multitasking operating systems. The OpenGL driver is responsible for swapping in texture data to/from system memory, if there's demand for it.
What OpenGL can do for you, is tell, if the textures you've uploaded are still resident in fast memory. The function is called "glAreTexturesResident". You can use this to gradually upload stuff to the GPU until you've filled up the GPU's memory. But keep in mind that you're not the only user of the GPU.

How does shared memory vs message passing handle large data structures?

In looking at Go and Erlang's approach to concurrency, I noticed that they both rely on message passing.
This approach obviously alleviates the need for complex locks because there is no shared state.
However, consider the case of many clients wanting parallel read-only access to a single large data structure in memory -- like a suffix array.
My questions:
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?
One thing to realise is that the Erlang concurrency model does NOT really specify that the data in messages must be copied between processes, it states that sending messages is the only way to communicate and that there is no shared state. As all data is immutable, which is fundamental, then an implementation may very well not copy the data but just send a reference to it. Or may use a combination of both methods. As always, there is no best solution and there are trade-offs to be made when choosing how to do it.
The BEAM uses copying, except for large binaries where it sends a reference.
Yes, shared state could be faster in this case. But only if you can forgo the locks, and this is only doable if it's absolutely read-only. if it's 'mostly read-only' then you need a lock (unless you manage to write lock-free structures, be warned that they're even trickier than locks), and then you'd be hard-pressed to make it perform as fast as a good message-passing architecture.
Yes, you could write a 'server process' to share it. With really lightweight processes, it's no more heavy than writing a small API to access the data. Think like an object (in OOP sense) that 'owns' the data. Splitting the data in chunks to enhance parallelism (called 'sharding' in DB circles) helps in big cases (or if the data is on slow storage).
Even if NUMA is getting mainstream, you still have more and more cores per NUMA cell. And a big difference is that a message can be passed between just two cores, while a lock has to be flushed from cache on ALL cores, limiting it to the inter-cell bus latency (even slower than RAM access). If anything, shared-state/locks is getting more and more unfeasible.
in short.... get used to message passing and server processes, it's all the rage.
Edit: revisiting this answer, I want to add about a phrase found on Go's documentation:
share memory by communicating, don't communicate by sharing memory.
the idea is: when you have a block of memory shared between threads, the typical way to avoid concurrent access is to use a lock to arbitrate. The Go style is to pass a message with the reference, a thread only accesses the memory when receiving the message. It relies on some measure of programmer discipline; but results in very clean-looking code that can be easily proofread, so it's relatively easy to debug.
the advantage is that you don't have to copy big blocks of data on every message, and don't have to effectively flush down caches as on some lock implementations. It's still somewhat early to say if the style leads to higher performance designs or not. (specially since current Go runtime is somewhat naive on thread scheduling)
In Erlang, all values are immutable - so there's no need to copy a message when it's sent between processes, as it cannot be modified anyway.
In Go, message passing is by convention - there's nothing to prevent you sending someone a pointer over a channel, then modifying the data pointed to, only convention, so once again there's no need to copy the message.
Most modern processors use variants of the MESI protocol. Because of the shared state, Passing read-only data between different threads is very cheap. Modified shared data is very expensive though, because all other caches that store this cache line must invalidate it.
So if you have read-only data, it is very cheap to share it between threads instead of copying with messages. If you have read-mostly data, it can be expensive to share between threads, partly because of the need to synchronize access, and partly because writes destroy the cache friendly behavior of the shared data.
Immutable data structures can be beneficial here. Instead of changing the actual data structure, you simply make a new one that shares most of the old data, but with the things changed that you need changed. Sharing a single version of it is cheap, since all the data is immutable, but you can still update to a new version efficiently.
What is a large data structure?
One persons large is another persons small.
Last week I talked to two people - one person was making embedded devices he used the word
"large" - I asked him what it meant - he say over 256 KBytes - later in the same week a
guy was talking about media distribution - he used the word "large" I asked him what he
meant - he thought for a bit and said "won't fit on one machine" say 20-100 TBytes
In Erlang terms "large" could mean "won't fit into RAM" - so with 4 GBytes of RAM
data structures > 100 MBytes might be considered large - copying a 500 MBytes data structure
might be a problem. Copying small data structures (say < 10 MBytes) is never a problem in Erlang.
Really large data structures (i.e. ones that won't fit on one machine) have to be
copied and "striped" over several machines.
So I guess you have the following:
Small data structures are no problem - since they are small data processing times are
fast, copying is fast and so on (just because they are small)
Big data structures are a problem - because they don't fit on one machine - so copying is essential.
Note that your questions are technically non-sensical because message passing can use shared state so I shall assume that you mean message passing with deep copying to avoid shared state (as Erlang currently does).
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
Using shared state will be a lot faster.
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Either approach can be used.
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?
Copying is cache unfriendly and, therefore, destroys scalability on multicores because it worsens contention for the shared resource that is main memory.
Ultimately, Erlang-style message passing is designed for concurrent programming whereas your questions about throughput performance are really aimed at parallel programming. These are two quite different subjects and the overlap between them is tiny in practice. Specifically, latency is typically just as important as throughput in the context of concurrent programming and Erlang-style message passing is a great way to achieve desirable latency profiles (i.e. consistently low latencies). The problem with shared memory then is not so much synchronization among readers and writers but low-latency memory management.
One solution that has not been presented here is master-slave replication. If you have a large data-structure, you can replicate changes to it out to all slaves that perform the update on their copy.
This is especially interesting if one wants to scale to several machines that don't even have the possibility to share memory without very artificial setups (mmap of a block device that read/write from a remote computer's memory?)
A variant of it is to have a transaction manager that one ask nicely to update the replicated data structure, and it will make sure that it serves one and only update-request concurrently. This is more of the mnesia model for master-master replication of mnesia table-data, which qualify as "large data structure".
The problem at the moment is indeed that the locking and cache-line coherency might be as expensive as copying a simpler data structure (e.g. a few hundred bytes).
Most of the time a clever written new multi-threaded algorithm that tries to eliminate most of the locking will always be faster - and a lot faster with modern lock-free data structures. Especially when you have well designed cache systems like Sun's Niagara chip level multi-threading.
If your system/problem is not easily broken down into a few and simple data accesses then you have a problem. And not all problems can be solved by message passing. This is why there are still some Itanium based super computers sold because they have terabyte of shared RAM and up to 128 CPU's working on the same shared memory. They are an order of magnitude more expensive then a mainstream x86 cluster with the same CPU power but you don't need to break down your data.
Another reason not mentioned so far is that programs can become much easier to write and maintain when you use multi-threading. Message passing and the shared nothing approach makes it even more maintainable.
As an example, Erlang was never designed to make things faster but instead use a large number of threads to structure complex data and event flows.
I guess this was one of the main points in the design. In the web world of google you usually don't care about performance - as long as it can run in parallel in the cloud. And with message passing you ideally can just add more computers without changing the source code.
Usually message passing languages (this is especially easy in erlang, since it has immutable variables) optimise away the actual data copying between the processes (of course local processes only: you'll want to think your network distribution pattern wisely), so this isn't much an issue.
The other concurrent paradigm is STM, software transactional memory. Clojure's ref's are getting a lot of attention. Tim Bray has a good series exploring erlang and clojure's concurrent mechanisms
http://www.tbray.org/ongoing/When/200x/2009/09/27/Concur-dot-next
http://www.tbray.org/ongoing/When/200x/2009/12/01/Clojure-Theses

Resources