Is OpenCL a shared, distributed or hybrid memory system - memory

I'm having a hard time understanding if OpenCL and in particular OpenCL 2.0+ is a shared, distributed or a distributed shared memory architecture, in particular with a computer that has many OpenCL devices in the same PC.
In particular, I can see that It's a shared memory system in the fact that they can all access global memory but theirs a network-like aspect with the compute units that makes me question if it could classicly be classed as a distributed-shared memory architecture

Looking at it from a generic OpenCL coding perspective, your answer is "yes, maybe, unless it's not."
If you are talking about some specific hardware, there are (somewhere) clear and concise answers of how things work on the chip(s) and how OpenCL uses them.
By examining the OpenCL capacities and capabilities at runtime, you could modify some parameters of your OpenCL program or choose one of various kernels that is the best fit.


Is possible multiples GPUs work as one with more memory?

I have a deep learning workstation where there are 4 GPUs with 6 GB of memory each. Would it be possible to make a docker container see the 4 GPUs as one but with 24 GB?
Thank you.
I haven't work with docker before but work a lot with CUDA with Multiple GPU. Since multiple GPUs is physically are separated, hence working with multiple GPUs required a lot of memory synchronization in code level.
I don't think that docker can virtually merge all the GPU memory as that will make the computation very complicated on the GPU side. working with Multiple GPU required custom kernel to synchronize to each other.
The best analogy I relate is, "Can you get two bare-metal computers to merge the RAM and run Microsoft Word as if it were a single machine?".
Short answer: No.
Alternate answer: Yes, but requires additional hardware, expensive, and probably incompatible with your existing hardware.
It is possible if your GPUs are connected using NVIDIA NVLink (take a look at the details here
Usually NVLink used for pairs of GPUs, like GPU0 connected with GPU1 and GPU2 connected with GPU3, in this case best option you can obtain is 2 GPUs with doubled memory each.
Another option is special InfiniBand module, installed to modern GPU servers by some vendors.

cuda 'memory bound' vs 'latency bound' vs 'bandwidth bound' vs 'compute bound'

In the many resources online it is possible to find different usages of 'memory','bandwidth' 'latency' bound kernels. It seems to me that the authors sometimes use their own definition of these terms and I think if would be very beneficial for someone to make a clear distinction.
To my understanding:
Bandwidth bound kernels approach the physical limits of the device in terms of access to global memory. E.g. an application uses 170GB/s out of 177GB/s on an M2090 device.
A latency bound kernel is one whose predominant stall reason is due to memory fetches. So we are not saturating the global memory bus, but still have to wait to get the data into the kernel.
A compute bound kernel is one in which computation dominates the kernel time, under the assumption that there is no problem feeding the kernel with memory, and there is good overlap of arithmetic and latency.
If I got these correct, what would a 'memory bound' kernel be? Is there ambiguity, and if yes, should we limit the conversation to the three above terms?
what would a 'memory bound' kernel be?
Memory bound refers to a general case where a code is limited by memory access, ie. it includes codes that are latency bound and codes that are bandwidth bound. You've defined pretty much all the other terms correctly.
Is there ambiguity, and if yes, should we limit the conversation to the three above terms?
I don't think there's much ambiguity (you've clearly demarcated 3 of the 4 terms, anyway), and you're not going to impose order on the world in a SO question/answer.

Calling BLAS routines inside OpenCL kernels

Currently I am doing some image processing algorithms using OpenCL. Basically my algorithm requires to solve a linear system of equations for each pixel. Each system is independent of others, so going for a parallel implementation is natural.
I have looked at several BLAS packages such as ViennaCL and AMD APPML, but it seems all of them have the same use pattern (host calling BLAS subroutines to be executed on CL device).
What I need is a BLAS library that could be called inside an OpenCL kernel so that I can solve many linear systems in parallel.
I found this similar question on the AMD forums.
Calling APPML BLAS functions from the kernel
Its not possible. clBLAS routines make a series of kernel launches, some 'solve' routine kernel launches are really complicated. clBLAS routines take cl_mem and commandQueues as args. So if your buffer is already on device, clBLAS will directly act on that. It doesn't accept host buffer or manage host->device transfers
If you want to have a look at what kernel are generated and launched, uncomment this line and build clBLAS. It will dump all kernels being called

OpenCL compliant DSP

On the Khronos website, OpenCL is said to be open to DSP. But when I look on the website of DSP making companies, like Texas Instrument, Freescale, NXP or Analog Devices, I can't find any mention about OpenCL.
So does anyone knows if a OpenCL compliant DSP exists?
Edit: As this question seems surprising, I add the reason why I asked it. From the page:
"OpenCL 1.0 at a glance
OpenCL (Open Computing Language) is the first open, royalty-free standard for general-purpose parallel programming of heterogeneous systems. OpenCL provides a uniform programming environment for software developers to write efficient, portable code for high-performance compute servers, desktop computer systems and handheld devices using a diverse mix of multi-core CPUs, GPUs, Cell-type architectures and other parallel processors such as DSPs"
So I think it would be interesting to know if it's true, if DSPs, which are particulary suited for some complex calculations, can really be programmed using OpenCL.
The OpenCL spec seems to support using a chip that has one or more programmable GPU shader cores as an expensive DSP. It does not appear that the spec makes many allowances for DSP chips that were not designed to support being used as a programmable GPU shader in a graphics pipeline.
I finally found one: The SNU-Samsung OpenCL Framework is able to use Texas Instrument C64x DSPs. More infos here:

Why is MPI considered harder than shared memory and Erlang considered easier, when they are both message-passing?

There's a lot of interest these days in Erlang as a language for writing parallel programs on multicore. I've heard people argue that Erlang's message-passing model is easier to program than the dominant shared-memory models such as threads.
Conversely, in the high-performance computing community the dominant parallel programming model has been MPI, which also implements a message-passing model. But in the HPC world, this message-passing model is generally considered very difficult to program in, and people argue that shared memory models such as OpenMP or UPC are easier to program in.
Does anybody know why there is such a difference in the perception of message-passing vs. shared memory in the IT and HPC worlds? Is it due to some fundamental difference in how Erlang and MPI implement message passing that makes Erlang-style message-passing much easier than MPI? Or is there some other reason?
I agree with all previous answers, but I think a key point that is not made totally clear is that one reason that MPI might be considered hard and Erlang easy is the match of model to the domain.
Erlang is based on a concept of local memory, asynchronous message passing, and shared state solved by using some form of global database that all threads can get to. It is designed for applications that do not move a whole lot of data around, and that is not supposed to explode out to a 100k separate nodes that need coordination.
MPI is based on local memory and message passing, and is intended for problems where moving data around is a key part of the domain. High-performance computing is very much about taking the dataset for a problem, and splitting it up among a host of compute resources. And that is pretty hard work in a message-passing system as data has to be explicitly distributed with balancing in mind. Essentially, MPI can be viewed as a grudging admittance that shared memory does not scale. And it is targeting high-performance computation spread across 100k processors or more.
Erlang is not trying to achieve the highest possible performance, rather to decompose a naturally parallel problem into its natural threads. It was designed with a totally different type of programming tasks in mind compared to MPI.
So Erlang is best compared to pthreads and other rather local heterogeneous thread solutions, rather than MPI which is really aimed at a very different (and to some extent inherently harder) problem set.
Parallelism in Erlang is still pretty hard to implement. By that I mean that you still have to figure out how to split up your problem, but there's a few minor things that ease this difficulty when compared to some MPI library in C or C++.
First, since Erlang's message-passing is a first-class language feature, the syntactic sugar makes it feel easier.
Also, Erlang libraries are all built around Erlang's message passing. This support structure helps give you a boost into parallel-processling land. Take a look at the components of OTP like gen_server, gen_fsm, gen_event. These are very easy to use structures that can help your program become parallel.
I think it's more the robustness of the available standard library that differentiates erlang's message passing from other MPI implementations, not really any specific feature of the language itself.
Usually concurrency in HPC means working on large amounts of data. This kind of parallelism is called data parallelism and is indeed easier to implement using a shared memory approach like OpenMP, because the operating system takes care of things like scheduling and placement of tasks, which one would have to implement oneself if using a message passing paradigm.
In contrast, Erlang was designed to cope with task parallelism encountered in telephone systems, where different pieces of code have to be executed concurrently with only a limited amount of communication and strong requirements for fault tolerance and recovery.
This model is similar to what most people use PThreads for. It fits applications like web servers, where each request can be handled by a different thread, while HPC applications do pretty much the same thing on huge amounts of data which also have to be exchanged between workers.
I think it has something to do with the mind-set when you're programming with MPI and when you're programming with Erlang. For instance, MPI is not built-into the language whereas Erlang has built-in support for message passing. Another possible reason is the disconnect between merely sending/receiving messages and partitioning solutions into concurrent units of execution.
With Erlang you are forced to think in a functional programming frame where data actually zips by from function call to function call -- and receiving is an active act which looks like a normal construct in the language. This gives you a closer connection between the computation you're actually performing and the act of sending/receiving messages.
With MPI on the other hand you are forced to think merely about the actual message passing but not really the decomposition of work. This frame of thinking requires somewhat of a context switch between writing the solution and the messaging infrastructure in your code.
The discussion can go on but the common view is that if the construct for message passing is actually built into the programming language and paradigm that you're using, usually that's a better means of expressing the solution compared to something else that is "tacked on" or exists as an add-on to a language (in the form of a library or extension).
Does anybody know why there is such a difference in the perception of message-passing vs. shared memory in the IT and HPC worlds? Is it due to some fundamental difference in how Erlang and MPI implement message passing that makes Erlang-style message-passing much easier than MPI? Or is there some other reason?
The reason is simply parallelism vs concurrency. Erlang is bred for concurrent programming. HPC is all about parallel programming. These are related but different objectives.
Concurrent programming is greatly complicated by heavily non-deterministic control flow and latency is often an important objective. Erlang's use of immutable data structures greatly simplifies concurrent programming.
Parallel programming has much simpler control flow and the objective is all about maximal total throughput and not latency. Efficient cache usage is much more important here, which renders both Erlang and immutable data structures largely unsuitable. Mutating shared memory is both tractable and substantially better in this context. In effect, cache coherence is providing hardware-accelerated message passing for you.
Finally, in addition to these technical differences there is also a political issue. The Erlang guys are trying to ride the multicore hype by pretending that Erlang is relevant to multicore when it isn't. In particular, they are touting great scalability so it is essential to consider absolute performance as well. Erlang scales effortlessly from poor absolute performance on one core to poor absolute performance on any number of cores. As you can imagine, that does not impress the HPC community (but it is adequate for a lot of heavily concurrent code).
Regarding MPI vs OpenMP/UPC: MPI forces you to slice the problem in small pieces and take responsibility for moving data around. With OpenMP/UPC, "all the data is there", you just have to dereference a pointer. The MPI advantage is that 32-512 CPU clusters are much cheaper than 32-512 CPU single machines. Also, with MPI the expense is upfront, when you design the algorithm. OpenMP/UPC can hide the latencies that you'll get at runtime, if your system uses NUMA (and all big systems do) - your program won't scale and it will take a while to figure out why.
This article actually explaines it well, Erlang is best when we are sending small pieces of data arround and MPI does much better on more complex things. Also The Erlang model is easy to understand :-)
Erlang Versus MPI - Final Results and Source Code
