ImageMagick and OpenMP, use all cpu os a cluster? - imagemagick

I know i can use ImageMagick with OpenMP to use all cores of my CPU, but the question is: Can i use all the cores of my 10 computer cluster? or imagemagick will only use the cores of my local CPU?
Thanks a lot.

OpenMP is based on the fork-join paradigm which is bound to shared-memory systems because of the very nature of the fork function. Thus, you won't be able to use all the cores of your cluster since they don't share their memory. OpenMP-enabled programs are then limited to cores that share their memory (cores from the local CPU).
There is a way around this, though, but it may not be worth it. You can simulate a virtual NUMA architecture over your 10 computer cluster and execute ImageMagick on this virtual machine. ImageMagick will hence believe it runs on a single system containing all your cores. That's what ScaleMP offers with their vSMP software. But the performance gains are strongly related to the program's memory accesses, as they might occur over network in this VM, which is magnitudes slower than cache or RAM access. You may get a significant performance hit depending on how memory is accessed in ImageMagick.
In order to directly perform what you ask, instead of trying to use OpenMP outside its use cases, you could use a Message-Passing Interface (MPI) framework (such as OpenMPI) to parallelize the program over multiple networked computers. That would require to rewrite the parallel portion of ImageMagick to use OpenMPI instead, which may be a pretty daunting task depending on ImageMagick's code base.

Related

Why would one chose many smaller machine types instead of fewer big machine types?

In a clustering high-performance computing framework such as Google Cloud Dataflow (or for that matter even Apache Spark or Kubernetes clusters etc), I would think that it's far more performant to have fewer really BIG machine types rather than many small machine types, right? As in, it's more performant to have 10 n1-highcpu-96 rather than say 120 n1-highcpu-8 machine types, because
the cpus can use shared memory, which is way way faster than network communications
if a single thread needs access to lots of memory for a single threaded operation (eg sort), it has access to that greater memory in a BIG machine rather than a smaller one
And since the price is the same (eg 10 n1-highcpu-96 costs the same as 120 n1-highcpu-8 machine types), why would anyone opt for the smaller machine types?
As well, I have a hunch that for the n1-highcpu-96 machine type, we'd occupy the whole host, so we don't need to worry about competing demands on the host by another VM from another Google cloud customer (eg contention in the CPU caches
or motherboard bandwidth etc.), right?
Finally, although I don't think the google compute VMs correctly report the "true" CPU topology of the host system, if we do chose the n1-highcpu-96 machine type, the reported CPU topology may be a touch closer to the "truth" because presumably the VM is using up the whole host, so the reported CPU topology is a little closer to the truth, so any programs (eg the "NUMA" aware option in Java?) running on that VM that may attempt to take advantage of the topology has a better chance of making the "right decisions".
It will depend on many factors if you want to choose many instances with smaller machine type or a few instances with big machine types.
The VMs sizes differ not only in number of cores and RAM, but also on network I/O performance.
Instances with small machine types have are limited in CPU and I/O power and are inadequate for heavy workloads.
Also, if you are planning to grow and scale it is better to design and develop your application in several instances. Having small VMs gives you a better chance of having them distributed across physical servers in the datacenter that have the best resource situation at the time the machines are provisioned.
Having a small number of instances helps to isolate fault domains. If one of your small nodes crashes, that only affects a small number of processes. If a large node crashes, multiple processes go down.
It also depends on the application you are running on your cluster and the workload.I would also recommend going through this link to see the sizing recommendation for an instance.

Main difference between Shared memory and Distributed memory

I'm a bit confused between about the difference between shared memory and distributed memory. Can you clarify?
Is shared memory for one processor and distributed for many (for network)?
Why do we need distributed memory, if we have shared memory?
Short answer
Shared memory and distributed memory are low-level programming abstractions that are used with certain types of parallel programming. Shared memory allows multiple processing elements to share the same location in memory (that is to see each others reads and writes) without any other special directives, while distributed memory requires explicit commands to transfer data from one processing element to another.
Detailed answer
There are two issues to consider regarding the terms shared memory and distributed memory. One is what do these mean as programming abstractions, and the other is what do they mean in terms of how the hardware is actually implemented.
In the past there were true shared memory cache-coherent multiprocessor systems. The systems communicated with each other and with shared main memory over a shared bus. This meant that any access from any processor to main memory would have equal latency. Today these types of systems are not manufactured. Instead there are various point-to-point links between processing elements and memory elements (this is the reason for non-uniform memory access, or NUMA). However, the idea of communicating directly through memory remains a useful programming abstraction. So in many systems this is handled by the hardware and the programmer does not need to insert any special directives. Some common programming techniques that use these abstractions are OpenMP and Pthreads.
Distributed memory has traditionally been associated with processors performing computation on local memory and then once it using explicit messages to transfer data with remote processors. This adds complexity for the programmer, but simplifies the hardware implementation because the system no longer has to maintain the illusion that all memory is actually shared. This type of programming has traditionally been used with supercomputers that have hundreds or thousands of processing elements. A commonly used technique is MPI.
However, supercomputers are not the only systems with distributed memory. Another example is GPGPU programming which is available for many desktop and laptop systems sold today. Both CUDA and OpenCL require the programmer to explicitly manage sharing between the CPU and the GPU (or other accelerator in the case of OpenCL). This is largely because when GPU programming started the GPU and CPU memory was separated by the PCI bus which has a very long latency compared to performing computation on the locally attached memory. So the programming models were developed assuming that the memory was separate (or distributed) and communication between the two processing elements (CPU and GPU) required explicit communication. Now that many systems have GPU and CPU elements on the same die there are proposals to allow GPGPU programming to have an interface that is more like shared memory.
In modern x86 terms, for example, all the CPUs in one physical computer share memory. e.g. 4-socket system with four 18-core CPUs. Each CPU has its own memory controllers, but they talk to each other so all the CPUs are part of one coherency domain. The system is NUMA shared memory, not distributed.
A room full of these machines form a distributed-memory cluster which communicates by sending messages over a network.
Practical considerations are one major reasons for distributed memory: it's impractical to have thousands or millions of CPU cores sharing the same memory with any kind of coherency semantics that make it worth calling it shared memory.

How do I increase the "global memory" available to the Intel CPU OpenCL driver?

My system has 32GB of ram, but the device information for the Intel OpenCL implementation says "CL_DEVICE_GLOBAL_MEM_SIZE: 2147352576" (~2GB).
I was under the impression that on a CPU platform the global memory is the "normal" ram and thus something like ~30+GB should be available to the OpenCL CPU implementation. (ofcourse I'm using the 64bit version of the SDK)
Is there some sort of secret setting to tell the Intel OpenCL driver to increase global memory and use all the system memory ?
SOLVED: Got it working by recompiling everything to 64bit. Quite stupid as it seems, but I thought that OpenCL was working similar to OpenGL, where you can easily allocate e.g. 8GB texture memory from a 32bit process and the driver handles the details for you (ofcourse you can't allocate 8GB in one sweep, but e.g. transfer multiple textures that add up to more that 4GB).
I still think that limiting the OpenCL memory abstraction to the adress space of the process (at least for intel/amd drivers) is irritating, but maybe there are some subtle details or performance tradeoff why this implementation was chosen.

MPI/Pthread program does not scale

I have a MPI/Pthread program in which each MPI process will be running on a separate computing node. Within each MPI process, certain number of Pthreads (1-8) are launched. However, no matter how many Pthreads are launched within a MPI process, the overall performance is pretty much the same. I suspect all the Pthreads are running on the same CPU core. How can I assign threads to different CPU cores?
Each computing node has 8 cores.(two Quad core Nehalem processors)
Open MPI 1.4
Linux x86_64
Questions like this are often dependent on the problem at hand. Most likely, you are running into a resource lock issue (where the threads are competing for a lock) -- this would look like only one core was doing any work, because only one thread can (effectively) do any work at any given time.
Setting CPU affinity for a certain thread is not a good solution. You should allow for the OS scheduler to optimally determine the physical core assignment for a given pthread.
Look at your code and try to figure out where you are locking where you shouldn't be, or if you've come up with a correct parallel solution to the problem at hand. You should also test a version of the program using only pthreads (not MPI) and see if scaling is achieved.

Get the number of cores in Erlang with Linux

i am writing a concurrent program and i need to know the number of cores of the system so then the program will know how many processes to open.
Is there command to get this inside Erlang code?
Thnx.
You can use
erlang:system_info(logical_processors_available)
to get the number of cores that can be used by the erlang runtime system.
There is also:
erlang:system_info(schedulers_online)
which tells you how many scheduler threads are actually running.
To get the number of available cores, use the logical_processors flag to erlang:system_info/1:
1> erlang:system_info(logical_processors).
8
There are two companion flags to this one: logical_processors_online shows how many are in use, and logical_processors_available show how many are available (it will return unknown when all logical processors available are online).
To know how to parallelize your code, you should rely on schedulers_online which will return the number of actual Erlang schedulers that are available in your current VM instance:
1> erlang:system_info(schedulers_online).
8
Note however that parallelizing on this value alone might not be enough. Sometimes you have other processes running that need some CPU time and sometimes your algorithm would benefit from even more parallelism (waiting on IO for example). A rule of thumb is to use the value obtained from schedulers_online as a multiplier for parallelism, but always test with different multiples to see what works best for your application.
How this information is exposed will be very operating system specific (unless you happen to be writing an operating system of course).
You didn't say what operating system you're working on. In the case of Linux, you can get the data from /proc/cpuinfo, however there are subtleties with the meaning of hyperthreading and the issue of multiple cores on the same die using a shared L2 cache (effectively you've got a NUMA architecture).

Resources