Soes anybody know any simulator that I can use in order to measure statistics of memory access latencies for multicore processors?
Are there such statistics(for any kind of multicore) already published somewhere?
You might try CodeAnalyst from AMD which monitors the performance registers during program execution on AMD processors. Multi-core too where applicable.
I don't know the name of intel's equivalent product.
Related
I am learning the concept of virtual memory, but this question has been confusing me for a while. Since most modern computers use virtual memory, when a program is in execution, the os is supposed to page data in and out between RAM and disk. But why do we still encounter "out of memory" issue? Could you please correct me if I misunderstood the concept? I really appreciate your explanation.
PS: For example, I was analyzing a large amount of data (>100G) output from simulation on a computing cluster, and read in the data to an C array. Very often the system crashed and complained a memory error.
First: Modern computer do indeed use virtual memory, however there is no magic here. Memory is not created out of nothing. Virtual memory schemes typically allow a portion of the mass storage sub-system (aka hard disk) to be used to hold portions of the process that are (hopefully) less frequently used.
This technique allows processes to use more memory than is available as RAM. However nothing is infinite. Eventually all RAM and Hard Drive resources will be used up and the process will get an out of memory error.
Second: It is not unheard of for operating systems to place a cap on the memory that a process may use. Hit that cap and again, the process gets an out of memory error.
Even with virtual memory the memory available is not unlimited.
Limit 1) Architectural limits. The processor and operating system will place some maximum virtual memory limit.
Limit 2) System Parameters. Many operating systems configure the maximum virtual memory size.
Limit 3) Process quotas. Many operating system have process quotas that limit the maximum virtual memory size.
Limit 4) System resources. Notably page file space.
I'm a bit confused between about the difference between shared memory and distributed memory. Can you clarify?
Is shared memory for one processor and distributed for many (for network)?
Why do we need distributed memory, if we have shared memory?
Short answer
Shared memory and distributed memory are low-level programming abstractions that are used with certain types of parallel programming. Shared memory allows multiple processing elements to share the same location in memory (that is to see each others reads and writes) without any other special directives, while distributed memory requires explicit commands to transfer data from one processing element to another.
Detailed answer
There are two issues to consider regarding the terms shared memory and distributed memory. One is what do these mean as programming abstractions, and the other is what do they mean in terms of how the hardware is actually implemented.
In the past there were true shared memory cache-coherent multiprocessor systems. The systems communicated with each other and with shared main memory over a shared bus. This meant that any access from any processor to main memory would have equal latency. Today these types of systems are not manufactured. Instead there are various point-to-point links between processing elements and memory elements (this is the reason for non-uniform memory access, or NUMA). However, the idea of communicating directly through memory remains a useful programming abstraction. So in many systems this is handled by the hardware and the programmer does not need to insert any special directives. Some common programming techniques that use these abstractions are OpenMP and Pthreads.
Distributed memory has traditionally been associated with processors performing computation on local memory and then once it using explicit messages to transfer data with remote processors. This adds complexity for the programmer, but simplifies the hardware implementation because the system no longer has to maintain the illusion that all memory is actually shared. This type of programming has traditionally been used with supercomputers that have hundreds or thousands of processing elements. A commonly used technique is MPI.
However, supercomputers are not the only systems with distributed memory. Another example is GPGPU programming which is available for many desktop and laptop systems sold today. Both CUDA and OpenCL require the programmer to explicitly manage sharing between the CPU and the GPU (or other accelerator in the case of OpenCL). This is largely because when GPU programming started the GPU and CPU memory was separated by the PCI bus which has a very long latency compared to performing computation on the locally attached memory. So the programming models were developed assuming that the memory was separate (or distributed) and communication between the two processing elements (CPU and GPU) required explicit communication. Now that many systems have GPU and CPU elements on the same die there are proposals to allow GPGPU programming to have an interface that is more like shared memory.
In modern x86 terms, for example, all the CPUs in one physical computer share memory. e.g. 4-socket system with four 18-core CPUs. Each CPU has its own memory controllers, but they talk to each other so all the CPUs are part of one coherency domain. The system is NUMA shared memory, not distributed.
A room full of these machines form a distributed-memory cluster which communicates by sending messages over a network.
Practical considerations are one major reasons for distributed memory: it's impractical to have thousands or millions of CPU cores sharing the same memory with any kind of coherency semantics that make it worth calling it shared memory.
On a Linux machine, I need to count the number of read and write accesses to memory (DRAM) performed by a process. The machine has a NUMA configuration and I am binding the process to access memory from a single remote NUMA node using numactl. The process is running on CPUs in node 0 and accessing memory in node 1.
Currently, I am using perf to count the number of LLC load miss and LLC store miss events to serve as an estimate for read and write accesses to memory. Because, I guessed LLC misses will need to be served by memory accesses. Is this approach correct i.e. is this event relevant ? And, are there any alternatives to obtain the read and write access information ?
Processor : Intel Xeon E5-4620
Kernel : Linux 3.9.0+
Depending on your hardware you should be able to acess performance counter located on the memory side, to exactly count memory accesses. On Intel processor, these events are called uncore events. I know that you can also count the same thing on AMD processors.
Counting LLC misses is not totally correct because some events such as the hardware prefetcher may lead a significant number of memory accesses.
Regarding your hardware, unfortunately you will have to use raw events (in the perf terminology). These events can't be generalized by perf because they are processor's specifics and as a consequence you will have to look into your processor's manual to find the raw encoding of the event to give to perf. For your Intel processor you should look at chapter 18.9.8 Intel® Xeon® Processor E5 Family Uncore Performance Monitoring Facility and CHAPTER 19 PERFORMANCE-MONITORING EVENTS of the Intel software developer manual document available here In these documents you'll need the exact ID of your processor that you can get using /proc/cpuinfo
I know i can use ImageMagick with OpenMP to use all cores of my CPU, but the question is: Can i use all the cores of my 10 computer cluster? or imagemagick will only use the cores of my local CPU?
Thanks a lot.
OpenMP is based on the fork-join paradigm which is bound to shared-memory systems because of the very nature of the fork function. Thus, you won't be able to use all the cores of your cluster since they don't share their memory. OpenMP-enabled programs are then limited to cores that share their memory (cores from the local CPU).
There is a way around this, though, but it may not be worth it. You can simulate a virtual NUMA architecture over your 10 computer cluster and execute ImageMagick on this virtual machine. ImageMagick will hence believe it runs on a single system containing all your cores. That's what ScaleMP offers with their vSMP software. But the performance gains are strongly related to the program's memory accesses, as they might occur over network in this VM, which is magnitudes slower than cache or RAM access. You may get a significant performance hit depending on how memory is accessed in ImageMagick.
In order to directly perform what you ask, instead of trying to use OpenMP outside its use cases, you could use a Message-Passing Interface (MPI) framework (such as OpenMPI) to parallelize the program over multiple networked computers. That would require to rewrite the parallel portion of ImageMagick to use OpenMPI instead, which may be a pretty daunting task depending on ImageMagick's code base.
My system has 32GB of ram, but the device information for the Intel OpenCL implementation says "CL_DEVICE_GLOBAL_MEM_SIZE: 2147352576" (~2GB).
I was under the impression that on a CPU platform the global memory is the "normal" ram and thus something like ~30+GB should be available to the OpenCL CPU implementation. (ofcourse I'm using the 64bit version of the SDK)
Is there some sort of secret setting to tell the Intel OpenCL driver to increase global memory and use all the system memory ?
SOLVED: Got it working by recompiling everything to 64bit. Quite stupid as it seems, but I thought that OpenCL was working similar to OpenGL, where you can easily allocate e.g. 8GB texture memory from a 32bit process and the driver handles the details for you (ofcourse you can't allocate 8GB in one sweep, but e.g. transfer multiple textures that add up to more that 4GB).
I still think that limiting the OpenCL memory abstraction to the adress space of the process (at least for intel/amd drivers) is irritating, but maybe there are some subtle details or performance tradeoff why this implementation was chosen.