How to reduce latency for 'cma_alloc'?

How to reduce latency for 'cma_alloc'? - memory

I am working with cma reserved memory and getting high latency.
Is there any way to reduce cma allocation latency?

The latency is probably due to the kernel migrating movable pages (for block device caches etc.) out of the CMA in order to get a large enough contiguous region for your request. One way to avoid the latency is to reserve a specific memory block for your device in the device tree, so it doesn't have to allocate from the global CMA pool.
The following assumes you are developing an embedded system that uses device trees and can modify driver code to make use of a reserved memory region.
If reserving a memory region specifically for your device, you can reserve the region statically or dynamically (letting the kernel determine the start address of the region). See reserved-memory.txt for details.
Set the no-map property to stop the kernel mapping the memory as part of its standard system memory.
DO NOT set the reusable property if you do not want the memory to be used for temporary storage by other parts of the kernel (block page cache etc.).
If you want to allocate coherent DMA memory from the buffer, set compatible = "shared-dma-pool";. Although the pool is "shared", it is only shared by devices that specifically use this reserved memory region.
In your device node, set the memory-region property as a phandle referring to the reserved memory region.
For example:
/ {
/* ... */
reserved-memory {
#address-cells = <1>;
#size-cells = <1>;
ranges;
my_dev_reserved: region_my_dev_buffer {
compatible = "shared-dma-pool";
no-map;
size = <0x02000000>; /* 32 MiB size */
alignment = <0x1000>; /* 4 KiB alignment */
};
};
/* ... */
my_dev: mydev#xxxx {
compatible = "foo,barbaz-1.0";
/* ... */
memory-region = <&my_dev_reserved>;
/* ... */
};
/* ... */
};
In your device driver "probe" function, you can call of_reserved_mem_device_init(hwdev); (where hwdev is a pointer to the struct device for your hardware decice, which is usually embedded inside some other structure such as struct platform_device). A return value of 0 indicates that the device will use the device-specific reserved memory for coherent DMA memory allocations (if the region's compatible string is "shared-dma-pool", otherwise coherent DMA memory allocations will fall back to using the global pool.
If of_reserved_mem_device_init(hwdev); was successful, you should call of_reserved_mem_release(hwdev); in your device driver "remove" function to free the allocated resources. (You may also need to call it in the error clean-up code of your "probe" function.)

Related

Where does the increment value for sbrk come from when using newlib nano malloc?

I have a target (Stm32f030R8) I am using with FreeRTOS and the newlib reentrant heap implementation (http://www.nadler.com/embedded/newlibAndFreeRTOS.html). This shim defines sbrk in addition to implementing the actual FreeRTOS memory allocation shim. The project is built with GNU ARM GCC and using --specs=nosys.specs.
For the following example, FreeRTOS has not been started yet. malloc has not been called before. The mcu is fresh off boot and has only just copied initialized data from flash into ram and configured clocks and basic peripherals.
Simply having the lib in my project, I am seeing that sbrk is being called with very large increment values for seemingly small malloc calls.
The target has 8K of memory, of which I have 0x12b8 (~4KB bytes between the start of the heap and end of ram (top of the stack).
I am seeing that if I allocate 1000 bytes with str = (char*) malloc(1000);, that sbrk gets called twice. First with an increment value of 0x07e8 and then again with an increment value of 0x0c60. the result being that the desired total increment count is 0x1448 (5192 bytes!) and of course this overflows not just the stack, but available ram.
What on earth is going on here? Why are these huge increment values being used by malloc for such a relatively small desired buffer allocation?

I think it may not possible to answer definitively, rather than just advise on debugging. The simplest solution is to step through the allocation code to determine where and why the allocation size request is being corrupted (as it appears to be). You will need to course to build the library from source or at least include mallocr.c in your code to override any static library implementation.
In Newlib Nano the call-stack to _sbrk_r is rather simple (compared to regular Newlib). The increment is determined from the allocation size s in nano_malloc() in nano-mallocr.c as follows:
malloc_size_t alloc_size;
alloc_size = ALIGN_TO(s, CHUNK_ALIGN); /* size of aligned data load */
alloc_size += MALLOC_PADDING; /* padding */
alloc_size += CHUNK_OFFSET; /* size of chunk head */
alloc_size = MAX(alloc_size, MALLOC_MINCHUNK);
Then if a chunk of alloc_size is not found in the free-list (which is always the case when free() has not been called), then sbrk_aligned( alloc_size ) is called which calls _sbrk_r(). None of the padding, alignment or minimum allocation macros add such a large amount.
sbrk_aligned( alloc_size ) calls _sbrk_r if the request is not aligned. The second call should be never be larger than CHUNK_ALIGN - (sizeof(void*)).
In you debugger you should be able to inspect the call stack or step through the call to see where parameter becomes incorrect.

difference between pci_alloc_consistent and dma_alloc_coherent

I am working on pcie based network driver. Different examples use one of pci_alloc_consistent or dma_alloc_coherent to get memory for transmission and reception descriptors. Which one is better if any and what is the difference between the two?

The difference is subtle but quite important.
pci_alloc_consistent() is the older function of the two and legacy drivers still use it.
Nowadays, pci_alloc_consistent() just calls dma_alloc_coherent().
The difference? The type of the allocated memory.
pci_alloc_consistent() - Allocates memory of type GFP_ATOMIC.
Allocation does not sleep, for use in e.g. interrupt handlers, bottom
halves.
dma_alloc_coherent()- You specify yourself what type of memory to
allocate. You should not use the high priority GFP_ATOMIC memory
unless you need it and in most cases, you will be fine with
GFP_KERNEL allocations.
Kernel 3.18 definition of pci_alloc_consistent() is very simple, namely:
static inline void *
pci_alloc_consistent(struct pci_dev *hwdev, size_t size,
dma_addr_t *dma_handle)
{
return dma_alloc_coherent(hwdev == NULL ? NULL : &hwdev->dev, size, dma_handle, GFP_ATOMIC);
}
In short, use dma_alloc_coherent().

OS memory allocation addresses

Quick curious question, memory allocation addresses are choosed by the language compiler or is it the OS which chooses the addresses for the memory asked?
This is from a doubt about virtual memory, where it could be quickly explained as "let the process think he owns all the memory", but what happens on 64 bits architectures where only 48 bits are used for memory addresses if the process wants a higher address?
Lets say you do a int a = malloc(sizeof(int)); and you have no memory left from the previous system call so you need to ask the OS for more memory, is the compiler the one who determines the memory address to allocate this variable, or does it just ask the OS for memory and it allocates it on the address returned by it?

It would not be the compiler, especially since this is dynamic memory allocation. Compilation is done well before you actually execute your program.
Memory reservation for static variables happens at compile time. But the static memory allocation will happen at start-up, before the user defined Main.
Static variables can be given space in the executable file itself, this would then be memory mapped into the process address space. This is only one of few times(?) I can image the compiler actually "deciding" on an address.
During dynamic memory allocation your program would ask the OS for some memory and it is the OS that returns a memory address. This address is then stored in a pointer for example.

The dynamic memory allocation in C/C++ is simply done by runtime library functions. Those functions can do pretty much as they please as long as their behavior is standards-compliant. A trivial implementation of compliant but useless malloc() looks like this:
void * malloc(size_t size) {
return NULL;
}
The requirements are fairly relaxed -- the pointer has to be suitably aligned and the pointers must be unique unless they've been previously free()d. You could have a rather silly but somewhat portable and absolutely not thread-safe memory allocator done the way below. There, the addresses come from a pool that was decided upon by the compiler.
#include "stdint.h"
// 1/4 of available address space, but at most 2^30.
#define HEAPSIZE (1UL << ( ((sizeof(void*)>4) ? 4 : sizeof(void*)) * 2 ))
// A pseudo-portable alignment size for pointerŚbwitary types. Breaks
// when faced with SIMD data types.
#define ALIGNMENT (sizeof(intptr_t) > sizeof(double) ? sizeof(intptr_t) : siE 1Azeof(double))
void * malloc(size_t size)
{
static char buffer[HEAPSIZE];
static char * next = NULL;
void * result;
if (next == NULL) {
uintptr_t ptr = (uintptr_t)buffer;
ptr += ptr % ALIGNMENT;
next = (char*)ptr;
}
if (size == 0) return NULL;
if (next-buffer > HEAPSIZE-size) return NULL;
result = next;
next += size;
next += size % ALIGNMENT;
return result;
}
void free(void * ptr)
{}
Practical memory allocators don't depend upon such static memory pools, but rather call the OS to provide them with newly mapped memory.
The proper way of thinking about it is: you don't know what particular pointer you are going to get from malloc(). You can only know that it's unique and points to properly aligned memory if you've called malloc() with a non-zero argument. That's all.

Is it possible to use cudaMemcpy with src and dest as different types?

I'm using a Tesla, and for the first time, I'm running low on CPU memory instead of GPU memory! Hence, I thought I could cut the size of my host memory by switching all integers to short (all my values are below 255).
However, I want my device memory to use integers, since the memory access is faster. So is there a way to copy my host memory (in short) to my device global memory (in int)? I guess this won't work:
short *buf_h = new short[100];
int *buf_d = NULL;
cudaMalloc((void **)&buf_d, 100*sizeof(int));
cudaMemcpy( buf_d, buf_h, 100*sizeof(short), cudaMemcpyHostToDevice );
Any ideas? Thanks!

There isn't really a way to do what you are asking directly. The CUDA API doesn't support "smart copying" with padding or alignment, or "deep copying" of nested pointers, or anything like that. Memory transfers require linear host and device memory, and alignment must be the same between source and destination memory.
Having said that, one approach to circumvent this restriction would be to copy the host short data to an allocation of short2 on the device. Your device code can retrieve a short2 containing two packed shorts, extract the value it needs and then cast the value to int. This will give the code 32 bit memory transactions per thread, allowing for memory coalescing, and (if you are using Fermi GPUs) good L1 cache hit rates, because adjacent threads within a block would be reading the same 32 bit word. On non Fermi GPUs, you could probably use a shared memory scheme to efficiently retrieve all the values for a block using coalesced reads.

Memory access after ioremap very slow

I'm working on a Linux kernel driver that makes a chunk of physical memory available to user space. I have a working version of the driver, but it's currently very slow. So, I've gone back a few steps and tried making a small, simple driver to recreate the problem.
I reserve the memory at boot time using the kernel parameter memmap=2G$1G. Then, in the driver's __init function, I ioremap some of this memory, and initialize it to a known value. I put in some code to measure the timing as well:
#define RESERVED_REGION_SIZE (1 * 1024 * 1024 * 1024) // 1GB
#define RESERVED_REGION_OFFSET (1 * 1024 * 1024 * 1024) // 1GB
static int __init memdrv_init(void)
{
struct timeval t1, t2;
printk(KERN_INFO "[memdriver] init\n");
// Remap reserved physical memory (that we grabbed at boot time)
do_gettimeofday( &t1 );
reservedBlock = ioremap( RESERVED_REGION_OFFSET, RESERVED_REGION_SIZE );
do_gettimeofday( &t2 );
printk( KERN_ERR "[memdriver] ioremap() took %d usec\n", usec_diff( &t2, &t1 ) );
// Set the memory to a known value
do_gettimeofday( &t1 );
memset( reservedBlock, 0xAB, RESERVED_REGION_SIZE );
do_gettimeofday( &t2 );
printk( KERN_ERR "[memdriver] memset() took %d usec\n", usec_diff( &t2, &t1 ) );
// Register the character device
...
return 0;
}
I load the driver, and check dmesg. It reports:
[memdriver] init
[memdriver] ioremap() took 76268 usec
[memdriver] memset() took 12622779 usec
That's 12.6 seconds for the memset. That means the memset is running at 81 MB/sec. Why on earth is it so slow?
This is kernel 2.6.34 on Fedora 13, and it's an x86_64 system.
EDIT:
The goal behind this scheme is to take a chunk of physical memory and make it available to both a PCI device (via the memory's bus/physical address) and a user space application (via a call to mmap, supported by the driver). The PCI device will then continually fill this memory with data, and the user-space app will read it out. If ioremap is a bad way to do this (as Ben suggested below), I'm open to other suggestions that'll allow me to get any large chunk of memory that can be directly accessed by both hardware and software. I can probably make do with a smaller buffer also.
See my eventual solution below.

ioremap allocates uncacheable pages, as you'd desire for access to a memory-mapped-io device. That would explain your poor performance.
You probably want kmalloc or vmalloc. The usual reference materials will explain the capabilities of each.

I don't think ioremap() is what you want there. You should only access the result (what you call reservedBlock) with readb, readl, writeb, memcpy_toio etc. It is not even guaranteed that the return is virtually mapped (although it apparently is on your platform). I'd guess that the region is being mapped uncached (suitable for IO registers) leading to the terrible performance.

It's been a while, but I'm updating since I did eventually find a workaround for this ioremap problem.
Since we had custom hardware writing directly to the memory, it was probably more correct to mark it uncacheable, but it was unbearably slow and wasn't working for our application. Our solution was to only read from that memory (a ring buffer) once there was enough new data to fill a whole cache line on our architecture (I think that was 256 bytes). This guaranteed we never got stale data, and it was plenty fast.

I have tried out doing a huge memory chunk reservations with the memmap
The ioremapping of this chunk gave me a mapped memory address space which in beyond few tera bytes.
when you ask to reserve 128GB memory starting at 64 GB. you see the following in /proc/vmallocinfo
0xffffc9001f3a8000-0xffffc9201f3a9000 137438957568 0xffffffffa00831c9 phys=1000000000 ioremap
Thus the address space starts at 0xffffc9001f3a8000 (which is waay too large).
Secondly, Your observation is correct. even the memset_io results in a extremely large delays (in tens of minutes) to touch all this memory.
So, the time taken has to do mainly with address space conversion and non cacheable page loading.

Categories

HOME

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to reduce latency for 'cma_alloc'? - memory

I am working with cma reserved memory and getting high latency. Is there any way to reduce cma allocation latency?

Related

Where does the increment value for sbrk come from when using newlib nano malloc?

difference between pci_alloc_consistent and dma_alloc_coherent

OS memory allocation addresses

Is it possible to use cudaMemcpy with src and dest as different types?

Memory access after ioremap very slow

Categories

Resources