I'm reading Vulkan Memory Allocation - Memory Host and seems that VkAllocationCallbacks can be implemented using naive malloc/realloc/free functions.
typedef struct VkAllocationCallbacks {
void* pUserData;
PFN_vkAllocationFunction pfnAllocation;
PFN_vkReallocationFunction pfnReallocation;
PFN_vkFreeFunction pfnFree;
PFN_vkInternalAllocationNotification pfnInternalAllocation;
PFN_vkInternalFreeNotification pfnInternalFree;
} VkAllocationCallbacks;
But I see only two possible reasons to implement my own vkAllocationCallback:
Log and track memory usage by Vulkan API;
Implement a kind of heap memory management, it is a large chunk of memory to be used and reused, over and over. Obviously, it can be a overkill and suffer same sort of problems of managed memory (as in Java JVM).
Am I missing something here ?
What sort of applications would worth implementing vkAllocationCallbacks ?
From the spec:
Since most memory allocations are off the critical path, this is not
meant as a performance feature. Rather, this can be useful for certain
embedded systems, for debugging purposes (e.g. putting a guard page
after all host allocations), or for memory allocation logging.
With an embedded system, you might have grabbed all the memory right at the start, so you don't want the driver calling malloc because there might be nothing left in the tank. Guard pages and memory logging (for debug builds only) could be useful for the cautious/curious.
I read on a slide somewhere (can't remember where, sorry) that you definitely should not implement allocation callbacks that just feed through to malloc/realloc/free because you can generally assume that the drivers are doing a much better job than that (e.g. consolidating small allocations into pools).
I think that if you're not sure whether you ought to be implementing allocation callbacks, then you don't need to implement allocation callbacks and you don't need to worry that maybe you should have.
I think they're there for those specific use cases and for those who really want to be in control of everything.
This answer is an attempt to clarify and correct some of the information in the other answers...
Whatever you do, don't use malloc/free/realloc for a Vulkan allocator. Vulkan can and probably does use aligned memory copies to move memory. Using unaligned allocations will cause memory corruption and bad things will happen. The corruptions may not show themselves in an obvious way either. Instead use the posix aligned_alloc/aligned_free/aligned_realloc. They can be found in 'malloc.h' on most systems. (under windows use _aligned_alloc,ect) The function aligned_realloc is not well know but it is there (and has been there for years). BTW The alloc's for my test card had alignment requests all over the place.
One thing that is non-obvious about passing an application specific allocator to Vulkan is that at least some Vulkan objects "remember" the allocator. For example I passed an allocator to the vkcreateinstance function and was very surprised to see messages coming from my allocator when allocating other objects (which I had passed a nullptr too for the allocator). It made sense when I stopped to think about since objects that interact with the vulkan instance may cause the instance to make additional allocations.
This all play's into Vulkan's performance since individual allocators could be written and tuned to a specific allocation task. Which could have an impact on process startup time. But more importantly, a "block" allocator that places instance allocations, for example, near each other could have an impact on overall performance since they could increase cache coherency. (Instead of having the allocations scattered all over memory) I realize that this kind of performance "enhancement" is very speculative, but a carefully tuned application could have an impact. (Not to mention the numerous other performance critical paths in Vulkan that deserve more attention.)
Whatever you do don't attempt to use the aligned_alloc class of functions as a "release" allocator as they have very poor performance compared to Vulkan's built-in allocator (on my test card). Even in simple programs there was a very noticable performance difference compared to Vulkan's allocator. (sorry i didn't collect any timing information but no way was I going to repeatedly sit through those lengthy startup times.)
When it comes to debugging, even something as simple as plain old printf's can be enlightening inside the allocators. It is also easy to add simple statistic's collecting. But expect a severe performance penalty. They can also be useful as debug hooks without writing a fancy debug allocator or adding yet another debug layer.
btw...my test card was nvidia using release drivers
I implemented my own VkAllocatorCallback using plain C's malloc()/realloc()/free(). It is a naive implementation, and completely ignores the alignment parameter. Taking in account that malloc in 64 bits OS always return pointers with 16 (!) bytes alignment, which is pretty huge alignment, that would not be a problem in my tests. See Reference.
For information completeness, a 16 bytes alignment is also 8/4/2 bytes aligned.
My code is the following:
/**
* PFN_vkAllocationFunction implementation
*/
void* allocationFunction(void* pUserData, size_t size, size_t alignment, VkSystemAllocationScope allocationScope){
printf("pAllocator's allocationFunction: <%s>, size: %u, alignment: %u, allocationScope: %d",
(USER_TYPE)pUserData, size, alignment, allocationScope);
// the allocation itself - ignore alignment, for while
void* ptr = malloc(size);//_aligned_malloc(size, alignment);
memset(ptr, 0, size);
printf(", return ptr* : 0x%p \n", ptr);
return ptr;
}
/**
* The PFN_vkFreeFunction implementation
*/
void freeFunction(void* pUserData, void* pMemory){
printf("pAllocator's freeFunction: <%s> ptr: 0x%p\n",
(USER_TYPE)pUserData, pMemory);
// now, the free operation !
free(pMemory);
}
/**
* The PFN_vkReallocationFunction implementation
*/
void* reallocationFunction(void* pUserData, void* pOriginal, size_t size, size_t alignment, VkSystemAllocationScope allocationScope){
printf("pAllocator's REallocationFunction: <%s>, size %u, alignment %u, allocationScope %d \n",
(USER_TYPE)pUserData, size, alignment, allocationScope);
return realloc(pOriginal, size);
}
/**
* PFN_vkInternalAllocationNotification implementation
*/
void internalAllocationNotification(void* pUserData, size_t size, VkInternalAllocationType allocationType, VkSystemAllocationScope allocationScope){
printf("pAllocator's internalAllocationNotification: <%s>, size %uz, alignment %uz, allocationType %uz, allocationScope %s \n",
(USER_TYPE)pUserData,
size,
allocationType,
allocationScope);
}
/**
* PFN_vkInternalFreeNotification implementation
**/
void internalFreeNotification(void* pUserData, size_t size, VkInternalAllocationType allocationType, VkSystemAllocationScope allocationScope){
printf("pAllocator's internalFreeNotification: <%s>, size %uz, alignment %uz, allocationType %d, allocationScope %s \n",
(USER_TYPE)pUserData, size, allocationType, allocationScope);
}
/**
* Create Pallocator
* #param info - String for tracking Allocator usage
*/
static VkAllocationCallbacks* createPAllocator(const char* info){
VkAllocationCallbacks* m_allocator = (VkAllocationCallbacks*)malloc(sizeof(VkAllocationCallbacks));
memset(m_allocator, 0, sizeof(VkAllocationCallbacks));
m_allocator->pUserData = (void*)info;
m_allocator->pfnAllocation = (PFN_vkAllocationFunction)(&allocationFunction);
m_allocator->pfnReallocation = (PFN_vkReallocationFunction)(&reallocationFunction);
m_allocator->pfnFree = (PFN_vkFreeFunction)&freeFunction;
m_allocator->pfnInternalAllocation = (PFN_vkInternalAllocationNotification)&internalAllocationNotification;
m_allocator->pfnInternalFree = (PFN_vkInternalFreeNotification)&internalFreeNotification;
// storePAllocator(m_allocator);
return m_allocator;
}
`
I used the Cube.c example, from VulkanSDK, to test my code and assumptions. Modified versions is available here GitHub
A sample of output:
pAllocator's allocationFunction: <Device>, size: 800, alignment: 8, allocationScope: 1, return ptr* : 0x00000000061ECE40
pAllocator's allocationFunction: <RenderPass>, size: 128, alignment: 8, allocationScope: 1, return ptr* : 0x000000000623FAB0
pAllocator's allocationFunction: <ShaderModule>, size: 96, alignment: 8, allocationScope: 1, return ptr* : 0x00000000061F2C30
pAllocator's allocationFunction: <ShaderModule>, size: 96, alignment: 8, allocationScope: 1, return ptr* : 0x00000000061F8790
pAllocator's allocationFunction: <PipelineCache>, size: 152, alignment: 8, allocationScope: 1, return ptr* : 0x00000000061F2590
pAllocator's allocationFunction: <Device>, size: 424, alignment: 8, allocationScope: 1, return ptr* : 0x00000000061F8EB0
pAllocator's freeFunction: <ShaderModule> ptr: 0x00000000061F8790
pAllocator's freeFunction: <ShaderModule> ptr: 0x00000000061F2C30
pAllocator's allocationFunction: <Device>, size: 3448, alignment: 8, allocationScope: 1, return ptr* : 0x000000000624D260
pAllocator's allocationFunction: <Device>, size: 3448, alignment: 8, allocationScope: 1, return ptr* : 0x0000000006249A80
Conclusions:
The user implemented PFN_vkAllocationFunction, PFN_vkReallocationFunction,PFN_vkFreeFunction really does malloc/realoc/free operations in behalf of Vulkan. Not sure if they performs ALL allocations, as Vulkan may choose alloc/free some portions by itself.
The output provided by my implementations shows that typical alignment requested is 8 bytes, in my Win 7-64/NVidia. This shows that there is room for optimization, like as kind managed memory, where you grab a large chunk of memory and sub-allocate for your Vulkan app (a memory pool). It may* reduces memory usage (think 8 bytes before and up to 8 bytes after each alloc'ed block). It also may be faster, as malloc() call can last longer than a direct pointer to your own pool of memory already alloc'ed.
At least with my current Vulkan drivers, the PFN_vkInternalAllocationNotification and PFN_vkInternalFreeNotification doesn't run. Perhaps a bug in my NVidia drivers. I'll check in my AMD later.
The *pUserData is to be used to both debug info and/or management. Actually, you can used it to pass a C++ object, and play all required performance job over there. It's a sort of obvious info, but you can change it for each call or VkCreateXXX object.
You can use a single and generic VkAllocatorCallBack allocator for all application, but I guess that using a customised allocator may lead to better results. I my test, VkSemaphore creation shows a typical pattern on intense alloc/free of small chunks (72 bytes), which may be addressed with reuse of a previously chunk on memory, in a customised allocator. malloc()/free() already reuse memory when possible, but is tempting try to use our own memory manager, at least for short lived small blocks of memory.
Memory alignment maybe an issue to implement VkAllocationCallback (there is no _aligned_realoc function available, but only _aligned_malloc and _aligned_free). But only if Vulkan requests alignments bigger than malloc's default (8 bytes for x86, 16 for AMD64, etc. must check ARM defaults). But so far, seens Vulkan actually request memory with lower alignment than malloc() defaults, at least on 64bit OS's.
Final Thought:
You can live happy until the end of time just setting all VkAllocatorCallback* pAllocator you find as NULL ;)
Possibly Vulkan's default allocator already does it all better than yourself.
BUT...
One of highlights of Vulkan benefits was the developer would be put in control of everything, including memory-management. Khronos presentation, slide 6
Related
I have implemented a custom allocator to track the memory that is allocated by vkAllocateMemory.
I have also studied the official Khronos documentation regarding this topic but I could not found an answer to a following questions:
What is acutally the size parameter PFN_vkAllocationFunction? I have found out that apparently it is not the size of the actual buffer we want to allocate the memory for. I am suspecting that this is size of a Vulkan structures or any other Vulkan internal buffers. No matter how big memory chunk I want to allocate, it the size is always set to 200 (it is machine/GPU/driver dependent value but it is a constant value).
For the testing purposes I used triangle from: https://github.com/SaschaWillems/Vulkan and an allocator from a similar question: Vulkan's VkAllocationCallbacks implemented with malloc/free()
triangle before vkAllocateMemory: memAlloc.allocationSize: 7864320 sizeof(memAlloc): 32
triangle after vkAllocateMemory: memAlloc.allocationSize: 7864320
pAllocator's allocationFunction: <Memory>, size: 200, alignment: 8, allocationScope: 1, return ptr* : 0x0x5564ac917b20
I have also found out that for other calls, this size is different for different vulkan calls, like vkCreateRenderPass or vkCreateImageView.
pAllocator's allocationFunction: <ImageView>, size: 160, alignment: 8, allocationScope: 1, return ptr* : 0x0x5564acdf8390
pAllocator's allocationFunction: <ImageView>, size: 824, alignment: 8, allocationScope: 1, return ptr* : 0x0x5564acdf8440
pAllocator's allocationFunction: <RenderPass>, size: 200, alignment: 8, allocationScope: 1, return ptr* : 0x0x5564ac950ee0
pAllocator's allocationFunction: <RenderPass>, size: 88, alignment: 8, allocationScope: 1, return ptr* : 0x0x5564acdf8780
pAllocator's allocationFunction: <RenderPass>, size: 56, alignment: 8, allocationScope: 1, return ptr* : 0x0x5564acdf7a00
pAllocator's allocationFunction: <RenderPass>, size: 344, alignment: 8, allocationScope: 1, return ptr* : 0x0x5564ac6a07c0
pAllocator's allocationFunction: <RenderPass>, size: 8, alignment: 8, allocationScope: 1, return ptr* : 0x0x5564acdf87e0
Is it possible to allocate the host-visible memory using those callbacks? I would like to emulate the behaviour of VK_EXT_external_memory_host, VK_KHR_external_memory_fd or VK_EXT_external_memory_dma_buf but I don't know if this approach (that is implementing own allocator) can be useful for it.
To understand what's going, you need to understand what VkAllocationCallbacks are for. So let's dive in.
When you call vkCreateDevice for example, a Vulkan implementation needs to return a VkDevice handle. This is undoubtedly a pointer to an object. So... where did it's memory come from? There must be some internal, implementation-defined data structure in play. And that likely requires heap allocation.
However, for a low-level system, it is generally considered rude for it to arbitrarily allocate heap memory without the knowledge or consent of the application using that system. The larger application may want to pre-allocate a bunch of memory specifically for the Vulkan implementation's use, for any number of reasons.
Being able to do that requires being able to replace the Vulkan implementation's heap allocation functions with your own. That is what VkAllocationCallbacks are for: allowing your program to provide heap-allocation services to the Vulkan implementations, which will use them for its own internal data structures. And that last part is important: these allocations are for internal use.
vkAllocateMemory is a function for allocating device-accessible memory. Now yes, depending on the memory type used, it may allocate from the same pool of memory as CPU-accessible memory. But the memory vkAllocateMemory is allocating from is not for the Vulkan implementations usage; it's for your application's usage, mediated through the Vulkan API.
Now, above I said "in certain contexts". When you create a device using VkAllocationCallbacks, the callbacks will be used to allocate implementation memory for any Vulkan function using that device. That is the context of those callbacks. Other functions have their own context. For example, when you call vkCreateDescriptorPool, you also give it callbacks. These callbacks, if specified, will be used for any allocations used for descriptor pools.
Now, as stated above, vkAllocateMemory allocates device-accessible memory. However, it also creates a VkDeviceMemory object that represents this storage. This object lives in CPU space, and therefore it must have been allocated from CPU storage.
That's what the 200 bytes you saw represents: not the device memory allocation itself, but the allocation of CPU storage used to manage that device memory. You could say that the internal implementation object represented by VkDeviceMemory takes up 200 bytes.
Overall, you cannot track device-accessible allocations through a system intended to provide you with access to how much CPU-accessible memory is being allocated by Vulkan.
I've been using and modifying this library https://github.com/sile/patricia_tree
One thing that bothered a bit was how much unsafe was used in node.rs, particularly, it's defined as just a pointer to some heap location. When doing the first benchmark listed on the readme page (wikipedia inputs), the PatriciaSet uses ~700mb (PatriciaSet is just holding a Node at it's root)
pub struct Node<V> {
// layout:
// all these fields accessed with ptr.offset
// - flags: u8
// - label_len: u8
// - label: [u8; label_len]
// - value: Option<V>
// - child: Option<Node<V>>
// - sibling: Option<Node<V>>
ptr: *mut u8,
_value: PhantomData<V>,
}
and uses malloc for allocation:
let ptr = unsafe { libc::malloc(block_size) } as *mut u8;
I was told this memory is not aligned properly, so I tried to add the new alloc api and use Layout/alloc, this is also still not aligning properly just seems to 'work'. full pr
let layout = Layout::array::<u8>(block_size).expect("Failed to get layout");
let ptr = unsafe { alloc::alloc(layout) as *mut u8 };
This single change, which also holds the layout in the block of memory pointed to by ptr, caused memory consumption to go up 40% under the performance tests for very large trees. The layout type is just 2 words wide, so this was unexpected. For the same tests this uses closer to ~1000mb (compared to previous 700)
In another attempt, I tried to remove most of the unsafe and go with something a bit more rust-y full pr here
pub struct Node<V> {
value: Option<V>,
child: Option<*mut Node<V>>,
sibling: Option<*mut Node<V>>,
label: SmallVec<[u8; 10]>,
_value: PhantomData<V>,
}
creating the node in a manner you may expect in rust
let child = child.map(|c| Box::into_raw(Box::new(c)));
let sibling = sibling.map(|c| Box::into_raw(Box::new(c)));
Node {
value,
child,
sibling,
label: SmallVec::from_slice(label),
_value: PhantomData,
}
Performance wise, it's about equivalent to the original unmodified library, but it's memory consumption appears to be not much better than just inserting every single item in a HashSet, using around ~1700mb for the first benchmark.
The final data structure which uses node is a compressed trie or a 'patricia tree'. No other code was changed other than the structure of the Node and the implementations of some of its methods that idiomatically fall out of those changes.
I was hoping someone could tip me off of a what exactly is causing such a drastic difference in memory consumption between these implementations. In my mind, they should be about equivalent. They all allocate around the same number of fields with about the same widths. The unsafe first one is able to store a dynamic label length in-line, so that could be one reason. But smallvec should be able to do something similar with smaller label sizes (using just Vec was even worse).
Looking for any suggestions or help on why the ending results are so different. If curious, the code to run these is here though it is spread across the original authors and my own repo
Tools for how to investigate the differences between these would be open to hearing that as well!
You're seeing increased memory use for a couple of reasons. I'll assume a standard 64-bit Unix system.
First, a pointer is 8 bytes. An Option<*mut Node<V>> is 16 bytes because pointers aren't subject to the nullable optimization that happens with references. References can never be null, so the compiler can convert an Option<&'a V> into a null pointer if the value is None and a regular pointer if it's Some, but pointers can be null so that can't happen here. Rust makes the size of the enum field the same size as the size of the data type, so you use 16 bytes per pointer here.
The easiest and most type-safe way to deal with this is just to use Option<NonNull<Node<V>>>. Doing that drops your structure by 16 bytes total.
Second, your SmallVec is 32 bytes in size. They avoid needing a heap allocation in some cases, but they aren't, despite the name, necessarily small. You can use a regular Vec or a boxed slice, which will likely result in lower memory usage at the cost of an additional allocation.
With those changes and using a Vec, your structure will be 48 bytes in size. With a boxed slice, it will be 40. The original used 72. How much savings you see will depend on how big your labels are, since you'll need to allocate space for them.
The required alignment for this structure is 8 bytes because the largest alignment of any type (the pointer) is 8 bytes. Even on architectures like x86-64 where alignment is not required for all types, it is still faster, and sometimes significantly so, so the compiler always does it.
The original code was not properly aligned at all and will either outright fail (on SPARC), perform badly (on PowerPC), or require an alignment trap into the kernel if they're enabled (on MIPS) or fail if they're not. An alignment trap into the kernel for unaligned access performs terribly because you have to do a full context switch just to load and shift two words, so most people turn them off.
The reason that this is not properly aligned is because Node contains a pointer and it appears in the structure at an offset which is not guaranteed to be a multiple of 8. If it were rewritten such that the child and sibling attributes came first, then it would be properly aligned provided the memory were suitably aligned (which malloc guarantees but your Rust allocation does not). You could create a suitable Layout with Layout::from_size_align(block_size, std::mem::align_of::<*mut Node>()).
So while the original code worked on x86-64 and saved a bunch of memory, it performed badly and was not portable.
The code I used for this example is simply the following, plus some knowledge about how Rust does nullable types and knowledge about C and memory allocation:
extern crate smallvec;
use smallvec::SmallVec;
use std::marker::PhantomData;
use std::ptr::NonNull;
pub struct Node<V> {
value: Option<V>,
child: Option<NonNull<Node<V>>>,
sibling: Option<NonNull<Node<V>>>,
label: Vec<u8>,
_value: PhantomData<V>,
}
fn main() {
println!("size: {}", std::mem::size_of::<Node<()>>());
}
I'm looking to allocate a vector of small-sized structs.
This takes 30 milliseconds and increases linearly:
let v = vec![[0, 0, 0, 0]; 1024 * 1024];
This takes tens of microseconds:
let v = vec![0; 1024 * 1024];
Is there a more efficient solution to the first case? I'm okay with unsafe code.
Fang Zhang's answer is correct in the general case. The code you asked about is a little bit special: it could use alloc_zeroed, but it does not. As Stargateur also points out in the question comments, with future language and library improvements it is possible both cases could take advantage of this speedup.
This usually should not be a problem. Initializing a whole big vector at once probably isn't something you do extremely often. Big allocations are usually long-lived, so you won't be creating and freeing them in a tight loop -- the cost of initializing the vector will only be paid rarely. Sooner than resorting to unsafe, I would take a look at my algorithms and try to understand why a single memset is causing so much trouble.
However, if you happen to know that all-bits-zero is an acceptable initial value, and if you absolutely cannot tolerate the slowdown, you can do an end-run around the standard library by calling alloc_zeroed and creating the Vec using from_raw_parts. Vec::from_raw_parts is unsafe, so you have to be absolutely sure the size and alignment of the allocated memory is correct. Since Rust 1.44, you can use Layout::array to do this easily. Here's an example:
pub fn make_vec() -> Vec<[i8; 4]> {
let layout = std::alloc::Layout::array::<[i8; 4]>(1_000_000).unwrap();
// I copied the following unsafe code from Stack Overflow without understanding
// it. I was advised not to do this, but I didn't listen. It's my fault.
unsafe {
Vec::from_raw_parts(
std::alloc::alloc_zeroed(layout) as *mut _,
1_000_000,
1_000_000,
)
}
}
See also
How to perform efficient vector initialization in Rust?
vec![0; 1024 * 1024] is a special case. If you change it to vec![1; 1024 * 1024], you will see performance degrades dramatically.
Typically, for non-zero element e, vec![e; n] will clone the element n times, which is the major cost. For element equal to 0, there is other system approach to init the memory, which is much faster.
So the answer to your question is no.
When trying to use Metal to rapidly draw pixel buffers to the screen from memory, we create MTLBuffer objects using MTLDevice.makeBuffer(bytesNoCopy:..) to allow the GPU to directly read the pixels from memory without having to copy it. Shared memory is really a must-have for achieving good pixel transfer performance.
The catch is that makeBuffer requires a page-aligned memory address and a page aligned length. Those requirements are not only in the documentation -- they are also enforced using runtime assertions.
The code I am writing has to deal with a variety of incoming resolutions and pixel formats, and occasionally I get unaligned buffers or unaligned lengths. After researching this I discovered a hack that allows me to use shared memory for those instances.
Basically what I do is I round the unaligned buffer address down to the nearest page boundary, and use the offset parameter from makeTexture to ensure that the GPU starts reading from the right place. Then I round up length to the nearest page size. Obviously that memory is going to be valid (because allocations can only occur on page boundaries), and I think it's safe to assume the GPU isn't writing to or corrupting that memory.
Here is the code I'm using to allocate shared buffers from unaligned buffers:
extension MTLDevice {
func makeTextureFromUnalignedBuffer(textureDescriptor : MTLTextureDescriptor, bufferPtr : UnsafeMutableRawPointer, bufferLength : UInt, bytesPerRow : Int) -> MTLTexture? {
var calculatedBufferLength = bufferLength
let pageSize = UInt(getpagesize())
let pageSizeBitmask = UInt(getpagesize()) - 1
let alignedBufferAddr = UnsafeMutableRawPointer(bitPattern: UInt(bitPattern: bufferPtr) & ~pageSizeBitmask)
let offset = UInt(bitPattern: bufferPtr) & pageSizeBitmask
assert(bytesPerRow % 64 == 0 && offset % 64 == 0, "Supplied bufferPtr and bytesPerRow must be aligned on a 64-byte boundary!")
calculatedBufferLength += offset
if (calculatedBufferLength & pageSizeBitmask) != 0 {
calculatedBufferLength &= ~(pageSize - 1)
calculatedBufferLength += pageSize
}
let buffer = self.makeBuffer(bytesNoCopy: alignedBufferAddr!, length: Int(calculatedBufferLength), options: .storageModeShared, deallocator: nil)
return buffer.makeTexture(descriptor: textureDescriptor, offset: Int(offset), bytesPerRow: bytesPerRow)
}
}
I've tested this on numerous different buffers and it seems to work perfectly (only tested on iOS, not on macOS). My question is: Is this approach safe? Any obvious reasons why this wouldn't work?
Then again, if it is safe, why were the requirements imposed in the first place? Why isn't the API just doing this for us?
I have submitted an Apple TSI (Technical Support Incident) for this question, and the answer is basically yes, it is safe. Here is the exact response in case anyone is interested:
After discussing your approach with engineering we concluded that it
was valid and safe. Some noteworthy quotes:
“The framework shouldn’t care about the fact that the user doesn’t own
the entire page, because it shouldn’t ever read before the offset
where the valid data begins.”
“It really shouldn’t [care], but in general if the developer can use
page-allocators rather than malloc for their incoming images, that
would be nice.”
As to why the alignment constraints/assertions are in place:
“Typically mapping memory you don’t own into another address space is
a bit icky, even if it works in practice. This is one reason why we
required mapping to be page aligned, because the hardware really is
mapping (and gaining write access) to the entire page.”
I am working on pcie based network driver. Different examples use one of pci_alloc_consistent or dma_alloc_coherent to get memory for transmission and reception descriptors. Which one is better if any and what is the difference between the two?
The difference is subtle but quite important.
pci_alloc_consistent() is the older function of the two and legacy drivers still use it.
Nowadays, pci_alloc_consistent() just calls dma_alloc_coherent().
The difference? The type of the allocated memory.
pci_alloc_consistent() - Allocates memory of type GFP_ATOMIC.
Allocation does not sleep, for use in e.g. interrupt handlers, bottom
halves.
dma_alloc_coherent()- You specify yourself what type of memory to
allocate. You should not use the high priority GFP_ATOMIC memory
unless you need it and in most cases, you will be fine with
GFP_KERNEL allocations.
Kernel 3.18 definition of pci_alloc_consistent() is very simple, namely:
static inline void *
pci_alloc_consistent(struct pci_dev *hwdev, size_t size,
dma_addr_t *dma_handle)
{
return dma_alloc_coherent(hwdev == NULL ? NULL : &hwdev->dev, size, dma_handle, GFP_ATOMIC);
}
In short, use dma_alloc_coherent().