Is it safe to feed unaligned buffers to MTLBuffer? - ios

When trying to use Metal to rapidly draw pixel buffers to the screen from memory, we create MTLBuffer objects using MTLDevice.makeBuffer(bytesNoCopy:..) to allow the GPU to directly read the pixels from memory without having to copy it. Shared memory is really a must-have for achieving good pixel transfer performance.
The catch is that makeBuffer requires a page-aligned memory address and a page aligned length. Those requirements are not only in the documentation -- they are also enforced using runtime assertions.
The code I am writing has to deal with a variety of incoming resolutions and pixel formats, and occasionally I get unaligned buffers or unaligned lengths. After researching this I discovered a hack that allows me to use shared memory for those instances.
Basically what I do is I round the unaligned buffer address down to the nearest page boundary, and use the offset parameter from makeTexture to ensure that the GPU starts reading from the right place. Then I round up length to the nearest page size. Obviously that memory is going to be valid (because allocations can only occur on page boundaries), and I think it's safe to assume the GPU isn't writing to or corrupting that memory.
Here is the code I'm using to allocate shared buffers from unaligned buffers:
extension MTLDevice {
func makeTextureFromUnalignedBuffer(textureDescriptor : MTLTextureDescriptor, bufferPtr : UnsafeMutableRawPointer, bufferLength : UInt, bytesPerRow : Int) -> MTLTexture? {
var calculatedBufferLength = bufferLength
let pageSize = UInt(getpagesize())
let pageSizeBitmask = UInt(getpagesize()) - 1
let alignedBufferAddr = UnsafeMutableRawPointer(bitPattern: UInt(bitPattern: bufferPtr) & ~pageSizeBitmask)
let offset = UInt(bitPattern: bufferPtr) & pageSizeBitmask
assert(bytesPerRow % 64 == 0 && offset % 64 == 0, "Supplied bufferPtr and bytesPerRow must be aligned on a 64-byte boundary!")
calculatedBufferLength += offset
if (calculatedBufferLength & pageSizeBitmask) != 0 {
calculatedBufferLength &= ~(pageSize - 1)
calculatedBufferLength += pageSize
}
let buffer = self.makeBuffer(bytesNoCopy: alignedBufferAddr!, length: Int(calculatedBufferLength), options: .storageModeShared, deallocator: nil)
return buffer.makeTexture(descriptor: textureDescriptor, offset: Int(offset), bytesPerRow: bytesPerRow)
}
}
I've tested this on numerous different buffers and it seems to work perfectly (only tested on iOS, not on macOS). My question is: Is this approach safe? Any obvious reasons why this wouldn't work?
Then again, if it is safe, why were the requirements imposed in the first place? Why isn't the API just doing this for us?

I have submitted an Apple TSI (Technical Support Incident) for this question, and the answer is basically yes, it is safe. Here is the exact response in case anyone is interested:
After discussing your approach with engineering we concluded that it
was valid and safe. Some noteworthy quotes:
“The framework shouldn’t care about the fact that the user doesn’t own
the entire page, because it shouldn’t ever read before the offset
where the valid data begins.”
“It really shouldn’t [care], but in general if the developer can use
page-allocators rather than malloc for their incoming images, that
would be nice.”
As to why the alignment constraints/assertions are in place:
“Typically mapping memory you don’t own into another address space is
a bit icky, even if it works in practice. This is one reason why we
required mapping to be page aligned, because the hardware really is
mapping (and gaining write access) to the entire page.”

Related

Is there a way to bind assets once instead of with every command encoder?

I'm rendering a vertex/frag shader with a compute kernel.
Every frame I am binding large assets (such as a 450MB texture) in the usual way:
computeEncoder.setTexture(highResTexture, index: 0)
computeEncoder.setBuffer(largeBuffer, offset: 0, index: 0)
...
renderEncoder.setVertexTexture(highResTexture, index: 0)
renderEncoder.setVertexBuffer(largeBuffer, offset: 0, index: 0)
So that is close to 1GB in bandwidth for a single texture, and I have many more assets totaling a few hundred megs, so that is about 1.5GB that I bind for every frame.
Is there anyway to bind textures/buffers to the GPU once so that they would then be available in the kernel and vertex functions without binding every frame?
I could be wrong, but I thought something was introduced in the one of the last couple WWDCs so thought I would ask to make sure I'm not missing anything.
EDIT:
By simply binding a texture in the vertex function that I have already bound in the compute encoder it does indeed show more texture bandwidth used, even though I am not using it for the capture.
GPU Read Bandwidth:
6.3920 GiB/s without binding
7.1919 GiB/s with binding
Without binding the texture:
With binding the texture but not using it in any way:
Also, if it works as you describe, why does using multiple command encoders warn about wasted bandwidth? If I use more than one emitter, each with a separate encoder, even though they bind identical resources, I get the performance warning:
I think you are confused. Setting a texture to a command encoder doesn't consume bandwidth. Reading it or sampling it inside the shader does.
When you set a texture or any other buffer to an encoder, what happens is that driver just passes some small amount of metadata to the shader using some mechanism, likely some internal buffer that's not visible to you as the API user. It doesn't "load" the texture anywhere. There's an exception for buffers that are marked as constant address buffers in the shaders, because those may get pre-fetched by the GPU for better performance.
Another thing that happens is that the resource is made resident, meaning the GPU driver will map a range of addresses in the GPU addresses virtual memory table to point to the physical memory that stores the texture contents. This also does not consume memory, but it does consume available virtual address space. You might run out of virtual address space in some cases, but that's not a bandwidth issue.
Still, if you do have a lot of textures, you might be actually spending a lot of CPU time just encoding those setTexture commands. Instead, you can use argument buffers. If the hardware you are targeting supports argument buffers tier 2, you can put every texture in an argument buffer. This will require calling useResource on all of those textures, because the driver needs to know that you are going to use those textures to make them resident, so you will still spend CPU time encoding those commands. To avoid that, you can allocate all the textures from one or more heaps and call useHeaps on those heaps. This will make the whole heap resident, and you won't need to call useResource on individual resources. There are a bunch of WWDC talks on this topic, latest one being Explore bindless rendering in Metal.
But again, to reiterate: nothing I mentioned here "wastes" bandwidth.
Update:
A very basic example of using argument buffers would be to use it like this.
let argumentDescriptor = MTLArgumentDescriptor()
argumentDescriptor.index = 0
argumentDescriptor.dataType = .texture
argumentDescriptor.textureType = .type2D
let argumentEncoder = MTLArgumentEncoder(arguments: [argumentDescriptor])
let argumentBuffer = device.makeBuffer(length: argumentEncoder.encodedLength, options: [.storageModeShared])
argumentEncoder.setArgumentBuffer(argumentBuffer, offset: 0)
argumentEncoder.setTexture(someTexture, index: 0)
commandEncoder.setBuffer(argumentBuffer, offset: 0, index: 0)
commandEncoder.useResource(someTexture, usage: .read)
And in the shader you would write a struct like this:
struct MyTexture
{
texture2d<float> texture [[ id(0) ]];
};
and then bind it like
device MyTexture& myTexture [[ buffer(0) ]]
and use it like any other struct. This is a very basic example and you can actually use reflection to create those MTLArgumentEncoders for you from functions and binding indices.

Rust - Why such a big difference in memory usage between malloc/alloc and more 'idiomatic' approaches

I've been using and modifying this library https://github.com/sile/patricia_tree
One thing that bothered a bit was how much unsafe was used in node.rs, particularly, it's defined as just a pointer to some heap location. When doing the first benchmark listed on the readme page (wikipedia inputs), the PatriciaSet uses ~700mb (PatriciaSet is just holding a Node at it's root)
pub struct Node<V> {
// layout:
// all these fields accessed with ptr.offset
// - flags: u8
// - label_len: u8
// - label: [u8; label_len]
// - value: Option<V>
// - child: Option<Node<V>>
// - sibling: Option<Node<V>>
ptr: *mut u8,
_value: PhantomData<V>,
}
and uses malloc for allocation:
let ptr = unsafe { libc::malloc(block_size) } as *mut u8;
I was told this memory is not aligned properly, so I tried to add the new alloc api and use Layout/alloc, this is also still not aligning properly just seems to 'work'. full pr
let layout = Layout::array::<u8>(block_size).expect("Failed to get layout");
let ptr = unsafe { alloc::alloc(layout) as *mut u8 };
This single change, which also holds the layout in the block of memory pointed to by ptr, caused memory consumption to go up 40% under the performance tests for very large trees. The layout type is just 2 words wide, so this was unexpected. For the same tests this uses closer to ~1000mb (compared to previous 700)
In another attempt, I tried to remove most of the unsafe and go with something a bit more rust-y full pr here
pub struct Node<V> {
value: Option<V>,
child: Option<*mut Node<V>>,
sibling: Option<*mut Node<V>>,
label: SmallVec<[u8; 10]>,
_value: PhantomData<V>,
}
creating the node in a manner you may expect in rust
let child = child.map(|c| Box::into_raw(Box::new(c)));
let sibling = sibling.map(|c| Box::into_raw(Box::new(c)));
Node {
value,
child,
sibling,
label: SmallVec::from_slice(label),
_value: PhantomData,
}
Performance wise, it's about equivalent to the original unmodified library, but it's memory consumption appears to be not much better than just inserting every single item in a HashSet, using around ~1700mb for the first benchmark.
The final data structure which uses node is a compressed trie or a 'patricia tree'. No other code was changed other than the structure of the Node and the implementations of some of its methods that idiomatically fall out of those changes.
I was hoping someone could tip me off of a what exactly is causing such a drastic difference in memory consumption between these implementations. In my mind, they should be about equivalent. They all allocate around the same number of fields with about the same widths. The unsafe first one is able to store a dynamic label length in-line, so that could be one reason. But smallvec should be able to do something similar with smaller label sizes (using just Vec was even worse).
Looking for any suggestions or help on why the ending results are so different. If curious, the code to run these is here though it is spread across the original authors and my own repo
Tools for how to investigate the differences between these would be open to hearing that as well!
You're seeing increased memory use for a couple of reasons. I'll assume a standard 64-bit Unix system.
First, a pointer is 8 bytes. An Option<*mut Node<V>> is 16 bytes because pointers aren't subject to the nullable optimization that happens with references. References can never be null, so the compiler can convert an Option<&'a V> into a null pointer if the value is None and a regular pointer if it's Some, but pointers can be null so that can't happen here. Rust makes the size of the enum field the same size as the size of the data type, so you use 16 bytes per pointer here.
The easiest and most type-safe way to deal with this is just to use Option<NonNull<Node<V>>>. Doing that drops your structure by 16 bytes total.
Second, your SmallVec is 32 bytes in size. They avoid needing a heap allocation in some cases, but they aren't, despite the name, necessarily small. You can use a regular Vec or a boxed slice, which will likely result in lower memory usage at the cost of an additional allocation.
With those changes and using a Vec, your structure will be 48 bytes in size. With a boxed slice, it will be 40. The original used 72. How much savings you see will depend on how big your labels are, since you'll need to allocate space for them.
The required alignment for this structure is 8 bytes because the largest alignment of any type (the pointer) is 8 bytes. Even on architectures like x86-64 where alignment is not required for all types, it is still faster, and sometimes significantly so, so the compiler always does it.
The original code was not properly aligned at all and will either outright fail (on SPARC), perform badly (on PowerPC), or require an alignment trap into the kernel if they're enabled (on MIPS) or fail if they're not. An alignment trap into the kernel for unaligned access performs terribly because you have to do a full context switch just to load and shift two words, so most people turn them off.
The reason that this is not properly aligned is because Node contains a pointer and it appears in the structure at an offset which is not guaranteed to be a multiple of 8. If it were rewritten such that the child and sibling attributes came first, then it would be properly aligned provided the memory were suitably aligned (which malloc guarantees but your Rust allocation does not). You could create a suitable Layout with Layout::from_size_align(block_size, std::mem::align_of::<*mut Node>()).
So while the original code worked on x86-64 and saved a bunch of memory, it performed badly and was not portable.
The code I used for this example is simply the following, plus some knowledge about how Rust does nullable types and knowledge about C and memory allocation:
extern crate smallvec;
use smallvec::SmallVec;
use std::marker::PhantomData;
use std::ptr::NonNull;
pub struct Node<V> {
value: Option<V>,
child: Option<NonNull<Node<V>>>,
sibling: Option<NonNull<Node<V>>>,
label: Vec<u8>,
_value: PhantomData<V>,
}
fn main() {
println!("size: {}", std::mem::size_of::<Node<()>>());
}

How to efficiently create a large vector of items initialized to the same value?

I'm looking to allocate a vector of small-sized structs.
This takes 30 milliseconds and increases linearly:
let v = vec![[0, 0, 0, 0]; 1024 * 1024];
This takes tens of microseconds:
let v = vec![0; 1024 * 1024];
Is there a more efficient solution to the first case? I'm okay with unsafe code.
Fang Zhang's answer is correct in the general case. The code you asked about is a little bit special: it could use alloc_zeroed, but it does not. As Stargateur also points out in the question comments, with future language and library improvements it is possible both cases could take advantage of this speedup.
This usually should not be a problem. Initializing a whole big vector at once probably isn't something you do extremely often. Big allocations are usually long-lived, so you won't be creating and freeing them in a tight loop -- the cost of initializing the vector will only be paid rarely. Sooner than resorting to unsafe, I would take a look at my algorithms and try to understand why a single memset is causing so much trouble.
However, if you happen to know that all-bits-zero is an acceptable initial value, and if you absolutely cannot tolerate the slowdown, you can do an end-run around the standard library by calling alloc_zeroed and creating the Vec using from_raw_parts. Vec::from_raw_parts is unsafe, so you have to be absolutely sure the size and alignment of the allocated memory is correct. Since Rust 1.44, you can use Layout::array to do this easily. Here's an example:
pub fn make_vec() -> Vec<[i8; 4]> {
let layout = std::alloc::Layout::array::<[i8; 4]>(1_000_000).unwrap();
// I copied the following unsafe code from Stack Overflow without understanding
// it. I was advised not to do this, but I didn't listen. It's my fault.
unsafe {
Vec::from_raw_parts(
std::alloc::alloc_zeroed(layout) as *mut _,
1_000_000,
1_000_000,
)
}
}
See also
How to perform efficient vector initialization in Rust?
vec![0; 1024 * 1024] is a special case. If you change it to vec![1; 1024 * 1024], you will see performance degrades dramatically.
Typically, for non-zero element e, vec![e; n] will clone the element n times, which is the major cost. For element equal to 0, there is other system approach to init the memory, which is much faster.
So the answer to your question is no.

MPSImageHistogramEqualization throws assertion that offset must be < [buffer length]

I'm trying to do histogram equalization using MPSImageHistogramEqualization on iOS but it ends up throwin an assertion I do not understand. Here is my code:
// Calculate Histogram
var histogramInfo = MPSImageHistogramInfo(
numberOfHistogramEntries: 256,
histogramForAlpha: false,
minPixelValue: vector_float4(0,0,0,0),
maxPixelValue: vector_float4(1,1,1,1))
let calculation = MPSImageHistogram(device: self.mtlDevice, histogramInfo: &histogramInfo)
let bufferLength = calculation.histogramSize(forSourceFormat: sourceTexture.pixelFormat)
let histogramInfoBuffer = self.mtlDevice.makeBuffer(length: bufferLength, options: [.storageModePrivate])!
calculation.encode(to: commandBuffer,
sourceTexture: sourceTexture,
histogram: histogramInfoBuffer,
histogramOffset: 0)
let histogramEqualization = MPSImageHistogramEqualization(device: self.mtlDevice, histogramInfo: &histogramInfo)
histogramEqualization.encodeTransform(to: commandBuffer, sourceTexture: sourceTexture, histogram: histogramInfoBuffer, histogramOffset: 0)
And here is the resulting assert that happens on that last line:
-[MTLDebugComputeCommandEncoder setBuffer:offset:atIndex:]:283: failed assertion `offset(4096) must be < [buffer length](4096).'
Any suggestions on what might be going on here?
This appears to be a bug in a specialized path in MPSImageHistogramEqualization, and I encourage you to file feedback on it.
When numberOfHistogramEntries is greater than 256, the image kernel allocates an internal buffer large enough to hold the data it needs to work with (for N=512, this is 8192 bytes), plus an extra bit of space (32 bytes). When the internal optimized256BinsUseCase flag is set, it allocates exactly 4096 bytes, omitting that last bit of extra storage. My suspicion is that subsequent operations rely on having more space after the initial data chunk, and inadvertently set the buffer offset past the length of the internal buffer.
You may be able to work around this by using a different number of histogram bins, like 512. This wastes a little space and time, but I assume it will produce the same results.
Alternatively, you might be able to avoid this crash by disabling the Metal validation layer, but I strongly discourage that, since you'll just be masking the underlying issue till it gets fixed.
Note: I did my reverse-engineering of the MetalPerformanceShaders framework on macOS Catalina. Different platforms and different software versions likely have different code paths.

Reading from 16-bit hardware registers

On an embedded system we have a setup that allows us to read arbitrary data over a command-line interface for diagnostic purposes. For most data, this works fine, we use memcpy() to copy data at the requested address and send it back across a serial connection.
However, for 16-bit hardware registers, memcpy() causes some problems. If I try to access a 16-bit hardware register using two 8-bit accesses, the high-order byte doesn't read correctly.
Has anyone encountered this issue? I'm a 'high-level' (C#/Java/Python/Ruby) guy that's moving closer to the hardware and this is alien territory.
What's the best way to deal with this? I see some info, specifically, a somewhat confusing [to me] post here. The author of this post has exactly the same issue I do but I hate to implement a solution without fully understanding what I'm doing.
Any light you can shed on this issue is much appreciated. Thanks!
In addition to what Eddie said, you typically need to use a volatile pointer to read a hardware register (assuming a memory mapped register, which is not the case for all systems, but it sounds like is true for yours). Something like:
// using types from stdint.h to ensure particular size values
// most systems that access hardware registers will have typedefs
// for something similar (for 16-bit values might be uint16_t, INT16U,
// or something)
uint16_t volatile* pReg = (int16_t volatile*) 0x1234abcd; // whatever the reg address is
uint16_t val = *pReg; // read the 16-bit wide register
Here's a series of articles by Dan Saks that should give you pretty much everything you need to know to be able to effectively use memory mapped registers in C/C++:
"Mapping memory"
"Mapping memory efficiently"
"More ways to map memory"
"Sizing and aligning device registers"
"Use volatile judiciously"
"Place volatile accurately"
"Volatile as a promise"
Each register in this hardware is exposed as a two-byte array, the first element is aligned at a two-byte boundary (its address is even). memcpy() runs a cycle and copies one byte at each iteration, so it copies from these registers this way (all loops unrolled, char is one byte):
*((char*)target) = *((char*)register);// evenly aligned - address is always even
*((char*)target + 1) = *((char*)register + 1);//oddly aligned - address is always odd
However the second line works incorrectly for some hardware specific reasons. If you copy two bytes at a time instead of one at a time, it is instead done this way (short int is two bytes):
*((short int*)target) = *((short*)register;// evenly aligned
Here you copy two bytes in one operation and the first byte is evenly aligned. Since there's no separate copying from an oddly aligned address, it works.
The modified memcpy checks whether the addresses are venely aligned and copies in tow bytes chunks if they are.
If you require access to hardware registers of a specific size, then you have two choices:
Understand how your C compiler generates code so you can use the appropriate integer type to access the memory, or
Embed some assembly to do the access with the correct byte or word size.
Reading hardware registers can have side affects, depending on the register and its function, of course, so it's important to access hardware registers with the proper sized access so you can read the entire register in one go.
Usually it's sufficient to use an integer type that is the same size as your register. On most compilers, a short is 16 bits.
void wordcpy(short *dest, const short *src, size_t bytecount)
{
int i;
for (i = 0; i < bytecount/2; ++i)
*dest++ = *src++;
}
I think all the detail is contained in that thread you posted so I'll try and break it down a little;
Specifically;
If you access a 16-bit hardware register using two 8-bit
accesses, the high-order byte doesn't read correctly (it
always read as 0xFF for me). This is fair enough since
TI's docs state that 16-bit hardware registers must be
read and written using 16-bit-wide instructions, and
normally would be, unless you're using memcpy() to
read them.
So the problem here is that the hardware registers only report the correct value if their values are read in a single 16-bit read. This would be equivalent to doing;
uint16 value = *(regAddress);
This reads from the address into the value register using a single 16-byte read. On the other hand you have memcpy which is copying data a single-byte at a time. Something like;
while (n--)
{
*(uint8*)pDest++ = *(uint8*)pSource++;
}
So this causes the registers to be read 8-bits (1 byte) at a time, resulting in the values being invalid.
The solution posted in that thread is to use a version of memcpy that will copy the data using 16-bit reads whereever the source and destination are a6-bit aligned.
What do you need to know? You've already found a separate post explaining it. Apparently the CPU documentation requires that 16-bit hardware registers are accessed with 16-bit reads and writes, but your implementation of memcpy uses 8-bit reads/writes. So they don't work together.
The solution is simply not to use memcpy to access this register.
Instead, write your own routine which copies 16-bit values.
Not sure exactly what the question is - I think that post has the right solution.
As you stated, the issue is that the standard memcpy() routine reads a byte at a time, which does not work correctly for memory mapped hardware registers. That is a limitation of the processor - there's simply no way to get a valid value reading a byte at at time.
The suggested solution is to write your own memcpy() which only works on word-aligned addresses, and reads 16-bit words at a time. This is fairly straightforward - the link gives both a c and an assembly version. The only gotcha is to make sure you always do the 16 bit copies from validly aligned address. You can do that in 2 ways: either use linker commands or pragmas to make sure things are aligned, or add a special case for the extra byte at the front of an unaligned buffer.

Resources