Which is faster in directX when sending data to the vertex shader.
struct VertexInputType
{
float4 data : DATA; // x,y - POSITION, z - distance, w - size
}
vs
struct VertexInputType
{
float2 pos : POSITION;
float distance : DISTANCE;
float size : SIZE;
}
A wild guess would be to say that first one is faster because it packs in a 128 bit register. But I am thinking there is a better answer.
If you are thinking about memory transfer between CPU and GPU:
If these are all coming from the same buffer object then it shouldn't matter. The second one is just an interpretation of the data that is known to the shader, and it has nothing to do with actual data being transferred. In case of using multiple vertex streams in case 2 it might have different performance but this difference is not connected with the format used in the shader.
If you are worried about vertex cache efficiency:
In both cases 16 bytes will be stored and retrieved per vertex so there is no difference here either.
Related
I'm about to port an iOS app that utilizes OpenGL written in C++ to Apple's Metal. The goal is to completely get rid of OpenGL and replace it with Metal.
The OpenGL code is layered and I'm attempting to just replace the renderer, i.e. the class that actually calls OpenGL functions. However, the entire code base utilizes the GLM math library to represent vectors and matrices.
For example there is a camera class that provides the view and projection matrix. Both of them are of type glm::mat4 and are simply passed to the GLSL vertex shader where they are compatible with the mat4 data type given by GLSL. I would like to utilize that camera class as it is to send those matrices to the Metal vertex shader. Now, I'm not sure whether glm::mat4 is compatible with Metal's float4x4.
I don't have a working example where I can test this because I literally just started with Metal and can't find anything useful online.
So my questions are as follows:
Are GLM types such as glm::mat4 and glm::vec4 compatible with Metal's float4x4 / float4?
If the answer to question 1. is yes, am I having any disadvantages if I directly use GLM types in Metal shaders?
The background regarding question 2. is that I came across Apple's SIMD library that provides another set of data types which I would not be able to use in such a case, right?
The app is iOS only, I don't care about running Metal on macOS at all.
Code snippets (preferably Objective-C (yes, no joke)) would be very welcome.
Overall, the answer is yes, GLM is a good fit for apps that utilize Apple's Metal. However, there are a couple of things that need to be considered. Some of those things have already been hinted at in the comments.
First of all, the Metal Programming Guide mentions that
Metal defines its Normalized Device Coordinate (NDC) system as a 2x2x1 cube with its center at (0, 0, 0.5)
This means that Metal NDC coordinates are different from OpenGL NDC coordinates because OpenGL defines the NDC coordinate system as a 2x2x2 cube with its center at (0, 0, 0), i.e. valid OpenGL NDC coordinates must be within
// Valid OpenGL NDC coordinates
-1 <= x <= 1
-1 <= y <= 1
-1 <= z <= 1
Because GLM was originally tailored for OpenGL, its glm::ortho and glm::perspective functions create projection matrices that transform coordinates into OpenGL NDC coordinates. Because of this, it is necessary to adjust those coordinates to Metal. How this could be achieved is outlined in this blog post.
However, there is a more elegant way to fix those coordinates. Interestingly, Vulkan utilizes the same NDC coordinate system as Metal and GLM has already been adapted to work with Vulkan (hint for this found here).
By defining the C/C++ preprocessor macro GLM_FORCE_DEPTH_ZERO_TO_ONE the mentioned GLM projection matrix functions will transform coordinates to work with Metal's / Vulkan's NDC coordinate system. That #define will hence solve the problem with the different NDC coordinate systems.
Next, it is important to take both the size and alignment of GLM's and Metal's data types into account when exchanging data between Metal shaders and client side (CPU) code. Apple's Metal Shading Language Specification lists both size and alignment for some of its data type.
For the data types that aren't listed in there, size and alignment can be determined by utilizing C/C++'s sizeof and alignof operators. Interestingly, both operators are supported within Metal shaders. Here are a couple of examples for both GLM and Metal:
// Size and alignment of some GLM example data types
glm::vec2 : size: 8, alignment: 4
glm::vec3 : size: 12, alignment: 4
glm::vec4 : size: 16, alignment: 4
glm::mat4 : size: 64, alignment: 4
// Size and alignment of some of Metal example data types
float2 : size: 8, alignment: 8
float3 : size: 16, alignment: 16
float4 : size: 16, alignment: 16
float4x4 : size: 64, alignment: 16
packed_float2 : size: 8, alignment: 4
packed_float3 : size: 12, alignment: 4
packed_float4 : size: 16, alignment: 4
As can be seen from the above table the GLM vector data types match nicely with Metal's packed vector data types both in terms of size and alignment. Note however, that the 4x4 matrix data types don't match in terms of alignment.
According to this answer to another SO question, alignment means the following:
Alignment is a restriction on which memory positions a value's first byte can be stored. (It is needed to improve performance on processors and to permit use of certain instructions that works only on data with particular alignment, for example SSE need to be aligned to 16 bytes, while AVX to 32 bytes.)
Alignment of 16 means that memory addresses that are a multiple of 16 are the only valid addresses.
Therefore we need to be careful to factor in the different alignments when sending 4x4 matrices to Metal shaders. Let's look at an example:
The following Objective-C struct serves as a buffer to store uniform values to be sent to a Metal vertex shader:
typedef struct
{
glm::mat4 modelViewProjectionMatrix;
glm::vec2 windowScale;
glm::vec4 edgeColor;
glm::vec4 selectionColor;
} SolidWireframeUniforms;
This struct is defined in a header file that is included wherever it is required in client side (i.e. CPU side) code. To be able to utilize those values on the Metal vertex shader side we need a corresponding data structure. In case of this example the Metal vertex shader part looks as follows:
#include <metal_matrix>
#include <metal_stdlib>
using namespace metal;
struct SolidWireframeUniforms
{
float4x4 modelViewProjectionMatrix;
packed_float2 windowScale;
packed_float4 edgeColor;
packed_float4 selectionColor;
};
// VertexShaderInput struct defined here...
// VertexShaderOutput struct defined here...
vertex VertexShaderOutput solidWireframeVertexShader(VertexShaderInput input [[stage_in]], constant SolidWireframeUniforms &uniforms [[buffer(1)]])
{
VertexShaderOutput output;
// vertex shader code
}
To transmit data from the client side code to the Metal shader the uniform struct is packaged into a buffer. The below code shows how to create and update that buffer:
- (void)createUniformBuffer
{
_uniformBuffer = [self.device newBufferWithBytes:(void*)&_uniformData length:sizeof(SolidWireframeUniforms) options:MTLResourceCPUCacheModeDefaultCache];
}
- (void)updateUniforms
{
dispatch_semaphore_wait(_bufferAccessSemaphore, DISPATCH_TIME_FOREVER);
SolidWireframeUniforms* uniformBufferContent = (SolidWireframeUniforms*)[_uniformBuffer contents];
memcpy(uniformBufferContent, &_uniformData, sizeof(SolidWireframeUniforms));
dispatch_semaphore_signal(_bufferAccessSemaphore);
}
Note the memcpy call that is used to update the buffer. This is where things can go wrong if the size and alignment of the GLM and Metal data types don't match. Since we simply copy every byte of the Objective-C struct to the buffer and then on Metal shader side, interpret that data again, the data will get misinterpreted on the Metal shader side if the data structures don't match.
In the case of that example, the memory layout looks as follows:
104 bytes
|<--------------------------------------------------------------------------->|
| |
| 64 bytes 8 bytes 16 bytes 16 bytes |
| modelViewProjectionMatrix windowScale edgeColor selectionColor |
|<------------------------->|<----------->|<--------------->|<--------------->|
| | | | |
+--+--+--+------------+--+--+--+-------+--+--+-----------+--+--+----------+---+
Byte index | 0| 1| 2| ... |62|63|64| ... |71|72| ... |87|88| ... |103|
+--+--+--+------------+--+--+--+-------+--+--+-----------+--+--+----------+---+
^ ^ ^
| | |
| | +-- Is a multiple of 4, aligns with glm::vec4 / packed_float4
| |
| +-- Is a multiple of 4, aligns with glm::vec4 / packed_float4
|
+-- Is a multiple of 4, aligns with glm::vec2 / packed_float2
With the exception of the 4x4 matix alignment, everything matches well. The misalignment of the 4x4 matrix poses no problem here as visible in the above memory layout. However, if the uniform struct gets modified, alignment or size could become a problem and padding might be necessary in order for it to work properly.
Lastly, there is something else to be aware of. The alignment of the data types has an impact on the size that needs to be allocated for the uniform buffer. Because the largest alignment that occurs in the SolidWireframeUniforms struct is 16, it seems that the length of the uniform buffer must also be a multiple of 16.
This is not the case in the above example, where the buffer length is 104 bytes which is not a multiple of 16. When running the app directly from Xcode, a built-in assertion prints the following message:
validateFunctionArguments:3478: failed assertion `Vertex Function(solidWireframeVertexShader): argument uniforms[0] from buffer(1) with offset(0) and length(104) has space for 104 bytes, but argument has a length(112).'
In order to resolve this, we need to make the size of the buffer a multiple of 16 bytes. To do so we just calculate the next multiple of 16 based on the actual length we need. For 104 that would be 112, which is what the assertion above also tells us.
The following function calculates the next multiple of 16 for a specified integer:
- (NSUInteger)roundUpToNextMultipleOf16:(NSUInteger)number
{
NSUInteger remainder = number % 16;
if(remainder == 0)
{
return number;
}
return number + 16 - remainder;
}
Now we calculate the length of the uniform buffer using the above function which changes the buffer creation method (posted above) as follows:
- (void)createUniformBuffer
{
NSUInteger bufferLength = [self roundUpToNextMultipleOf16:sizeof(SolidWireframeUniforms)];
_uniformBuffer = [self.device newBufferWithBytes:(void*)&_uniformData length:bufferLength options:MTLResourceCPUCacheModeDefaultCache];
}
That should resolve issue detected by the mentioned assertion.
I'm doing depth peeling with a very simple fragment function:
struct VertexOut {
float4 position [[ position ]];
};
fragment void depthPeelFragment(VertexOut in [[ stage_in ]],
depth2d<float, access::read> previousDepth)
{
float4 p = in.position;
if(!is_null_texture(previousDepth) && p.z <= previousDepth.read(uint2(p.xy)))
{
discard_fragment();
}
}
(My depth buffer pixel format is MTLPixelFormatDepth32Float)
This works well on my Mac. In each pass I submit the same geometry, and eventually no more fragments are written and the process terminates. For example, with a test sphere, there are two passes with the same number of fragments written each pass (front and back hemispheres).
However on iPad, the process does not terminate. There are some (not all) fragments which, despite being rendered in the previous pass, are not discarded in subsequent passes.
What platform differences could cause this?
Is the z-coordinate of the position attribute always the value written to the depth buffer?
According to Apple engineer, it's not a logarithmic depth buffer.
Note that I cannot simply limit the number of passes (I'm not using this for OIT).
Update:
Here's what the depth texture looks like on the 3rd pass via GPU capture (the green represents the bounds of rendered geometry):
The distribution of points makes this look like a floating point accuracy issue.
Furthermore, if I add an epsilon to the previous depth buffer:
p.z <= previousDepth.read(uint2(p.xy))+.0000001
then the process does terminate on the iPad. However the results aren't accurate enough for downstream use.
In OpenGL, depth buffer values are calculated based on the near and far clipping planes of the scene. (Reference: Getting the true z value from the depth buffer)
How does this work in WebGL? My understanding is that WebGL is unaware of my scene's far and near clipping planes. The near and far clipping planes are used to calculate my projection matrix, but I never tell WebGL what they are explicitly so it can't use them to calculate depth buffer values.
How does WebGL set values in the depth buffer when my scene is rendered?
WebGL (like modern OpenGL and OpenGL ES) gets the depth value from the value you supply to gl_Position.z in your vertex shader (though you can also write directly to the depth buffer using certain extensions but that's far less common)
There is no scene in WebGL nor modern OpenGL. That concept of a scene is part of legacy OpenGL left over from the early 90s and long since deprecated. It doesn't exist in OpenGL ES (the OpenGL that runs on Android, iOS, ChromeOS, Raspberry PI, WebGL etc...)
Modern OpenGL and WebGL are just rasterization APIs. You write shaders which are small functions that run on the GPU. You provide those shaders with data through attributes (per iteration data), uniforms (global variables), textures (2d/3d arrays), varyings (data passed from vertex shaders to fragment shaders).
The rest is up to you and what your supplied shader functions do. Modern OpenGL and WebGL are for all intents and purposes just generic computing engines with certain limits. To get them to do anything is up to you to supply shaders.
See webglfundamentals.org for more.
In the Q&A you linked to it's the programmer supplied shaders that decide to use frustum math to decide how to set gl_Position.z. The frustum math is supplied by the programmer. WebGL/GL don't care how gl_Position.z is computed, only that it's a value between -1.0 and +1.0 so how to take a value from the depth buffer and go back to Z is solely up to how the programmer decided to calculate it in the first place.
This article covers the most commonly used math for setting gl_Position.z when rendering 3d with WebGL/OpenGL. Based on your question though I'd suggest reading the preceding articles linked at the beginning of that one.
As for what actual values get written to the depth buffer it's
ndcZ = gl_Position.z / gl_Position.w;
depthValue = (far - near) / 2 * ndcZ + (near - far) / 2
near and far default to 0 and 1 respectively though you can set them with gl.depthRange but assuming they are 0 and 1 then
ndcZ = gl_Position.z / gl_Position.w;
depthValue = .5 * ndcZ - .5
That depthValue would then be in the 0 to 1 range and converted to whatever bit depth the depth buffer is. It's common to have a 24bit depth buffer so
bitValue = depthValue * (2^24 - 1)
I'm trying to render a forrest scene for an iOS App with OpenGL. To make it a little bit nicer, I'd like to implement a depth effect into the scene. However I need a linearized depth value from the OpenGL depth buffer to do so. Currently I am using a computation in the fragment shader (which I found here).
Therefore my terrain fragment shader looks like this:
#version 300 es
precision mediump float;
layout(location = 0) out lowp vec4 out_color;
float linearizeDepth(float depth) {
return 2.0 * nearz / (farz + nearz - depth * (farz - nearz));
}
void main(void) {
float depth = gl_FragCoord.z;
float linearized = (linearizeDepth(depth));
out_color = vec4(linearized, linearized, linearized, 1.0);
}
However, this results in the following output:
As you can see, the "further" you get away, the more "stripy" the resulting depth value gets (especially behind the ship). If the terrain tile is close to the camera, the output is somewhat okay..
I even tried another computation:
float linearizeDepth(float depth) {
return 2.0 * nearz * farz / (farz + nearz - (2.0 * depth - 1.0) * (farz - nearz));
}
which resulted in a way too high value so I scaled it down by dividing:
float linearized = (linearizeDepth(depth) - 2.0) / 40.0;
Nevertheless, it gave a similar result.
So how do I achieve a smooth, linear transition between the near and the far plane, without any stripes? Has anybody had a similar problem?
the problem is that you store non linear values which are truncated so when you peek the depth values later on you got choppy result because you lose accuracy the more you are far from znear plane. No matter what you evaluate you will not obtain better results unless:
Lower accuracy loss
You can change znear,zfar values so they are closer together. enlarge znear as much as you can so the more accurate area covers more of your scene.
Another option is to use more bits per depth buffer (16 bits is too low) not sure if can do this in OpenGL ES but in standard OpenGL you can use 24,32 bits on most cards.
use linear depth buffer
So store linear values into depth buffer. There are two ways. One is compute depth so after all the underlying operations you will get linear value.
Another option is to use separate texture/FBO and store the linear depths directly to it. The problem is you can not use its contents in the same rendering pass.
[Edit1] Linear Depth buffer
To linearize depth buffer itself (not just the values taken from it) try this:
Vertex:
varying float depth;
void main()
{
vec4 p=ftransform();
depth=p.z;
gl_Position=p;
gl_FrontColor = gl_Color;
}
Fragment:
uniform float znear,zfar;
varying float depth; // original z in camera space instead of gl_FragCoord.z because is already truncated
void main(void)
{
float z=(depth-znear)/(zfar-znear);
gl_FragDepth=z;
gl_FragColor=gl_Color;
}
Non linear Depth buffer linearized on CPU side (as you do):
Linear Depth buffer GPU side (as you should):
The scene parameters are:
// 24 bits per Depth value
const double zang = 60.0;
const double znear= 0.01;
const double zfar =20000.0;
and simple rotated plate covering whole depth field of view. Booth images are taken by glReadPixels(0,0,scr.xs,scr.ys,GL_DEPTH_COMPONENT,GL_FLOAT,zed); and transformed to 2D RGB texture on CPU side. Then rendered as single QUAD covering whole screen on unit matrices ...
Now to obtain original depth value from linear depth buffer you just do this:
z = znear + (zfar-znear)*depth_value;
I used the ancient stuff just to keep this simple so port it to your profile ...
Beware I do not code in OpenGL ES nor IOS so I hope I did not miss something related to that (I am used to Win and PC).
To show the difference I added another rotated plate to the same scene (so they intersect) and use colored output (no depth obtaining anymore):
As you can see linear depth buffer is much much better (for scenes covering large part of depth FOV).
I have a vertex buffer with an unordered access view, which I'm using to fill the vertices using a compute shader, which treats the UAV as a RWStructuredBuffer, using an equivalent struct to the vertex definition. There are 216000 vertices (i.e. 60 x 60 x 60). But my compute shader seems to fill only about 8000 of them, leaving the rest with their initial values. Is there a limit on the number of elements in a structured buffer that can be written in this way?
As it turns out, if you turn on DirectX error-checking, assigning the UAV of a vertex buffer as a RWStructuredBuffer in the shader is considered to be an error. So although this actually works (for a limited number of vertices), it's not supported.