I have a Metal-based Core Image convolution kernel that was using half precision variables for keeping track of sums and weights. However, I now figured that the range of 16-bit half is not enough in some cases, which means I need 32-bit float for some variables.
Now I'm wondering what's more performant:
use half as much as possible (for the samplers and most local vars) and only convert to float when needed (which means quite a lot, inside the loop)
or change all samplers and local vars to float type so that no conversion is necessary.
The former would mean that all arithmetic is performed in 32-bit precision, though it would only be needed for some operations.
Is there any documentation or benchmark I can run to find the cost of float ↔︎ half conversion in Metal?
I believe you should go with option A:
use half as much as possible (for the samplers and most local vars) and only convert to float when needed (which means quite a lot, inside the loop)
based on the discussion in the WWDC 2016 talk entitled "Advanced Metal Shader Optimization" linked here.
Between around 17:17-18:58 is the relevant section for this topic. The speaker Fiona mentions a couple of things of importance:
A8 and later GPUs have 16-bit registers, which means that 32-bit floating-point formats (like float) use twice as many registers, which means twice as much bandwidth, energy, etc. So using half saves registers (which is always good) and energy
On A8 and later GPUs, "data type conversions are typically free, even between float and half [emphasis added]." Fiona even poses questions you might be asking yourself covering what I believe you are concerned about with all of the conversions and says that it's still probably fast because the conversions are free. Furthermore, according to the Metal Shading Language Specification Version 2.3 (pg. 218)
For textures that have half-precision floating-point pixel color values, the conversions from half to float are lossless
so that you don't have to worry about precision being lost as well.
There are some other relevant points that are worth looking into as well in that section, but I believe this is enough to justify going with option A
Related
I would like to understand whether using fixed point Q31 is better than floating-point (single precision) for DSP applications where accuracy is important.
More details, I am currently working with ARM Cortex-M7 microcontroller and I need to perform FFT with high accuracy using CMSIS library. I understand that the SP has 24 bits for the mantissa while the Q31 has 31 bits, therefore, the precision of the Q31 should be better, but I read everywhere that for algorithms that require multiplication and so on, the floating-point representation should be used, which I do not understand why.
Thanks in advance.
Getting maximum value out of fixed point (that extra 6 or 7 bits of mantissa accuracy), as well as avoiding a ton of possible underflow and overflow problems, requires knowing precisely the bounds (min and max) of every arithmetic operation in your CMSIS algorithms for every valid set of input data.
In practice, both a complete error analysis turns out to be difficult, and the added operations needed to rescale all intermediate values to optimal ranges reduces performance so much, that only a narrower set of cases seems worth the effort, over using either IEEE signal or double, which the M7 supports in hardware, and where the floating point exponent range hides an enormous amount (but not all !!) of intermediate result numerical scaling issues.
But for some more simple DSP algorithms, sometimes analyzing and fixing the scaling isn't a problem. Hard to tell which without disassembling the numeric range of every arithmetic operation in your needed algorithm. Sometimes the work required to use integer arithmetic needs to be done because the processors available don't support floating point arithmetic well or at all.
I am implementing a simple routine that performs sparse matrix - dense matrix multiplication using cusparseScsrmm from cuSPARSE. This is part of a bigger application that could allocate memory on GPU using cudaMalloc (more than 99% of the time) or cudaMallocPitch (very rarely used). I have a couple of questions regarding how cuSPARSE deals with pitched memory:
1) I passed in pitched memory into the cuSPARSE routine but the results were incorrect (as expected, since there is no way to pass in the pitch as an argument). Is there a way to get these libraries working with memory allocated using cudaMallocPitch?
2) What is the best way to deal with this? Should I just add a check in the calling function, to enforce that the memory not be allocated using pitched mode?
For sparse matrix operations, the concept of pitched data has no relevance anyway.
For dense matrix operations most operations don't directly support a "pitch" to the data per se, however various operations can operate on a sub-matrix. With a particular caveat, it should be possible for such operations to handle pitched or unpitched data. Any time you see a CUBLAS (or CUSPARSE) operation that accepts "leading dimension" arguments, those arguments could be used to encompass a pitch in the data.
Since the "leading dimension" parameter is specified in matrix elements, and the pitch is (usually) specified in bytes, the caveat here is that the pitch is evenly divisible by the size of the matrix element in question, so that the pitch (in bytes) can be converted to a "leading dimension" parameter specified in matrix elements. I would expect that this would be typically possible for char, int, float, double and similar types, as I believe the pitch quantity returned by cudaMallocPitch will usually be evenly divisible by 16. But there is no stated guarantee of this, so proper run-time checking is advised, if you intend to use this approach.
For example, it should be possible to perform a CUBLAS matrix-matrix multiply (gemm) on pitched data, with appropriate specification of the lda, ldb and ldc parameters.
The operation you indicate does offer such leading dimension parameters for the dense matrices involved.
If 99% of your use-cases don't use pitched data, I would either not support pitched data at all, or else, for operations where no leading dimension parameters are available, copy the pitched data to an unpitched buffer for use in the desired operation. A device-to-device pitched to unpitched copy can run at approximately the rate of memory bandwidth, so it might be fast enough to not be a serious issue for 1% of the use cases.
As stated in the Metal Shading Language Guide:
Writes to a buffer or a texture are disallowed from a fragment function.
I understand that this is the case, but I'm curious as to why. Being able to write to a buffer from within a fragment shader is incredibly useful; I understand that it is likely more complex on the hardware end to not know ahead of time the end location of memory writes for a particular thread, which you don't always know with raw buffer writes, but this is a capability exposed within Metal compute shaders, so why not within fragment shaders too?
Addendum
I should clarify why I think buffer writes from fragment functions are useful. In the most common usage case of the rasterization pipeline, triangles are being rasterized and shaded (per the fragment shader) and written into predefined memory locations, known before each fragment shader invocation and determined by the predefined mapping from the normalized device coordinates and the frame buffer. This fits most usage cases, since most of the time you just want to render triangles directly to a buffer or the screen.
There are other cases in which you might want to do a lazy write within the fragment shader, the end location of which is based off of fragment properties and not the fragment's exact location; effectively, rasterization with side effects. For instance, most GPU-based voxelization works by rendering the scene with orthographic projection from some desirable angle, and then writing into a 3D texture, mapping the XY coordinates of the fragment and its associated depth value to a location in the 3D texture. This is described here.
Other uses include some forms of order-independent transparency (transparency where draw order is unimportant, allowing for overlapping transparent objects). One solution is to use a multi-layered frame buffer, and then to sort and blend the fragments based upon their depth values in a separate pass. Since there's no hardware support for doing this (on most GPUs, Intel's I believe have hardware acceleration for this), you have to maintain atomic counters and manual texture/buffer writes from each pixel to coordinate writes to the layered frame buffer.
Yet another example might be extraction of virtual point lights for GI through rasterization (i.e. you write out point lights for relevant fragments as you rasterize). In all of these usage cases, buffer writes from fragment shaders are required, because ROPs only store one resulting fragment for each pixel. The only way to achieve equivalent results without this feature is by some manner of depth peeling, which is horribly slow for scenes of high depth complexity.
Now I realize that the examples I gave aren't really all about buffer writes in particular, but more generally about the idea of dynamic memory writes from fragment shaders, ideally along with support for atomicity. Buffer writes just seem like a simple issue, and their inclusion would go a long way towards improving the situation.
Since I wasn't getting any answers here, I ended up posting the question on Apple's developer forums. I got more feedback there, but still no real answer. Unless I am missing something, it seems that virtually every OS X device which officially supports Metal has hardware support for this feature. And as I understand, this feature first started popping up in GPUs around 2009. It's a common feature in both current DirectX and OpenGL (not even considering DX12 or Vulkan), so Metal would be the only "cutting-edge" API which lacks it.
I realize that this feature might not be supported on PowerVR hardware, but Apple has had no issue differentiating the Metal Shading Language by feature set. For instance, Metal on iOS allows for "free" frame buffer fetches within fragment shaders, which is directly supported in hardware by the cache-heavy PowerVR architecture. This feature manifests itself directly in the Metal Shading Language, as it allows you to declare fragment function inputs with the [[color(m)]] attribute qualifier for iOS shaders. Arguably allowing declaration of buffers with the device storage space qualifier, or textures with access::write, as inputs to fragment shaders, would be no greater semantic change to the language than what Apple has done to optimize for iOS. So, as far as I'm concerned, a lack of support by PowerVR would not explain the lack of the feature I'm looking for on OS X.
Writes to buffers from fragment shaders is now supported, as mentioned in
What’s New in iOS 10, tvOS 10, and macOS 10.12
Function Buffer Read-Writes Available in: iOS_GPUFamily3_v2,
OSX_GPUFamily1_v2
Fragment functions can now write to buffers. Writable buffers must be
declared in the device address space and must not be const. Use
dynamic indexing to write to a buffer.
More over, line specifying the restriction (from the original question) is not there in Metal Shading Language Specification 2.0
I think you can neither write arbitrary pixels or texels on a fragment function on OpenGL or DirectX. One thing is the rendering API and other thing are the fragment or vertex functions.
A fragment function is intended to produce as result a pixel / texel output, one per run, even is each one have multiple channels. Usually if you want to write to a buffer or texture you need to render something (a quad, triangle, or something using your fragment function over a surface (buffer or texture). As result each pixel / texel will be rendered using your fragment function. For example raycasting or raytracing fragment functions usually uses this approach.
There is a good reason for not to allowing you to write arbitrary pixels / texels: parallelization. The fragment function is executed usually for lots of different pixels / texels at once at a time on most GPUs in a very high parallelization mode, each GPU has its own manner to parallelize (SMP, vectorial...) but all do very high paralelization. So you can write only by returning one pixel or texel channels output as the return of the fragment function in order to avoid common parallelization problems like races. This applies for each graphic library I know.
For building a scene graph a decision needs to be made between using TFixedPoint and TFloatPoint for all geometries and math. GR32 uses both Fixed and Float.
Why are there two point types in GR32?
Which is faster / more efficient?
Which is safer?
Any other suggestions re this issue?
Operational boundaries for the graph:
max 500 primitives/elements per node, avg is 20
max 2000 nodes per scene, avg is 250
Features for the graph:
Graphics are 2d
Graphics must be of a high visual quality
Animation is required
Isometric projections is required
The intended use for the graph:
Business graphics (charts, grids etc)
Modeling tool
Textual presentation
Process simulations
Fixed Point math is generally faster, so TFixedPoint will perform faster mathematically.
Floating Point can (depending on the degree of precision employed on Fixed Point values) provide greater precision than Fixed Point, but will not perform as quickly in terms of mathematical routines.
"Safety" is too subjective to answer... safer how?
As per your final part-question, it depends what you value more: precision or performance.
If precision is your primary objective, go with TFloatPoint. If performance is your primary objective, go with TFixedPoint.
I am using Lua 5.1
print(10.08 - 10.07)
Rather than printing 0.01, above prints 0.0099999999999998.
Any idea how to get 0.01 form this subtraction?
You got 0.01 from the subtraction. It is just in the form of a repeating decimal with a tiny amount of lost precision.
Lua uses the C type double to represent numbers. This is, on nearly every platform you will use, a 64-bit binary floating point value with approximately 23 decimal digits of precision. However, no amount of precision is sufficient to represent 0.01 exactly in binary. The situation is similar to attempting to write 1/3 in decimal.
Furthermore, you are subtracting two values that are very similar in magnitude. That all by itself causes an additional loss of precision.
The solution depends on what your context is. If you are doing accounting, then I would strongly recommend that you not use floating point values to represent account values because these small errors will accumulate and eventually whole cents (or worse) will appear or disappear. It is much better to store accounts in integer cents, and divide by 100 for display.
In general, the answer is to be aware of the issues that are inherent to floating point, and one of them is this sort of small loss of precision. It is easily handled by rounding answers to an appropriate number of digits for display, and never comparing results of calculations for equality.
Some resources for background:
The semi-official explanation at the Lua Users Wiki
This great page of IEEE floating point calculators where you can enter values in decimal and see how they are represented, and vice-versa.
Wiki on IEEE floating point.
Wiki on floating point numbers in general.
What Every Computer Scientist Should Know About Floating-Point Arithmetic is the most complete discussion of the fine details.
Edit: Added the WECSSKAFPA document after all. It really is worth the study, although it will likely seem a bit overwhelming on the first pass. In the same vein, Knuth Volume 2 has extensive coverage of arithmetic in general and a large section on floating point. And, since lhf reminded me of its existence, I inserted the Lua wiki explanation of why floating point is ok to use as the sole numeric type as the first item on the list.
Use Lua's string.format function:
print(string.format('%.02f', 10.08 - 10.07))
Use an arbitrary precision library.
Use a Decimal Number Library for Lua.