How do the cuSPARSE and cuBLAS libraries deal with memory allocated using cudaMallocPitch?

How do the cuSPARSE and cuBLAS libraries deal with memory allocated using cudaMallocPitch? - memory

I am implementing a simple routine that performs sparse matrix - dense matrix multiplication using cusparseScsrmm from cuSPARSE. This is part of a bigger application that could allocate memory on GPU using cudaMalloc (more than 99% of the time) or cudaMallocPitch (very rarely used). I have a couple of questions regarding how cuSPARSE deals with pitched memory:
1) I passed in pitched memory into the cuSPARSE routine but the results were incorrect (as expected, since there is no way to pass in the pitch as an argument). Is there a way to get these libraries working with memory allocated using cudaMallocPitch?
2) What is the best way to deal with this? Should I just add a check in the calling function, to enforce that the memory not be allocated using pitched mode?

For sparse matrix operations, the concept of pitched data has no relevance anyway.
For dense matrix operations most operations don't directly support a "pitch" to the data per se, however various operations can operate on a sub-matrix. With a particular caveat, it should be possible for such operations to handle pitched or unpitched data. Any time you see a CUBLAS (or CUSPARSE) operation that accepts "leading dimension" arguments, those arguments could be used to encompass a pitch in the data.
Since the "leading dimension" parameter is specified in matrix elements, and the pitch is (usually) specified in bytes, the caveat here is that the pitch is evenly divisible by the size of the matrix element in question, so that the pitch (in bytes) can be converted to a "leading dimension" parameter specified in matrix elements. I would expect that this would be typically possible for char, int, float, double and similar types, as I believe the pitch quantity returned by cudaMallocPitch will usually be evenly divisible by 16. But there is no stated guarantee of this, so proper run-time checking is advised, if you intend to use this approach.
For example, it should be possible to perform a CUBLAS matrix-matrix multiply (gemm) on pitched data, with appropriate specification of the lda, ldb and ldc parameters.
The operation you indicate does offer such leading dimension parameters for the dense matrices involved.
If 99% of your use-cases don't use pitched data, I would either not support pitched data at all, or else, for operations where no leading dimension parameters are available, copy the pitched data to an unpitched buffer for use in the desired operation. A device-to-device pitched to unpitched copy can run at approximately the rate of memory bandwidth, so it might be fast enough to not be a serious issue for 1% of the use cases.

Related

Performance cost of float ↔︎ half conversion in Metal

I have a Metal-based Core Image convolution kernel that was using half precision variables for keeping track of sums and weights. However, I now figured that the range of 16-bit half is not enough in some cases, which means I need 32-bit float for some variables.
Now I'm wondering what's more performant:
use half as much as possible (for the samplers and most local vars) and only convert to float when needed (which means quite a lot, inside the loop)
or change all samplers and local vars to float type so that no conversion is necessary.
The former would mean that all arithmetic is performed in 32-bit precision, though it would only be needed for some operations.
Is there any documentation or benchmark I can run to find the cost of float ↔︎ half conversion in Metal?

I believe you should go with option A:
use half as much as possible (for the samplers and most local vars) and only convert to float when needed (which means quite a lot, inside the loop)
based on the discussion in the WWDC 2016 talk entitled "Advanced Metal Shader Optimization" linked here.
Between around 17:17-18:58 is the relevant section for this topic. The speaker Fiona mentions a couple of things of importance:
A8 and later GPUs have 16-bit registers, which means that 32-bit floating-point formats (like float) use twice as many registers, which means twice as much bandwidth, energy, etc. So using half saves registers (which is always good) and energy
On A8 and later GPUs, "data type conversions are typically free, even between float and half [emphasis added]." Fiona even poses questions you might be asking yourself covering what I believe you are concerned about with all of the conversions and says that it's still probably fast because the conversions are free. Furthermore, according to the Metal Shading Language Specification Version 2.3 (pg. 218)
For textures that have half-precision floating-point pixel color values, the conversions from half to float are lossless
so that you don't have to worry about precision being lost as well.
There are some other relevant points that are worth looking into as well in that section, but I believe this is enough to justify going with option A

Precision of Q31 and SP for FFT ARM Cortex-M7

I would like to understand whether using fixed point Q31 is better than floating-point (single precision) for DSP applications where accuracy is important.
More details, I am currently working with ARM Cortex-M7 microcontroller and I need to perform FFT with high accuracy using CMSIS library. I understand that the SP has 24 bits for the mantissa while the Q31 has 31 bits, therefore, the precision of the Q31 should be better, but I read everywhere that for algorithms that require multiplication and so on, the floating-point representation should be used, which I do not understand why.
Thanks in advance.

Getting maximum value out of fixed point (that extra 6 or 7 bits of mantissa accuracy), as well as avoiding a ton of possible underflow and overflow problems, requires knowing precisely the bounds (min and max) of every arithmetic operation in your CMSIS algorithms for every valid set of input data.
In practice, both a complete error analysis turns out to be difficult, and the added operations needed to rescale all intermediate values to optimal ranges reduces performance so much, that only a narrower set of cases seems worth the effort, over using either IEEE signal or double, which the M7 supports in hardware, and where the floating point exponent range hides an enormous amount (but not all !!) of intermediate result numerical scaling issues.
But for some more simple DSP algorithms, sometimes analyzing and fixing the scaling isn't a problem. Hard to tell which without disassembling the numeric range of every arithmetic operation in your needed algorithm. Sometimes the work required to use integer arithmetic needs to be done because the processors available don't support floating point arithmetic well or at all.

Why is the method of im2col with GEMM is more efficient than the method of direction implementation with SIMD in CNN

The convolutional layers are most computationally intense parts of Convolutional neural networks (CNNs).Currently the common approach to impement convolutional layers is to expand the image into a column matrix(im2col) and perform and perform Multiple Channel Multiple Kernel (MCMK) convolution using an existing parallel General Matrix Multiplication (GEMM) library. However im2col operation need load and store the image data, and also need another memory block to hold the intermediate data.
If I need to optimize the convolutional implementation, I may choose to direct implementation with SIMD instructions. Such method will not incur any memory operation overhead.
the benefits from the very regular patterns of memory access outweigh the wasteful storage costs.
From the following link, at the end of the link
https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
So I hope to know the reason. May floating point operations require more instruction cycle? or the input image is not much large, so it may residue in the cache and the memory operations don't need access DDR and consume less cycles.

Cache-blocking a GEMM is possible so you get mostly L1 cache hits (see also What Every Programmer Should Know About Memory?).
Fitting in the large shared L3 cache on typical x86 CPUs is not sufficient to make things efficient. The per-core L2 caches are typically 256kiB, and even that's slower than the 32kiB L1d cache.
Memory latency is very slow compared to a CPU core clock, but memory/cache bandwidth is not terrible these days with fast DDR4, or L3 cache hits. (But like I said, for a matmul with good cache blocking / loop tiling you can reuse data while it's still hot in L1d if you only transpose parts of the input matrix on the fly. Reducing off-core bandwidth requirements is also important for efficient matmul, not just transposing one so its columns are sequential in memory.)
Beyond that, sequential access to memory is essential for efficient SIMD (loading a vector of multiple contiguous elements, letting you multiply / add / whatever 4 or 8 packed float elements with one CPU instruction). Striding down columns in a row-major matrix would hurt throughput even if the matrix was small enough to fit in L1d cache (32kiB).

Complex interpolation on an FPGA

I have a problem in that I need to implement an algorithm on an FPGA that requires a large array of data that is too large to fit into block or distributed memory. The array contains complex fixed-point values, and it turns out that I can do a good job by reducing the total number of stored values through decimation and then linearly interpolating the interim values on demand.
Though I have DSP blocks (and so fixed-point hardware multipliers) which could be used trivially for real and imaginary part interpolation, I actually want to do the interpolation on the amplitude and angle (of the polar form of the complex number) and then convert the result to real-imaginary form. The data can be stored in polar form if it improves things.
I think my question boils down to this: How should I quickly convert between polar complex numbers and real-imaginary complex numbers (and back again) on an FPGA (noting availability of DSP hardware)? The solution need not be exact, just close, but be speed optimised. Alternatively, better strategies are gladly received!
edit: I know about cordic techniques, so this would be how I would do it in the absence of a better idea. Are there refinements specific to this problem I could invoke?
Another edit: Following from #mbschenkel's question, and some more thinking on my part, I wanted to know if there were any known tricks specific to the problem of polar interpolation.
In my case, the dominant variation between samples is a phase rotation, with a slowly varying amplitude. Since the sampling grid is known ahead of time and is regular, one trick could be to precompute some complex interpolation factors. So, for two complex values a and b, if we wish to find (N-1) intermediate equally spaced values, we can precompute the factor
scale = (abs(b)/abs(a))**(1/N)*exp(1j*(angle(b)-angle(a)))/N)
and then find each intermediate value iteratively as val[n] = scale * val[n-1] where val[0] = a.
This works well for me as I need the samples in order and I compute them all. For small variations in amplitude (i.e. abs(b)/abs(a) ~= 1) and 0 < n < N, (abs(b)/abs(a))**(n/N) is approximately linear (though linear is not necessarily better).
The above is all very good, but still results in a complex multiplication. Are there other options for approximating this? I'm interested in resource and speed constraints, not accuracy. I know I can do the rotation with CORDIC, but still need a pair of multiplications for the scaling, so I'm adding lots of complexity and resource usage for potentially limited results. I don't really have a feel for the convergence of CORDIC, so perhaps I just truncate early, or use lots of resources to converge quickly.

FFT and decibel scales

If I take audio data on an iPhone (i.e. real) data, perform an FFT and then take the magnitudes (Re^2 + Im^2).
These vary from >0 to some large numbers, so I do 10log(n) to get it in dB.
This gives me outputs that are negative (for the inputs that were < 1) to positive.
But the examples I've seen for this (and also drawing the spectrum in Sonic Visualiser) always have positive spectrums when measured in dB.
So what have I missed?!
On a wider note, as I understand it decibels are a ratio, so in this context when turning the FFT magnitudes into dB, what are they a ratio to?

The simple answer is that, for the most part, you can add an arbitrary number to the dB value to make the values all positive, or all negative, or whatever you prefer. With an uncalibrated microphone, like on the iPhone, this is all that makes sense anyway, since all you know are relative values.
For a more advanced technical approach, using a calibrated microphone, you could reference everything using dB (SPL), as a reasonable standard, but this is a hassle, and not meaningful in your use case anyway.
Rationale:
The main reason that shifting by an arbitrary amount is that the log doesn't report the units of measurement. For example, even if you know the input amplitude is 0.1 Pascal, it's completely valid to say this is 100 milliPascal, where you'd be taking the log of 100 rather than 0.1 (so the log values are either 2 or -1). Both are completely valid and the choice is entirely arbitrary. When comparing to a standard reference, as in dB SPL, note that it's done as a ratio, log(P/Pref), removing the impact of changing the units.

Since an FFT is a linear operator, The scale of the output of an FFT is related to the scale of the data input to the FFT. The scale of the input to the FFT on your iPhone depends on the gain of the mic, audio filters, potentially the AGC, and the DAC reference. Since the latter are all undocumented and can vary (by position of the mic, model of device, input gain which may depend on the audio session configuration, and etc.) you won't know the ratio unless you perform some sort of calibration against a known reference.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart