Metal performance on iPhone XR - ios

I have a kernel Metal function which basically looks like this:
struct Matrix {
half arr[562500]; //enough to store 750x750 matrix
struct Output {
half arr[12288];
kernel void compute_features(device Output& buffer [[ buffer(0) ]],
const device Matrix& mtx_0 [[ buffer(1) ]],
const device Matrix& mtx_1 [[ buffer(2) ]],
constant short2& matSize [[ buffer(3) ]],
constant float& offset [[ buffer(4) ]],
ushort2 gid [[ thread_position_in_grid ]]) {
for (int i = 0; i < 12; i++) {
for (int j = 0; j < 12; j++) {
int mat_id = i * matSize.x + j;
half matrixValue_0 = mtx_0.mat[mat_id];
half matrixValue_1 = mtx_1.mat[mat_id] - offset;
short someId_0 = 0;
short someId_1 = 0;
short someId_2 = 0;
short someId_3 = 0; //those ids will be calculated at the code below
half value = 0.h; //this value will be calculated at the code below
//some math where `someId` and `value` are calculated with usage of `matrixValue_0` and `matrixValue_1`
if (some_condition0) {
buffer.arr[someId_0] += value;
if (some_condition1) {
buffer.arr[someId_1] += value;
if (some_condition2) {
buffer.arr[someId_2] += value;
if (some_condition3) {
buffer.arr[someId_3] += value;
I understand that this code has its down-sides - dynamic indexing and big loop. But unfortunately the algorithm I'm trying to express can not be implemented differently at that point.
Now, this code runs very good at iPhone 7+, it takes around 200 us per iteration, and I'm very happy with this number.
BUT, I tried to run the exact same algorithm on iPhone XR and I was surprised to see that this algorithm takes around 1.0-1.2 ms to complete.
With the help of XCode and it's magnificent GPU pipeline debugging tool I found out that my bottlenecks are:
half matrixValue_0 = mtx_0.mat[mat_id];
half matrixValue_1 = mtx_1.mat[mat_id] - offset;
It seems that significant part of processing time are spent in Memory Load operation.
if (some_condition0) {
buffer[someId_0] += value;
if (some_condition1) {
buffer[someId_1] += value;
if (some_condition2) {
buffer[someId_2] += value;
if (some_condition3) {
buffer[someId_3] += value;
The major processing time are spent for Memory Store operation.
For me it seems like iPhone XR quite struggles operating with device memory because bottle-necks are in places where I work with containers from device memory.
I understand that I'm using dynamic indexing - compiler can not really predict what address in the container will be loaded/stored in certain iteration. But the code works very good on iPhone 7+, but not on iPhone XR.
I suspect that it might have something to do with byte alignment. Can it be somehow related to that?
I would love to hear some suggestions on this. Thanks in advance!


About extending a Look Up Table at compile time

I'd like to extend my instrumental Profiler in order to avoid it affect too much performances.
Im my current implementation, I'm using a ProfilerHelper taking one string, which is put whereever you want in the profiling f().
The ctor is starting the measurement and the dector is closing it, logging the Delta in an unordered_map entry, which is key is the string.
Now, I'd like to turn all of that into a faster stuff.
First of all, I'd like to create a string LUT (Look Up Table) contaning the f()s names at compile time, and turn the unordered_map to a plain vector which is paired by the string function LUT.
Now the question is: I've managed to create a LUT but std::string_view, but I cannot find a way to extend it at compile time.
A first rought trial sounds like this:
template<unsigned N>
constexpr auto LUT() {
std::array<std::string_view, N> Strs{};
for (unsigned n = 0; n < N; n++) {
Strs[n] = "";
return Strs;
constexpr std::array<std::string_view, 0> StringsLUT { LUT<0>() };
constexpr auto AddString(std::string_view const& Str)
constexpr auto Size = StringsLUT.size();
std::array<std::string_view, Size + 1> Copy{};
for (auto i = 0; i < Size; ++i)
Copy[i] = StringsLUT[i];
Copy[Size] = Str;
return Copy;
int main()
constexpr auto Strs = AddString(__builtin_FUNCTION());
//for (auto const Str : Strs)
std::cout << Strs[0] << std::endl;
So my idea should be to recall the AddString whenever needed in my f()s to be profiled, extending this list at compile time.
But of course I should take the returned Copy and replace the StringsLUT everytime, to land to a final StringsLUT with all the f() names inside it.
Is there a way to do that at compile time?
Sorry, but I'm just entering the magic "new" world of constexpr applied to LUT right in these days.
Tx for your support in advance.

Real FFT output

I have implemented fft into at32ucb series ucontroller using kiss fft library and currently struggling with the output of the fft.
My intention is to analyse sound coming from piezo speaker.
Currently, the frequency of the sounder is 420Hz which I successfully got from the fft output (cross checked with an oscilloscope). However, the output frequency is just half of expected if I put function generator waveform into the system.
I suspect its the frequency bin calculation formula which I got wrong; currently using, fft_peak_magnitude_index*sampling frequency / fft_size.
My input is real and doing real fft. (output samples = N/2)
And also doing iir filtering and windowing before fft.
Any suggestion would be a great help!
// IIR filter calculation, n = 256 fft points
for (ctr=0; ctr<n; ctr++)
// filter calculation
y[ctr] = num_coef[0]*x[ctr];
y[ctr] += (num_coef[1]*x[ctr-1]) - (den_coef[1]*y[ctr-1]);
y[ctr] += (num_coef[2]*x[ctr-2]) - (den_coef[2]*y[ctr-2]);
y1[ctr] = y[ctr] - 510; //eliminate dc offset
// hamming window
hamming[ctr] = (0.54-((0.46) * cos(2*M_PI*ctr/n)));
window[ctr] = hamming[ctr]*y1[ctr];
fft_input[ctr].r = window[ctr];
fft_input[ctr].i = 0;
fft_output[ctr].r = 0;
fft_output[ctr].i = 0;
kiss_fftr_cfg fftConfig = kiss_fftr_alloc(n,0,NULL,NULL);
kiss_fftr(fftConfig, (kiss_fft_scalar * )fft_input, fft_output);
peak = 0;
freq_bin = 0;
for (ctr=0; ctr<n1; ctr++)
fft_mag[ctr] = 10*(sqrt((fft_output[ctr].r * fft_output[ctr].r) + (fft_output[ctr].i * fft_output[ctr].i)))/(0.5*n);
if(fft_mag[ctr] > peak)
peak = fft_mag[ctr];
freq_bin = ctr;
frequency = (freq_bin*(10989/n)); // 10989 is the sampling freq
//Usart write
char filtResult[10];
//sprintf(filtResult, "%04d %04d %04d\n", (int)peak, (int)freq_bin, (int)frequency);
sprintf(filtResult, "%04d %04d %04d\n", (int)x[ctr], (int)fft_mag[ctr], (int)frequency);
char c;
char *ptr = &filtResult[0];
c = *ptr;
usart_bw_write_char(&AVR32_USART2, (int)c);
// sendByte(c);
} while (c != '\n');
The main problem is likely to be how you declared fft_input.
Based on your previous question, you are allocating fft_input as an array of kiss_fft_cpx. The function kiss_fftr on the other hand expect an array of scalar. By casting the input array into a kiss_fft_scalar with:
kiss_fftr(fftConfig, (kiss_fft_scalar * )fft_input, fft_output);
KissFFT essentially sees an array of real-valued data which contains zeros every second sample (what you filled in as imaginary parts). This is effectively an upsampled version (although without interpolation) of your original signal, i.e. a signal with effectively twice the sampling rate (which is not accounted for in your freq_bin to frequency conversion). To fix this, I suggest you pack your data into a kiss_fft_scalar array:
kiss_fft_scalar fft_input[n];
for (ctr=0; ctr<n; ctr++)
fft_input[ctr] = window[ctr];
kiss_fftr_cfg fftConfig = kiss_fftr_alloc(n,0,NULL,NULL);
kiss_fftr(fftConfig, fft_input, fft_output);
Note also that while looking for the peak magnitude, you probably are only interested in the final largest peak, instead of the running maximum. As such, you could limit the loop to only computing the peak (using freq_bin instead of ctr as an array index in the following sprintf statements if needed):
for (ctr=0; ctr<n1; ctr++)
fft_mag[ctr] = 10*(sqrt((fft_output[ctr].r * fft_output[ctr].r) + (fft_output[ctr].i * fft_output[ctr].i)))/(0.5*n);
if(fft_mag[ctr] > peak)
peak = fft_mag[ctr];
freq_bin = ctr;
} // close the loop here before computing "frequency"
Finally, when computing the frequency associated with the bin with the largest magnitude, you need the ensure the computation is done using floating point arithmetic. If as I suspect n is an integer, your formula would be performing the 10989/n factor using integer arithmetic resulting in truncation. This can be simply remedied with:
frequency = (freq_bin*(10989.0/n)); // 10989 is the sampling freq

ID3D11DeviceContext::DrawIndexed() Failed

my program is Directx Program that draws a container cube within it smaller cubes....these smaller cubes fall by time i hope you understand what i mean...
The program isn't complete yet should draws the container only ....but it draws nothing ...only the background color is visible... i only included what i think is needed ...
this is the routines that initialize the program
bool Game::init(HINSTANCE hinst,HWND _hw){
Directx11 ::init(hinst , _hw);
return LoadContent();}
bool Directx11::init(HINSTANCE hinst,HWND hw){
RECT rc;
height= rc.bottom -;
width = rc.right - rc.left;
UINT flags=0;
#ifdef _DEBUG
if (d3dDevice == 0 || d3dDeviceContext == 0)
return 0;
if (m4xMsaaEnable)
IDXGIDevice *Device=0;
HR(d3dDevice->QueryInterface(__uuidof(IDXGIDevice),reinterpret_cast <void**> (&Device)));
HR(Device->GetParent(__uuidof(IDXGIAdapter),reinterpret_cast <void**> (&Ad)));
IDXGIFactory* fac=0;
HR(Ad->GetParent(__uuidof(IDXGIFactory),reinterpret_cast <void**> (&fac)));
ID3D11Texture2D *back = 0;
HR(swapchain->GetBuffer(0,__uuidof(ID3D11Texture2D),reinterpret_cast <void**> (&back)));
Tdesc.BindFlags = D3D11_BIND_DEPTH_STENCIL;
Tdesc.ArraySize = 1;
Tdesc.Height= height;
Tdesc.Width = width;
Tdesc.Usage = D3D11_USAGE_DEFAULT;
if (m4xMsaaEnable)
vp.Width = static_cast <float> (width);
vp.Height= static_cast <float> (height);
vp.MinDepth = 0.0f;
vp.MaxDepth = 1.0f;
d3dDeviceContext -> RSSetViewports(1,&vp);
return true;
SetBuild() Prepare the matrices inside the container for the smaller cubes ....i didnt program it to draw the smaller cubes yet
and this the function that draws the scene
void Game::Render(){
d3dDeviceContext->ClearRenderTargetView(RenderTarget,reinterpret_cast <const float*> (&Colors::LightSteelBlue));
d3dDeviceContext->ClearDepthStencilView(depth,D3D11_CLEAR_DEPTH | D3D11_CLEAR_STENCIL,1.0f,0);
d3dDeviceContext-> IASetInputLayout(_layout);
d3dDeviceContext-> IASetPrimitiveTopology(D3D10_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
UINT strides=sizeof(Vertex),off=0;
Floor * Lookup; /*is a variable to Lookup inside the matrices structure (Floor Contains XMMATRX Piese[9])*/
std::vector<XMFLOAT4X4> filled; // saves the matrices of the smaller cubes
XMMATRIX V=XMLoadFloat4x4(&View),P = XMLoadFloat4x4(&Proj);
for (UINT i = 0; i < des.Passes; i++)
wvp = XMLoadFloat4x4(&(B.Memory[0].Pieces[0])) * vp; // Loading The Matrix at translation(0,0,0)
HR(ShadeMat->SetMatrix(reinterpret_cast<float*> ( &wvp)));
UINT r1=B.GetSize(),r2=filled.size();
for (UINT j = 0; j < r1; j++)
Lookup = &B.Memory[j];
for (UINT r = 0; r < Lookup->filledindeces.size(); r++)
for (UINT j = 0; j < r2; j++)
ShadeMat->SetMatrix( reinterpret_cast<const float*> (&filled[i]));
thanks in Advance
One bug in your program appears to be that you're using i, the index of the current pass, as an index into the filled vector, when you should apparently be using j.
Another apparent bug is that in the loop where you are supposed to be iterating over the elements of filled, you're not iterating over all of them. The value r2 is set to the size of filled before you append anything to it during that pass. During the first pass this means that nothing will be drawn by this loop. If your technique only has one pass then this means that the second DrawIndexed call in your code will never be executed.
It also appears you should be only adding matrices to filled once, regardless of the number of the passes the technique has. You should consider if your code is actually meant to work with techniques with multiple passes.

cuda: involuntary memory changes during kernels [closed]

im a beginer cuda programmer,
im trying to build an application similar to the Nvidia particle system example (many balls in a cube).
i have a kernel louncher function as below :
void Ccuda:: sort_Particles_And_Find_Cell_Start (int *Cell_Start, // output
int *Cell_End, // output
float3 *Sorted_Pos, // output
float3 *Sorted_Vel, //output
int *Particle_Cell, // input
int *Particle_Index, // input
float3 *Old_Pos,
float3 *Old_Vel,
int Num_Particles,
int Num_Cells)
int numThreads, numBlocks;
/*Cell_Start = (int*) cudaAlloc (Num_Cells, sizeof(int));
Cell_End = (int*) cudaAlloc (Num_Cells, sizeof(int));
Sorted_Pos = (float3*) cudaAlloc (Num_Particles, sizeof(int));
Sorted_Vel = (float3*) cudaAlloc (Num_Particles, sizeof(int));*/
int *h_p_cell = (int *) malloc (Num_Particles * sizeof (int));
cudaMemcpy (h_p_cell,Particle_Cell, Num_Particles*sizeof(int),cudaMemcpyDeviceToHost);
free (h_p_cell);
computeGridSize(Num_Particles, 512, numBlocks, numThreads);
sort_Particles_And_Find_Cell_StartD<<<numBlocks, numThreads>>>(Cell_Start,Cell_End, Sorted_Pos, Sorted_Vel, Particle_Cell, Particle_Index, Old_Pos, Old_Vel, Num_Particles);
h_p_cell = (int *) malloc (Num_Particles * sizeof (int));
cudaMemcpy (h_p_cell,Particle_Cell, Num_Particles*sizeof(int),cudaMemcpyDeviceToHost);
free (h_p_cell);
And this global kernel function :
__global__ void sort_Particles_And_Find_Cell_StartD(int *Cell_Start, // output
int *Cell_End, // output
float3 *Sorted_Pos, // output
float3 *Sorted_Vel, //output
int *Particle_Cell, // input
int *Particle_Index, // input
float3 *Old_Pos,
float3 *Old_Vel,
int Num_Particles)
int hash;
extern __shared__ int Shared_Hash[]; // blockSize + 1 elements
int index = blockIdx.x*blockDim.x + threadIdx.x;
if (index < Num_Particles)
hash = Particle_Cell[index];
Shared_Hash[threadIdx.x+1] = hash;
if (index > 0 && threadIdx.x == 0)
// first thread in block load previous particle hash
Shared_Hash[0] = Particle_Cell[index-1];
if (index < Num_Particles)
// If this particle has a different cell index to the previous
// particle then it must be the first particle in the cell,
// so store the index of this particle in the cell.
// As it isn't the first particle, it must also be the cell end of
// the previous particle's cell
if (index == 0 || hash != Shared_Hash[threadIdx.x]) // if its the first thread in the grid or its particle cell index is different from cell index of the previous neighboring thread
Cell_Start[hash] = index;
if (index > 0)
Cell_End[Shared_Hash[threadIdx.x]] = index;
if (index == Num_Particles - 1)
Cell_End[hash] = index + 1;
// Now use the sorted index to reorder the pos and vel data
int Sorted_Index = Particle_Index[index];
//float3 pos = FETCH(Old_Pos, Sorted_Index); // macro does either global read or texture fetch
//float3 vel = FETCH(Old_Vel, Sorted_Index); // see particles_kernel.cuh
float3 pos = Old_Pos[Sorted_Index];
float3 vel = Old_Vel[Sorted_Index];
Sorted_Pos[index] = pos;
Sorted_Vel[index] = vel;
during execute i got this debug arror massege r6010 saying an abort has been called.
as you may see in the louncher function (the first one) i use int *h_p_cell to view
Particle_Cell content before and after the kernel execution, and it seems like the content has been changed, although inside the kernel there is no assignment to Particle_Cell.
Particle_Cell memory allocated by cudaMemcpy during program init().
i have trying for few days to solve this issue, without success
can anyone help ?
Your kernel is expecting dynamically allocated shared memory:
extern __shared__ int Shared_Hash[]; // blockSize + 1 elements
But you aren't allocating any in your kernel invocation:
sort_Particles_And_Find_Cell_StartD<<<numBlocks, numThreads>>>(Cell_Start,Cell_End, Sorted_Pos, Sorted_Vel, Particle_Cell, Particle_Index, Old_Pos, Old_Vel, Num_Particles);
missing shared memory size parameter
You should provide a shared memory amount in your launch configuration. You probably want something like this:
sort_Particles_And_Find_Cell_StartD<<<numBlocks, numThreads, ((numThreads+1)*sizeof(int))>>>(Cell_Start,Cell_End, Sorted_Pos, Sorted_Vel, Particle_Cell, Particle_Index, Old_Pos, Old_Vel, Num_Particles);
This error will cause your kernel to abort when it tries to access shared memory.
You should also do cuda error checking on all cuda API calls and kernel calls. I don't see any evidence of that in your code.
Once you have all the API errors sorted out, run your code with cuda-memcheck. The reason for the unexpected writes to Particle_Cell may be due to out-of-bounds accesses from your kernel, which will become evident with cuda-memcheck.

Checking if removing an edge in a graph will result in the graph splitting

I have a graph structure where I am removing edges one by one until some conditions are met. My brain has totally stopped and i can't find an efficient way to detect if removing an edge will result in my graph splitting in two or more graphs.
The bruteforce solution would be to do an bfs until one can reach all the nodes from a random node, but that will take too much time with large graphs...
Any ideas?
Edit: After a bit of search it seems what I am trying to do is very similar to the fleury's algorithm, where I need to find if an edge is a "bridge" or not.
Edges that make a graph disconnected when removed are called 'bridges'. You can find them in O(|V|+|E|) with a single depth-first search over the whole graph. A related algorithm finds all 'articulation points' (nodes that, if removed, makes the graph disconnected) follows. Any edge between two articulation-points is a bridge (you can test that in a second pass over all edges).
// g: graph; v: current vertex id;
// r_p: parents (r/w); r_a: ascents (r/w); r_ap: art. points, bool array (r/w)
// n_v: bfs order-of-visit
void dfs_art_i(graph *g, int v, int *r_p, int *r_v, int *r_a, int *r_ap, int *n_v) {
int i;
r_v[v] = *n_v;
r_a[v] = *n_v;
(*n_v) ++;
// printf("entering %d (nv = %d)\n", v, *n_v);
for (i=0; i<g->vertices[v].n_edges; i++) {
int w = g->vertices[v].edges[i].target;
// printf("\t evaluating %d->%d: ", v, w);
if (r_v[w] == -1) {
// printf("...\n");
// This is the first time we find this vertex
r_p[w] = v;
dfs_art_i(g, w, r_p, r_v, r_a, r_ap, n_v);
// printf("\n\t ... back in %d->%d", v, w);
if (r_a[w] >= r_v[v]) {
// printf(" - a[%d] %d >= v[%d] %d", w, r_a[w], v, r_v[v]);
// Articulation point found
r_ap[i] = 1;
if (r_a[w] < r_a[v]) {
// printf(" - a[%d] %d < a[%d] %d", w, r_a[w], v, r_a[v]);
r_a[v] = r_a[w];
// printf("\n");
else {
// printf("back");
// We have already found this vertex before
if (r_v[w] < r_a[v]) {
// printf(" - updating ascent to %d", r_v[w]);
r_a[v] = r_v[w];
// printf("\n");
int dfs_art(graph *g, int root, int *r_p, int *r_v, int *r_a, int *r_ap) {
int i, n_visited = 0, n_root_children = 0;
for (i=0; i<g->n_vertices; i++) {
r_p[i] = r_v[i] = r_a[i] = -1;
r_ap[i] = 0;
dfs_art_i(g, root, r_p, r_v, r_a, r_ap, &n_visitados);
// the root can only be an AP if it has more than 1 child
for (i=0; i<g->n_vertices; i++) {
if (r_p[i] == root) {
n_root_children ++;
r_ap[root] = n_root_children > 1 ? 1 : 0;
return 1;
If you remove the link between vertices A and B, can't you just check that you can still reach A from B after the edge removal? That's a little better than getting to all nodes from a random node.
How do you choose the edges to be removed?
Can you tell more about your problem domain?
Just how large Is your graph? maybe BFS is just fine!
After you wrote that you are trying to find out whether an edge is a bridge or not, I suggest
you remove edges in decreasing order of their betweenness measure.
Essentially, betweenness is a measure of an edges (or vertices) centrality in a graph.
Edges with higher value of betweenness have greater potential of being a bridge in a graph.
Look it up on the web, the algorithm is called 'Girvan-Newman algorithm'.
